SlideShare a Scribd company logo
Deep Reinforcement Learning
CS294-112, 2017 Fall
Lecture 13
์†๊ทœ๋นˆ

๊ณ ๋ ค๋Œ€ํ•™๊ต ์‚ฐ์—…๊ฒฝ์˜๊ณตํ•™๊ณผ
๋ชฉ์ฐจ
1. IRL : ์ „๋ฌธ๊ฐ€์˜ demo์—์„œ reward function ์ถ”๋ก 
2. MaxEnt IRL
1. Ambiguous reward๊ฐ€ ์—ฌ๋Ÿฌ๊ฐœ ์žˆ์„ ๋•Œ ์ ์ ˆํ•˜๊ฒŒ ์„ ํƒ ๊ฐ€๋Šฅ
2. Dynamic programming์„ ํ†ตํ•ด ๋‹จ์ˆœํ•˜๊ณ  ํšจ์œจ์ ์œผ๋กœ ๊ณ„์‚ฐ ๊ฐ€๋Šฅ(small space)
3. Large, continuous space์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•
3. MaxEnt IRL with GANs
1. Guided cost learning algorithm
2. Connection to GAN
3. Generative adversarial imitation learning
!2
Where does the reward function come from?
!3
๊ฒŒ์ž„ ๊ฐ™์€ ๊ฒฝ์šฐ score ๊ฐ™์€
์ˆ˜์น˜ํ˜• signal์ด ๋ช…ํ™•ํ•˜๊ฒŒ ์กด์žฌ
์‹ค์ œ ์„ธ๊ณ„์—์„  ๊ฒŒ์ž„์ฒ˜๋Ÿผ ๋ช…ํ™•ํ•œ reward๊ฐ€ ์—†๊ณ 
task๊ฐ€ ์™„๋ฃŒ๋˜์—ˆ๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต๊ณ 
task ์ž์ฒด๋ฅผ ๊นŠ์ด ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”
Where does the reward function come from?
!4
Automated tech support system
- ์ปดํ“จํ„ฐ ์ˆ˜๋ฆฌ ๋ฌธ์˜ ์‹œ์Šคํ…œ์ด๋ผ๋ฉดโ€จ
-> ์ตœ์ข… reward : ์‹œ์Šคํ…œ์ด ๋„์›€์ด ๋˜์—ˆ๋Š”์ง€ ์—ฌ๋ถ€
- ๋ฌดํ˜•์˜ ๋ชฉํ‘œ ์กด์žฌ -> Ground truth ์–ป๊ธฐ๊ฐ€ ์‰ฝ์ง€ ์•Š์Œโ€จ
ex) ๊ณ ๊ฐ์˜ ๋งŒ์กฑ, ๋˜‘๊ฐ™์€ ๋ง ๋ฐ˜๋ณต์œผ๋กœ ์ธํ•œ ์งœ์ฆ
- reward function์„ ์ž‘์„ฑํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋“  ์—”์ง€๋‹ˆ์–ด๋“ค์ด ๋งŒ์กฑํ•˜๋Š”
Convention, Rule์ด ์กด์žฌํ•˜์ง€ ์•Š์Œ -> ์• ๋งคํ•จ
Reward function์„ ์ž‘์„ฑํ•˜๊ธฐ ๋งค์šฐ ์–ด๋ ค์›€
(์ž์œจ์ฃผํ–‰์—์„œ ์šด์ „์ž์— ๋Œ€ํ•œ ๋งค๋„ˆ)
Why shoud we learn the reward?
!5
โ€ข์„ค๋ช…ํ•˜๊ธฐ ์–ด๋ ค์šด task, reward๋“ค์€ ์˜คํžˆ๋ ค ์ง์ ‘
๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ์‰ฌ์šธ ๋•Œ๊ฐ€ ์žˆ์Œโ€จ
(์šด์ „์ž๊ฐ€ ๊ฐ€์ ธ์•ผํ•  ์–‘์‹ฌ, ๋งค๋„ˆ, ์—ํ‹ฐ์ผ“ ๋“ฑ)
โ€ขImitation Learningโ€จ
task์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ์ „ํ˜€ ํ•„์š” ์—†์Œโ€จ
๊ทธ๋ƒฅ ๋”ฐ๋ผํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋ถˆํ•„์š”ํ•œ ํ–‰๋™
๋„ ๋”ฐ๋ผํ•˜๊ฒŒ๋˜๊ณ , ์–ผ๋งˆ๋‚˜ ๋Šฅ์ˆ™ํ•œ ์ „๋ฌธ๊ฐ€๋ฅผ ๋”ฐ๋ผํ•˜
๋А๋ƒ์— ๋”ฐ๋ผ์„œ๋„ ์„ฑ๋Šฅ์ด ์ฒœ์ฐจ๋งŒ๋ณ„
โ€ข์ขŒ์ธก ์ด๋ฏธ์ง€์˜ ์•„์ด ์˜์ƒ์€ ๋งค์šฐ ์œ ๋ช…ํ•œ ์‹คํ—˜ ์‚ฌ๋ก€
Why shoud we learn the reward?
!6
โ€ข์œ ์•„๊ฐ€ ๋‹ค๋ฅธ ์‚ฌ๋žŒ์˜ ์˜๋„๋ฅผ ์ดํ•ดํ•˜๋Š” ๋Šฅ๋ ฅ
โ€ข์•„์ด๋Š” ๋งน๋ชฉ์ ์œผ๋กœ ํ–‰๋™์„ ๋ชจ๋ฐฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ,
task์˜ ์‹œ์Šคํ…œ ์ž์ฒด๋ฅผ ์ดํ•ดํ•˜๊ณ  ์žˆ์Œ
โ€ข๋งŒ์•ฝ ์šฐ๋ฆฌ์˜ RL ์‹œ์Šคํ…œ์ด Imitation learning์„
ํ†ตํ•œ ๋ชจ๋ธ์ด๋ผ๋ฉด ์•„์ด์ฒ˜๋Ÿผ ํ–‰๋™ํ•  ์ˆ˜ ์—†์Œ
โ€ขํ•˜์ง€๋งŒ ์˜๋„๋ฅผ ์ดํ•ดํ•˜๋Š”, ์‹œ์Šคํ…œ์„ ์ดํ•ดํ•˜๋Š” ๋ชจ๋ธ
์ผ ๊ฒฝ์šฐ ์„ฑ๋Šฅ์ด๋‚˜ ํšจ์œจ์˜ ๋ฌธ์ œ๋ฅผ ๋„˜์–ด Domain
transfer๊นŒ์ง€ ๊ฐ€๋Šฅ
โ€ข์šฐ์ธก ์–ด๋ฅธ: ์ž๋ฃจ์— ๋ฌผ๊ฑด์„ ๋‹ด์œผ๋ ค ํ•˜๊ณ โ€จ
๋•…์— ๋–จ์–ด์ง„ ๋ฌผ๊ฑด์ด ์•ˆ ์ฃผ์›Œ์ง
โ€ข์•„์ด๊ฐ€ ๊ทธ ์žฅ๋ฉด์„ ๋ณด๋‹ค๊ฐ€ ์ฃผ์›Œ์คŒ
Inverse Optimal Control / Inverse Reinforment Learning
!7
์ฃผ์–ด์ง„ ๊ฒƒ
โ€ขstate & action space
โ€ขsamples from
โ€ขdynamics model
ฯ€*
๋ชฉํ‘œ
โ€ขRecover reward function
โ€ขUse reward to get policy
Challenges
โ€ข๋ฌธ์ œ๋ฅผ underdefine ํ•˜๋Š” ๊ฒƒ
โ€ขLearned reward๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์–ด๋ ค์šด ์ 
โ€ขdemonstration ๋ถ€ํ„ฐ suboptimal์ธ ์ 
Chaellenges of IRL
!8
1. Underdefined problem -> Multi-answer
1. ๋ฌธ์ œ ์ •์˜๋ฅผ ๊ตฌ์ฒด์ ์œผ๋กœ ์ž˜ ํ•ด์•ผํ•จ
2. ์•ž์„  ์‹คํ—˜์—์„œ ์•„์ด๋Š” ์ œ๋ฐ˜ ์ƒํ™ฉ์— ๋Œ€ํ•œ
์ง€์‹๋“ค์„ ์ด๋ฏธ ๋งŽ์ด ๊ฐ€์ง€๊ณ  ์žˆ์Œ
3. ML ๋ฌธ์ œ์— ์ ์šฉํ–ˆ์„ ๋•Œ ์šฐ๋ฆฌ์˜ ๋ชจ๋ธ์€
์•„์ด์ฒ˜๋Ÿผ ์ตœ์†Œํ•œ์˜ ์„ธ์ƒ์— ๋Œ€ํ•œ ์ดํ•ด๋„
์—†์ด ๋ฌธ์ œ๋ฅผ ํ’€๋ ค๊ณ  ๋…ธ๋ ฅํ•˜๋Š” ์ƒํ™ฉ
ex) Simple world
โ€ข์œ„ ์„ธ๋ชจ, ๋™๊ทธ๋ผ๋ฏธ, ํ™”์‚ดํ‘œ๋ฅผ ํ•ด์„
โ€ข๋งค์šฐ ๋‹ค์–‘ํ•œ ํ•ด์„์ด ์กด์žฌ
โ€ข์šฐ๋ฆฌ๋Š” ์•„๋ฌด๋Ÿฐ ์‚ฌ์ „์ง€์‹์ด ์—†๊ณ  ๋‹ค์Œ
์— ์–ด๋–ป๊ฒŒ ํ–‰๋™ํ•ด์•ผํ• ์ง€ ๋ชจํ˜ธํ•จ
For any observed policy in general
there's an infinite set of reward functions
that will all make that policy appear optimal
Chaellenges of IRL
!9
2. Evaluation of learned reward is difficult
1. ์ผ๋ฐ˜์ ์ธ IRL ๊ตฌ์กฐ
1. Improve the reward function
2. Evaluate the reward function(Gradient ๊ณ„์‚ฐ ๋“ฑ์˜ ๋ฐฉ๋ฒ•์„ ํ†ตํ•จ)
2. ์œ„์™€ ๊ฐ™์€ ๊ตฌ์กฐ์—์„  IRL ๊ณผ์ • ์•ˆ์—์„œ inner loop์„ ํ†ตํ•ด RL ๊ณผ์ •์„ ์ˆ˜ํ–‰
3. IRL ์•ˆ์— ๋ฐ˜๋ณต๋˜๋Š” RL์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ต‰์žฅํžˆ ๊ณ ๋น„์šฉ
3. Sub-optimality of experts
1. ์ฐธ๊ณ ํ•  ์ „๋ฌธ๊ฐ€์˜ demonstration ์ž์ฒด๊ฐ€ ๋ถ€์ ํ•ฉํ•  ๊ฒฝ์šฐ
2. ์•ž์„  ๋‘ ๋ฌธ์ œ๊ฐ€ ์™„๋ฒฝํ•˜๊ฒŒ ํ•ด๊ฒฐ๋œ๋‹ค ํ•˜๋”๋ผ๋„ ์ด ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ๋‚˜์œ ์„ฑ๋Šฅ ๋ณด์ž„
A bit more formally
!10
Forward RL
given:
- state & action
- transitions p(s'|s, a)
- reward function r(s, a)
learn ฯ€*(a|s)
Inverse RL
given:
- state & action
- transitions p(s'|s, a)
- trajectory samples sampled from
learn ( reward parameters )
----> reward function์€ ๋‹ค์‹œ policy ํ•™์Šต์— ์“ฐ์ž„
ฯ€*(a|s)
ฯ€*(ฯ„){ฯ„i}
rฯˆ(s, a) ฯˆ
Linear reward function
์—ฌ๊ธฐ์„œ f์— ๋ถ™๋Š” psi๋Š” ํ•ด๋‹น feature๋ฅผ ์–ผ๋งˆ
๋‚˜ ํ•„์š”๋กœ ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋ƒ„
Feature matching IRL
!11
rฯˆ(s, a) =
โˆ‘
i
ฯˆi fi(s, a) = ฯˆT
f(s, a)
Eฯ€rฯˆ[f(s, a)] = Eฯ€*[f(s, a)]
ํ˜„์žฌ reward function์—
optimal์ธ policy
Unknown optimal policy
using expert sample
ํ•™์Šตํ•œ policy์™€ ์ „๋ฌธ๊ฐ€ policy์˜ f๊ฐ€ ๊ฐ™๋‹ค๋ฉด
๋น„์Šทํ•œ feature๋ฅผ ๋งค์นญํ•  ์ˆ˜ ์žˆ๋‹ค.
maximum margin principle์„ ์ด์šฉ
Maximum margin principle
๋ชฉํ‘œ: margin m์„ ์ตœ๋Œ€ํ™”ํ•˜์ž.
์ขŒํ•ญ: feature ๊ฐ’์„ ํŒŒ์ด๋กœ expectation
psi๋ฅผ dot productํ•˜๋ฉด
reward์˜ expectation ๊ตฌํ•  ์ˆ˜ ์žˆ์Œ
Feature matching IRL
!12
ฯˆT
Eฯ€*[f(s, a)] โ‰ฅ maxฯˆT
Eฯ€[f(s, a)] + m
์šฐํ•ญ: ์šฐ๋ฆฌ๊ฐ€ ์ฐพ์„ ์ˆ˜ ์žˆ๋Š” ์ตœ๊ณ ์˜ ์ •์ฑ…์œผ๋กœ

feature ๊ฐ’์„ expectationํ•˜๊ณ 

psi๋ฅผ dot product ํ–ˆ์„ ๋•Œ ๋‚˜์˜ค๋Š”

reward์˜ expectation
Apply "SVM trick"
Feature matching IRL & maximum margin
!13
ฯˆT
Eฯ€*[f(s, a)] โ‰ฅ maxฯˆT
Eฯ€[f(s, a)] + m < m์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฌธ์ œ >
ฯˆT
Eฯ€*[f(s, a)] โ‰ฅ maxฯˆT
Eฯ€[f(s, a)] + D(ฯ€, ฯ€*)
< ์˜ weight magnitude ์ž์ฒด๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉ์‹>ฯˆ
feature expectation์˜
์ฐจ์ด๊ฐ’์„ ์˜๋ฏธ
๋ฌธ์ œ์ 
1. ๋ชจํ˜ธํ•œ ๋ฐฉ์‹์œผ๋กœ ํ•ด๊ฒฐ: Margin์ด ์–ด๋–ค ์˜๋ฏธ๋ฅผ ์ง€๋‹ˆ๋Š”์ง€ ์•Œ ์ˆ˜ ์—†์Œ
2. ์ „๋ฌธ๊ฐ€์˜ ๋น„์ˆ™๋ จ, ๋ถ€์ ํ•ฉ์„ฑ์„ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ• ๋”ฑํžˆ ์—†์Œ
3. Linear model์—์„œ์กฐ์ฐจ ์ œ์•ฝ์กฐ๊ฑด์ด ๋งŽ๊ณ  ๋ณต์žก
MaxEnt IRL algorithm
!14
์œ„ 1-5 ์ˆœ์„œ๋ฅผ ๋ฐ˜๋ณต
MaxEnt IRL Case study : Road navigation
!15
1. ํƒ์‹œ ์šด์ „์‚ฌ์˜ ์ฃผํ–‰ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ชฉ์ ์ง€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ
2. Discrete state, action space -> ํฌ์ง€๋งŒ ์ถฉ๋ถ„ํžˆ tabular representation ๊ฐ€๋Šฅ
3. ์ข€ ๋” ๋‚˜์•„๊ฐ€์„œ Feature weight ์•Œ์•„๋ƒ„
1. ์šด์ „์ž ์ธํ„ฐ๋ทฐ๋ฅผ ํ†ตํ•ด ๊ตญ๋„์™€ ๊ณ ์†๋„๋กœ, ์–ด๋–ค turn์„ ์„ ํ˜ธํ•˜๋Š”์ง€ ๋“ฑ์„ ์กฐ์‚ฌ
2. human driver๊ฐ€ ์–ด๋–ป๊ฒŒ ์šด์ „ํ•˜๋Š”์ง€ reward function์—์„œ ๋” ์ž˜ ๋‚˜ํƒ€๋‚˜๋„๋ก ์‹œ๋„
3. tabular ํฌ๊ธฐ์˜ space๋งŒ์œผ๋กœ๋„ ์‹ค์ œ ์„ธ๊ณ„์˜ ์ƒํ™ฉ์„ ์˜ˆ์ธกํ•œ ์ข‹์€ ์‚ฌ๋ก€
MaxEnt IRL Case study : MaxEnt Deep IRL
!16
1. ๋กœ๋ด‡์ด๋‚˜ ์‹ค๋‚ด์ฃผํ–‰์ง€๋„ ๊ทธ๋ฆฌ๋Š” task์—์„œ ์‚ฌ์šฉ -> Reward๊ฐ€ ๋ณต์žกํ•œ representation
2. Discrete state, action space๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ, Reward function์€ Neural networks
3. ๊ณ„์†์ ์œผ๋กœ environment๋ฅผ ์นด๋ฉ”๋ผ๋ฅผ ํ†ตํ•ด ์ดฌ์˜
1. ์ดฌ์˜๋œ ๊ฒฐ๊ณผ๋ฌผ์ด ์ด๋ฏธ์ง€๊ฐ€ ์•„๋‹ ์ˆ˜๋„ ์žˆ๊ณ , ์ˆ˜๋งŽ์€ feature๋“ค์ด encoding๋œ ๊ฒฐ๊ณผ๋ฌผ
2. ์‹ค์ œ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŽ์ด ๋ชจ์•„์„œ reward function์„ ํ•™์Šตํ•˜๋Š”๋ฐ ์‚ฌ์šฉ
Unknown dynamics & large state / action spaces
!17
Deep IRL์„ ๊ณ ์ฐจ์› ๊ณต๊ฐ„,
Unknown space๋กœ ํ™•์žฅํ•˜๊ธฐ
- ์ฒซ ๋ฒˆ์งธ ํ•ญ: ์‹ค์ œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ 
reward๋ฅผ ๋‹จ์ˆœ sum ํ•˜๋Š” ๊ฑฐ๋ผ์„œ
๊ณ„์‚ฐ ๋ณต์žก๋„ ๋‚ฎ๋‹ค
- ๋‘˜์งธ ํ•ญ: distribution์„ model
free ๊ด€์ ์œผ๋กœ ํ•ด๊ฒฐํ•ด๋ณด์ž
More efficient sample-based updates
!18
1. p(a|s)๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ์–ด๋–ค MaxEnt IRL ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•ด๋„ ์ข‹๋‹ค
2. Model free ๊ด€์ ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ์‹œ์Šคํ…œ Dynamics๋ฅผ ๋Œ๋ ค์•ผํ•ด์„œ ์‹œ๊ฐ„ ๋ณต์žก
๋„๊ฐ€ ์—„์ฒญ๋‚˜๊ณ  inner loop์—์„œ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋Œ์•„๊ฐ€๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์‹ค์ƒ ๋ถˆ๊ฐ€๋Šฅ
3. policy๋ฅผ ์™„์ „ํžˆ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์‚ด์ง ๊ฐœ์„ ํ•˜๊ณ  gradient step ์ง„ํ–‰
4. ํ•˜์ง€๋งŒ ์ด ๋•Œ๋Š” ์™„์ „ํ•œ policy๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ฏ€๋กœ ํ‹€๋ฆฐ ์  ๋ฐœ์ƒ
5. ํ‹€๋ฆฐ ์ ์„ Importance sampling์œผ๋กœ ๊ต์ •
Connection to Generative Adversarial Networks
!19
GAN๊ณผ ํ†ตํ•˜๋Š” ๋ถ€๋ถ„์ด ์žˆ์Œ
Guided cost learning algorithm - Finn et al. ICML 2016
!20
IRL as adversarial optimization
!21
์ผ๋ฐ˜ GAN์—์„œ์ฒ˜๋Ÿผ ๋กœ๋ด‡์—์„œ ๋‚˜์˜จ ํ–‰๋™์ธ์ง€,

์‹ค์ œ ๋ฐ๋ชจ์ธ์ง€ ๊ตฌ๋ถ„ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ํ•™์Šต
Questions
!22
๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

More Related Content

PDF
CS294-112 18
PDF
TinyBERT
PDF
Masked Sequence to Sequence Pre-training for Language Generation
PDF
Efficient Training of Bert by Progressively Stacking
PDF
REALM
PDF
Sequence to Sequence Learning with Neural Networks
PDF
Character-Aware Neural Language Models
PPTX
Bag of Tricks for Image Classification with Convolutional Neural Networks (C...
CS294-112 18
TinyBERT
Masked Sequence to Sequence Pre-training for Language Generation
Efficient Training of Bert by Progressively Stacking
REALM
Sequence to Sequence Learning with Neural Networks
Character-Aware Neural Language Models
Bag of Tricks for Image Classification with Convolutional Neural Networks (C...

What's hot (20)

PDF
๋”ฅ๋Ÿฌ๋‹ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ - RNN์—์„œ BERT๊นŒ์ง€
PPTX
Denoising auto encoders(d a)
PDF
Pretrained summarization on distillation
PPTX
Variational Autoencoder๋ฅผ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๊ฐ๋„์—์„œ ์ดํ•ดํ•˜๊ธฐ (Understanding Variational Autoencod...
PPTX
Machine translation survey vol2
PDF
2017 tensor flow dev summit
PPTX
์ธ๊ณต ์‹ ๊ฒฝ๋ง ๊ตฌํ˜„์— ๊ด€ํ•œ ๊ฐ„๋‹จํ•œ ์„ค๋ช…
PPTX
๋žฉํƒ‘์œผ๋กœ tensorflow ๋„์ „ํ•˜๊ธฐ - tutorial
PDF
PYCON KR 2017 - ๊ตฌ๋ฆ„์ด ํ•˜๋Š˜์˜ ์ผ์ด๋ผ๋ฉด (์œค์ƒ์›…)
PPTX
Machine learning linearregression
PDF
Tensorflow for Deep Learning(SK Planet)
PDF
๋”ฅ๋Ÿฌ๋‹์˜ ๊ธฐ๋ณธ
PPTX
Chapter 15 Representation learning - 1
PDF
Siamese neural networks for one shot image recognition paper explained
PDF
MiniFlow
PDF
๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ณธ ์›๋ฆฌ์˜ ์ดํ•ด
PPTX
A Beginner's guide to understanding Autoencoder
PPTX
Ai ๊ทธ๊นŒ์ด๊ฑฐ
PPTX
Variational inference intro. (korean ver.)
PDF
์บ๋นˆ๋จธํ”ผ ๋จธ์‹ ๋Ÿฌ๋‹ Kevin Murphy Machine Learning Statistic
๋”ฅ๋Ÿฌ๋‹ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ - RNN์—์„œ BERT๊นŒ์ง€
Denoising auto encoders(d a)
Pretrained summarization on distillation
Variational Autoencoder๋ฅผ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๊ฐ๋„์—์„œ ์ดํ•ดํ•˜๊ธฐ (Understanding Variational Autoencod...
Machine translation survey vol2
2017 tensor flow dev summit
์ธ๊ณต ์‹ ๊ฒฝ๋ง ๊ตฌํ˜„์— ๊ด€ํ•œ ๊ฐ„๋‹จํ•œ ์„ค๋ช…
๋žฉํƒ‘์œผ๋กœ tensorflow ๋„์ „ํ•˜๊ธฐ - tutorial
PYCON KR 2017 - ๊ตฌ๋ฆ„์ด ํ•˜๋Š˜์˜ ์ผ์ด๋ผ๋ฉด (์œค์ƒ์›…)
Machine learning linearregression
Tensorflow for Deep Learning(SK Planet)
๋”ฅ๋Ÿฌ๋‹์˜ ๊ธฐ๋ณธ
Chapter 15 Representation learning - 1
Siamese neural networks for one shot image recognition paper explained
MiniFlow
๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ณธ ์›๋ฆฌ์˜ ์ดํ•ด
A Beginner's guide to understanding Autoencoder
Ai ๊ทธ๊นŒ์ด๊ฑฐ
Variational inference intro. (korean ver.)
์บ๋นˆ๋จธํ”ผ ๋จธ์‹ ๋Ÿฌ๋‹ Kevin Murphy Machine Learning Statistic
Ad

Similar to CS294-112 Lecture 13 (20)

PDF
๊ฐ•ํ™”ํ•™์Šต ํ•ด๋ถ€ํ•™ ๊ต์‹ค: Rainbow ์ด๋ก ๋ถ€ํ„ฐ ๊ตฌํ˜„๊นŒ์ง€ (2nd dlcat in Daejeon)
PDF
์•Œ์•„๋‘๋ฉด ์“ธ๋ฐ์žˆ๋Š” ์‹ ๊ธฐํ•œ ๊ฐ•ํ™”ํ•™์Šต NAVER 2017
PDF
Reinforcement learning basic
PDF
๊ฐ•ํ™”ํ•™์Šต ๊ธฐ์ดˆ๋ถ€ํ„ฐ DQN๊นŒ์ง€ (Reinforcement Learning from Basics to DQN)
PDF
Introduction toDQN
PDF
Alpha Go Introduction
PDF
ํŒŒ์ด์ฌ๊ณผ ์ผ€๋ผ์Šค๋กœ ๋ฐฐ์šฐ๋Š” ๊ฐ•ํ™”ํ•™์Šต ์ €์žํŠน๊ฐ•
PDF
์•ˆ.์ „.์ œ.์ผ. ๊ฐ•ํ™”ํ•™์Šต!
PPTX
Q Learning๊ณผ CNN์„ ์ด์šฉํ•œ Object Localization
PDF
PDF
๊ฐ•ํ™” ํ•™์Šต ๊ธฐ์ดˆ Reinforcement Learning an introduction
PDF
Introduction to SAC(Soft Actor-Critic)
PPTX
Control as Inference.pptx
PDF
Soft Actor-Critic Algorithms and Applications ํ•œ๊ตญ์–ด ๋ฆฌ๋ทฐ
PDF
[RLkorea] ๊ฐ์žก๊ณ  ๋กœ๋ด‡ํŒ” ๋ฐœํ‘œ
PDF
CS294-112 Lecture 06
PDF
Policy gradient
PDF
Multi armed bandit
PPTX
แ„€แ…กแ†ผแ„’แ…ชแ„’แ…กแ†จแ„‰แ…ณแ†ธ & Unity ML Agents
PDF
Reinforcement learning v0.5
๊ฐ•ํ™”ํ•™์Šต ํ•ด๋ถ€ํ•™ ๊ต์‹ค: Rainbow ์ด๋ก ๋ถ€ํ„ฐ ๊ตฌํ˜„๊นŒ์ง€ (2nd dlcat in Daejeon)
์•Œ์•„๋‘๋ฉด ์“ธ๋ฐ์žˆ๋Š” ์‹ ๊ธฐํ•œ ๊ฐ•ํ™”ํ•™์Šต NAVER 2017
Reinforcement learning basic
๊ฐ•ํ™”ํ•™์Šต ๊ธฐ์ดˆ๋ถ€ํ„ฐ DQN๊นŒ์ง€ (Reinforcement Learning from Basics to DQN)
Introduction toDQN
Alpha Go Introduction
ํŒŒ์ด์ฌ๊ณผ ์ผ€๋ผ์Šค๋กœ ๋ฐฐ์šฐ๋Š” ๊ฐ•ํ™”ํ•™์Šต ์ €์žํŠน๊ฐ•
์•ˆ.์ „.์ œ.์ผ. ๊ฐ•ํ™”ํ•™์Šต!
Q Learning๊ณผ CNN์„ ์ด์šฉํ•œ Object Localization
๊ฐ•ํ™” ํ•™์Šต ๊ธฐ์ดˆ Reinforcement Learning an introduction
Introduction to SAC(Soft Actor-Critic)
Control as Inference.pptx
Soft Actor-Critic Algorithms and Applications ํ•œ๊ตญ์–ด ๋ฆฌ๋ทฐ
[RLkorea] ๊ฐ์žก๊ณ  ๋กœ๋ด‡ํŒ” ๋ฐœํ‘œ
CS294-112 Lecture 06
Policy gradient
Multi armed bandit
แ„€แ…กแ†ผแ„’แ…ชแ„’แ…กแ†จแ„‰แ…ณแ†ธ & Unity ML Agents
Reinforcement learning v0.5
Ad

CS294-112 Lecture 13

  • 1. Deep Reinforcement Learning CS294-112, 2017 Fall Lecture 13 ์†๊ทœ๋นˆ ๊ณ ๋ ค๋Œ€ํ•™๊ต ์‚ฐ์—…๊ฒฝ์˜๊ณตํ•™๊ณผ
  • 2. ๋ชฉ์ฐจ 1. IRL : ์ „๋ฌธ๊ฐ€์˜ demo์—์„œ reward function ์ถ”๋ก  2. MaxEnt IRL 1. Ambiguous reward๊ฐ€ ์—ฌ๋Ÿฌ๊ฐœ ์žˆ์„ ๋•Œ ์ ์ ˆํ•˜๊ฒŒ ์„ ํƒ ๊ฐ€๋Šฅ 2. Dynamic programming์„ ํ†ตํ•ด ๋‹จ์ˆœํ•˜๊ณ  ํšจ์œจ์ ์œผ๋กœ ๊ณ„์‚ฐ ๊ฐ€๋Šฅ(small space) 3. Large, continuous space์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ• 3. MaxEnt IRL with GANs 1. Guided cost learning algorithm 2. Connection to GAN 3. Generative adversarial imitation learning !2
  • 3. Where does the reward function come from? !3 ๊ฒŒ์ž„ ๊ฐ™์€ ๊ฒฝ์šฐ score ๊ฐ™์€ ์ˆ˜์น˜ํ˜• signal์ด ๋ช…ํ™•ํ•˜๊ฒŒ ์กด์žฌ ์‹ค์ œ ์„ธ๊ณ„์—์„  ๊ฒŒ์ž„์ฒ˜๋Ÿผ ๋ช…ํ™•ํ•œ reward๊ฐ€ ์—†๊ณ  task๊ฐ€ ์™„๋ฃŒ๋˜์—ˆ๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต๊ณ  task ์ž์ฒด๋ฅผ ๊นŠ์ด ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”
  • 4. Where does the reward function come from? !4 Automated tech support system - ์ปดํ“จํ„ฐ ์ˆ˜๋ฆฌ ๋ฌธ์˜ ์‹œ์Šคํ…œ์ด๋ผ๋ฉดโ€จ -> ์ตœ์ข… reward : ์‹œ์Šคํ…œ์ด ๋„์›€์ด ๋˜์—ˆ๋Š”์ง€ ์—ฌ๋ถ€ - ๋ฌดํ˜•์˜ ๋ชฉํ‘œ ์กด์žฌ -> Ground truth ์–ป๊ธฐ๊ฐ€ ์‰ฝ์ง€ ์•Š์Œโ€จ ex) ๊ณ ๊ฐ์˜ ๋งŒ์กฑ, ๋˜‘๊ฐ™์€ ๋ง ๋ฐ˜๋ณต์œผ๋กœ ์ธํ•œ ์งœ์ฆ - reward function์„ ์ž‘์„ฑํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋“  ์—”์ง€๋‹ˆ์–ด๋“ค์ด ๋งŒ์กฑํ•˜๋Š” Convention, Rule์ด ์กด์žฌํ•˜์ง€ ์•Š์Œ -> ์• ๋งคํ•จ Reward function์„ ์ž‘์„ฑํ•˜๊ธฐ ๋งค์šฐ ์–ด๋ ค์›€ (์ž์œจ์ฃผํ–‰์—์„œ ์šด์ „์ž์— ๋Œ€ํ•œ ๋งค๋„ˆ)
  • 5. Why shoud we learn the reward? !5 โ€ข์„ค๋ช…ํ•˜๊ธฐ ์–ด๋ ค์šด task, reward๋“ค์€ ์˜คํžˆ๋ ค ์ง์ ‘ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ์‰ฌ์šธ ๋•Œ๊ฐ€ ์žˆ์Œโ€จ (์šด์ „์ž๊ฐ€ ๊ฐ€์ ธ์•ผํ•  ์–‘์‹ฌ, ๋งค๋„ˆ, ์—ํ‹ฐ์ผ“ ๋“ฑ) โ€ขImitation Learningโ€จ task์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ์ „ํ˜€ ํ•„์š” ์—†์Œโ€จ ๊ทธ๋ƒฅ ๋”ฐ๋ผํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋ถˆํ•„์š”ํ•œ ํ–‰๋™ ๋„ ๋”ฐ๋ผํ•˜๊ฒŒ๋˜๊ณ , ์–ผ๋งˆ๋‚˜ ๋Šฅ์ˆ™ํ•œ ์ „๋ฌธ๊ฐ€๋ฅผ ๋”ฐ๋ผํ•˜ ๋А๋ƒ์— ๋”ฐ๋ผ์„œ๋„ ์„ฑ๋Šฅ์ด ์ฒœ์ฐจ๋งŒ๋ณ„ โ€ข์ขŒ์ธก ์ด๋ฏธ์ง€์˜ ์•„์ด ์˜์ƒ์€ ๋งค์šฐ ์œ ๋ช…ํ•œ ์‹คํ—˜ ์‚ฌ๋ก€
  • 6. Why shoud we learn the reward? !6 โ€ข์œ ์•„๊ฐ€ ๋‹ค๋ฅธ ์‚ฌ๋žŒ์˜ ์˜๋„๋ฅผ ์ดํ•ดํ•˜๋Š” ๋Šฅ๋ ฅ โ€ข์•„์ด๋Š” ๋งน๋ชฉ์ ์œผ๋กœ ํ–‰๋™์„ ๋ชจ๋ฐฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, task์˜ ์‹œ์Šคํ…œ ์ž์ฒด๋ฅผ ์ดํ•ดํ•˜๊ณ  ์žˆ์Œ โ€ข๋งŒ์•ฝ ์šฐ๋ฆฌ์˜ RL ์‹œ์Šคํ…œ์ด Imitation learning์„ ํ†ตํ•œ ๋ชจ๋ธ์ด๋ผ๋ฉด ์•„์ด์ฒ˜๋Ÿผ ํ–‰๋™ํ•  ์ˆ˜ ์—†์Œ โ€ขํ•˜์ง€๋งŒ ์˜๋„๋ฅผ ์ดํ•ดํ•˜๋Š”, ์‹œ์Šคํ…œ์„ ์ดํ•ดํ•˜๋Š” ๋ชจ๋ธ ์ผ ๊ฒฝ์šฐ ์„ฑ๋Šฅ์ด๋‚˜ ํšจ์œจ์˜ ๋ฌธ์ œ๋ฅผ ๋„˜์–ด Domain transfer๊นŒ์ง€ ๊ฐ€๋Šฅ โ€ข์šฐ์ธก ์–ด๋ฅธ: ์ž๋ฃจ์— ๋ฌผ๊ฑด์„ ๋‹ด์œผ๋ ค ํ•˜๊ณ โ€จ ๋•…์— ๋–จ์–ด์ง„ ๋ฌผ๊ฑด์ด ์•ˆ ์ฃผ์›Œ์ง โ€ข์•„์ด๊ฐ€ ๊ทธ ์žฅ๋ฉด์„ ๋ณด๋‹ค๊ฐ€ ์ฃผ์›Œ์คŒ
  • 7. Inverse Optimal Control / Inverse Reinforment Learning !7 ์ฃผ์–ด์ง„ ๊ฒƒ โ€ขstate & action space โ€ขsamples from โ€ขdynamics model ฯ€* ๋ชฉํ‘œ โ€ขRecover reward function โ€ขUse reward to get policy Challenges โ€ข๋ฌธ์ œ๋ฅผ underdefine ํ•˜๋Š” ๊ฒƒ โ€ขLearned reward๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์–ด๋ ค์šด ์  โ€ขdemonstration ๋ถ€ํ„ฐ suboptimal์ธ ์ 
  • 8. Chaellenges of IRL !8 1. Underdefined problem -> Multi-answer 1. ๋ฌธ์ œ ์ •์˜๋ฅผ ๊ตฌ์ฒด์ ์œผ๋กœ ์ž˜ ํ•ด์•ผํ•จ 2. ์•ž์„  ์‹คํ—˜์—์„œ ์•„์ด๋Š” ์ œ๋ฐ˜ ์ƒํ™ฉ์— ๋Œ€ํ•œ ์ง€์‹๋“ค์„ ์ด๋ฏธ ๋งŽ์ด ๊ฐ€์ง€๊ณ  ์žˆ์Œ 3. ML ๋ฌธ์ œ์— ์ ์šฉํ–ˆ์„ ๋•Œ ์šฐ๋ฆฌ์˜ ๋ชจ๋ธ์€ ์•„์ด์ฒ˜๋Ÿผ ์ตœ์†Œํ•œ์˜ ์„ธ์ƒ์— ๋Œ€ํ•œ ์ดํ•ด๋„ ์—†์ด ๋ฌธ์ œ๋ฅผ ํ’€๋ ค๊ณ  ๋…ธ๋ ฅํ•˜๋Š” ์ƒํ™ฉ ex) Simple world โ€ข์œ„ ์„ธ๋ชจ, ๋™๊ทธ๋ผ๋ฏธ, ํ™”์‚ดํ‘œ๋ฅผ ํ•ด์„ โ€ข๋งค์šฐ ๋‹ค์–‘ํ•œ ํ•ด์„์ด ์กด์žฌ โ€ข์šฐ๋ฆฌ๋Š” ์•„๋ฌด๋Ÿฐ ์‚ฌ์ „์ง€์‹์ด ์—†๊ณ  ๋‹ค์Œ ์— ์–ด๋–ป๊ฒŒ ํ–‰๋™ํ•ด์•ผํ• ์ง€ ๋ชจํ˜ธํ•จ For any observed policy in general there's an infinite set of reward functions that will all make that policy appear optimal
  • 9. Chaellenges of IRL !9 2. Evaluation of learned reward is difficult 1. ์ผ๋ฐ˜์ ์ธ IRL ๊ตฌ์กฐ 1. Improve the reward function 2. Evaluate the reward function(Gradient ๊ณ„์‚ฐ ๋“ฑ์˜ ๋ฐฉ๋ฒ•์„ ํ†ตํ•จ) 2. ์œ„์™€ ๊ฐ™์€ ๊ตฌ์กฐ์—์„  IRL ๊ณผ์ • ์•ˆ์—์„œ inner loop์„ ํ†ตํ•ด RL ๊ณผ์ •์„ ์ˆ˜ํ–‰ 3. IRL ์•ˆ์— ๋ฐ˜๋ณต๋˜๋Š” RL์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ต‰์žฅํžˆ ๊ณ ๋น„์šฉ 3. Sub-optimality of experts 1. ์ฐธ๊ณ ํ•  ์ „๋ฌธ๊ฐ€์˜ demonstration ์ž์ฒด๊ฐ€ ๋ถ€์ ํ•ฉํ•  ๊ฒฝ์šฐ 2. ์•ž์„  ๋‘ ๋ฌธ์ œ๊ฐ€ ์™„๋ฒฝํ•˜๊ฒŒ ํ•ด๊ฒฐ๋œ๋‹ค ํ•˜๋”๋ผ๋„ ์ด ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ๋‚˜์œ ์„ฑ๋Šฅ ๋ณด์ž„
  • 10. A bit more formally !10 Forward RL given: - state & action - transitions p(s'|s, a) - reward function r(s, a) learn ฯ€*(a|s) Inverse RL given: - state & action - transitions p(s'|s, a) - trajectory samples sampled from learn ( reward parameters ) ----> reward function์€ ๋‹ค์‹œ policy ํ•™์Šต์— ์“ฐ์ž„ ฯ€*(a|s) ฯ€*(ฯ„){ฯ„i} rฯˆ(s, a) ฯˆ
  • 11. Linear reward function ์—ฌ๊ธฐ์„œ f์— ๋ถ™๋Š” psi๋Š” ํ•ด๋‹น feature๋ฅผ ์–ผ๋งˆ ๋‚˜ ํ•„์š”๋กœ ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋ƒ„ Feature matching IRL !11 rฯˆ(s, a) = โˆ‘ i ฯˆi fi(s, a) = ฯˆT f(s, a) Eฯ€rฯˆ[f(s, a)] = Eฯ€*[f(s, a)] ํ˜„์žฌ reward function์— optimal์ธ policy Unknown optimal policy using expert sample ํ•™์Šตํ•œ policy์™€ ์ „๋ฌธ๊ฐ€ policy์˜ f๊ฐ€ ๊ฐ™๋‹ค๋ฉด ๋น„์Šทํ•œ feature๋ฅผ ๋งค์นญํ•  ์ˆ˜ ์žˆ๋‹ค. maximum margin principle์„ ์ด์šฉ
  • 12. Maximum margin principle ๋ชฉํ‘œ: margin m์„ ์ตœ๋Œ€ํ™”ํ•˜์ž. ์ขŒํ•ญ: feature ๊ฐ’์„ ํŒŒ์ด๋กœ expectation psi๋ฅผ dot productํ•˜๋ฉด reward์˜ expectation ๊ตฌํ•  ์ˆ˜ ์žˆ์Œ Feature matching IRL !12 ฯˆT Eฯ€*[f(s, a)] โ‰ฅ maxฯˆT Eฯ€[f(s, a)] + m ์šฐํ•ญ: ์šฐ๋ฆฌ๊ฐ€ ์ฐพ์„ ์ˆ˜ ์žˆ๋Š” ์ตœ๊ณ ์˜ ์ •์ฑ…์œผ๋กœ feature ๊ฐ’์„ expectationํ•˜๊ณ  psi๋ฅผ dot product ํ–ˆ์„ ๋•Œ ๋‚˜์˜ค๋Š” reward์˜ expectation
  • 13. Apply "SVM trick" Feature matching IRL & maximum margin !13 ฯˆT Eฯ€*[f(s, a)] โ‰ฅ maxฯˆT Eฯ€[f(s, a)] + m < m์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฌธ์ œ > ฯˆT Eฯ€*[f(s, a)] โ‰ฅ maxฯˆT Eฯ€[f(s, a)] + D(ฯ€, ฯ€*) < ์˜ weight magnitude ์ž์ฒด๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉ์‹>ฯˆ feature expectation์˜ ์ฐจ์ด๊ฐ’์„ ์˜๋ฏธ ๋ฌธ์ œ์  1. ๋ชจํ˜ธํ•œ ๋ฐฉ์‹์œผ๋กœ ํ•ด๊ฒฐ: Margin์ด ์–ด๋–ค ์˜๋ฏธ๋ฅผ ์ง€๋‹ˆ๋Š”์ง€ ์•Œ ์ˆ˜ ์—†์Œ 2. ์ „๋ฌธ๊ฐ€์˜ ๋น„์ˆ™๋ จ, ๋ถ€์ ํ•ฉ์„ฑ์„ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ• ๋”ฑํžˆ ์—†์Œ 3. Linear model์—์„œ์กฐ์ฐจ ์ œ์•ฝ์กฐ๊ฑด์ด ๋งŽ๊ณ  ๋ณต์žก
  • 14. MaxEnt IRL algorithm !14 ์œ„ 1-5 ์ˆœ์„œ๋ฅผ ๋ฐ˜๋ณต
  • 15. MaxEnt IRL Case study : Road navigation !15 1. ํƒ์‹œ ์šด์ „์‚ฌ์˜ ์ฃผํ–‰ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ชฉ์ ์ง€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ 2. Discrete state, action space -> ํฌ์ง€๋งŒ ์ถฉ๋ถ„ํžˆ tabular representation ๊ฐ€๋Šฅ 3. ์ข€ ๋” ๋‚˜์•„๊ฐ€์„œ Feature weight ์•Œ์•„๋ƒ„ 1. ์šด์ „์ž ์ธํ„ฐ๋ทฐ๋ฅผ ํ†ตํ•ด ๊ตญ๋„์™€ ๊ณ ์†๋„๋กœ, ์–ด๋–ค turn์„ ์„ ํ˜ธํ•˜๋Š”์ง€ ๋“ฑ์„ ์กฐ์‚ฌ 2. human driver๊ฐ€ ์–ด๋–ป๊ฒŒ ์šด์ „ํ•˜๋Š”์ง€ reward function์—์„œ ๋” ์ž˜ ๋‚˜ํƒ€๋‚˜๋„๋ก ์‹œ๋„ 3. tabular ํฌ๊ธฐ์˜ space๋งŒ์œผ๋กœ๋„ ์‹ค์ œ ์„ธ๊ณ„์˜ ์ƒํ™ฉ์„ ์˜ˆ์ธกํ•œ ์ข‹์€ ์‚ฌ๋ก€
  • 16. MaxEnt IRL Case study : MaxEnt Deep IRL !16 1. ๋กœ๋ด‡์ด๋‚˜ ์‹ค๋‚ด์ฃผํ–‰์ง€๋„ ๊ทธ๋ฆฌ๋Š” task์—์„œ ์‚ฌ์šฉ -> Reward๊ฐ€ ๋ณต์žกํ•œ representation 2. Discrete state, action space๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ, Reward function์€ Neural networks 3. ๊ณ„์†์ ์œผ๋กœ environment๋ฅผ ์นด๋ฉ”๋ผ๋ฅผ ํ†ตํ•ด ์ดฌ์˜ 1. ์ดฌ์˜๋œ ๊ฒฐ๊ณผ๋ฌผ์ด ์ด๋ฏธ์ง€๊ฐ€ ์•„๋‹ ์ˆ˜๋„ ์žˆ๊ณ , ์ˆ˜๋งŽ์€ feature๋“ค์ด encoding๋œ ๊ฒฐ๊ณผ๋ฌผ 2. ์‹ค์ œ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŽ์ด ๋ชจ์•„์„œ reward function์„ ํ•™์Šตํ•˜๋Š”๋ฐ ์‚ฌ์šฉ
  • 17. Unknown dynamics & large state / action spaces !17 Deep IRL์„ ๊ณ ์ฐจ์› ๊ณต๊ฐ„, Unknown space๋กœ ํ™•์žฅํ•˜๊ธฐ - ์ฒซ ๋ฒˆ์งธ ํ•ญ: ์‹ค์ œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  reward๋ฅผ ๋‹จ์ˆœ sum ํ•˜๋Š” ๊ฑฐ๋ผ์„œ ๊ณ„์‚ฐ ๋ณต์žก๋„ ๋‚ฎ๋‹ค - ๋‘˜์งธ ํ•ญ: distribution์„ model free ๊ด€์ ์œผ๋กœ ํ•ด๊ฒฐํ•ด๋ณด์ž
  • 18. More efficient sample-based updates !18 1. p(a|s)๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ์–ด๋–ค MaxEnt IRL ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•ด๋„ ์ข‹๋‹ค 2. Model free ๊ด€์ ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ์‹œ์Šคํ…œ Dynamics๋ฅผ ๋Œ๋ ค์•ผํ•ด์„œ ์‹œ๊ฐ„ ๋ณต์žก ๋„๊ฐ€ ์—„์ฒญ๋‚˜๊ณ  inner loop์—์„œ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋Œ์•„๊ฐ€๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์‹ค์ƒ ๋ถˆ๊ฐ€๋Šฅ 3. policy๋ฅผ ์™„์ „ํžˆ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์‚ด์ง ๊ฐœ์„ ํ•˜๊ณ  gradient step ์ง„ํ–‰ 4. ํ•˜์ง€๋งŒ ์ด ๋•Œ๋Š” ์™„์ „ํ•œ policy๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ฏ€๋กœ ํ‹€๋ฆฐ ์  ๋ฐœ์ƒ 5. ํ‹€๋ฆฐ ์ ์„ Importance sampling์œผ๋กœ ๊ต์ •
  • 19. Connection to Generative Adversarial Networks !19 GAN๊ณผ ํ†ตํ•˜๋Š” ๋ถ€๋ถ„์ด ์žˆ์Œ
  • 20. Guided cost learning algorithm - Finn et al. ICML 2016 !20
  • 21. IRL as adversarial optimization !21 ์ผ๋ฐ˜ GAN์—์„œ์ฒ˜๋Ÿผ ๋กœ๋ด‡์—์„œ ๋‚˜์˜จ ํ–‰๋™์ธ์ง€, ์‹ค์ œ ๋ฐ๋ชจ์ธ์ง€ ๊ตฌ๋ถ„ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ํ•™์Šต