SlideShare a Scribd company logo
HAYA!
Continuous Deep Q-Learning
with Model-based Acceleration
2016 ICML
S. Gu, T. Lillicrap, I. Sutskever, S. Levine.
Presenter : Hyemin Ahn
HAYA!
Introduction
2016-12-02 CPSLAB (EECS) 2
 Another, and Another improved work of
Deep - Reinforcement Learning
 Tried incorporate the advantages of
Model-free Reinforcement Learning
&&
Model-based Reinforcement Learning
HAYA!
Results : Preview
2016-12-02 CPSLAB (EECS) 3
HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 4
Agent
How can we formulize our behavior?
HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 5
At each time 𝒕,
the agent receives an observation 𝒙 𝒕
from environment 𝑬
Wow
so scare
such gun
so many bullets
nice suit btw
HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 6
The agent takes
an action 𝒖 𝒕 ∈ 𝒰,
and receives a scalar reward 𝒓 𝒕.
𝒖 𝒕
𝒙 𝒕
HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 7
The agent chooses an action according to its current policy 𝛑(𝐮𝐭|𝐱 𝐭),
which maps states to probability distribution over actions.
𝒖 𝟏
𝒖 𝟐
𝛑(𝐮𝐭|𝒙 𝒕)
𝒖 𝟐𝒖 𝟏
HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 8
𝒖 𝟏
𝒙 𝟏
𝛑
𝒖 𝟐
𝒙 𝟐
𝛑
𝒖 𝟑
𝒙 𝟑
𝛑
𝒑(𝒙 𝟐|𝒙 𝟏, 𝒖 𝟏) 𝒑(𝒙 𝟑|𝒙 𝟐, 𝒖 𝟐)
𝑹 𝒕 =
𝒊=𝒕
𝑻
𝜸(𝒊−𝒕)
𝒓(𝒙𝒊, 𝒖𝒊)
: cumulative sum of rewards
over sequences. (𝜸 ∈ [𝟎, 𝟏]:discounting factor)
𝑸 𝝅
𝒙 𝒕, 𝒖 𝒕 = 𝔼[𝑹 𝒕|𝒙 𝒕, 𝒖 𝒕]
: state-action value function.
Objective of RL
: find 𝛑 maximizing 𝔼(𝑹 𝟏) !
𝒓(𝒙 𝟏, 𝒖 𝟏) 𝒓(𝒙 𝟐, 𝒖 𝟐) 𝒓(𝒙 𝟑, 𝒖 𝟑)M
D
P
HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 9
𝑸 𝝅 𝒕𝒓𝒊𝒏𝒊𝒕𝒚 𝒙 𝒕, 𝒖 𝒕 𝑸 𝝅 𝒏𝒆𝒐 𝒙 𝒕, 𝒖 𝒕<
HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 10
• From environment E,
𝒙 ∈ 𝒳 : state
𝒖 ∈ 𝒰 : action
• π(𝒖 𝑡|𝒙 𝑡) : a policy defining agent’s behavior
: maps states to probability distribution over the actions
• With 𝒳, 𝒰, an initial state distribution p(𝒙1), the agent experiences a transition
to a new state sampled from the dynamics distribution p 𝒙t+1 𝒙t, 𝒖t
• Rt = i=t
T
γ(i−t)
r(𝒙i, 𝒖i) : the sum of future reward with a discounting factor γ ∈ [0,1]
• Objective of RL : learning a policy π maximizing,
HAYA!
Reinforcement Learning : Model Free?
2016-12-02 CPSLAB (EECS) 11
• When the system dynamics p 𝒙t+1 𝒙t, 𝒖t are not known.
• We define the Q-function 𝑄 𝜋
𝒙 𝑡, 𝒖 𝑡 , corresponding to a policy 𝜋
as the expected return from 𝒙 𝑡 after taking 𝒖 𝑡 and following 𝜋 thereafter.
• Q-learning learns a greedy deterministic policy
which corresponds to
• The learning objective is to minimize the Bellman error,
 𝛽 : arbitrary exploration policy, 𝜌 𝛽
: resulting state visitation frequency of the policy 𝛽,
 𝜃 𝑄
: parameter of the Q-function,
 Assume that there is a fixed target 𝑦𝑡, 𝑸(𝒙, 𝝁(𝒙))
HAYA!
Continuous Q-Learning with
Normalized Advantage Functions
2016-12-02 CPSLAB (EECS) 12
How Authors learned parameterized Q-function with Deep Learning,
when the domain of state-action is continuous?
Value function
Advantage function of
a given policy 𝝅
They suggest to use a neural network
that separately outputs a value function term, and an advantage term.
 State-dependent, positive-definite square matrix,
parameterized by 𝑷 𝒙 𝜃 𝑃
= 𝑳 𝒙 𝜃 𝑃
𝑳 𝒙 𝜃 𝑃 𝑇
.
 𝑳 𝒙 𝜃 𝑃
: Lower-triangular matrix whose entries
come from a linear output layer of a neural network.
The action that maximizes
the Q-function is always
given by 𝝁(𝒙|𝜃 𝜇
).
HAYA!
Continuous Q-Learning with
Normalized Advantage Functions
2016-12-02 CPSLAB (EECS) 13
Trick : assume that we have a target network.
𝑄′(𝒙, 𝒖|𝜃 𝑄′)
the SLOW-LEARNER
𝑄 𝒙, 𝒖 𝜃 𝑄
the EXPLORER
𝑹 EXPERIENCE
CONTAINER
HAYA!
Accelerating Learning with Imagination Rollouts
2016-12-02 CPSLAB (EECS) 14
 The sample complexity of model-free algorithms tends to be high when
using high-dimensional function approximators.
 To reduce the sample complexity and accelerate the learning phase,
how about using a good exploratory behavior
from the trajectory optimization?
HAYA!
Accelerating Learning with Imagination Rollouts
2016-12-02 CPSLAB (EECS) 15
 how about using a good exploratory behavior from the trajectory optimization?
𝑄′(𝒙, 𝒖|𝜃 𝑄′
)
𝑄 𝒙, 𝒖 𝜃 𝑄
𝑹
𝑩 𝒇𝑩 𝒐𝒍𝒅
𝜇 𝒙 𝜃 𝜇 𝜋 𝑡
𝑖𝐿𝑄𝐺
𝒖 𝑡 𝒙 𝑡
𝓜
𝑹 𝒇
𝒇
𝒇
HAYA!
Experiment : Results
2016-12-02 CPSLAB (EECS) 16
HAYA!
Experiment : Results
2016-12-02 CPSLAB (EECS) 17
HAYA!
2016-12-02 CPSLAB (EECS) 18

More Related Content

PPTX
0415_seminar_DeepDPG
PPTX
Competition winning learning rates
PDF
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
PDF
DQN (Deep Q-Network)
PDF
Deep Q-Learning
PDF
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
PDF
Exploration Strategies in Reinforcement Learning
PDF
Distributed Deep Q-Learning
0415_seminar_DeepDPG
Competition winning learning rates
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
DQN (Deep Q-Network)
Deep Q-Learning
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Exploration Strategies in Reinforcement Learning
Distributed Deep Q-Learning

What's hot (20)

PDF
Maximum Entropy Reinforcement Learning (Stochastic Control)
PDF
Deep Reinforcement Learning
PPTX
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
PDF
Lec3 dqn
PPTX
Reinforcement Learning : A Beginners Tutorial
PDF
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
PDF
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
PPTX
Introduction of "TrailBlazer" algorithm
PDF
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
PPTX
Reinforcement Learning
PDF
Reinforcement learning
PPTX
Deep Reinforcement Learning
PPTX
Differential privacy without sensitivity [NIPS2016読み会資料]
PPTX
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
PPT
Reinforcement learning
PPT
Reinforcement learning 7313
PDF
Safe and Efficient Off-Policy Reinforcement Learning
PDF
Deep robotics
PDF
Generalized Reinforcement Learning
PPTX
Reinforcement Learning and Artificial Neural Nets
Maximum Entropy Reinforcement Learning (Stochastic Control)
Deep Reinforcement Learning
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Lec3 dqn
Reinforcement Learning : A Beginners Tutorial
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Introduction of "TrailBlazer" algorithm
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Reinforcement Learning
Reinforcement learning
Deep Reinforcement Learning
Differential privacy without sensitivity [NIPS2016読み会資料]
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Reinforcement learning
Reinforcement learning 7313
Safe and Efficient Off-Policy Reinforcement Learning
Deep robotics
Generalized Reinforcement Learning
Reinforcement Learning and Artificial Neural Nets
Ad

Viewers also liked (20)

PPTX
Introduction For seq2seq(sequence to sequence) and RNN
PDF
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
PDF
Cooperative Collision Avoidance via Proximal Message Passing
PPTX
Human brain how it work
PDF
Explaining and harnessing adversarial examples (2015)
PDF
Paper Reading : Enriching word vectors with subword information(2016)
PDF
Linear Discriminant Analysis and Its Generalization
PDF
Paper Reading : Learning from simulated and unsupervised images through adver...
PPT
Encoding Robotic Sensor States for Q-Learning using the
PDF
『밑바닥부터 시작하는 딥러닝』 - 미리보기
PDF
Face detection and recognition using OpenCV
PDF
Handover Parameters Self-optimization by Q-Learning in 4G Networks
PDF
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
PPT
Conflict mgt in nursing
PDF
Machine Learning for Actuaries
PDF
The What, Why and How of (Web) Analytics Testing (Web, IoT, Big Data)
PPTX
Deep Learning in Computer Vision
PDF
Generative adversarial networks
PDF
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
PDF
Internet of Things
Introduction For seq2seq(sequence to sequence) and RNN
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
Cooperative Collision Avoidance via Proximal Message Passing
Human brain how it work
Explaining and harnessing adversarial examples (2015)
Paper Reading : Enriching word vectors with subword information(2016)
Linear Discriminant Analysis and Its Generalization
Paper Reading : Learning from simulated and unsupervised images through adver...
Encoding Robotic Sensor States for Q-Learning using the
『밑바닥부터 시작하는 딥러닝』 - 미리보기
Face detection and recognition using OpenCV
Handover Parameters Self-optimization by Q-Learning in 4G Networks
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
Conflict mgt in nursing
Machine Learning for Actuaries
The What, Why and How of (Web) Analytics Testing (Web, IoT, Big Data)
Deep Learning in Computer Vision
Generative adversarial networks
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
Internet of Things
Ad

Similar to 1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration (20)

PDF
Reinforcement learning
PDF
5th Module_Machine Learning_Reinforc.pdf
PPTX
Intro to Deep Reinforcement Learning
PPTX
Deep Q-learning from Demonstrations DQfD
PPTX
R22 Machine learning jntuh UNIT- 5.pptx
PDF
Continuous control with deep reinforcement learning (DDPG)
PDF
RL presentation
PPTX
Lecture 21 ppt Q learning................
PPTX
Learning Task in machine learning
PDF
deep q networks (reinforcement learning)
PDF
anintroductiontoreinforcementlearning-180912151720.pdf
PPTX
An introduction to reinforcement learning
PDF
PDF
Introduction to reinforcement learning
PPTX
Reinforcement Learning
PPTX
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
PDF
Memory for Lean Reinforcement Learning.pdf
PDF
Deep Reinforcement learning
PDF
Shanghai deep learning meetup 4
Reinforcement learning
5th Module_Machine Learning_Reinforc.pdf
Intro to Deep Reinforcement Learning
Deep Q-learning from Demonstrations DQfD
R22 Machine learning jntuh UNIT- 5.pptx
Continuous control with deep reinforcement learning (DDPG)
RL presentation
Lecture 21 ppt Q learning................
Learning Task in machine learning
deep q networks (reinforcement learning)
anintroductiontoreinforcementlearning-180912151720.pdf
An introduction to reinforcement learning
Introduction to reinforcement learning
Reinforcement Learning
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Memory for Lean Reinforcement Learning.pdf
Deep Reinforcement learning
Shanghai deep learning meetup 4

Recently uploaded (20)

PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
composite construction of structures.pdf
PDF
Digital Logic Computer Design lecture notes
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Well-logging-methods_new................
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Welding lecture in detail for understanding
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
composite construction of structures.pdf
Digital Logic Computer Design lecture notes
Internet of Things (IOT) - A guide to understanding
Well-logging-methods_new................
Model Code of Practice - Construction Work - 21102022 .pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
UNIT 4 Total Quality Management .pptx
additive manufacturing of ss316l using mig welding
Welding lecture in detail for understanding
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
bas. eng. economics group 4 presentation 1.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Operating System & Kernel Study Guide-1 - converted.pdf
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx

1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration

  • 1. HAYA! Continuous Deep Q-Learning with Model-based Acceleration 2016 ICML S. Gu, T. Lillicrap, I. Sutskever, S. Levine. Presenter : Hyemin Ahn
  • 2. HAYA! Introduction 2016-12-02 CPSLAB (EECS) 2  Another, and Another improved work of Deep - Reinforcement Learning  Tried incorporate the advantages of Model-free Reinforcement Learning && Model-based Reinforcement Learning
  • 4. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 4 Agent How can we formulize our behavior?
  • 5. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 5 At each time 𝒕, the agent receives an observation 𝒙 𝒕 from environment 𝑬 Wow so scare such gun so many bullets nice suit btw
  • 6. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 6 The agent takes an action 𝒖 𝒕 ∈ 𝒰, and receives a scalar reward 𝒓 𝒕. 𝒖 𝒕 𝒙 𝒕
  • 7. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 7 The agent chooses an action according to its current policy 𝛑(𝐮𝐭|𝐱 𝐭), which maps states to probability distribution over actions. 𝒖 𝟏 𝒖 𝟐 𝛑(𝐮𝐭|𝒙 𝒕) 𝒖 𝟐𝒖 𝟏
  • 8. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 8 𝒖 𝟏 𝒙 𝟏 𝛑 𝒖 𝟐 𝒙 𝟐 𝛑 𝒖 𝟑 𝒙 𝟑 𝛑 𝒑(𝒙 𝟐|𝒙 𝟏, 𝒖 𝟏) 𝒑(𝒙 𝟑|𝒙 𝟐, 𝒖 𝟐) 𝑹 𝒕 = 𝒊=𝒕 𝑻 𝜸(𝒊−𝒕) 𝒓(𝒙𝒊, 𝒖𝒊) : cumulative sum of rewards over sequences. (𝜸 ∈ [𝟎, 𝟏]:discounting factor) 𝑸 𝝅 𝒙 𝒕, 𝒖 𝒕 = 𝔼[𝑹 𝒕|𝒙 𝒕, 𝒖 𝒕] : state-action value function. Objective of RL : find 𝛑 maximizing 𝔼(𝑹 𝟏) ! 𝒓(𝒙 𝟏, 𝒖 𝟏) 𝒓(𝒙 𝟐, 𝒖 𝟐) 𝒓(𝒙 𝟑, 𝒖 𝟑)M D P
  • 9. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 9 𝑸 𝝅 𝒕𝒓𝒊𝒏𝒊𝒕𝒚 𝒙 𝒕, 𝒖 𝒕 𝑸 𝝅 𝒏𝒆𝒐 𝒙 𝒕, 𝒖 𝒕<
  • 10. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 10 • From environment E, 𝒙 ∈ 𝒳 : state 𝒖 ∈ 𝒰 : action • π(𝒖 𝑡|𝒙 𝑡) : a policy defining agent’s behavior : maps states to probability distribution over the actions • With 𝒳, 𝒰, an initial state distribution p(𝒙1), the agent experiences a transition to a new state sampled from the dynamics distribution p 𝒙t+1 𝒙t, 𝒖t • Rt = i=t T γ(i−t) r(𝒙i, 𝒖i) : the sum of future reward with a discounting factor γ ∈ [0,1] • Objective of RL : learning a policy π maximizing,
  • 11. HAYA! Reinforcement Learning : Model Free? 2016-12-02 CPSLAB (EECS) 11 • When the system dynamics p 𝒙t+1 𝒙t, 𝒖t are not known. • We define the Q-function 𝑄 𝜋 𝒙 𝑡, 𝒖 𝑡 , corresponding to a policy 𝜋 as the expected return from 𝒙 𝑡 after taking 𝒖 𝑡 and following 𝜋 thereafter. • Q-learning learns a greedy deterministic policy which corresponds to • The learning objective is to minimize the Bellman error,  𝛽 : arbitrary exploration policy, 𝜌 𝛽 : resulting state visitation frequency of the policy 𝛽,  𝜃 𝑄 : parameter of the Q-function,  Assume that there is a fixed target 𝑦𝑡, 𝑸(𝒙, 𝝁(𝒙))
  • 12. HAYA! Continuous Q-Learning with Normalized Advantage Functions 2016-12-02 CPSLAB (EECS) 12 How Authors learned parameterized Q-function with Deep Learning, when the domain of state-action is continuous? Value function Advantage function of a given policy 𝝅 They suggest to use a neural network that separately outputs a value function term, and an advantage term.  State-dependent, positive-definite square matrix, parameterized by 𝑷 𝒙 𝜃 𝑃 = 𝑳 𝒙 𝜃 𝑃 𝑳 𝒙 𝜃 𝑃 𝑇 .  𝑳 𝒙 𝜃 𝑃 : Lower-triangular matrix whose entries come from a linear output layer of a neural network. The action that maximizes the Q-function is always given by 𝝁(𝒙|𝜃 𝜇 ).
  • 13. HAYA! Continuous Q-Learning with Normalized Advantage Functions 2016-12-02 CPSLAB (EECS) 13 Trick : assume that we have a target network. 𝑄′(𝒙, 𝒖|𝜃 𝑄′) the SLOW-LEARNER 𝑄 𝒙, 𝒖 𝜃 𝑄 the EXPLORER 𝑹 EXPERIENCE CONTAINER
  • 14. HAYA! Accelerating Learning with Imagination Rollouts 2016-12-02 CPSLAB (EECS) 14  The sample complexity of model-free algorithms tends to be high when using high-dimensional function approximators.  To reduce the sample complexity and accelerate the learning phase, how about using a good exploratory behavior from the trajectory optimization?
  • 15. HAYA! Accelerating Learning with Imagination Rollouts 2016-12-02 CPSLAB (EECS) 15  how about using a good exploratory behavior from the trajectory optimization? 𝑄′(𝒙, 𝒖|𝜃 𝑄′ ) 𝑄 𝒙, 𝒖 𝜃 𝑄 𝑹 𝑩 𝒇𝑩 𝒐𝒍𝒅 𝜇 𝒙 𝜃 𝜇 𝜋 𝑡 𝑖𝐿𝑄𝐺 𝒖 𝑡 𝒙 𝑡 𝓜 𝑹 𝒇 𝒇 𝒇

Editor's Notes

  • #13: Of these, the least novel are the value/advantage decomposition of Q(s,a) and the use of locally-adapted linear-Gaussian dynamics.
  • #14: But we don’t know the target…!
  • #15: But we don’t know the target…!
  • #16: But we don’t know the target…!