SlideShare a Scribd company logo
DQN algorithm
kv
Physics Department, National Taiwan University
kelispinor@gmail.com
The silide is largely credicted from David Silver’s slide and CS294
July 16, 2018
kv (NTU-PHYS) RLMC July 16, 2018 1 / 27
Overview
Overview
1 Overview
2 Introdution
What is Reinforcement Learning
Markov Decision Process
Dynamic Programming
kv (NTU-PHYS) RLMC July 16, 2018 2 / 27
Introdution What is Reinforcement Learning
What is Reinforcement Learning?
RL is a general framework for AI.
RL is for agent with ability to interact
Each action influences agent’s future states
Success is measured by a scalar reward signal
RL in a nutshell: Select actions to maximize future reward.
kv (NTU-PHYS) RLMC July 16, 2018 3 / 27
Introdution What is Reinforcement Learning
Reinforcement Learning Framework
In Reinforcement Learning, the agent observes current state St, receives
reward Rt, then interacts with the environment with action At under
policy.
Agent
Environment
Action atNew state st+1 Reward rt+1
kv (NTU-PHYS) RLMC July 16, 2018 4 / 27
Introdution Markov Decision Process
Markov Decision Process
Markov Property
The future is independent of the past given the present.
P(St+1|St) = P(St+1|St, St−1, ..., S2, S1)
MDP is a tuple < S, A, P, R, γ >, defined by follwing components
S: state space
A: action space
P(r, s |s, a): transition probability. trainsition s, a → r, s
kv (NTU-PHYS) RLMC July 16, 2018 5 / 27
Introdution Dynamic Programming
Policy
Policy: Is any function mapping from the states to actions π : S → A
Deterministic policy a = π(s)
Stochastic policy a ∼ π(a|s)
kv (NTU-PHYS) RLMC July 16, 2018 6 / 27
Introdution Dynamic Programming
Policy Evaluation and Value Functions
Policy optimization: maximize expected reward wrt policy π
maximize E
t
rt
Policy evaluation: compute the expected return for given π
State value function: V π
(s) = E
∞
t γt
rt|St = s
State-action value function: Qπ
(s, a) = E
∞
t γt
rt|St = s, At = a
kv (NTU-PHYS) RLMC July 16, 2018 7 / 27
Introdution Dynamic Programming
Value Functions
Q-function or state-action value function: expected total reward from
state s and action a under a policy π
Qπ
(s, a) = E
π
[r0 + γr1 + γ2
r2 + ...|s0 = s, a0 = a] (1)
State value function: expected (long-term )retrun starting from s
V π
(s) = E
π
[r0 + γr1 + γ2
r2 + ...|St = s] (2)
= E
a∼π
[Qπ
(s, a)|St = s] (3)
Advantage function
Aπ
(s, a) = Qπ
(s, a) − V π
(s) (4)
kv (NTU-PHYS) RLMC July 16, 2018 8 / 27
Introdution Dynamic Programming
Bellman Equation
State action value function can be unrolled recursively
Qπ
(s, a) = E[r0 + γr1 + γ2
r2 + ...|s, a] (5)
= E
s
[r + γQπ
(s , a )|s, a] (6)
Optimal Q function Q∗(s, a) can be unrolled recursively
Q∗
(s, a) = E
s
[r + max
a
Q∗
(s , a )|s, a] (7)
Value iteration algorithm solves the Bellman equation
Qi+1(s) = E
s
[r + max
a
Qi (s , a )|s, a] (8)
kv (NTU-PHYS) RLMC July 16, 2018 9 / 27
Introdution Dynamic Programming
Bellman Backups Operator
Q-function with clear time index
Qπ
(s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ E
a1∼π
Qπ
(s1, a1) (9)
Define Bellman backup operator, operating on Q-function
[T π
Q](s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ E
a1∼π
Qπ
(s1, a1) (10)
Qπ is a fixed point function
T π
Qπ
= Qπ
(11)
If we apply T π repeatedly to Q, the series will converge to Qπ
Q, T π
Q, (T π
)2
Q, ... → Qπ
(12)
kv (NTU-PHYS) RLMC July 16, 2018 10 / 27
Introdution Dynamic Programming
Introducing Q∗
Denote π∗ an optimal policy.
Q∗(s, a) = Qπ∗
(s, a) = maxπ Qπ(s, a)
Satisfy π∗(s) = argmaxa Q∗(s, a)
Then, Bellman equation
Qπ
(s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ E
a1∼π
Qπ
(s1, a1) (13)
becomes
Q∗
(s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ max
a1
Q∗
(s1, a1) (14)
We can also define corresponding Bellman backup operator
kv (NTU-PHYS) RLMC July 16, 2018 11 / 27
Introdution Dynamic Programming
Bellman Backups Operator on Q∗
Bellman backup operator, operating on Q-function
[T Q](s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ max
a1
Q(s1, a1) (15)
Qπ is a fixed point function
T Q∗
= Q∗
(16)
If we apply T repeatedly to Q, the series will converge to Q∗
Q, T Q, (T )2
Q, ... → Q∗
(17)
kv (NTU-PHYS) RLMC July 16, 2018 12 / 27
Introdution Dynamic Programming
Deep Q-Learning
Repersent value function by deep Q-Network with weights w
Q(s, a; w) ≈ Qπ
(s, a)
Objective function of Q-values is defined in mean-squared error
L(w) = E (r + γ max
a
Q(s , a ; w)
TD Target
−Q(s, a; w))2
Q-learning gradient
∂L(w)
∂w
= E (r + γ max
a
Q(s , a ; w)
TD Target
−Q(s, a; w))
∂Q(s, a; w)
∂w
kv (NTU-PHYS) RLMC July 16, 2018 13 / 27
Introdution Dynamic Programming
Deep Q-Learning
Backup estimation T Qt = rt + maxat+1 γQ(st+1, at+1)
To approximate Q ← T Qt, solve T Qt − Q(st, at)
2
T is contraction under . ∞ not . 2
kv (NTU-PHYS) RLMC July 16, 2018 14 / 27
Introdution Dynamic Programming
Stability Issues
1 Data is sequential
Successive non-iid data are highly correlated
2 Policy changes raplidly with slightly change of Q values
π may oscillates
Distribution of data may swing
3 Scale of rewards and Q value is unknown
Large gradients can cause unstable backpropagation
kv (NTU-PHYS) RLMC July 16, 2018 15 / 27
Introdution Dynamic Programming
Deep Q Network
Proposed solutions
1 Use experience replay
Break correlations in data, recover to iid setting
2 Fix target network
Old Q-function is freezed over long timesteps before update
Break correlations in Q-function and target
3 Clip rewards and normalize adaptively to sensible range
Robust gradients
kv (NTU-PHYS) RLMC July 16, 2018 16 / 27
Introdution Dynamic Programming
Stablize DQN: Experience Replay
Goal: Remove correlations. Build agent’s data-set
at is sampled from -greedy policy
Store transition (st, at, rt+1, st+1) in replay memory D
Sample randomly in mini-batch of transition (s, a, r, s ) from D
Optimize MSE between Q-network and Q-Learning target
L(w) = E
a,s,r,s ∼D
(r + γ max
a
Q(s , a ; w) − Q(s, a; w))2
kv (NTU-PHYS) RLMC July 16, 2018 17 / 27
Introdution Dynamic Programming
Stablize DQN: Fixed Target
Goal: Avoid oscillations, fix parameters used in target
Compute Q-learning target wrt old, fixed parameters w−
r + γ max
a
Q(s , a ; w−
)
Optimize MSE between Q-network and Q-learning target
L(w) = E
s,a,r,s ∼D
(r + γ max
a
Q(s , a ; w−
)
Fixed Target
−Q(s, a; w))2
Periodically update fixed parameters w− ← w
kv (NTU-PHYS) RLMC July 16, 2018 18 / 27
Introdution Dynamic Programming
Stablize DQN: Rewards/ Values Range
Clips rewards to [-1, 1]
Ensure gradients are well-conditioned
kv (NTU-PHYS) RLMC July 16, 2018 19 / 27
Introdution Dynamic Programming
DQN in Atari
Figure: Deep Q Learning
kv (NTU-PHYS) RLMC July 16, 2018 20 / 27
Introdution Dynamic Programming
DQN in Atari
End-to-end learning of Q from pixels s
Input s is stacked last 4 frames
Output Q(s, a) for 18 actions
Reward is change in score for that step
Figure: Q-Network Architecture
kv (NTU-PHYS) RLMC July 16, 2018 21 / 27
Introdution Dynamic Programming
DQN Results
kv (NTU-PHYS) RLMC July 16, 2018 22 / 27
Introdution Dynamic Programming
DQN Results
kv (NTU-PHYS) RLMC July 16, 2018 23 / 27
Introdution Dynamic Programming
DQN Results
kv (NTU-PHYS) RLMC July 16, 2018 24 / 27
Introdution Dynamic Programming
Is Q-value has meaning?
kv (NTU-PHYS) RLMC July 16, 2018 25 / 27
Introdution Dynamic Programming
Is Q-value has meaning?
But Q-values are usually overestimated.
kv (NTU-PHYS) RLMC July 16, 2018 26 / 27
Introdution Dynamic Programming
Double Q Learning
EX1,X2 [max(X1, X2)] ≥ max(EX1,X2 [X1], EX1,X2 [X2])
Q-values are noisy and overesitmated
Solution: use two networks and compute max with the other networ
QA(s, a) ← r + γQ(s , argmax
a
QB(s , a ))
QB(s, a) ← r + γQ(s , argmax
a
QA(s , a ))
Original DQN
Q(s, a) ← r + γQtarget
(s , a ) = r + γQtarget
(s , argmax
a
Qtarget
)
Double DQN
Q(s, a) ← r + γQtarget
(s , argmax
a
Q(s , a )) (18)
kv (NTU-PHYS) RLMC July 16, 2018 27 / 27

More Related Content

PDF
Deep reinforcement learning
PDF
An introduction to deep reinforcement learning
PPTX
Deep Reinforcement Learning
PPTX
Intro to Deep Reinforcement Learning
PDF
Deep Reinforcement Learning
PDF
Deep Q-Learning
PDF
Reinforcement Learning 4. Dynamic Programming
PPTX
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Deep reinforcement learning
An introduction to deep reinforcement learning
Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
Deep Reinforcement Learning
Deep Q-Learning
Reinforcement Learning 4. Dynamic Programming
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]

What's hot (20)

PPTX
Deep sarsa, Deep Q-learning, DQN
PDF
Reinforcement Learning 1. Introduction
PDF
ddpg seminar
PPTX
An introduction to reinforcement learning
PDF
An introduction to reinforcement learning
PPTX
Deep Reinforcement Learning
PPTX
Reinforcement Learning
PDF
Reinforcement Learning - DQN
PPTX
Reinforcement learning
PDF
Introduction of Deep Reinforcement Learning
PDF
Reinforcement Learning 6. Temporal Difference Learning
PPTX
Deep deterministic policy gradient
PDF
Reinforcement Learning 3. Finite Markov Decision Processes
PDF
Actor critic algorithm
PDF
Deep Reinforcement Learning and Its Applications
PDF
Deep Reinforcement learning
PDF
RLCode와 A3C 쉽고 깊게 이해하기
PDF
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
PDF
Reinforcement Learning
PPTX
Reinforcement Learning
Deep sarsa, Deep Q-learning, DQN
Reinforcement Learning 1. Introduction
ddpg seminar
An introduction to reinforcement learning
An introduction to reinforcement learning
Deep Reinforcement Learning
Reinforcement Learning
Reinforcement Learning - DQN
Reinforcement learning
Introduction of Deep Reinforcement Learning
Reinforcement Learning 6. Temporal Difference Learning
Deep deterministic policy gradient
Reinforcement Learning 3. Finite Markov Decision Processes
Actor critic algorithm
Deep Reinforcement Learning and Its Applications
Deep Reinforcement learning
RLCode와 A3C 쉽고 깊게 이해하기
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
Reinforcement Learning
Reinforcement Learning
Ad

Similar to Deep Reinforcement Learning: Q-Learning (20)

PPT
about reinforcement-learning ,reinforcement-learning.ppt
PPT
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
PPT
reinforcement-learning.ppt
PPT
reinforcement-learning.prsentation for c
PPT
reinforcement-learning its based on the slide of university
PPTX
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
PPTX
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
PPTX
Learning Task in machine learning
PDF
anintroductiontoreinforcementlearning-180912151720.pdf
PPTX
How to formulate reinforcement learning in illustrative ways
PDF
PDF
Reinforcement learning
PDF
Introduction to Deep Reinforcement Learning
PPT
Lecture notes
PPTX
An Introduction to Reinforcement Learning - The Doors to AGI
PDF
deep q networks (reinforcement learning)
PPTX
R22 Machine learning jntuh UNIT- 5.pptx
PDF
Deep reinforcement learning from scratch
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
PDF
Continuous control with deep reinforcement learning (DDPG)
about reinforcement-learning ,reinforcement-learning.ppt
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
reinforcement-learning.ppt
reinforcement-learning.prsentation for c
reinforcement-learning its based on the slide of university
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
Learning Task in machine learning
anintroductiontoreinforcementlearning-180912151720.pdf
How to formulate reinforcement learning in illustrative ways
Reinforcement learning
Introduction to Deep Reinforcement Learning
Lecture notes
An Introduction to Reinforcement Learning - The Doors to AGI
deep q networks (reinforcement learning)
R22 Machine learning jntuh UNIT- 5.pptx
Deep reinforcement learning from scratch
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Continuous control with deep reinforcement learning (DDPG)
Ad

More from Kai-Wen Zhao (8)

PDF
Learning visual representation without human label
PDF
Deep Double Descent
PDF
Recent Object Detection Research & Person Detection
PDF
Learning to discover monte carlo algorithm on spin ice manifold
PDF
Toward Disentanglement through Understand ELBO
PDF
Paper Review: An exact mapping between the Variational Renormalization Group ...
PDF
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
PDF
High Dimensional Data Visualization using t-SNE
Learning visual representation without human label
Deep Double Descent
Recent Object Detection Research & Person Detection
Learning to discover monte carlo algorithm on spin ice manifold
Toward Disentanglement through Understand ELBO
Paper Review: An exact mapping between the Variational Renormalization Group ...
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
High Dimensional Data Visualization using t-SNE

Recently uploaded (20)

PDF
Foundation of Data Science unit number two notes
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Introduction to Business Data Analytics.
PDF
Lecture1 pattern recognition............
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Computer network topology notes for revision
Foundation of Data Science unit number two notes
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Galatica Smart Energy Infrastructure Startup Pitch Deck
Fluorescence-microscope_Botany_detailed content
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Database Infoormation System (DBIS).pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction to Business Data Analytics.
Lecture1 pattern recognition............
IBA_Chapter_11_Slides_Final_Accessible.pptx
climate analysis of Dhaka ,Banglades.pptx
Mega Projects Data Mega Projects Data
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Knowledge Engineering Part 1
Clinical guidelines as a resource for EBP(1).pdf
Supervised vs unsupervised machine learning algorithms
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Computer network topology notes for revision

Deep Reinforcement Learning: Q-Learning

  • 1. DQN algorithm kv Physics Department, National Taiwan University kelispinor@gmail.com The silide is largely credicted from David Silver’s slide and CS294 July 16, 2018 kv (NTU-PHYS) RLMC July 16, 2018 1 / 27
  • 2. Overview Overview 1 Overview 2 Introdution What is Reinforcement Learning Markov Decision Process Dynamic Programming kv (NTU-PHYS) RLMC July 16, 2018 2 / 27
  • 3. Introdution What is Reinforcement Learning What is Reinforcement Learning? RL is a general framework for AI. RL is for agent with ability to interact Each action influences agent’s future states Success is measured by a scalar reward signal RL in a nutshell: Select actions to maximize future reward. kv (NTU-PHYS) RLMC July 16, 2018 3 / 27
  • 4. Introdution What is Reinforcement Learning Reinforcement Learning Framework In Reinforcement Learning, the agent observes current state St, receives reward Rt, then interacts with the environment with action At under policy. Agent Environment Action atNew state st+1 Reward rt+1 kv (NTU-PHYS) RLMC July 16, 2018 4 / 27
  • 5. Introdution Markov Decision Process Markov Decision Process Markov Property The future is independent of the past given the present. P(St+1|St) = P(St+1|St, St−1, ..., S2, S1) MDP is a tuple < S, A, P, R, γ >, defined by follwing components S: state space A: action space P(r, s |s, a): transition probability. trainsition s, a → r, s kv (NTU-PHYS) RLMC July 16, 2018 5 / 27
  • 6. Introdution Dynamic Programming Policy Policy: Is any function mapping from the states to actions π : S → A Deterministic policy a = π(s) Stochastic policy a ∼ π(a|s) kv (NTU-PHYS) RLMC July 16, 2018 6 / 27
  • 7. Introdution Dynamic Programming Policy Evaluation and Value Functions Policy optimization: maximize expected reward wrt policy π maximize E t rt Policy evaluation: compute the expected return for given π State value function: V π (s) = E ∞ t γt rt|St = s State-action value function: Qπ (s, a) = E ∞ t γt rt|St = s, At = a kv (NTU-PHYS) RLMC July 16, 2018 7 / 27
  • 8. Introdution Dynamic Programming Value Functions Q-function or state-action value function: expected total reward from state s and action a under a policy π Qπ (s, a) = E π [r0 + γr1 + γ2 r2 + ...|s0 = s, a0 = a] (1) State value function: expected (long-term )retrun starting from s V π (s) = E π [r0 + γr1 + γ2 r2 + ...|St = s] (2) = E a∼π [Qπ (s, a)|St = s] (3) Advantage function Aπ (s, a) = Qπ (s, a) − V π (s) (4) kv (NTU-PHYS) RLMC July 16, 2018 8 / 27
  • 9. Introdution Dynamic Programming Bellman Equation State action value function can be unrolled recursively Qπ (s, a) = E[r0 + γr1 + γ2 r2 + ...|s, a] (5) = E s [r + γQπ (s , a )|s, a] (6) Optimal Q function Q∗(s, a) can be unrolled recursively Q∗ (s, a) = E s [r + max a Q∗ (s , a )|s, a] (7) Value iteration algorithm solves the Bellman equation Qi+1(s) = E s [r + max a Qi (s , a )|s, a] (8) kv (NTU-PHYS) RLMC July 16, 2018 9 / 27
  • 10. Introdution Dynamic Programming Bellman Backups Operator Q-function with clear time index Qπ (s0, a0) = E s1∼P(s1|s0,a0) r0 + γ E a1∼π Qπ (s1, a1) (9) Define Bellman backup operator, operating on Q-function [T π Q](s0, a0) = E s1∼P(s1|s0,a0) r0 + γ E a1∼π Qπ (s1, a1) (10) Qπ is a fixed point function T π Qπ = Qπ (11) If we apply T π repeatedly to Q, the series will converge to Qπ Q, T π Q, (T π )2 Q, ... → Qπ (12) kv (NTU-PHYS) RLMC July 16, 2018 10 / 27
  • 11. Introdution Dynamic Programming Introducing Q∗ Denote π∗ an optimal policy. Q∗(s, a) = Qπ∗ (s, a) = maxπ Qπ(s, a) Satisfy π∗(s) = argmaxa Q∗(s, a) Then, Bellman equation Qπ (s0, a0) = E s1∼P(s1|s0,a0) r0 + γ E a1∼π Qπ (s1, a1) (13) becomes Q∗ (s0, a0) = E s1∼P(s1|s0,a0) r0 + γ max a1 Q∗ (s1, a1) (14) We can also define corresponding Bellman backup operator kv (NTU-PHYS) RLMC July 16, 2018 11 / 27
  • 12. Introdution Dynamic Programming Bellman Backups Operator on Q∗ Bellman backup operator, operating on Q-function [T Q](s0, a0) = E s1∼P(s1|s0,a0) r0 + γ max a1 Q(s1, a1) (15) Qπ is a fixed point function T Q∗ = Q∗ (16) If we apply T repeatedly to Q, the series will converge to Q∗ Q, T Q, (T )2 Q, ... → Q∗ (17) kv (NTU-PHYS) RLMC July 16, 2018 12 / 27
  • 13. Introdution Dynamic Programming Deep Q-Learning Repersent value function by deep Q-Network with weights w Q(s, a; w) ≈ Qπ (s, a) Objective function of Q-values is defined in mean-squared error L(w) = E (r + γ max a Q(s , a ; w) TD Target −Q(s, a; w))2 Q-learning gradient ∂L(w) ∂w = E (r + γ max a Q(s , a ; w) TD Target −Q(s, a; w)) ∂Q(s, a; w) ∂w kv (NTU-PHYS) RLMC July 16, 2018 13 / 27
  • 14. Introdution Dynamic Programming Deep Q-Learning Backup estimation T Qt = rt + maxat+1 γQ(st+1, at+1) To approximate Q ← T Qt, solve T Qt − Q(st, at) 2 T is contraction under . ∞ not . 2 kv (NTU-PHYS) RLMC July 16, 2018 14 / 27
  • 15. Introdution Dynamic Programming Stability Issues 1 Data is sequential Successive non-iid data are highly correlated 2 Policy changes raplidly with slightly change of Q values π may oscillates Distribution of data may swing 3 Scale of rewards and Q value is unknown Large gradients can cause unstable backpropagation kv (NTU-PHYS) RLMC July 16, 2018 15 / 27
  • 16. Introdution Dynamic Programming Deep Q Network Proposed solutions 1 Use experience replay Break correlations in data, recover to iid setting 2 Fix target network Old Q-function is freezed over long timesteps before update Break correlations in Q-function and target 3 Clip rewards and normalize adaptively to sensible range Robust gradients kv (NTU-PHYS) RLMC July 16, 2018 16 / 27
  • 17. Introdution Dynamic Programming Stablize DQN: Experience Replay Goal: Remove correlations. Build agent’s data-set at is sampled from -greedy policy Store transition (st, at, rt+1, st+1) in replay memory D Sample randomly in mini-batch of transition (s, a, r, s ) from D Optimize MSE between Q-network and Q-Learning target L(w) = E a,s,r,s ∼D (r + γ max a Q(s , a ; w) − Q(s, a; w))2 kv (NTU-PHYS) RLMC July 16, 2018 17 / 27
  • 18. Introdution Dynamic Programming Stablize DQN: Fixed Target Goal: Avoid oscillations, fix parameters used in target Compute Q-learning target wrt old, fixed parameters w− r + γ max a Q(s , a ; w− ) Optimize MSE between Q-network and Q-learning target L(w) = E s,a,r,s ∼D (r + γ max a Q(s , a ; w− ) Fixed Target −Q(s, a; w))2 Periodically update fixed parameters w− ← w kv (NTU-PHYS) RLMC July 16, 2018 18 / 27
  • 19. Introdution Dynamic Programming Stablize DQN: Rewards/ Values Range Clips rewards to [-1, 1] Ensure gradients are well-conditioned kv (NTU-PHYS) RLMC July 16, 2018 19 / 27
  • 20. Introdution Dynamic Programming DQN in Atari Figure: Deep Q Learning kv (NTU-PHYS) RLMC July 16, 2018 20 / 27
  • 21. Introdution Dynamic Programming DQN in Atari End-to-end learning of Q from pixels s Input s is stacked last 4 frames Output Q(s, a) for 18 actions Reward is change in score for that step Figure: Q-Network Architecture kv (NTU-PHYS) RLMC July 16, 2018 21 / 27
  • 22. Introdution Dynamic Programming DQN Results kv (NTU-PHYS) RLMC July 16, 2018 22 / 27
  • 23. Introdution Dynamic Programming DQN Results kv (NTU-PHYS) RLMC July 16, 2018 23 / 27
  • 24. Introdution Dynamic Programming DQN Results kv (NTU-PHYS) RLMC July 16, 2018 24 / 27
  • 25. Introdution Dynamic Programming Is Q-value has meaning? kv (NTU-PHYS) RLMC July 16, 2018 25 / 27
  • 26. Introdution Dynamic Programming Is Q-value has meaning? But Q-values are usually overestimated. kv (NTU-PHYS) RLMC July 16, 2018 26 / 27
  • 27. Introdution Dynamic Programming Double Q Learning EX1,X2 [max(X1, X2)] ≥ max(EX1,X2 [X1], EX1,X2 [X2]) Q-values are noisy and overesitmated Solution: use two networks and compute max with the other networ QA(s, a) ← r + γQ(s , argmax a QB(s , a )) QB(s, a) ← r + γQ(s , argmax a QA(s , a )) Original DQN Q(s, a) ← r + γQtarget (s , a ) = r + γQtarget (s , argmax a Qtarget ) Double DQN Q(s, a) ← r + γQtarget (s , argmax a Q(s , a )) (18) kv (NTU-PHYS) RLMC July 16, 2018 27 / 27