SlideShare a Scribd company logo
2
Most read
8
Most read
10
Most read
Continuous control with deep
reinforcement learning
2016-06-28
Taehoon Kim
Motivation
• DQN can only handle
• discrete (not continuous)
• low-dimensional action spaces
• Simple approach to adapt DQN to continuous domain is discretizing
• 7 degree of freedom system with discretization 𝑎" ∈ {−𝑘, 0, 𝑘}
• Now space dimensionality becomes 3+
= 2187
• explosion of the number of discrete actions
2
Contribution
• Present a model-free, off-policy actor-critic algorithm
• learn policies in high-dimensional, continuous action spaces
• Work based on DPG (Deterministic policy gradient)
3
Background
• actions 𝑎" ∈ ℝ2
, action space 𝒜 = ℝ2
• history of observation, action pairs 𝑠" = (𝑥7, 𝑎7, … , 𝑎"97, 𝑥")
• assume fully-observable so 𝑠" = 𝑥"
• policy 𝜋: 𝒮 → 𝒫(𝒜)
• Model environment as Markov decision process
• initial state distribution 𝑝(𝑠7)
• transition dynamics 𝑝(𝑠"A7|𝑠", 𝑎")
4
Background
• Discounted future reward 𝑅" = ∑ 𝛾F9"
𝑟(𝑠F, 𝑎F)H
FI"
• Goal of RL is to learn a policy 𝜋 which maximizes the expected return
• from the start distribution 𝐽 = 𝔼LM ,NM~P,QM~R[𝑅7]
• Discounted state visitation distribution for a policy 𝜋: ρR
5
Background
• action-value function 𝑄R
𝑠", 𝑎" = 𝔼LMW",NMXY~P,QMXY~R[𝑅"|𝑠", 𝑎"]
• expected return after taking an action 𝑎" in state 𝑠" and following policy 𝜋
• Bellman equation
• 𝑄R
𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝔼QYZ[~R 𝑄R
(𝑠"A7, 𝑎"A7) ]
• With deterministic policy 𝜇: 𝒮 → 𝒜
• 𝑄^
𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝑄^
𝑠"A7, 𝜇(𝑠"A7 )]
6
Background
• Expectation only depends on the environment
• possible to learn 𝑄 𝝁
off-policy, where transitions are generated from
different stochastic policy 𝜷
• Q-learning with greedy policy 𝜇 𝑠 = arg max
f
𝑄 𝑠, 𝑎
• 𝐿 𝜃i
= 𝔼NY~jk,QY~l,NY~P[ 𝑄 𝑠", 𝑎" 𝜃i
− 𝑦"
n
]
• where 𝑦" = 𝑟 𝑠", 𝑎" + 𝛾𝑄(𝑠"A7, 𝜇(𝑠"A7)|𝜃i
)
• To scale Q-learning into large non-linear approximators:
• a replay buffer, a separate target network
7
(a	commonly	used	off-policy algorithm)
Deterministic Policy Gradient (DPG)
• In continuous space, finding the greedy policy requires an optimization of 𝑎" at
every timestep
• too slow to large, unconstrained function approximators and nontrivial action spaces
• Instead, used an actor-critic approach based on the DPG algorithm
• actor: 𝜇 𝑠 𝜃^
: 	𝒮 → 𝒜
• critic: 𝑄(𝑠, 𝑎|𝜃i
)
8
Learning algorithm
• Actor is updated by following the applying the chain rule to the expected return
from the start distribution 𝒥 w.r.t 𝜃^
• 𝛻rs 𝒥 ≈ 𝔼N~j 𝜷 𝛻rs 𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ 𝑠" 𝜃^ =
𝔼N~j 𝜷 𝛻Q 𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ NY
∇rs 𝜇 𝑠 𝜃^ |NIN"
• Silver et al. (2014) proved this is the policy gradient
• the gradient of policy’s performance
9
Contributions
• Introducing non-linear function approximators means that
convergence is no longer guaranteed
• But essential to learn and generalize on large state spaces
• Contribution
• To provide modifications to DPG, inspired by the success of DQN
• Allow to use neural network function approximators to learn in large state and
action spaces online
10
Challenges 1
• NN for RL usually assume that the samples are i.i.d.
• but when the samples are generated from exploring sequentially in an environment,
this assumption no longer holds.
• As DQN, we use replay buffer to address this issue
• As DQN, we used target network for stable learning but use “soft” target
updates
• 𝜃` ← 𝜏𝜃 + 1 − 𝜏 𝜃`, with 𝜏 ≪ 1
• Target network slowly change that greatly improve the stability of learning
11
Challenges 2
• When learning from low dimensional feature vector, observations may have
different physical units (i.e. positions and velocities)
• make it difficult to learn effectively and also to find hyper-parameters which generalize across
environments
• Use batch normalization [Ioffe & Szegedy, 2015] to normalize each dimension
across the samples in a minibatch to have unit mean and variance
• Also maintains a running average of the mean and variance for normalization during testing
• Use all layers of 𝜇 and 𝑄 prior to the action input
• Can train different units without needing to manually ensure the units were within a set range
12
(exploration	or	evaluation)
Challenges 3
• Advantage of off-policies algorithm (i.e. DDPG) is that we can treat the problem
of exploration independently from the learning algorithm
• Constructed an exploration policy 𝜇` by adding noise sampled from a noise
process 𝒩
• 𝜇` 𝑠" = 𝜇 𝑠" 𝜃"
^
+ 𝒩
• Use an Ornstein-Uhlenbeck process to generate temporally
correlated exploration for exploration efficiency with inertia
13
14
Experiment details
• Adam. 𝑙𝑟^
= 109|
, 𝑙𝑟i
= 109}
• 𝑄 include 𝐿n weight decay of 109n
and 𝛾 = 0.99
• 𝜏 = 0.001
• ReLU for hidden layers, tanh for output layer of the actor to bound the actions
• NN: 2 hidden layers with 400 and 300 units
• Action is not included until the 2nd hidden layer of 𝑄
• The final layer weights and biases are initialized from a uniform distribution −3×109}
,3×109}
• to ensure the initial outputs for the policy and value estimates were near zero
• The other layers are initialized from uniform distributions −
7
•
,
7
•
where 𝑓 is the fan-in of the layer
• Replay buffer ℛ = 10„
, Ornstein-Uhlenbeck process: 𝜃 = 0.15, 𝜎 = 0.2
15
References
1. [Wang, 2015] Wang, Z., de Freitas, N., & Lanctot, M. (2015). Dueling network architectures for
deep reinforcement learning. arXiv preprint arXiv:1511.06581.
2. [Van, 2015] Van Hasselt, H., Guez, A., & Silver, D. (2015). Deep reinforcement learning with
double Q-learning. CoRR, abs/1509.06461.
3. [Schaul, 2015] Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience
replay. arXiv preprint arXiv:1511.05952.
4. [Sutton, 1998] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction(Vol.
1, No. 1). Cambridge: MIT press.
16

More Related Content

PDF
Deep Learning from Scratch - Building with Python from First Principles.pdf
PDF
ddpg seminar
PPTX
Large numbers
PPTX
EfficientNet
PDF
Remotely Piloted Aircraft System
PPTX
Onnx and onnx runtime
PPTX
Deep Reinforcement Learning
PPTX
強化学習アルゴリズムPPOの解説と実験
Deep Learning from Scratch - Building with Python from First Principles.pdf
ddpg seminar
Large numbers
EfficientNet
Remotely Piloted Aircraft System
Onnx and onnx runtime
Deep Reinforcement Learning
強化学習アルゴリズムPPOの解説と実験

What's hot (20)

PPTX
Deep deterministic policy gradient
PDF
Deep Reinforcement Learning
PPTX
Intro to Deep Reinforcement Learning
PPTX
An introduction to reinforcement learning
PDF
Deep Q-Learning
PPTX
Deep Reinforcement Learning
PDF
Introduction of Deep Reinforcement Learning
PDF
An introduction to deep reinforcement learning
PDF
Reinforcement Learning 1. Introduction
PPTX
Proximal Policy Optimization
PPTX
Reinforcement Learning
PDF
Deep Reinforcement Learning and Its Applications
PDF
An introduction to reinforcement learning
PDF
Reinforcement learning
PDF
Actor critic algorithm
PDF
Reinforcement Learning 4. Dynamic Programming
PPT
Reinforcement Learning Q-Learning
PDF
K - Nearest neighbor ( KNN )
PDF
Reinforcement Learning 3. Finite Markov Decision Processes
PDF
Deep Reinforcement Learning: Q-Learning
Deep deterministic policy gradient
Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
An introduction to reinforcement learning
Deep Q-Learning
Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
An introduction to deep reinforcement learning
Reinforcement Learning 1. Introduction
Proximal Policy Optimization
Reinforcement Learning
Deep Reinforcement Learning and Its Applications
An introduction to reinforcement learning
Reinforcement learning
Actor critic algorithm
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning Q-Learning
K - Nearest neighbor ( KNN )
Reinforcement Learning 3. Finite Markov Decision Processes
Deep Reinforcement Learning: Q-Learning
Ad

Viewers also liked (8)

PDF
Introduction to A3C model
PDF
ChainerRLの紹介
PDF
Healthy Competition: How Adversarial Reasoning is Leading the Next Wave of In...
PDF
アクターモデルについて
PDF
【論文紹介】PGQ: Combining Policy Gradient And Q-learning
PDF
A3Cという強化学習アルゴリズムで遊んでみた話
PPTX
A3C解説
PDF
TensorFlow を使った 機械学習ことはじめ (GDG京都 機械学習勉強会)
Introduction to A3C model
ChainerRLの紹介
Healthy Competition: How Adversarial Reasoning is Leading the Next Wave of In...
アクターモデルについて
【論文紹介】PGQ: Combining Policy Gradient And Q-learning
A3Cという強化学習アルゴリズムで遊んでみた話
A3C解説
TensorFlow を使った 機械学習ことはじめ (GDG京都 機械学習勉強会)
Ad

Similar to Continuous control with deep reinforcement learning (DDPG) (20)

PDF
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
PPTX
PPTX
Lacture Generative Adversal Network in Neural Networks
PDF
Dueling network architectures for deep reinforcement learning
PPTX
Artificial Neural Networks presentations
PPTX
Reinforcement Learning and Artificial Neural Nets
PPTX
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
PDF
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
PDF
Lec3 dqn
PDF
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
PDF
Distributional RL via Moment Matching
PPTX
DDPG algortihm for angry birds
PDF
K-means and GMM
PPTX
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
PPTX
Cvpr 2018 papers review (efficient computing)
PDF
DQN Variants: A quick glance
PPTX
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
PDF
Deep Feed Forward Neural Networks and Regularization
PPTX
08 neural networks
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Lacture Generative Adversal Network in Neural Networks
Dueling network architectures for deep reinforcement learning
Artificial Neural Networks presentations
Reinforcement Learning and Artificial Neural Nets
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Lec3 dqn
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Distributional RL via Moment Matching
DDPG algortihm for angry birds
K-means and GMM
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Cvpr 2018 papers review (efficient computing)
DQN Variants: A quick glance
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
Deep Feed Forward Neural Networks and Regularization
08 neural networks

More from Taehoon Kim (15)

PDF
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
PDF
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
PDF
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
PDF
Random Thoughts on Paper Implementations [KAIST 2018]
PDF
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
PDF
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
PDF
카카오톡으로 여친 만들기 2013.06.29
PDF
Differentiable Neural Computer
PDF
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
PDF
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
PDF
강화 학습 기초 Reinforcement Learning an introduction
PDF
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
PDF
Deep Reasoning
PDF
쉽게 쓰여진 Django
PDF
영화 서비스에 대한 생각
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
Random Thoughts on Paper Implementations [KAIST 2018]
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
카카오톡으로 여친 만들기 2013.06.29
Differentiable Neural Computer
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
강화 학습 기초 Reinforcement Learning an introduction
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
Deep Reasoning
쉽게 쓰여진 Django
영화 서비스에 대한 생각

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Approach and Philosophy of On baking technology
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Cloud computing and distributed systems.
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Encapsulation_ Review paper, used for researhc scholars
Unlocking AI with Model Context Protocol (MCP)
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MYSQL Presentation for SQL database connectivity
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Approach and Philosophy of On baking technology
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Big Data Technologies - Introduction.pptx
Machine learning based COVID-19 study performance prediction
Network Security Unit 5.pdf for BCA BBA.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
20250228 LYD VKU AI Blended-Learning.pptx
Spectral efficient network and resource selection model in 5G networks
Cloud computing and distributed systems.
CIFDAQ's Market Insight: SEC Turns Pro Crypto

Continuous control with deep reinforcement learning (DDPG)

  • 1. Continuous control with deep reinforcement learning 2016-06-28 Taehoon Kim
  • 2. Motivation • DQN can only handle • discrete (not continuous) • low-dimensional action spaces • Simple approach to adapt DQN to continuous domain is discretizing • 7 degree of freedom system with discretization 𝑎" ∈ {−𝑘, 0, 𝑘} • Now space dimensionality becomes 3+ = 2187 • explosion of the number of discrete actions 2
  • 3. Contribution • Present a model-free, off-policy actor-critic algorithm • learn policies in high-dimensional, continuous action spaces • Work based on DPG (Deterministic policy gradient) 3
  • 4. Background • actions 𝑎" ∈ ℝ2 , action space 𝒜 = ℝ2 • history of observation, action pairs 𝑠" = (𝑥7, 𝑎7, … , 𝑎"97, 𝑥") • assume fully-observable so 𝑠" = 𝑥" • policy 𝜋: 𝒮 → 𝒫(𝒜) • Model environment as Markov decision process • initial state distribution 𝑝(𝑠7) • transition dynamics 𝑝(𝑠"A7|𝑠", 𝑎") 4
  • 5. Background • Discounted future reward 𝑅" = ∑ 𝛾F9" 𝑟(𝑠F, 𝑎F)H FI" • Goal of RL is to learn a policy 𝜋 which maximizes the expected return • from the start distribution 𝐽 = 𝔼LM ,NM~P,QM~R[𝑅7] • Discounted state visitation distribution for a policy 𝜋: ρR 5
  • 6. Background • action-value function 𝑄R 𝑠", 𝑎" = 𝔼LMW",NMXY~P,QMXY~R[𝑅"|𝑠", 𝑎"] • expected return after taking an action 𝑎" in state 𝑠" and following policy 𝜋 • Bellman equation • 𝑄R 𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝔼QYZ[~R 𝑄R (𝑠"A7, 𝑎"A7) ] • With deterministic policy 𝜇: 𝒮 → 𝒜 • 𝑄^ 𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝑄^ 𝑠"A7, 𝜇(𝑠"A7 )] 6
  • 7. Background • Expectation only depends on the environment • possible to learn 𝑄 𝝁 off-policy, where transitions are generated from different stochastic policy 𝜷 • Q-learning with greedy policy 𝜇 𝑠 = arg max f 𝑄 𝑠, 𝑎 • 𝐿 𝜃i = 𝔼NY~jk,QY~l,NY~P[ 𝑄 𝑠", 𝑎" 𝜃i − 𝑦" n ] • where 𝑦" = 𝑟 𝑠", 𝑎" + 𝛾𝑄(𝑠"A7, 𝜇(𝑠"A7)|𝜃i ) • To scale Q-learning into large non-linear approximators: • a replay buffer, a separate target network 7 (a commonly used off-policy algorithm)
  • 8. Deterministic Policy Gradient (DPG) • In continuous space, finding the greedy policy requires an optimization of 𝑎" at every timestep • too slow to large, unconstrained function approximators and nontrivial action spaces • Instead, used an actor-critic approach based on the DPG algorithm • actor: 𝜇 𝑠 𝜃^ : 𝒮 → 𝒜 • critic: 𝑄(𝑠, 𝑎|𝜃i ) 8
  • 9. Learning algorithm • Actor is updated by following the applying the chain rule to the expected return from the start distribution 𝒥 w.r.t 𝜃^ • 𝛻rs 𝒥 ≈ 𝔼N~j 𝜷 𝛻rs 𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ 𝑠" 𝜃^ = 𝔼N~j 𝜷 𝛻Q 𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ NY ∇rs 𝜇 𝑠 𝜃^ |NIN" • Silver et al. (2014) proved this is the policy gradient • the gradient of policy’s performance 9
  • 10. Contributions • Introducing non-linear function approximators means that convergence is no longer guaranteed • But essential to learn and generalize on large state spaces • Contribution • To provide modifications to DPG, inspired by the success of DQN • Allow to use neural network function approximators to learn in large state and action spaces online 10
  • 11. Challenges 1 • NN for RL usually assume that the samples are i.i.d. • but when the samples are generated from exploring sequentially in an environment, this assumption no longer holds. • As DQN, we use replay buffer to address this issue • As DQN, we used target network for stable learning but use “soft” target updates • 𝜃` ← 𝜏𝜃 + 1 − 𝜏 𝜃`, with 𝜏 ≪ 1 • Target network slowly change that greatly improve the stability of learning 11
  • 12. Challenges 2 • When learning from low dimensional feature vector, observations may have different physical units (i.e. positions and velocities) • make it difficult to learn effectively and also to find hyper-parameters which generalize across environments • Use batch normalization [Ioffe & Szegedy, 2015] to normalize each dimension across the samples in a minibatch to have unit mean and variance • Also maintains a running average of the mean and variance for normalization during testing • Use all layers of 𝜇 and 𝑄 prior to the action input • Can train different units without needing to manually ensure the units were within a set range 12 (exploration or evaluation)
  • 13. Challenges 3 • Advantage of off-policies algorithm (i.e. DDPG) is that we can treat the problem of exploration independently from the learning algorithm • Constructed an exploration policy 𝜇` by adding noise sampled from a noise process 𝒩 • 𝜇` 𝑠" = 𝜇 𝑠" 𝜃" ^ + 𝒩 • Use an Ornstein-Uhlenbeck process to generate temporally correlated exploration for exploration efficiency with inertia 13
  • 14. 14
  • 15. Experiment details • Adam. 𝑙𝑟^ = 109| , 𝑙𝑟i = 109} • 𝑄 include 𝐿n weight decay of 109n and 𝛾 = 0.99 • 𝜏 = 0.001 • ReLU for hidden layers, tanh for output layer of the actor to bound the actions • NN: 2 hidden layers with 400 and 300 units • Action is not included until the 2nd hidden layer of 𝑄 • The final layer weights and biases are initialized from a uniform distribution −3×109} ,3×109} • to ensure the initial outputs for the policy and value estimates were near zero • The other layers are initialized from uniform distributions − 7 • , 7 • where 𝑓 is the fan-in of the layer • Replay buffer ℛ = 10„ , Ornstein-Uhlenbeck process: 𝜃 = 0.15, 𝜎 = 0.2 15
  • 16. References 1. [Wang, 2015] Wang, Z., de Freitas, N., & Lanctot, M. (2015). Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581. 2. [Van, 2015] Van Hasselt, H., Guez, A., & Silver, D. (2015). Deep reinforcement learning with double Q-learning. CoRR, abs/1509.06461. 3. [Schaul, 2015] Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952. 4. [Sutton, 1998] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction(Vol. 1, No. 1). Cambridge: MIT press. 16