SlideShare a Scribd company logo
Introduction to Deep Reinforcement Learning
Khaled Saleh
PhD Researcher at IISRI/ Deakin University
Australia
Khaled Saleh
Agenda
• Motivation
• What is Reinforcement Learning (RL) ?
• Characteristics of RL
• Formulation of the RL Problem
• Different Components of RL
• Taxonomy of Algorithms for Solving RL
• Q-Learning
• Deep Q Network (DQN)
• Policy Gradient Methods
• Inverse RL
• Deep RL/IRL Potential Applications
2
Motivation
3
Video credit: Ng et al. NIPS 2007 Video credit: Google DeepMind 2015
What is Reinforcement Learning (RL) ?
4Image credit: Sutton and Barto (1998)
Characteristics of RL
5
• In comparison to other machine learning paradigms, the
following are what make the RL different:
• No supervision needed, only a reward signal
• Feedback is delayed, not instantaneous
• Sequential decision Making
Formulation of RL
6
• Most common method to formulate RL problem is through
Markov Decision Process (MDP)
• One episode of this process forms a finite sequence of states,
actions and rewards:
• 𝑠0, 𝑎0, 𝑟1, 𝑠1, 𝑎1, 𝑟2, 𝑠2, … , 𝑠 𝑛−1, 𝑎 𝑛−1, 𝑟𝑛, 𝑠 𝑛
Image credit: WikipediaImage credit: Sutton and Barto (1998)
Formulation of RL
7
• A good policy, need to take into account not only the
immediate rewards, but also the future rewards we are going
to get.
• Thus, the ultimate goal of RL agent is to select actions to
maximize a total future reward.
• Given one run of Markov decision process, we can easily
calculate the total reward for one episode from time
step t onward as follows:
• 𝑅𝑡 = 𝑟𝑡 + 𝑟𝑡+1 + 𝑟𝑡+2 + ⋯ + 𝑟𝑛
• Due to the inherit uncertainty in the environment, we usually
use the discounted future reward instead:
• 𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2
𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡
𝑟𝑛 = 𝑟𝑡 + 𝛾𝑅𝑡+1
Components of RL
8
• An RL agent may include one or more of these components:
• Policy: agent’s behavior function 𝑎 = π(𝑠)
• Value function: a prediction of future reward - how good
is each state and/or action
• Model: agent’s representation of the environment, given
state 𝑠 and action 𝑎, the model gives us both the reward
of this state and action as well as the probability of the
next state 𝑠′
Components of RL: Policy
9Example adapted from: http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html
• Given the following maze example:
Policy would be
Components of RL: Value Function
10
• Used to evaluate the goodness/badness of states
• And therefore to select between actions:
𝑄 𝜋(𝑠, 𝑎) = 𝑚𝑎𝑥 𝜋 𝑅𝑡+1
Taxonomy of Algorithms for Solving RL
11
• Model Free
• Policy or/and Value Function
• Model Based
• Model + Policy or/and Value Function
• Approximated Learned Model + Policy or/and Value
Function
Q-Learning
12
• Q-learning is a model free paradigm to learn the value
function of the RL problem.
• In Q-learning, we define a function 𝑄(𝑠, 𝑎) representing the
discounted future reward when we perform action a in state s,
and continue optimally from that point on.
• 𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑚𝑎𝑥 𝜋 𝑅𝑡+1
• Once we have the Q-function, the question of which policy to
choose at a given state 𝑠, can be broke down into :
• 𝜋 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠, 𝑎)
Q-Learning (2)
13
• To obtain Q-function, we will focus on just one transition
<𝑠, 𝑎, r, 𝑠′>.
• Recall,
𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟𝑛 = 𝑟𝑡 + 𝛾𝑅𝑡+1
• Similarly, we can just represent Q-value of state 𝑠 and
action 𝑎 in terms of Q-value of next state 𝑠′
𝑄 𝑠, 𝑎 = 𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄(𝑠′, 𝑎′)
Bellman Equation
Q-Learning (3)
14Algorithm adapted from : http://artint.info/html/ArtInt_265.html
• We can then iteratively approximate the Q-function using the
Bellman equation, as follows:
Learning rate
Deep Q-Networks
15
• Q-function could be represented with neural network, that
takes the state and action as input and outputs the
corresponding Q-value
• Alternatively, we could take only game screens as input and
output the Q-value for each possible action.
DQN: Atari
16Image credit: Mnih et al. Nature 2015
DQN: Training
17
• Given a transition <𝑠, 𝑎, r, 𝑠′>, and loss function
𝐿 =
1
2
[𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′ − Q s, a ]2:
1. Do a feedforward pass for the current state 𝑠 to get
predicted Q-values for all actions.
2. Do a feedforward pass for the next state 𝑠′ and calculate
maximum over all network outputs 𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′
3. Set Q-value target for
action 𝑎 to 𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′ (use the max calculated
in step 2). For all other actions, set the Q-value target to
the same as originally returned from step 1, making the
error 0 for those outputs
4. Update the weights using backpropagation.
target prediction
DQN: Experience Replay
18
• One of the engineering tricks that made the training of DQN
much more stable
• During gameplay all the experiences <𝑠, 𝑎, r, 𝑠′
> are stored in
a replay memory
• When training the network, random samples from the replay
memory are used instead of the most recent transition
1. This breaks the similarity of subsequent training
samples, which otherwise might drive the network into a
local minimum
2. It made the training task more similar to usual
supervised learning, which simplifies debugging and
testing the algorithm.
DQN: ε-greedy exploration
19
• When Q-network is initialized randomly, then its predictions
are initially random as well
• If we pick an action with the highest Q-value, the action will
be random and the agent performs crude “exploration”.
• As a Q-function converges, it returns more consistent Q-
values and the amount of exploration decreases
• Another engineering trick is ε-greedy exploration – with
probability ε choose a random action, otherwise go with the
“greedy” action with the highest Q-value.
DQN: Algorithm
20Algorithm adapted from : http://artint.info/html/ArtInt_265.html
Experience Replay
ε-greedy exploration
Policy Gradient Methods
21
• Another commonly paradigm to solve the RL problem is by
learning the policy directly.
• Learning the policy directly, can be much more efficient in
case of continuous action spaces (human locomotion,..etc.)
• One of the key methods in this paradigm, is policy gradient
methods (Gradient descent, Conjugate gradient, Quasi-
newton).
• The formulation as follow, let 𝐽 𝜃 be any policy objective
function
• Policy gradient methods search for a local maximum in 𝐽 𝜃
by ascending the gradient of the policy, w.r.t. parameters 𝜃
Δ𝜃 = α𝛻𝜃 𝐽 𝜃
Policy gradient
Policy Gradient Methods
22
Heess, Nicolas, et al. "Emergence of locomotion behaviours in rich environments." arXiv preprint
arXiv:1707.02286 (2017).
Inverse RL
Adapted from CS 294: Deep Reinforcement Learning, UC Berkeley, Fall 2017
Inverse RL
• Since in most of the real-world applications, the notion of
reward is not quite obvious or really hard to specify.
• In IRL problem, we try to learn the reward (and the transition
model as well) from expert or human demonstrations.
Inverse RL: Autonomous Driving
Image credit: Wulfmeier et al. IROS 2016
Reward Features
Inverse RL: Intent Prediction
26Image credit: KITTI Dataset
Pedestrian
Deep RL/IRL Potential Applications
• Autonomous Navigation
• Semantic Segmentation
• Recommendation Systems
• Chatbots
• Inventory Management
• Power Systems
• Financial investment decisions*
• Medical Sector (Dynamic treatment regime)
* http://pit.ai/
Further Educational Resources
• Reinforcement Learning: An Introduction (Sutton and Barto’s
Book, 2nd Edition)
• David Silver's Reinforcement Learning Course (UCL, 2015)
• CS 294: Deep Reinforcement Learning, Fall 2017
• Deep RL Bootcamp, Summer 2017
DeepMind AlphaGo
29Image and Video credit: Google Brain & DeepMind
References
1. Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998.
2. Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature518.7540 (2015): 529-533.
3. Abbeel, Pieter, and Andrew Y. Ng. "Apprenticeship learning via inverse reinforcement learning." Proceedings of the twenty-first
international conference on Machine learning. ACM, 2004.
4. Cassandra, Anthony Rocco. "Exact and approximate algorithms for partially observable Markov decision processes." (1998).
5. Heess, Nicolas, et al. "Emergence of Locomotion Behaviours in Rich Environments." arXiv preprint arXiv:1707.02286 (2017).
6. Heess, Nicolas, et al. "Learning and Transfer of Modulated Locomotor Controllers." arXiv preprint arXiv:1610.05182 (2016).
7. Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep Spatial Autoencoders for Visuomotor
Learning. In ICRA, 2016.
8. Jakob N Foerster, Yannis M Assael, Nando de Freitas, and Shimon Whiteson. Learning to Communicate to Solve Riddles with Deep
Distributed Recurrent QNetworks. arXiv:1602.02672, 2016.
9. Sham M Kakade. A Natural Policy Gradient. In NIPS, 2002
10. Nate Kohl and Peter Stone. Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion. In ICRA, volume 3, 2004
11. Sascha Lange, Martin Riedmiller, and Arne Voigtlander. Autonomous Reinforcement Learning on Raw Visual Input Data in a Real World
Application. In IJCNN, 2012.
12. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep Learning. Nature, 521 (7553):436–444, 2015.
13. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end Training of Deep Visuomotor Policies. JMLR, 17(39):1–40,
2016
14. Xiujun Li, Lihong Li, Jianfeng Gao, Xiaodong He, Jianshu Chen, Li Deng, and Ji He. Recurrent Reinforcement Learning: A Hybrid
Approach. arXiv:1509.03044,
15. Wulfmeier, Markus, Dominic Zeng Wang, and Ingmar Posner. "Watch this: Scalable cost-function learning for path planning in urban
environments." Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on. IEEE, 2016.
30
Thank You!

More Related Content

PDF
Multi-Agent Reinforcement Learning
PPTX
Reinforcement learning
PDF
An introduction to reinforcement learning
PPTX
Reinforcement Learning
PPTX
Deep Reinforcement Learning
PPTX
Reinforcement Learning : A Beginners Tutorial
PDF
An introduction to deep reinforcement learning
PDF
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
Multi-Agent Reinforcement Learning
Reinforcement learning
An introduction to reinforcement learning
Reinforcement Learning
Deep Reinforcement Learning
Reinforcement Learning : A Beginners Tutorial
An introduction to deep reinforcement learning
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman

What's hot (20)

PDF
Deep Reinforcement Learning
PPTX
Reinforcement Learning
PDF
Reinforcement Learning 4. Dynamic Programming
PDF
Deep reinforcement learning
PPT
Reinforcement Learning Q-Learning
PDF
Reinforcement Learning - DQN
PDF
Deep Reinforcement Learning: Q-Learning
PDF
Deep Q-Learning
PDF
Reinforcement Learning 3. Finite Markov Decision Processes
PDF
Reinforcement Learning
PPTX
An introduction to reinforcement learning
PDF
Deep Reinforcement Learning and Its Applications
PDF
Reinforcement learning
PDF
Temporal difference learning
PDF
A brief overview of Reinforcement Learning applied to games
PPT
Reinforcement learning 7313
PDF
Introduction of Deep Reinforcement Learning
PDF
Deep Reinforcement learning
PPTX
Deep sarsa, Deep Q-learning, DQN
PDF
ddpg seminar
Deep Reinforcement Learning
Reinforcement Learning
Reinforcement Learning 4. Dynamic Programming
Deep reinforcement learning
Reinforcement Learning Q-Learning
Reinforcement Learning - DQN
Deep Reinforcement Learning: Q-Learning
Deep Q-Learning
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning
An introduction to reinforcement learning
Deep Reinforcement Learning and Its Applications
Reinforcement learning
Temporal difference learning
A brief overview of Reinforcement Learning applied to games
Reinforcement learning 7313
Introduction of Deep Reinforcement Learning
Deep Reinforcement learning
Deep sarsa, Deep Q-learning, DQN
ddpg seminar
Ad

Similar to Intro to Deep Reinforcement Learning (20)

PDF
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
PPTX
R22 Machine learning jntuh UNIT- 5.pptx
PDF
Continuous control with deep reinforcement learning (DDPG)
PDF
Shanghai deep learning meetup 4
PDF
RL presentation
PDF
Reinforcement Learning Overview | Marco Del Pra
PPTX
Reinforcement course material samples: lecture 1
PDF
anintroductiontoreinforcementlearning-180912151720.pdf
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
PPTX
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
PDF
Reinforcement learning in a nutshell
PDF
Introduction to reinforcement learning
PDF
deep q networks (reinforcement learning)
PDF
Machine Learning , deep learning module imp
PPTX
An Introduction to Reinforcement Learning - The Doors to AGI
PDF
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
PDF
Introduction2drl
PDF
5 Important Deep Learning Research Papers You Must Read In 2020
PPTX
Intro to Reinforcement Learning
PDF
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
R22 Machine learning jntuh UNIT- 5.pptx
Continuous control with deep reinforcement learning (DDPG)
Shanghai deep learning meetup 4
RL presentation
Reinforcement Learning Overview | Marco Del Pra
Reinforcement course material samples: lecture 1
anintroductiontoreinforcementlearning-180912151720.pdf
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Reinforcement learning in a nutshell
Introduction to reinforcement learning
deep q networks (reinforcement learning)
Machine Learning , deep learning module imp
An Introduction to Reinforcement Learning - The Doors to AGI
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Introduction2drl
5 Important Deep Learning Research Papers You Must Read In 2020
Intro to Reinforcement Learning
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
Ad

Recently uploaded (20)

PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPT
Chemical bonding and molecular structure
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
Microbiology with diagram medical studies .pptx
PDF
An interstellar mission to test astrophysical black holes
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PPTX
2. Earth - The Living Planet Module 2ELS
ECG_Course_Presentation د.محمد صقران ppt
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
microscope-Lecturecjchchchchcuvuvhc.pptx
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Classification Systems_TAXONOMY_SCIENCE8.pptx
bbec55_b34400a7914c42429908233dbd381773.pdf
Taita Taveta Laboratory Technician Workshop Presentation.pptx
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
TOTAL hIP ARTHROPLASTY Presentation.pptx
Chemical bonding and molecular structure
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
Comparative Structure of Integument in Vertebrates.pptx
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
Cell Membrane: Structure, Composition & Functions
INTRODUCTION TO EVS | Concept of sustainability
Microbiology with diagram medical studies .pptx
An interstellar mission to test astrophysical black holes
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
2. Earth - The Living Planet Module 2ELS

Intro to Deep Reinforcement Learning

  • 1. Introduction to Deep Reinforcement Learning Khaled Saleh PhD Researcher at IISRI/ Deakin University Australia Khaled Saleh
  • 2. Agenda • Motivation • What is Reinforcement Learning (RL) ? • Characteristics of RL • Formulation of the RL Problem • Different Components of RL • Taxonomy of Algorithms for Solving RL • Q-Learning • Deep Q Network (DQN) • Policy Gradient Methods • Inverse RL • Deep RL/IRL Potential Applications 2
  • 3. Motivation 3 Video credit: Ng et al. NIPS 2007 Video credit: Google DeepMind 2015
  • 4. What is Reinforcement Learning (RL) ? 4Image credit: Sutton and Barto (1998)
  • 5. Characteristics of RL 5 • In comparison to other machine learning paradigms, the following are what make the RL different: • No supervision needed, only a reward signal • Feedback is delayed, not instantaneous • Sequential decision Making
  • 6. Formulation of RL 6 • Most common method to formulate RL problem is through Markov Decision Process (MDP) • One episode of this process forms a finite sequence of states, actions and rewards: • 𝑠0, 𝑎0, 𝑟1, 𝑠1, 𝑎1, 𝑟2, 𝑠2, … , 𝑠 𝑛−1, 𝑎 𝑛−1, 𝑟𝑛, 𝑠 𝑛 Image credit: WikipediaImage credit: Sutton and Barto (1998)
  • 7. Formulation of RL 7 • A good policy, need to take into account not only the immediate rewards, but also the future rewards we are going to get. • Thus, the ultimate goal of RL agent is to select actions to maximize a total future reward. • Given one run of Markov decision process, we can easily calculate the total reward for one episode from time step t onward as follows: • 𝑅𝑡 = 𝑟𝑡 + 𝑟𝑡+1 + 𝑟𝑡+2 + ⋯ + 𝑟𝑛 • Due to the inherit uncertainty in the environment, we usually use the discounted future reward instead: • 𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟𝑛 = 𝑟𝑡 + 𝛾𝑅𝑡+1
  • 8. Components of RL 8 • An RL agent may include one or more of these components: • Policy: agent’s behavior function 𝑎 = π(𝑠) • Value function: a prediction of future reward - how good is each state and/or action • Model: agent’s representation of the environment, given state 𝑠 and action 𝑎, the model gives us both the reward of this state and action as well as the probability of the next state 𝑠′
  • 9. Components of RL: Policy 9Example adapted from: http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html • Given the following maze example: Policy would be
  • 10. Components of RL: Value Function 10 • Used to evaluate the goodness/badness of states • And therefore to select between actions: 𝑄 𝜋(𝑠, 𝑎) = 𝑚𝑎𝑥 𝜋 𝑅𝑡+1
  • 11. Taxonomy of Algorithms for Solving RL 11 • Model Free • Policy or/and Value Function • Model Based • Model + Policy or/and Value Function • Approximated Learned Model + Policy or/and Value Function
  • 12. Q-Learning 12 • Q-learning is a model free paradigm to learn the value function of the RL problem. • In Q-learning, we define a function 𝑄(𝑠, 𝑎) representing the discounted future reward when we perform action a in state s, and continue optimally from that point on. • 𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑚𝑎𝑥 𝜋 𝑅𝑡+1 • Once we have the Q-function, the question of which policy to choose at a given state 𝑠, can be broke down into : • 𝜋 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠, 𝑎)
  • 13. Q-Learning (2) 13 • To obtain Q-function, we will focus on just one transition <𝑠, 𝑎, r, 𝑠′>. • Recall, 𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟𝑛 = 𝑟𝑡 + 𝛾𝑅𝑡+1 • Similarly, we can just represent Q-value of state 𝑠 and action 𝑎 in terms of Q-value of next state 𝑠′ 𝑄 𝑠, 𝑎 = 𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄(𝑠′, 𝑎′) Bellman Equation
  • 14. Q-Learning (3) 14Algorithm adapted from : http://artint.info/html/ArtInt_265.html • We can then iteratively approximate the Q-function using the Bellman equation, as follows: Learning rate
  • 15. Deep Q-Networks 15 • Q-function could be represented with neural network, that takes the state and action as input and outputs the corresponding Q-value • Alternatively, we could take only game screens as input and output the Q-value for each possible action.
  • 16. DQN: Atari 16Image credit: Mnih et al. Nature 2015
  • 17. DQN: Training 17 • Given a transition <𝑠, 𝑎, r, 𝑠′>, and loss function 𝐿 = 1 2 [𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′ − Q s, a ]2: 1. Do a feedforward pass for the current state 𝑠 to get predicted Q-values for all actions. 2. Do a feedforward pass for the next state 𝑠′ and calculate maximum over all network outputs 𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′ 3. Set Q-value target for action 𝑎 to 𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′ (use the max calculated in step 2). For all other actions, set the Q-value target to the same as originally returned from step 1, making the error 0 for those outputs 4. Update the weights using backpropagation. target prediction
  • 18. DQN: Experience Replay 18 • One of the engineering tricks that made the training of DQN much more stable • During gameplay all the experiences <𝑠, 𝑎, r, 𝑠′ > are stored in a replay memory • When training the network, random samples from the replay memory are used instead of the most recent transition 1. This breaks the similarity of subsequent training samples, which otherwise might drive the network into a local minimum 2. It made the training task more similar to usual supervised learning, which simplifies debugging and testing the algorithm.
  • 19. DQN: ε-greedy exploration 19 • When Q-network is initialized randomly, then its predictions are initially random as well • If we pick an action with the highest Q-value, the action will be random and the agent performs crude “exploration”. • As a Q-function converges, it returns more consistent Q- values and the amount of exploration decreases • Another engineering trick is ε-greedy exploration – with probability ε choose a random action, otherwise go with the “greedy” action with the highest Q-value.
  • 20. DQN: Algorithm 20Algorithm adapted from : http://artint.info/html/ArtInt_265.html Experience Replay ε-greedy exploration
  • 21. Policy Gradient Methods 21 • Another commonly paradigm to solve the RL problem is by learning the policy directly. • Learning the policy directly, can be much more efficient in case of continuous action spaces (human locomotion,..etc.) • One of the key methods in this paradigm, is policy gradient methods (Gradient descent, Conjugate gradient, Quasi- newton). • The formulation as follow, let 𝐽 𝜃 be any policy objective function • Policy gradient methods search for a local maximum in 𝐽 𝜃 by ascending the gradient of the policy, w.r.t. parameters 𝜃 Δ𝜃 = α𝛻𝜃 𝐽 𝜃 Policy gradient
  • 22. Policy Gradient Methods 22 Heess, Nicolas, et al. "Emergence of locomotion behaviours in rich environments." arXiv preprint arXiv:1707.02286 (2017).
  • 23. Inverse RL Adapted from CS 294: Deep Reinforcement Learning, UC Berkeley, Fall 2017
  • 24. Inverse RL • Since in most of the real-world applications, the notion of reward is not quite obvious or really hard to specify. • In IRL problem, we try to learn the reward (and the transition model as well) from expert or human demonstrations.
  • 25. Inverse RL: Autonomous Driving Image credit: Wulfmeier et al. IROS 2016 Reward Features
  • 26. Inverse RL: Intent Prediction 26Image credit: KITTI Dataset Pedestrian
  • 27. Deep RL/IRL Potential Applications • Autonomous Navigation • Semantic Segmentation • Recommendation Systems • Chatbots • Inventory Management • Power Systems • Financial investment decisions* • Medical Sector (Dynamic treatment regime) * http://pit.ai/
  • 28. Further Educational Resources • Reinforcement Learning: An Introduction (Sutton and Barto’s Book, 2nd Edition) • David Silver's Reinforcement Learning Course (UCL, 2015) • CS 294: Deep Reinforcement Learning, Fall 2017 • Deep RL Bootcamp, Summer 2017
  • 29. DeepMind AlphaGo 29Image and Video credit: Google Brain & DeepMind
  • 30. References 1. Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998. 2. Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature518.7540 (2015): 529-533. 3. Abbeel, Pieter, and Andrew Y. Ng. "Apprenticeship learning via inverse reinforcement learning." Proceedings of the twenty-first international conference on Machine learning. ACM, 2004. 4. Cassandra, Anthony Rocco. "Exact and approximate algorithms for partially observable Markov decision processes." (1998). 5. Heess, Nicolas, et al. "Emergence of Locomotion Behaviours in Rich Environments." arXiv preprint arXiv:1707.02286 (2017). 6. Heess, Nicolas, et al. "Learning and Transfer of Modulated Locomotor Controllers." arXiv preprint arXiv:1610.05182 (2016). 7. Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep Spatial Autoencoders for Visuomotor Learning. In ICRA, 2016. 8. Jakob N Foerster, Yannis M Assael, Nando de Freitas, and Shimon Whiteson. Learning to Communicate to Solve Riddles with Deep Distributed Recurrent QNetworks. arXiv:1602.02672, 2016. 9. Sham M Kakade. A Natural Policy Gradient. In NIPS, 2002 10. Nate Kohl and Peter Stone. Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion. In ICRA, volume 3, 2004 11. Sascha Lange, Martin Riedmiller, and Arne Voigtlander. Autonomous Reinforcement Learning on Raw Visual Input Data in a Real World Application. In IJCNN, 2012. 12. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep Learning. Nature, 521 (7553):436–444, 2015. 13. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end Training of Deep Visuomotor Policies. JMLR, 17(39):1–40, 2016 14. Xiujun Li, Lihong Li, Jianfeng Gao, Xiaodong He, Jianshu Chen, Li Deng, and Ji He. Recurrent Reinforcement Learning: A Hybrid Approach. arXiv:1509.03044, 15. Wulfmeier, Markus, Dominic Zeng Wang, and Ingmar Posner. "Watch this: Scalable cost-function learning for path planning in urban environments." Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on. IEEE, 2016. 30

Editor's Notes

  • #5: In Reinforcement learning, we have an agent that interact with the environment whereas, at each time step, it gets an observation from the environment about his/her state s_t, it executes an action a_t , and receives a reward r_t from the environment. From the agent perspective: it only input an action, and get as input from env (observation s_t, and reward r_t) From the environment perspective: it output both observations about agent state, and reward r_t Reward is a scalar feedback signal, indicates how well agent is doing at each time step The job of the agent is to maximize a cumulative reward
  • #6: Sequential decision Making -> Agent’s actions affect the subsequent data it receives, that’s why the time really matters And this is distinction between it and supervised, where you only have an independent predictions for each input sample.
  • #7: The set of states and actions, together with rules for transitioning from one state to another and for getting rewards, make up a Markov decision process. The episode ends with terminal state sn (e.g. “game over” screen). The rules for how you choose those actions are called policy. A Markov decision process relies on the Markov assumption, that the probability of the next state si+1 depends only on current state si and performed action ai, but not on preceding states or actions.
  • #8: But because our environment is stochastic, we can never be sure, if we will get the same rewards the next time we perform the same actions. The more into the future we go, the more it may diverge. For that reason it is common to use discounted future reward Here γ is the discount factor between 0 and 1 – the more into the future the reward is, the less we take it into consideration. It is easy to see, that discounted future reward at time step t can be expressed in terms of the same thing at time step t+1: If we set the discount factor γ=0, then our strategy will be short-sighted and we rely only on the immediate rewards. If we want to balance between immediate and future rewards, we should set discount factor to something like γ=0.9. If our environment is deterministic and the same actions always result in same rewards, then we can set discount factor γ=1
  • #9: P predicts the next state
  • #10: Rewards: -1 per time-step -> motivate it to finish ASAP Actions: N, E, S, W States: Agent’s location Arrows represent policy π(s) for each state s
  • #11: Numbers represent value vπ(s) of each state s
  • #12: The main distinction in Model free, you learn on the job by trial and error, however in model based you learn about it offline or from demonstrations Policy based have better convergence, effective in high dimension or continuous actions spaces
  • #13: The way to think about Q(s,a) is that it is “the best possible score at the end of game after performing action a in state s”. It is called Q-function, because it represents the “quality” of certain action in given state. Once you have the magical Q-function, the answer becomes really simple – pick the action with the highest Q-value!
  • #14: This may sound quite a puzzling definition. How can we estimate the score at the end of game, if we know just current state and action, and not the actions and rewards coming after that? We really can’t. But as a theoretical construct we can assume existence of such a function. Let’s focus on just one transition <s,a,r,s′>. Just like with discounted future rewards in previous section we can express Q-value of state s and action a in terms of Q-value of next state s′. If you think about it, it is quite logical – maximum future reward for this state and action is the immediate reward plus maximum future reward for the next state.
  • #15: In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns. α in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then two Q[s,a]-s cancel and the update is exactly the same as Bellman equation. maxa’ Q[s',a'] that we use to update Q[s,a] is only an estimation and in early stages of learning it may be completely wrong. However the estimations get more and more accurate with every iteration, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value. The state of the environment in the Breakout game can be defined by the location of the paddle, location and direction of the ball and the existence of each individual brick. This intuitive representation is however game specific. Could we come up with something more universal, that would be suitable for all the games? Obvious choice is screen pixels. they implicitly contain all of the relevant information about the game situation, except for the speed and direction of the ball. Two consecutive screens would have these covered as well.
  • #16: In case of the break out –Atari game in the first videos, to construct the Q(s,a) table from raw pixels as state space (84*84*4) this mean a possible of million of game states, which corresponds , nillions of rows in our (s,a) table This is the point, where deep learning steps in. Neural networks are exceptionally good in coming up with good features for highly structured data We could represent our Q-function with a neural network, that takes the state (four game screens) and action as input and outputs the corresponding Q-value This approach has the advantage, that if we want to perform a Q-value update or pick the action with highest Q-value, we only have to do one forward pass through the network and have all Q-values for all actions immediately available.
  • #17: This is a classical convolutional neural network with three convolutional layers, followed by two fully connected layers. People familiar with object recognition networks may notice that there are no pooling layers. But if you really think about that, then pooling layers buy you a translation invariance – the network becomes insensitive to the location of an object in the image. That makes perfectly sense for a classification task like ImageNet, but for games the location of the ball is crucial in determining the potential reward and we wouldn’t want to discard this information!
  • #18: In case of the break out –Atari game in the first videos, to construct the Q(s,a) table from raw pixels as state space (84*84*4) this mean a possible of million of game states, which corresponds , nillions of rows in our (s,a) table This is the point, where deep learning steps in. Neural networks are exceptionally good in coming up with good features for highly structured data We could represent our Q-function with a neural network, that takes the state (four game screens) and action as input and outputs the corresponding Q-value This approach has the advantage, that if we want to perform a Q-value update or pick the action with highest Q-value, we only have to do one forward pass through the network and have all Q-values for all actions immediately available.
  • #20: So we could say, that Q-learning incorporates the exploration as part of the algorithm. But this exploration is “greedy”, it settles with the first effective strategy it finds. In their system DeepMind actually decreases ε over time from 1 to 0.1 – in the beginning the system makes completely random moves to explore the state space maximally, and then it settles down to a fixed exploration rate.
  • #21: In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns. α in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then two Q[s,a]-s cancel and the update is exactly the same as Bellman equation. maxa’ Q[s',a'] that we use to update Q[s,a] is only an estimation and in early stages of learning it may be completely wrong. However the estimations get more and more accurate with every iteration, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value. The state of the environment in the Breakout game can be defined by the location of the paddle, location and direction of the ball and the existence of each individual brick. This intuitive representation is however game specific. Could we come up with something more universal, that would be suitable for all the games? Obvious choice is screen pixels. they implicitly contain all of the relevant information about the game situation, except for the speed and direction of the ball. Two consecutive screens would have these covered as well.
  • #24: * A 15-month old infant can interpret the intentions of other human demonstrator, even if it was the first time to see it actaualy
  • #28: Reinforcement Learning is used to develop distributed control structure for a set of distributed generation sources. The exchange of information between these sources is governed by a communication graph topology Reinforcement learning algorithms can be built to reduce transit time for stocking as well as retrieving products in the warehouse for optimizing space utilization and warehouse operations. Pit.ai is at the forefront leveraging reinforcement learning for evaluating trading strategies A dynamic treatment regime (DTR) is a subject of medical research setting rules for finding effective treatments for patients. Diseases like cancer demand treatments for a long period where drugs and treatment levels are administered over a long period. Reinforcement learning addresses this DTR problem where RI algorithms help in processing clinical data to come up with a treatment strategy, using various clinical indicators collected from patients as inputs.