Intro to Deep Reinforcement Learning

Introduction to Deep Reinforcement Learning
Khaled Saleh
PhD Researcher at IISRI/ Deakin University
Australia
Khaled Saleh

Agenda
• Motivation
• What is Reinforcement Learning (RL) ?
• Characteristics of RL
• Formulation of the RL Problem
• Different Components of RL
• Taxonomy of Algorithms for Solving RL
• Q-Learning
• Deep Q Network (DQN)
• Policy Gradient Methods
• Inverse RL
• Deep RL/IRL Potential Applications
2

Motivation
3
Video credit: Ng et al. NIPS 2007 Video credit: Google DeepMind 2015

What is Reinforcement Learning (RL) ?
4Image credit: Sutton and Barto (1998)

Characteristics of RL
5
• In comparison to other machine learning paradigms, the
following are what make the RL different:
• No supervision needed, only a reward signal
• Feedback is delayed, not instantaneous
• Sequential decision Making

Formulation of RL
6
• Most common method to formulate RL problem is through
Markov Decision Process (MDP)
• One episode of this process forms a finite sequence of states,
actions and rewards:
• 𝑠0, 𝑎0, 𝑟1, 𝑠1, 𝑎1, 𝑟2, 𝑠2, … , 𝑠 𝑛−1, 𝑎 𝑛−1, 𝑟𝑛, 𝑠 𝑛
Image credit: WikipediaImage credit: Sutton and Barto (1998)

Formulation of RL
7
• A good policy, need to take into account not only the
immediate rewards, but also the future rewards we are going
to get.
• Thus, the ultimate goal of RL agent is to select actions to
maximize a total future reward.
• Given one run of Markov decision process, we can easily
calculate the total reward for one episode from time
step t onward as follows:
• 𝑅𝑡 = 𝑟𝑡 + 𝑟𝑡+1 + 𝑟𝑡+2 + ⋯ + 𝑟𝑛
• Due to the inherit uncertainty in the environment, we usually
use the discounted future reward instead:
• 𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2
𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡
𝑟𝑛 = 𝑟𝑡 + 𝛾𝑅𝑡+1

Components of RL
8
• An RL agent may include one or more of these components:
• Policy: agent’s behavior function 𝑎 = π(𝑠)
• Value function: a prediction of future reward - how good
is each state and/or action
• Model: agent’s representation of the environment, given
state 𝑠 and action 𝑎, the model gives us both the reward
of this state and action as well as the probability of the
next state 𝑠′

Components of RL: Policy
9Example adapted from: http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html
• Given the following maze example:
Policy would be

Components of RL: Value Function
10
• Used to evaluate the goodness/badness of states
• And therefore to select between actions:
𝑄 𝜋(𝑠, 𝑎) = 𝑚𝑎𝑥 𝜋 𝑅𝑡+1

Taxonomy of Algorithms for Solving RL
11
• Model Free
• Policy or/and Value Function
• Model Based
• Model + Policy or/and Value Function
• Approximated Learned Model + Policy or/and Value
Function

Q-Learning
12
• Q-learning is a model free paradigm to learn the value
function of the RL problem.
• In Q-learning, we define a function 𝑄(𝑠, 𝑎) representing the
discounted future reward when we perform action a in state s,
and continue optimally from that point on.
• 𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑚𝑎𝑥 𝜋 𝑅𝑡+1
• Once we have the Q-function, the question of which policy to
choose at a given state 𝑠, can be broke down into :
• 𝜋 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠, 𝑎)

Q-Learning (2)
13
• To obtain Q-function, we will focus on just one transition
<𝑠, 𝑎, r, 𝑠′>.
• Recall,
𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟𝑛 = 𝑟𝑡 + 𝛾𝑅𝑡+1
• Similarly, we can just represent Q-value of state 𝑠 and
action 𝑎 in terms of Q-value of next state 𝑠′
𝑄 𝑠, 𝑎 = 𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄(𝑠′, 𝑎′)
Bellman Equation

Q-Learning (3)
14Algorithm adapted from : http://artint.info/html/ArtInt_265.html
• We can then iteratively approximate the Q-function using the
Bellman equation, as follows:
Learning rate

Deep Q-Networks
15
• Q-function could be represented with neural network, that
takes the state and action as input and outputs the
corresponding Q-value
• Alternatively, we could take only game screens as input and
output the Q-value for each possible action.

DQN: Atari
16Image credit: Mnih et al. Nature 2015

DQN: Training
17
• Given a transition <𝑠, 𝑎, r, 𝑠′>, and loss function
𝐿 =
1
2
[𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′ − Q s, a ]2:
1. Do a feedforward pass for the current state 𝑠 to get
predicted Q-values for all actions.
2. Do a feedforward pass for the next state 𝑠′ and calculate
maximum over all network outputs 𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′
3. Set Q-value target for
action 𝑎 to 𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′ (use the max calculated
in step 2). For all other actions, set the Q-value target to
the same as originally returned from step 1, making the
error 0 for those outputs
4. Update the weights using backpropagation.
target prediction

DQN: Experience Replay
18
• One of the engineering tricks that made the training of DQN
much more stable
• During gameplay all the experiences <𝑠, 𝑎, r, 𝑠′
> are stored in
a replay memory
• When training the network, random samples from the replay
memory are used instead of the most recent transition
1. This breaks the similarity of subsequent training
samples, which otherwise might drive the network into a
local minimum
2. It made the training task more similar to usual
supervised learning, which simplifies debugging and
testing the algorithm.

DQN: ε-greedy exploration
19
• When Q-network is initialized randomly, then its predictions
are initially random as well
• If we pick an action with the highest Q-value, the action will
be random and the agent performs crude “exploration”.
• As a Q-function converges, it returns more consistent Q-
values and the amount of exploration decreases
• Another engineering trick is ε-greedy exploration – with
probability ε choose a random action, otherwise go with the
“greedy” action with the highest Q-value.

DQN: Algorithm
20Algorithm adapted from : http://artint.info/html/ArtInt_265.html
Experience Replay
ε-greedy exploration

Policy Gradient Methods
21
• Another commonly paradigm to solve the RL problem is by
learning the policy directly.
• Learning the policy directly, can be much more efficient in
case of continuous action spaces (human locomotion,..etc.)
• One of the key methods in this paradigm, is policy gradient
methods (Gradient descent, Conjugate gradient, Quasi-
newton).
• The formulation as follow, let 𝐽 𝜃 be any policy objective
function
• Policy gradient methods search for a local maximum in 𝐽 𝜃
by ascending the gradient of the policy, w.r.t. parameters 𝜃
Δ𝜃 = α𝛻𝜃 𝐽 𝜃
Policy gradient

Policy Gradient Methods
22
Heess, Nicolas, et al. "Emergence of locomotion behaviours in rich environments." arXiv preprint
arXiv:1707.02286 (2017).

Inverse RL
Adapted from CS 294: Deep Reinforcement Learning, UC Berkeley, Fall 2017

Inverse RL
• Since in most of the real-world applications, the notion of
reward is not quite obvious or really hard to specify.
• In IRL problem, we try to learn the reward (and the transition
model as well) from expert or human demonstrations.

Inverse RL: Autonomous Driving
Image credit: Wulfmeier et al. IROS 2016
Reward Features

Inverse RL: Intent Prediction
26Image credit: KITTI Dataset
Pedestrian

Deep RL/IRL Potential Applications
• Autonomous Navigation
• Semantic Segmentation
• Recommendation Systems
• Chatbots
• Inventory Management
• Power Systems
• Financial investment decisions*
• Medical Sector (Dynamic treatment regime)
* http://pit.ai/

Further Educational Resources
• Reinforcement Learning: An Introduction (Sutton and Barto’s
Book, 2nd Edition)
• David Silver's Reinforcement Learning Course (UCL, 2015)
• CS 294: Deep Reinforcement Learning, Fall 2017
• Deep RL Bootcamp, Summer 2017

DeepMind AlphaGo
29Image and Video credit: Google Brain & DeepMind

References
1. Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998.
2. Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature518.7540 (2015): 529-533.
3. Abbeel, Pieter, and Andrew Y. Ng. "Apprenticeship learning via inverse reinforcement learning." Proceedings of the twenty-first
international conference on Machine learning. ACM, 2004.
4. Cassandra, Anthony Rocco. "Exact and approximate algorithms for partially observable Markov decision processes." (1998).
5. Heess, Nicolas, et al. "Emergence of Locomotion Behaviours in Rich Environments." arXiv preprint arXiv:1707.02286 (2017).
6. Heess, Nicolas, et al. "Learning and Transfer of Modulated Locomotor Controllers." arXiv preprint arXiv:1610.05182 (2016).
7. Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep Spatial Autoencoders for Visuomotor
Learning. In ICRA, 2016.
8. Jakob N Foerster, Yannis M Assael, Nando de Freitas, and Shimon Whiteson. Learning to Communicate to Solve Riddles with Deep
Distributed Recurrent QNetworks. arXiv:1602.02672, 2016.
9. Sham M Kakade. A Natural Policy Gradient. In NIPS, 2002
10. Nate Kohl and Peter Stone. Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion. In ICRA, volume 3, 2004
11. Sascha Lange, Martin Riedmiller, and Arne Voigtlander. Autonomous Reinforcement Learning on Raw Visual Input Data in a Real World
Application. In IJCNN, 2012.
12. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep Learning. Nature, 521 (7553):436–444, 2015.
13. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end Training of Deep Visuomotor Policies. JMLR, 17(39):1–40,
2016
14. Xiujun Li, Lihong Li, Jianfeng Gao, Xiaodong He, Jianshu Chen, Li Deng, and Ji He. Recurrent Reinforcement Learning: A Hybrid
Approach. arXiv:1509.03044,
15. Wulfmeier, Markus, Dominic Zeng Wang, and Ingmar Posner. "Watch this: Scalable cost-function learning for path planning in urban
environments." Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on. IEEE, 2016.
30

Intro to Deep Reinforcement Learning

More Related Content

What's hot (20)

Similar to Intro to Deep Reinforcement Learning (20)

Recently uploaded (20)

Intro to Deep Reinforcement Learning

Editor's Notes