SlideShare a Scribd company logo
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Principal Solutions Architect – AWS Deep Learning
Amazon Web Services
Game Playing RL Agent
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inspiration From Nature
https://newatlas.com/bae-smartskin/33458/ https://www.nasa.gov/ames/feature/go-go-green-wing-mighty-morphing-materials-in-aircraft-design
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hardware of Learning
http://www.biologyreference.com/Mo-Nu/Neuron.html
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hardware of Learning
http://www.biologyreference.com/Mo-Nu/Neuron.html
I1 I2 B
O
w2 w3
! "#, %# = Φ() + Σ#(%#. "#))
Φ " = .
1, 0! " ≥ 0.5
0, 0! " < 0.5
w1
5 ∧ Q
8 9 8 ∧ Q
: : :
: ; ;
; : ;
; : ;
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Process of Learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Process of Learning
Agent
Environment!"#$
!"
state
%"#$
%"
reward
&"
action
Sutton and Barto
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Markov State
An information state (a.k.a. Markov state) contains all
useful information from the history.
A state St is Markov if and only if:
! "#$%|"# = ! "#$% "%, … , "*]
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Expected Return
• Expected return: !" : sequence of rewards, potentially discounted by a factor
# where # ∈ 0,1
!" = )"*+ + #)"*- + #-)"*. + … = 0
123
4
#1)"*1*+
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bellman Expectation Equations
!" # = % &' (' = # = % )'*+ + -!" #'*+ #' = #
Value of s is the expected return at state s following policy
. subsequently.
This is Bellman Expectation Equation that can be also
expressed as action-value function for policy .
/" #, 1 = % )'*+ + -/" #'*+, 1'*+ #' = #, 1' = 1
= ℛ3
4 + - 5
3678
9336
4
!" #′
Value of taking action a at state s under policy .
.
# → !"(#)
#, 1 → /"(#, 1)
#′ → !"(#′)
1
#′
>
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bellman Equations - Example
5.5
510 -3
! " = 10 × .5 + 5 × .25 + −3 × .25 = 5.5
4.4
2
R=5
P=.5 R=2
P=.5
5
P=.4 P=.5
! " = 5 × .5 + .5[.4 × 2 + .5 × 5 + .1 × 4.4] = 4.4
P=.5
P=.25
P=.25
P=.1
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Optimal Policy
max
s,a
r
s’
a’ s’
s
a
!
p r
max
9
510 -3
"∗ $ = max{−1 + 10, +2 + 5, +3 − 3} = 9
R = -1
R = 2
R = 3
A	policy	is	better	if	"B $ ≥ "BD $ ∀ $ ∈ G
"∗ s ≡ max "B $ ∀ $ ∈ G
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Iterative Policy Evaluation
1 2 3
4 5 6 7
8 9 10 11
12 13 14
!" = −1
& → . = & ↑ . =
& ↓ . = & ← . = .25
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Iterative Policy Evaluation
0.00
1
0.00
2
0.00
3
0.00
4
0.00
5
0.00
6
0.00
7
0.00
8
0.00
9
0.00
10
0.00
11
0.00
12
0.00
13
0.00
14
!" = −1
& = 0
0
15
(: !*+,-. /-0123
( → . = ( ↑ . =
( ↓ . = ( ← . = .25
9" = −1
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Iterative Policy Evaluation
!"#$ 1 = .25× −1 + 0.(0)
"#2 →
+.25× −1 + 0.($)
"#2 ↑
+
.25× −1 + 0.(5)
"#2 ↓
+.25× −1 + 0.(7)
"#2 ←
= −.25 − .25 − .25 − .25 = −9
-1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00
0
0
0.00
1
0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00
0
0
: → . = : ↑ . =
: ↓ . = : ← . = .25
;< = −1
= = 0 = = 1
!"#$ 7 =.25× −1 + 0.(?)
"#2 →
+.25× −1 + 0.(@)
"#2 ↑
+
.25× −1 + 0.($$)
"#2 ↓
+.25× −1 + 0.(A)
"#2 ←
= −.25 − .25 − .25 − .25 = −9
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Iterative Policy Evaluation
!"#$ 1
=.25× −1 + −1.00.($)
"#1 →
+ . 25× −1 + −1.00.(1)
"#1 ↑
+
.25× −1 + −1.00.(4)
"#1 ↓
+.25× −1 + 0.(6)
"#1 ←
= .25× −8 − 8 − 8 − 9 = −9. :;
-1.75 -2.00 -2.00
-2.00 -2.00 -2.00 -2.00
-2.00 -2.00 -2.00 -1.75
-2.00 -2.00 -1.75
0
0
-1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00
0
0
!"#$ 7
= −1 ×.25 − 1.00.(=)
"#1 →
+ −1 ×.25 − 1.00.(>)
"#1 ↑
+
−1 ×.25 − 1.00.(11)
"#1 ↓
+ −1 ×.25 − 1.00
.
.(?)
"#1 ←
=
= .25× −8 − 8 − 8 − 9 = −8
@ → . = @ ↑ . =
@ ↓ . = @ ← . = .25
AB = −1
C = 1 C = 2
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Iterative Policy Evaluation
!"#$ 1
=.25× −1 + −2.00.(0)
"#0 →
+ . 25× −1 + −1.75.(4)
"#0 ↑
+
.25× −1 + −2.00.(6)
"#0 ↓
+.25× −1 + 0.(8)
"#0 ←
= .25× −: − ;. <= − : − > = −;. ?:
-2.43 -2.93 -3.00
-2.43 -2.93 -3.00 -2.93
-2.93 -3.00 -2.93 -2.43
-3.00 -2.93 -2.43
0
0
-1.75 -2.00 -2.00
-1.75 -2.00 -2.00 -2.00
-2.00 -2.00 -2.00 -1.75
-2.00 -2.00 -1.75
0
0
!"#$ 7
= −1 ×.25 − 2.00.(@)
"#0 →
+ −1 ×.25 − 2.00.($)
"#0 ↑
+
−1 ×.25 − 1.75.(44)
"#0 ↓
+ −1 ×.25 − 2.00
.
.(A)
"#0 ←
=
.25× −: − : − ;. <= − : = −;.93
D → . = D ↑ . =
D ↓ . = D ← . = .25
EF = −1
G = 2 G = 3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Policy Improvement and Control
! "
! → $%
! → &'(()*(")
Evaluation
Improvement
!∗
"∗
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Policy Improvement and Control
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
GridWorld Demo
https://github.com/rlcode/reinforcement-learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Limitation of Dynamic Programming
• Assumption of full knowledge of MDP
• DP is using full-width backup.
• Number of states can grow rapidly.
• Suitable for medium problem of just a few million states.
…
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monte Carlo Learning
• Model-Free learning
• Learning from episode of experience
• All episodes much have a terminal state
… …
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Temporal Difference (TD) Learning
• Learning from episodes of experience.
• Model-Free
• TD learns from incomplete episodes.
• Updating an estimate towards an estimate.
…
TD(1)
TD(2)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Exploration and Exploitation
• Exploitation is maximizing reward using known
information about a system.
• Going to school, applying to college, choosing a degree such as engineering and
medicine that has a better comparative yield, graduating as quickly as possible through
taking all the recommended degree courses, getting a job, putting money in retirement
schemes, retiring at a middle-class house comfortably.
• Always following a system based on known information
results in missing out on potentials for better results.
• Going to school, applying to college, choosing a degree such as engineering and
medicine that has a better comparative yield, taking a course in Neural Networks out of
curiosity, changing subject, graduating, starting an AI company, growing the company,
becoming a billionaire, never retiring J
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Q-Learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Q-Learning
• The Q learning updates the Q value, slightly in the
direction of best possible next Q value.
s,a
r
s’
max
! ", $ ← ! ", $ + '() + * max
./
! "′, 1′ − !(", $))
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Q-Learning Properties
• Model-free
• Change of task (reinforcement) requires re-training
• A special kind of Temporal Difference learning
• Convergence assured only for Markov states
• Tabular approach requires every observed state-action
pair to have an entry
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Action Selection
• Greedy – always pick the actions with highest value
• Break ties randomly
• !-greedy – choose random with low probability !
• Softmax – always choose randomly, weighted by
respective Q-values
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reinforcement Function
• Implicitly supplies the goal to the agent
• Designing the function is an art
• Mistakes result in agent learning wrong behavior
• When need to learn behavior with shortest duration,
penalize every action a little for “wasting time”.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Q –Learning Demos
https://github.com/dbatalov/reinforcement-learning
Rocket Lander DemoGrid World Demo
https://github.com/rlcode/reinforcement-learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Tabular Approach and its Limitation
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DQN
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Universal Function Approximation Theorem
• Let $ 0 &' ( )*+,-(+-, &*/+0'0, (+0 1*+*-2)(334 2+)5'(,2+6 7/+)-2*+.
• 9'- :; 0'+*-' -ℎ' 1 021'+-2*+(3 /+2- ℎ4='5)/&' 0,1 ;.
• ?ℎ' @=()' *7 )*+-2+/*/, 7/+)-2*+, *+ :; 2, 0'+*-'0 &4 A :; .
?ℎ'+ 62B'+ C > 0 (+0 (+4 7/+)-2*+ 7CA :; , -ℎ'5' 'E2,-,
• (+ 2+-'6'5 F
• 5'(3 )*+,-(+-, BG, &GCℝ
• 5'(3 B')-*5, IGCℝ;
, Iℎ'5' 2 = 1,2, … , F, ,/)ℎ -ℎ(- I' 1(4 0'72+':
N E = O
GPQ
R
BG$(IG
T
E + &G )
(, (+ (==5*E21(-2*+ 5'(32W(-2*+ *7) 7/+)-2*+ 7 Iℎ'5' 2, 2+0'='+0'+- *7 $; -ℎ(- 2,
N E − 7 E < C
7*5 (33 E 2+ :;
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deep Reinforcement Learning
• An Artificial Neural Network is a
Universal Function
Approximator.
• We can use a ANN as an
approximation of an agent to
choose what action to take to
maximize reward.
Check this link for proof of the theorem:
https://en.wikipedia.org/wiki/Universal_approximation_theorem
David Silver
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DQN Network
https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf
• DQN Agent achieves >75%
of the human score in 29
our of 49 games
• DQN Agent beats human
score (>100%) in 22 games
!"#$%% =
()*%+, !"#$% − ./+0#1 23/4 !"#$%)
(671/+ !"#$% − ./+0#1 23/4 !"#$%)
8 100
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DQN for Breakout
https://github.com/apache/incubator-mxnet/tree/master/example/reinforcement-learning/dqn
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DQN Algorithm
• Techniques that increase stability and better convergence
• !- greedy Exploration
• Technique: Choose action as per optimal policy with (1-") and random action
with " probability
• Advantage: Minimize overfitting of the network
• Experience (#$, &$, '$, #$()) Replay
• Technique: Store agent’s experiences and use samples from them to update Q-
network
• Advantage: Removes correlations in observation sequence
• Periodic update of Q towards target
• Technique: Every C updates, clone the Q-network and used cloned (*Q) for
generating target for the following C updates to Q-network
• Advantage: Reduces correlations with the target
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DQN-Algorithm
DQN
(Cloned) DQN
!"#$, &"#$, '"#$, !"
!"#(, &"#(, '"#(, !"#$
!"#), &"#), '"#), !"#(
!"#*, &"#*, '"#*, !"#*+$
Initialize replay memory (N = 1M)
Random play
Initialize DQNs with random ,-
./ !, &; ,1
# / !, &; ,1
Episode 1: Select 2$ and get !$
Time step 1:
&$ = 4
'&5678 &9:;75, <' = =
&'>8&?@/ !$, &; ,- , AB2A
Observe reward '$ and move to 2(
Add (!$, &$, '$, !() to D
Generate training data:
U(D) = Random sample of D
For each !E, &E, 'E, !E+$ ∈ G H :
IE = J
'E, AK;276A :A8;5&:A2
'E + M max
@Q
./ !E+$, &R; ,-
#
,1
#
= ,-
,1 = ,-
!$
S !$, . ; ,-
!E+$
US !E+$, . ; ,- Update DQN using U(D) with ys
,1 = ,$
Time step 2:
&( = 4
'&5678 &9:;75, <' = =
&'>8&?@/ !(, &; ,$ , AB2A
Observe reward '( and move to 2)
Add (!(, &(, '(, !)) to D
!(
S !(, . ; ,$
,1
#
= ,$-V ,1 = ,$-V
Every 10K steps, Clone DQN: ,1
#
= ,1
,1 = ,(
Episode 2: Select 2$ and get !$Episode m: Select 2$ and get !$
Time step t:
&" = 4
'&5678 &9:;75, <' = =
&'>8&?@/ !", &; ,1 , AB2A
Observe reward '" and move to 2"+$
Add (!", &", '", !"+$) to D
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Function Approximation
!∗ #, % = '() * + , max
0)
!∗ #1, %1 |#, %
!3 #, % = '() * + , max
0)
!345 #1
, %1
|#, %
!∗ #, % ≈ !(#, %; 9)
! #, %; 93 ≈ '() * + , max
0)
! #1, %1; 93
4
|#, %
!3 → !∗
%# < → ∞
>3 93 = '(,0,? '() @|#, % − !(#, %; 93) B
where, @ = * + , max
0)
! #1
, %1
; 93
4
Bellman equation
Iterative update
Function Approximation
Modified Iterative update
Loss function to minimize
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Network Architecture
https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf
84 X 84 X 4
!(#) %(#, '; ))
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deep Convolutional Network - Nature
DQN = gluon.nn.Sequential()
with DQN.name_scope():
#first layer
DQN.add(gluon.nn.Conv2D(channels=32, kernel_size=8,strides = 4,padding = 0))
DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))
DQN.add(gluon.nn.Activation('relu'))
#second layer
DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=4,strides = 2))
DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))
DQN.add(gluon.nn.Activation('relu'))
#tird layer
DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=3,strides = 1))
DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))
DQN.add(gluon.nn.Activation('relu'))
DQN.add(gluon.nn.Flatten())
#fourth layer
DQN.add(gluon.nn.Dense(512,activation ='relu'))
#fifth layer
DQN.add(gluon.nn.Dense(num_action,activation ='relu'))
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Issues with DQN
• Q-Learning does overestimate action values due to
maximization term over estimated values.
• Over-estimation is being associated with noise and
insufficiently flexible function approximation.
• DQN provides a flexible function approximation.
• Deterministic nature of Atari games eliminates noise.
• DQN still significantly overestimates action values.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Double Q Learning and DDQN
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Double Q-Learning
• The max operator uses the same values for evaluation and action
selection. This leads to over-optimism
• Decoupling evaluation and action-selection can prevent
overoptimization. This is the idea behind Double Q-Learning.
• In Double QL two value functions are learned by randomly assigning
experiences to update either of the two, resulting in two sets of
weights, ! and !′.
• For each update one set of weights is used to determine greedy
policy and the other for determining its value.
#$
%
≡ '$() + + max
/
0 1$(), 3; !$
#$
5%6
≡ '$() + + max
/
0 1$(), 3; !$
7
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Untangling Evaluation and
• For action selection we are using !
• For evaluation we are using !′.
#$
%
≡ '$() + + max
/
0 1$(), 3; !$ → #$
%
≡ '$() + +0 1$(), argmax
8
0 1$(), 3; !$ ; !$
#$
9%:
≡ '$() + + max
/
0 1$(), 3; !$
;
→ #$
9<=>?@%
≡ '$() + +0 1$(), argmax
8
0 1$(), 3; !$ ; !′$
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Over-optimism and Error Estimation – upper bound
• Thurn and Schwartz showed that the upper bound of
error due to over-optimization is where action values are
uniformly distributed in an interval [−#, #] is &#
'()
'*)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Over-optimism and Error Estimation – lower bound
• Consider state s at which !∗ #, % = '∗ # ()* #)+, '∗ # .
• Let !. be are arbitrary value estimates that are on the
whole unbiased so that ∑0(!. #, % − '∗ 3 ), but are not
all correct, such that
5
6
∑0 !. #, % − '∗ 3
7
= 8 for
some 8 > 0, where + > 2 is the number of actions in #.
• Then max
0
!. #, % ≥ '∗ # +
A
6B5
.
• The lower bound is tight. The lower bound on the
absolute error of the Double Q-Learning is zero.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Number of Actions and Bias
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bias in Q-Learning vs Double Q-Learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DDQN
• Using DQN’s target network for value estimation
• Using DQN’s online network for evaluating greedy
policy.
!"
#$%&'(#)*
≡ ,"-. + 01 2"-., argmax
9
1 2"-., :; <" ; <′"
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Results
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bayesian DQN or BDQN
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Focusing on Efficient Exploration
• The central claim is that mechanism such as !-greedy
exploration are inefficient.
• Thompson Sampling allows for targeted exploration at
higher dimension but is computationally too expensive.
• BDQN targets to implement Thompson Sampling at
scale though function approximation.
• BDQN combines DQN with a BLR (Bayesian Linear
Regression) model on the last layer.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thomson Sampling
• Thompson sampling involves maintaining a prior
distribution over the environment models (reward
and/or dynamics)
• The distribution is updated as observations are made
• To choose an action, a sample from the posterior belief
is drawn and an action is selected that maximizes the
expected return under the sampled belief.
• For more information please refer to,”A Tutorial on
Thompson Sampling, Daniel Russo et al.”
https://arxiv.org/abs/1707.02038
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TS vs !-Greedy
• !-greedy focuses on greedy action.
• TS explores actions with higher estimated return with
higher probability.
• TS based strategy advances the
exploration/exploitation balance by making a trade-off
between the expected returns and the uncertainties,
while ε− greedy strategy ignores all of this information.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TS vs !-Greedy
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TS vs !-Greedy
• TS finds optimal Q-Function faster.
• Randomizes over Q-Functions with high promising returns
and high uncertainty.
• When true Q-Function is selected, it increases posterior
probability.
• When other function are selected, wrong values are
estimated and the posterior probability is set to zero.
• !-greedy agent randomizes its action with probability of
!, even after having chosen the true Q-Function,
therefore, it takes exponentially many trials in order to
get to the target.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
BDQN Algorithm
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Network Architecture
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
BDQN Performance
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Closing Words
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Value Alignment
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
References
• DQN: https://www.nature.com/articles/nature14236
• DDQN: https://arxiv.org/abs/1509.06461
• BDQN: https://arxiv.org/abs/1802.04412
• DQN MXNet Code:
• https://github.com/apache/incubator-mxnet/tree/master/example/reinforcement-learning/dqn
• DQN MXNet/Gluon Code:https://github.com/zackchase/mxnet-the-straight-
dope/blob/master/chapter17_deep-reinforcement-learning/DQN.ipynb
• DDQN MXNet/Gluon Code: https://github.com/zackchase/mxnet-the-straight-
dope/blob/master/chapter17_deep-reinforcement-learning/DDQN.ipynb
• BDQN MXNet/Gluon Code: https://github.com/kazizzad/BDQN-MxNet-Gluon
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

More Related Content

PDF
Reinfrocement Learning
PDF
Lec2 sampling-based-approximations-and-function-fitting
PPTX
Lecture 8 artificial intelligence .pptx
PPT
Lecture notes
PPTX
AI - Introduction to Bellman Equations
PPTX
lecture_21.pptx - PowerPoint Presentation
PDF
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
PDF
Continuous control with deep reinforcement learning (DDPG)
Reinfrocement Learning
Lec2 sampling-based-approximations-and-function-fitting
Lecture 8 artificial intelligence .pptx
Lecture notes
AI - Introduction to Bellman Equations
lecture_21.pptx - PowerPoint Presentation
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Continuous control with deep reinforcement learning (DDPG)

Similar to Game Playing RL Agent (20)

PPT
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
PPT
about reinforcement-learning ,reinforcement-learning.ppt
PPTX
Reinforcement learning
PDF
Uncertainty Awareness in Integrating Machine Learning and Game Theory
PPT
reinforcement-learning.ppt
PPT
reinforcement-learning its based on the slide of university
PPT
reinforcement-learning.prsentation for c
PDF
PDF
Sutton reinforcement learning new ppt.pdf
PPTX
14_ReinforcementLearning.pptx
PDF
Intro to Reinforcement learning - part I
PDF
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
PDF
anintroductiontoreinforcementlearning-180912151720.pdf
PPTX
An introduction to reinforcement learning
PDF
Reinforcement Learning Overview | Marco Del Pra
PDF
Reinforcement learning and the Frozen Lake Problem
PDF
PDF
Introduction to Deep Reinforcement Learning
PDF
A reinforcement learning approach for designing artificial autonomous intelli...
PDF
Playing Atari with Deep Reinforcement Learning
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
about reinforcement-learning ,reinforcement-learning.ppt
Reinforcement learning
Uncertainty Awareness in Integrating Machine Learning and Game Theory
reinforcement-learning.ppt
reinforcement-learning its based on the slide of university
reinforcement-learning.prsentation for c
Sutton reinforcement learning new ppt.pdf
14_ReinforcementLearning.pptx
Intro to Reinforcement learning - part I
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
anintroductiontoreinforcementlearning-180912151720.pdf
An introduction to reinforcement learning
Reinforcement Learning Overview | Marco Del Pra
Reinforcement learning and the Frozen Lake Problem
Introduction to Deep Reinforcement Learning
A reinforcement learning approach for designing artificial autonomous intelli...
Playing Atari with Deep Reinforcement Learning
Ad

More from Apache MXNet (20)

PPTX
Recent Advances in Natural Language Processing
PPTX
Fine-tuning BERT for Question Answering
PPTX
Introduction to GluonNLP
PPTX
Introduction to object tracking with Deep Learning
PPTX
Introduction to GluonCV
PPTX
Introduction to Computer Vision
PPTX
Image Segmentation: Approaches and Challenges
PPTX
Introduction to Deep face detection and recognition
PPTX
Generative Adversarial Networks (GANs) using Apache MXNet
PPTX
Deep Learning With Apache MXNet On Video by Ben Taylor @ ziff.ai
PDF
Using Java to deploy Deep Learning models with MXNet
PPTX
AI powered emotion recognition: From Inception to Production - Global AI Conf...
PPTX
MXNet Paris Workshop - Intro To MXNet
PDF
Apache MXNet ODSC West 2018
PDF
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
PDF
Apache MXNet EcoSystem - ACNA2018
PDF
ONNX and Edge Deployments
PDF
Distributed Inference with MXNet and Spark
PDF
Multivariate Time Series
PDF
AI On the Edge: Model Compression
Recent Advances in Natural Language Processing
Fine-tuning BERT for Question Answering
Introduction to GluonNLP
Introduction to object tracking with Deep Learning
Introduction to GluonCV
Introduction to Computer Vision
Image Segmentation: Approaches and Challenges
Introduction to Deep face detection and recognition
Generative Adversarial Networks (GANs) using Apache MXNet
Deep Learning With Apache MXNet On Video by Ben Taylor @ ziff.ai
Using Java to deploy Deep Learning models with MXNet
AI powered emotion recognition: From Inception to Production - Global AI Conf...
MXNet Paris Workshop - Intro To MXNet
Apache MXNet ODSC West 2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
Apache MXNet EcoSystem - ACNA2018
ONNX and Edge Deployments
Distributed Inference with MXNet and Spark
Multivariate Time Series
AI On the Edge: Model Compression
Ad

Recently uploaded (20)

PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Architecture types and enterprise applications.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
Tartificialntelligence_presentation.pptx
PPTX
observCloud-Native Containerability and monitoring.pptx
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPTX
TLE Review Electricity (Electricity).pptx
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Enhancing emotion recognition model for a student engagement use case through...
NewMind AI Weekly Chronicles – August ’25 Week III
Architecture types and enterprise applications.pdf
Web App vs Mobile App What Should You Build First.pdf
A contest of sentiment analysis: k-nearest neighbor versus neural network
Final SEM Unit 1 for mit wpu at pune .pptx
A novel scalable deep ensemble learning framework for big data classification...
Zenith AI: Advanced Artificial Intelligence
O2C Customer Invoices to Receipt V15A.pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
DP Operators-handbook-extract for the Mautical Institute
cloud_computing_Infrastucture_as_cloud_p
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Chapter 5: Probability Theory and Statistics
Tartificialntelligence_presentation.pptx
observCloud-Native Containerability and monitoring.pptx
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
TLE Review Electricity (Electricity).pptx
From MVP to Full-Scale Product A Startup’s Software Journey.pdf

Game Playing RL Agent

  • 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Principal Solutions Architect – AWS Deep Learning Amazon Web Services Game Playing RL Agent
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Inspiration From Nature https://newatlas.com/bae-smartskin/33458/ https://www.nasa.gov/ames/feature/go-go-green-wing-mighty-morphing-materials-in-aircraft-design
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hardware of Learning http://www.biologyreference.com/Mo-Nu/Neuron.html
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hardware of Learning http://www.biologyreference.com/Mo-Nu/Neuron.html I1 I2 B O w2 w3 ! "#, %# = Φ() + Σ#(%#. "#)) Φ " = . 1, 0! " ≥ 0.5 0, 0! " < 0.5 w1 5 ∧ Q 8 9 8 ∧ Q : : : : ; ; ; : ; ; : ;
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Process of Learning
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Process of Learning Agent Environment!"#$ !" state %"#$ %" reward &" action Sutton and Barto
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Markov State An information state (a.k.a. Markov state) contains all useful information from the history. A state St is Markov if and only if: ! "#$%|"# = ! "#$% "%, … , "*]
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Expected Return • Expected return: !" : sequence of rewards, potentially discounted by a factor # where # ∈ 0,1 !" = )"*+ + #)"*- + #-)"*. + … = 0 123 4 #1)"*1*+
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bellman Expectation Equations !" # = % &' (' = # = % )'*+ + -!" #'*+ #' = # Value of s is the expected return at state s following policy . subsequently. This is Bellman Expectation Equation that can be also expressed as action-value function for policy . /" #, 1 = % )'*+ + -/" #'*+, 1'*+ #' = #, 1' = 1 = ℛ3 4 + - 5 3678 9336 4 !" #′ Value of taking action a at state s under policy . . # → !"(#) #, 1 → /"(#, 1) #′ → !"(#′) 1 #′ >
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bellman Equations - Example 5.5 510 -3 ! " = 10 × .5 + 5 × .25 + −3 × .25 = 5.5 4.4 2 R=5 P=.5 R=2 P=.5 5 P=.4 P=.5 ! " = 5 × .5 + .5[.4 × 2 + .5 × 5 + .1 × 4.4] = 4.4 P=.5 P=.25 P=.25 P=.1
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Optimal Policy max s,a r s’ a’ s’ s a ! p r max 9 510 -3 "∗ $ = max{−1 + 10, +2 + 5, +3 − 3} = 9 R = -1 R = 2 R = 3 A policy is better if "B $ ≥ "BD $ ∀ $ ∈ G "∗ s ≡ max "B $ ∀ $ ∈ G
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iterative Policy Evaluation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 !" = −1 & → . = & ↑ . = & ↓ . = & ← . = .25
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iterative Policy Evaluation 0.00 1 0.00 2 0.00 3 0.00 4 0.00 5 0.00 6 0.00 7 0.00 8 0.00 9 0.00 10 0.00 11 0.00 12 0.00 13 0.00 14 !" = −1 & = 0 0 15 (: !*+,-. /-0123 ( → . = ( ↑ . = ( ↓ . = ( ← . = .25 9" = −1
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iterative Policy Evaluation !"#$ 1 = .25× −1 + 0.(0) "#2 → +.25× −1 + 0.($) "#2 ↑ + .25× −1 + 0.(5) "#2 ↓ +.25× −1 + 0.(7) "#2 ← = −.25 − .25 − .25 − .25 = −9 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 0 0 0.00 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0 : → . = : ↑ . = : ↓ . = : ← . = .25 ;< = −1 = = 0 = = 1 !"#$ 7 =.25× −1 + 0.(?) "#2 → +.25× −1 + 0.(@) "#2 ↑ + .25× −1 + 0.($$) "#2 ↓ +.25× −1 + 0.(A) "#2 ← = −.25 − .25 − .25 − .25 = −9
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iterative Policy Evaluation !"#$ 1 =.25× −1 + −1.00.($) "#1 → + . 25× −1 + −1.00.(1) "#1 ↑ + .25× −1 + −1.00.(4) "#1 ↓ +.25× −1 + 0.(6) "#1 ← = .25× −8 − 8 − 8 − 9 = −9. :; -1.75 -2.00 -2.00 -2.00 -2.00 -2.00 -2.00 -2.00 -2.00 -2.00 -1.75 -2.00 -2.00 -1.75 0 0 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 0 0 !"#$ 7 = −1 ×.25 − 1.00.(=) "#1 → + −1 ×.25 − 1.00.(>) "#1 ↑ + −1 ×.25 − 1.00.(11) "#1 ↓ + −1 ×.25 − 1.00 . .(?) "#1 ← = = .25× −8 − 8 − 8 − 9 = −8 @ → . = @ ↑ . = @ ↓ . = @ ← . = .25 AB = −1 C = 1 C = 2
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iterative Policy Evaluation !"#$ 1 =.25× −1 + −2.00.(0) "#0 → + . 25× −1 + −1.75.(4) "#0 ↑ + .25× −1 + −2.00.(6) "#0 ↓ +.25× −1 + 0.(8) "#0 ← = .25× −: − ;. <= − : − > = −;. ?: -2.43 -2.93 -3.00 -2.43 -2.93 -3.00 -2.93 -2.93 -3.00 -2.93 -2.43 -3.00 -2.93 -2.43 0 0 -1.75 -2.00 -2.00 -1.75 -2.00 -2.00 -2.00 -2.00 -2.00 -2.00 -1.75 -2.00 -2.00 -1.75 0 0 !"#$ 7 = −1 ×.25 − 2.00.(@) "#0 → + −1 ×.25 − 2.00.($) "#0 ↑ + −1 ×.25 − 1.75.(44) "#0 ↓ + −1 ×.25 − 2.00 . .(A) "#0 ← = .25× −: − : − ;. <= − : = −;.93 D → . = D ↑ . = D ↓ . = D ← . = .25 EF = −1 G = 2 G = 3
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Policy Improvement and Control ! " ! → $% ! → &'(()*(") Evaluation Improvement !∗ "∗
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Policy Improvement and Control
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. GridWorld Demo https://github.com/rlcode/reinforcement-learning
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Limitation of Dynamic Programming • Assumption of full knowledge of MDP • DP is using full-width backup. • Number of states can grow rapidly. • Suitable for medium problem of just a few million states. …
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monte Carlo Learning • Model-Free learning • Learning from episode of experience • All episodes much have a terminal state … …
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Temporal Difference (TD) Learning • Learning from episodes of experience. • Model-Free • TD learns from incomplete episodes. • Updating an estimate towards an estimate. … TD(1) TD(2)
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Exploration and Exploitation • Exploitation is maximizing reward using known information about a system. • Going to school, applying to college, choosing a degree such as engineering and medicine that has a better comparative yield, graduating as quickly as possible through taking all the recommended degree courses, getting a job, putting money in retirement schemes, retiring at a middle-class house comfortably. • Always following a system based on known information results in missing out on potentials for better results. • Going to school, applying to college, choosing a degree such as engineering and medicine that has a better comparative yield, taking a course in Neural Networks out of curiosity, changing subject, graduating, starting an AI company, growing the company, becoming a billionaire, never retiring J
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Q-Learning
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Q-Learning • The Q learning updates the Q value, slightly in the direction of best possible next Q value. s,a r s’ max ! ", $ ← ! ", $ + '() + * max ./ ! "′, 1′ − !(", $))
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Q-Learning Properties • Model-free • Change of task (reinforcement) requires re-training • A special kind of Temporal Difference learning • Convergence assured only for Markov states • Tabular approach requires every observed state-action pair to have an entry
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Action Selection • Greedy – always pick the actions with highest value • Break ties randomly • !-greedy – choose random with low probability ! • Softmax – always choose randomly, weighted by respective Q-values
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reinforcement Function • Implicitly supplies the goal to the agent • Designing the function is an art • Mistakes result in agent learning wrong behavior • When need to learn behavior with shortest duration, penalize every action a little for “wasting time”.
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Q –Learning Demos https://github.com/dbatalov/reinforcement-learning Rocket Lander DemoGrid World Demo https://github.com/rlcode/reinforcement-learning
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tabular Approach and its Limitation
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DQN
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Universal Function Approximation Theorem • Let $ 0 &' ( )*+,-(+-, &*/+0'0, (+0 1*+*-2)(334 2+)5'(,2+6 7/+)-2*+. • 9'- :; 0'+*-' -ℎ' 1 021'+-2*+(3 /+2- ℎ4='5)/&' 0,1 ;. • ?ℎ' @=()' *7 )*+-2+/*/, 7/+)-2*+, *+ :; 2, 0'+*-'0 &4 A :; . ?ℎ'+ 62B'+ C > 0 (+0 (+4 7/+)-2*+ 7CA :; , -ℎ'5' 'E2,-, • (+ 2+-'6'5 F • 5'(3 )*+,-(+-, BG, &GCℝ • 5'(3 B')-*5, IGCℝ; , Iℎ'5' 2 = 1,2, … , F, ,/)ℎ -ℎ(- I' 1(4 0'72+': N E = O GPQ R BG$(IG T E + &G ) (, (+ (==5*E21(-2*+ 5'(32W(-2*+ *7) 7/+)-2*+ 7 Iℎ'5' 2, 2+0'='+0'+- *7 $; -ℎ(- 2, N E − 7 E < C 7*5 (33 E 2+ :;
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep Reinforcement Learning • An Artificial Neural Network is a Universal Function Approximator. • We can use a ANN as an approximation of an agent to choose what action to take to maximize reward. Check this link for proof of the theorem: https://en.wikipedia.org/wiki/Universal_approximation_theorem David Silver
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DQN Network https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf • DQN Agent achieves >75% of the human score in 29 our of 49 games • DQN Agent beats human score (>100%) in 22 games !"#$%% = ()*%+, !"#$% − ./+0#1 23/4 !"#$%) (671/+ !"#$% − ./+0#1 23/4 !"#$%) 8 100
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DQN for Breakout https://github.com/apache/incubator-mxnet/tree/master/example/reinforcement-learning/dqn
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DQN Algorithm • Techniques that increase stability and better convergence • !- greedy Exploration • Technique: Choose action as per optimal policy with (1-") and random action with " probability • Advantage: Minimize overfitting of the network • Experience (#$, &$, '$, #$()) Replay • Technique: Store agent’s experiences and use samples from them to update Q- network • Advantage: Removes correlations in observation sequence • Periodic update of Q towards target • Technique: Every C updates, clone the Q-network and used cloned (*Q) for generating target for the following C updates to Q-network • Advantage: Reduces correlations with the target
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DQN-Algorithm DQN (Cloned) DQN !"#$, &"#$, '"#$, !" !"#(, &"#(, '"#(, !"#$ !"#), &"#), '"#), !"#( !"#*, &"#*, '"#*, !"#*+$ Initialize replay memory (N = 1M) Random play Initialize DQNs with random ,- ./ !, &; ,1 # / !, &; ,1 Episode 1: Select 2$ and get !$ Time step 1: &$ = 4 '&5678 &9:;75, <' = = &'>8&?@/ !$, &; ,- , AB2A Observe reward '$ and move to 2( Add (!$, &$, '$, !() to D Generate training data: U(D) = Random sample of D For each !E, &E, 'E, !E+$ ∈ G H : IE = J 'E, AK;276A :A8;5&:A2 'E + M max @Q ./ !E+$, &R; ,- # ,1 # = ,- ,1 = ,- !$ S !$, . ; ,- !E+$ US !E+$, . ; ,- Update DQN using U(D) with ys ,1 = ,$ Time step 2: &( = 4 '&5678 &9:;75, <' = = &'>8&?@/ !(, &; ,$ , AB2A Observe reward '( and move to 2) Add (!(, &(, '(, !)) to D !( S !(, . ; ,$ ,1 # = ,$-V ,1 = ,$-V Every 10K steps, Clone DQN: ,1 # = ,1 ,1 = ,( Episode 2: Select 2$ and get !$Episode m: Select 2$ and get !$ Time step t: &" = 4 '&5678 &9:;75, <' = = &'>8&?@/ !", &; ,1 , AB2A Observe reward '" and move to 2"+$ Add (!", &", '", !"+$) to D
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Function Approximation !∗ #, % = '() * + , max 0) !∗ #1, %1 |#, % !3 #, % = '() * + , max 0) !345 #1 , %1 |#, % !∗ #, % ≈ !(#, %; 9) ! #, %; 93 ≈ '() * + , max 0) ! #1, %1; 93 4 |#, % !3 → !∗ %# < → ∞ >3 93 = '(,0,? '() @|#, % − !(#, %; 93) B where, @ = * + , max 0) ! #1 , %1 ; 93 4 Bellman equation Iterative update Function Approximation Modified Iterative update Loss function to minimize
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Network Architecture https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf 84 X 84 X 4 !(#) %(#, '; ))
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep Convolutional Network - Nature DQN = gluon.nn.Sequential() with DQN.name_scope(): #first layer DQN.add(gluon.nn.Conv2D(channels=32, kernel_size=8,strides = 4,padding = 0)) DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True)) DQN.add(gluon.nn.Activation('relu')) #second layer DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=4,strides = 2)) DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True)) DQN.add(gluon.nn.Activation('relu')) #tird layer DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=3,strides = 1)) DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True)) DQN.add(gluon.nn.Activation('relu')) DQN.add(gluon.nn.Flatten()) #fourth layer DQN.add(gluon.nn.Dense(512,activation ='relu')) #fifth layer DQN.add(gluon.nn.Dense(num_action,activation ='relu'))
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Issues with DQN • Q-Learning does overestimate action values due to maximization term over estimated values. • Over-estimation is being associated with noise and insufficiently flexible function approximation. • DQN provides a flexible function approximation. • Deterministic nature of Atari games eliminates noise. • DQN still significantly overestimates action values.
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Double Q Learning and DDQN
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Double Q-Learning • The max operator uses the same values for evaluation and action selection. This leads to over-optimism • Decoupling evaluation and action-selection can prevent overoptimization. This is the idea behind Double Q-Learning. • In Double QL two value functions are learned by randomly assigning experiences to update either of the two, resulting in two sets of weights, ! and !′. • For each update one set of weights is used to determine greedy policy and the other for determining its value. #$ % ≡ '$() + + max / 0 1$(), 3; !$ #$ 5%6 ≡ '$() + + max / 0 1$(), 3; !$ 7
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Untangling Evaluation and • For action selection we are using ! • For evaluation we are using !′. #$ % ≡ '$() + + max / 0 1$(), 3; !$ → #$ % ≡ '$() + +0 1$(), argmax 8 0 1$(), 3; !$ ; !$ #$ 9%: ≡ '$() + + max / 0 1$(), 3; !$ ; → #$ 9<=>?@% ≡ '$() + +0 1$(), argmax 8 0 1$(), 3; !$ ; !′$
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Over-optimism and Error Estimation – upper bound • Thurn and Schwartz showed that the upper bound of error due to over-optimization is where action values are uniformly distributed in an interval [−#, #] is &# '() '*)
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Over-optimism and Error Estimation – lower bound • Consider state s at which !∗ #, % = '∗ # ()* #)+, '∗ # . • Let !. be are arbitrary value estimates that are on the whole unbiased so that ∑0(!. #, % − '∗ 3 ), but are not all correct, such that 5 6 ∑0 !. #, % − '∗ 3 7 = 8 for some 8 > 0, where + > 2 is the number of actions in #. • Then max 0 !. #, % ≥ '∗ # + A 6B5 . • The lower bound is tight. The lower bound on the absolute error of the Double Q-Learning is zero.
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Number of Actions and Bias
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bias in Q-Learning vs Double Q-Learning
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DDQN • Using DQN’s target network for value estimation • Using DQN’s online network for evaluating greedy policy. !" #$%&'(#)* ≡ ,"-. + 01 2"-., argmax 9 1 2"-., :; <" ; <′"
  • 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Results
  • 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bayesian DQN or BDQN
  • 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Focusing on Efficient Exploration • The central claim is that mechanism such as !-greedy exploration are inefficient. • Thompson Sampling allows for targeted exploration at higher dimension but is computationally too expensive. • BDQN targets to implement Thompson Sampling at scale though function approximation. • BDQN combines DQN with a BLR (Bayesian Linear Regression) model on the last layer.
  • 53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thomson Sampling • Thompson sampling involves maintaining a prior distribution over the environment models (reward and/or dynamics) • The distribution is updated as observations are made • To choose an action, a sample from the posterior belief is drawn and an action is selected that maximizes the expected return under the sampled belief. • For more information please refer to,”A Tutorial on Thompson Sampling, Daniel Russo et al.” https://arxiv.org/abs/1707.02038
  • 54. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. TS vs !-Greedy • !-greedy focuses on greedy action. • TS explores actions with higher estimated return with higher probability. • TS based strategy advances the exploration/exploitation balance by making a trade-off between the expected returns and the uncertainties, while ε− greedy strategy ignores all of this information.
  • 55. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. TS vs !-Greedy
  • 56. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. TS vs !-Greedy • TS finds optimal Q-Function faster. • Randomizes over Q-Functions with high promising returns and high uncertainty. • When true Q-Function is selected, it increases posterior probability. • When other function are selected, wrong values are estimated and the posterior probability is set to zero. • !-greedy agent randomizes its action with probability of !, even after having chosen the true Q-Function, therefore, it takes exponentially many trials in order to get to the target.
  • 57. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. BDQN Algorithm
  • 58. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Network Architecture
  • 59. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. BDQN Performance
  • 60. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Closing Words
  • 61. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Value Alignment
  • 62. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. References • DQN: https://www.nature.com/articles/nature14236 • DDQN: https://arxiv.org/abs/1509.06461 • BDQN: https://arxiv.org/abs/1802.04412 • DQN MXNet Code: • https://github.com/apache/incubator-mxnet/tree/master/example/reinforcement-learning/dqn • DQN MXNet/Gluon Code:https://github.com/zackchase/mxnet-the-straight- dope/blob/master/chapter17_deep-reinforcement-learning/DQN.ipynb • DDQN MXNet/Gluon Code: https://github.com/zackchase/mxnet-the-straight- dope/blob/master/chapter17_deep-reinforcement-learning/DDQN.ipynb • BDQN MXNet/Gluon Code: https://github.com/kazizzad/BDQN-MxNet-Gluon
  • 63. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.