Game Playing RL Agent

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Principal Solutions Architect – AWS Deep Learning
Amazon Web Services
Game Playing RL Agent

Inspiration From Nature
https://newatlas.com/bae-smartskin/33458/ https://www.nasa.gov/ames/feature/go-go-green-wing-mighty-morphing-materials-in-aircraft-design

Hardware of Learning
http://www.biologyreference.com/Mo-Nu/Neuron.html

Hardware of Learning
http://www.biologyreference.com/Mo-Nu/Neuron.html
I1 I2 B
O
w2 w3
! "#, %# = Φ() + Σ#(%#. "#))
Φ " = .
1, 0! " ≥ 0.5
0, 0! " < 0.5
w1
5 ∧ Q
8 9 8 ∧ Q
: : :
: ; ;
; : ;
; : ;

Process of Learning

Process of Learning
Agent
Environment!"#$
!"
state
%"#$
%"
reward
&"
action
Sutton and Barto

Markov State
An information state (a.k.a. Markov state) contains all
useful information from the history.
A state St is Markov if and only if:
! "#$%|"# = ! "#$% "%, … , "*]

Expected Return
• Expected return: !" : sequence of rewards, potentially discounted by a factor
# where # ∈ 0,1
!" = )"*+ + #)"*- + #-)"*. + … = 0
123
4
#1)"*1*+

Bellman Expectation Equations
!" # = % &' (' = # = % )'*+ + -!" #'*+ #' = #
Value of s is the expected return at state s following policy
. subsequently.
This is Bellman Expectation Equation that can be also
expressed as action-value function for policy .
/" #, 1 = % )'*+ + -/" #'*+, 1'*+ #' = #, 1' = 1
= ℛ3
4 + - 5
3678
9336
4
!" #′
Value of taking action a at state s under policy .
.
# → !"(#)
#, 1 → /"(#, 1)
#′ → !"(#′)
1
#′
>

Bellman Equations - Example
5.5
510 -3
! " = 10 × .5 + 5 × .25 + −3 × .25 = 5.5
4.4
2
R=5
P=.5 R=2
P=.5
5
P=.4 P=.5
! " = 5 × .5 + .5[.4 × 2 + .5 × 5 + .1 × 4.4] = 4.4
P=.5
P=.25
P=.25
P=.1

Optimal Policy
max
s,a
r
s’
a’ s’
s
a
!
p r
max
9
510 -3
"∗ $ = max{−1 + 10, +2 + 5, +3 − 3} = 9
R = -1
R = 2
R = 3
A policy is better if "B $ ≥ "BD $ ∀ $ ∈ G
"∗ s ≡ max "B $ ∀ $ ∈ G

Iterative Policy Evaluation
1 2 3
4 5 6 7
8 9 10 11
12 13 14
!" = −1
& → . = & ↑ . =
& ↓ . = & ← . = .25

0.00
1
0.00
2
0.00
3
0.00
4
0.00
5
0.00
6
0.00
7
0.00
8
0.00
9
0.00
10
0.00
11
0.00
12
0.00
13
0.00
14
!" = −1
& = 0
0
15
(: !*+,-. /-0123
( → . = ( ↑ . =
( ↓ . = ( ← . = .25
9" = −1

!"#$ 1 = .25× −1 + 0.(0)
"#2 →
+.25× −1 + 0.($)
"#2 ↑
+
.25× −1 + 0.(5)
"#2 ↓
+.25× −1 + 0.(7)
"#2 ←
= −.25 − .25 − .25 − .25 = −9
-1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00
0
0
0.00
1
0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00
0
0
: → . = : ↑ . =
: ↓ . = : ← . = .25
;< = −1
= = 0 = = 1
!"#$ 7 =.25× −1 + 0.(?)
"#2 →
+.25× −1 + 0.(@)
"#2 ↑
+
.25× −1 + 0.($$)
"#2 ↓
+.25× −1 + 0.(A)
"#2 ←
= −.25 − .25 − .25 − .25 = −9

!"#$ 1
=.25× −1 + −1.00.($)
"#1 →
+ . 25× −1 + −1.00.(1)
"#1 ↑
+
.25× −1 + −1.00.(4)
"#1 ↓
+.25× −1 + 0.(6)
"#1 ←
= .25× −8 − 8 − 8 − 9 = −9. :;
-1.75 -2.00 -2.00
-2.00 -2.00 -2.00 -2.00
-2.00 -2.00 -2.00 -1.75
-2.00 -2.00 -1.75
0
0
-1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00
0
0
!"#$ 7
= −1 ×.25 − 1.00.(=)
"#1 →
+ −1 ×.25 − 1.00.(>)
"#1 ↑
+
−1 ×.25 − 1.00.(11)
"#1 ↓
+ −1 ×.25 − 1.00
.
.(?)
"#1 ←
=
= .25× −8 − 8 − 8 − 9 = −8
@ → . = @ ↑ . =
@ ↓ . = @ ← . = .25
AB = −1
C = 1 C = 2

!"#$ 1
=.25× −1 + −2.00.(0)
"#0 →
+ . 25× −1 + −1.75.(4)
"#0 ↑
+
.25× −1 + −2.00.(6)
"#0 ↓
+.25× −1 + 0.(8)
"#0 ←
= .25× −: − ;. <= − : − > = −;. ?:
-2.43 -2.93 -3.00
-2.43 -2.93 -3.00 -2.93
-2.93 -3.00 -2.93 -2.43
-3.00 -2.93 -2.43
0
0
-1.75 -2.00 -2.00
-1.75 -2.00 -2.00 -2.00
-2.00 -2.00 -2.00 -1.75
-2.00 -2.00 -1.75
0
0
!"#$ 7
= −1 ×.25 − 2.00.(@)
"#0 →
+ −1 ×.25 − 2.00.($)
"#0 ↑
+
−1 ×.25 − 1.75.(44)
"#0 ↓
+ −1 ×.25 − 2.00
.
.(A)
"#0 ←
=
.25× −: − : − ;. <= − : = −;.93
D → . = D ↑ . =
D ↓ . = D ← . = .25
EF = −1
G = 2 G = 3

Policy Improvement and Control
! "
! → $%
! → &'(()*(")
Evaluation
Improvement
!∗
"∗

Policy Improvement and Control

GridWorld Demo
https://github.com/rlcode/reinforcement-learning

Limitation of Dynamic Programming
• Assumption of full knowledge of MDP
• DP is using full-width backup.
• Number of states can grow rapidly.
• Suitable for medium problem of just a few million states.
…

Monte Carlo Learning
• Model-Free learning
• Learning from episode of experience
• All episodes much have a terminal state
… …

Temporal Difference (TD) Learning
• Learning from episodes of experience.
• Model-Free
• TD learns from incomplete episodes.
• Updating an estimate towards an estimate.
…
TD(1)
TD(2)

Exploration and Exploitation
• Exploitation is maximizing reward using known
information about a system.
• Going to school, applying to college, choosing a degree such as engineering and
medicine that has a better comparative yield, graduating as quickly as possible through
taking all the recommended degree courses, getting a job, putting money in retirement
schemes, retiring at a middle-class house comfortably.
• Always following a system based on known information
results in missing out on potentials for better results.
• Going to school, applying to college, choosing a degree such as engineering and
medicine that has a better comparative yield, taking a course in Neural Networks out of
curiosity, changing subject, graduating, starting an AI company, growing the company,
becoming a billionaire, never retiring J

Q-Learning

Q-Learning
• The Q learning updates the Q value, slightly in the
direction of best possible next Q value.
s,a
r
s’
max
! ", $ ← ! ", $ + '() + * max
./
! "′, 1′ − !(", $))

Q-Learning Properties
• Model-free
• Change of task (reinforcement) requires re-training
• A special kind of Temporal Difference learning
• Convergence assured only for Markov states
• Tabular approach requires every observed state-action
pair to have an entry

Action Selection
• Greedy – always pick the actions with highest value
• Break ties randomly
• !-greedy – choose random with low probability !
• Softmax – always choose randomly, weighted by
respective Q-values

Reinforcement Function
• Implicitly supplies the goal to the agent
• Designing the function is an art
• Mistakes result in agent learning wrong behavior
• When need to learn behavior with shortest duration,
penalize every action a little for “wasting time”.

Q –Learning Demos
https://github.com/dbatalov/reinforcement-learning
Rocket Lander DemoGrid World Demo
https://github.com/rlcode/reinforcement-learning

Tabular Approach and its Limitation

DQN

Universal Function Approximation Theorem
• Let $ 0 &' ( )*+,-(+-, &*/+0'0, (+0 1*+*-2)(334 2+)5'(,2+6 7/+)-2*+.
• 9'- :; 0'+*-' -ℎ' 1 021'+-2*+(3 /+2- ℎ4='5)/&' 0,1 ;.
• ?ℎ' @=()' *7 )*+-2+/*/, 7/+)-2*+, *+ :; 2, 0'+*-'0 &4 A :; .
?ℎ'+ 62B'+ C > 0 (+0 (+4 7/+)-2*+ 7CA :; , -ℎ'5' 'E2,-,
• (+ 2+-'6'5 F
• 5'(3 )*+,-(+-, BG, &GCℝ
• 5'(3 B')-*5, IGCℝ;
, Iℎ'5' 2 = 1,2, … , F, ,/)ℎ -ℎ(- I' 1(4 0'72+':
N E = O
GPQ
R
BG$(IG
T
E + &G )
(, (+ (==5*E21(-2*+ 5'(32W(-2*+ *7) 7/+)-2*+ 7 Iℎ'5' 2, 2+0'='+0'+- *7 $; -ℎ(- 2,
N E − 7 E < C
7*5 (33 E 2+ :;

Deep Reinforcement Learning
• An Artificial Neural Network is a
Universal Function
Approximator.
• We can use a ANN as an
approximation of an agent to
choose what action to take to
maximize reward.
Check this link for proof of the theorem:
https://en.wikipedia.org/wiki/Universal_approximation_theorem
David Silver

DQN Network
https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf
• DQN Agent achieves >75%
of the human score in 29
our of 49 games
• DQN Agent beats human
score (>100%) in 22 games
!"#$%% =
()*%+, !"#$% − ./+0#1 23/4 !"#$%)
(671/+ !"#$% − ./+0#1 23/4 !"#$%)
8 100

DQN for Breakout
https://github.com/apache/incubator-mxnet/tree/master/example/reinforcement-learning/dqn

DQN Algorithm
• Techniques that increase stability and better convergence
• !- greedy Exploration
• Technique: Choose action as per optimal policy with (1-") and random action
with " probability
• Advantage: Minimize overfitting of the network
• Experience (#$, &$, '$, #$()) Replay
• Technique: Store agent’s experiences and use samples from them to update Q-
network
• Advantage: Removes correlations in observation sequence
• Periodic update of Q towards target
• Technique: Every C updates, clone the Q-network and used cloned (*Q) for
generating target for the following C updates to Q-network
• Advantage: Reduces correlations with the target

DQN-Algorithm
DQN
(Cloned) DQN
!"#$, &"#$, '"#$, !"
!"#(, &"#(, '"#(, !"#$
!"#), &"#), '"#), !"#(
!"#*, &"#*, '"#*, !"#*+$
Initialize replay memory (N = 1M)
Random play
Initialize DQNs with random ,-
./ !, &; ,1
# / !, &; ,1
Episode 1: Select 2$ and get !$
Time step 1:
&$ = 4
'&5678 &9:;75, <' = =
&'>8&?@/ !$, &; ,- , AB2A
Observe reward '$ and move to 2(
Add (!$, &$, '$, !() to D
Generate training data:
U(D) = Random sample of D
For each !E, &E, 'E, !E+$ ∈ G H :
IE = J
'E, AK;276A :A8;5&:A2
'E + M max
@Q
./ !E+$, &R; ,-
#
,1
#
= ,-
,1 = ,-
!$
S !$, . ; ,-
!E+$
US !E+$, . ; ,- Update DQN using U(D) with ys
,1 = ,$
Time step 2:
&( = 4
'&5678 &9:;75, <' = =
&'>8&?@/ !(, &; ,$ , AB2A
Observe reward '( and move to 2)
Add (!(, &(, '(, !)) to D
!(
S !(, . ; ,$
,1
#
= ,$-V ,1 = ,$-V
Every 10K steps, Clone DQN: ,1
#
= ,1
,1 = ,(
Episode 2: Select 2$ and get !$Episode m: Select 2$ and get !$
Time step t:
&" = 4
'&5678 &9:;75, <' = =
&'>8&?@/ !", &; ,1 , AB2A
Observe reward '" and move to 2"+$
Add (!", &", '", !"+$) to D

Function Approximation
!∗ #, % = '() * + , max
0)
!∗ #1, %1 |#, %
!3 #, % = '() * + , max
0)
!345 #1
, %1
|#, %
!∗ #, % ≈ !(#, %; 9)
! #, %; 93 ≈ '() * + , max
0)
! #1, %1; 93
4
|#, %
!3 → !∗
%# < → ∞
>3 93 = '(,0,? '() @|#, % − !(#, %; 93) B
where, @ = * + , max
0)
! #1
, %1
; 93
4
Bellman equation
Iterative update
Function Approximation
Modified Iterative update
Loss function to minimize

Network Architecture
https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf
84 X 84 X 4
!(#) %(#, '; ))

Deep Convolutional Network - Nature
DQN = gluon.nn.Sequential()
with DQN.name_scope():
#first layer
DQN.add(gluon.nn.Conv2D(channels=32, kernel_size=8,strides = 4,padding = 0))
DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))
DQN.add(gluon.nn.Activation('relu'))
#second layer
DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=4,strides = 2))
#tird layer
DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=3,strides = 1))
DQN.add(gluon.nn.Flatten())
#fourth layer
DQN.add(gluon.nn.Dense(512,activation ='relu'))
#fifth layer
DQN.add(gluon.nn.Dense(num_action,activation ='relu'))

Issues with DQN
• Q-Learning does overestimate action values due to
maximization term over estimated values.
• Over-estimation is being associated with noise and
insufficiently flexible function approximation.
• DQN provides a flexible function approximation.
• Deterministic nature of Atari games eliminates noise.
• DQN still significantly overestimates action values.

Double Q Learning and DDQN

Double Q-Learning
• The max operator uses the same values for evaluation and action
selection. This leads to over-optimism
• Decoupling evaluation and action-selection can prevent
overoptimization. This is the idea behind Double Q-Learning.
• In Double QL two value functions are learned by randomly assigning
experiences to update either of the two, resulting in two sets of
weights, ! and !′.
• For each update one set of weights is used to determine greedy
policy and the other for determining its value.
#$
%
≡ '$() + + max
/
0 1$(), 3; !$
#$
5%6
≡ '$() + + max
/
0 1$(), 3; !$
7

Untangling Evaluation and
• For action selection we are using !
• For evaluation we are using !′.
#$
%
≡ '$() + + max
/
0 1$(), 3; !$ → #$
%
≡ '$() + +0 1$(), argmax
8
0 1$(), 3; !$ ; !$
#$
9%:
≡ '$() + + max
/
0 1$(), 3; !$
;
→ #$
9<=>?@%
≡ '$() + +0 1$(), argmax
8
0 1$(), 3; !$ ; !′$

Over-optimism and Error Estimation – upper bound
• Thurn and Schwartz showed that the upper bound of
error due to over-optimization is where action values are
uniformly distributed in an interval [−#, #] is &#
'()
'*)

Over-optimism and Error Estimation – lower bound
• Consider state s at which !∗ #, % = '∗ # ()* #)+, '∗ # .
• Let !. be are arbitrary value estimates that are on the
whole unbiased so that ∑0(!. #, % − '∗ 3 ), but are not
all correct, such that
5
6
∑0 !. #, % − '∗ 3
7
= 8 for
some 8 > 0, where + > 2 is the number of actions in #.
• Then max
0
!. #, % ≥ '∗ # +
A
6B5
.
• The lower bound is tight. The lower bound on the
absolute error of the Double Q-Learning is zero.

Number of Actions and Bias

Bias in Q-Learning vs Double Q-Learning

DDQN
• Using DQN’s target network for value estimation
• Using DQN’s online network for evaluating greedy
policy.
!"
#$%&'(#)*
≡ ,"-. + 01 2"-., argmax
9
1 2"-., :; <" ; <′"

Results

Bayesian DQN or BDQN

Focusing on Efficient Exploration
• The central claim is that mechanism such as !-greedy
exploration are inefficient.
• Thompson Sampling allows for targeted exploration at
higher dimension but is computationally too expensive.
• BDQN targets to implement Thompson Sampling at
scale though function approximation.
• BDQN combines DQN with a BLR (Bayesian Linear
Regression) model on the last layer.

Thomson Sampling
• Thompson sampling involves maintaining a prior
distribution over the environment models (reward
and/or dynamics)
• The distribution is updated as observations are made
• To choose an action, a sample from the posterior belief
is drawn and an action is selected that maximizes the
expected return under the sampled belief.
• For more information please refer to,”A Tutorial on
Thompson Sampling, Daniel Russo et al.”
https://arxiv.org/abs/1707.02038

TS vs !-Greedy
• !-greedy focuses on greedy action.
• TS explores actions with higher estimated return with
higher probability.
• TS based strategy advances the
exploration/exploitation balance by making a trade-off
between the expected returns and the uncertainties,
while ε− greedy strategy ignores all of this information.

TS vs !-Greedy

TS vs !-Greedy
• TS finds optimal Q-Function faster.
• Randomizes over Q-Functions with high promising returns
and high uncertainty.
• When true Q-Function is selected, it increases posterior
probability.
• When other function are selected, wrong values are
estimated and the posterior probability is set to zero.
• !-greedy agent randomizes its action with probability of
!, even after having chosen the true Q-Function,
therefore, it takes exponentially many trials in order to
get to the target.

BDQN Algorithm

Network Architecture

BDQN Performance

Closing Words

Value Alignment

References
• DQN: https://www.nature.com/articles/nature14236
• DDQN: https://arxiv.org/abs/1509.06461
• BDQN: https://arxiv.org/abs/1802.04412
• DQN MXNet Code:
• https://github.com/apache/incubator-mxnet/tree/master/example/reinforcement-learning/dqn
• DQN MXNet/Gluon Code:https://github.com/zackchase/mxnet-the-straight-
dope/blob/master/chapter17_deep-reinforcement-learning/DQN.ipynb
• DDQN MXNet/Gluon Code: https://github.com/zackchase/mxnet-the-straight-
dope/blob/master/chapter17_deep-reinforcement-learning/DDQN.ipynb
• BDQN MXNet/Gluon Code: https://github.com/kazizzad/BDQN-MxNet-Gluon

Game Playing RL Agent

More Related Content

Similar to Game Playing RL Agent (20)

More from Apache MXNet (20)

Recently uploaded (20)

Game Playing RL Agent