SlideShare a Scribd company logo
Asynchronous Methods for Deep Reinforcement Learning
Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P., David Silver, Koray Kavukcuoglu
Google DeepMind
Montreal Institute for Learning Algorithms (MILA), University of Montreal
Journal reference: ICML 2016
Cite as: arXiv:1602.01783 [cs.LG]
(or arXiv:1602.01783v2 [cs.LG] for this version)
資應所 105065702 李思叡
1
Outline
Introduction
Related Work
Reinforcement Learning Background
Asynchronous RL Framework
Experiments
Conclusions and Discussion
2
Outline
Introduction
Related Work
Reinforcement Learning Background
Asynchronous RL Framework
Experiments
Conclusions and Discussion
3
Introduction
Online RL algorithm with DNN is fundamentally unstable
- sequence of observed data encounter by an online RL agent is non-stationary
- online RL updates are strongly correlated
- solution idea → experience replay
(experience replay achieved unprecedented success)
Drawbacks of experience replay:
- uses more memory and computation per real interaction
- requires off-policy learning algorithm that can update from data generated by
an older policy
4
Introduction
In this paper:
Asynchronously execute multiple agents in parallel on multiple environments.
(instead of experience replay)
Benefits:
- decorrelate the agents’ data into a more stationary process
- can apply on both on-policy and off-policy RL algorithm
- GPU or massively distributed machines → multi core CPU (single machine)
- cost far less time than GPU-based algorithms
- use far less resources than massively distributed approach.
5
Outline
Introduction
Related Work
Reinforcement Learning Background
Asynchronous RL Framework
Experiments
Conclusions and Discussion
6
Related Work
The General Reinforcement Learning Architecture (Gorila) of performs asynchronous training of
reinforcement learning agents in a distributed setting. (Nair et al., 2015)
Map Reduce framework with linear function approximation. (Li & Schuurmans, 2011)
Parallel version of the Sarsa uses multiple separate actor-learners. (Grounds & Kudenko, 2008)
Convergence properties of Q-learning in the asynchronous optimization setting. (Tsitsiklis, 1994)
The related problem of distributed dynamic programming. (Bertsekas, 1982)
7
Related Work: Gorila Architecture
Perform	asynchronous training	of	reinforcement	learning	agents	in	a	distributed setting.
Actor
- acts	in	its	own	copy	of	environment
- separate	replay	memory
Learner
- samples	data	from	replay	memory
- computes	gradients	of	the	DQN	loss	with	respect	to the	policy	parameters
8
Related Work: Gorila Architecture
Gradients	are	asynchronously	sent	to	a	central	parameter	server	which	updates	a	central	
copy	of	the	model.
The	updated	policy	parameters	are	sent	to	the	actor-learners	at	fixed	interval.
Setting:
- 100	separate	actor-learners
- 30	parameter	servers	instance
- 130	machines	in	total
Performance:
- outperform	DQN	over	49	Atari	games	
- 20	times	faster	than	DQN
9
Related Work
Map	Reduce	framework	to	parallelizing	batch	reinforcement	learning	with	linear	function	
approximation
- Parallelism	was	used	to	speed	up	large	matrix	operations
- Not	to	parallelize	the	collection	of	experience	or	stabilize	learning
Parallel	version	of	Sarsa	algorithm
- Multiple	separate	actor-learners	to	accelerate	training
- Learns	separately		and	periodically	send	updates	to	weights	that	have	changed	
significantly	to	the	other	learners	using	peer-to-peer	(P2P)	communication.
10
Related Work
Q-learning	is	still	guaranteed	to	converge	when	some	of	the	information	is	outdated	as	
long	as	outdated	information	is	always	eventually	discarded	and	several	other	
assumptions	are	satisfied.
Evolutionary	methods
- Often	straightforward	to	parallelize	by	distributing	fitness	evaluations	over	multiple	
machines	or	threads.
11
Outline
Introduction
Related Work
Reinforcement Learning Background
Asynchronous RL Framework
Experiments
Conclusions and Discussion
12
Reinforcement Learning Background
one-step Q-learning (TD(0))
one-step Sarsa
n-step Q-learning (TD(0) → TD(λ))
actor-critic
13
RL Background: Q-learning and Sarsa
Q-learning:
Sarsa:
14
RL Background: Actor-Critic
Actor-Critic follow an approximate policy gradient:
With advantage function:
15
(use the TD(0) error for example)
Outline
Introduction
Related Work
Reinforcement Learning Background
Asynchronous RL Framework
Experiments
Conclusions and Discussion
16
Asynchronous RL Framework
Copy global network parameters
Worker interacts with environment
Accumulate gradients
Update global network with gradients
Copy global network parameters
Worker interacts with environment
....
source: https://medium.com/emergent-
future/simple-reinforcement-learning-with-
tensorflow-part-8-asynchronous-actor-critic-
agents-a3c-c88f72a5e9f2
17Framework of A3C
loop
Asynchronous RL Framework
18source: Naruto comic (315th, 617th episode)
Asynchronous RL Framework source: https://github.com/coreylynch/async-rl
19
Two main ideas of practice
Similarly to the Gorila framework, but:
- separate machines → multiple CPU threads on a single machine
- removes the communication costs of sending gradients and parameters
- use Hogwild! (Recht et al., 2011) style updates for training
Multiple learners running in parallel
- exploring different parts of the environment
- maximize the diversity
- be less correlated in time (decorrelate)
- do not use replay memory
(→ be able to use on-policy RL to train neural networks in a stable way)
20
Asynchronous RL Framework
Asynchronous one-step Q-learning
Asynchronous one-step Sarsa
Asynchronous n-step Q-learning
Asynchronous advantage actor-critic (A3C)
21
one-step Q-learning
one-step Sarsa
n-step Q-learning
actor-critic
Algo: DQN v.s. Asynchronous one-step Q-learning
22
Deep Q Network (DQN) Asynchronous one-step Q-learning
Algo: Asynchronous Advantage Actor-Critic
23
can add entropy regularization (H):
+
Optimization
1. SGD with momentum
2. RMSProp without shared statistics
3. RMSProp with shared statistics (mode robust)
→ RMSProp where statistics g are shared across threads
24
Outline
Introduction
Related Work
Reinforcement Learning Background
Asynchronous RL Framework
Experiments
Conclusions and Discussion
25
Experiments
Atari 2600 Games (experiment on 57 games)
TORCS Car Racing Simulator (3D game)
MuJoCo Physics Simulator (Continuous Action Control)
Labyrinth (3D maze game)
26
Experimental setup (A3C on Atari and TORCS)
- number of threads: 16 (on a single machine and no GPUs)
- updates every 5 actions (𝑡"#$ = 5 and 𝐼()*#+, = 5)
- optimization: shared RMSProp
- network architecture: 2 Conv layers and 1 FC layer (followed by ReLU)
- input preprocessing and network architecture as (Mnih et al., 2015; 2013)
- discount factor 𝛾 = 0.99, RMSProp decay factor 𝛼 = 0.99, entropy weight 𝛽 = 0.01
- learning rate: sample from a 𝐿𝑜𝑔𝑈𝑛𝑖𝐹𝑜𝑟𝑚(10>?
, 10>A
) distribution
(for more details, see paper section 8 & 9) 27
Learning speed comparison (5 Atari games)
DQN: train on Nvidia K40 GPU
Asynchronous methods: train on 16 CPU cores
28
Score result on 57 Atari games
Fix all hyperparameters for all 57 games
A3C, LSTM: add 256 LSTM cells after final hidden layer (for more compare)
Mean, median: human-normalized scores on 57 Atari games
29
A3C on other environments
TORCS 3D car racing game: https://youtu.be/0xo1Ldx3L5Q
MuJoCo physics simulator: https://youtu.be/Ajjc08-iPx8
Labyrinth: https://youtu.be/nMR5mjCFZCw
30
Scalability and Data Efficiency
- superlinear speedups
(especially on one-step methods)
31
the speedups average over 7 Atari games
Robustness and Stability
50 different learning rates and random initializations
Result:
Robust to the choice of learning rate and random initialization
Stable and do not collapse or diverage once they are learning
(4 asynchronous methods have the same conclusion)
32
Comparison of three optimization methods
50 experiments on n-step Q and A3C
with 50 different random learning rates and initializations
1. Momentum SGD
2. RMSProp
3. Shared RMSProp
33
Outline
Introduction
Related Work
Reinforcement Learning Background
Asynchronous RL Framework
Experiments
Conclusions and Discussion
34
Conclusions and Discussion
In this framework
- Stable training of NN is possible in many situations
(value/policy-base, on/off-policy, discrete/continuous)
- Reduce the comsumption of time
- Could be potentially improved by using other ways of estimating advantage
funciton
- A number of complementary improvements to the NN architecture are
possible.
35
Thanks for listening :)
36
Q & A
37

More Related Content

PDF
Introduction to JADE (Java Agent DEvelopment) Framework
ODP
Android Camera Architecture
PPTX
Android - ADB
PDF
Karate - Web-Service API Testing Made Simple
PDF
[Android] Multimedia Programming
PPT
Uml diagrams
PPTX
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
PPTX
GPU and Deep learning best practices
Introduction to JADE (Java Agent DEvelopment) Framework
Android Camera Architecture
Android - ADB
Karate - Web-Service API Testing Made Simple
[Android] Multimedia Programming
Uml diagrams
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
GPU and Deep learning best practices

Similar to Asynchronous Methods for Deep Reinforcement Learning (20)

PDF
Developing and Deploying Deep Learning Based Computer Vision Systems - Alka N...
PPTX
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
PDF
A study of Machine Learning approach for Predictive Maintenance in Industry 4.0
PDF
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
PPTX
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
PPTX
Keynote at IWLS 2017
PDF
Distributed Deep Q-Learning
PPTX
Ajila (1)
PDF
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
PDF
Training ImageNet-1k ResNet50 in 15min pfn
PPTX
Pruning convolutional neural networks for resource efficient inference
PDF
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
PDF
S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs
 
PPTX
181123 asynchronous method for deep reinforcement learning seunghyeok back
PDF
A Platform for Accelerating Machine Learning Applications
PDF
Josh Patterson MLconf slides
PDF
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)
PPTX
Super COMPUTING Journal
PDF
Toronto meetup 20190917
PPTX
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Developing and Deploying Deep Learning Based Computer Vision Systems - Alka N...
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
A study of Machine Learning approach for Predictive Maintenance in Industry 4.0
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Keynote at IWLS 2017
Distributed Deep Q-Learning
Ajila (1)
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
Training ImageNet-1k ResNet50 in 15min pfn
Pruning convolutional neural networks for resource efficient inference
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs
 
181123 asynchronous method for deep reinforcement learning seunghyeok back
A Platform for Accelerating Machine Learning Applications
Josh Patterson MLconf slides
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)
Super COMPUTING Journal
Toronto meetup 20190917
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Ad

Recently uploaded (20)

PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Transcultural that can help you someday.
PDF
Microsoft Core Cloud Services powerpoint
PPTX
Introduction to Inferential Statistics.pptx
DOCX
Factor Analysis Word Document Presentation
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
Managing Community Partner Relationships
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
A Complete Guide to Streamlining Business Processes
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
modul_python (1).pptx for professional and student
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
Database Infoormation System (DBIS).pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
STERILIZATION AND DISINFECTION-1.ppthhhbx
Transcultural that can help you someday.
Microsoft Core Cloud Services powerpoint
Introduction to Inferential Statistics.pptx
Factor Analysis Word Document Presentation
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Managing Community Partner Relationships
Optimise Shopper Experiences with a Strong Data Estate.pdf
CYBER SECURITY the Next Warefare Tactics
A Complete Guide to Streamlining Business Processes
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
modul_python (1).pptx for professional and student
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Database Infoormation System (DBIS).pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Ad

Asynchronous Methods for Deep Reinforcement Learning

  • 1. Asynchronous Methods for Deep Reinforcement Learning Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P., David Silver, Koray Kavukcuoglu Google DeepMind Montreal Institute for Learning Algorithms (MILA), University of Montreal Journal reference: ICML 2016 Cite as: arXiv:1602.01783 [cs.LG] (or arXiv:1602.01783v2 [cs.LG] for this version) 資應所 105065702 李思叡 1
  • 2. Outline Introduction Related Work Reinforcement Learning Background Asynchronous RL Framework Experiments Conclusions and Discussion 2
  • 3. Outline Introduction Related Work Reinforcement Learning Background Asynchronous RL Framework Experiments Conclusions and Discussion 3
  • 4. Introduction Online RL algorithm with DNN is fundamentally unstable - sequence of observed data encounter by an online RL agent is non-stationary - online RL updates are strongly correlated - solution idea → experience replay (experience replay achieved unprecedented success) Drawbacks of experience replay: - uses more memory and computation per real interaction - requires off-policy learning algorithm that can update from data generated by an older policy 4
  • 5. Introduction In this paper: Asynchronously execute multiple agents in parallel on multiple environments. (instead of experience replay) Benefits: - decorrelate the agents’ data into a more stationary process - can apply on both on-policy and off-policy RL algorithm - GPU or massively distributed machines → multi core CPU (single machine) - cost far less time than GPU-based algorithms - use far less resources than massively distributed approach. 5
  • 6. Outline Introduction Related Work Reinforcement Learning Background Asynchronous RL Framework Experiments Conclusions and Discussion 6
  • 7. Related Work The General Reinforcement Learning Architecture (Gorila) of performs asynchronous training of reinforcement learning agents in a distributed setting. (Nair et al., 2015) Map Reduce framework with linear function approximation. (Li & Schuurmans, 2011) Parallel version of the Sarsa uses multiple separate actor-learners. (Grounds & Kudenko, 2008) Convergence properties of Q-learning in the asynchronous optimization setting. (Tsitsiklis, 1994) The related problem of distributed dynamic programming. (Bertsekas, 1982) 7
  • 8. Related Work: Gorila Architecture Perform asynchronous training of reinforcement learning agents in a distributed setting. Actor - acts in its own copy of environment - separate replay memory Learner - samples data from replay memory - computes gradients of the DQN loss with respect to the policy parameters 8
  • 9. Related Work: Gorila Architecture Gradients are asynchronously sent to a central parameter server which updates a central copy of the model. The updated policy parameters are sent to the actor-learners at fixed interval. Setting: - 100 separate actor-learners - 30 parameter servers instance - 130 machines in total Performance: - outperform DQN over 49 Atari games - 20 times faster than DQN 9
  • 10. Related Work Map Reduce framework to parallelizing batch reinforcement learning with linear function approximation - Parallelism was used to speed up large matrix operations - Not to parallelize the collection of experience or stabilize learning Parallel version of Sarsa algorithm - Multiple separate actor-learners to accelerate training - Learns separately and periodically send updates to weights that have changed significantly to the other learners using peer-to-peer (P2P) communication. 10
  • 12. Outline Introduction Related Work Reinforcement Learning Background Asynchronous RL Framework Experiments Conclusions and Discussion 12
  • 13. Reinforcement Learning Background one-step Q-learning (TD(0)) one-step Sarsa n-step Q-learning (TD(0) → TD(λ)) actor-critic 13
  • 14. RL Background: Q-learning and Sarsa Q-learning: Sarsa: 14
  • 15. RL Background: Actor-Critic Actor-Critic follow an approximate policy gradient: With advantage function: 15 (use the TD(0) error for example)
  • 16. Outline Introduction Related Work Reinforcement Learning Background Asynchronous RL Framework Experiments Conclusions and Discussion 16
  • 17. Asynchronous RL Framework Copy global network parameters Worker interacts with environment Accumulate gradients Update global network with gradients Copy global network parameters Worker interacts with environment .... source: https://medium.com/emergent- future/simple-reinforcement-learning-with- tensorflow-part-8-asynchronous-actor-critic- agents-a3c-c88f72a5e9f2 17Framework of A3C loop
  • 18. Asynchronous RL Framework 18source: Naruto comic (315th, 617th episode)
  • 19. Asynchronous RL Framework source: https://github.com/coreylynch/async-rl 19
  • 20. Two main ideas of practice Similarly to the Gorila framework, but: - separate machines → multiple CPU threads on a single machine - removes the communication costs of sending gradients and parameters - use Hogwild! (Recht et al., 2011) style updates for training Multiple learners running in parallel - exploring different parts of the environment - maximize the diversity - be less correlated in time (decorrelate) - do not use replay memory (→ be able to use on-policy RL to train neural networks in a stable way) 20
  • 21. Asynchronous RL Framework Asynchronous one-step Q-learning Asynchronous one-step Sarsa Asynchronous n-step Q-learning Asynchronous advantage actor-critic (A3C) 21 one-step Q-learning one-step Sarsa n-step Q-learning actor-critic
  • 22. Algo: DQN v.s. Asynchronous one-step Q-learning 22 Deep Q Network (DQN) Asynchronous one-step Q-learning
  • 23. Algo: Asynchronous Advantage Actor-Critic 23 can add entropy regularization (H): +
  • 24. Optimization 1. SGD with momentum 2. RMSProp without shared statistics 3. RMSProp with shared statistics (mode robust) → RMSProp where statistics g are shared across threads 24
  • 25. Outline Introduction Related Work Reinforcement Learning Background Asynchronous RL Framework Experiments Conclusions and Discussion 25
  • 26. Experiments Atari 2600 Games (experiment on 57 games) TORCS Car Racing Simulator (3D game) MuJoCo Physics Simulator (Continuous Action Control) Labyrinth (3D maze game) 26
  • 27. Experimental setup (A3C on Atari and TORCS) - number of threads: 16 (on a single machine and no GPUs) - updates every 5 actions (𝑡"#$ = 5 and 𝐼()*#+, = 5) - optimization: shared RMSProp - network architecture: 2 Conv layers and 1 FC layer (followed by ReLU) - input preprocessing and network architecture as (Mnih et al., 2015; 2013) - discount factor 𝛾 = 0.99, RMSProp decay factor 𝛼 = 0.99, entropy weight 𝛽 = 0.01 - learning rate: sample from a 𝐿𝑜𝑔𝑈𝑛𝑖𝐹𝑜𝑟𝑚(10>? , 10>A ) distribution (for more details, see paper section 8 & 9) 27
  • 28. Learning speed comparison (5 Atari games) DQN: train on Nvidia K40 GPU Asynchronous methods: train on 16 CPU cores 28
  • 29. Score result on 57 Atari games Fix all hyperparameters for all 57 games A3C, LSTM: add 256 LSTM cells after final hidden layer (for more compare) Mean, median: human-normalized scores on 57 Atari games 29
  • 30. A3C on other environments TORCS 3D car racing game: https://youtu.be/0xo1Ldx3L5Q MuJoCo physics simulator: https://youtu.be/Ajjc08-iPx8 Labyrinth: https://youtu.be/nMR5mjCFZCw 30
  • 31. Scalability and Data Efficiency - superlinear speedups (especially on one-step methods) 31 the speedups average over 7 Atari games
  • 32. Robustness and Stability 50 different learning rates and random initializations Result: Robust to the choice of learning rate and random initialization Stable and do not collapse or diverage once they are learning (4 asynchronous methods have the same conclusion) 32
  • 33. Comparison of three optimization methods 50 experiments on n-step Q and A3C with 50 different random learning rates and initializations 1. Momentum SGD 2. RMSProp 3. Shared RMSProp 33
  • 34. Outline Introduction Related Work Reinforcement Learning Background Asynchronous RL Framework Experiments Conclusions and Discussion 34
  • 35. Conclusions and Discussion In this framework - Stable training of NN is possible in many situations (value/policy-base, on/off-policy, discrete/continuous) - Reduce the comsumption of time - Could be potentially improved by using other ways of estimating advantage funciton - A number of complementary improvements to the NN architecture are possible. 35