SlideShare a Scribd company logo
PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING
TENSORFLOW + KERAS & OPENAI GYM
1
CONTENTS
Playing Atari Deep Reinforcement Learning
 Playing Atari with Deep Reinforcement Learning
 Human Level Control through Deep
Reinforcement Learning
 Deep Reinforcement Learning with Q-Learning
2
PLAYING ATARI WITH DEEP REINFORCEMENT
LEARNING
3
ATARI 2600
http://atariage.com/index.php
Atari 2600是1976年發行的經典遊戲主
機
 史上第一部家用電子遊戲機
 支援160 X 192解析度螢幕,最高128色,主機上
有 128 Byte RAM和 6KB ROM
 FC 紅白機十年之後才出現
4
PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING
DeepMind Object is to find an optimal policy
 展示了如何讓電腦學習玩 Atari 2600 遊戲
 這個結果引人注目的地方在於電腦只觀察螢幕圖
元並在遊戲得分增加時接收獎勵
 相同模型架構
 學習七種不同遊戲
 其中三個遊戲玩得比人類好
5
HUMAN LEVEL
Original Results on Atari Games Beating Human Level
6
A3C (ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC)
RESULTS ON ATARI GAMES
7
PLAYING ATARI
8
PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING
Reinforcement Learning Object is to find an optimal policy
1. Given Current State
2. Take an Action based on state
3. Get current Reward
9
BREAKOUT
Tested on Ubuntu 16.04 Breakout
 State
 球在螢幕上的位置
 Action
 訓練電腦玩遊戲
 Input:螢幕截圖
 Outpu:控制Paddle左、右、發球
 Reward
 螢幕上半部分有很多磚塊,球碰到磚塊會將它擊碎,你
會得分
10
RESOURCES
Playing Atari with Deep
Reinforcement Learning
 https://courses.cs.ut.ee/MTAT.03.291/2014_sprin
g/uploads/Main/Replicating%20DeepMind.pdf
Replicating-DeepMind
 https://github.com/kristjankorjus/Replicating-
DeepMind
11
RESOURCES
DeepMind Atari Deep Q Learner
 https://github.com/kuz/DeepMind-Atari-Deep-
Q-Learner
 LuaJIT and Torch 7.0
 nngraph
 Xitari (fork of the Arcade Learning Environment
(Bellemare et al., 2013))
 AleWrap (a lua interface to Xitari) An install script
for these dependencies is provided.
Asyncronous RL in Tensorflow + Keras
OpenAI's Gym
 https://github.com/coreylynch/async-rl
 tensorflow
 gym
 [gym's atari environment]
(https://github.com/openai/gym#atari)
 skimage
 Keras
12
RESOURCES
The Arcade Learning Environment
 http://www.arcadelearningenvironment.org/
ALE (Visual Studio Version)
 https://github.com/mvacha/A.L.E.-0.4.4.-Visual-
Studio
13
APT-GET INSTALL
 libtiff5-dev
 libjpeg8-dev
 zlib1g-dev
 liblcms2-dev
 libwebp-dev
 tcl8.6-dev
 tk8.5-dev
 python-tk
 cmake
 xvfb
14
DEEP NEURAL NETWORKS
 Tensor Flow is a good flexible deep learning
framework
 Backpropagation and deep neural network do a
lot the reinforcement learning challenge is how
you find the best loss function to train
15
HOW TO RUN AI AGENTS ON GAMES?
https://gym.openai.com/ OpenAI Gym
 Library of Environments
 Pong
 Breakout
 Cart-Pole
 Same API
 Provides way to share and compare results
16
HOW TO RUN AI AGENTS ON GAMES?
https://gym.openai.com/ Pip install -e '.[atari]'
import gym
env = gym.make('SpaceInvaders-v0')
obs = env.reset()
env.render()
ob, reward, done, _ = env.step(action)
17
OTHER OPTIONS
https://github.com/DanielSlater/PyGamePla
yer PyGame
 1000’s of games
 Easy to change game code
 PyGamePlayer
 Half pong
18
PYTHON ASYNC_DQN.PY --EXPERIMENT BREAKOUT --GAME
"BREAKOUT-V0" --NUM_CONCURRENT 8
Checkpoints
/tmp/checkpoints/
TensorBoard Summary
tensorboard --logdir
/tmp/summaries/breakout
"created":1485854183,
"episode_types":["t"],
"episode_lengths":[1717],
"object":"episode_batch",
"initial_reset_timestamps":[
1485853848.3293480873],
"episode_rewards":[62.0],
"data_sources":[0],
"seeds":[],
"main_seeds":[],
"timestamps":[1485853853.
9296009541],
"env_id":"Breakout-v0",
"initial_reset_timestamp":1
485853848.3293480873,
"id":"eb_taFBJqLFThuZ5jBw
O0NFTQ"
tensorboard --logdir /tmp/summaries/breakout
19
ALE GRAYSCALE CONVERSION METHOD
RGB images grayscale conversion
20
SCREENSHOT
frame skipping maximum over two consecutive frames
21
100-EPISODE (2 HOURS) AVERAGE REWARD WAS 68.97
Training episode batch video (mp4) Visualizing training with tensorboard
22
VISUALIZING TRAINING WITH TENSORBOARD
Episode Reward Max Q Value
23
REINFORCEMENT LEARNING
24
MARKOV DECISION PROCESS
 選擇這些行動的策略
 一般來說環境是隨機的
 下一個狀態的出現也是隨機的
 MDP < S, A, P, R, 𝛾 >
 S: set of states
 A: set of actions
 T(s, a, s’): probability of transition
 Reward(s): reward function
 𝛾: discounting factory
 Trace: {<s0,a0,r0>, …, <sn,an,rn>}
25
Convolutional networks Network architecture
26
REINFORCEMENT LEARNING
3 categories of reinforcement learning
 Value learning : Q-learning
 給定一個狀態和一組可能的行動,決定採取最佳的
獎勵的行動
 Policy learning : Policy gradients
 使用Gradients找到最佳策略
 Model learning
 學習在不同狀態間的轉換
 Min-Max
 Monte-Carlo sampling
Definitions
 Return: total discounted reward:
 Policy: Agent’s behavior
 Deterministic policy: π(s) = a
 Stochastic policy: π(a | s) = P[At = a | St = s]
 Value function: Expected return starting from
state s:
 State-value function: Vπ(s) = Eπ[R | St = s]
 Action-value function: Qπ(s, a) = Eπ[R | St = s, At =
a]
27
LEARNING
Deep Q Learning
 Model-free, off-policy technique to learn optimal Q(s, a):
 Qi+1(s, a) ← Qi(s, a) + 𝛼(R + 𝛾 maxa’ Qi(s’, a’) - Qi(s, a))
 Optimal policy then π(s) = argmaxa’ Q(s, a’)
 Requires exploration (ε-greedy) to explore various transitions from the
states.
 Take random action with ε probability, start ε high and decay to low
value as training progresses.
 Deep Q Learning: approximate Q(s, a) with neural network: Q(s, a, 𝜃)
 Do stochastic gradient descent using loss
 L(𝜃) = MSEs, a(Q(s, a, 𝜃i), r + 𝛾maxa’Q(s, a’, 𝜃i - 1))
Policy Gradient
 Given policy π𝜃(a | s) find such 𝜃 that maximizes
expected return:
 J(𝜃) = ∑sdπ(s)V(s)
 In Deep RL, we approximate π𝜃(a | s) with neural
network.
 Usually with softmax layer on top to estimate
probabilities of each action.
 We can estimate J(𝜃) from samples of observed
behavior: ∑k=0..Tp𝜃(𝜏k | π)R(𝜏k)
 Do stochastic gradient descent using update:
 𝜃i+1 = 𝜃i + 𝛼 (1/T) ∑k=0..T ∇log p𝜃(𝜏k | π)R(𝜏k)
28
DQN OPTIMIZATION
29
ASYNC ADVANTAGE ACTOR-CRITIC (A3C)
 Asynchronous: using multiple instances of
environments and networks
 Actor-Critic: using both policy and estimate of
value function.
 Advantage: estimate how different was outcome
than expected.
30
TENSORFLOW-RL/EXAMPLES/ATARI-RL.PY
31
ACTING
Environment Random Agent
32
Q-NETWORK
Q-network Optimization
33
Q-NETWORK
Q-network Layer
 Convolutional Layer
 16 個 8 x 8 ,輸出採樣間隔為 4 x 4,並加 ReLU 非線性啟動函數
 32 個 4 x 4 ,輸出採樣間隔為 2 x 2,並加 ReLU 非線性啟動函數
 Flatten
 將回應展開為一維向量
 Fully-Connected Layer
 256 個神經元,加 ReLU 非線性啟動函數
 num_actions 個神經元,加線性啟動函數,對應每個 action 的 score
值(稱為 Q 值)
 Pooling Layer
 none
34
Q-NETWORK
Q-network Monitored Training Session
35
POLICY NETWORK
Policy Network Optimization
36
POLICY AND VALUE AND POLICY NETWORKS
Networks optimization
37
PROBLEM
temporal credit assignment
 時間效益分配
 先前的行動會影響到當前的收益的獲得
 動作的先後影響力
 experience replay
 所有的經驗<P,A,R‘,S’>都存放在一個資料表
balance exploration-exploit
 平衡行動
 利用已有的策略
 還是探索其他可能更好的策略
 greedy exploration
 按照最高的Q Value進行貪心行動
 機率選擇一個隨機行動
38
THANK YOU!
39

More Related Content

PPTX
Atari 2600 Programming for Fun
PDF
A Development of Log-based Game AI using Deep Learning
PPTX
Raspberry pi gaming 4 kids
PDF
Paulien van Alst - Upgrade Time: Choose Java 11 or the "other" one… Kotlin - ...
PDF
Upgrade time ! Java to Kotlin
PDF
CES 2015: 7 of the Most Impressive and Innovative Ideas
PPTX
The Video Game R-Evolution
PDF
The Ring programming language version 1.9 book - Part 56 of 210
Atari 2600 Programming for Fun
A Development of Log-based Game AI using Deep Learning
Raspberry pi gaming 4 kids
Paulien van Alst - Upgrade Time: Choose Java 11 or the "other" one… Kotlin - ...
Upgrade time ! Java to Kotlin
CES 2015: 7 of the Most Impressive and Innovative Ideas
The Video Game R-Evolution
The Ring programming language version 1.9 book - Part 56 of 210

What's hot (20)

PPTX
Raspberry Pi with Java (JJUG)
PPTX
Oracle IoT Kids Workshop
PPTX
Raspberry Pi Gaming 4 Kids (Devoxx4Kids)
PDF
Fun with sensors - JSConf.asia 2014
PPTX
JCrete Embedded Java Workshop
PPTX
The most awesome build ever!
PDF
Android dev toolbox - Shem Magnezi, WeWork
PDF
AlphaGo and AlphaGo Zero
PDF
The Ring programming language version 1.5.1 book - Part 45 of 180
PPTX
Internet of Things Magic Show
PDF
Home Automation with Android Things and the Google Assistant
PDF
Kotlin - Coroutine
PDF
Minko stage3d workshop_20130525
PDF
libGDX: Scene2D
PDF
Box2D and libGDX
PDF
Ernst kuilder (Nelen & Schuurmans) - De waterkaart van Nederland: technisch g...
PDF
Connecting your phone and home with firebase and android things - James Cogga...
PPTX
Cross-scene references: A shock to the system - Unite Copenhagen 2019
PPTX
RetroPi Handheld Raspberry Pi Gaming Console
PDF
NoiseGen at Arlington Ruby 2012
Raspberry Pi with Java (JJUG)
Oracle IoT Kids Workshop
Raspberry Pi Gaming 4 Kids (Devoxx4Kids)
Fun with sensors - JSConf.asia 2014
JCrete Embedded Java Workshop
The most awesome build ever!
Android dev toolbox - Shem Magnezi, WeWork
AlphaGo and AlphaGo Zero
The Ring programming language version 1.5.1 book - Part 45 of 180
Internet of Things Magic Show
Home Automation with Android Things and the Google Assistant
Kotlin - Coroutine
Minko stage3d workshop_20130525
libGDX: Scene2D
Box2D and libGDX
Ernst kuilder (Nelen & Schuurmans) - De waterkaart van Nederland: technisch g...
Connecting your phone and home with firebase and android things - James Cogga...
Cross-scene references: A shock to the system - Unite Copenhagen 2019
RetroPi Handheld Raspberry Pi Gaming Console
NoiseGen at Arlington Ruby 2012
Ad

Similar to Tensorflow + Keras & Open AI Gym (20)

PDF
從 Atari/AlphaGo/ChatGPT 談深度強化學習及通用人工智慧
PDF
強化學習的王者之旅
PDF
強化學習的王者之旅
PDF
[1312.5602] Playing Atari with Deep Reinforcement Learning
PPTX
Practical Reinforcement Learning with TensorFlow
PDF
John Carmack’s Slides From His Upper Bound 2025 Talk
PDF
Deep Reinforcement Learning and Its Applications
PDF
Reinforcement learning in a nutshell
PDF
Autonomous agents with deep reinforcement learning - Oredev 2018
PDF
Deep Learning for Real-Time Atari Game Play Using Offline Monte-CarloTree Sear...
PDF
Playing Atari with Deep Reinforcement Learning
PDF
A Journey to Reinforcement Learning
PDF
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
PPTX
Reinforcement learning
PDF
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
PDF
Mastering the game of Go with deep neural networks and tree search (article o...
PDF
A brief overview of Reinforcement Learning applied to games
PDF
Playing Atari with Deep Reinforcement Learning
PDF
AbadIA: the abbey of the crime AI - GDG Cloud London 2018
PPTX
Atari Game State Representation using Convolutional Neural Networks
從 Atari/AlphaGo/ChatGPT 談深度強化學習及通用人工智慧
強化學習的王者之旅
強化學習的王者之旅
[1312.5602] Playing Atari with Deep Reinforcement Learning
Practical Reinforcement Learning with TensorFlow
John Carmack’s Slides From His Upper Bound 2025 Talk
Deep Reinforcement Learning and Its Applications
Reinforcement learning in a nutshell
Autonomous agents with deep reinforcement learning - Oredev 2018
Deep Learning for Real-Time Atari Game Play Using Offline Monte-CarloTree Sear...
Playing Atari with Deep Reinforcement Learning
A Journey to Reinforcement Learning
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Reinforcement learning
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
Mastering the game of Go with deep neural networks and tree search (article o...
A brief overview of Reinforcement Learning applied to games
Playing Atari with Deep Reinforcement Learning
AbadIA: the abbey of the crime AI - GDG Cloud London 2018
Atari Game State Representation using Convolutional Neural Networks
Ad

More from HO-HSUN LIN (7)

PPTX
以太坊(Ethereum) solidity & web3.js
PDF
區塊鏈與金融科技(Blockchain and Fintech)
PPTX
Microsoft CNTK, Cognitive Toolkit 微軟深度學習工具
PPTX
Chaincode Development 區塊鏈鏈碼開發
PPTX
Net Parallel Programming .NET平行處理與執行序
PPTX
ASP.NET AJAX
PPTX
SQL Loader & Bulk Insert 大量資料匯入工具
以太坊(Ethereum) solidity & web3.js
區塊鏈與金融科技(Blockchain and Fintech)
Microsoft CNTK, Cognitive Toolkit 微軟深度學習工具
Chaincode Development 區塊鏈鏈碼開發
Net Parallel Programming .NET平行處理與執行序
ASP.NET AJAX
SQL Loader & Bulk Insert 大量資料匯入工具

Recently uploaded (20)

PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
CloudStack 4.21: First Look Webinar slides
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Architecture types and enterprise applications.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPT
Geologic Time for studying geology for geologist
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PPT
What is a Computer? Input Devices /output devices
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
STKI Israel Market Study 2025 version august
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
CloudStack 4.21: First Look Webinar slides
Benefits of Physical activity for teenagers.pptx
Enhancing emotion recognition model for a student engagement use case through...
Architecture types and enterprise applications.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
A novel scalable deep ensemble learning framework for big data classification...
Geologic Time for studying geology for geologist
Web Crawler for Trend Tracking Gen Z Insights.pptx
What is a Computer? Input Devices /output devices
Getting Started with Data Integration: FME Form 101
Final SEM Unit 1 for mit wpu at pune .pptx
STKI Israel Market Study 2025 version august
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
NewMind AI Weekly Chronicles – August ’25 Week III
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
1 - Historical Antecedents, Social Consideration.pdf
A review of recent deep learning applications in wood surface defect identifi...
From MVP to Full-Scale Product A Startup’s Software Journey.pdf

Tensorflow + Keras & Open AI Gym

  • 1. PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING TENSORFLOW + KERAS & OPENAI GYM 1
  • 2. CONTENTS Playing Atari Deep Reinforcement Learning  Playing Atari with Deep Reinforcement Learning  Human Level Control through Deep Reinforcement Learning  Deep Reinforcement Learning with Q-Learning 2
  • 3. PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING 3
  • 4. ATARI 2600 http://atariage.com/index.php Atari 2600是1976年發行的經典遊戲主 機  史上第一部家用電子遊戲機  支援160 X 192解析度螢幕,最高128色,主機上 有 128 Byte RAM和 6KB ROM  FC 紅白機十年之後才出現 4
  • 5. PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING DeepMind Object is to find an optimal policy  展示了如何讓電腦學習玩 Atari 2600 遊戲  這個結果引人注目的地方在於電腦只觀察螢幕圖 元並在遊戲得分增加時接收獎勵  相同模型架構  學習七種不同遊戲  其中三個遊戲玩得比人類好 5
  • 6. HUMAN LEVEL Original Results on Atari Games Beating Human Level 6
  • 7. A3C (ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC) RESULTS ON ATARI GAMES 7
  • 9. PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING Reinforcement Learning Object is to find an optimal policy 1. Given Current State 2. Take an Action based on state 3. Get current Reward 9
  • 10. BREAKOUT Tested on Ubuntu 16.04 Breakout  State  球在螢幕上的位置  Action  訓練電腦玩遊戲  Input:螢幕截圖  Outpu:控制Paddle左、右、發球  Reward  螢幕上半部分有很多磚塊,球碰到磚塊會將它擊碎,你 會得分 10
  • 11. RESOURCES Playing Atari with Deep Reinforcement Learning  https://courses.cs.ut.ee/MTAT.03.291/2014_sprin g/uploads/Main/Replicating%20DeepMind.pdf Replicating-DeepMind  https://github.com/kristjankorjus/Replicating- DeepMind 11
  • 12. RESOURCES DeepMind Atari Deep Q Learner  https://github.com/kuz/DeepMind-Atari-Deep- Q-Learner  LuaJIT and Torch 7.0  nngraph  Xitari (fork of the Arcade Learning Environment (Bellemare et al., 2013))  AleWrap (a lua interface to Xitari) An install script for these dependencies is provided. Asyncronous RL in Tensorflow + Keras OpenAI's Gym  https://github.com/coreylynch/async-rl  tensorflow  gym  [gym's atari environment] (https://github.com/openai/gym#atari)  skimage  Keras 12
  • 13. RESOURCES The Arcade Learning Environment  http://www.arcadelearningenvironment.org/ ALE (Visual Studio Version)  https://github.com/mvacha/A.L.E.-0.4.4.-Visual- Studio 13
  • 14. APT-GET INSTALL  libtiff5-dev  libjpeg8-dev  zlib1g-dev  liblcms2-dev  libwebp-dev  tcl8.6-dev  tk8.5-dev  python-tk  cmake  xvfb 14
  • 15. DEEP NEURAL NETWORKS  Tensor Flow is a good flexible deep learning framework  Backpropagation and deep neural network do a lot the reinforcement learning challenge is how you find the best loss function to train 15
  • 16. HOW TO RUN AI AGENTS ON GAMES? https://gym.openai.com/ OpenAI Gym  Library of Environments  Pong  Breakout  Cart-Pole  Same API  Provides way to share and compare results 16
  • 17. HOW TO RUN AI AGENTS ON GAMES? https://gym.openai.com/ Pip install -e '.[atari]' import gym env = gym.make('SpaceInvaders-v0') obs = env.reset() env.render() ob, reward, done, _ = env.step(action) 17
  • 18. OTHER OPTIONS https://github.com/DanielSlater/PyGamePla yer PyGame  1000’s of games  Easy to change game code  PyGamePlayer  Half pong 18
  • 19. PYTHON ASYNC_DQN.PY --EXPERIMENT BREAKOUT --GAME "BREAKOUT-V0" --NUM_CONCURRENT 8 Checkpoints /tmp/checkpoints/ TensorBoard Summary tensorboard --logdir /tmp/summaries/breakout "created":1485854183, "episode_types":["t"], "episode_lengths":[1717], "object":"episode_batch", "initial_reset_timestamps":[ 1485853848.3293480873], "episode_rewards":[62.0], "data_sources":[0], "seeds":[], "main_seeds":[], "timestamps":[1485853853. 9296009541], "env_id":"Breakout-v0", "initial_reset_timestamp":1 485853848.3293480873, "id":"eb_taFBJqLFThuZ5jBw O0NFTQ" tensorboard --logdir /tmp/summaries/breakout 19
  • 20. ALE GRAYSCALE CONVERSION METHOD RGB images grayscale conversion 20
  • 21. SCREENSHOT frame skipping maximum over two consecutive frames 21
  • 22. 100-EPISODE (2 HOURS) AVERAGE REWARD WAS 68.97 Training episode batch video (mp4) Visualizing training with tensorboard 22
  • 23. VISUALIZING TRAINING WITH TENSORBOARD Episode Reward Max Q Value 23
  • 25. MARKOV DECISION PROCESS  選擇這些行動的策略  一般來說環境是隨機的  下一個狀態的出現也是隨機的  MDP < S, A, P, R, 𝛾 >  S: set of states  A: set of actions  T(s, a, s’): probability of transition  Reward(s): reward function  𝛾: discounting factory  Trace: {<s0,a0,r0>, …, <sn,an,rn>} 25
  • 27. REINFORCEMENT LEARNING 3 categories of reinforcement learning  Value learning : Q-learning  給定一個狀態和一組可能的行動,決定採取最佳的 獎勵的行動  Policy learning : Policy gradients  使用Gradients找到最佳策略  Model learning  學習在不同狀態間的轉換  Min-Max  Monte-Carlo sampling Definitions  Return: total discounted reward:  Policy: Agent’s behavior  Deterministic policy: π(s) = a  Stochastic policy: π(a | s) = P[At = a | St = s]  Value function: Expected return starting from state s:  State-value function: Vπ(s) = Eπ[R | St = s]  Action-value function: Qπ(s, a) = Eπ[R | St = s, At = a] 27
  • 28. LEARNING Deep Q Learning  Model-free, off-policy technique to learn optimal Q(s, a):  Qi+1(s, a) ← Qi(s, a) + 𝛼(R + 𝛾 maxa’ Qi(s’, a’) - Qi(s, a))  Optimal policy then π(s) = argmaxa’ Q(s, a’)  Requires exploration (ε-greedy) to explore various transitions from the states.  Take random action with ε probability, start ε high and decay to low value as training progresses.  Deep Q Learning: approximate Q(s, a) with neural network: Q(s, a, 𝜃)  Do stochastic gradient descent using loss  L(𝜃) = MSEs, a(Q(s, a, 𝜃i), r + 𝛾maxa’Q(s, a’, 𝜃i - 1)) Policy Gradient  Given policy π𝜃(a | s) find such 𝜃 that maximizes expected return:  J(𝜃) = ∑sdπ(s)V(s)  In Deep RL, we approximate π𝜃(a | s) with neural network.  Usually with softmax layer on top to estimate probabilities of each action.  We can estimate J(𝜃) from samples of observed behavior: ∑k=0..Tp𝜃(𝜏k | π)R(𝜏k)  Do stochastic gradient descent using update:  𝜃i+1 = 𝜃i + 𝛼 (1/T) ∑k=0..T ∇log p𝜃(𝜏k | π)R(𝜏k) 28
  • 30. ASYNC ADVANTAGE ACTOR-CRITIC (A3C)  Asynchronous: using multiple instances of environments and networks  Actor-Critic: using both policy and estimate of value function.  Advantage: estimate how different was outcome than expected. 30
  • 34. Q-NETWORK Q-network Layer  Convolutional Layer  16 個 8 x 8 ,輸出採樣間隔為 4 x 4,並加 ReLU 非線性啟動函數  32 個 4 x 4 ,輸出採樣間隔為 2 x 2,並加 ReLU 非線性啟動函數  Flatten  將回應展開為一維向量  Fully-Connected Layer  256 個神經元,加 ReLU 非線性啟動函數  num_actions 個神經元,加線性啟動函數,對應每個 action 的 score 值(稱為 Q 值)  Pooling Layer  none 34
  • 36. POLICY NETWORK Policy Network Optimization 36
  • 37. POLICY AND VALUE AND POLICY NETWORKS Networks optimization 37
  • 38. PROBLEM temporal credit assignment  時間效益分配  先前的行動會影響到當前的收益的獲得  動作的先後影響力  experience replay  所有的經驗<P,A,R‘,S’>都存放在一個資料表 balance exploration-exploit  平衡行動  利用已有的策略  還是探索其他可能更好的策略  greedy exploration  按照最高的Q Value進行貪心行動  機率選擇一個隨機行動 38