SlideShare a Scribd company logo
Introduction of Reinforcement Learning
Artificial Intelligence
• 지능이란?
 보다 추상적인 정보를 이해하는 능력
•인공 지능이란?
 이러한 지능 현상을 인공적으로 구현하려는 연구
Artificial Intelligence & Machine Learning
Deep Learning in RL
• DeepRL에서 딥러닝은 그저 하나의
module로써만 사용된다.
• Deep Learning의 강점 :
1. 층을 쌓을 수록 더욱 Abstract
Feature Learning이 가능한 유일
한 알고리즘.
2. Universal Function Approximator.
Universal Function Approximator
Reinforcement Learning이란?
•Supervised Learning :
y = f(x)
•Unsupervised Learning :
x ~ p(x) or x = f(x)
•Reinforcement Learning :
Find a policy, p(a|s) which maximizes the sum of reward
Machine Learning
Example of Supervised Learning :
Polynomial Curve Fitting
Microsoft Excel 2007의 추세선
Example of Unsupervised Learning :
Clustering
http://www.frankichamaki.com/data-driven-market-segmentation-more-effective-marketing-to-segments-using-ai/
Example of Reinforcement Learning :
Optimal Control Problem
http://graveleylab.cam.uchc.edu/WebData/mduff/older_papers.html
https://studywolf.wordpress.com/2015/03/29/reinforcement-learning-part-3-egocentric-learning/
Markov Decision Processes(MDP)
• Discrete state space : S = { A , B }
• State transition probability : P(S’|S) = {PAA = 0.7 , PAB = 0.3 , PBA = 0.5 , PBB = 0.5 }
• Purpose : Finding a steady state distribution
Markov Processes
• Discrete state space : S = { A , B }
• State transition probability : P(S’|S) = {PAA = 0.7 , PAB = 0.3 , PBA = 0.5 , PBB = 0.5 }
• Reward function : R(S’=A) = +1 , R(S’=B) = -1
• Purpose : Finding an expected reward distribution
Markov Processes with rewards
• Discrete state space : S = { A , B }
• Discrete action space : A = { X, Y }
• (Action conditional) State transition probability : P(S’|S , A) = { … }
• Reward function : R(S’=A) = +1 , R(S’=B) = -1
• Purpose : Finding an optimal policy (maximizes the expected sum of future reward)
Markov Decision Processes
•Markov decision processes :
a mathematical framework for modeling decision making.
• MDP are solved via dynamic programming and reinforcement learning.
• Applications : robotics, automated control, economics and manufacturing.
• Examples of MDP :
1) AlphaGo에서는 바둑을 MDP로 정의
2) 자동차 운전을 MDP로 정의
3) 주식시장을 MDP로 정의
Markov Decision Processes
• Objective : Finding an optimal policy which maximizes the expected sum of
future rewards
• Algorithms
1) Planning : Exhaustive Search / Dynamic Programming
2) Reinforcement Learning : MC method / TD Learning(Q-learning , …)
Agent-Environment Interaction
Discount Factor
• Sum of future rewards in episodic tasks
 Gt := Rt+1 + Rt+2 + Rt+3 + … + RT
• Sum of future rewards in continuous tasks
 Gt := Rt+1 + Rt+2 + Rt+3 + … + RT + …
Gt  ∞ (diverge)
• Sum of discounted future rewards in both case
 Gt := Rt+1 + γRt+2 + γ2Rt+3 + … + γT-1RT + …
= γk−1Rt+k
∞
𝒌=𝟏 (converge)
(Rt is bounded / γ : discount factor, 0 <= γ < 1)
Value-based Reinforcement Learning
Deterministic Policy, a=f(s)
State S f(S)
A Y
B Y
Stochastic Policy, p(a|s)
State S P(X|S) P(Y|S)
A 0.3 0.7
B 0.4 0.6
Solution of the MDP : Planning
• So our last job is find optimal policy and there are two approaches.
2) Dynamic Programming
1)Exhaustive Search
Find Optimal Policy with Exhaustive Search
If we know the one step dynamics of the
MDP, P(s’,r|s,a) , we can do exhaustive
search iteratively until the end step T.
And we can choose the optimal action
path, but this needs O(N^T)! A
A
(+1)
B
(-1)
A
(+1)
B
(-1)
A
(+1)
B
(-1)
A
(+1)
B
(-1)
A
(+1)
B
(-1)
A
(+1)
B
(-1)
X
X
Y
Y
A
(+1)
B
(-1)
A
(+1)
B
(-1)
A
(+1)
B
(-1)
A
(+1)
B
(-1)
Dynamic Programming
• We can apply the DP in this problem, and the computational cost
reduces to O(N2T). (But still we need to know the environment
dynamics.)
• DP is a computer science technique which calculates the final goal
value with compositions of cumulative partial values.
• State : 5x5
• Action : 4방향이동
• Reward : A에 도착하면 +10, B에 도착하면 +5, 벽에 부딪히면 -1, 그이외 0
• Discounted Factor : 0.9
Gridworld
1) Policy from state-value function
 One-step ahead search for all actions
with state transition probability(model).
2) Policy from action-value function
 a = f(s) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑞π
𝑠 , 𝑎
Optimal Policy from Value Function
Value Function
•We will introduce a value function which let us know the expected
sum of future rewards at given state s.
1) State-value function
2) Action-value function
Policy Iteration
•How can we get the state-value function with DP?
(action-value function is similarly computed.)
Policy Iteration = Policy Evaluation + Policy Improvement
27
Policy Iteration
28
• Policy iteration consists of
two simultaneous,
interacting processes.
• (policy evaluation)
One making the value
function consistent with the
current policy
• (policy improvement)
And the other making the
policy greedy with respect
to the current value function
Policy Iteration
29
Solution of the MDP : Learning
•The planning methods must know the perfect dynamics of
environment, P(s’,r|s,a)
•But typically this is really hard to know and empirically impossible.
Therefore we will ignore this term and just calculate the mean of
reward with sampling method. This is the starting point of the
machine learning is embedded.
1) Monte Carlo Methods
2) Temporal-Difference Learning
(some kind of reinforcement learning)
30
Monte Carlo Methods
31
Starting State Value
S1 Average of G(S1)
S2 Average of G(S2)
S3 Average of G(S2)
Tabular State-value
function
Monte Carlo Methods
32
•We need a full length of experience
for each started state. This is really
time consuming to update one state
while waiting the terminal of episode.
Q-learning (Temporal Difference Learning)
Temporal-Difference Learning
•TD learning is a combination of Monte Carlo ideas and dynamic
programming (DP) ideas.
•Like Monte Carlo methods, TD methods can learn directly from
raw experience without a model of the environment's dynamics.
•Like DP, TD methods update estimates based in part without
waiting for a final outcome (they bootstrap).
34
Bellman Equation
(iterative formulation)
Q-learning Algorithm
Temporal-Difference Learning
37
Temporal-Difference Learning
38
On policy / Off policy
• On policy : Target policy = Behavioral policy
there can be only one policy.
 This can learn a stochastic policy. Ex) Q-learning
• Off policy : Target policy != Behavioral policy
there can be several policies.
 Broad applications. Ex) SARSA
39
Sarsa: On-Policy TD Control
40
Eligibility Trace
41
•Smoothly combine the TD and MC.
Eligibility Trace
42
Comparisons
43
Planning & Learning
•There is only a difference between planning and learning. That is
the existence of model.
•So we call planning is model-based method, and learning is
model-free method.
44
Planning + Learning
45
Planning vs Learning
• After one-step learning.
46
Deep Reinforcement Learning
•Value function approximation with deep learning
 Large scale or infinite dimensional state can be solvable
Generalization with deep learning
This needs supervised learning techniques and online moving target regression.
Appendix
• Atari 2600 - https://www.youtube.com/watch?v=iqXKQf2BOSE
• Super MARIO - https://www.youtube.com/watch?v=qv6UVOQ0F44
• Robot Learns to Flip Pancakes - https://www.youtube.com/watch?v=W_gxLKSsSIE
• Stanford Autonomous Helicopter - Airshow #2 -
https://www.youtube.com/watch?v=VCdxqn0fcnE
• OpenAI Gym - https://gym.openai.com/envs
• Awesome RL - https://github.com/aikorea/awesome-rl
• Udacity RL course
• TensorFlow DRL - https://github.com/nivwusquorum/tensorflow-deepq
• Karpathy rldemo -
http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html
49
References
[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An
introduction. MIT press, 1998.
50
감사합니다.

More Related Content

PPTX
Intro to Deep Reinforcement Learning
PDF
An introduction to deep reinforcement learning
PDF
An introduction to reinforcement learning
PPTX
Deep Reinforcement Learning
PPTX
Reinforcement Learning : A Beginners Tutorial
PPTX
An introduction to reinforcement learning
PDF
Deep Reinforcement Learning and Its Applications
PDF
Actor critic algorithm
Intro to Deep Reinforcement Learning
An introduction to deep reinforcement learning
An introduction to reinforcement learning
Deep Reinforcement Learning
Reinforcement Learning : A Beginners Tutorial
An introduction to reinforcement learning
Deep Reinforcement Learning and Its Applications
Actor critic algorithm

What's hot (20)

PDF
Deep Reinforcement Learning
PPT
AI Lecture 3 (solving problems by searching)
PPT
PDF
Lec3 dqn
PDF
Reinforcement Learning
PPTX
Reinforcement Learning
PDF
DQN (Deep Q-Network)
PPTX
Reinforcement Learning
PPTX
Reinforcement learning
PDF
Reinforcement Learning 4. Dynamic Programming
PDF
Deep reinforcement learning
PPT
Lecture 11 Informed Search
PDF
Reinforcement Learning 6. Temporal Difference Learning
PDF
Temporal difference learning
PPTX
AI_Session 11: searching with Non-Deterministic Actions and partial observati...
PDF
강화 학습 기초 Reinforcement Learning an introduction
PPTX
Deep sarsa, Deep Q-learning, DQN
PDF
Reinforcement Learning
PDF
Reinforcement Learning 8: Planning and Learning with Tabular Methods
PPTX
Game Playing in Artificial intelligence.pptx
Deep Reinforcement Learning
AI Lecture 3 (solving problems by searching)
Lec3 dqn
Reinforcement Learning
Reinforcement Learning
DQN (Deep Q-Network)
Reinforcement Learning
Reinforcement learning
Reinforcement Learning 4. Dynamic Programming
Deep reinforcement learning
Lecture 11 Informed Search
Reinforcement Learning 6. Temporal Difference Learning
Temporal difference learning
AI_Session 11: searching with Non-Deterministic Actions and partial observati...
강화 학습 기초 Reinforcement Learning an introduction
Deep sarsa, Deep Q-learning, DQN
Reinforcement Learning
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Game Playing in Artificial intelligence.pptx
Ad

Similar to Reinforcement learning (20)

PDF
Introduction of Deep Reinforcement Learning
PPTX
An efficient use of temporal difference technique in Computer Game Learning
PPTX
How to formulate reinforcement learning in illustrative ways
PDF
Deep Reinforcement learning
PDF
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
PDF
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
PDF
Introduction to Deep Reinforcement Learning
PDF
Reinfrocement Learning
PDF
Deep reinforcement learning from scratch
PPT
Presentazione Tesi Laurea Triennale in Informatica
PDF
Reinforcement Learning - DQN
PDF
TensorFlow and Deep Learning Tips and Tricks
ODP
Online advertising and large scale model fitting
PPT
Machine learning introduction to unit 1.ppt
PPT
Lecture: introduction to Machine Learning.ppt
PPTX
Deep Q-learning from Demonstrations DQfD
PPTX
AI UNIT V PPT.pptxAI UNIT V PPT.pptxAI UNIT V PPT.pptx
PDF
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
PDF
Deep Learning Introduction - WeCloudData
PDF
Learning To Run
Introduction of Deep Reinforcement Learning
An efficient use of temporal difference technique in Computer Game Learning
How to formulate reinforcement learning in illustrative ways
Deep Reinforcement learning
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
Introduction to Deep Reinforcement Learning
Reinfrocement Learning
Deep reinforcement learning from scratch
Presentazione Tesi Laurea Triennale in Informatica
Reinforcement Learning - DQN
TensorFlow and Deep Learning Tips and Tricks
Online advertising and large scale model fitting
Machine learning introduction to unit 1.ppt
Lecture: introduction to Machine Learning.ppt
Deep Q-learning from Demonstrations DQfD
AI UNIT V PPT.pptxAI UNIT V PPT.pptxAI UNIT V PPT.pptx
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Deep Learning Introduction - WeCloudData
Learning To Run
Ad

Recently uploaded (20)

PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPT
Mechanical Engineering MATERIALS Selection
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Geodesy 1.pptx...............................................
DOCX
573137875-Attendance-Management-System-original
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
web development for engineering and engineering
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
PPT on Performance Review to get promotions
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Mechanical Engineering MATERIALS Selection
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Geodesy 1.pptx...............................................
573137875-Attendance-Management-System-original
R24 SURVEYING LAB MANUAL for civil enggi
web development for engineering and engineering
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Automation-in-Manufacturing-Chapter-Introduction.pdf
Foundation to blockchain - A guide to Blockchain Tech
OOP with Java - Java Introduction (Basics)
Operating System & Kernel Study Guide-1 - converted.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT on Performance Review to get promotions

Reinforcement learning

  • 2. Artificial Intelligence • 지능이란?  보다 추상적인 정보를 이해하는 능력 •인공 지능이란?  이러한 지능 현상을 인공적으로 구현하려는 연구
  • 3. Artificial Intelligence & Machine Learning
  • 4. Deep Learning in RL • DeepRL에서 딥러닝은 그저 하나의 module로써만 사용된다. • Deep Learning의 강점 : 1. 층을 쌓을 수록 더욱 Abstract Feature Learning이 가능한 유일 한 알고리즘. 2. Universal Function Approximator.
  • 7. •Supervised Learning : y = f(x) •Unsupervised Learning : x ~ p(x) or x = f(x) •Reinforcement Learning : Find a policy, p(a|s) which maximizes the sum of reward Machine Learning
  • 8. Example of Supervised Learning : Polynomial Curve Fitting Microsoft Excel 2007의 추세선
  • 9. Example of Unsupervised Learning : Clustering http://www.frankichamaki.com/data-driven-market-segmentation-more-effective-marketing-to-segments-using-ai/
  • 10. Example of Reinforcement Learning : Optimal Control Problem http://graveleylab.cam.uchc.edu/WebData/mduff/older_papers.html https://studywolf.wordpress.com/2015/03/29/reinforcement-learning-part-3-egocentric-learning/
  • 12. • Discrete state space : S = { A , B } • State transition probability : P(S’|S) = {PAA = 0.7 , PAB = 0.3 , PBA = 0.5 , PBB = 0.5 } • Purpose : Finding a steady state distribution Markov Processes
  • 13. • Discrete state space : S = { A , B } • State transition probability : P(S’|S) = {PAA = 0.7 , PAB = 0.3 , PBA = 0.5 , PBB = 0.5 } • Reward function : R(S’=A) = +1 , R(S’=B) = -1 • Purpose : Finding an expected reward distribution Markov Processes with rewards
  • 14. • Discrete state space : S = { A , B } • Discrete action space : A = { X, Y } • (Action conditional) State transition probability : P(S’|S , A) = { … } • Reward function : R(S’=A) = +1 , R(S’=B) = -1 • Purpose : Finding an optimal policy (maximizes the expected sum of future reward) Markov Decision Processes
  • 15. •Markov decision processes : a mathematical framework for modeling decision making. • MDP are solved via dynamic programming and reinforcement learning. • Applications : robotics, automated control, economics and manufacturing. • Examples of MDP : 1) AlphaGo에서는 바둑을 MDP로 정의 2) 자동차 운전을 MDP로 정의 3) 주식시장을 MDP로 정의 Markov Decision Processes
  • 16. • Objective : Finding an optimal policy which maximizes the expected sum of future rewards • Algorithms 1) Planning : Exhaustive Search / Dynamic Programming 2) Reinforcement Learning : MC method / TD Learning(Q-learning , …) Agent-Environment Interaction
  • 17. Discount Factor • Sum of future rewards in episodic tasks  Gt := Rt+1 + Rt+2 + Rt+3 + … + RT • Sum of future rewards in continuous tasks  Gt := Rt+1 + Rt+2 + Rt+3 + … + RT + … Gt  ∞ (diverge) • Sum of discounted future rewards in both case  Gt := Rt+1 + γRt+2 + γ2Rt+3 + … + γT-1RT + … = γk−1Rt+k ∞ 𝒌=𝟏 (converge) (Rt is bounded / γ : discount factor, 0 <= γ < 1)
  • 20. Stochastic Policy, p(a|s) State S P(X|S) P(Y|S) A 0.3 0.7 B 0.4 0.6
  • 21. Solution of the MDP : Planning • So our last job is find optimal policy and there are two approaches. 2) Dynamic Programming 1)Exhaustive Search
  • 22. Find Optimal Policy with Exhaustive Search If we know the one step dynamics of the MDP, P(s’,r|s,a) , we can do exhaustive search iteratively until the end step T. And we can choose the optimal action path, but this needs O(N^T)! A A (+1) B (-1) A (+1) B (-1) A (+1) B (-1) A (+1) B (-1) A (+1) B (-1) A (+1) B (-1) X X Y Y A (+1) B (-1) A (+1) B (-1) A (+1) B (-1) A (+1) B (-1)
  • 23. Dynamic Programming • We can apply the DP in this problem, and the computational cost reduces to O(N2T). (But still we need to know the environment dynamics.) • DP is a computer science technique which calculates the final goal value with compositions of cumulative partial values.
  • 24. • State : 5x5 • Action : 4방향이동 • Reward : A에 도착하면 +10, B에 도착하면 +5, 벽에 부딪히면 -1, 그이외 0 • Discounted Factor : 0.9 Gridworld
  • 25. 1) Policy from state-value function  One-step ahead search for all actions with state transition probability(model). 2) Policy from action-value function  a = f(s) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑞π 𝑠 , 𝑎 Optimal Policy from Value Function
  • 26. Value Function •We will introduce a value function which let us know the expected sum of future rewards at given state s. 1) State-value function 2) Action-value function
  • 27. Policy Iteration •How can we get the state-value function with DP? (action-value function is similarly computed.) Policy Iteration = Policy Evaluation + Policy Improvement 27
  • 28. Policy Iteration 28 • Policy iteration consists of two simultaneous, interacting processes. • (policy evaluation) One making the value function consistent with the current policy • (policy improvement) And the other making the policy greedy with respect to the current value function
  • 30. Solution of the MDP : Learning •The planning methods must know the perfect dynamics of environment, P(s’,r|s,a) •But typically this is really hard to know and empirically impossible. Therefore we will ignore this term and just calculate the mean of reward with sampling method. This is the starting point of the machine learning is embedded. 1) Monte Carlo Methods 2) Temporal-Difference Learning (some kind of reinforcement learning) 30
  • 31. Monte Carlo Methods 31 Starting State Value S1 Average of G(S1) S2 Average of G(S2) S3 Average of G(S2) Tabular State-value function
  • 32. Monte Carlo Methods 32 •We need a full length of experience for each started state. This is really time consuming to update one state while waiting the terminal of episode.
  • 34. Temporal-Difference Learning •TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. •Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment's dynamics. •Like DP, TD methods update estimates based in part without waiting for a final outcome (they bootstrap). 34
  • 39. On policy / Off policy • On policy : Target policy = Behavioral policy there can be only one policy.  This can learn a stochastic policy. Ex) Q-learning • Off policy : Target policy != Behavioral policy there can be several policies.  Broad applications. Ex) SARSA 39
  • 40. Sarsa: On-Policy TD Control 40
  • 44. Planning & Learning •There is only a difference between planning and learning. That is the existence of model. •So we call planning is model-based method, and learning is model-free method. 44
  • 46. Planning vs Learning • After one-step learning. 46
  • 48. •Value function approximation with deep learning  Large scale or infinite dimensional state can be solvable Generalization with deep learning This needs supervised learning techniques and online moving target regression.
  • 49. Appendix • Atari 2600 - https://www.youtube.com/watch?v=iqXKQf2BOSE • Super MARIO - https://www.youtube.com/watch?v=qv6UVOQ0F44 • Robot Learns to Flip Pancakes - https://www.youtube.com/watch?v=W_gxLKSsSIE • Stanford Autonomous Helicopter - Airshow #2 - https://www.youtube.com/watch?v=VCdxqn0fcnE • OpenAI Gym - https://gym.openai.com/envs • Awesome RL - https://github.com/aikorea/awesome-rl • Udacity RL course • TensorFlow DRL - https://github.com/nivwusquorum/tensorflow-deepq • Karpathy rldemo - http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html 49
  • 50. References [1] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 1998. 50