SlideShare a Scribd company logo
6
Most read
10
Most read
17
Most read
Chapter 3: Finite Markov Decision Processes
Seungjae Ryan Lee
● Simplified, flexible reinforcement learning problem
● Consists of States , Actions , Rewards
Markov Decision Process (MDP)
States
Info available to agent
Actions
Choice made by agent
Rewards
Basis for evaluating choices
Agent
The learner
Takes action
Everything outside the agent
Returns state and reward
Environment
● Anything the agent cannot arbitrarily change is part of the environment
○ Agent might still know everything about the environment
● Different boundaries for different purposes
Agent-Environment Boundary
Machinery Sensors Battery“Brain”
1. Agent observes a state
2. Agent takes action
3. Agent receives reward and new state
4. Agent takes another action
5. Repeat
Agent-Environment Interactions
Transition Probability
● Probability of reaching state and reward by taking action on state
● Fully describes the dynamics of a finite MDP
● Can deduce other properties of the environment
Expected Rewards
● Expected reward of taking action on state
● Expected reward of arriving in state by taking action on state
Recycling Robot Example
● States: Battery status (high or low)
● Actions
○ Search: High reward. Battery status can be lowered or depleted.
○ Wait: Low reward. Battery status does not change.
○ Recharge: No reward. Battery status changed to high.
● If battery is depleted, -3 reward and battery status changed to high.
Transition Graph
● Graphical summary of MDP dynamics
Designing Rewards
● Reward hypothesis
○ Goals and purposes can be represented by maximization of cumulative reward
● Tell what you want to achieve, not how
+1 for each boxProportional to
forward action
Always -1
Episodic Tasks
● Interactions can be broken into episodes
● Episodes end in a special terminal state
● Each episode is independent
Finished when the game ends Finished when the agent is out of the maze
Return for Episodic Tasks
● Sum of rewards from time step
● Time of termination:
Continuing Tasks
● Cannot be naturally broken into episodes
● Goes on without limit
Stock Trading
Return for Continuing Tasks
● Sum of rewards is almost always infinite
● Need to discount future rewards by factor
○ If , the return only considers immediate reward (myopic)
Unified Notation for Return
● Cumulative reward
● can be a finite number or infinity
● Future rewards can be discounted with factor
○ If , then must be less than 1.
Policy
● Mapping from states to probabilities of selecting each possible action
● : Probability of selecting action in state
State-value function
● Expected return from state and following policy
Action-value function
● Expected return from taking action in state and following policy
Bellman Equation
● Recursive relationship between and
Optimal Policies and Value Functions
● For any policy , for all states
● There can be multiple optimal policies
● All optimal policies share same optimal value functions:
Bellman Optimality Equation
● Bellman Equation for optimal policies
Solving Bellman Optimality Equation
● Linear system: equations, unknowns
● Possible to find the exact optimal policy
● Impractical in most environments
○ Need to know the dynamics of the environment
○ Need extreme computational power
○ Need Markov property
→ In most cases, approximation is the best possible solution.
Approximation
● Does not require complete knowledge of environment
● Less memory and computational power needed
● Can focus learning on frequently encountered states
Thank you!
Original content from
● Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
● github.com/seungjaeryanlee
● www.endtoend.ai

More Related Content

PDF
Reinforcement Learning 1. Introduction
PDF
Reinforcement Learning 2. Multi-armed Bandits
PDF
Reinforcement Learning 6. Temporal Difference Learning
PDF
Reinforcement Learning 5. Monte Carlo Methods
PDF
Reinforcement Learning 4. Dynamic Programming
PDF
Markov decision process
PDF
Multi-armed Bandits
PDF
Rl chapter 1 introduction
Reinforcement Learning 1. Introduction
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 4. Dynamic Programming
Markov decision process
Multi-armed Bandits
Rl chapter 1 introduction

What's hot (20)

PDF
An introduction to deep reinforcement learning
PPTX
Intro to Deep Reinforcement Learning
PDF
Deep Q-Learning
PPTX
An introduction to reinforcement learning
PDF
Introduction to SAC(Soft Actor-Critic)
PPT
Reinforcement Learning Q-Learning
PPTX
Deep Reinforcement Learning
PPTX
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
PPTX
Reinforcement Learning
PDF
Introduction of Deep Reinforcement Learning
PDF
Reinforcement learning, Q-Learning
PDF
Deep Reinforcement Learning
PDF
Deep reinforcement learning
PDF
Markov Chain Monte Carlo Methods
PDF
Lecture 9 Markov decision process
PPT
Reinforcement learning 7313
PDF
Multi-Armed Bandit and Applications
PDF
Deep Reinforcement Learning and Its Applications
PDF
Reinforcement Learning 7. n-step Bootstrapping
PDF
An introduction to reinforcement learning
An introduction to deep reinforcement learning
Intro to Deep Reinforcement Learning
Deep Q-Learning
An introduction to reinforcement learning
Introduction to SAC(Soft Actor-Critic)
Reinforcement Learning Q-Learning
Deep Reinforcement Learning
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Reinforcement Learning
Introduction of Deep Reinforcement Learning
Reinforcement learning, Q-Learning
Deep Reinforcement Learning
Deep reinforcement learning
Markov Chain Monte Carlo Methods
Lecture 9 Markov decision process
Reinforcement learning 7313
Multi-Armed Bandit and Applications
Deep Reinforcement Learning and Its Applications
Reinforcement Learning 7. n-step Bootstrapping
An introduction to reinforcement learning
Ad

Similar to Reinforcement Learning 3. Finite Markov Decision Processes (20)

PPTX
lecture_21.pptx - PowerPoint Presentation
PDF
Sutton reinforcement learning new ppt.pdf
PPTX
Reinforcement Learning
PPTX
Reinforcement Learning: An Introduction.pptx
PPTX
What is Reinforcement Algorithms and how worked.pptx
PDF
MarkovDecisionProcess&POMDP-MDP_PPTX.pdf
PPTX
REINFORCEMENT_LEARNING POWER POINT PRESENTATION.pptx
PDF
Deep reinforcement learning from scratch
PPTX
Introduction to reinforcement learning - Phu Nguyen
PPTX
How to formulate reinforcement learning in illustrative ways
PPTX
Unit 4 - 4.1 Markov Decision Process.pptx
PDF
Cs229 notes12
PPT
ReinforcementLearningReinforcementLearning.ppt
PPT
Reinforcement Learning 1 in explore-then-commit, epsilon-greedy, Boltzmann e...
PPTX
Introduce to Reinforcement Learning
PDF
Machine learning (13)
PPTX
Making Complex Decisions(Artificial Intelligence)
PPT
reinforcement-learning.ppt
PPT
reinforcement-learning.prsentation for c
PPT
reinforcement-learning its based on the slide of university
lecture_21.pptx - PowerPoint Presentation
Sutton reinforcement learning new ppt.pdf
Reinforcement Learning
Reinforcement Learning: An Introduction.pptx
What is Reinforcement Algorithms and how worked.pptx
MarkovDecisionProcess&POMDP-MDP_PPTX.pdf
REINFORCEMENT_LEARNING POWER POINT PRESENTATION.pptx
Deep reinforcement learning from scratch
Introduction to reinforcement learning - Phu Nguyen
How to formulate reinforcement learning in illustrative ways
Unit 4 - 4.1 Markov Decision Process.pptx
Cs229 notes12
ReinforcementLearningReinforcementLearning.ppt
Reinforcement Learning 1 in explore-then-commit, epsilon-greedy, Boltzmann e...
Introduce to Reinforcement Learning
Machine learning (13)
Making Complex Decisions(Artificial Intelligence)
reinforcement-learning.ppt
reinforcement-learning.prsentation for c
reinforcement-learning its based on the slide of university
Ad

More from Seung Jae Lee (6)

PDF
[1312.5602] Playing Atari with Deep Reinforcement Learning
PDF
[1807] Learning Montezuma's Revenge from a Single Demonstration
PDF
[1808.00177] Learning Dexterous In-Hand Manipulation
PDF
Reinforcement Learning 8: Planning and Learning with Tabular Methods
PDF
Reinforcement Learning 10. On-policy Control with Approximation
PDF
GitHub으로 웹페이지 만들기
[1312.5602] Playing Atari with Deep Reinforcement Learning
[1807] Learning Montezuma's Revenge from a Single Demonstration
[1808.00177] Learning Dexterous In-Hand Manipulation
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 10. On-policy Control with Approximation
GitHub으로 웹페이지 만들기

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Approach and Philosophy of On baking technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Electronic commerce courselecture one. Pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Empathic Computing: Creating Shared Understanding
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Mobile App Security Testing_ A Comprehensive Guide.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Review of recent advances in non-invasive hemoglobin estimation
Approach and Philosophy of On baking technology
20250228 LYD VKU AI Blended-Learning.pptx
A Presentation on Artificial Intelligence
Unlocking AI with Model Context Protocol (MCP)
Advanced methodologies resolving dimensionality complications for autism neur...
Electronic commerce courselecture one. Pdf
cuic standard and advanced reporting.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Empathic Computing: Creating Shared Understanding
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Diabetes mellitus diagnosis method based random forest with bat algorithm
The Rise and Fall of 3GPP – Time for a Sabbatical?

Reinforcement Learning 3. Finite Markov Decision Processes

  • 1. Chapter 3: Finite Markov Decision Processes Seungjae Ryan Lee
  • 2. ● Simplified, flexible reinforcement learning problem ● Consists of States , Actions , Rewards Markov Decision Process (MDP) States Info available to agent Actions Choice made by agent Rewards Basis for evaluating choices
  • 3. Agent The learner Takes action Everything outside the agent Returns state and reward Environment
  • 4. ● Anything the agent cannot arbitrarily change is part of the environment ○ Agent might still know everything about the environment ● Different boundaries for different purposes Agent-Environment Boundary Machinery Sensors Battery“Brain”
  • 5. 1. Agent observes a state 2. Agent takes action 3. Agent receives reward and new state 4. Agent takes another action 5. Repeat Agent-Environment Interactions
  • 6. Transition Probability ● Probability of reaching state and reward by taking action on state ● Fully describes the dynamics of a finite MDP ● Can deduce other properties of the environment
  • 7. Expected Rewards ● Expected reward of taking action on state ● Expected reward of arriving in state by taking action on state
  • 8. Recycling Robot Example ● States: Battery status (high or low) ● Actions ○ Search: High reward. Battery status can be lowered or depleted. ○ Wait: Low reward. Battery status does not change. ○ Recharge: No reward. Battery status changed to high. ● If battery is depleted, -3 reward and battery status changed to high.
  • 9. Transition Graph ● Graphical summary of MDP dynamics
  • 10. Designing Rewards ● Reward hypothesis ○ Goals and purposes can be represented by maximization of cumulative reward ● Tell what you want to achieve, not how +1 for each boxProportional to forward action Always -1
  • 11. Episodic Tasks ● Interactions can be broken into episodes ● Episodes end in a special terminal state ● Each episode is independent Finished when the game ends Finished when the agent is out of the maze
  • 12. Return for Episodic Tasks ● Sum of rewards from time step ● Time of termination:
  • 13. Continuing Tasks ● Cannot be naturally broken into episodes ● Goes on without limit Stock Trading
  • 14. Return for Continuing Tasks ● Sum of rewards is almost always infinite ● Need to discount future rewards by factor ○ If , the return only considers immediate reward (myopic)
  • 15. Unified Notation for Return ● Cumulative reward ● can be a finite number or infinity ● Future rewards can be discounted with factor ○ If , then must be less than 1.
  • 16. Policy ● Mapping from states to probabilities of selecting each possible action ● : Probability of selecting action in state
  • 17. State-value function ● Expected return from state and following policy
  • 18. Action-value function ● Expected return from taking action in state and following policy
  • 19. Bellman Equation ● Recursive relationship between and
  • 20. Optimal Policies and Value Functions ● For any policy , for all states ● There can be multiple optimal policies ● All optimal policies share same optimal value functions:
  • 21. Bellman Optimality Equation ● Bellman Equation for optimal policies
  • 22. Solving Bellman Optimality Equation ● Linear system: equations, unknowns ● Possible to find the exact optimal policy ● Impractical in most environments ○ Need to know the dynamics of the environment ○ Need extreme computational power ○ Need Markov property → In most cases, approximation is the best possible solution.
  • 23. Approximation ● Does not require complete knowledge of environment ● Less memory and computational power needed ● Can focus learning on frequently encountered states
  • 24. Thank you! Original content from ● Reinforcement Learning: An Introduction by Sutton and Barto You can find more content in ● github.com/seungjaeryanlee ● www.endtoend.ai