Reinforcement Learning 3. Finite Markov Decision Processes

Chapter 3: Finite Markov Decision Processes
Seungjae Ryan Lee

● Simplified, flexible reinforcement learning problem
● Consists of States , Actions , Rewards
Markov Decision Process (MDP)
States
Info available to agent
Actions
Choice made by agent
Rewards
Basis for evaluating choices

Agent
The learner
Takes action
Everything outside the agent
Returns state and reward
Environment

● Anything the agent cannot arbitrarily change is part of the environment
○ Agent might still know everything about the environment
● Different boundaries for different purposes
Agent-Environment Boundary
Machinery Sensors Battery“Brain”

1. Agent observes a state
2. Agent takes action
3. Agent receives reward and new state
4. Agent takes another action
5. Repeat
Agent-Environment Interactions

Transition Probability
● Probability of reaching state and reward by taking action on state
● Fully describes the dynamics of a finite MDP
● Can deduce other properties of the environment

Expected Rewards
● Expected reward of taking action on state
● Expected reward of arriving in state by taking action on state

Recycling Robot Example
● States: Battery status (high or low)
● Actions
○ Search: High reward. Battery status can be lowered or depleted.
○ Wait: Low reward. Battery status does not change.
○ Recharge: No reward. Battery status changed to high.
● If battery is depleted, -3 reward and battery status changed to high.

Transition Graph
● Graphical summary of MDP dynamics

Designing Rewards
● Reward hypothesis
○ Goals and purposes can be represented by maximization of cumulative reward
● Tell what you want to achieve, not how
+1 for each boxProportional to
forward action
Always -1

Episodic Tasks
● Interactions can be broken into episodes
● Episodes end in a special terminal state
● Each episode is independent
Finished when the game ends Finished when the agent is out of the maze

Return for Episodic Tasks
● Sum of rewards from time step
● Time of termination:

Continuing Tasks
● Cannot be naturally broken into episodes
● Goes on without limit
Stock Trading

Return for Continuing Tasks
● Sum of rewards is almost always infinite
● Need to discount future rewards by factor
○ If , the return only considers immediate reward (myopic)

Unified Notation for Return
● Cumulative reward
● can be a finite number or infinity
● Future rewards can be discounted with factor
○ If , then must be less than 1.

Policy
● Mapping from states to probabilities of selecting each possible action
● : Probability of selecting action in state

State-value function
● Expected return from state and following policy

Action-value function
● Expected return from taking action in state and following policy

Bellman Equation
● Recursive relationship between and

Optimal Policies and Value Functions
● For any policy , for all states
● There can be multiple optimal policies
● All optimal policies share same optimal value functions:

Bellman Optimality Equation
● Bellman Equation for optimal policies

Solving Bellman Optimality Equation
● Linear system: equations, unknowns
● Possible to find the exact optimal policy
● Impractical in most environments
○ Need to know the dynamics of the environment
○ Need extreme computational power
○ Need Markov property
→ In most cases, approximation is the best possible solution.

Approximation
● Does not require complete knowledge of environment
● Less memory and computational power needed
● Can focus learning on frequently encountered states

Thank you!
Original content from
● Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
● github.com/seungjaeryanlee
● www.endtoend.ai

Reinforcement Learning 3. Finite Markov Decision Processes

More Related Content

What's hot (20)

Similar to Reinforcement Learning 3. Finite Markov Decision Processes (20)

More from Seung Jae Lee (6)

Recently uploaded (20)

Reinforcement Learning 3. Finite Markov Decision Processes