RL - Unit 1.pptx reinforcement learning ppt srm ist
1. Extended Example: Self-Driving Car at Traffic Signal
• Objective:
• The objective of the agent (self-driving car) is to safely and efficiently cross a traffic
intersection by deciding whether to stop, slow down, or proceed, based on the current traffic
light, speed, and distance.
• RL Element
• Agent - The self-driving AI controlling the car
• Environment - The road, other vehicles, and traffic lights
• State - A tuple: (traffic light color, car speed, distance to signal) e.g., (RED, 40 km/h, 15 meters)
• Action - Stop, slow down, maintain speed, accelerate
• Reward - +10 for safe crossing, -50 for running red light, -20 for emergency stop, 0 for waiting
• Policy - The learned strategy to decide which action to take in a given state
• Value Function - Long-term value of being in a specific traffic condition based on expected future rewards
2. Limitations of Reinforcement Learning
Major Challenges in RL:
1.Sample Inefficiency:
→ Requires many trials to learn optimal behavior.
2.Delayed Rewards:
→ Hard to assign credit for earlier actions when rewards come late.
3.Exploration vs Exploitation Dilemma:
→ Balancing between trying new actions and sticking with known ones.
4.Non-Stationary Environments:
→ If the environment changes, the learned policy may fail.
5.Sparse Rewards:
→ Some environments give feedback rarely (e.g., games like chess).
6.Computation Cost:
→ Training time and resources can be expensive.
3. Scope of Reinforcement Learning
• Robotics: Navigation, locomotion, grasping
• Game Playing: Chess (AlphaZero), Go, Atari games
• Recommendation Systems: Adaptive and contextual recommendations
• Healthcare: Adaptive treatment strategies
• Finance: Portfolio optimization, algorithmic trading
RL works best in dynamic, interactive, feedback-rich environments.
4. Introduction to Multi-Armed Bandit (MAB)
Problem
• What is the Multi-Armed Bandit Problem?
Real-world analogy: Imagine you're in a casino. You see a row of slot machines
(called "bandits").Each machine gives different rewards. You don’t know which one
is best — you must try and learn.
Definition:
• In MAB, an agent must choose between k different options (arms), each with an
unknown probability of giving a reward.
The goal is to maximize total reward over time.
5. Problem Setup
Term Description
Agent Decision maker (you or the learning system)
Arms Each action choice (slot machine)
Reward Numerical feedback after pulling an arm
Action Choosing which arm to pull at each time step
Objective
Learn which arms give higher rewards and pull them
more often
6. Key Challenge: Exploration vs. Exploitation
• Exploration: Try different arms to gather information
• Exploitation: Choose the best-known arm for higher reward
• RL must balance both to succeed.
Bandit Loop
While learning:
1. Choose an arm (A)
2. 2. Receive reward (R)
3. 3. Update knowledge about arm A
4. 4. Repeat
7. Real-world Examples of Bandits
Domain Example Use Case
Online Ads Show ad variant A/B/C — learn best click-through rate
A/B Testing Website design optimization
Medical Trials Select best treatment with unknown outcomes
Recommender Systems Which product to recommend next?
8. K-Armed Bandit Problem
Real-World Analogy:
• Imagine you are in a casino with K slot machines (bandits). Each machine gives a
random reward. You want to find the best one to win the most money over time.
But you don’t know which is the best until you try them.
Definition:
• The K-armed bandit problem is a simplified reinforcement learning setting where
an agent repeatedly chooses from K different actions (arms), each with an
unknown reward distribution, aiming to maximize total reward.
9. Problem Setup:
Component Description
Agent
The learner or decision-maker interacting with the
environment
Actions
K possible choices (arms) the agent can select at
each step
Reward Numerical feedback after choosing an action
Q-value Estimated average reward for each action
Time Step t
Discrete iteration index where the agent selects an
action and gets a reward
Objective Maximize total expected reward or minimize regret
over time
10. Key Challenge: Exploration vs Exploitation
• Exploration: Try new actions to gather more data
• Exploitation: Use current best-known action to maximize reward
Real-World Example:
• Choosing ads to display (Google, Facebook)
• Trying different pricing strategies in e-commerce
• A/B testing for website design
11. Epsilon-Greedy Strategy
• Epsilon-Greedy is a widely-used action selection strategy in Reinforcement
Learning, especially in the context of Multi-Armed Bandit problems. It is
designed to balance two competing goals:
• Exploitation: Leveraging what the agent knows to maximize reward.
• Exploration: Trying out unknown actions to gather more information.
• This strategy uses a probabilistic approach to action selection, allowing the agent
to explore occasionally while mostly choosing the best-known action.
12. Problem Setup Table:
Component Description
Agent
The decision-maker learning through trial
and error
Action Set A A set of K actions to choose from
Estimated Value Q(a) Running average reward for each action
Epsilon (ε) A small number between 0 and 1
representing the exploration rate
Selection Rule
Randomly explore with probability ε; exploit
best-known with 1-ε
13. Pros & Cons
Pros:
• Simple to implement
• Good initial performance in stochastic environments
Cons:
• Can waste steps exploring poor actions
• Does not consider uncertainty in a principled way
14. Epsilon-Greedy Strategy
Real-World Analogy:
• Imagine you're choosing which restaurant to order food from. You usually go with
your favorite (best-known), but once in a while you try a new one to see if it’s better.
Definition:
• With probability ε, the agent chooses a random action (exploration), and with
probability 1 - ε, it selects the action with the highest estimated reward
(exploitation).
Real-World Example:
• Online recommendation systems occasionally showing something new
• Game AI occasionally trying non-optimal moves to learn more
15. Optimistic Initial Values
• Conceptual Overview:
• Optimistic Initial Values is a method to encourage exploration without
randomness. By initializing all Q-values to high numbers, the agent is incentivized
to try all actions at least once. It eventually learns the true values based on
feedback, reducing exploration over time.
• Learning Behavior:
• Strong initial incentive to try all actions
• After multiple trials, Q-values converge to actual expected reward
16. Problem Setup Table:
Component Description
Agent Learns optimal action values by trial and error
Actions Each action starts with an optimistic (high) estimated
reward
Initial Q(a)
Initialized higher than maximum expected reward
(e.g., Q(a) = 5)
Reward Signal
Agent receives actual reward and updates Q(a) after
each action
Exploration Trigger High Q(a) values drive initial exploration
17. Pros & Cons
Pros:
• Eliminates need for ε parameter
• Systematic exploration based on assumptions
Cons:
• Sensitive to initial value selection
• May converge slowly or not at all in noisy environments
Real-World Analogy:
• You assume all restaurants have amazing food without knowing anything. You try
each one to confirm. Over time, you find some are not that great, so you stop going.
Definition:
• Instead of initializing Q(a) to 0, we set them to high optimistic values to encourage
early exploration.
18. Key Challenges
• May lead to slow learning in noisy environments
• Can bias the agent toward incorrect assumptions if not adjusted
• Real-World Example:
• Assuming all investment options are great and trying them early in a portfolio
• Testing all marketing channels thinking they perform equally well
19. Upper Confidence Bound (UCB)
Conceptual Overview:
• UCB is a more intelligent approach to exploration. It adds a confidence bonus to the
estimated reward of each action. The agent selects the action with the highest upper
confidence bound, encouraging exploration of uncertain actions.
Definition:
• UCB selects actions based on their estimated value plus a confidence bonus. It prefers
actions that are uncertain and potentially better.
Real-World Example:
• Optimizing user click-through rates in ad serving
• Balancing medical treatment tests between new and known protocols
20. Problem Setup Table:
Component Description
Agent Makes sequential decisions and receives feedback
Estimated Q(a) The average reward received from action a
N(a) Number of times action a has been selected
t (time) Total number of steps taken
UCB Score
Q(a) + c × sqrt( (ln t) / N(a) ) → combines value &
uncertainty
Pros:
•No need for ε or hyper-randomness
•Smarter exploration based on confidence and data
Cons:
•Requires careful computation of logs and counters
•Needs tuning of parameter c
21. Mathematical Formula
• UCB(a) = Q(a) + c × sqrt( ln(t) / N(a) )
• Q(a): Estimated value
• N(a): Number of times a has been selected
• t: Current timestep
• c: Controls balance between exploration and exploitation
22. Strategy Comparison Overview
Strategy Real-World Analogy Key Idea Core Trade-off
Epsilon-Greedy
Occasionally try new
restaurants
Random exploration
Simplicity vs wasteful
trials
Optimistic
Initialization
Start with high
expectations for all
arms
Early forced
exploration
May overestimate
UCB
Use reward +
uncertainty bonus
Directed exploration
Complex but smart
decisions
23. Where Each Strategy Really Works Best
Strategy Best Applied In...
Epsilon-Greedy
Problems with large action space and stochastic
rewards where simplicity matters. Suitable for web
recommendations, simple games, and online learning
where slight inefficiency is tolerable.
Optimistic Initialization
Early-phase exploration in static environments with
limited noise. Ideal for deterministic settings like
product testing, robotics tasks with stable dynamics,
or when rewards are sparse.
Upper Confidence Bound (UCB)
High-stakes environments requiring efficient
exploration. Best for ad-serving, clinical trials, online
platforms where early bad choices are costly and
informed exploration is crucial.
24. 10-Armed Testbed
• The 10-Armed Testbed is a simulated environment commonly used to test how well different
reinforcement learning strategies perform in learning which actions yield the best rewards.
• It’s called “10-Armed” because there are 10 possible actions (or arms) that the agent can
choose from — just like 10 slot machines in a casino, each giving unpredictable rewards.
• The agent doesn't know which arm is best in advance, so it needs to try different arms,
observe the rewards, and gradually learn to pick the one that gives the highest average
reward.
• It is widely used as a benchmark for comparing strategies like:
• Epsilon-Greedy, Optimistic Initial Values, Upper Confidence Bound (UCB)
• A testbed provides a repeatable, structured way to:
• Evaluate how quickly a strategy finds the optimal action, Compare exploration vs.
exploitation performance, Visualize learning curves across strategies
25. How the 10-Armed Testbed Works
• The 10-Armed Testbed is built to simulate a situation where an agent interacts
with an environment by repeatedly choosing from 10 different actions (or arms),
each with its own hidden reward potential. The idea is to let the agent learn which
action gives the highest reward on average, through trial and error.
• Step-by-Step Setup:
• 10 Actions
• Hidden Mean Rewards
• Agent Decision
• Reward Returned
• Update Estimate
Purpose of Running Many Episodes:
• Since each simulation starts with a different random setup of arm rewards, we run thousands of
episodes (e.g., 2000) and average the results to compare strategies fairly.
26. Evaluation Metrics
Metric Meaning
Average Reward Mean of all rewards received over time
% Optimal Action Selection
Fraction of time the agent chose the best true-value
action
Problem Setup Table
Component Description
Number of Arms 10
True Means μᵢ Sampled from N(0,1)
Rewards R(aᵢ) Sampled from N(μᵢ, 1) for each chosen arm
Agent Estimation Maintains Q(aᵢ), updated with each new reward
Action Selection Strategy-dependent (Epsilon-Greedy, UCB, etc.)
Goal
Maximize total reward; identify and exploit the
optimal arm efficiently
27. Algorithm Evaluation via Testbed
Compared Algorithms:
• Greedy (no exploration)
• Epsilon-Greedy (with ε = 0.1 or 0.01)
• Optimistic Initial Values (e.g., Q = 5)
• UCB (e.g., c = 2)
What You Learn:
• Which strategy converges to the best action faster
• How average reward differs across time
• Whether strategies over-explore or under-explore
28. Real-World Analogy
Advertising Campaigns
• Imagine you are testing 10 different advertising campaigns. You don’t know
which performs best. Each time you run one (select an arm), you get varying ROI
(reward).
Your goal:
• Learn which campaign consistently yields the highest ROI
• Stop wasting money on ineffective ones
• This is exactly what the 10-armed testbed simulates — trying out options, learning
from feedback, and gradually choosing the best.
29. Incremental Implementation
Definition
• A technique to update action-value estimates without storing all past rewards, by
adjusting the old estimate towards the new reward using a learning rate.
Why Important?
• In real-world scenarios, data is often streaming (e.g., online ads, user clicks).
• Storing all data for averages is expensive and not practical.
• Incremental method allows real-time updates.
Mathematical Intuition: Qn+1
=Qn
+α(Rn
−Qn
)
• The error term (Rn−Qn) measures surprise.
• Multiply by α to control adjustment
30. Incremental Implementation
Step-by-Step Formula
• New Estimate=Old Estimate+α×(New Info - Old Estimate)
Example
• Old estimate Qn=5Q_n = 5Qn
=5
• New reward Rn=7R_n = 7Rn
=7
• α = 0.1
• Qn+1
=5+0.1(7−5)=5.2
Analogy
• Guessing someone’s favorite color → You guessed blue, they like green → Adjust slightly
toward green (not completely change your guess).
31. Tracking a Non-Stationary Problem
Definition
• A method to keep estimates accurate when reward probabilities change over time, by
giving more weight to recent data.
Why Important?
• Real-world is dynamic: stock prices, ad clicks, user tastes.
• Old data becomes irrelevant.
Key Idea
• Use constant α instead of decreasing α.
• Recent rewards influence more → quick adaptation.
Formula
• Qn+1
=Qn
+α(Rn
−Qn
)
32. Associative Search / Contextual Bandits
Definition
• A bandit problem where the best action depends on the context (state of environment)
Why Important?
• Many real problems are context-dependent:
• Ads shown depend on user profile.
• Medicine depends on patient condition.
Formula
• Q(s,a)=Expected reward for state s and action a
• Example
• State: Morning → Action: Coffee
• State: Evening → Action: Tea