SlideShare a Scribd company logo
Extended Example: Self-Driving Car at Traffic Signal
• Objective:
• The objective of the agent (self-driving car) is to safely and efficiently cross a traffic
intersection by deciding whether to stop, slow down, or proceed, based on the current traffic
light, speed, and distance.
• RL Element
• Agent - The self-driving AI controlling the car
• Environment - The road, other vehicles, and traffic lights
• State - A tuple: (traffic light color, car speed, distance to signal) e.g., (RED, 40 km/h, 15 meters)
• Action - Stop, slow down, maintain speed, accelerate
• Reward - +10 for safe crossing, -50 for running red light, -20 for emergency stop, 0 for waiting
• Policy - The learned strategy to decide which action to take in a given state
• Value Function - Long-term value of being in a specific traffic condition based on expected future rewards
Limitations of Reinforcement Learning
Major Challenges in RL:
1.Sample Inefficiency:
→ Requires many trials to learn optimal behavior.
2.Delayed Rewards:
→ Hard to assign credit for earlier actions when rewards come late.
3.Exploration vs Exploitation Dilemma:
→ Balancing between trying new actions and sticking with known ones.
4.Non-Stationary Environments:
→ If the environment changes, the learned policy may fail.
5.Sparse Rewards:
→ Some environments give feedback rarely (e.g., games like chess).
6.Computation Cost:
→ Training time and resources can be expensive.
Scope of Reinforcement Learning
• Robotics: Navigation, locomotion, grasping
• Game Playing: Chess (AlphaZero), Go, Atari games
• Recommendation Systems: Adaptive and contextual recommendations
• Healthcare: Adaptive treatment strategies
• Finance: Portfolio optimization, algorithmic trading
RL works best in dynamic, interactive, feedback-rich environments.
Introduction to Multi-Armed Bandit (MAB)
Problem
• What is the Multi-Armed Bandit Problem?
Real-world analogy: Imagine you're in a casino. You see a row of slot machines
(called "bandits").Each machine gives different rewards. You don’t know which one
is best — you must try and learn.
Definition:
• In MAB, an agent must choose between k different options (arms), each with an
unknown probability of giving a reward.
The goal is to maximize total reward over time.
Problem Setup
Term Description
Agent Decision maker (you or the learning system)
Arms Each action choice (slot machine)
Reward Numerical feedback after pulling an arm
Action Choosing which arm to pull at each time step
Objective
Learn which arms give higher rewards and pull them
more often
Key Challenge: Exploration vs. Exploitation
• Exploration: Try different arms to gather information
• Exploitation: Choose the best-known arm for higher reward
• RL must balance both to succeed.
Bandit Loop
While learning:
1. Choose an arm (A)
2. 2. Receive reward (R)
3. 3. Update knowledge about arm A
4. 4. Repeat
Real-world Examples of Bandits
Domain Example Use Case
Online Ads Show ad variant A/B/C — learn best click-through rate
A/B Testing Website design optimization
Medical Trials Select best treatment with unknown outcomes
Recommender Systems Which product to recommend next?
K-Armed Bandit Problem
Real-World Analogy:
• Imagine you are in a casino with K slot machines (bandits). Each machine gives a
random reward. You want to find the best one to win the most money over time.
But you don’t know which is the best until you try them.
Definition:
• The K-armed bandit problem is a simplified reinforcement learning setting where
an agent repeatedly chooses from K different actions (arms), each with an
unknown reward distribution, aiming to maximize total reward.
Problem Setup:
Component Description
Agent
The learner or decision-maker interacting with the
environment
Actions
K possible choices (arms) the agent can select at
each step
Reward Numerical feedback after choosing an action
Q-value Estimated average reward for each action
Time Step t
Discrete iteration index where the agent selects an
action and gets a reward
Objective Maximize total expected reward or minimize regret
over time
Key Challenge: Exploration vs Exploitation
• Exploration: Try new actions to gather more data
• Exploitation: Use current best-known action to maximize reward
Real-World Example:
• Choosing ads to display (Google, Facebook)
• Trying different pricing strategies in e-commerce
• A/B testing for website design
Epsilon-Greedy Strategy
• Epsilon-Greedy is a widely-used action selection strategy in Reinforcement
Learning, especially in the context of Multi-Armed Bandit problems. It is
designed to balance two competing goals:
• Exploitation: Leveraging what the agent knows to maximize reward.
• Exploration: Trying out unknown actions to gather more information.
• This strategy uses a probabilistic approach to action selection, allowing the agent
to explore occasionally while mostly choosing the best-known action.
Problem Setup Table:
Component Description
Agent
The decision-maker learning through trial
and error
Action Set A A set of K actions to choose from
Estimated Value Q(a) Running average reward for each action
Epsilon (ε) A small number between 0 and 1
representing the exploration rate
Selection Rule
Randomly explore with probability ε; exploit
best-known with 1-ε
Pros & Cons
Pros:
• Simple to implement
• Good initial performance in stochastic environments
Cons:
• Can waste steps exploring poor actions
• Does not consider uncertainty in a principled way
Epsilon-Greedy Strategy
Real-World Analogy:
• Imagine you're choosing which restaurant to order food from. You usually go with
your favorite (best-known), but once in a while you try a new one to see if it’s better.
Definition:
• With probability ε, the agent chooses a random action (exploration), and with
probability 1 - ε, it selects the action with the highest estimated reward
(exploitation).
Real-World Example:
• Online recommendation systems occasionally showing something new
• Game AI occasionally trying non-optimal moves to learn more
Optimistic Initial Values
• Conceptual Overview:
• Optimistic Initial Values is a method to encourage exploration without
randomness. By initializing all Q-values to high numbers, the agent is incentivized
to try all actions at least once. It eventually learns the true values based on
feedback, reducing exploration over time.
• Learning Behavior:
• Strong initial incentive to try all actions
• After multiple trials, Q-values converge to actual expected reward
Problem Setup Table:
Component Description
Agent Learns optimal action values by trial and error
Actions Each action starts with an optimistic (high) estimated
reward
Initial Q(a)
Initialized higher than maximum expected reward
(e.g., Q(a) = 5)
Reward Signal
Agent receives actual reward and updates Q(a) after
each action
Exploration Trigger High Q(a) values drive initial exploration
Pros & Cons
Pros:
• Eliminates need for ε parameter
• Systematic exploration based on assumptions
Cons:
• Sensitive to initial value selection
• May converge slowly or not at all in noisy environments
Real-World Analogy:
• You assume all restaurants have amazing food without knowing anything. You try
each one to confirm. Over time, you find some are not that great, so you stop going.
Definition:
• Instead of initializing Q(a) to 0, we set them to high optimistic values to encourage
early exploration.
Key Challenges
• May lead to slow learning in noisy environments
• Can bias the agent toward incorrect assumptions if not adjusted
• Real-World Example:
• Assuming all investment options are great and trying them early in a portfolio
• Testing all marketing channels thinking they perform equally well
Upper Confidence Bound (UCB)
Conceptual Overview:
• UCB is a more intelligent approach to exploration. It adds a confidence bonus to the
estimated reward of each action. The agent selects the action with the highest upper
confidence bound, encouraging exploration of uncertain actions.
Definition:
• UCB selects actions based on their estimated value plus a confidence bonus. It prefers
actions that are uncertain and potentially better.
Real-World Example:
• Optimizing user click-through rates in ad serving
• Balancing medical treatment tests between new and known protocols
Problem Setup Table:
Component Description
Agent Makes sequential decisions and receives feedback
Estimated Q(a) The average reward received from action a
N(a) Number of times action a has been selected
t (time) Total number of steps taken
UCB Score
Q(a) + c × sqrt( (ln t) / N(a) ) → combines value &
uncertainty
Pros:
•No need for ε or hyper-randomness
•Smarter exploration based on confidence and data
Cons:
•Requires careful computation of logs and counters
•Needs tuning of parameter c
Mathematical Formula
• UCB(a) = Q(a) + c × sqrt( ln(t) / N(a) )
• Q(a): Estimated value
• N(a): Number of times a has been selected
• t: Current timestep
• c: Controls balance between exploration and exploitation
Strategy Comparison Overview
Strategy Real-World Analogy Key Idea Core Trade-off
Epsilon-Greedy
Occasionally try new
restaurants
Random exploration
Simplicity vs wasteful
trials
Optimistic
Initialization
Start with high
expectations for all
arms
Early forced
exploration
May overestimate
UCB
Use reward +
uncertainty bonus
Directed exploration
Complex but smart
decisions
Where Each Strategy Really Works Best
Strategy Best Applied In...
Epsilon-Greedy
Problems with large action space and stochastic
rewards where simplicity matters. Suitable for web
recommendations, simple games, and online learning
where slight inefficiency is tolerable.
Optimistic Initialization
Early-phase exploration in static environments with
limited noise. Ideal for deterministic settings like
product testing, robotics tasks with stable dynamics,
or when rewards are sparse.
Upper Confidence Bound (UCB)
High-stakes environments requiring efficient
exploration. Best for ad-serving, clinical trials, online
platforms where early bad choices are costly and
informed exploration is crucial.
10-Armed Testbed
• The 10-Armed Testbed is a simulated environment commonly used to test how well different
reinforcement learning strategies perform in learning which actions yield the best rewards.
• It’s called “10-Armed” because there are 10 possible actions (or arms) that the agent can
choose from — just like 10 slot machines in a casino, each giving unpredictable rewards.
• The agent doesn't know which arm is best in advance, so it needs to try different arms,
observe the rewards, and gradually learn to pick the one that gives the highest average
reward.
• It is widely used as a benchmark for comparing strategies like:
• Epsilon-Greedy, Optimistic Initial Values, Upper Confidence Bound (UCB)
• A testbed provides a repeatable, structured way to:
• Evaluate how quickly a strategy finds the optimal action, Compare exploration vs.
exploitation performance, Visualize learning curves across strategies
How the 10-Armed Testbed Works
• The 10-Armed Testbed is built to simulate a situation where an agent interacts
with an environment by repeatedly choosing from 10 different actions (or arms),
each with its own hidden reward potential. The idea is to let the agent learn which
action gives the highest reward on average, through trial and error.
• Step-by-Step Setup:
• 10 Actions
• Hidden Mean Rewards
• Agent Decision
• Reward Returned
• Update Estimate
Purpose of Running Many Episodes:
• Since each simulation starts with a different random setup of arm rewards, we run thousands of
episodes (e.g., 2000) and average the results to compare strategies fairly.
Evaluation Metrics
Metric Meaning
Average Reward Mean of all rewards received over time
% Optimal Action Selection
Fraction of time the agent chose the best true-value
action
Problem Setup Table
Component Description
Number of Arms 10
True Means μᵢ Sampled from N(0,1)
Rewards R(aᵢ) Sampled from N(μᵢ, 1) for each chosen arm
Agent Estimation Maintains Q(aᵢ), updated with each new reward
Action Selection Strategy-dependent (Epsilon-Greedy, UCB, etc.)
Goal
Maximize total reward; identify and exploit the
optimal arm efficiently
Algorithm Evaluation via Testbed
Compared Algorithms:
• Greedy (no exploration)
• Epsilon-Greedy (with ε = 0.1 or 0.01)
• Optimistic Initial Values (e.g., Q = 5)
• UCB (e.g., c = 2)
What You Learn:
• Which strategy converges to the best action faster
• How average reward differs across time
• Whether strategies over-explore or under-explore
Real-World Analogy
Advertising Campaigns
• Imagine you are testing 10 different advertising campaigns. You don’t know
which performs best. Each time you run one (select an arm), you get varying ROI
(reward).
Your goal:
• Learn which campaign consistently yields the highest ROI
• Stop wasting money on ineffective ones
• This is exactly what the 10-armed testbed simulates — trying out options, learning
from feedback, and gradually choosing the best.
Incremental Implementation
Definition
• A technique to update action-value estimates without storing all past rewards, by
adjusting the old estimate towards the new reward using a learning rate.
Why Important?
• In real-world scenarios, data is often streaming (e.g., online ads, user clicks).
• Storing all data for averages is expensive and not practical.
• Incremental method allows real-time updates.
Mathematical Intuition: Qn+1​
=Qn​
+α(Rn​
−Qn​
)
• The error term (Rn−Qn) measures surprise.
• Multiply by α to control adjustment
Incremental Implementation
Step-by-Step Formula
• New Estimate=Old Estimate+α×(New Info - Old Estimate)
Example
• Old estimate Qn=5Q_n = 5Qn​
=5
• New reward Rn=7R_n = 7Rn​
=7
• α = 0.1
• Qn+1​
=5+0.1(7−5)=5.2
Analogy
• Guessing someone’s favorite color → You guessed blue, they like green → Adjust slightly
toward green (not completely change your guess).
Tracking a Non-Stationary Problem
Definition
• A method to keep estimates accurate when reward probabilities change over time, by
giving more weight to recent data.
Why Important?
• Real-world is dynamic: stock prices, ad clicks, user tastes.
• Old data becomes irrelevant.
Key Idea
• Use constant α instead of decreasing α.
• Recent rewards influence more → quick adaptation.
Formula
• Qn+1​
=Qn​
+α(Rn​
−Qn​
)
Associative Search / Contextual Bandits
Definition
• A bandit problem where the best action depends on the context (state of environment)
Why Important?
• Many real problems are context-dependent:
• Ads shown depend on user profile.
• Medicine depends on patient condition.
Formula
• Q(s,a)=Expected reward for state s and action a
• Example
• State: Morning → Action: Coffee
• State: Evening → Action: Tea

More Related Content

PPTX
q_learning in machine learning with problem
PPTX
Making smart decisions in real-time with Reinforcement Learning
PPTX
Reinforcemnet Leaning in ML and DL.pptx
PDF
introduction to machine learning and artificial intelligence
PDF
Introduction to Deep Reinforcement Learning
PPTX
R22 Machine learning jntuh UNIT- 5.pptx
PDF
introducatio to ml introducatio to ml introducatio to ml
PPTX
reinforcement-learning-141009013546-conversion-gate02.pptx
q_learning in machine learning with problem
Making smart decisions in real-time with Reinforcement Learning
Reinforcemnet Leaning in ML and DL.pptx
introduction to machine learning and artificial intelligence
Introduction to Deep Reinforcement Learning
R22 Machine learning jntuh UNIT- 5.pptx
introducatio to ml introducatio to ml introducatio to ml
reinforcement-learning-141009013546-conversion-gate02.pptx

Similar to RL - Unit 1.pptx reinforcement learning ppt srm ist (20)

PPTX
Product Madness - A/B Testing
PPTX
Deep Q-learning from Demonstrations DQfD
PDF
reinforcement-learning-141009013546-conversion-gate02.pdf
PPTX
Introduction: Asynchronous Methods for Deep Reinforcement Learning
PDF
Causal reasoning and Learning Systems
PPTX
24.09.2021 Reinforcement Learning Algorithms.pptx
PDF
Intro to Reinforcement learning - part II
PPTX
An efficient use of temporal difference technique in Computer Game Learning
PDF
5th Module_Machine Learning_Reinforc.pdf
PPT
PPT-UEU-Sistem-Pendukung-Keputusan-Pertemuan-9.ppt
PDF
Sequential Decision Making in Recommendations
PPTX
Reinforcement Learning: An Introduction.pptx
PPTX
What is Reinforcement Algorithms and how worked.pptx
PDF
DRL #2-3 - Multi-Armed Bandits .pptx.pdf
PDF
Aaa ped-24- Reinforcement Learning
PPTX
Reinforcement learning
PPT
Presentazione Tesi Laurea Triennale in Informatica
PPT
vdocuments.mx_supplier-selection-fuzzy-ahp.ppt
PPTX
Campaign Analytics and AB testing(1).pptx
PPTX
Demystifying deep reinforement learning
Product Madness - A/B Testing
Deep Q-learning from Demonstrations DQfD
reinforcement-learning-141009013546-conversion-gate02.pdf
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Causal reasoning and Learning Systems
24.09.2021 Reinforcement Learning Algorithms.pptx
Intro to Reinforcement learning - part II
An efficient use of temporal difference technique in Computer Game Learning
5th Module_Machine Learning_Reinforc.pdf
PPT-UEU-Sistem-Pendukung-Keputusan-Pertemuan-9.ppt
Sequential Decision Making in Recommendations
Reinforcement Learning: An Introduction.pptx
What is Reinforcement Algorithms and how worked.pptx
DRL #2-3 - Multi-Armed Bandits .pptx.pdf
Aaa ped-24- Reinforcement Learning
Reinforcement learning
Presentazione Tesi Laurea Triennale in Informatica
vdocuments.mx_supplier-selection-fuzzy-ahp.ppt
Campaign Analytics and AB testing(1).pptx
Demystifying deep reinforement learning
Ad

Recently uploaded (20)

PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
CYBER SECURITY the Next Warefare Tactics
PDF
Global Data and Analytics Market Outlook Report
PDF
Microsoft Core Cloud Services powerpoint
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
How to run a consulting project- client discovery
PDF
annual-report-2024-2025 original latest.
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
modul_python (1).pptx for professional and student
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
Leprosy and NLEP programme community medicine
PDF
Introduction to the R Programming Language
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPT
Predictive modeling basics in data cleaning process
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
Transcultural that can help you someday.
PPTX
IMPACT OF LANDSLIDE.....................
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
New ISO 27001_2022 standard and the changes
Pilar Kemerdekaan dan Identi Bangsa.pptx
CYBER SECURITY the Next Warefare Tactics
Global Data and Analytics Market Outlook Report
Microsoft Core Cloud Services powerpoint
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
How to run a consulting project- client discovery
annual-report-2024-2025 original latest.
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
modul_python (1).pptx for professional and student
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Leprosy and NLEP programme community medicine
Introduction to the R Programming Language
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Predictive modeling basics in data cleaning process
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Transcultural that can help you someday.
IMPACT OF LANDSLIDE.....................
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
New ISO 27001_2022 standard and the changes
Ad

RL - Unit 1.pptx reinforcement learning ppt srm ist

  • 1. Extended Example: Self-Driving Car at Traffic Signal • Objective: • The objective of the agent (self-driving car) is to safely and efficiently cross a traffic intersection by deciding whether to stop, slow down, or proceed, based on the current traffic light, speed, and distance. • RL Element • Agent - The self-driving AI controlling the car • Environment - The road, other vehicles, and traffic lights • State - A tuple: (traffic light color, car speed, distance to signal) e.g., (RED, 40 km/h, 15 meters) • Action - Stop, slow down, maintain speed, accelerate • Reward - +10 for safe crossing, -50 for running red light, -20 for emergency stop, 0 for waiting • Policy - The learned strategy to decide which action to take in a given state • Value Function - Long-term value of being in a specific traffic condition based on expected future rewards
  • 2. Limitations of Reinforcement Learning Major Challenges in RL: 1.Sample Inefficiency: → Requires many trials to learn optimal behavior. 2.Delayed Rewards: → Hard to assign credit for earlier actions when rewards come late. 3.Exploration vs Exploitation Dilemma: → Balancing between trying new actions and sticking with known ones. 4.Non-Stationary Environments: → If the environment changes, the learned policy may fail. 5.Sparse Rewards: → Some environments give feedback rarely (e.g., games like chess). 6.Computation Cost: → Training time and resources can be expensive.
  • 3. Scope of Reinforcement Learning • Robotics: Navigation, locomotion, grasping • Game Playing: Chess (AlphaZero), Go, Atari games • Recommendation Systems: Adaptive and contextual recommendations • Healthcare: Adaptive treatment strategies • Finance: Portfolio optimization, algorithmic trading RL works best in dynamic, interactive, feedback-rich environments.
  • 4. Introduction to Multi-Armed Bandit (MAB) Problem • What is the Multi-Armed Bandit Problem? Real-world analogy: Imagine you're in a casino. You see a row of slot machines (called "bandits").Each machine gives different rewards. You don’t know which one is best — you must try and learn. Definition: • In MAB, an agent must choose between k different options (arms), each with an unknown probability of giving a reward. The goal is to maximize total reward over time.
  • 5. Problem Setup Term Description Agent Decision maker (you or the learning system) Arms Each action choice (slot machine) Reward Numerical feedback after pulling an arm Action Choosing which arm to pull at each time step Objective Learn which arms give higher rewards and pull them more often
  • 6. Key Challenge: Exploration vs. Exploitation • Exploration: Try different arms to gather information • Exploitation: Choose the best-known arm for higher reward • RL must balance both to succeed. Bandit Loop While learning: 1. Choose an arm (A) 2. 2. Receive reward (R) 3. 3. Update knowledge about arm A 4. 4. Repeat
  • 7. Real-world Examples of Bandits Domain Example Use Case Online Ads Show ad variant A/B/C — learn best click-through rate A/B Testing Website design optimization Medical Trials Select best treatment with unknown outcomes Recommender Systems Which product to recommend next?
  • 8. K-Armed Bandit Problem Real-World Analogy: • Imagine you are in a casino with K slot machines (bandits). Each machine gives a random reward. You want to find the best one to win the most money over time. But you don’t know which is the best until you try them. Definition: • The K-armed bandit problem is a simplified reinforcement learning setting where an agent repeatedly chooses from K different actions (arms), each with an unknown reward distribution, aiming to maximize total reward.
  • 9. Problem Setup: Component Description Agent The learner or decision-maker interacting with the environment Actions K possible choices (arms) the agent can select at each step Reward Numerical feedback after choosing an action Q-value Estimated average reward for each action Time Step t Discrete iteration index where the agent selects an action and gets a reward Objective Maximize total expected reward or minimize regret over time
  • 10. Key Challenge: Exploration vs Exploitation • Exploration: Try new actions to gather more data • Exploitation: Use current best-known action to maximize reward Real-World Example: • Choosing ads to display (Google, Facebook) • Trying different pricing strategies in e-commerce • A/B testing for website design
  • 11. Epsilon-Greedy Strategy • Epsilon-Greedy is a widely-used action selection strategy in Reinforcement Learning, especially in the context of Multi-Armed Bandit problems. It is designed to balance two competing goals: • Exploitation: Leveraging what the agent knows to maximize reward. • Exploration: Trying out unknown actions to gather more information. • This strategy uses a probabilistic approach to action selection, allowing the agent to explore occasionally while mostly choosing the best-known action.
  • 12. Problem Setup Table: Component Description Agent The decision-maker learning through trial and error Action Set A A set of K actions to choose from Estimated Value Q(a) Running average reward for each action Epsilon (ε) A small number between 0 and 1 representing the exploration rate Selection Rule Randomly explore with probability ε; exploit best-known with 1-ε
  • 13. Pros & Cons Pros: • Simple to implement • Good initial performance in stochastic environments Cons: • Can waste steps exploring poor actions • Does not consider uncertainty in a principled way
  • 14. Epsilon-Greedy Strategy Real-World Analogy: • Imagine you're choosing which restaurant to order food from. You usually go with your favorite (best-known), but once in a while you try a new one to see if it’s better. Definition: • With probability ε, the agent chooses a random action (exploration), and with probability 1 - ε, it selects the action with the highest estimated reward (exploitation). Real-World Example: • Online recommendation systems occasionally showing something new • Game AI occasionally trying non-optimal moves to learn more
  • 15. Optimistic Initial Values • Conceptual Overview: • Optimistic Initial Values is a method to encourage exploration without randomness. By initializing all Q-values to high numbers, the agent is incentivized to try all actions at least once. It eventually learns the true values based on feedback, reducing exploration over time. • Learning Behavior: • Strong initial incentive to try all actions • After multiple trials, Q-values converge to actual expected reward
  • 16. Problem Setup Table: Component Description Agent Learns optimal action values by trial and error Actions Each action starts with an optimistic (high) estimated reward Initial Q(a) Initialized higher than maximum expected reward (e.g., Q(a) = 5) Reward Signal Agent receives actual reward and updates Q(a) after each action Exploration Trigger High Q(a) values drive initial exploration
  • 17. Pros & Cons Pros: • Eliminates need for ε parameter • Systematic exploration based on assumptions Cons: • Sensitive to initial value selection • May converge slowly or not at all in noisy environments Real-World Analogy: • You assume all restaurants have amazing food without knowing anything. You try each one to confirm. Over time, you find some are not that great, so you stop going. Definition: • Instead of initializing Q(a) to 0, we set them to high optimistic values to encourage early exploration.
  • 18. Key Challenges • May lead to slow learning in noisy environments • Can bias the agent toward incorrect assumptions if not adjusted • Real-World Example: • Assuming all investment options are great and trying them early in a portfolio • Testing all marketing channels thinking they perform equally well
  • 19. Upper Confidence Bound (UCB) Conceptual Overview: • UCB is a more intelligent approach to exploration. It adds a confidence bonus to the estimated reward of each action. The agent selects the action with the highest upper confidence bound, encouraging exploration of uncertain actions. Definition: • UCB selects actions based on their estimated value plus a confidence bonus. It prefers actions that are uncertain and potentially better. Real-World Example: • Optimizing user click-through rates in ad serving • Balancing medical treatment tests between new and known protocols
  • 20. Problem Setup Table: Component Description Agent Makes sequential decisions and receives feedback Estimated Q(a) The average reward received from action a N(a) Number of times action a has been selected t (time) Total number of steps taken UCB Score Q(a) + c × sqrt( (ln t) / N(a) ) → combines value & uncertainty Pros: •No need for ε or hyper-randomness •Smarter exploration based on confidence and data Cons: •Requires careful computation of logs and counters •Needs tuning of parameter c
  • 21. Mathematical Formula • UCB(a) = Q(a) + c × sqrt( ln(t) / N(a) ) • Q(a): Estimated value • N(a): Number of times a has been selected • t: Current timestep • c: Controls balance between exploration and exploitation
  • 22. Strategy Comparison Overview Strategy Real-World Analogy Key Idea Core Trade-off Epsilon-Greedy Occasionally try new restaurants Random exploration Simplicity vs wasteful trials Optimistic Initialization Start with high expectations for all arms Early forced exploration May overestimate UCB Use reward + uncertainty bonus Directed exploration Complex but smart decisions
  • 23. Where Each Strategy Really Works Best Strategy Best Applied In... Epsilon-Greedy Problems with large action space and stochastic rewards where simplicity matters. Suitable for web recommendations, simple games, and online learning where slight inefficiency is tolerable. Optimistic Initialization Early-phase exploration in static environments with limited noise. Ideal for deterministic settings like product testing, robotics tasks with stable dynamics, or when rewards are sparse. Upper Confidence Bound (UCB) High-stakes environments requiring efficient exploration. Best for ad-serving, clinical trials, online platforms where early bad choices are costly and informed exploration is crucial.
  • 24. 10-Armed Testbed • The 10-Armed Testbed is a simulated environment commonly used to test how well different reinforcement learning strategies perform in learning which actions yield the best rewards. • It’s called “10-Armed” because there are 10 possible actions (or arms) that the agent can choose from — just like 10 slot machines in a casino, each giving unpredictable rewards. • The agent doesn't know which arm is best in advance, so it needs to try different arms, observe the rewards, and gradually learn to pick the one that gives the highest average reward. • It is widely used as a benchmark for comparing strategies like: • Epsilon-Greedy, Optimistic Initial Values, Upper Confidence Bound (UCB) • A testbed provides a repeatable, structured way to: • Evaluate how quickly a strategy finds the optimal action, Compare exploration vs. exploitation performance, Visualize learning curves across strategies
  • 25. How the 10-Armed Testbed Works • The 10-Armed Testbed is built to simulate a situation where an agent interacts with an environment by repeatedly choosing from 10 different actions (or arms), each with its own hidden reward potential. The idea is to let the agent learn which action gives the highest reward on average, through trial and error. • Step-by-Step Setup: • 10 Actions • Hidden Mean Rewards • Agent Decision • Reward Returned • Update Estimate Purpose of Running Many Episodes: • Since each simulation starts with a different random setup of arm rewards, we run thousands of episodes (e.g., 2000) and average the results to compare strategies fairly.
  • 26. Evaluation Metrics Metric Meaning Average Reward Mean of all rewards received over time % Optimal Action Selection Fraction of time the agent chose the best true-value action Problem Setup Table Component Description Number of Arms 10 True Means μᵢ Sampled from N(0,1) Rewards R(aᵢ) Sampled from N(μᵢ, 1) for each chosen arm Agent Estimation Maintains Q(aᵢ), updated with each new reward Action Selection Strategy-dependent (Epsilon-Greedy, UCB, etc.) Goal Maximize total reward; identify and exploit the optimal arm efficiently
  • 27. Algorithm Evaluation via Testbed Compared Algorithms: • Greedy (no exploration) • Epsilon-Greedy (with ε = 0.1 or 0.01) • Optimistic Initial Values (e.g., Q = 5) • UCB (e.g., c = 2) What You Learn: • Which strategy converges to the best action faster • How average reward differs across time • Whether strategies over-explore or under-explore
  • 28. Real-World Analogy Advertising Campaigns • Imagine you are testing 10 different advertising campaigns. You don’t know which performs best. Each time you run one (select an arm), you get varying ROI (reward). Your goal: • Learn which campaign consistently yields the highest ROI • Stop wasting money on ineffective ones • This is exactly what the 10-armed testbed simulates — trying out options, learning from feedback, and gradually choosing the best.
  • 29. Incremental Implementation Definition • A technique to update action-value estimates without storing all past rewards, by adjusting the old estimate towards the new reward using a learning rate. Why Important? • In real-world scenarios, data is often streaming (e.g., online ads, user clicks). • Storing all data for averages is expensive and not practical. • Incremental method allows real-time updates. Mathematical Intuition: Qn+1​ =Qn​ +α(Rn​ −Qn​ ) • The error term (Rn−Qn) measures surprise. • Multiply by α to control adjustment
  • 30. Incremental Implementation Step-by-Step Formula • New Estimate=Old Estimate+α×(New Info - Old Estimate) Example • Old estimate Qn=5Q_n = 5Qn​ =5 • New reward Rn=7R_n = 7Rn​ =7 • α = 0.1 • Qn+1​ =5+0.1(7−5)=5.2 Analogy • Guessing someone’s favorite color → You guessed blue, they like green → Adjust slightly toward green (not completely change your guess).
  • 31. Tracking a Non-Stationary Problem Definition • A method to keep estimates accurate when reward probabilities change over time, by giving more weight to recent data. Why Important? • Real-world is dynamic: stock prices, ad clicks, user tastes. • Old data becomes irrelevant. Key Idea • Use constant α instead of decreasing α. • Recent rewards influence more → quick adaptation. Formula • Qn+1​ =Qn​ +α(Rn​ −Qn​ )
  • 32. Associative Search / Contextual Bandits Definition • A bandit problem where the best action depends on the context (state of environment) Why Important? • Many real problems are context-dependent: • Ads shown depend on user profile. • Medicine depends on patient condition. Formula • Q(s,a)=Expected reward for state s and action a • Example • State: Morning → Action: Coffee • State: Evening → Action: Tea