2. 2
INTRODUCTION
What is Reinforcement Learning?
• A type of Machine Learning where an agent learns to make decisions by interacting with an environment.
• The agent receives rewards or penalties based on its actions and aims to maximize total reward.
• Core elements:
• Agent: The learner or decision-maker.
• Environment: Where the agent operates.
• Action: Choices the agent can make.
• State: Current situation returned by the environment.
• Reward: Feedback from the environment.
• Reinforcement Learning ≠ Supervised Learning (no labeled data; learns from experience).
3. 3
INTRODUCTION
What is Q-Learning?
• Q-learning is a model-free, value-based, off-policy algorithm that will find the best series of
actions based on the agent's current state.
• The “Q” stands for quality.
• Quality represents how valuable the action is in maximizing future rewards.
• Model-free algorithms learn the consequences of their actions through the experience
without transition and reward function.
• The value-based method trains the value function to learn which state is more valuable and
take action.
• Policy-based methods train the policy directly to learn which action to take in a given state.
• In the off-policy, the algorithm evaluates and updates a policy that differs from the policy
used to take an action.
4. TERMINOLOGIES
4
• States(s): the current position of the agent in the environment.
• Action(a): a step taken by the agent in a particular state.
• Rewards: for every action, the agent receives a reward and penalty.
• Episodes: the end of the stage, where agents can’t take new action. It happens when
the agent has achieved the goal or failed.
• Q(St+1, a): expected optimal Q-value of doing the action in a particular state.
• Q(St, At): it is the current estimation of Q(St+1, a).
• Q-Table: the agent maintains the Q-table of sets of states and actions.
• Temporal Differences(TD): used to estimate the expected value of Q(St+1, a) by using
the current state and action and previous state and action.
6. THE BELLMAN EQUATION
6
In order to find the Q value, we use the Bellman equation which is as follows:
Q(s, a) Q(s, a) + α [r + γ max Q(s', a') - Q(s, a)]
←
Where:
• Q(s, a): Current Q-value for state s and action a
• α (Learning rate): Impact of new information
• r (Reward): Immediate reward received
• γ (Discount factor): Future reward significance
• max Q(s', a'): Best Q-value in next state
7. 7
APPLICATIONS OF Q-LEARNING
Game Strategy Learning
Q-Learning is widely used in training agents to play games such as Tic-Tac-Toe, Chess, and Gridworld.
The agent learns optimal strategies by interacting with the environment and receiving rewards,
allowing it to improve performance over time without needing a model of the game.
Robot Navigation
Robots use Q-Learning to learn how to navigate mazes or physical spaces by trial and error. The agent
learns the best sequence of actions (e.g., move forward, turn) to reach a target location while avoiding
obstacles, even when the environment is partially unknown.
Control Problems (e.g., Cart-Pole Balancing)
In classic control tasks like the Cart-Pole problem, Q-Learning teaches the agent to take actions (like
moving the cart left or right) that keep the pole balanced. These problems are fundamental
benchmarks in reinforcement learning and demonstrate how Q-Learning can handle continuous
feedback.
8. 8
ADVANTAGES AND DISADVANTAGES
Advantages:
• Does not require a model of the environment (model-free).
• Can handle stochastic transitions and rewards.
• Converges to the optimal policy given sufficient exploration and learning time.
• Simple to implement and understand.
Disadvantages:
• Not scalable for environments with large state/action spaces (Q-table becomes too big).
• Requires lots of interactions with the environment.
• Exploration vs exploitation balance can be tricky.
• May converge slowly in complex environments.