SlideShare a Scribd company logo
MACHINE LEARNING (INTEGRATED)
(21ISE62)
Module 5
Dr. Shivashankar
Professor
Department of Information Science & Engineering
GLOBAL ACADEMY OF TECHNOLOGY-Bengaluru
8/20/2024 1
Dr. Shivashankar, ISE, GAT
GLOBAL ACADEMY OF TECHNOLOGY
Ideal Homes Township, Rajarajeshwari Nagar, Bengaluru – 560 098
Department of Information Science & Engineering
Course Outcomes
After Completion of the course, student will be able to:
 Illustrate Regression Techniques and Decision Tree Learning
Algorithm.
 Apply SVM, ANN and KNN algorithm to solve appropriate problems.
 Apply Bayesian Techniques and derive effective learning rules.
 Illustrate performance of AI and ML algorithms using evaluation
techniques.
 Understand reinforcement learning and its application in real world
problems.
Text Book:
1. Tom M. Mitchell, Machine Learning, McGraw Hill Education, India Edition 2013.
2. EthemAlpaydın, Introduction to machine learning, MIT press, Second edition.
3. Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining,
Pearson, First Impression, 2014.
8/20/2024 2
Dr. Shivashankar, ISE, GAT
Module 5: Reinforcement Learning
• Reinforcement learning (RL) is a Machine Learning (ML) technique that
trains software to make decisions to achieve the most optimal results.
• It mimics the trial-and-error learning process that humans use to
achieve their goals.
• It is a feedback based ML approach, here an agent learns to which
action to perform by looking at the environment and result of action.
• For each correct action, the agents get positive feedback, and for each
incorrect action, the agent gets negative feedback or penalty.
8/20/2024 3
Dr. Shivashankar, ISE, GAT
Agent
Environment
Action
State
Rewards
Fig 5.1: Reinforcement Learning
Learning to Optimize Rewards
• The agent interact with environment and identify the possible action he can
perform.
• The primary goal of an agent in reinforcement learning is to perform actions by
looking at the environment and get the maximum positive rewards.
• In reinforcement learning, the agent learns automatically using feedbacks
without any labelled data, unlike supervised learning.
• Since there is no labelled data, so the agent is bound to learn by its experience
only.
• There are two types of reinforcement learning: Positive and negative.
• Positive reinforcement learning is a recurrence behavior due to positive
rewards.
• Rewards increase strength and frequency of a specific behavior.
• This encourages to execute similar action that yield maximum rewards.
• Similarly in negative reinforcement learning, negative rewards are used as
deterrent to weaken the behavior and to avoid it.
• Rewards decrease the strength and the frequency of a specific behavior.
8/20/2024 4
Dr. Shivashankar, ISE, GAT
Cont…
• The agent can take any path to reach to the final point, but he needs to make
it in possible fewer steps. Suppose the agent considers the path S9-S5-S1-S2-
S3, so he will get the +1-reward point.
• Action performed by the agent is referred to as "a"
• State occurred by performing the action is "s."
• The reward/feedback obtained for each good and bad action is "R."
• A discount factor is Gamma "γ."
V(s) = max [R(s,a) + γV(s`)]
8/20/2024 5
Dr. Shivashankar, ISE, GAT
Cont…
• V(s)= value calculated at a particular point.
• R(s,a) = Reward at a particular state s by performing an action.
• γ = Discount factor
• V(s`) = The value at the previous state.
How to represent the agent state?
• We can represent the agent state using the Markov State that contains all
the required information from the history. The State St is Markov state if it
follows the given condition:
P[St+1 | St ] = P[St +1 | S1,......, St]
• tuple of four elements (S, A, Pa, Ra):
• A set of finite States S
• A set of finite Actions A
• Rewards received after transitioning from state S to state S', due to action
a.
• Probability Pa.
8/20/2024 6
Dr. Shivashankar, ISE, GAT
Reinforcement Learning Algorithms
• Reinforcement learning algorithms are mainly used in AI applications and gaming
applications. The main used algorithms are:
• Q-Learning:
• Q-learning is an Off policy RL algorithm, which is used for the temporal difference
Learning. The temporal difference learning methods are the way of comparing
temporally successive predictions.
• It learns the value function Q (S, a), which means how good to take action "a" at a
particular state "s."
• The below flowchart explains the working of Q- learning:
8/20/2024 7
Dr. Shivashankar, ISE, GAT
Credit Assignment Problem
• The Credit Assignment Problem (CAP) is a fundamental challenge in
reinforcement learning.
• It arises when an agent receives a reward for a particular action, but the agent
must determine which of its previous actions led to the reward.
• In reinforcement learning, an agent applies a set of actions in an environment
to maximize the overall reward.
• The agent updates its policy based on feedback received from the
environment.
• The CAP refers to the problem of measuring the influence and impact of an
action taken by an agent on future rewards.
• The core aim is to guide the agents to take corrective actions which can
maximize the reward.
• This can make it difficult for the agent to build an effective policy.
• Additionally, there’re situations where the agent takes a sequence of actions,
and the reward signal is only received at the end of the sequence.
• In these cases, the agent must determine which of its previous actions
positively contributed to the final reward
8/20/2024 8
Dr. Shivashankar, ISE, GAT
Cont…
• Example: As the agent explores the maze, it receives a reward of +10 for
reaching the goal state. Additionally, if it hits a stone, we penalize the action by
providing a -10 reward.
Path 1: 1-5-9
Path 2: 1-4-8
Path 3: 1-2-3-6-9
Path 4: 1-2-5-9
and so on..
8/20/2024 9
Dr. Shivashankar, ISE, GAT
Temporal Difference Learning
• Temporal Difference Learning in reinforcement learning works as
an unsupervised learning method.
• It helps predict the total expected future reward.
• At its core, Temporal Difference Learning (TD Learning) aims to
predict a variable's future value in a state sequence. TD Learning
made a big leap in solving reward prediction problems.
8/20/2024 10
Dr. Shivashankar, ISE, GAT
Figure 3 : The temporal difference reinforcement learning algorithm.
Cont…
• More formally, according to the TD algorithm the prediction error δ(t) is
defined as the immediate reward R(t) plus the predicted future value V(t+1)
minus the current value prediction V(t) (Eqn (1)):
δ (t) = R(t) + γ · V (t+1) − V (t) -----(1)
• The Pe δ(t) is used to update the old value prediction (Eqn (2)):
V (t) new = Vt(old) + α · δ (t) ----(2)
• Alpha (α): learning rate
It shows how much our estimates should be adjusted, based on the error. This
rate varies between 0 and 1.
• Gamma (γ): the discount rate
This indicates how much future rewards are valued.
• The (exponential) discount factor 0<γ<1 in Eqn (1) accounts for the fact that
humans (and other animals) tend to discount the value of future reward.
• The learning rate 0<α<1 in Eqn (2) determines how much a specific event
affects future value predictions. A learning rate close to 1 would suggest that
the most recent outcome has a strong effect on the value prediction.
8/20/2024 11
Dr. Shivashankar, ISE, GAT
Q-learning
• Q-learning is a reinforcement learning algorithm that finds an
optimal action-selection policy for any finite Markov Decision
Process (MDP).
• It helps an agent learn to maximize the total reward over time
through repeated interactions with the environment, even when
the model of that environment is not known.
How Does Q-Learning Work
1. Learning and Updating Q-values:
• The algorithm maintains a table of Q-values for each state-action
pair.
• These Q-values represent the expected utility of taking a given
action in a given state and following the optimal policy after that.
• The Q-values are initialized arbitrarily and are updated iteratively
using the experiences gathered by the agent.
8/20/2024 12
Dr. Shivashankar, ISE, GAT
Q-learning
2. Q-value Update Rule:
The Q-values are updated using the formula:
𝑄(𝑠,𝑎)←𝑄(𝑠,𝑎)+𝛼[𝑟+𝛾max (𝑄(𝑠′,𝑎′)−𝑄(𝑠,𝑎))]
Q(s,a) = r(s,a) + 𝛼*max (𝑄(𝛿(s,a), ƴ
𝑎))
Where :
𝑠 is the current state,
𝑎 is the action taken,
r is the reward received after taking action 𝑎 in state 𝑠,
𝑠′ is the new state after action,
𝑎′ is any possible action from the new state 𝑠′,
𝛼 is the learning rate (0 < α ≤ 1),
𝛾 is the discount factor (0 ≤ γ < 1).
8/20/2024 13
Dr. Shivashankar, ISE, GAT
Cont…
3. Policy Derivation: The policy determines what action to take in
each state and can be derived from the Q-values.
Typically, the policy chooses the action with the highest Q-value in
each state.
4. Exploration vs. Exploitation: Q-learning manages the trade-off
between exploration (choosing random actions to discover new
strategies) and exploitation (choosing actions based on
accumulated knowledge).
5. Convergence: Under certain conditions, such as ensuring all
state-action pairs are visited an infinite number of times, Q-learning
converges to the optimal policy and Q-values that give the
maximum expected reward for any state under any conditions.
8/20/2024 14
Dr. Shivashankar, ISE, GAT
Q Learning Algorithm
For each s, a initialize the table entry ෡
𝑸(s,a) to zero.
Observe current state s.
Do forever:
 Select an action a and execute it.
 Receive immediate reward r
 Observe new state ƴ
𝒔
 Update the entry for ෡
𝑸(s,a) as follow:
Q(s,a) = r(s,a) + 𝛼*max (𝑄(𝛿(s,a), ƴ
𝑎))
෡
𝑸(s,a) ← 𝒓 + 𝜸 𝒎𝒂𝒙
ƴ
𝒂
෡
𝑸 ( ƴ
𝒔, ƴ
𝒂)
S ← ƴ
𝒔
8/20/2024 15
Dr. Shivashankar, ISE, GAT
Problem
Problem 1: Suppose there are 6 rooms in the given table, room 3 is the goal, have an
instant reward of 100, other rooms not directly connected to the target room have zero
reward. Each row contains an instant reward value. Construct a reward matrix, Q-learning
and calculate immediate reward value for each instant. Learning rate is 0.9.
Solution: Reward Matrix Q-learning Matrix
8/20/2024 16
Dr. Shivashankar, ISE, GAT
1 2 3 4 5 6
1 -1 0 -1 0 -1 -1
2 0 -1 100 -1 0 -1
3 -1 -1 0 -1 -1 -1
4 0 -1 -1 -1 0 -1
5 -1 0 -1 0 -1 0
6 -1 -1 100 -1 0 -1
1 2 3 4 5 6
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0
R=
State
Action
Q=
Cont…
Q(state, action)= R(state, action) + Alpha * Max[Q(next state, all action)]
Q(s,a) = r(s,a) + 𝛼*max [𝑄(𝛿(s,a), ƴ
𝑎)]
At state 3: Q(2,3) = R(2,3) + 0.9 * max [Q(3,3)]
= 100 + 0.9 * max [0]
= 100 + 0 = 100
Q(6,3) = R(6,3) + 0.9 * max [Q(3,3)]
= 100 + 0.9 * max [0]
= 100 + 0 = 100
At state 2: Q(1,2) = R(1,2) + 0.9 * max [Q(2,3), Q(2,5)]
= 0 + 0.9 * max [100,0]
= 0.9*100 = 90
At state 1: Q(2,1) = R(2,1) + 0.9 * max [Q(1,2), Q(1,4)]
= 0 + 0.9 * max [90,0]
= 0.9*90 = 81
Q(4,1) = R(4,1) + 0.9 * max [Q(1,2), Q(1,4)]
= 0 + 0.9 * max [90,0]
= 0.9*90=81
8/20/2024 17
Dr. Shivashankar, ISE, GAT
Sate transition diagram
Cont…
At state 4: Q(1,4) = R(1,4) + 0.9 * max [Q(4,1), Q(4,5)]
= 0 + 0.9 * max [81,0]
= 0.9*81 = 72
At state 2: Q(5,2) = R(5,2) + 0.9 * max [Q(2,1), Q(2,3), Q(2,5)]
= 0 + 0.9 * max [81,100,0]
= 0.9*100 = 90
At state 4: Q(5,4) = R(5,4) + 0.9 * max [Q(4,1), Q(4,5)]
= 0 + 0.9 * max [81,0]
= 0.9*81 = 72
At state 5: Q(2,5) = R(2,5) + 0.9 * max [Q(5,2), Q(5,4), Q(5,6)]
= 0 + 0.9 * max [90, 72, 0]
= 0.9*90 = 81
Q(4,5) = R(4,5) + 0.9 * max [Q(5,2), Q(5,4), Q(5,6)]
= 0 + 0.9 * max [90, 72, 0]
= 0.9*90 = 81
8/20/2024 18
Dr. Shivashankar, ISE, GAT
Cont…
At state 5: Q(6,5) = R(6,5) + 0.9 * max [Q(5,2), Q(5,4), Q(5,6)]
= 0 + 0.9 * max [90, 72, 0]
= 0.9*90 = 81
At state 6: Q(5,6) = R(5,6) + 0.9 * max [Q(6,3), Q(6,5)]
= 0 + 0.9 * max [100,81]
= 0.9*100 = 90
8/20/2024 19
Dr. Shivashankar, ISE, GAT
Cont…
Q(s,a) values: Updated Q-matrix
Q(s,a) values
V*(s) value one optimal policy
8/20/2024 20
Dr. Shivashankar, ISE, GAT
1 2 3 4 5 6
1 0 90 0 72 0 0
2 81 0 100 0 81 0
3 0 0 0 0 0 0
4 81 0 0 0 81 0
5 0 90 0 72 0 90
6 0 0 100 0 81 0
Cont…
Problem 2: Suppose we have 5 rooms in a building connected by doors as shown in figure
below. We’ll number each room 0 through 4. The outside of the building can be thought of
us one big room (5). Note that doors 1 and 4 lead into the building from room 5 (outside).
The goal room is number 5. The doors that lead immediately to the goal have an instant
reward of 100. other rooms not directly connected the target room have 0 reward.
Construct a reward matrix, Q-learning and calculate immediate reward value for each
instant. Learning rate is 0.8.
8/20/2024 21
Dr. Shivashankar, ISE, GAT
Cont…
Solution:
Reward Matrix Q-Learning Matrix
R= Q=
8/20/2024 22
Dr. Shivashankar, ISE, GAT
0 1 2 3 4 5
0 -1 -1 -1 -1 0 -1
1 -1 -1 -1 0 -1 100
2 -1 -1 -1 0 -1 -1
3 -1 0 0 -1 0 -1
4 0 -1 -1 0 -1 100
5 -1 0 -1 -1 0 100
Action
State
0 1 2 3 4 5
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
Cont…
• Now let’s imagine what would happen if our agent were in state 5 (next state)
• Look at the 6th row of the reward matrix R(i.e. state 5)
• It has 3 possible actions: go to state 1,4, or 5.
Q(s,a) = r(s,a) + 𝛼*max [𝑄(𝛿(s,a), ƴ
𝑎)]
Q(state, action)= R(state, action) + Alpha * max[Q(next state, all action)]
At state 5:
Q(1,5) = R(1,5) + 0.8 * max [Q(5,1), Q(5,4)]
= 100 + 0.8 * max [Q(0,0)]
= 100 + 0.8*0 = 100
Q(4,5) = R(4,5) + 0.8 * max [Q(5,1), Q(5,4)]
= 100 + 0.8 * max [0,0]= 100
At state 1:
Now we imagine that we are in state 1(next state).
It has 2 possible actions: go to state 3 or state 5.
Then, we complete the Q value:
Q(3,1) = R(3,1) +0.8*Max(Q[1,3), Q(1,5)]
= 0 + 0.8 * max [0,100]
= 0 + 0.8*max(0,100) = 80
8/20/2024 23
Dr. Shivashankar, ISE, GAT
Cont…
Q(5,1) = R(5,1) + 0.8*max (Q(1,3), Q(1,5)]
= 0 + 0.8* max(64,100)]
0.8*100 = 80
At state 3:
Q(1,3) = R(1,3) + 0.8 * max [Q(3,1), Q(3,2), Q(3,4)]
= 0 + 0.8 * max [80, 0, 0]
= 0+ 0.8*80 = 64
Q(4,3) = R(4,3) + 0.8 * max [Q(3,4), Q(3,2), Q(3,1)]
= 0 + 0.8 * max [0, 0, 80]
= 0+ 0.8*80 = 64
Q(2,3) = R(2,3) + 0.8 * max [Q(3,2), Q(3,1), Q(3,4)]
= 0 + 0.8 * max [0, 80, 0]
= 0+ 0.8*80 = 64
At state 4:
Q(5, 4) = R(5, 4) + 0.8 * max [Q(4,5), Q(4,3), Q(4, 0)]
= 0 + 0.8 * max [100, 64, 0]
= 80
8/20/2024 24
Dr. Shivashankar, ISE, GAT
Cont…
Q(3, 4) = R(3, 4) + 0.8 * max [Q(4,3), Q(4,5), Q(4, 0)]
= 0 + 0.8 * max [64,100, 0]= 80
Q(0, 4) = R(0, 4) + 0.8 * max [Q(4,0), Q(4,3), Q(4,5)]
= 0 + 0.8 * max [0, 0, 100] = 80
At state 2:
Q(3,2) = R(3,2) + 0.8*max[ Q(2,3)]
= 0 + 0.8*max[64] = 51
At state 0:
Q(4, 0) = R(4,0) + 0.8*max[Q(0,4)]
= 0 + 0.8 * 80 = 64
8/20/2024 25
Dr. Shivashankar, ISE, GAT
0 1 2 3 4 5
0 0 0 0 0 80 0
1 0 0 0 64 0 100
2 0 0 0 64 0 0
3 0 80 51 0 80 0
4 64 0 0 64 0 100
5 0 80 0 0 80 100
Updated state diagram
Final Updated Q-Learning Matrix
Cont…
Case Study:
Artificial Intelligence Powering Google Products,
Recent AI Tools leveraged by Tesla, AI for
Facebook,
Robo-Banking: Artificial Intelligence at JPMorgan
Chase, Audio AI,
A Machine Learning Approach — Building a Hotel
Recommendation Engine
8/20/2024 26
Dr. Shivashankar, ISE, GAT

More Related Content

PDF
Module 3_Machine Learning Bayesian Learn
PDF
Module 4_Machine Learning_Evaluating Hyp
PDF
Machine Learning_SVM_KNN_K-MEANSModule 2.pdf
PDF
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
PPTX
PRML Chapter 10
PPTX
PRML Chapter 6
PPTX
Id3 algorithm
PPTX
PRML Chapter 1
Module 3_Machine Learning Bayesian Learn
Module 4_Machine Learning_Evaluating Hyp
Machine Learning_SVM_KNN_K-MEANSModule 2.pdf
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
PRML Chapter 10
PRML Chapter 6
Id3 algorithm
PRML Chapter 1

What's hot (20)

PDF
Random variable
PPTX
Discrete Distribution.pptx
PDF
AI 8 | Probability Basics, Bayes' Rule, Probability Distribution
PPTX
binomial distribution
PPTX
Probability
PPTX
Conditional-Probability-Powerpoint.pptx
PDF
02 Machine Learning - Introduction probability
PPT
Sfs4e ppt 06
PPTX
LU2 Basic Probability
PDF
Discrete probability distribution (complete)
PPTX
Probability
PPTX
Binomial distribution
PDF
Module - 5 Machine Learning-22ISE62.pdf
PPT
Basic concept of probability
PDF
Maximum Likelihood Estimation
PDF
Supersymmetry and All That
PPTX
Conditional Probability
PPTX
Complements and Conditional Probability, and Bayes' Theorem
PPTX
Integration of all 6 trig functions
Random variable
Discrete Distribution.pptx
AI 8 | Probability Basics, Bayes' Rule, Probability Distribution
binomial distribution
Probability
Conditional-Probability-Powerpoint.pptx
02 Machine Learning - Introduction probability
Sfs4e ppt 06
LU2 Basic Probability
Discrete probability distribution (complete)
Probability
Binomial distribution
Module - 5 Machine Learning-22ISE62.pdf
Basic concept of probability
Maximum Likelihood Estimation
Supersymmetry and All That
Conditional Probability
Complements and Conditional Probability, and Bayes' Theorem
Integration of all 6 trig functions
Ad

Similar to 5th Module_Machine Learning_Reinforc.pdf (20)

PDF
Reinforcement learning, Q-Learning
PPTX
An efficient use of temporal difference technique in Computer Game Learning
PPT
Lecture -10 AI Reinforcement Learning.ppt
PDF
Reinforcement learning Russell and Norvig CMSC
PPT
Reinforcement Learning.ppt
PPT
RL.ppt
PPT
Reinforcement learning 7313
PPT
Reinforcement learning
PPT
reiniforcement learning.ppt
PDF
reinforcement-learning-141009013546-conversion-gate02.pdf
PDF
Shanghai deep learning meetup 4
PPTX
Learning Task in machine learning
PPTX
Reinforcement Learning
PPT
RL_online _presentation_1.ppt
PPT
YijueRL.ppt
PDF
anintroductiontoreinforcementlearning-180912151720.pdf
PPTX
An introduction to reinforcement learning
PPTX
reinforcement-learning-141009013546-conversion-gate02.pptx
PPT
Reinforcement learning presentation1.ppt
PPTX
Reinforcement Learning
Reinforcement learning, Q-Learning
An efficient use of temporal difference technique in Computer Game Learning
Lecture -10 AI Reinforcement Learning.ppt
Reinforcement learning Russell and Norvig CMSC
Reinforcement Learning.ppt
RL.ppt
Reinforcement learning 7313
Reinforcement learning
reiniforcement learning.ppt
reinforcement-learning-141009013546-conversion-gate02.pdf
Shanghai deep learning meetup 4
Learning Task in machine learning
Reinforcement Learning
RL_online _presentation_1.ppt
YijueRL.ppt
anintroductiontoreinforcementlearning-180912151720.pdf
An introduction to reinforcement learning
reinforcement-learning-141009013546-conversion-gate02.pptx
Reinforcement learning presentation1.ppt
Reinforcement Learning
Ad

More from Dr. Shivashankar (20)

PDF
Module - 4 Machine Learning -22ISE62.pdf
PDF
Dr. Shivu__Machine Learning-Module 3.pdf
PDF
Dr. Shivu___Machine Learning_Module 2pdf
PDF
Machine Learning_2025_First Module_1.pdf
PDF
Dr Shivu_GAT_Computer Network_Module 5.pdf
PDF
Dr Shivu_GAT_Computer Network_22ISE52_Module 4.pdf
PDF
DrShivashankar_Computer Net_Module-3.pdf
PPTX
22ISE52_Computer Networks_Module _2.pptx
PDF
22ISE52_COMPUTER NETWORKS _Module 1+.pdf
PDF
21 Scheme_21EC53_MODULE-5_CCN_Dr. ShivaS
PDF
21 SCHEME_21EC53_VTU_MODULE-4_COMPUTER COMMUNCATION NETWORK.pdf
PDF
21 Scheme_ MODULE-3_CCN.pdf
PDF
21_Scheme_MODULE-1_CCN.pdf
PDF
21 Scheme_MODULE-2_CCN.pdf
PDF
Network Security_Dr Shivashankar_Module 5.pdf
PDF
Wireless Cellular Communication_Module 3_Dr. Shivashankar.pdf
PDF
Wireless Cellular Communication_Mudule2_Dr.Shivashankar.pdf
PDF
Network Security_4th Module_Dr. Shivashankar
PDF
Network Security_3rd Module_Dr. Shivashankar
PDF
Network Security_Module_2_Dr Shivashankar
Module - 4 Machine Learning -22ISE62.pdf
Dr. Shivu__Machine Learning-Module 3.pdf
Dr. Shivu___Machine Learning_Module 2pdf
Machine Learning_2025_First Module_1.pdf
Dr Shivu_GAT_Computer Network_Module 5.pdf
Dr Shivu_GAT_Computer Network_22ISE52_Module 4.pdf
DrShivashankar_Computer Net_Module-3.pdf
22ISE52_Computer Networks_Module _2.pptx
22ISE52_COMPUTER NETWORKS _Module 1+.pdf
21 Scheme_21EC53_MODULE-5_CCN_Dr. ShivaS
21 SCHEME_21EC53_VTU_MODULE-4_COMPUTER COMMUNCATION NETWORK.pdf
21 Scheme_ MODULE-3_CCN.pdf
21_Scheme_MODULE-1_CCN.pdf
21 Scheme_MODULE-2_CCN.pdf
Network Security_Dr Shivashankar_Module 5.pdf
Wireless Cellular Communication_Module 3_Dr. Shivashankar.pdf
Wireless Cellular Communication_Mudule2_Dr.Shivashankar.pdf
Network Security_4th Module_Dr. Shivashankar
Network Security_3rd Module_Dr. Shivashankar
Network Security_Module_2_Dr Shivashankar

Recently uploaded (20)

PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Construction Project Organization Group 2.pptx
PPTX
Sustainable Sites - Green Building Construction
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Well-logging-methods_new................
PDF
composite construction of structures.pdf
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Geodesy 1.pptx...............................................
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
Digital Logic Computer Design lecture notes
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Lecture Notes Electrical Wiring System Components
Construction Project Organization Group 2.pptx
Sustainable Sites - Green Building Construction
Foundation to blockchain - A guide to Blockchain Tech
UNIT 4 Total Quality Management .pptx
Well-logging-methods_new................
composite construction of structures.pdf
R24 SURVEYING LAB MANUAL for civil enggi
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Geodesy 1.pptx...............................................
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Digital Logic Computer Design lecture notes
Automation-in-Manufacturing-Chapter-Introduction.pdf

5th Module_Machine Learning_Reinforc.pdf

  • 1. MACHINE LEARNING (INTEGRATED) (21ISE62) Module 5 Dr. Shivashankar Professor Department of Information Science & Engineering GLOBAL ACADEMY OF TECHNOLOGY-Bengaluru 8/20/2024 1 Dr. Shivashankar, ISE, GAT GLOBAL ACADEMY OF TECHNOLOGY Ideal Homes Township, Rajarajeshwari Nagar, Bengaluru – 560 098 Department of Information Science & Engineering
  • 2. Course Outcomes After Completion of the course, student will be able to:  Illustrate Regression Techniques and Decision Tree Learning Algorithm.  Apply SVM, ANN and KNN algorithm to solve appropriate problems.  Apply Bayesian Techniques and derive effective learning rules.  Illustrate performance of AI and ML algorithms using evaluation techniques.  Understand reinforcement learning and its application in real world problems. Text Book: 1. Tom M. Mitchell, Machine Learning, McGraw Hill Education, India Edition 2013. 2. EthemAlpaydın, Introduction to machine learning, MIT press, Second edition. 3. Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Pearson, First Impression, 2014. 8/20/2024 2 Dr. Shivashankar, ISE, GAT
  • 3. Module 5: Reinforcement Learning • Reinforcement learning (RL) is a Machine Learning (ML) technique that trains software to make decisions to achieve the most optimal results. • It mimics the trial-and-error learning process that humans use to achieve their goals. • It is a feedback based ML approach, here an agent learns to which action to perform by looking at the environment and result of action. • For each correct action, the agents get positive feedback, and for each incorrect action, the agent gets negative feedback or penalty. 8/20/2024 3 Dr. Shivashankar, ISE, GAT Agent Environment Action State Rewards Fig 5.1: Reinforcement Learning
  • 4. Learning to Optimize Rewards • The agent interact with environment and identify the possible action he can perform. • The primary goal of an agent in reinforcement learning is to perform actions by looking at the environment and get the maximum positive rewards. • In reinforcement learning, the agent learns automatically using feedbacks without any labelled data, unlike supervised learning. • Since there is no labelled data, so the agent is bound to learn by its experience only. • There are two types of reinforcement learning: Positive and negative. • Positive reinforcement learning is a recurrence behavior due to positive rewards. • Rewards increase strength and frequency of a specific behavior. • This encourages to execute similar action that yield maximum rewards. • Similarly in negative reinforcement learning, negative rewards are used as deterrent to weaken the behavior and to avoid it. • Rewards decrease the strength and the frequency of a specific behavior. 8/20/2024 4 Dr. Shivashankar, ISE, GAT
  • 5. Cont… • The agent can take any path to reach to the final point, but he needs to make it in possible fewer steps. Suppose the agent considers the path S9-S5-S1-S2- S3, so he will get the +1-reward point. • Action performed by the agent is referred to as "a" • State occurred by performing the action is "s." • The reward/feedback obtained for each good and bad action is "R." • A discount factor is Gamma "γ." V(s) = max [R(s,a) + γV(s`)] 8/20/2024 5 Dr. Shivashankar, ISE, GAT
  • 6. Cont… • V(s)= value calculated at a particular point. • R(s,a) = Reward at a particular state s by performing an action. • γ = Discount factor • V(s`) = The value at the previous state. How to represent the agent state? • We can represent the agent state using the Markov State that contains all the required information from the history. The State St is Markov state if it follows the given condition: P[St+1 | St ] = P[St +1 | S1,......, St] • tuple of four elements (S, A, Pa, Ra): • A set of finite States S • A set of finite Actions A • Rewards received after transitioning from state S to state S', due to action a. • Probability Pa. 8/20/2024 6 Dr. Shivashankar, ISE, GAT
  • 7. Reinforcement Learning Algorithms • Reinforcement learning algorithms are mainly used in AI applications and gaming applications. The main used algorithms are: • Q-Learning: • Q-learning is an Off policy RL algorithm, which is used for the temporal difference Learning. The temporal difference learning methods are the way of comparing temporally successive predictions. • It learns the value function Q (S, a), which means how good to take action "a" at a particular state "s." • The below flowchart explains the working of Q- learning: 8/20/2024 7 Dr. Shivashankar, ISE, GAT
  • 8. Credit Assignment Problem • The Credit Assignment Problem (CAP) is a fundamental challenge in reinforcement learning. • It arises when an agent receives a reward for a particular action, but the agent must determine which of its previous actions led to the reward. • In reinforcement learning, an agent applies a set of actions in an environment to maximize the overall reward. • The agent updates its policy based on feedback received from the environment. • The CAP refers to the problem of measuring the influence and impact of an action taken by an agent on future rewards. • The core aim is to guide the agents to take corrective actions which can maximize the reward. • This can make it difficult for the agent to build an effective policy. • Additionally, there’re situations where the agent takes a sequence of actions, and the reward signal is only received at the end of the sequence. • In these cases, the agent must determine which of its previous actions positively contributed to the final reward 8/20/2024 8 Dr. Shivashankar, ISE, GAT
  • 9. Cont… • Example: As the agent explores the maze, it receives a reward of +10 for reaching the goal state. Additionally, if it hits a stone, we penalize the action by providing a -10 reward. Path 1: 1-5-9 Path 2: 1-4-8 Path 3: 1-2-3-6-9 Path 4: 1-2-5-9 and so on.. 8/20/2024 9 Dr. Shivashankar, ISE, GAT
  • 10. Temporal Difference Learning • Temporal Difference Learning in reinforcement learning works as an unsupervised learning method. • It helps predict the total expected future reward. • At its core, Temporal Difference Learning (TD Learning) aims to predict a variable's future value in a state sequence. TD Learning made a big leap in solving reward prediction problems. 8/20/2024 10 Dr. Shivashankar, ISE, GAT Figure 3 : The temporal difference reinforcement learning algorithm.
  • 11. Cont… • More formally, according to the TD algorithm the prediction error δ(t) is defined as the immediate reward R(t) plus the predicted future value V(t+1) minus the current value prediction V(t) (Eqn (1)): δ (t) = R(t) + γ · V (t+1) − V (t) -----(1) • The Pe δ(t) is used to update the old value prediction (Eqn (2)): V (t) new = Vt(old) + α · δ (t) ----(2) • Alpha (α): learning rate It shows how much our estimates should be adjusted, based on the error. This rate varies between 0 and 1. • Gamma (γ): the discount rate This indicates how much future rewards are valued. • The (exponential) discount factor 0<γ<1 in Eqn (1) accounts for the fact that humans (and other animals) tend to discount the value of future reward. • The learning rate 0<α<1 in Eqn (2) determines how much a specific event affects future value predictions. A learning rate close to 1 would suggest that the most recent outcome has a strong effect on the value prediction. 8/20/2024 11 Dr. Shivashankar, ISE, GAT
  • 12. Q-learning • Q-learning is a reinforcement learning algorithm that finds an optimal action-selection policy for any finite Markov Decision Process (MDP). • It helps an agent learn to maximize the total reward over time through repeated interactions with the environment, even when the model of that environment is not known. How Does Q-Learning Work 1. Learning and Updating Q-values: • The algorithm maintains a table of Q-values for each state-action pair. • These Q-values represent the expected utility of taking a given action in a given state and following the optimal policy after that. • The Q-values are initialized arbitrarily and are updated iteratively using the experiences gathered by the agent. 8/20/2024 12 Dr. Shivashankar, ISE, GAT
  • 13. Q-learning 2. Q-value Update Rule: The Q-values are updated using the formula: 𝑄(𝑠,𝑎)←𝑄(𝑠,𝑎)+𝛼[𝑟+𝛾max (𝑄(𝑠′,𝑎′)−𝑄(𝑠,𝑎))] Q(s,a) = r(s,a) + 𝛼*max (𝑄(𝛿(s,a), ƴ 𝑎)) Where : 𝑠 is the current state, 𝑎 is the action taken, r is the reward received after taking action 𝑎 in state 𝑠, 𝑠′ is the new state after action, 𝑎′ is any possible action from the new state 𝑠′, 𝛼 is the learning rate (0 < α ≤ 1), 𝛾 is the discount factor (0 ≤ γ < 1). 8/20/2024 13 Dr. Shivashankar, ISE, GAT
  • 14. Cont… 3. Policy Derivation: The policy determines what action to take in each state and can be derived from the Q-values. Typically, the policy chooses the action with the highest Q-value in each state. 4. Exploration vs. Exploitation: Q-learning manages the trade-off between exploration (choosing random actions to discover new strategies) and exploitation (choosing actions based on accumulated knowledge). 5. Convergence: Under certain conditions, such as ensuring all state-action pairs are visited an infinite number of times, Q-learning converges to the optimal policy and Q-values that give the maximum expected reward for any state under any conditions. 8/20/2024 14 Dr. Shivashankar, ISE, GAT
  • 15. Q Learning Algorithm For each s, a initialize the table entry ෡ 𝑸(s,a) to zero. Observe current state s. Do forever:  Select an action a and execute it.  Receive immediate reward r  Observe new state ƴ 𝒔  Update the entry for ෡ 𝑸(s,a) as follow: Q(s,a) = r(s,a) + 𝛼*max (𝑄(𝛿(s,a), ƴ 𝑎)) ෡ 𝑸(s,a) ← 𝒓 + 𝜸 𝒎𝒂𝒙 ƴ 𝒂 ෡ 𝑸 ( ƴ 𝒔, ƴ 𝒂) S ← ƴ 𝒔 8/20/2024 15 Dr. Shivashankar, ISE, GAT
  • 16. Problem Problem 1: Suppose there are 6 rooms in the given table, room 3 is the goal, have an instant reward of 100, other rooms not directly connected to the target room have zero reward. Each row contains an instant reward value. Construct a reward matrix, Q-learning and calculate immediate reward value for each instant. Learning rate is 0.9. Solution: Reward Matrix Q-learning Matrix 8/20/2024 16 Dr. Shivashankar, ISE, GAT 1 2 3 4 5 6 1 -1 0 -1 0 -1 -1 2 0 -1 100 -1 0 -1 3 -1 -1 0 -1 -1 -1 4 0 -1 -1 -1 0 -1 5 -1 0 -1 0 -1 0 6 -1 -1 100 -1 0 -1 1 2 3 4 5 6 1 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 0 0 0 4 0 0 0 0 0 0 5 0 0 0 0 0 0 6 0 0 0 0 0 0 R= State Action Q=
  • 17. Cont… Q(state, action)= R(state, action) + Alpha * Max[Q(next state, all action)] Q(s,a) = r(s,a) + 𝛼*max [𝑄(𝛿(s,a), ƴ 𝑎)] At state 3: Q(2,3) = R(2,3) + 0.9 * max [Q(3,3)] = 100 + 0.9 * max [0] = 100 + 0 = 100 Q(6,3) = R(6,3) + 0.9 * max [Q(3,3)] = 100 + 0.9 * max [0] = 100 + 0 = 100 At state 2: Q(1,2) = R(1,2) + 0.9 * max [Q(2,3), Q(2,5)] = 0 + 0.9 * max [100,0] = 0.9*100 = 90 At state 1: Q(2,1) = R(2,1) + 0.9 * max [Q(1,2), Q(1,4)] = 0 + 0.9 * max [90,0] = 0.9*90 = 81 Q(4,1) = R(4,1) + 0.9 * max [Q(1,2), Q(1,4)] = 0 + 0.9 * max [90,0] = 0.9*90=81 8/20/2024 17 Dr. Shivashankar, ISE, GAT Sate transition diagram
  • 18. Cont… At state 4: Q(1,4) = R(1,4) + 0.9 * max [Q(4,1), Q(4,5)] = 0 + 0.9 * max [81,0] = 0.9*81 = 72 At state 2: Q(5,2) = R(5,2) + 0.9 * max [Q(2,1), Q(2,3), Q(2,5)] = 0 + 0.9 * max [81,100,0] = 0.9*100 = 90 At state 4: Q(5,4) = R(5,4) + 0.9 * max [Q(4,1), Q(4,5)] = 0 + 0.9 * max [81,0] = 0.9*81 = 72 At state 5: Q(2,5) = R(2,5) + 0.9 * max [Q(5,2), Q(5,4), Q(5,6)] = 0 + 0.9 * max [90, 72, 0] = 0.9*90 = 81 Q(4,5) = R(4,5) + 0.9 * max [Q(5,2), Q(5,4), Q(5,6)] = 0 + 0.9 * max [90, 72, 0] = 0.9*90 = 81 8/20/2024 18 Dr. Shivashankar, ISE, GAT
  • 19. Cont… At state 5: Q(6,5) = R(6,5) + 0.9 * max [Q(5,2), Q(5,4), Q(5,6)] = 0 + 0.9 * max [90, 72, 0] = 0.9*90 = 81 At state 6: Q(5,6) = R(5,6) + 0.9 * max [Q(6,3), Q(6,5)] = 0 + 0.9 * max [100,81] = 0.9*100 = 90 8/20/2024 19 Dr. Shivashankar, ISE, GAT
  • 20. Cont… Q(s,a) values: Updated Q-matrix Q(s,a) values V*(s) value one optimal policy 8/20/2024 20 Dr. Shivashankar, ISE, GAT 1 2 3 4 5 6 1 0 90 0 72 0 0 2 81 0 100 0 81 0 3 0 0 0 0 0 0 4 81 0 0 0 81 0 5 0 90 0 72 0 90 6 0 0 100 0 81 0
  • 21. Cont… Problem 2: Suppose we have 5 rooms in a building connected by doors as shown in figure below. We’ll number each room 0 through 4. The outside of the building can be thought of us one big room (5). Note that doors 1 and 4 lead into the building from room 5 (outside). The goal room is number 5. The doors that lead immediately to the goal have an instant reward of 100. other rooms not directly connected the target room have 0 reward. Construct a reward matrix, Q-learning and calculate immediate reward value for each instant. Learning rate is 0.8. 8/20/2024 21 Dr. Shivashankar, ISE, GAT
  • 22. Cont… Solution: Reward Matrix Q-Learning Matrix R= Q= 8/20/2024 22 Dr. Shivashankar, ISE, GAT 0 1 2 3 4 5 0 -1 -1 -1 -1 0 -1 1 -1 -1 -1 0 -1 100 2 -1 -1 -1 0 -1 -1 3 -1 0 0 -1 0 -1 4 0 -1 -1 0 -1 100 5 -1 0 -1 -1 0 100 Action State 0 1 2 3 4 5 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 0 0 0 4 0 0 0 0 0 0 5 0 0 0 0 0 0
  • 23. Cont… • Now let’s imagine what would happen if our agent were in state 5 (next state) • Look at the 6th row of the reward matrix R(i.e. state 5) • It has 3 possible actions: go to state 1,4, or 5. Q(s,a) = r(s,a) + 𝛼*max [𝑄(𝛿(s,a), ƴ 𝑎)] Q(state, action)= R(state, action) + Alpha * max[Q(next state, all action)] At state 5: Q(1,5) = R(1,5) + 0.8 * max [Q(5,1), Q(5,4)] = 100 + 0.8 * max [Q(0,0)] = 100 + 0.8*0 = 100 Q(4,5) = R(4,5) + 0.8 * max [Q(5,1), Q(5,4)] = 100 + 0.8 * max [0,0]= 100 At state 1: Now we imagine that we are in state 1(next state). It has 2 possible actions: go to state 3 or state 5. Then, we complete the Q value: Q(3,1) = R(3,1) +0.8*Max(Q[1,3), Q(1,5)] = 0 + 0.8 * max [0,100] = 0 + 0.8*max(0,100) = 80 8/20/2024 23 Dr. Shivashankar, ISE, GAT
  • 24. Cont… Q(5,1) = R(5,1) + 0.8*max (Q(1,3), Q(1,5)] = 0 + 0.8* max(64,100)] 0.8*100 = 80 At state 3: Q(1,3) = R(1,3) + 0.8 * max [Q(3,1), Q(3,2), Q(3,4)] = 0 + 0.8 * max [80, 0, 0] = 0+ 0.8*80 = 64 Q(4,3) = R(4,3) + 0.8 * max [Q(3,4), Q(3,2), Q(3,1)] = 0 + 0.8 * max [0, 0, 80] = 0+ 0.8*80 = 64 Q(2,3) = R(2,3) + 0.8 * max [Q(3,2), Q(3,1), Q(3,4)] = 0 + 0.8 * max [0, 80, 0] = 0+ 0.8*80 = 64 At state 4: Q(5, 4) = R(5, 4) + 0.8 * max [Q(4,5), Q(4,3), Q(4, 0)] = 0 + 0.8 * max [100, 64, 0] = 80 8/20/2024 24 Dr. Shivashankar, ISE, GAT
  • 25. Cont… Q(3, 4) = R(3, 4) + 0.8 * max [Q(4,3), Q(4,5), Q(4, 0)] = 0 + 0.8 * max [64,100, 0]= 80 Q(0, 4) = R(0, 4) + 0.8 * max [Q(4,0), Q(4,3), Q(4,5)] = 0 + 0.8 * max [0, 0, 100] = 80 At state 2: Q(3,2) = R(3,2) + 0.8*max[ Q(2,3)] = 0 + 0.8*max[64] = 51 At state 0: Q(4, 0) = R(4,0) + 0.8*max[Q(0,4)] = 0 + 0.8 * 80 = 64 8/20/2024 25 Dr. Shivashankar, ISE, GAT 0 1 2 3 4 5 0 0 0 0 0 80 0 1 0 0 0 64 0 100 2 0 0 0 64 0 0 3 0 80 51 0 80 0 4 64 0 0 64 0 100 5 0 80 0 0 80 100 Updated state diagram Final Updated Q-Learning Matrix
  • 26. Cont… Case Study: Artificial Intelligence Powering Google Products, Recent AI Tools leveraged by Tesla, AI for Facebook, Robo-Banking: Artificial Intelligence at JPMorgan Chase, Audio AI, A Machine Learning Approach — Building a Hotel Recommendation Engine 8/20/2024 26 Dr. Shivashankar, ISE, GAT