SlideShare a Scribd company logo
Q-LEARNING
CO-3
AIM
To familiarize students with the concepts of unsupervised machine learning, hierarchical clustering, distance
functions, and data standardization
INSTRUCTIONAL OBJECTIVES
This session is designed to:
1. Two formulations for learning: Inductive and Analytical
2. Perfect domain theories
LEARNING OUTCOMES
At the end of this session, you should be able to:
1. Hierarchical clustering and its types
2. Agglomerative clustering
3. Measuring the distance of two clusters
4. Data standardization techniques
Q-Function
• One approach to RL is then to try to estimate V*(s).
• However, this approach requires you to know r(s,a) and delta(s,a).
• This is unrealistic in many real problems. What is the reward if a robot is
exploring mars and decides to take a right turn?
• Fortunately, we can circumvent this problem by exploring and experiencing
how the world reacts to our actions. We need to learn r & delta.
• We want a function that directly learns good state-action pairs, i.e., what action
should I take in this state. We call this Q(s,a).
• Given Q(s,a) it is now trivial to execute the optimal policy, without knowing r(s,a)
and delta(s,a). We have:
*
*
argmax
( ) ( , )
( ) ( , )
max
a
a
s Q s a
V s Q s a
 

Example II
*
*
argmax
( ) ( , )
( ) ( , )
max
a
a
s Q s a
V s Q s a
 

Check that
Q-Learning
• This still depends on r(s , a) and delta(s , a).
• However, imagine the robot is exploring its environment, trying new actions as it goes.
• At every step it receives some reward “r”, and it observes the environment change into a
new state s’ for action a.
• How can we use these observations, (s, a, s’,r) to learn a model?
s’=st+1
Q-Learning
• This equation continually estimates Q at state s consistent with an estimate
of Q at state s’, one step in the future: temporal difference (TD) learning.
• Note that s’ is closer to goal, and hence more “reliable”, but still an estimate itself.
• Updating estimates based on other estimates is called bootstrapping.
• We do an update after each state-action pair. I.e., we are learning online!
• We are learning useful things about explored state-action pairs. These are typically most useful
because they are likely to be encountered again.
• Under suitable conditions, these updates can actually be proved to converge to the real answer.
Example Q-Learning
1 2
'
ˆ ˆ
( , ) ( , ')
max
0 0.9 max{66,81,100}
90
right
a
Q s a r Q s a

 
 

Q-learning propagates Q-estimates 1-step backwards
Exploration / Exploitation
• It is very important that the agent does not simply follow the current policy
when learning Q. (off-policy learning).The reason is that you may get stuck in
a suboptimal solution. I.e., there may be other solutions out there that you
have never seen.
• Hence it is good to try new things so now and then, e.g.
If T large lots of exploring, if T small follow current policy. One can decrease
T over time
ˆ( , )/
( | ) eQ s a T
P a s 
Improvements
• One can trade-off memory and computation by cashing (s,s’,r) for
observed transitions. After a while, as Q(s’,a’) has changed, you can
“replay” the update:
• One can actively search for state-action pairs for which Q(s,a) is
expected to change a lot (prioritized sweeping).
• One can do updates along the sampled path much further back
than just one step ( learning).
( )
TD 
Extensions
• To deal with stochastic environments, we need to maximize
expected future discounted reward:
• Often the state space is too large to deal with all states. In this case we
need to learn a function:
• Neural network with back-propagation have been quite successful.
• For instance, TD-Gammon is a back-gammon program that plays at expert level. state-space very large,
trained by playing against itself, uses NN to approximate value function, uses TD(lambda) for learning.
( , ) ( , )
Q s a f s a


More on Function Approximation
• For instance: linear function:
• The features Phi are fixed measurements of the state (e.g., # stones on the
board).
• We only learn the parameters theta.
• Update rule: (start in state s, take action a, observe reward r and end up in
state s’)
Conclusion
• Reinforcement learning addresses a very broad and relevant question:
• How can we learn to survive in our environment?
• We have looked at Q-learning, which simply learn s from experience.
• No model of the world is needed.
• We made simplifying assumptions: e.g., state of the world only depends on
last state and action. This is the Markov assumption. The model is called a
Markov Decision Process (MDP).
• We assumed deterministic dynamics, reward function, but the world really
is stochastic.
• There are many extensions to speed up learning.
• There have been many successful real-world applications.
Applications of
Reinforcement Learning
• Robotics for industrial automation.
• Business strategy planning
• Machine learning and data processing
• It helps you to create training systems that provide custom
instruction and materials according to the requirement of
students.
• Aircraft control and robot motion control
• Traffic Light Control
• A robot cleaning room and recharging its battery
• Robot-soccer
• How to invest in shares
• Modeling the economy through rational agents
• Learning how to fly a helicopter
• Scheduling planes to their destinations
THANK YOU
TEAM ML

More Related Content

PPTX
An introduction to reinforcement learning
PDF
anintroductiontoreinforcementlearning-180912151720.pdf
PDF
Reinfrocement Learning
PPTX
R22 Machine learning jntuh UNIT- 5.pptx
PPTX
24.09.2021 Reinforcement Learning Algorithms.pptx
PDF
Introduction to Deep Reinforcement Learning
PPTX
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
PDF
TensorFlow and Deep Learning Tips and Tricks
An introduction to reinforcement learning
anintroductiontoreinforcementlearning-180912151720.pdf
Reinfrocement Learning
R22 Machine learning jntuh UNIT- 5.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
Introduction to Deep Reinforcement Learning
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
TensorFlow and Deep Learning Tips and Tricks

Similar to Learning Task in machine learning (20)

PDF
Reinforcement Learning Guide For Beginners
PPT
reiniforcement learning.ppt
PPTX
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
PPTX
Reinforcement Learning
PPT
Reinforcement learning presentation1.ppt
PPTX
Intro to Reinforcement Learning
PDF
Introduction of Deep Reinforcement Learning
PPTX
An efficient use of temporal difference technique in Computer Game Learning
PDF
Reinforcement learning, Q-Learning
PDF
22PCOAM16 Machine Learning Unit V Full notes & QB
PPT
YijueRL.ppt
PPT
RL_online _presentation_1.ppt
PDF
Reinforcement learning
PPT
Hierarchical Reinforcement Learning
PPTX
lecture_21.pptx - PowerPoint Presentation
PPT
Lecture notes
PDF
Reinforcement learning
PDF
REINFORCEMENT LEARNING
PPTX
How to formulate reinforcement learning in illustrative ways
PPT
Reinforcement Learning.ppt
Reinforcement Learning Guide For Beginners
reiniforcement learning.ppt
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Reinforcement Learning
Reinforcement learning presentation1.ppt
Intro to Reinforcement Learning
Introduction of Deep Reinforcement Learning
An efficient use of temporal difference technique in Computer Game Learning
Reinforcement learning, Q-Learning
22PCOAM16 Machine Learning Unit V Full notes & QB
YijueRL.ppt
RL_online _presentation_1.ppt
Reinforcement learning
Hierarchical Reinforcement Learning
lecture_21.pptx - PowerPoint Presentation
Lecture notes
Reinforcement learning
REINFORCEMENT LEARNING
How to formulate reinforcement learning in illustrative ways
Reinforcement Learning.ppt
Ad

More from Kv Sagar (18)

PPTX
DPL-co4-session-4-VAEDPL-co4-session-4-VAE.pptx
PPTX
DPL-co4-session-3-RBMDPL-co4-session-3-RBM
PDF
Dl-PerceptronsCO1-Perceptrons2 Perceptrons.pdf
PPTX
Dl-Data AgumentatioCO1-6 Data Agumentation.pptx
PPTX
Dl-Bias VarienceCO1-5 Bias Varience.pptx
PPTX
DL-CO2-Session6-VGGNet_GoogLeNet_ResNet_DenseNet_RCNN.pptx
PPTX
DL-CO2-Session 2-Batch Normalization.pptx
PPTX
Deep learning L1-CO2-session-4 CNN .pptx
PPTX
DL-CO2 -Session 3 Learning Vectorial Representations of Words.pptx
PPTX
CO2_Session1 Greedy Layer-wise Pre-training.pptx
PPTX
Machine learning tree models for classification
PPTX
Random Forest classifier in Machine Learning
PPTX
3. Tree Models in machine learning
PPT
CNN for NLP using text analysis by using deep learning
PPTX
Implement LST perform LSTm stock Makrket Analysis
PPTX
build a Convolutional Neural Network (CNN) using TensorFlow in Python
PPTX
1. Introduction to deep learning.pptx
PPT
L2 3.fa19
DPL-co4-session-4-VAEDPL-co4-session-4-VAE.pptx
DPL-co4-session-3-RBMDPL-co4-session-3-RBM
Dl-PerceptronsCO1-Perceptrons2 Perceptrons.pdf
Dl-Data AgumentatioCO1-6 Data Agumentation.pptx
Dl-Bias VarienceCO1-5 Bias Varience.pptx
DL-CO2-Session6-VGGNet_GoogLeNet_ResNet_DenseNet_RCNN.pptx
DL-CO2-Session 2-Batch Normalization.pptx
Deep learning L1-CO2-session-4 CNN .pptx
DL-CO2 -Session 3 Learning Vectorial Representations of Words.pptx
CO2_Session1 Greedy Layer-wise Pre-training.pptx
Machine learning tree models for classification
Random Forest classifier in Machine Learning
3. Tree Models in machine learning
CNN for NLP using text analysis by using deep learning
Implement LST perform LSTm stock Makrket Analysis
build a Convolutional Neural Network (CNN) using TensorFlow in Python
1. Introduction to deep learning.pptx
L2 3.fa19
Ad

Recently uploaded (20)

PDF
Classroom Observation Tools for Teachers
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Cell Types and Its function , kingdom of life
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
RMMM.pdf make it easy to upload and study
PDF
Basic Mud Logging Guide for educational purpose
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
master seminar digital applications in india
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Insiders guide to clinical Medicine.pdf
Classroom Observation Tools for Teachers
Abdominal Access Techniques with Prof. Dr. R K Mishra
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Computing-Curriculum for Schools in Ghana
Cell Types and Its function , kingdom of life
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPH.pptx obstetrics and gynecology in nursing
RMMM.pdf make it easy to upload and study
Basic Mud Logging Guide for educational purpose
VCE English Exam - Section C Student Revision Booklet
master seminar digital applications in india
Final Presentation General Medicine 03-08-2024.pptx
Anesthesia in Laparoscopic Surgery in India
102 student loan defaulters named and shamed – Is someone you know on the list?
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Insiders guide to clinical Medicine.pdf

Learning Task in machine learning

  • 2. AIM To familiarize students with the concepts of unsupervised machine learning, hierarchical clustering, distance functions, and data standardization INSTRUCTIONAL OBJECTIVES This session is designed to: 1. Two formulations for learning: Inductive and Analytical 2. Perfect domain theories LEARNING OUTCOMES At the end of this session, you should be able to: 1. Hierarchical clustering and its types 2. Agglomerative clustering 3. Measuring the distance of two clusters 4. Data standardization techniques
  • 3. Q-Function • One approach to RL is then to try to estimate V*(s). • However, this approach requires you to know r(s,a) and delta(s,a). • This is unrealistic in many real problems. What is the reward if a robot is exploring mars and decides to take a right turn? • Fortunately, we can circumvent this problem by exploring and experiencing how the world reacts to our actions. We need to learn r & delta. • We want a function that directly learns good state-action pairs, i.e., what action should I take in this state. We call this Q(s,a). • Given Q(s,a) it is now trivial to execute the optimal policy, without knowing r(s,a) and delta(s,a). We have: * * argmax ( ) ( , ) ( ) ( , ) max a a s Q s a V s Q s a   
  • 4. Example II * * argmax ( ) ( , ) ( ) ( , ) max a a s Q s a V s Q s a    Check that
  • 5. Q-Learning • This still depends on r(s , a) and delta(s , a). • However, imagine the robot is exploring its environment, trying new actions as it goes. • At every step it receives some reward “r”, and it observes the environment change into a new state s’ for action a. • How can we use these observations, (s, a, s’,r) to learn a model? s’=st+1
  • 6. Q-Learning • This equation continually estimates Q at state s consistent with an estimate of Q at state s’, one step in the future: temporal difference (TD) learning. • Note that s’ is closer to goal, and hence more “reliable”, but still an estimate itself. • Updating estimates based on other estimates is called bootstrapping. • We do an update after each state-action pair. I.e., we are learning online! • We are learning useful things about explored state-action pairs. These are typically most useful because they are likely to be encountered again. • Under suitable conditions, these updates can actually be proved to converge to the real answer.
  • 7. Example Q-Learning 1 2 ' ˆ ˆ ( , ) ( , ') max 0 0.9 max{66,81,100} 90 right a Q s a r Q s a       Q-learning propagates Q-estimates 1-step backwards
  • 8. Exploration / Exploitation • It is very important that the agent does not simply follow the current policy when learning Q. (off-policy learning).The reason is that you may get stuck in a suboptimal solution. I.e., there may be other solutions out there that you have never seen. • Hence it is good to try new things so now and then, e.g. If T large lots of exploring, if T small follow current policy. One can decrease T over time ˆ( , )/ ( | ) eQ s a T P a s 
  • 9. Improvements • One can trade-off memory and computation by cashing (s,s’,r) for observed transitions. After a while, as Q(s’,a’) has changed, you can “replay” the update: • One can actively search for state-action pairs for which Q(s,a) is expected to change a lot (prioritized sweeping). • One can do updates along the sampled path much further back than just one step ( learning). ( ) TD 
  • 10. Extensions • To deal with stochastic environments, we need to maximize expected future discounted reward: • Often the state space is too large to deal with all states. In this case we need to learn a function: • Neural network with back-propagation have been quite successful. • For instance, TD-Gammon is a back-gammon program that plays at expert level. state-space very large, trained by playing against itself, uses NN to approximate value function, uses TD(lambda) for learning. ( , ) ( , ) Q s a f s a  
  • 11. More on Function Approximation • For instance: linear function: • The features Phi are fixed measurements of the state (e.g., # stones on the board). • We only learn the parameters theta. • Update rule: (start in state s, take action a, observe reward r and end up in state s’)
  • 12. Conclusion • Reinforcement learning addresses a very broad and relevant question: • How can we learn to survive in our environment? • We have looked at Q-learning, which simply learn s from experience. • No model of the world is needed. • We made simplifying assumptions: e.g., state of the world only depends on last state and action. This is the Markov assumption. The model is called a Markov Decision Process (MDP). • We assumed deterministic dynamics, reward function, but the world really is stochastic. • There are many extensions to speed up learning. • There have been many successful real-world applications.
  • 13. Applications of Reinforcement Learning • Robotics for industrial automation. • Business strategy planning • Machine learning and data processing • It helps you to create training systems that provide custom instruction and materials according to the requirement of students. • Aircraft control and robot motion control • Traffic Light Control • A robot cleaning room and recharging its battery • Robot-soccer • How to invest in shares • Modeling the economy through rational agents • Learning how to fly a helicopter • Scheduling planes to their destinations