SlideShare a Scribd company logo
Department of Computer
Science and Engineering
IIT Kharagpur
Imitation Learning: Learning to
Act like Humans from Humans
SSN College of Engineering – Faculty Development Program Talk
10:45-12:15 IST, 25 Nov 2017
Anirban Santara
santara.github.io
Department of Computer
Science and Engineering
IIT Kharagpur
About me
Anirban Santara
Google India Ph.D. Fellow at
IIT Kharagpur (2015-Present)
Graduate Research Intern at
Intel Labs for Autonomous
Driving (2017-Present)
B.Tech. in Electronics and
Electrical Communication
Engineering from IIT
Kharagpur in 2015
Department of Computer
Science and Engineering
IIT Kharagpur
Contents
1. Building the motivation
2. Problem definition and Different Approaches to Solution
3. Issues of Safety and Reliability
Department of Computer
Science and Engineering
IIT Kharagpur
Description of the Imitation
Learning Problem
Department of Computer
Science and Engineering
IIT Kharagpur
Imitation Learning
Imitation Learning
techniques aim to mimic
human behavior at a given
task1
1 Hussein, Ahmed, et al. "Imitation Learning: A Survey of Learning Methods." ACM Computing Surveys (CSUR)
50.2 (2017): 21. Image Source: GRASP lab - University of Pennsylvania
Department of Computer
Science and Engineering
IIT Kharagpur
Why should you care?
• Imitation learning methods are rooted in neuro-science and form an
important part of learning in humans
• Makes it possible to teach robots complex tasks with minimal expert
knowledge of the tasks
• No need for explicit programming or task-specific reward function design
• Its high time!
• Modern sensors are able to collect and transmit high volumes of data at high speed
• High performance computing is cheaper, more capable and ubiquitous than ever
• Virtual Reality systems – that are considered the best portal of human-machine
interaction – are widely available
Department of Computer
Science and Engineering
IIT Kharagpur
Example Application Areas
Department of Computer
Science and Engineering
IIT Kharagpur
Autonomous Driving
No more accidents due to human error. No more traffic jams.
Department of Computer
Science and Engineering
IIT Kharagpur
Robotic Surgery
Complex Actions in Critical Situations – Accurate. Every time.
Department of Computer
Science and Engineering
IIT Kharagpur
Industrial Automation
Efficiency. Precise Quality Control. Safety.
Department of Computer
Science and Engineering
IIT Kharagpur
Assistive Robotics
Elderly Care. Rehabilitation. Special Needs.
Department of Computer
Science and Engineering
IIT Kharagpur
Conversational Agents
Assistance. Recommendation. Therapy.
Department of Computer
Science and Engineering
IIT Kharagpur
Approaches to Solution
Department of Computer
Science and Engineering
IIT Kharagpur
A quick primer on Machine Learning
Reference application – Driving a Racing Car
State variables (X):
• Position in track
• Distance from track
edges along different
directions
• Direction of heading
• Current speed
Action Variables (Y):
• Steering
• Acceleration
• Brake
Department of Computer
Science and Engineering
IIT Kharagpur
Comparison of ML paradigms
Supervised Learning
• Would require training
examples in the form:
{ 𝑋𝑖, 𝑌𝑖 }𝑖=1
𝑁
• Where, 𝑌𝑖 are
true/correct
actions that must be
taken in state 𝑋𝑖
Unsupervised Learning
• Works only on with the
input state information
𝑋𝑖
• Does not use any
kind of feedback
from the environment
regarding performance
of the agent
Reinforcement Learning
• Requires feedback from the
environment in the form of
reward signals
• Reward signals might be
sparse and delayed
• But it should indicate the
quality of actions being
taken by the agent in
different states
e.g. +1 if the car makes progress, -1 if it
comes to a halt, -10 if it bumps into an
obstacle, 100 if it finishes the race
Department of Computer
Science and Engineering
IIT Kharagpur
Problem Setting
Our Agent has to achieve its
goal by taking a sequence of
actions in an environment
whose states change in
response to the agent’s
actions.
ActionNew State
Environment
Agent
Department of Computer
Science and Engineering
IIT Kharagpur
Mathematical Formulation
Markov Decision Process (MDP)
Imitation Learning problems are often specified in terms of a Markov Decision Process
(MDP). An MDP is defined as ℳ = (𝑆, 𝐴, 𝑇, 𝑟, 𝜌0, 𝛾)
• State Space 𝑆: Set of all possible states/configurations of the environment
• Action Space 𝐴: Set of all possible actions
• Transition Probability 𝑇: 𝑆 × 𝐴 → 𝑆; T 𝑠𝑡, 𝑎 𝑡 = 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
• Reward function r: 𝑆 × 𝐴 → ℝ; we write 𝑟 𝑠𝑡, 𝑎 𝑡 = 𝑟𝑡
• Initial state distribution 𝜌0; 𝜌0 𝑠 = 𝑃( 𝑠0 = 𝑠)
• Temporal discount factor 𝛾
“Markov” because it assumes:
𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡, 𝑠𝑡−1, 𝑎 𝑡−1, … , 𝑠0
= 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 = T(𝑠𝑡, 𝑎 𝑡)
Department of Computer
Science and Engineering
IIT Kharagpur
Some more definitions
• Policy 𝜋: 𝑆 → 𝐴: A function that predicts actions for a given state
• Trajectory 𝜏: A sequence of (𝑠𝑡, 𝑎 𝑡) tuples that describe an episode of experiences
of an agent as it executes a policy.
𝜏 = 𝑠0, 𝑎0, 𝑠1, 𝑎1, … , 𝑠𝑡, 𝑎 𝑡, … , 𝑠 𝑇
Department of Computer
Science and Engineering
IIT Kharagpur
Approaches to Imitation Learning
Broad Categories
Imitation Learning
Learning from a
dataset of expert
demonstrations
Behavioral
Cloning
Apprenticeship
Learning
Active learning
with an expert
Department of Computer
Science and Engineering
IIT Kharagpur
Learning from a Dataset of
Expert Demonstrations
Department of Computer
Science and Engineering
IIT Kharagpur
Problem Definition
• Given: a dataset of trajectories demonstrated by an expert:
where each trajectory is a sequence of states and actions:
• Goal: Find a policy 𝜋∗
that achieves “expert-like performance”
𝜏 𝑖 𝑖=1
𝑁
𝜏 = 𝑠0, 𝑎0, 𝑠1, 𝑎1, … , 𝑠𝑡, 𝑎 𝑡, … , 𝑠 𝑇
Department of Computer
Science and Engineering
IIT Kharagpur
Behavioral Cloning
Supervised learning of a mapping from states to the expert’s actions in those states
Model
𝑥1
𝑥2
.
.
.
𝑥 𝑛
state: 𝑥
𝑎
𝑎: expert action
−
statistical
divergence
Loss
Minimize this
w.r.t. model parameters
expert
Department of Computer
Science and Engineering
IIT KharagpurPros and Cons of Behavioral
Cloning
• Advantages:
• Simplicity!
• Drawbacks:
• Fails to work well with limited data
• Assumes that observations are i.i.d. and learn to fit single time step decisions
This leads to the problem of compounding error due to covariate shift
Department of Computer
Science and Engineering
IIT Kharagpur
Apprenticeship Learning
Department of Computer
Science and Engineering
IIT KharagpurReinforcement
Learning
Reinforcement Learning
refers to learning through
trial and error using
feedback from the
environment.
Action
Reward,
New State
Environment
Agent
Department of Computer
Science and Engineering
IIT Kharagpur
Goal of RL
Find a policy 𝜋∗that
maximizes the expectation of
the reward function 𝑅 𝜏
over trajectories 𝜏
𝜋∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 Ε 𝜏[𝑅(𝜏)]
Reward of a trajectory 𝑅 𝜏 is a
function of all the rewards
received in a trajectory
e.g. 𝑅 𝜏 = 𝑡 𝑟𝑡 , 𝑅 𝜏 = 𝑡 𝛾 𝑡 𝑟𝑡
Department of Computer
Science and Engineering
IIT Kharagpur
Apprenticeship Learning
1. Inverse Reinforcement Learning (IRL): Use the dataset of expert-
demonstrations to uncover the reward function that the expert is
trying to optimize.
• This reward function is expected to succinctly encode the expert’s behavior…
2. Reinforcement Learning (IRL): Learn the optimal policy for this
recovered reward function using RL.
expert
demonstrations
IRL
reward
function
RL
optimum
policy
Department of Computer
Science and Engineering
IIT KharagpurPros and Cons of Apprenticeship
Learning
• Advantages:
• Does not take single time-step decisions and hence compounding error is not a
problem, unlike behavioral cloning
• Drawbacks:
• IRL is a computationally expensive algorithm because it needs RL to run in an
inside loop
• Scalability issues in large environment
• Agent needs to act in the environment during learning – this may be unsafe in
risk-sensitive applications
Department of Computer
Science and Engineering
IIT Kharagpur
Active Learning
Department of Computer
Science and Engineering
IIT Kharagpur
Active Learning
In Active Learning the agent
is able to query the expert
for an optimal action in any
given state and use these
active samples to improve its
policy
state
agent
confidence
High Low
Agent takes
action
Agent queries
expert
action
Agent
takes
actionAgent rectifies
policy
Department of Computer
Science and Engineering
IIT Kharagpur
Workflow of Active Learning
Train the agent by
behavioral cloning
Deploy the agent
in the real world
in presence of an
expert
Agent queries the
expert whenever
it is in doubt and
rectifies itself
Department of Computer
Science and Engineering
IIT Kharagpur
Pros and Cons of Active Learning
• Advantages:
• Safe during both training and testing
• Drawbacks:
• Getting robust confidence estimates is tough
• Requires longer supervision of the expert
Department of Computer
Science and Engineering
IIT Kharagpur
Issue of Safety
Department of Computer
Science and Engineering
IIT Kharagpur
Types of Safety
Safety during
training
Safety after
deployment
Department of Computer
Science and Engineering
IIT KharagpurDifferent Approaches to Ensuring
Safety
• Vigilance during exploration
• External Knowledge
• Prior knowledge
• Expert demonstration
• Teacher advice
• Risk-directed exploration
• Engineering the optimization criterion
• Worst case criteria
• Risk-sensitive criteria
• Constrained criteria
Department of Computer
Science and Engineering
IIT Kharagpur
Case study on how to make
an existing algorithm safe
Department of Computer
Science and Engineering
IIT Kharagpur
GAIL: Generative Adversarial Imitation
Learning
Problem of heavy tail
Department of Computer
Science and Engineering
IIT Kharagpur
RAIL: Risk-Averse Imitation Learning
Santara et al. 2017. Accepted at Deep Reinforcement Learning Symposium at NIPS 2017
CVaR of trajectory risk
Department of Computer
Science and Engineering
IIT Kharagpur
Results
Department of Computer
Science and Engineering
IIT Kharagpur
Any Questions, Please 
Scan me to give
Anirban feedback
Department of Computer
Science and Engineering
IIT Kharagpur
Thank You

More Related Content

PPTX
mobile ad-hoc network (MANET) and its applications
PDF
Sensor Networks Introduction and Architecture
PPTX
SENSOR NETWORK PLATFORMS AND TOOLS
PPTX
Classification of routing protocols
PPT
Wsn 08
PPT
Sensor Protocols for Information via Negotiation (SPIN)
PPTX
Introduction of Cloud computing
mobile ad-hoc network (MANET) and its applications
Sensor Networks Introduction and Architecture
SENSOR NETWORK PLATFORMS AND TOOLS
Classification of routing protocols
Wsn 08
Sensor Protocols for Information via Negotiation (SPIN)
Introduction of Cloud computing

What's hot (20)

PPTX
Wireless Sensor Network Routing Protocols
PPTX
Cloud Computing Principles and Paradigms: 7 enhancing cloud computing environ...
PPTX
Air quality monitoring system
PPTX
IEEE 802.11 Architecture and Services
PPTX
boosting algorithm
PPTX
Wsn unit-1-ppt
PPT
Server Consolidation
PDF
An Introduction to Macrocells & Small Cells
PPTX
Wireless Sensor Networks
PPTX
PPTX
MANET in Mobile Computing
PPTX
Energy consumption of wsn
PDF
PDF
Questions about Understanding benefits of mimo technology (article)
PPTX
Paging and Location Update
PPT
Cellular communication
PPT
Microwave Transmission
PDF
Lecture 19 22. transport protocol for ad-hoc
PDF
The business case for SD WAN in the enterprise
Wireless Sensor Network Routing Protocols
Cloud Computing Principles and Paradigms: 7 enhancing cloud computing environ...
Air quality monitoring system
IEEE 802.11 Architecture and Services
boosting algorithm
Wsn unit-1-ppt
Server Consolidation
An Introduction to Macrocells & Small Cells
Wireless Sensor Networks
MANET in Mobile Computing
Energy consumption of wsn
Questions about Understanding benefits of mimo technology (article)
Paging and Location Update
Cellular communication
Microwave Transmission
Lecture 19 22. transport protocol for ad-hoc
The business case for SD WAN in the enterprise
Ad

Similar to Imitation Learning (20)

PPTX
RAIL: Risk-Averse Imitation Learning | Invited talk at Intel AI Workshop at K...
PPTX
An Introduction to Reinforcement Learning - The Doors to AGI
PDF
Lesson 33
PDF
AI Lesson 33
PPTX
Imitation learning tutorial
PDF
Introduction to Machine Learning
PDF
Introduction to Machine Learning
PPTX
Jsai final final final
PPTX
Launching into machine learning
PPTX
Machine learning[1]
PPT
CS8082_MachineLearnigTechniques _Unit-1.ppt
PPTX
AI: Learning in AI
PPTX
AI: Learning in AI
PPTX
Machine learning ppt.
PPT
Chapter01.ppt
PPTX
ppt on introduction to Machine learning tools
PPT
Machine Learning Techniques all units .ppt
PPTX
Learning
PDF
RL presentation
RAIL: Risk-Averse Imitation Learning | Invited talk at Intel AI Workshop at K...
An Introduction to Reinforcement Learning - The Doors to AGI
Lesson 33
AI Lesson 33
Imitation learning tutorial
Introduction to Machine Learning
Introduction to Machine Learning
Jsai final final final
Launching into machine learning
Machine learning[1]
CS8082_MachineLearnigTechniques _Unit-1.ppt
AI: Learning in AI
AI: Learning in AI
Machine learning ppt.
Chapter01.ppt
ppt on introduction to Machine learning tools
Machine Learning Techniques all units .ppt
Learning
RL presentation
Ad

Recently uploaded (20)

PDF
Computing-Curriculum for Schools in Ghana
PDF
Trump Administration's workforce development strategy
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Classroom Observation Tools for Teachers
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Pharma ospi slides which help in ospi learning
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Computing-Curriculum for Schools in Ghana
Trump Administration's workforce development strategy
Final Presentation General Medicine 03-08-2024.pptx
Classroom Observation Tools for Teachers
O5-L3 Freight Transport Ops (International) V1.pdf
VCE English Exam - Section C Student Revision Booklet
Anesthesia in Laparoscopic Surgery in India
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
GDM (1) (1).pptx small presentation for students
Microbial disease of the cardiovascular and lymphatic systems
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Abdominal Access Techniques with Prof. Dr. R K Mishra
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
O7-L3 Supply Chain Operations - ICLT Program
Pharma ospi slides which help in ospi learning
human mycosis Human fungal infections are called human mycosis..pptx
Microbial diseases, their pathogenesis and prophylaxis
Module 4: Burden of Disease Tutorial Slides S2 2025
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS

Imitation Learning

  • 1. Department of Computer Science and Engineering IIT Kharagpur Imitation Learning: Learning to Act like Humans from Humans SSN College of Engineering – Faculty Development Program Talk 10:45-12:15 IST, 25 Nov 2017 Anirban Santara santara.github.io
  • 2. Department of Computer Science and Engineering IIT Kharagpur About me Anirban Santara Google India Ph.D. Fellow at IIT Kharagpur (2015-Present) Graduate Research Intern at Intel Labs for Autonomous Driving (2017-Present) B.Tech. in Electronics and Electrical Communication Engineering from IIT Kharagpur in 2015
  • 3. Department of Computer Science and Engineering IIT Kharagpur Contents 1. Building the motivation 2. Problem definition and Different Approaches to Solution 3. Issues of Safety and Reliability
  • 4. Department of Computer Science and Engineering IIT Kharagpur Description of the Imitation Learning Problem
  • 5. Department of Computer Science and Engineering IIT Kharagpur Imitation Learning Imitation Learning techniques aim to mimic human behavior at a given task1 1 Hussein, Ahmed, et al. "Imitation Learning: A Survey of Learning Methods." ACM Computing Surveys (CSUR) 50.2 (2017): 21. Image Source: GRASP lab - University of Pennsylvania
  • 6. Department of Computer Science and Engineering IIT Kharagpur Why should you care? • Imitation learning methods are rooted in neuro-science and form an important part of learning in humans • Makes it possible to teach robots complex tasks with minimal expert knowledge of the tasks • No need for explicit programming or task-specific reward function design • Its high time! • Modern sensors are able to collect and transmit high volumes of data at high speed • High performance computing is cheaper, more capable and ubiquitous than ever • Virtual Reality systems – that are considered the best portal of human-machine interaction – are widely available
  • 7. Department of Computer Science and Engineering IIT Kharagpur Example Application Areas
  • 8. Department of Computer Science and Engineering IIT Kharagpur Autonomous Driving No more accidents due to human error. No more traffic jams.
  • 9. Department of Computer Science and Engineering IIT Kharagpur Robotic Surgery Complex Actions in Critical Situations – Accurate. Every time.
  • 10. Department of Computer Science and Engineering IIT Kharagpur Industrial Automation Efficiency. Precise Quality Control. Safety.
  • 11. Department of Computer Science and Engineering IIT Kharagpur Assistive Robotics Elderly Care. Rehabilitation. Special Needs.
  • 12. Department of Computer Science and Engineering IIT Kharagpur Conversational Agents Assistance. Recommendation. Therapy.
  • 13. Department of Computer Science and Engineering IIT Kharagpur Approaches to Solution
  • 14. Department of Computer Science and Engineering IIT Kharagpur A quick primer on Machine Learning Reference application – Driving a Racing Car State variables (X): • Position in track • Distance from track edges along different directions • Direction of heading • Current speed Action Variables (Y): • Steering • Acceleration • Brake
  • 15. Department of Computer Science and Engineering IIT Kharagpur Comparison of ML paradigms Supervised Learning • Would require training examples in the form: { 𝑋𝑖, 𝑌𝑖 }𝑖=1 𝑁 • Where, 𝑌𝑖 are true/correct actions that must be taken in state 𝑋𝑖 Unsupervised Learning • Works only on with the input state information 𝑋𝑖 • Does not use any kind of feedback from the environment regarding performance of the agent Reinforcement Learning • Requires feedback from the environment in the form of reward signals • Reward signals might be sparse and delayed • But it should indicate the quality of actions being taken by the agent in different states e.g. +1 if the car makes progress, -1 if it comes to a halt, -10 if it bumps into an obstacle, 100 if it finishes the race
  • 16. Department of Computer Science and Engineering IIT Kharagpur Problem Setting Our Agent has to achieve its goal by taking a sequence of actions in an environment whose states change in response to the agent’s actions. ActionNew State Environment Agent
  • 17. Department of Computer Science and Engineering IIT Kharagpur Mathematical Formulation Markov Decision Process (MDP) Imitation Learning problems are often specified in terms of a Markov Decision Process (MDP). An MDP is defined as ℳ = (𝑆, 𝐴, 𝑇, 𝑟, 𝜌0, 𝛾) • State Space 𝑆: Set of all possible states/configurations of the environment • Action Space 𝐴: Set of all possible actions • Transition Probability 𝑇: 𝑆 × 𝐴 → 𝑆; T 𝑠𝑡, 𝑎 𝑡 = 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 • Reward function r: 𝑆 × 𝐴 → ℝ; we write 𝑟 𝑠𝑡, 𝑎 𝑡 = 𝑟𝑡 • Initial state distribution 𝜌0; 𝜌0 𝑠 = 𝑃( 𝑠0 = 𝑠) • Temporal discount factor 𝛾 “Markov” because it assumes: 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡, 𝑠𝑡−1, 𝑎 𝑡−1, … , 𝑠0 = 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 = T(𝑠𝑡, 𝑎 𝑡)
  • 18. Department of Computer Science and Engineering IIT Kharagpur Some more definitions • Policy 𝜋: 𝑆 → 𝐴: A function that predicts actions for a given state • Trajectory 𝜏: A sequence of (𝑠𝑡, 𝑎 𝑡) tuples that describe an episode of experiences of an agent as it executes a policy. 𝜏 = 𝑠0, 𝑎0, 𝑠1, 𝑎1, … , 𝑠𝑡, 𝑎 𝑡, … , 𝑠 𝑇
  • 19. Department of Computer Science and Engineering IIT Kharagpur Approaches to Imitation Learning Broad Categories Imitation Learning Learning from a dataset of expert demonstrations Behavioral Cloning Apprenticeship Learning Active learning with an expert
  • 20. Department of Computer Science and Engineering IIT Kharagpur Learning from a Dataset of Expert Demonstrations
  • 21. Department of Computer Science and Engineering IIT Kharagpur Problem Definition • Given: a dataset of trajectories demonstrated by an expert: where each trajectory is a sequence of states and actions: • Goal: Find a policy 𝜋∗ that achieves “expert-like performance” 𝜏 𝑖 𝑖=1 𝑁 𝜏 = 𝑠0, 𝑎0, 𝑠1, 𝑎1, … , 𝑠𝑡, 𝑎 𝑡, … , 𝑠 𝑇
  • 22. Department of Computer Science and Engineering IIT Kharagpur Behavioral Cloning Supervised learning of a mapping from states to the expert’s actions in those states Model 𝑥1 𝑥2 . . . 𝑥 𝑛 state: 𝑥 𝑎 𝑎: expert action − statistical divergence Loss Minimize this w.r.t. model parameters expert
  • 23. Department of Computer Science and Engineering IIT KharagpurPros and Cons of Behavioral Cloning • Advantages: • Simplicity! • Drawbacks: • Fails to work well with limited data • Assumes that observations are i.i.d. and learn to fit single time step decisions This leads to the problem of compounding error due to covariate shift
  • 24. Department of Computer Science and Engineering IIT Kharagpur Apprenticeship Learning
  • 25. Department of Computer Science and Engineering IIT KharagpurReinforcement Learning Reinforcement Learning refers to learning through trial and error using feedback from the environment. Action Reward, New State Environment Agent
  • 26. Department of Computer Science and Engineering IIT Kharagpur Goal of RL Find a policy 𝜋∗that maximizes the expectation of the reward function 𝑅 𝜏 over trajectories 𝜏 𝜋∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 Ε 𝜏[𝑅(𝜏)] Reward of a trajectory 𝑅 𝜏 is a function of all the rewards received in a trajectory e.g. 𝑅 𝜏 = 𝑡 𝑟𝑡 , 𝑅 𝜏 = 𝑡 𝛾 𝑡 𝑟𝑡
  • 27. Department of Computer Science and Engineering IIT Kharagpur Apprenticeship Learning 1. Inverse Reinforcement Learning (IRL): Use the dataset of expert- demonstrations to uncover the reward function that the expert is trying to optimize. • This reward function is expected to succinctly encode the expert’s behavior… 2. Reinforcement Learning (IRL): Learn the optimal policy for this recovered reward function using RL. expert demonstrations IRL reward function RL optimum policy
  • 28. Department of Computer Science and Engineering IIT KharagpurPros and Cons of Apprenticeship Learning • Advantages: • Does not take single time-step decisions and hence compounding error is not a problem, unlike behavioral cloning • Drawbacks: • IRL is a computationally expensive algorithm because it needs RL to run in an inside loop • Scalability issues in large environment • Agent needs to act in the environment during learning – this may be unsafe in risk-sensitive applications
  • 29. Department of Computer Science and Engineering IIT Kharagpur Active Learning
  • 30. Department of Computer Science and Engineering IIT Kharagpur Active Learning In Active Learning the agent is able to query the expert for an optimal action in any given state and use these active samples to improve its policy state agent confidence High Low Agent takes action Agent queries expert action Agent takes actionAgent rectifies policy
  • 31. Department of Computer Science and Engineering IIT Kharagpur Workflow of Active Learning Train the agent by behavioral cloning Deploy the agent in the real world in presence of an expert Agent queries the expert whenever it is in doubt and rectifies itself
  • 32. Department of Computer Science and Engineering IIT Kharagpur Pros and Cons of Active Learning • Advantages: • Safe during both training and testing • Drawbacks: • Getting robust confidence estimates is tough • Requires longer supervision of the expert
  • 33. Department of Computer Science and Engineering IIT Kharagpur Issue of Safety
  • 34. Department of Computer Science and Engineering IIT Kharagpur Types of Safety Safety during training Safety after deployment
  • 35. Department of Computer Science and Engineering IIT KharagpurDifferent Approaches to Ensuring Safety • Vigilance during exploration • External Knowledge • Prior knowledge • Expert demonstration • Teacher advice • Risk-directed exploration • Engineering the optimization criterion • Worst case criteria • Risk-sensitive criteria • Constrained criteria
  • 36. Department of Computer Science and Engineering IIT Kharagpur Case study on how to make an existing algorithm safe
  • 37. Department of Computer Science and Engineering IIT Kharagpur GAIL: Generative Adversarial Imitation Learning Problem of heavy tail
  • 38. Department of Computer Science and Engineering IIT Kharagpur RAIL: Risk-Averse Imitation Learning Santara et al. 2017. Accepted at Deep Reinforcement Learning Symposium at NIPS 2017 CVaR of trajectory risk
  • 39. Department of Computer Science and Engineering IIT Kharagpur Results
  • 40. Department of Computer Science and Engineering IIT Kharagpur Any Questions, Please  Scan me to give Anirban feedback
  • 41. Department of Computer Science and Engineering IIT Kharagpur Thank You