SlideShare a Scribd company logo
Presentation on
“Safe, Multi-Agent, RL for
Autonomous Driving”
Kiho Suh
Modulabs( ), May 8th 2017
About Paper
• 2016 10
• Mobileye (1999 )
• Developing vision-based
advanced driver-assistance
systems
• 2017 3 $15.3
billion( 17 )
• Shai Shaked reddit
hacker news
“Failures of Deep
Learning”
?
•
• “multi-agent” game.
• agent , .
• : https://www.bloomberg.com/news/articles/
2015-12-18/humans-are-slamming-into-driverless-cars-and-exposing-a-key-flaw
(?)
https://www.youtube.com/watch?v=-2RCPpdmSVg
Paris, France
Mumbai, India
Three Elements of Autonomous Driving
Sensing Mapping Driving Policy
(Planning)
Environmental
Model
360 awareness
Localization at
High Accuracy (10cm)
Drivable Path
Negotiating in a
Multi-Agent game
Strategy
Three Elements of Autonomous Driving
Sensing Mapping Driving Policy
(Planning)
Environmental
Model
360 awareness
Localization at
High Accuracy (10cm)
Drivable Path
Negotiating in a
Multi-Agent game
Strategy
!
Driving Policy ?
• : policy
• Multi-Agent Game
• agent : ? ?
?
• Defensive/Aggressive Tradeoff:
• : .
•
• Sensing uncertainty: radar
• Blind zone: ?
Reinforcement Learning
Markov Decision Process
Some Approaches to RL
• Imitation Learning (behavior Cloning)
• Learn from pairs {(si,ai)}. where ai is the action chosen by a good (e.g. human driver) agent
at state si. One can then use supervised learning to learn a policy π such that π(si) ≈ at
• Direct Policy Optimization
• Express policy π in parametric form and directly optimize it using SGD
• Value based Learning
• Learn the Q or V functions. Solely context of MDP.
• Model based Learning and Planning
• Learn the probability of state transitions and solve the optimization problem of finding the
optimal Q. Clearly rely on the Markov Assumption.
Some Approaches to RL
• Imitation Learning (behavior Cloning)
• Learn from pairs {(si,ai)}. where ai is the action chosen by a good (e.g. human driver) agent
at state si. One can then use supervised learning to learn a policy π such that π(si) ≈ at
• Direct Policy Optimization
• Express policy π in parametric form and directly optimize it using SGD
• Value based Learning
• Learn the Q or V functions. Solely context of MDP.
• Model based Learning and Planning
• Learn the probability of state transitions and solve the optimization problem of finding the
optimal Q. Clearly rely on the Markov Assumption.
In practice: !
Markov Decision Process (MDP)
The Markovian Assumption:
st+1 is conditional independent of the past given st, at
• multi-agents
• Markovian Assumption , “ ”
( hidden state in POMDPs) . “
” ?
• Markovity .
• An agnostic RL setting (The optimal policy st ).
• Value function (V and Q Makovian Assumption
).
: “ ”
RL without the Markovian Assumption
• O Imitation learning (behavior cloning)
• ? Direct policy optimization
• X Value based learning
• X Model based learning and planning
st Markovian Assumption
!
agent
dynamics system
Markov Assumption
! !
Direct Policy Optimization
Policy Stochastic Gradient without Markovian Assumption
parameter θ !
action πθ
Policy Stochastic Gradient without Markovian Assumption
parameter θ !
action πθ
parameter θ
Gradient Policy Theorem
0
Variance Reduction
• SGD variance
• Value function variance .
• Markovian Assumption variance reduction
• SVRG (Johnson and Zhang) SDCA(Shalev-Shwartz and Zhang) ,
variance .
Variance Reduction Lemma
ξ : random variable
RL . Bt,i baseline,
( Value function , expected future reward, state ). Qθ
hat
(expected future reward, action ) Advantage function
. Actor Critic . ! Non-
Markovian . environment assumption !
Imitation Learning Policy Gradient
• Imitation learning πθ
• SGD πθ
• Markovian assumptions
Imitation Learning Policy Gradient (video)
https://www.dropbox.com/s/egfcghug4y612fp/mobileye3.mp4?dl=0
safe RL!
Safety Challenge
RL expected reward : 

• Suppose R(s) [-1,1] for “typical events” ( +1,
-1)

• p << 1 policy (p = 10-8
)

• expected reward pr, r reward

• , -r << -1/p . reward .

• variance . Variance : 

Claim: The variance of the policy gradient estimator grows with 1/p
Safety Challenge
learned policy ?
RL expected reward : 

• Suppose R(s) [-1,1] for “typical events” ( +1,
-1)

• p << 1 policy (p = 10-8
)

• expected reward pr, r reward

• , -r << -1/p . reward .

• variance . Variance : 

Claim: The variance of the policy gradient estimator grows with 1/p
Safe RL
• policy π : S -> A
• policy function : π = π(T) ο πθ
(D)
• .
πθ
(D)
• . ,
.
• πθ
(D): S -> D state parameters set desires set .
• This function is parameterized by θ and is being learned.
• The desires produced by πθ
(D) are translated into a cost function over driving
trajectories.
π(T)
•
• π(T): S -> D state desires set action (continuous
trajectory).
• This function is implemented by solving an optimization problem with hard
constraints on safety and a cost that is defined by the “desires”
• hard constraints framework . not being
learned ( θ )
π = π
(T)
ο πθ
(D)
• double-merge scenario (
)
•
.
•
.
•
heuristic brute-force
.
The Double-Merge Scenario (video)
https://www.dropbox.com/s136nbndtdyehtgidoubleMerge.m4v?dl=0
Double-merge Scenario
Desire ->
desired target speed of the host vehicle
desired lateral position in lanes units
classification labels assigned to each of the n other vehicles
g: give way( )
t: take way( )
o: maintain offset distance to it( )
Desires (video)
https://www.dropbox.com/s/2uxw1v87zyuhrhn/mobileye.mp4?dl=0
Desires (video)
.
https://www.dropbox.com/s/qv36iwz58068rxq/mobileye2.mp4?dl=0
Desires
( ?)
Markovity and Safety
• Markovity:
• Markov Assumption agnostic RL .
• Variance Reduction Markovian Assumption .
• Safety: “desire” rule-based “hard constraints” .
safety :
•
• defensive/aggressive
• .
?
• Curse of Dimensionality”
• Semantic Abstraction: driving policy options graph
.
• Continuous Trajectory Planner: we use a dynamic programming planning
module that optimizes the “desires” subject to the “hard constraints” on
safety.
• : self-play .
• : policies .
Options Graph
• Desires
. 

• children
child policy
.

• state space
Desires subset
output space .

• Reducing variance and sample
complexity of the learning
problem.
possible option graph for the double merge scenario
Conclusions
• Driving in realistic dense traffic scenes calls
• agent
• agent : defensive/aggressive tradeoff
• “The rules of breaking the rules …”
•
•
• safety component
• Agnostic RL with no Markovian assumption for the Multi-Agent setting
• Semantic abstraction using the Options Graph
• Self-play
•
Reference
• Reinforcement learning: An introduction, Barto and Sutton
• https://www.youtube.com/watch?v=n8T7A3wqH3Q
• www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
• https://www.bloomberg.com/news/articles/2015-12-18/humans-are-
slamming-into-driverless-cars-and-exposing-a-key-flaw
• https://media.nips.cc/Conferences/2016/Slides/6198-Slides.pdf
• https://www.cs.huji.ac.il/~shais/
• https://www.dropbox.com/s136nbndtdyehtgidoubleMerge.m4v?dl=0

More Related Content

PPTX
Model based rl
PPTX
Market Basket Analysis
PPTX
Statistics vs machine learning
PDF
Introduction to Statistical Machine Learning
PDF
Recurrent Neural Networks
PPTX
Introduction to Deep Learning
PDF
Continuous control with deep reinforcement learning (DDPG)
PDF
Ways to evaluate a machine learning model’s performance
Model based rl
Market Basket Analysis
Statistics vs machine learning
Introduction to Statistical Machine Learning
Recurrent Neural Networks
Introduction to Deep Learning
Continuous control with deep reinforcement learning (DDPG)
Ways to evaluate a machine learning model’s performance

What's hot (12)

PDF
[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 2장. 머신러닝 프로젝트 처음부터 끝까지
PDF
Introduction to Recurrent Neural Network
PDF
Counterfactual Learning for Recommendation
PPTX
Curse of dimensionality
PDF
Machine Learning and its Applications
PDF
딥러닝의 기본
PDF
Is Machine learning useful for Fraud Prevention?
PDF
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
PPTX
Hyperparameter Tuning
PPTX
Reinforcement Learning
PPTX
Machine Learning and Real-World Applications
PDF
Recommender Systems In Industry
[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 2장. 머신러닝 프로젝트 처음부터 끝까지
Introduction to Recurrent Neural Network
Counterfactual Learning for Recommendation
Curse of dimensionality
Machine Learning and its Applications
딥러닝의 기본
Is Machine learning useful for Fraud Prevention?
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Hyperparameter Tuning
Reinforcement Learning
Machine Learning and Real-World Applications
Recommender Systems In Industry
Ad

Similar to [한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving (20)

PDF
Reinforcement Learning with Amazon SageMaker RL
PDF
Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...
PDF
Autonomous Control AI Training from Data
PDF
Autonomous Systems for Optimization and Control
PDF
Reinforcement learning
ODP
Online advertising and large scale model fitting
PPTX
Training Drone Image Models with Grand Theft Auto
PDF
Monitoring AI with AI
PDF
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
PDF
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
PDF
Aprendizaje reforzado con swift
PDF
Horizon: Deep Reinforcement Learning at Scale
PDF
Recommendation Engine with In-Database Machine Learning
PPTX
World models v0.14
PPTX
Strata London - Deep Learning 05-2015
PPTX
Domainspecificsubgraph extraction ieee-bigdata2016
PPTX
Domainspecificsubgraph extraction ieee-bigdata2016
PDF
sourav-projects
PDF
Labeling Foot Traffic in Dense Locations
PDF
WWW 2021report public
Reinforcement Learning with Amazon SageMaker RL
Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...
Autonomous Control AI Training from Data
Autonomous Systems for Optimization and Control
Reinforcement learning
Online advertising and large scale model fitting
Training Drone Image Models with Grand Theft Auto
Monitoring AI with AI
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Aprendizaje reforzado con swift
Horizon: Deep Reinforcement Learning at Scale
Recommendation Engine with In-Database Machine Learning
World models v0.14
Strata London - Deep Learning 05-2015
Domainspecificsubgraph extraction ieee-bigdata2016
Domainspecificsubgraph extraction ieee-bigdata2016
sourav-projects
Labeling Foot Traffic in Dense Locations
WWW 2021report public
Ad

Recently uploaded (20)

PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Computer network topology notes for revision
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
Lecture1 pattern recognition............
PDF
Fluorescence-microscope_Botany_detailed content
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Business Analytics and business intelligence.pdf
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Quality review (1)_presentation of this 21
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Introduction to machine learning and Linear Models
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Reliability_Chapter_ presentation 1221.5784
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Mega Projects Data Mega Projects Data
STERILIZATION AND DISINFECTION-1.ppthhhbx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Computer network topology notes for revision
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Lecture1 pattern recognition............
Fluorescence-microscope_Botany_detailed content
[EN] Industrial Machine Downtime Prediction
Business Analytics and business intelligence.pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Quality review (1)_presentation of this 21
Galatica Smart Energy Infrastructure Startup Pitch Deck
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Introduction to machine learning and Linear Models

[한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving

  • 1. Presentation on “Safe, Multi-Agent, RL for Autonomous Driving” Kiho Suh Modulabs( ), May 8th 2017
  • 2. About Paper • 2016 10 • Mobileye (1999 ) • Developing vision-based advanced driver-assistance systems • 2017 3 $15.3 billion( 17 ) • Shai Shaked reddit hacker news “Failures of Deep Learning”
  • 3. ? • • “multi-agent” game. • agent , . • : https://www.bloomberg.com/news/articles/ 2015-12-18/humans-are-slamming-into-driverless-cars-and-exposing-a-key-flaw
  • 5. Three Elements of Autonomous Driving Sensing Mapping Driving Policy (Planning) Environmental Model 360 awareness Localization at High Accuracy (10cm) Drivable Path Negotiating in a Multi-Agent game Strategy
  • 6. Three Elements of Autonomous Driving Sensing Mapping Driving Policy (Planning) Environmental Model 360 awareness Localization at High Accuracy (10cm) Drivable Path Negotiating in a Multi-Agent game Strategy !
  • 7. Driving Policy ? • : policy • Multi-Agent Game • agent : ? ? ? • Defensive/Aggressive Tradeoff: • : . • • Sensing uncertainty: radar • Blind zone: ?
  • 10. Some Approaches to RL • Imitation Learning (behavior Cloning) • Learn from pairs {(si,ai)}. where ai is the action chosen by a good (e.g. human driver) agent at state si. One can then use supervised learning to learn a policy π such that π(si) ≈ at • Direct Policy Optimization • Express policy π in parametric form and directly optimize it using SGD • Value based Learning • Learn the Q or V functions. Solely context of MDP. • Model based Learning and Planning • Learn the probability of state transitions and solve the optimization problem of finding the optimal Q. Clearly rely on the Markov Assumption.
  • 11. Some Approaches to RL • Imitation Learning (behavior Cloning) • Learn from pairs {(si,ai)}. where ai is the action chosen by a good (e.g. human driver) agent at state si. One can then use supervised learning to learn a policy π such that π(si) ≈ at • Direct Policy Optimization • Express policy π in parametric form and directly optimize it using SGD • Value based Learning • Learn the Q or V functions. Solely context of MDP. • Model based Learning and Planning • Learn the probability of state transitions and solve the optimization problem of finding the optimal Q. Clearly rely on the Markov Assumption. In practice: !
  • 12. Markov Decision Process (MDP) The Markovian Assumption: st+1 is conditional independent of the past given st, at • multi-agents • Markovian Assumption , “ ” ( hidden state in POMDPs) . “ ” ? • Markovity . • An agnostic RL setting (The optimal policy st ). • Value function (V and Q Makovian Assumption ). : “ ”
  • 13. RL without the Markovian Assumption • O Imitation learning (behavior cloning) • ? Direct policy optimization • X Value based learning • X Model based learning and planning st Markovian Assumption ! agent dynamics system Markov Assumption ! !
  • 15. Policy Stochastic Gradient without Markovian Assumption parameter θ ! action πθ
  • 16. Policy Stochastic Gradient without Markovian Assumption parameter θ ! action πθ parameter θ
  • 18. Variance Reduction • SGD variance • Value function variance . • Markovian Assumption variance reduction • SVRG (Johnson and Zhang) SDCA(Shalev-Shwartz and Zhang) , variance .
  • 19. Variance Reduction Lemma ξ : random variable RL . Bt,i baseline, ( Value function , expected future reward, state ). Qθ hat (expected future reward, action ) Advantage function . Actor Critic . ! Non- Markovian . environment assumption !
  • 20. Imitation Learning Policy Gradient • Imitation learning πθ • SGD πθ • Markovian assumptions
  • 21. Imitation Learning Policy Gradient (video) https://www.dropbox.com/s/egfcghug4y612fp/mobileye3.mp4?dl=0
  • 23. Safety Challenge RL expected reward : • Suppose R(s) [-1,1] for “typical events” ( +1, -1) • p << 1 policy (p = 10-8 ) • expected reward pr, r reward • , -r << -1/p . reward . • variance . Variance : Claim: The variance of the policy gradient estimator grows with 1/p
  • 24. Safety Challenge learned policy ? RL expected reward : • Suppose R(s) [-1,1] for “typical events” ( +1, -1) • p << 1 policy (p = 10-8 ) • expected reward pr, r reward • , -r << -1/p . reward . • variance . Variance : Claim: The variance of the policy gradient estimator grows with 1/p
  • 25. Safe RL • policy π : S -> A • policy function : π = π(T) ο πθ (D) • .
  • 26. πθ (D) • . , . • πθ (D): S -> D state parameters set desires set . • This function is parameterized by θ and is being learned. • The desires produced by πθ (D) are translated into a cost function over driving trajectories.
  • 27. π(T) • • π(T): S -> D state desires set action (continuous trajectory). • This function is implemented by solving an optimization problem with hard constraints on safety and a cost that is defined by the “desires” • hard constraints framework . not being learned ( θ )
  • 28. π = π (T) ο πθ (D) • double-merge scenario ( ) • . • . • heuristic brute-force .
  • 29. The Double-Merge Scenario (video) https://www.dropbox.com/s136nbndtdyehtgidoubleMerge.m4v?dl=0
  • 30. Double-merge Scenario Desire -> desired target speed of the host vehicle desired lateral position in lanes units classification labels assigned to each of the n other vehicles g: give way( ) t: take way( ) o: maintain offset distance to it( )
  • 34. Markovity and Safety • Markovity: • Markov Assumption agnostic RL . • Variance Reduction Markovian Assumption . • Safety: “desire” rule-based “hard constraints” . safety : • • defensive/aggressive • .
  • 35. ? • Curse of Dimensionality” • Semantic Abstraction: driving policy options graph . • Continuous Trajectory Planner: we use a dynamic programming planning module that optimizes the “desires” subject to the “hard constraints” on safety. • : self-play . • : policies .
  • 36. Options Graph • Desires . • children child policy . • state space Desires subset output space . • Reducing variance and sample complexity of the learning problem. possible option graph for the double merge scenario
  • 37. Conclusions • Driving in realistic dense traffic scenes calls • agent • agent : defensive/aggressive tradeoff • “The rules of breaking the rules …” • • • safety component • Agnostic RL with no Markovian assumption for the Multi-Agent setting • Semantic abstraction using the Options Graph • Self-play •
  • 38. Reference • Reinforcement learning: An introduction, Barto and Sutton • https://www.youtube.com/watch?v=n8T7A3wqH3Q • www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html • https://www.bloomberg.com/news/articles/2015-12-18/humans-are- slamming-into-driverless-cars-and-exposing-a-key-flaw • https://media.nips.cc/Conferences/2016/Slides/6198-Slides.pdf • https://www.cs.huji.ac.il/~shais/ • https://www.dropbox.com/s136nbndtdyehtgidoubleMerge.m4v?dl=0