[한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving

Presentation on
“Safe, Multi-Agent, RL for
Autonomous Driving”
Kiho Suh
Modulabs( ), May 8th 2017

About Paper
• 2016 10
• Mobileye (1999 )
• Developing vision-based
advanced driver-assistance
systems
• 2017 3 $15.3
billion( 17 )
• Shai Shaked reddit
hacker news
“Failures of Deep
Learning”

?
•
• “multi-agent” game.
• agent , .
• : https://www.bloomberg.com/news/articles/
2015-12-18/humans-are-slamming-into-driverless-cars-and-exposing-a-key-ﬂaw

(?)
https://www.youtube.com/watch?v=-2RCPpdmSVg
Paris, France
Mumbai, India

Three Elements of Autonomous Driving
Sensing Mapping Driving Policy
(Planning)
Environmental
Model
360 awareness
Localization at
High Accuracy (10cm)
Drivable Path
Negotiating in a
Multi-Agent game
Strategy

Three Elements of Autonomous Driving
Sensing Mapping Driving Policy
(Planning)
Environmental
Model
360 awareness
Localization at
High Accuracy (10cm)
Drivable Path
Negotiating in a
Multi-Agent game
Strategy
!

Driving Policy ?
• : policy
• Multi-Agent Game
• agent : ? ?
?
• Defensive/Aggressive Tradeoff:
• : .
•
• Sensing uncertainty: radar
• Blind zone: ?

Some Approaches to RL
• Imitation Learning (behavior Cloning)
• Learn from pairs {(si,ai)}. where ai is the action chosen by a good (e.g. human driver) agent
at state si. One can then use supervised learning to learn a policy π such that π(si) ≈ at
• Direct Policy Optimization
• Express policy π in parametric form and directly optimize it using SGD
• Value based Learning
• Learn the Q or V functions. Solely context of MDP.
• Model based Learning and Planning
• Learn the probability of state transitions and solve the optimization problem of ﬁnding the
optimal Q. Clearly rely on the Markov Assumption.

Some Approaches to RL
• Imitation Learning (behavior Cloning)
• Learn from pairs {(si,ai)}. where ai is the action chosen by a good (e.g. human driver) agent
at state si. One can then use supervised learning to learn a policy π such that π(si) ≈ at
• Direct Policy Optimization
• Express policy π in parametric form and directly optimize it using SGD
• Value based Learning
• Learn the Q or V functions. Solely context of MDP.
• Model based Learning and Planning
• Learn the probability of state transitions and solve the optimization problem of ﬁnding the
optimal Q. Clearly rely on the Markov Assumption.
In practice: !

Markov Decision Process (MDP)
The Markovian Assumption:
st+1 is conditional independent of the past given st, at
• multi-agents
• Markovian Assumption , “ ”
( hidden state in POMDPs) . “
” ?
• Markovity .
• An agnostic RL setting (The optimal policy st ).
• Value function (V and Q Makovian Assumption
).
: “ ”

RL without the Markovian Assumption
• O Imitation learning (behavior cloning)
• ? Direct policy optimization
• X Value based learning
• X Model based learning and planning
st Markovian Assumption
!
agent
dynamics system
Markov Assumption
! !

Policy Stochastic Gradient without Markovian Assumption
parameter θ !
action πθ

Policy Stochastic Gradient without Markovian Assumption
parameter θ !
action πθ
parameter θ

Variance Reduction
• SGD variance
• Value function variance .
• Markovian Assumption variance reduction
• SVRG (Johnson and Zhang) SDCA(Shalev-Shwartz and Zhang) ,
variance .

Variance Reduction Lemma
ξ : random variable
RL . Bt,i baseline,
( Value function , expected future reward, state ). Qθ
hat
(expected future reward, action ) Advantage function
. Actor Critic . ! Non-
Markovian . environment assumption !

Imitation Learning Policy Gradient
• Imitation learning πθ
• SGD πθ
• Markovian assumptions

Imitation Learning Policy Gradient (video)
https://www.dropbox.com/s/egfcghug4y612fp/mobileye3.mp4?dl=0

Safety Challenge
RL expected reward :

• Suppose R(s) [-1,1] for “typical events” ( +1,
-1)

• p << 1 policy (p = 10-8
)

• expected reward pr, r reward

• , -r << -1/p . reward .

• variance . Variance :

Claim: The variance of the policy gradient estimator grows with 1/p

Safety Challenge
learned policy ?
RL expected reward :

• Suppose R(s) [-1,1] for “typical events” ( +1,
-1)

• p << 1 policy (p = 10-8
)

• expected reward pr, r reward

• , -r << -1/p . reward .

• variance . Variance :

Claim: The variance of the policy gradient estimator grows with 1/p

Safe RL
• policy π : S -> A
• policy function : π = π(T) ο πθ
(D)
• .

πθ
(D)
• . ,
.
• πθ
(D): S -> D state parameters set desires set .
• This function is parameterized by θ and is being learned.
• The desires produced by πθ
(D) are translated into a cost function over driving
trajectories.

π(T)
•
• π(T): S -> D state desires set action (continuous
trajectory).
• This function is implemented by solving an optimization problem with hard
constraints on safety and a cost that is deﬁned by the “desires”
• hard constraints framework . not being
learned ( θ )

π = π
(T)
ο πθ
(D)
• double-merge scenario (
)
•
.
•
.
•
heuristic brute-force
.

The Double-Merge Scenario (video)
https://www.dropbox.com/s136nbndtdyehtgidoubleMerge.m4v?dl=0

Double-merge Scenario
Desire ->
desired target speed of the host vehicle
desired lateral position in lanes units
classiﬁcation labels assigned to each of the n other vehicles
g: give way( )
t: take way( )
o: maintain offset distance to it( )

Desires (video)
https://www.dropbox.com/s/2uxw1v87zyuhrhn/mobileye.mp4?dl=0

Desires (video)
.
https://www.dropbox.com/s/qv36iwz58068rxq/mobileye2.mp4?dl=0

Markovity and Safety
• Markovity:
• Markov Assumption agnostic RL .
• Variance Reduction Markovian Assumption .
• Safety: “desire” rule-based “hard constraints” .
safety :
•
• defensive/aggressive
• .

?
• Curse of Dimensionality”
• Semantic Abstraction: driving policy options graph
.
• Continuous Trajectory Planner: we use a dynamic programming planning
module that optimizes the “desires” subject to the “hard constraints” on
safety.
• : self-play .
• : policies .

Options Graph
• Desires
.

• children
child policy
.

• state space
Desires subset
output space .

• Reducing variance and sample
complexity of the learning
problem.
possible option graph for the double merge scenario

Conclusions
• Driving in realistic dense trafﬁc scenes calls
• agent
• agent : defensive/aggressive tradeoff
• “The rules of breaking the rules …”
•
•
• safety component
• Agnostic RL with no Markovian assumption for the Multi-Agent setting
• Semantic abstraction using the Options Graph
• Self-play
•

Reference
• Reinforcement learning: An introduction, Barto and Sutton
• https://www.youtube.com/watch?v=n8T7A3wqH3Q
• www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
• https://www.bloomberg.com/news/articles/2015-12-18/humans-are-
slamming-into-driverless-cars-and-exposing-a-key-ﬂaw
• https://media.nips.cc/Conferences/2016/Slides/6198-Slides.pdf
• https://www.cs.huji.ac.il/~shais/
• https://www.dropbox.com/s136nbndtdyehtgidoubleMerge.m4v?dl=0

[한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving

More Related Content

What's hot (12)

Similar to [한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving (20)

Recently uploaded (20)

[한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving