SlideShare a Scribd company logo
Lecture	2:	Sampling-based	Approximations
And
Function	Fitting
Yan	(Rocky)	Duan
Berkeley	AI	Research	Lab
Many	slides	made	with	John	Schulman,	Xi	(Peter)	Chen	and	Pieter	Abbeel
n Optimal	Control
=	
given	an	MDP	(S, A, P, R, γ, H)
find	the	optimal	policy	π*
Quick	One-Slide	Recap
n Exact	Methods:
n Value	Iteration
n Policy	Iteration
Limitations:	
• Update	equations	require	access	to	dynamics	
model
• Iteration	over	/	Storage	for	all	states	and	actions:	
requires	small,	discrete	state-action	space
->	sampling-based	approximations
->	Q/V	function	fitting
n Q	Value	Iteration
n Value	Iteration?
n Policy	Iteration
n Policy	Evaluation
n Policy	Improvement?
Sampling-Based	Approximation
Recap	Q-Values
Q*(s, a) = expected utility starting in s, taking action a, and (thereafter)
acting optimally
Bellman Equation:
Q-Value Iteration:
n Q-value	iteration:
n Rewrite	as	expectation:	
n (Tabular)	Q-Learning:	replace	expectation	by	samples
n For	an	state-action	pair	(s,a),	receive:
n Consider	your	old	estimate:
n Consider	your	new	sample	estimate:
n Incorporate	the	new	estimate	into	a	running	average:
(Tabular)	Q-Learning
Qk+1 Es0⇠P (s0|s,a)
h
R(s, a, s0
) + max
a0
Qk(s0
, a0
)
i
s0
⇠ P(s0
|s, a)
Qk(s, a)
Qk+1(s, a) (1 ↵)Qk(s, a) + ↵ [target(s0
)]
(Tabular)	Q-Learning
Algorithm:
Start	with	 for	all	s,	a.
Get	initial	state	s
For	k =	1,	2,	…	till	convergence
Sample	action	a,	get	next	state	s’
If	s’	is	terminal:
Sample	new	initial	state	s’
else:
Q0(s, a)
target = R(s, a, s0
) + max
a0
Qk(s0
, a0
)
target = R(s, a, s0
)
s s0
Qk+1(s, a) (1 ↵)Qk(s, a) + ↵ [target]
n Choose random actions?
n Choose action that maximizes (i.e.	greedily)?
n ɛ-Greedy:	choose	random	action	with	prob.	ɛ,	otherwise	choose	
action	greedily
How	to	sample	actions?
Qk(s, a)
n Amazing	result:	Q-learning	converges	to	optimal	policy	--
even	if	you’re	acting	suboptimally!
n This	is	called	off-policy	learning
n Caveats:
n You	have	to	explore	enough
n You	have	to	eventually	make	the	learning	rate
small	enough
n …	but	not	decrease	it	too	quickly
Q-Learning	Properties
n Technical	requirements.	
n All	states	and	actions	are	visited	infinitely	often
n Basically,	in	the	limit,	it	doesn’t	matter	how	you	select	actions	(!)
n Learning	rate	schedule	such	that	for	all	state	and	action	
pairs	(s,a):
Q-Learning	Properties
For details, see Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of stochastic iterative
dynamic programming algorithms. Neural Computation, 6(6), November 1994.
1X
t=0
↵t(s, a) = 1
1X
t=0
↵2
t (s, a) < 1
Q-Learning	Demo:	Gridworld
• States:	11	cells
• Actions:	{up,	down,	left,	right}
• Deterministic	transition	function
• Learning	rate:	0.5
• Discount:	1
• Reward:	+1	for	getting	diamond,	-1	for	falling	into	trap
Q-Learning	Demo:	Crawler
• States:	discretized	value	of	2d	state:	(arm	angle,	hand	angle)
• Actions:	Cartesian	product	of	{arm	up,	arm	down}	and	{hand	up,	hand	down}
• Reward:	speed	in	the	forward	direction
Sampling-Based	Approximation
n Q	Value	Iteration	à (Tabular)	Q-learning
n Value	Iteration?
n Policy	Iteration
n Policy	Evaluation
n Policy	Improvement?
n Value	Iteration
n unclear	how	to	draw	samples	through	max…...
Value	Iteration	w/	Samples?
V ⇤
i+1(s) max
a
Es0⇠P (s0|s,a) [R(s, a, s0
) + V ⇤
i (s0
)]
n Q	Value	Iteration	à (Tabular)	Q-learning
n Value	Iteration?
n Policy	Iteration
n Policy	Evaluation
n Policy	Improvement?
Sampling-Based	Approximation
Recap:	Policy	Iteration
One	iteration	of	policy	iteration:
n Policy	evaluation	for	current	policy									:
n Iterate	until	convergence
n Policy	improvement:	find	the	best	action	according	to	one-step	
look-ahead
⇡k
Can	be	approximated	by	samples
This	is	called	Temporal	Difference	(TD)	Learning
Unclear	what	to	do	with	the	max	(for	now)
V ⇡k
i+1(s) Es0⇠P (s0|s,⇡k(s))[R(s, ⇡k(s), s0
) + V ⇡k
i (s0
)]
⇡k+1(s) arg max
a
Es0⇠P (s0|s,a)[R(s, a, s0
) + V ⇡k
(s0
)]
n Q	Value	Iteration	à (Tabular)	Q-learning
n Value	Iteration?
n Policy	Iteration
n Policy	Evaluation	à (Tabular)	TD-learning
n Policy	Improvement	(for	now)
Sampling-Based	Approximation
n Optimal	Control
=	
given	an	MDP	(S, A, P, R, γ, H)
find	the	optimal	policy	π*
Quick	One-Slide	Recap
n Exact	Methods:
n Value	Iteration
n Policy	Iteration
Limitations:	
• Update	equations	require	access	to	dynamics	
model
• Iteration	over	/	Storage	for	all	states	and	actions:	
requires	small,	discrete	state-action	space
->	sampling-based	approximations
->	Q/V	function	fitting
n Discrete	environments
Can	tabular	methods	scale?
Tetris
10^60
Atari
10^308 (ram) 10^16992 (pixels)
Gridworld
10^1
n Continuous	environments	(by	crude	discretization)
Crawler
10^2
Hopper
10^10
Humanoid
10^100
Can	tabular	methods	scale?
Generalizing	Across	States
n Basic	Q-Learning	keeps	a	table	of	all	q-values
n In	realistic	situations,	we	cannot	possibly	learn	
about	every	single	state!
n Too	many	states	to	visit	them	all	in	training
n Too	many	states	to	hold	the	q-tables	in	memory
n Instead,	we	want	to	generalize:
n Learn	about	some	small	number	of	training	states	from	
experience
n Generalize	that	experience	to	new,	similar	situations
n This	is	a	fundamental	idea	in	machine	learning,	and	
we’ll	see	it	over	and	over	again
n Instead	of	a	table,	we	have	a	parametrized	Q	function:
n Can	be	a	linear	function	in	features:	
n Or	a	complicated	neural	net
n Learning	rule:
n Remember:	
n Update:
Approximate	Q-Learning
Q✓(s, a)
Q✓(s, a) = ✓0f0(s, a) + ✓1f1(s, a) + · · · + ✓nfn(s, a)
target(s0
) = R(s, a, s0
) + max
a0
Q✓k
(s0
, a0
)
✓k+1 ✓k ↵r✓

1
2
(Q✓(s, a) target(s0
))2
✓=✓k
Connection	to	Tabular	Q-Learning
n Suppose	
n Plug	into	update:
n Compare	with	Tabular	Q-Learning	update:
✓ 2 R|S|⇥|A|
, Q✓(s, a) ⌘ ✓sa
r✓sa

1
2
(Q✓(s, a) target(s0
))2
= r✓sa

1
2
(✓sa target(s0
))2
= ✓sa target(s0
)
Qk+1(s, a) (1 ↵)Qk(s, a) + ↵ [target(s0
)]
✓sa ✓sa ↵(✓sa target(s0
))
= (1 ↵)✓sa + ↵[target(s0
)]
n state:	naïve	board	configuration	+	shape	of	the	falling	piece	~1060 states!
n action:	rotation	and	translation	applied	to	the	falling	piece
n 22	features	aka	basis	functions	
n Ten	basis	functions,	0,	.	.	.	,	9,	mapping	the	state	to	the	height	h[k]	of	each	column.
n Nine	basis	functions,	10,	.	.	.	,	18,	each	mapping	the	state	to	the	absolute	difference	
between	heights	of	successive	columns:	|h[k+1]	−	h[k]|,	k	=	1,	.	.	.	,	9.
n One	basis	function,	19,	that	maps	state	to	the	maximum	column	height:	maxk h[k]
n One	basis	function,	20,	that	maps	state	to	the	number	of	‘holes’	in	the	board.
n One	basis	function,	21,	that	is	equal	to	1	in	every	state.
[Bertsekas &	Ioffe,	1996	(TD);	Bertsekas &	Tsitsiklis 1996	(TD);	Kakade 2002	(policy	gradient);	Farias &	Van	Roy,	2006	(approximate	LP)]
ˆV (s) =
21X
i=0
i⇥i(s) = >
⇥(s)
i
Engineered	Approximation	Example:	Tetris
Deep	Reinforcement	Learning
Pong Enduro Beamrider Q*bert
• From	pixels	to	actions
• Same	algorithm	(with	effective	tricks)
• CNN	function	approximator,	w/	3M	free	parameters
n We	have	now	covered	enough	materials	for	Lab	1.
n Will	be	released	on	Piazza	by	this	afternoon.
n Covers	value	iteration,	policy	iteration,	and	tabular	Q-learning.
Lab	1
Lec2 sampling-based-approximations-and-function-fitting
n The	bad:	it	is	not	guaranteed	to	converge…
n Even	if	the	function	approximation	is	expressive	enough	to	
represent	the	true	Q	function
Convergence	of	Approximate	Q-Learning
Function	approximator:		[1	2]	*	θ
θ 2θ
x1 x2r=0
r=0
Simple	Example**
n Definition.		An	operator	G	is	a	non-expansion with	respect	to	a	norm	||	.	||		if
n Fact. If	the	operator	F	is	a	γ-contraction	with	respect	to	a	norm	||	.	||	and	the	
operator	G	is	a	non-expansion	with	respect	to	the	same	norm,	then	the	
sequential	application	of	the	operators	G	and	F	is	a	γ-contraction,	i.e.,	
n Corollary. If	the	supervised	learning	step	is	a	non-expansion,	then	iteration	in	
value	iteration	with	function	approximation	is	a	γ-contraction,	and	in	this	case	
we	have	a	convergence	guarantee.
Composing	Operators**
n Examples:	
n nearest	neighbor	(aka	state	aggregation)
n linear	interpolation	over	triangles	
(tetrahedrons,	…)
Averager Function	Approximators Are	Non-Expansions**
Averager Function	Approximators Are	Non-Expansions**
Example	taken	from	Gordon,	1995
Linear	Regression	L **
n I.e.,	if	we	pick	a	non-expansion	function	approximator which	can	approximate	
J*	well,	then	we	obtain	a	good	value	function	estimate.
n To	apply	to	discretization:	use	continuity	assumptions	to	show	that	J*	can	be	
approximated	well	by	chosen	discretization	scheme.
Guarantees	for	Fixed	Point**

More Related Content

PDF
Reinforcement Learning Overview | Marco Del Pra
PDF
shuyangli_summerpresentation08082014
PPTX
FinalPresentation_200630888 (1)
PDF
Confirmatory Bayesian Online Change Point Detection in the Covariance Structu...
PDF
Policy Gradient Theorem
PDF
KalmanFlow_2nd_draft
PDF
CLIM Fall 2017 Course: Statistics for Climate Research, Analysis for Climate ...
PDF
Continuous control
Reinforcement Learning Overview | Marco Del Pra
shuyangli_summerpresentation08082014
FinalPresentation_200630888 (1)
Confirmatory Bayesian Online Change Point Detection in the Covariance Structu...
Policy Gradient Theorem
KalmanFlow_2nd_draft
CLIM Fall 2017 Course: Statistics for Climate Research, Analysis for Climate ...
Continuous control

What's hot (6)

PPTX
Final slide (bsc csit) chapter 5
PDF
Critical Overview of Some Pumping Test Analysis Equations
PPTX
Introduction to reinforcement learning - Phu Nguyen
PDF
Everything You Wanted to Know About Optimization
PDF
sigir2017bayesian
PDF
CLIM Fall 2017 Course: Statistics for Climate Research, Climate Informatics -...
Final slide (bsc csit) chapter 5
Critical Overview of Some Pumping Test Analysis Equations
Introduction to reinforcement learning - Phu Nguyen
Everything You Wanted to Know About Optimization
sigir2017bayesian
CLIM Fall 2017 Course: Statistics for Climate Research, Climate Informatics -...
Ad

Similar to Lec2 sampling-based-approximations-and-function-fitting (20)

PPTX
14_ReinforcementLearning.pptx
PDF
week10_Reinforce.pdf
PPT
Hierarchical Reinforcement Learning
PPT
about reinforcement-learning ,reinforcement-learning.ppt
PPT
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
PPT
reinforcement-learning.prsentation for c
PPT
reinforcement-learning its based on the slide of university
PPT
reinforcement-learning.ppt
PDF
Introduction to Deep Reinforcement Learning
PDF
Deep Reinforcement learning
PPT
Cs221 lecture8-fall11
PDF
Value Function Approximation via Low-Rank Models
PPT
Lecture notes
PDF
Hierarchical Reinforcement Learning with Option-Critic Architecture
PDF
Reinforcement Learning on Mine Sweeper
PDF
Temporal difference learning
PDF
Temporal difference learning
PPT
Reinforcement learning 7313
PPTX
Introducción a las metaheurísticas de trayectoria
PDF
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
14_ReinforcementLearning.pptx
week10_Reinforce.pdf
Hierarchical Reinforcement Learning
about reinforcement-learning ,reinforcement-learning.ppt
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
reinforcement-learning.prsentation for c
reinforcement-learning its based on the slide of university
reinforcement-learning.ppt
Introduction to Deep Reinforcement Learning
Deep Reinforcement learning
Cs221 lecture8-fall11
Value Function Approximation via Low-Rank Models
Lecture notes
Hierarchical Reinforcement Learning with Option-Critic Architecture
Reinforcement Learning on Mine Sweeper
Temporal difference learning
Temporal difference learning
Reinforcement learning 7313
Introducción a las metaheurísticas de trayectoria
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Ad

More from Ronald Teo (16)

PDF
Mc td
PDF
07 regularization
PDF
PDF
06 mlp
PDF
PDF
04 numerical
PPTX
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
PDF
Intro rl
PDF
Lec7 deeprlbootcamp-svg+scg
PDF
Lec5 advanced-policy-gradient-methods
PDF
Lec6 nuts-and-bolts-deep-rl-research
PDF
Lec4b pong from_pixels
PDF
Lec4a policy-gradients-actor-critic
PDF
Lec3 dqn
PDF
Lec1 intro-mdps-exact-methods
PDF
02 linear algebra
Mc td
07 regularization
06 mlp
04 numerical
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Intro rl
Lec7 deeprlbootcamp-svg+scg
Lec5 advanced-policy-gradient-methods
Lec6 nuts-and-bolts-deep-rl-research
Lec4b pong from_pixels
Lec4a policy-gradients-actor-critic
Lec3 dqn
Lec1 intro-mdps-exact-methods
02 linear algebra

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Cloud computing and distributed systems.
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation theory and applications.pdf
Programs and apps: productivity, graphics, security and other tools
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectral efficient network and resource selection model in 5G networks
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine learning based COVID-19 study performance prediction
Cloud computing and distributed systems.
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
sap open course for s4hana steps from ECC to s4
Mobile App Security Testing_ A Comprehensive Guide.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Electronic commerce courselecture one. Pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Review of recent advances in non-invasive hemoglobin estimation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Approach and Philosophy of On baking technology
Encapsulation theory and applications.pdf

Lec2 sampling-based-approximations-and-function-fitting