Reinforcement	Learning	(RL) 		
Mehdi	Elahi		
Free	University	of	Bozen	/	Bolzano	
	
www.linkedin.com/in/mehdielahi
Introduc:on		
§  Supervised	Learning:	
	The	input-output	examples	are	known	
§  Reinforcement	Learning	:	
	The	input-output	examples	are	unknown	
	Instead	the	reward-punishment	are	known
Mo:va:on	
•  Typical	AcGve	Learning	
0 50 100 150
0.7
0.75
0.8
0.85
0.9
0.95
1
Mean Absolute Error
# of iterations
MAE
Strategy 1
Strategy 2
More	info	on	Ac:ve	Learning:	
Rubens,	Neil;	Elahi,	Mehdi;	Sugiyama,	Masashi;	Kaplan,	Dain;	Ac:ve	Learning	in	Recommender	Systems,	Recommender	Systems	Handbook,	Springer	US	(2015)
Mo:va:on	
•  AdapGve	AcGve	Learning	
0 50 100 150
0.7
0.75
0.8
0.85
0.9
0.95
1
# of iterations
MAE
Mean Absolute Error
Strategy 1
Strategy 2
Adaptive Strategy
Switching	
point	
More	info	on	Adap:ve	Ac:ve	Learning:	
Elahi,	Mehdi,	Francesco	Ricci,	and	Neil	Rubens.	"A	survey	of	ac:ve	learning	in	collabora:ve	filtering	recommender	systems."	Computer	Science	Review	(2016).
Mo:va:on	
•  AdapGve	AcGve	Learning	
0 50 100 150
0.7
0.75
0.8
0.85
0.9
0.95
1
# of iterations
MAE
Mean Absolute Error
Adaptive Strategy
More	info	on	Adap:ve	Ac:ve	Learning:	
Elahi,	Mehdi,	Francesco	Ricci,	and	Neil	Rubens.	"A	survey	of	ac:ve	learning	in	collabora:ve	filtering	recommender	systems."	Computer	Science	Review	(2016).
n-Armed	Bandit	
	
§  Slot	 machine	 with	 n-arms,	 each	 of	 them	 will	 give	 different	
reward	
§  In	 every	 play,	 we	 should	 find	 the	 best	 arm	 to	 maximized	 the	
total	reward	
	
Predict	
the	next	
reward	
Choose	
the	best	
arm	
Learn	
from	the	
reward
Example	
	
§  Example:	
	
1st	play	 2nd	play	 3rd	play	
§  Every	play	is	an	Ac:on	(a)	
§  Then	 the	 system	 make	 transiGon	 to	 the	
next	State	(s)	
§  In	every	play	a	reward	(r)	is	given		
	based	on	the	chosen	arm	
§  How	to	play	is	a	Policy	(π)	
	which	maps	states	to	ac9ons
Ac:on	Value	
	
§  AcGon	value	-	Qt(a)	:		
	the	esGmated	value	of		an	acGon	(a)	
§  This	Method	is	called	sample-average		
		
AcGon	value	at	Gme	t		 Rewards	
Number	of	Gmes	acGon	a	is	chosen
Op:mal	Ac:on	Value	
	
	
§  OpGmal	AcGon	Value	–	Qt
*(a)	:		
	the	true	value	of	an	acGon	(a)		
	
§  Law	of	large	numbers	guarantees	the	convergence	
	
	
OpGmal	AcGon	value	
lim
ka→∞
Qt (a) = Q*
(a)
Es:ma:on	
Qk+1 =
=
1
k +1
ri
i=1
k+1
∑
=
1
k +1
[rk+1 + ri
i=1
k
∑ ]
=
1
k +1
[rk+1 + kQk +Qk −Qk ]
=
1
k +1
[rk+1 +(k +1)Qk −Qk ]
= Qk +
1
k +1
[rk+1 −Qk ]
New	es:ma:on=	Old	es:ma:on+	Step	size	x	[Target-	Old	es:ma:on]
Challenges	
§  StaGonary	and	non-StaGonary	
§  ExploraGon	and	ExploitaGon	
§  Reinforcement	Comparison
Sta:onary	and	non-Sta:onary	
§  If	the	rewards	are	fixed	we	have	a	sta:onary	problem.	
§  You	can	keep	what	you	have	learned	
§  But	in	many	cases	the	rewards	are	changing	over	the	Gme.	
This	is	called	is	non-sta:onary	problem.	
§  This	means	that	once	you	learned,	you	can	not	keep	it	for	ever	
Number	of	plays	(t)	
Example	of		
non-staGonary	problem	
Rewards	(r)
Sta:onary	and	non-Sta:onary	
Qk =
= Qk−1 +α[rk −Qk−1]
=αrk +(1−α)Qk−1
=αrk +(1−α)αrk−1 +(1−α)2
Qk−2
=αrk +(1−α)αrk−1 +(1−α)2
αrk−2 +...+(1−α)k−1
αr1 ++(1−α)K
Q0
= (1−α)k
Q0 + α(1−
i=1
k
∑ α)k−i
ri
0 <α ≤1
Weight		
§  Introducing	the	weight	(α):	
	considers	the	recent	rewards	greater	than	long-	past	
	ones.	
This	method	is	called	exponen:al,	recency-weighted	average.
Explora:on	and	Exploita:on	
§  Exploit:	
	Using	what	it	is	already	learned	in	order	to	obtain	
	beeer	reward	
	Example:	choosing	the	best	ac9on	
	
§  Explore:	
	Learning	from	what	has	not	selected	before	and
	by	trying	other	possible	opGons		
	Example:	choosing	a	random	ac9on
Explora:on	and	Exploita:on	
§  ξ-Greedy:	
§  SoSmax:		
	Example:	Boltzmann	Method	
at
*
= argmax
a
Qt (a)
at = at
*
at ≠ at
*
at
*
with probability 1 − ε
random action with probability ε{at =
Choosing	the	best	acGon	(Exploita:on)	
Choosing	the	not-the-best	acGon	(Explora:on)	
The	best	acGon	
The	next	acGon	
Maximum	acGon	value
Performance	
ξ-Greedy		
methods	
ξ-Greedy		
methods	
Greedy		
method	
Greedy		
method
Boltzmann	Distribu:on	
e
Qt (a)
T
e
Qt (b)
T
b=1
n
∑
Temperature		
Lower	Temperature 	 	 	 		More	Greedy	acGon	selecGon	
	 	 	 	 	 	 	 		More	Exploita5on	
	
Higher	Temperature			 	 	 	 		Less	Greedy	acGon	selecGon	
	 	 	 	 	 	 	 		More	Explora5on	
Gives	the	probability	of	choosing	the	acGon	a	with	at	the	play	t
Reinforcement	Comparison	
§  We	know	that	in	RL:	
	AcGons	with	large	rewards	should	be	followed	more	likely	than	 	acGons	
with	small	rewards	
§  If	the	reward	is	5,	is	it	large	or	small?	
§  Natural	reference	reward	is		the	average	of	previously	received	rewards		
Larger	rewards	>	reference	reward		
Small	rewards	<	reference	reward	
	
§  Method	based	on	this	idea	are	called	Reinforcement	Comparison	
§  This	method:	
	Introduces	the	fact	probability	of	choosing	an	ac:on	in	ac:on	selec:on	
	process:	
	Indicates	that	high	rewards	should	increase	the	probability	of	reselec:ng	the	
	ac:on	were	taken
More	Challenges	
§  OpGmisGc	IniGal	Value	
§  AssociaGve	Search
Op:mis:c	Ini:al	Value	
§  Imagine	we	set	the	iniGal	acGon	value	very	high	(say	
+5	instead	of	0)	
§  Whatever	acGon	is	chosen,	the	next	acGon	value	
would	be	less	than	+5	
§  System	may	be	DISAPPOINTED!!	
§  It	will	end	up	with	a	temporary	explora:on	ac:ons	
	Well-suited	for	sta9onary	problems	
	But	not	for	non-sta9onary	problems
Associa:ve	Search	
§  Associa:ve:	
	Inputs	mapped	to	outputs;	earn	the	best	
	output	for	each	input	
§  Non-Associa:ve:	
	Learn	(find)	one	best	output	
	Examples:	the	bandit	machine	is	changing	
	over	the	9me
Preliminary	Result	
0 50 100 150
0.7
0.75
0.8
0.85
0.9
0.95
1
# of iterations
MAE Mean Absolute Error
Strategy 1
Strategy 2
RL Strategy
No	averaging	
ξ=0.1	
α=0.9	
ExploraGon		
Non-StaGonary	problem
0 50 100 150
0.7
0.75
0.8
0.85
0.9
0.95
1
# of iterations
MAE Mean Absolute Error
Strategy 1
Strategy 2
RL Strategy
Preliminary	Result	
5-fold	averaging	
ξ=0.1	
α=0.9	
Non-StaGonary	problem
0 50 100 150
0.7
0.75
0.8
0.85
0.9
0.95
1
# of iterations
MAE Mean Absolute Error
Strategy 1
Strategy 2
RL Strategy
Preliminary	Result	
10-fold	averaging	
ξ=0.1	
α=0.9	
Non-StaGonary	problem
RL	References	
•  R.	S.	Sueon	et.	al.,	Reinforcement	Learning,	
an	introduc:on,	The	MIT	press	,	Cambridge		
•  B.	Bakker,	Decision	Making	in	Intelligence	
Systems,	Lecture	2,	UVA,	Amsterdam	
•  C.	Rothkopf,	N-Armed	bandit	problems,	FIAS
Thank	you!	
www.linkedin.com/in/mehdielahi

More Related Content

PDF
Misconceptions in Visual Algorithm Simulation Revisited
PPTX
Lp 2
PPTX
22 January 2018 HEFCE open event “Using data to increase learning gains and t...
PDF
Joseph Jay Williams - WESST - Bridging Research via MOOClets and Collaborativ...
PPTX
Teaching Mathematics with Classroom Response Systems
PDF
HRSeminar Coaching Frederik Anseel
PPT
Job satisfaction& its impacton performance
PDF
LMHC Pass
Misconceptions in Visual Algorithm Simulation Revisited
Lp 2
22 January 2018 HEFCE open event “Using data to increase learning gains and t...
Joseph Jay Williams - WESST - Bridging Research via MOOClets and Collaborativ...
Teaching Mathematics with Classroom Response Systems
HRSeminar Coaching Frederik Anseel
Job satisfaction& its impacton performance
LMHC Pass

Similar to Reinforcement Learning: a Brief Overview (8)

DOCX
Research Proposal M Torres Vargas - Chyung's comments
PPTX
Artificial Intelligence.pptx
PPT
PBIS Positive Behavior Plan
PPTX
IPSSW Competency Prediction
PDF
M4L18 Unsupervised and Semi-Supervised Learning - Slides v2.pdf
PPTX
Multiple Linear Regression Homework Help
PPTX
eLearning assessments: 5 Must-Try Formats
PDF
Branching scenarios: An efficient learning technique | EasySIM
Research Proposal M Torres Vargas - Chyung's comments
Artificial Intelligence.pptx
PBIS Positive Behavior Plan
IPSSW Competency Prediction
M4L18 Unsupervised and Semi-Supervised Learning - Slides v2.pdf
Multiple Linear Regression Homework Help
eLearning assessments: 5 Must-Try Formats
Branching scenarios: An efficient learning technique | EasySIM
Ad

More from University of Bergen (7)

PDF
It Takes Two, Baby: Style and Tangibles for Recommending and Interacting with...
PDF
Exploring The Semantic Gap for Movie Recommendations
PDF
User Personality and the New User Problem in a Context-Aware Point of Interes...
PDF
Interaction Design Patterns in Recommender Systems
PDF
Toward Building a Content based Video Recommendation System Based on Low-leve...
PDF
Active Learning in Collaborative Filtering Recommender Systems : a Survey
PDF
Empirical Evaluation of Active Learning in Recommender Systems
It Takes Two, Baby: Style and Tangibles for Recommending and Interacting with...
Exploring The Semantic Gap for Movie Recommendations
User Personality and the New User Problem in a Context-Aware Point of Interes...
Interaction Design Patterns in Recommender Systems
Toward Building a Content based Video Recommendation System Based on Low-leve...
Active Learning in Collaborative Filtering Recommender Systems : a Survey
Empirical Evaluation of Active Learning in Recommender Systems
Ad

Recently uploaded (20)

PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPT
Image processing and pattern recognition 2.ppt
PPTX
modul_python (1).pptx for professional and student
PDF
Introduction to Data Science and Data Analysis
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Introduction to Inferential Statistics.pptx
PDF
Microsoft Core Cloud Services powerpoint
PDF
Transcultural that can help you someday.
PPTX
chrmotography.pptx food anaylysis techni
PPT
Predictive modeling basics in data cleaning process
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
DOCX
Factor Analysis Word Document Presentation
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
Image processing and pattern recognition 2.ppt
modul_python (1).pptx for professional and student
Introduction to Data Science and Data Analysis
Navigating the Thai Supplements Landscape.pdf
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
DU, AIS, Big Data and Data Analytics.ppt
SAP 2 completion done . PRESENTATION.pptx
Introduction to Inferential Statistics.pptx
Microsoft Core Cloud Services powerpoint
Transcultural that can help you someday.
chrmotography.pptx food anaylysis techni
Predictive modeling basics in data cleaning process
retention in jsjsksksksnbsndjddjdnFPD.pptx
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Factor Analysis Word Document Presentation
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...

Reinforcement Learning: a Brief Overview