SlideShare a Scribd company logo
Which	variables	matter	for	predicting	S1		
When	assessing	this	task,	it	is	important	to	remember	that	this	is	time	series	
based	data.	As	such,	and	often	particularly	with	stock	related	data,	multicollinearity	
will	most	likely	be	an	issue.	This	presents	major	problems	for	regression	analysis	
since	multicollinearity	will	inflate	the	sum	of	squares	in	our	regressions.	In	addition	
to	this,	we	also	must	recognize	that	including	all	of	the	variables	in	a	model	would	
lead	to	over-fitting	in	sample	and	subsequently	poor	predictive	performance	in	out	
of	sample	data.	With	these	problems	in	mind,	we	will	remedy	them	with	Principal	
Component	Analysis.			
Principal	Component	Analysis	is	a	statistical	method	used	to	reduce		
dimensionality	of	data	sets.	Simply	stated,	we	transform	the	data	into	new	variables	
called	principal	components	and	eliminate	the	principal	components	that	explain	
negligible	amounts	of	the	variance	exhibited	within	the	data	set.	The	benefit	of	this	
technique	is	that	we	preserve	the	variance	of	the	data	set	while	being	able	to	
perform	visual	and	exploratory	analysis	much	easier	than	prior	to	the	
transformation.	When	forming	the	matrix	of	data	we	will	perform	PCA	on,	we	
remove	S1,	since	this	is	the	response	variable,	and	retain	columns	S2	through	S10.			
After	running	principal	component	analysis	on	the	first	50	rows	of	S2	through	S10,	
we	see	the	following:
Each	row	index	number	represents	the	principal	component	number	and	each	
value	within	a	particular	principal	component	represents	the	percentage	of	the	
variability	that	principal	component	explains.	 In	this	experiment,	our	threshold	for	
whether	we	shall	retain	a	principal	component	is	1%.	Subsequently,	we	notice	that	
only	the	first	5	principal	components	meet	the	threshold	we	have	set.	As	such,	we	
remove	components	6	through	10.	When	translating	this	elimination	of	the	principal	
components	to	the	original	data,	we	choose	to	keep	columns	S2	through	S6,	and	
eliminate	the	rest	from	our	training	data.			
		
Does	S1	go	up	or	down	cumulatively	(on	an	open-to-close	basis)	over	this		
period?			
S1	represents	daily	open	to	close	changes	of	a	stock.	 We	find	that	s1	
increases	cumulatively	over	this	50	day	period	by	5.92	points.	When	observing	the	
cumulative	changes	in	stock	over	the	first	50	days,	we	see	the	following:
What	Techniques	did	you	use?	Why?			
We	began	our	experiment	by	using	principal	component	analysis	and	from	this	
technique	determined	our	explanatory	variables	to	be	S2	through	S6.	As	stated	prior,	
the	benefit	of	this	technique	is	that	we	preserve	the	variance	of	the	data	set,	but	are	
also	able	to	transform	it	in	a	manner	that	allows	us	to	understand	the	contribution		
of	each	principal	component	to	the	total	variance	within	the	data.	 After	the	training	
data	for	the	explanatory	variables	has	been	determined,	we	cross	validate	both	the	
response	and	explanatory	variables	by	randomly	sampling	the	index	and	row	
respectively	within	the	range	of	the	training	set.	By	doing	this,	we	are	not	only	
preventing	over	fitting,	but	we	are	also	able	to	test	our	model	on	“new”	data.	This	
allows	us	to	gain	a	more	realistic	perspective	on	how	it	would	perform	with	out	of	
sample	data.			
		
Models	Used	to	Predict	S1		
When	performing	this	experiment,	these	following	five	models	were	chosen	for	
evaluation.	The	Scikitlearn	module	was	used	for	several	of	the	implementations,	
while	on	model	was	constructed	in	stepwise	fashion.	The	models	used	are	as	
follows:		
		
a.				Ridge	Regression	–	method	used	to	analyzing	multiple	regression	
data	that	suffers	from	multicollinearity	(linear	or	near	linear	
relationships	between	explanatory	variables).	This	regression	accounts	
for	bias,	so	the	standard	errors	are	reduced	and	therefore	more	
reliable	than	traditional	regression	methods.	[Scikitlearn]		
		
b.				Support	Vector	Regression	–	regression	that	utilizes	kernels	
(functions	that	operate	in	feature	space	without	having	to	compute	
coordinates	of	the	data	and	computing	inner	products	between	data	
pairs	instead)	to	optimize	the	bounds	for	the	regression.	[Scikitlearn]
c.				Kernel	Ridge	Regression	–	ridge	regression	except	linear	function	is		
learned	in	the	space	induced	by	the	respective	kernel.	[Scikitlearn]		
d.				Neural	Network	using	Ridge	Regression	–	system	of	“neurons”	that	
data	is	inputted	into	containing	weights.	These	weights	are	updated	
each	iteration	of	the	algorithm.	Ridge	Regression	is	used	as	the	
function	within	the	neurons.	[Implemented	manually]		
		
e.				Stochastic	Gradient	Descent	–	finding	the	local	minimum	of	a	
function,	using	the	negative	direction	of	the	gradient	(increase	or	
decrease	in	magnitude/derivative	of	function).	[Scikitlearn]		
For	this	experiment,	we	choose	to	iterate	the	implementation	of	these		
algorithms	for	100	trials.	The	reasoning	behind	this	is	to	gain	a	more	reasonable	
approximation	of	the	following	summary	statistics	with	respect	to	the	sum	of	
squares:			
• Maximum		
• Minimum		
• Mean			
• Standard	deviation		
• Range				
We	shall	also	be	finding	the	r	squared	value,	which	informs	us	how	much	of	the		
variability	in	y	can	be	explained	by	x.	However,	this	does	not	change	from	iteration	
to	iteration	since	only	the	orientation	of	the	data	is	changing.	Our	objective	is	to	
choose	the	model	with	the	lowest	sum	of	squares	while	also	maximizing	our	r	
squared	value.	Upon	completion	of	the	iterations,	we	observe	the	following:
Determining	The	Model	to	Choose	and	Why		
We	find	that,	generally,	the	Support	Vector	Regression	performs	the	best	in	
consideration	of	our	objectives.	Of	all	the	models	utilized,	it	has	the	highest	r	
squared	value,	the	lowest	standard	deviation	of	sum	of	squares,	and	has	the	lowest	
maximum	sum	of	squared	values.	While	it	does	not	have	the	lowest	range	of	sum	of	
squares,	nor	does	it	have	the	lowest	minimum	sum	of	squares	observed,	the	
difference	in	these	statistics	from	the	best	performing	models	is	very	minimal.
`The	positive	performance	of	the	support	vector	regression	is	due	in	part	to	
the	epsilon	intensive	loss	function.	This	function	essentially	ignores	errors	within	a	
certain	distance	of	the	true	value	of	the	data	point.	Using	this	function,	we	achieve	a	
global	minimum,	while	still	retaining	generalization	within	the	bounds	of	the	
hyperplane	or	set	of	hyperplanes	(the	bounds	within	which	we	observe	the	given	
data,	defined	by	the	kernel).	 This	model	is	robust,	and	can	handle	both	linear	and	
nonlinear	regression,	also	making	it	a	suitable	choice	for	the	task	at	hand.	Be	this	as	
it	may,	our	model	is	not	perfect	and	we	must	understand	its	limits,	particularly	
within	the	context	of	financial	data.			
		
How	much	confidence	do	you	have	in	your	model?	Why	and	when	would	it		
fail?			
As	stated	prior,	financial	data	presents	itself	with	many	problems	that	must	be	
accounted	for.	When	examining	the	volatility	of	S1	in	our	training	data	set,	we	
observe	the	following:			
Where		
Y-axis:		F	–	M	,	Tu	–	Th,	and	Total	represent	Fridays	and	Mondays,	Tuesdays	through	
Thursdays,	and	Total	Days	respectively.			
X-axis:	Vol,	#Days,		%Days,	and	SSRs	represent	volatility,	number	of	days,		
percentage	of	days	and	sum	of	squared	residuals	for	a	particular	iteration.		
It	is	worth	noting	F-M	is	10	two-day	pairs	and	Tu	–	Th	10	three-day	pairs.	We	
can	see	that	there	is	more	variability	on	Friday	and	Monday	than	Thursday	through	
Friday.	Cumulatively,	the	most	inaccurate	predictions	in	this	observation	come	from	
the	Tuesday	through	Thursday	period.	Below,	we	observe	the	actual	S1	in	red	and	
our	predicted	S1	in	green	for	the	training	period	(not	cross	validated	data):
The	algorithm	does	perform	well	with	respect	to	its	predictive	abilities,	however		
there	are	still	shortcomings	to	this	technique.	The	main	shortcoming	is	that	the	
model	generally	overestimates	returns	slightly	and	with	moderate	variability	in	the	
residuals.	As	for	why	this	happens,	this	is	most	likely	due	to	the	kernel	we	have	
selected,	which	is	non-linear.	Different	kernels	produce	differing	hyperplanes,	and	
therefore	different	predictions.	In	general,	we	would	like	to	keep	our	models	more	
generalized	for	out	of	sample	prediction,	but	support	vector	regression	is	noted	for	
often	requiring	specific	kernel	selection	for	better	predictive	results.	Figuring	which	
kernel	to	choose	would	require	significant	amounts	of	time	and	whatever	benefit	we	
gain	in	more	accurate	predictions	in	sample,	we	trade	in	the	general	accuracy	of	the	
model,	particularly	with	out	of	sample	data.
As	for	when	this	model	would	operate	best,	that	would	likely	be	when	the	
systemic	factors	within	the	market	stay	the	same,	so	that	the	kernel	chosen	is	still	
appropriate	across	the	entirety	of	the	data	set.	Periods	such	as	2008	would	likely	
render	this	model	not	as	useful	as	periods	in	which	there	is	relative	stability,	or	
known	as	when	the	market	is	trading	“sideways.”	In	conclusion,	the	support	vector	
regression	modeling	of	our	reduced	data	set	(via	principal	component	analysis)	is	
the	best	model	for	our	regression,	but	we	can	see	that	there	is	still	fine	tuning	that	
must	be	done	respective	to	the	situation,	such	as	which	kernel	to	use.	So	long	as	this	
model	is	used	in	periods	in	which	systemic	factors	are	constant,	the	predictive	
power	is	significantly	enhanced	and	is	therefore	recommendable	as	a	component	of	
a	decision-making	processes.

More Related Content

PDF
Storytelling For The Web: Integrate Storytelling in your Design Process
PDF
2024 Trend Updates: What Really Works In SEO & Content Marketing
PDF
Data Science - Part III - EDA & Model Selection
PPTX
Introduction to MARS (1999)
PDF
Gini Index Research
PDF
Classification via Logistic Regression
PDF
Overview and Implementation of Principal Component Analysis
PDF
Statistical Arbitrage
Storytelling For The Web: Integrate Storytelling in your Design Process
2024 Trend Updates: What Really Works In SEO & Content Marketing
Data Science - Part III - EDA & Model Selection
Introduction to MARS (1999)
Gini Index Research
Classification via Logistic Regression
Overview and Implementation of Principal Component Analysis
Statistical Arbitrage

Recently uploaded (20)

PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPT
DU, AIS, Big Data and Data Analytics.ppt
PDF
Introduction to the R Programming Language
DOCX
Factor Analysis Word Document Presentation
PPTX
modul_python (1).pptx for professional and student
PDF
Transcultural that can help you someday.
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
New ISO 27001_2022 standard and the changes
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
Business_Capability_Map_Collection__pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
Qualitative Qantitative and Mixed Methods.pptx
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
SAP 2 completion done . PRESENTATION.pptx
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
DU, AIS, Big Data and Data Analytics.ppt
Introduction to the R Programming Language
Factor Analysis Word Document Presentation
modul_python (1).pptx for professional and student
Transcultural that can help you someday.
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Topic 5 Presentation 5 Lesson 5 Corporate Fin
A Complete Guide to Streamlining Business Processes
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
New ISO 27001_2022 standard and the changes
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Business_Capability_Map_Collection__pptx
Business Analytics and business intelligence.pdf
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
retention in jsjsksksksnbsndjddjdnFPD.pptx
Ad
Ad

Asset Price Prediction with Machine Learning