SlideShare a Scribd company logo
Kaggle	Challenge:
Predicting	Housing	Prices
BEN	BRUNSON,	NICHOLAS MALOOF,	AARON OWEN,	JOSH	YOON
Introduction
Challenge:	Predicting	Housing	Prices	in	Ames,	Iowa	using	various	machine	
learning	techniques
Data:
◦ Train	Data	Set:	1460	Observations	x	80	Variables	(Including	Response	Variable:	Sale	Price)
◦ Test	Data	Set:	1459	Observations	x	79	Variables
Useful	Links:
◦ Kaggle	Homepage:	https://www.kaggle.com/c/house-prices-advanced-regression-
techniques
◦ Data	Description:	https://storage.googleapis.com/kaggle-competitions-
data/kaggle/5407/data_description.txt
Understanding	the	Data
Total	Predictor	Variables	Provided:	79	
◦ Continuous	Variables:	28
◦ Categorical	Variables:	51
Combined	test	and	train	data	sets	to	get	a	holistic	view	of	each	variable
◦ (i.e.,	total	missing	values,	total	categories	in	categorical	variable)
Processing	the	Data:	
Response	Variable
Treat	response	variable	with	log	+	1	transformation
Remember	to	inverse	log	before	submitting	to	Kaggle
Processing	the	Data:	
Response	Variable
Processing	the	Data:	
Overview	of	Missingness
34	predictors	with	
missing	values
Processing	the	Data:
Handling	Missing	Data
1)	Are	data	really	missing?
Ex.	Pool	Quality
◦ 2909	out	of	2919	observations	have	”NA”	
values		
◦ Most	NAs	are	due	to	houses	not	having	pools
◦ Solution:
◦ Replace	(most)	NAs	with	new	category:	“None”
2)	Not	all	NA	values	indicate	a	missing	feature
Ex.	Pool	Quality
◦ Solution:	Use	related	numerical	variable	to	impute	categorical	variable
◦ Calculate	average	area	of	each	pool	class	within	Pool	Quality	and	fill	for	NAs	
Processing	the	Data:
Handling	Missing	Data
2)	Not	all	NA	values	indicate	a	missing	feature
Ex.	Sale	Type	(1	Missing	observation,	but	we	know	Sale	Condition)
◦ Solution:	Use	related	categorical	variables	to	impute
◦ For	Sale	Condition	that	is	“Normal”	we	see	by	far	most	common	Sale	Type	value	is	“WD”	and	we	can	
impute.
Processing	the	Data:
Handling	Missing	Data
3)	Use	domain	knowledge
Ex.	Lot	Frontage	(486	NAs)
◦ Houses	in	close	proximity	likely	have	similar	
lot	areas
◦ Solution:	use	categorical	variable	to	impute	
numerical
◦ Use	median	Lot	Frontage	by	neighborhood	to	
impute	missing	value
Processing	the	Data:
Handling	Missing	Data
4)	Variables	with	little	to	no	relation	to	other	variables
◦ Solution:	Impute	by	most	commonly	occurring	class	within	variable
Processing	the	Data:
Handling	Missing	Data
Ex.	Electrical
Processing	the	Data:	
Categorical	Variables	(Ordinal)
Some	machine	learning	algorithms	cannot	handle	non-numerical	values
Ex.	Kitchen	Quality
◦ Solution:	Use	average	Sale	Price	to	assign	ordered	numerical	values	to	categories
('None'	=	0,	'Po'	=	1,	'Fa'	=	2,	'TA'	=	3,	'Gd'	=	4,	'Ex'	=	5)
Processing	the	Data:
Categorical	Variables	(Nominal)
Some	machine	learning	algorithms	cannot	handle	non-numerical	values
Ex.	Land	Contour
◦ Solution: One-hot	encoding	technique:	binarizing classes	of	each	variable
Processing	the	Data:
Outliers
Some	observations	may	be	abnormally	
far	from	other	values
Ex.	Ground	Living	Area	vs	Sale	Price	
◦ Two	points	with	very	large	area	but	very	low	
sale	price
◦ Solution: Remove	outliers
Processing	the	Data:
Skewness	and	Scaling
Distributions	of	some	variables	may	be	highly	skewed
Ex.	Lot	Area
◦ Solution:	Log	+	1	Transformation
Processing	the	Data:
Near	Zero	Variance	Predictors
Low	variance	predictors	add	little	
value	to	models
◦ Calculate	ratio	of	most	frequent	vs.	
second	most	frequent	value
◦ Ratios	>>	1	suggest	very	low	variance
◦ Solution:	Remove	near	zero	predictors	
with	cutoffs	of	95:5
Processing	the	Data:
Numerical	Variables
As	expected,	important	quantitative	factors	to	consider	are	space/size,	date,	overall	quality.
Top10	Numerical	Variables	With	Greatest	Covariance	vs.	SalePrice
Feature	Engineering
Ideas	for	new	features:
◦ Remodeled	– Year	Built	not	equal	to	Year	Additional	Remodeling
◦ Seasonality	– Combine	Month	Sold	and	Year	Sold
◦ New	House	– Year	Built	same	as	Year	Sold
◦ Total	Area	– sum	all	variables	denoting	square	footage
◦ Inside	Area	– sum	all	variables	denoting	square	footage	referring	to	space	inside	the	house
◦ Overall	Basement	– Basement	Quality	and	Basement	Condition
◦ Overall	Condition – Condition	1	and	Condition	2
◦ Overall	Quality	– External	Quality	and	External	Condition
◦ Overall	Sale	– Sale	Type	and	Sale	Condition
◦ Sale	and	Condtion – Sale	Type	and	Overall	Condition
Pros Cons Hyperparameters Cross-Validated
RMSE	Score
Kaggle
Score
Random Forest Lower variance,	
Decorrelates data,	
Scale	invariant
High	bias,		
Difficult to	interpret
Num features	=	48,
Num trees	=	1000
0.14997 0.14758
Models
We're so skewed_presentation
Pros Cons Hyperparameters Cross-Validated
RMSE	Score
Kaggle
Score
Random Forest Lower variance,	
Decorrelates data,	
Scale	invariant
High	bias,		
Difficult to	interpret
Num features	=	48,
Num trees	=	1000
0.14997 0.14758
Gradient	Boost Feature	scaling	not	needed,	
High	accuracy	
Computationally	expensive,
Overfitting
Num trees =	1000,	
Depth	=	2,
Num Features	=	sqrt,	
Samples/leaf	=	15,	
Learning rate	=	0.05
0.1128 0.12421
Models
Top	40	Features	by	Relative	Importance	Gradient	Boost
Pros Cons Hyperparameters Cross-Validated
RMSE	Score
Kaggle
Score
Random Forest Lower variance,	
Decorrelates data,	
Scale	invariant
High	bias,		
Difficult to	interpret
Num features	=	48,
Num trees	=	1000
0.14997 0.14758
Gradient	Boost Feature	scaling	not	needed,	
High	accuracy	
Computationally	expensive,
Overfitting
Num trees =	1000,	
Depth	=	2,
Num Features	=	sqrt,	
Samples/leaf	=	15,	
Learning rate	=	0.05
0.1128 0.12421
XGBoost Extremely	fast,	
Allows parallel	computing
Difficult	to	interpret,
Overfits vs	gradient	boosting
Num trees	= 2724,
Max depth = 30,	
Gamma	= 0.0,	
Minimum child weight =	 4
0.13642 0.13082
Models
Top	40	Features	by	Relative	Importance	XGBoost
Pros Cons Hyperparameters Cross-Validated
RMSE	Score
Kaggle
Score
Random Forest Lower variance,	
Decorrelates data,	
Scale	invariant
High	bias,		
Difficult to	interpret
Num features	=	48,
Num trees	=	1000
0.14997 0.14758
Gradient	Boost Feature	scaling	not	needed,	
High	accuracy	
Computationally	expensive,
Overfitting
Num trees =	1000,	
Depth	=	2,
Num Features	=	sqrt,	
Samples/leaf	=	15,	
Learning rate	=	0.05
0.1128 0.12421
XGBoost Extremely	fast,	
Allows parallel	computing
Difficult	to	interpret,
Overfits vs	gradient	boosting
Num trees	= 2724,
Max depth = 30,	
Gamma	= 0.0,	
Minimum child weight =	 4
0.13642 0.13082
Regularize	Linear
Regression
Easily	interpretable,	
Computationally
inexpensive,	Less	prone	to	
overfitting
Requires	scaled	variables,
Requires	numerical	variables
Lambda =	0.0005,	
Alpha	=	0.9
0.1111 0.11922
Models
Coefficients	of	Top	40	Predictors
Pros Cons Hyperparameters Cross-Validated
RMSE	Score
Kaggle
Score
Random Forest Lower variance,	
Decorrelates data,	
Scale	invariant
High	bias,		
Difficult to	interpret
Num features	=	48,
Num trees	=	1000
0.14997 0.14758
Gradient	Boost Feature	scaling	not	needed,	
High	accuracy	
Computationally	expensive,
Overfitting
Num trees =	1000,	
Depth	=	2,
Num Features	=	sqrt,	
Samples/leaf	=	15,	
Learning rate	=	0.05
0.1128 0.12421
XGBoost Extremely	fast,	
Allows parallel	computing
Difficult	to	interpret,
Overfits vs	gradient	boosting
Num trees	= 2724,
Max depth = 30,	
Gamma	= 0.0,	
Minimum child weight =	 4
0.13642 0.13082
Regularize	Linear
Regression
Easily	interpretable,	
Computationally inexpensive,	
Less	prone	to	overfitting
Requires	scaled	variables,
Requires	numerical	variables
Lambda =	0.0005,	
Alpha	=	0.9
0.1111 0.11922
Ensembling Can	improve accuracy Lose interpretability Lasso,	Enet,	Gradient	Boost,	
Gradient Boost	Lite
0.1071 0.11751
Models
Conclusions
Prediction
Our	RMSE	yields	an	error	of:	≈ ± $9000	
for	average	sale	price	($181000)
What	Drives	Sale	Price?
Size,	Age
Overall	Quality/Condition
Neighborhood	(both	good	and	bad)
Commercial	Zone
Year	sold	(housing	crash)
Questions?

More Related Content

PDF
Wikipedia: Tuned Predictions on Big Data
PDF
Bayesian models in r
PDF
Using Machine Learning to aid Journalism at the New York Times
PDF
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
PDF
A Hybrid Recommender with Yelp Challenge Data
PPTX
Streaming Python on Hadoop
PDF
Introducing natural language processing(NLP) with r
PDF
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Wikipedia: Tuned Predictions on Big Data
Bayesian models in r
Using Machine Learning to aid Journalism at the New York Times
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
A Hybrid Recommender with Yelp Challenge Data
Streaming Python on Hadoop
Introducing natural language processing(NLP) with r
Kaggle Top1% Solution: Predicting Housing Prices in Moscow

Viewers also liked (6)

PDF
Data mining with caret package
PDF
Max Kuhn's talk on R machine learning
PDF
Winning data science competitions, presented by Owen Zhang
PDF
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
PDF
PDF
Tips for data science competitions
Data mining with caret package
Max Kuhn's talk on R machine learning
Winning data science competitions, presented by Owen Zhang
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Tips for data science competitions
Ad

Similar to We're so skewed_presentation (14)

PPTX
Machine learning and_nlp
PDF
Predicting Moscow Real Estate Prices with Azure Machine Learning
PDF
Predicting Moscow Real Estate Prices with Azure Machine Learning
PDF
Predicting Moscow Real Estate Prices with Azure Machine Learning
PPT
Real-time ranking with concept drift using expert advice
PPTX
Employee Salary Presentation.l based on data science collection of data
PPTX
deep reinforcement learning with double q learning
PDF
Housing Prices Prediction using Data Mining - Final Project 2024
PPTX
This document is about Ai-Project-Cycle.pptx
PPTX
optimisation methods techniques industrial
PPTX
Optimisation methods- techniques employed in various domain of industrial eng...
PPTX
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
PDF
XGBoost @ Fyber
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Machine learning and_nlp
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
Real-time ranking with concept drift using expert advice
Employee Salary Presentation.l based on data science collection of data
deep reinforcement learning with double q learning
Housing Prices Prediction using Data Mining - Final Project 2024
This document is about Ai-Project-Cycle.pptx
optimisation methods techniques industrial
Optimisation methods- techniques employed in various domain of industrial eng...
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
XGBoost @ Fyber
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Ad

More from Vivian S. Zhang (17)

PDF
Why NYC DSA.pdf
PPTX
Career services workshop- Roger Ren
PDF
Nycdsa wordpress guide book
PDF
PDF
Nyc open-data-2015-andvanced-sklearn-expanded
PDF
Nycdsa ml conference slides march 2015
PDF
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
PDF
Natural Language Processing(SupStat Inc)
PPTX
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
PPTX
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
PPTX
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
PDF
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
PPTX
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
PPTX
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
PPTX
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
PPTX
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
PPTX
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
Why NYC DSA.pdf
Career services workshop- Roger Ren
Nycdsa wordpress guide book
Nyc open-data-2015-andvanced-sklearn-expanded
Nycdsa ml conference slides march 2015
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
Natural Language Processing(SupStat Inc)
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Lecture1 pattern recognition............
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Computer network topology notes for revision
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Galatica Smart Energy Infrastructure Startup Pitch Deck
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Database Infoormation System (DBIS).pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Moving the Public Sector (Government) to a Digital Adoption
Lecture1 pattern recognition............
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Computer network topology notes for revision
STUDY DESIGN details- Lt Col Maksud (21).pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Clinical guidelines as a resource for EBP(1).pdf
1_Introduction to advance data techniques.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb

We're so skewed_presentation