We're so skewed_presentation

Kaggle Challenge:
Predicting Housing Prices
BEN BRUNSON, NICHOLAS MALOOF, AARON OWEN, JOSH YOON

Introduction
Challenge: Predicting Housing Prices in Ames, Iowa using various machine
learning techniques
Data:
◦ Train Data Set: 1460 Observations x 80 Variables (Including Response Variable: Sale Price)
◦ Test Data Set: 1459 Observations x 79 Variables
Useful Links:
◦ Kaggle Homepage: https://www.kaggle.com/c/house-prices-advanced-regression-
techniques
◦ Data Description: https://storage.googleapis.com/kaggle-competitions-
data/kaggle/5407/data_description.txt

Understanding the Data
Total Predictor Variables Provided: 79
◦ Continuous Variables: 28
◦ Categorical Variables: 51
Combined test and train data sets to get a holistic view of each variable
◦ (i.e., total missing values, total categories in categorical variable)

Processing the Data:
Response Variable

Treat response variable with log + 1 transformation
Remember to inverse log before submitting to Kaggle
Response Variable

Overview of Missingness
34 predictors with
missing values

Handling Missing Data
1) Are data really missing?
Ex. Pool Quality
◦ 2909 out of 2919 observations have ”NA”
values
◦ Most NAs are due to houses not having pools
◦ Solution:
◦ Replace (most) NAs with new category: “None”

2) Not all NA values indicate a missing feature
Ex. Pool Quality
◦ Solution: Use related numerical variable to impute categorical variable
◦ Calculate average area of each pool class within Pool Quality and fill for NAs

2) Not all NA values indicate a missing feature
Ex. Sale Type (1 Missing observation, but we know Sale Condition)
◦ Solution: Use related categorical variables to impute
◦ For Sale Condition that is “Normal” we see by far most common Sale Type value is “WD” and we can
impute.

3) Use domain knowledge
Ex. Lot Frontage (486 NAs)
◦ Houses in close proximity likely have similar
lot areas
◦ Solution: use categorical variable to impute
numerical
◦ Use median Lot Frontage by neighborhood to
impute missing value

4) Variables with little to no relation to other variables
◦ Solution: Impute by most commonly occurring class within variable
Ex. Electrical

Categorical Variables (Ordinal)
Some machine learning algorithms cannot handle non-numerical values
Ex. Kitchen Quality
◦ Solution: Use average Sale Price to assign ordered numerical values to categories
('None' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)

Categorical Variables (Nominal)
Some machine learning algorithms cannot handle non-numerical values
Ex. Land Contour
◦ Solution: One-hot encoding technique: binarizing classes of each variable

Outliers
Some observations may be abnormally
far from other values
Ex. Ground Living Area vs Sale Price
◦ Two points with very large area but very low
sale price
◦ Solution: Remove outliers

Skewness and Scaling
Distributions of some variables may be highly skewed
Ex. Lot Area
◦ Solution: Log + 1 Transformation

Near Zero Variance Predictors
Low variance predictors add little
value to models
◦ Calculate ratio of most frequent vs.
second most frequent value
◦ Ratios >> 1 suggest very low variance
◦ Solution: Remove near zero predictors
with cutoffs of 95:5

Numerical Variables
As expected, important quantitative factors to consider are space/size, date, overall quality.
Top10 Numerical Variables With Greatest Covariance vs. SalePrice

Feature Engineering
Ideas for new features:
◦ Remodeled – Year Built not equal to Year Additional Remodeling
◦ Seasonality – Combine Month Sold and Year Sold
◦ New House – Year Built same as Year Sold
◦ Total Area – sum all variables denoting square footage
◦ Inside Area – sum all variables denoting square footage referring to space inside the house
◦ Overall Basement – Basement Quality and Basement Condition
◦ Overall Condition – Condition 1 and Condition 2
◦ Overall Quality – External Quality and External Condition
◦ Overall Sale – Sale Type and Sale Condition
◦ Sale and Condtion – Sale Type and Overall Condition

Pros Cons Hyperparameters Cross-Validated
RMSE Score
Kaggle
Score
Random Forest Lower variance,
Decorrelates data,
Scale invariant
High bias,
Difficult to interpret
Num features = 48,
Num trees = 1000
0.14997 0.14758
Models

RMSE Score
Kaggle
Score
Decorrelates data,
Scale invariant
High bias,
Num features = 48,
Num trees = 1000
0.14997 0.14758
Gradient Boost Feature scaling not needed,
High accuracy
Computationally expensive,
Overfitting
Num trees = 1000,
Depth = 2,
Num Features = sqrt,
Samples/leaf = 15,
Learning rate = 0.05
0.1128 0.12421
Models

Top 40 Features by Relative Importance Gradient Boost

RMSE Score
Kaggle
Score
Decorrelates data,
Scale invariant
High bias,
Num features = 48,
Num trees = 1000
0.14997 0.14758
High accuracy
Overfitting
Num trees = 1000,
Depth = 2,
Samples/leaf = 15,
0.1128 0.12421
XGBoost Extremely fast,
Allows parallel computing
Difficult to interpret,
Overfits vs gradient boosting
Num trees = 2724,
Max depth = 30,
Gamma = 0.0,
Minimum child weight = 4
0.13642 0.13082
Models

Top 40 Features by Relative Importance XGBoost

RMSE Score
Kaggle
Score
Decorrelates data,
Scale invariant
High bias,
Num features = 48,
Num trees = 1000
0.14997 0.14758
High accuracy
Overfitting
Num trees = 1000,
Depth = 2,
Samples/leaf = 15,
0.1128 0.12421
Num trees = 2724,
Max depth = 30,
Gamma = 0.0,
0.13642 0.13082
Regularize Linear
Regression
Easily interpretable,
Computationally
inexpensive, Less prone to
overfitting
Requires scaled variables,
Requires numerical variables
Lambda = 0.0005,
Alpha = 0.9
0.1111 0.11922
Models

Coefficients of Top 40 Predictors

RMSE Score
Kaggle
Score
Decorrelates data,
Scale invariant
High bias,
Num features = 48,
Num trees = 1000
0.14997 0.14758
High accuracy
Overfitting
Num trees = 1000,
Depth = 2,
Samples/leaf = 15,
0.1128 0.12421
Num trees = 2724,
Max depth = 30,
Gamma = 0.0,
0.13642 0.13082
Regularize Linear
Regression
Easily interpretable,
Computationally inexpensive,
Less prone to overfitting
Requires scaled variables,
Requires numerical variables
Lambda = 0.0005,
Alpha = 0.9
0.1111 0.11922
Ensembling Can improve accuracy Lose interpretability Lasso, Enet, Gradient Boost,
Gradient Boost Lite
0.1071 0.11751
Models

Conclusions
Prediction
Our RMSE yields an error of: ≈ ± $9000
for average sale price ($181000)
What Drives Sale Price?
Size, Age
Overall Quality/Condition
Neighborhood (both good and bad)
Commercial Zone
Year sold (housing crash)

We're so skewed_presentation

More Related Content

Viewers also liked (6)

Similar to We're so skewed_presentation (14)

More from Vivian S. Zhang (17)

Recently uploaded (20)

We're so skewed_presentation