Musings of kaggler

Musing of a
Kaggler
By Kai Xin

I am not a good student. Skipped school, played games all day,
almost got kicked out of school.

I play a different game now. But at the core it is the same: understand
the game, devise a strategy, keep playing.

Every piece of data is unique
but some data is more important than others

It is not about the tools or the model or the stats.
It is about the steps to put everything together.

https://github.com/thiakx/RUGS-
Meetup
Remember to download data from Kaggle Competition and put it here

First look at the data
223,129 rows

First look at the data
Plot on
map?
Not really free text? Some repeats
Need to
predict
these
Related to
summary /
description?

Graph by Ryn Locar
Understand the data via visualization

Oakland
http://www.thiakx.com/misc/playground/scfMap/scfMap.
html
Oakland Chicargo
New Haven Richmond
LeafletR Demo
Visualize the data - Interactive maps

Step1:
Draw Boundary Polygon
Step 2:
Create Base (each hex 1km wide)
Step 3:
Point in Polygon Analysis
Step 4:
Local Moran’s I

Obtain Boundary Polygon Lat Long
App can be found at: leafletMaps/latlong.html
leafletMaps/
regionPoints.csv

Generating Hex
Code can be found at:
baseFunctions_map.R

Point in Polygon Analysis
1. dataExplore_map.R

Local Moran’s I

LeafletR
Kx’s layered demo map:
leafletMaps/scfMap_kxDe
moVer

Ignore IgnoreIgnore
Model
Ignore
Model
Ignore
MAD
Training Data

In Search of the
20% Data
Detection of
“Anomalies”
Can we justify this using statistics?

ksTest<-ks.test(trainData$num_views[trainData$month==4&trainData$year==2013],
trainData$num_views[trainData$month==9&trainData$year==2012])
#d is like the distance of
difference, smaller d = the two
data sets probably from same
distribution
d
Jan’12 to Oct’12 and Mar’13 training data ignored
2 sample
Kolmogorov–Smirnov
test

What happened
here?
No need to model?
Just assume all Chicargo
data to be 0?
Chicargo data generated by remote_API mostly 0s, no need to model

Separate Outliers using Median Absolute Deviation (MAD)
MAD is robust and can handle skewed data. It helps to identify outliers. We
separated data more which are more than 3 Median Absolute Deviation.
baseFunctions_cleanData.R

Ignore IgnoreIgnore
Model
Ignore
Model
Ignore
MAD

Ignore IgnoreIgnore
Model
Ignore
Model
Ignore
MAD
10% of
training
data is
used for
modeling
59% of
data are
Chicargo
data
generated
by
remote_AP
I, mostly
0s, no
need
model, just
estimate
using
median
Key Advantage: Rapid prototyping!
4% of data is identified as outliers by
MAD
KS test: 27% of training data are of different
distribution

When you can focus on a small but representative subset of data, you
can run many, many experiments very quickly (I did several hundreds)

Now we have the raw ingredients prepared,
it is time to make the dishes

Experiment with Different Models
❖ Random Forest
❖ Generalized Boosted Regression Models (GBM)
❖ Support Vector Machines (SVM)
❖ Bootstrap aggregated (bagged) linear models
How to use? Ask Google & RTFM

I don’t spend time on optimizing/tuning model settings (learning rate etc)
with cross validation. I find it really boring and really slow

Obsessing with tuning model variables is
like being obsessed with tuning the oven

Instead, the magic happens when we combine data and
when we create new data - aka feature creation

Creating Simples Features : City
trainData$city[trainData$longitude=="-77"]<-
"richmond"
"new_haven"
"chicargo"
"oakland"

Creating Complex Features: Local Moran’s I

Creating Complex Features: Predicted View
The task is to predict view, votes,
comments but logically, won’t
number of votes and comments be
correlated with number of views?
baseFunctions_model.R

Creating Complex Features: Predicted View
Storing the predicted value of view as new column
and using it as a new feature to predict votes & comments…
very risky business but powerful if you know what you are doing

Creating Complex Features: SplitTag, wordMine

Creating Complex Features: SplitTag, wordMine
Code can be
found at:
baseFunctions
_cleanData.R

Adjusting Features: Simplify Tags

Adjusting Features: Recode Unknown
Tags

Adjusting Features: Combine Low Count Tags

Full List of Features Used
+Num View as Y variable
+Num Comments as Y variable
+Num Votes as Y variable
Fed into models to predict
view, votes, comments respectively

Only used 1 original feature, I created the other 13 features
Fed into models to predict
view, votes, comments respectively
Original Feature (1) Created Feature (13)

An ensemble of good enough models can be surprisingly strong

An ensemble of the 4 base model has less error

Each model is good for different scenario
GBM is rock
solid, good for
all scenarios
SVM is
counter
weight, don’t
trust anything
it says
GLM is
amazing for
predicting
comments,
not so much
for others
RandomForest
is moderate,
provides a
balanced view

Ensemble (Stacking using regression)
testDataAns rfAns gbmAns svmAns glmBagAns
2.3 2 2.5 2.4 1.8
2 1.8 2.2 1.7 1.6
1.3 1.3 1.7 1.2 1.0
1.5 1.4 1.9 1.6 1.2
… … … … …
glm(testDataAns~rfAns+gbmAns+svmAns+glmBagAns)
We are interested in the coefficient

Ensemble (Stacking using regression)
Sort and column bind the predictions from the 4 models
Run regression (logistic or linear) and obtain coefficients
Scale ensemble ratio back to 1 (100%)

Obtaining the ensemble ratio for each model
Inside 3. testMod_generateEnsembleRatio
folder - getEnsembleRatio.r

Ensemble is not perfect…
❖ Simple to implement? Kind of. But very tedious to
update. Will need to rerun every single model every time
you make any changes to the data (as the ensemble
ratio may change).
❖ Easy to overfit test data (will require another set of
validation data or cross validation).
❖ Very hard to explain to business users what is going on.

All this should get you to top rank
49/532

Ignore IgnoreIgnore
Model
Ignore
Model
Ignore
MAD
10% of
training
data is
used for
modeling
4% of data is identified as outliers by
MAD
KS test: Too different from rest of data
59% of
data are
Chicargo
data
generated
by
remote_AP
I, mostly
0s, no
need
model, just
estimate
using
median
Key Advantage: Rapid prototyping!

Thank you!
thiakx@gmail.com
Data Science SG Facebook
Data Science SG Meetup

Musings of kaggler

More Related Content

What's hot (20)

Viewers also liked (12)

Similar to Musings of kaggler (20)

Recently uploaded (20)

Musings of kaggler