SlideShare a Scribd company logo
Musing of a
Kaggler
By Kai Xin
I am not a good student. Skipped school, played games all day,
almost got kicked out of school.
I play a different game now. But at the core it is the same: understand
the game, devise a strategy, keep playing.
My Overall Strategy
Every piece of data is unique
but some data is more important than others
It is not about the tools or the model or the stats.
It is about the steps to put everything together.
The Kaggle Competition
https://github.com/thiakx/RUGS-
Meetup
Remember to download data from Kaggle Competition and put it here
First look at the data
223,129 rows
First look at the data
Plot on
map?
Not really free text? Some repeats
Need to
predict
these
Related to
summary /
description?
Graph by Ryn Locar
Understand the data via visualization
Oakland
http://www.thiakx.com/misc/playground/scfMap/scfMap.
html
Oakland Chicargo
New Haven Richmond
LeafletR Demo
Visualize the data - Interactive maps
Step1:
Draw Boundary Polygon
Step 2:
Create Base (each hex 1km wide)
Step 3:
Point in Polygon Analysis
Step 4:
Local Moran’s I
Obtain Boundary Polygon Lat Long
App can be found at: leafletMaps/latlong.html
leafletMaps/
regionPoints.csv
Generating Hex
Code can be found at:
baseFunctions_map.R
Point in Polygon Analysis
Code can be found at:
1. dataExplore_map.R
Local Moran’s I
Code can be found at:
1. dataExplore_map.R
LeafletR
Code can be found at:
1. dataExplore_map.R
Kx’s layered demo map:
leafletMaps/scfMap_kxDe
moVer
In Search of the 20% data
Ignore IgnoreIgnore
Model
Ignore
Model
Ignore
MAD
Training Data
In Search of the
20% Data
Detection of
“Anomalies”
Can we justify this using statistics?
ksTest<-ks.test(trainData$num_views[trainData$month==4&trainData$year==2013],
trainData$num_views[trainData$month==9&trainData$year==2012])
#d is like the distance of
difference, smaller d = the two
data sets probably from same
distribution
d
Jan’12 to Oct’12 and Mar’13 training data ignored
2 sample
Kolmogorov–Smirnov
test
What happened
here?
No need to model?
Just assume all Chicargo
data to be 0?
Chicargo data generated by remote_API mostly 0s, no need to model
Separate Outliers using Median Absolute Deviation (MAD)
MAD is robust and can handle skewed data. It helps to identify outliers. We
separated data more which are more than 3 Median Absolute Deviation.
Code can be found at:
baseFunctions_cleanData.R
Ignore IgnoreIgnore
Model
Ignore
Model
Ignore
MAD
Ignore IgnoreIgnore
Model
Ignore
Model
Ignore
MAD
10% of
training
data is
used for
modeling
59% of
data are
Chicargo
data
generated
by
remote_AP
I, mostly
0s, no
need
model, just
estimate
using
median
Key Advantage: Rapid prototyping!
4% of data is identified as outliers by
MAD
KS test: 27% of training data are of different
distribution
When you can focus on a small but representative subset of data, you
can run many, many experiments very quickly (I did several hundreds)
Now we have the raw ingredients prepared,
it is time to make the dishes
Experiment with Different Models
❖ Random Forest
❖ Generalized Boosted Regression Models (GBM)
❖ Support Vector Machines (SVM)
❖ Bootstrap aggregated (bagged) linear models
How to use? Ask Google & RTFM
Or just do download my
code
I don’t spend time on optimizing/tuning model settings (learning rate etc)
with cross validation. I find it really boring and really slow
Obsessing with tuning model variables is
like being obsessed with tuning the oven
Instead, the magic happens when we combine data and
when we create new data - aka feature creation
Creating Simples Features : City
trainData$city[trainData$longitude=="-77"]<-
"richmond"
trainData$city[trainData$longitude=="-72"]<-
"new_haven"
trainData$city[trainData$longitude=="-87"]<-
"chicargo"
trainData$city[trainData$longitude=="-122"]<-
"oakland"
Code can be found at:
1. dataExplore_map.R
Creating Complex Features: Local Moran’s I
Code can be found at:
1. dataExplore_map.R
Creating Complex Features: Predicted View
The task is to predict view, votes,
comments but logically, won’t
number of votes and comments be
correlated with number of views?
Code can be found at:
baseFunctions_model.R
Creating Complex Features: Predicted View
Storing the predicted value of view as new column
and using it as a new feature to predict votes & comments…
very risky business but powerful if you know what you are doing
Creating Complex Features: SplitTag, wordMine
Creating Complex Features: SplitTag, wordMine
Code can be
found at:
baseFunctions
_cleanData.R
Adjusting Features: Simplify Tags
Code can be found at:
baseFunctions_cleanData.R
Adjusting Features: Recode Unknown
Tags
Code can be found at:
baseFunctions_cleanData.R
Adjusting Features: Combine Low Count Tags
Code can be found at:
baseFunctions_cleanData.R
Full List of Features Used
Code can be found at:
baseFunctions_model.R
+Num View as Y variable
+Num Comments as Y variable
+Num Votes as Y variable
Fed into models to predict
view, votes, comments respectively
Only used 1 original feature, I created the other 13 features
Code can be found at:
baseFunctions_model.R
Fed into models to predict
view, votes, comments respectively
Original Feature (1) Created Feature (13)
An ensemble of good enough models can be surprisingly strong
An ensemble of good enough models can be surprisingly strong
An ensemble of the 4 base model has less error
Each model is good for different scenario
GBM is rock
solid, good for
all scenarios
SVM is
counter
weight, don’t
trust anything
it says
GLM is
amazing for
predicting
comments,
not so much
for others
RandomForest
is moderate,
provides a
balanced view
Ensemble (Stacking using regression)
testDataAns rfAns gbmAns svmAns glmBagAns
2.3 2 2.5 2.4 1.8
2 1.8 2.2 1.7 1.6
1.3 1.3 1.7 1.2 1.0
1.5 1.4 1.9 1.6 1.2
… … … … …
glm(testDataAns~rfAns+gbmAns+svmAns+glmBagAns)
We are interested in the coefficient
Ensemble (Stacking using regression)
Sort and column bind the predictions from the 4 models
Run regression (logistic or linear) and obtain coefficients
Scale ensemble ratio back to 1 (100%)
Obtaining the ensemble ratio for each model
Inside 3. testMod_generateEnsembleRatio
folder - getEnsembleRatio.r
Ensemble is not perfect…
❖ Simple to implement? Kind of. But very tedious to
update. Will need to rerun every single model every time
you make any changes to the data (as the ensemble
ratio may change).
❖ Easy to overfit test data (will require another set of
validation data or cross validation).
❖ Very hard to explain to business users what is going on.
All this should get you to top rank
49/532
Ignore IgnoreIgnore
Model
Ignore
Model
Ignore
MAD
10% of
training
data is
used for
modeling
4% of data is identified as outliers by
MAD
KS test: Too different from rest of data
59% of
data are
Chicargo
data
generated
by
remote_AP
I, mostly
0s, no
need
model, just
estimate
using
median
Key Advantage: Rapid prototyping!
Thank you!
thiakx@gmail.com
Data Science SG Facebook
Data Science SG Meetup

More Related Content

PDF
Forecasting Techniques - Data Science SG
PPTX
Rational Sub-Grouping
PPTX
Calculating a Sample Size
PPTX
MSA – Attribute ARR Test
PPTX
MSA – Gage R&R Test
PDF
Optimization
PDF
Linear Regression in R
PPTX
MSA – Improving the Measurement System
Forecasting Techniques - Data Science SG
Rational Sub-Grouping
Calculating a Sample Size
MSA – Attribute ARR Test
MSA – Gage R&R Test
Optimization
Linear Regression in R
MSA – Improving the Measurement System

What's hot (20)

PDF
Feature Reduction Techniques
PDF
Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...
PPTX
MSA – Overview
ODP
Linear Regression Ex
PPTX
Variation Over Time (Short/Long Term Data)
PPTX
Identify Root Causes – Combining the C&E Diagram and 5 Whys
PPTX
House price prediction
PPTX
Data Types with Matt Hansen at StatStuff
PPTX
Comparing Distributions and Using the Graphical Summary
PPTX
Housing price prediction
PDF
Stock Market Prediction Using ANN
PDF
Scaling Analytics with Apache Spark
PDF
Credit risk meetup
PPTX
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
PDF
Machine Learning for Dummies
PPTX
Pricing like a data scientist
PDF
Insurance risk pricing with XGBoost
PPTX
WEKA: Credibility Evaluating Whats Been Learned
PPTX
Data Analysis project "TITANIC SURVIVAL"
PPTX
Model Selection Techniques
Feature Reduction Techniques
Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...
MSA – Overview
Linear Regression Ex
Variation Over Time (Short/Long Term Data)
Identify Root Causes – Combining the C&E Diagram and 5 Whys
House price prediction
Data Types with Matt Hansen at StatStuff
Comparing Distributions and Using the Graphical Summary
Housing price prediction
Stock Market Prediction Using ANN
Scaling Analytics with Apache Spark
Credit risk meetup
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Machine Learning for Dummies
Pricing like a data scientist
Insurance risk pricing with XGBoost
WEKA: Credibility Evaluating Whats Been Learned
Data Analysis project "TITANIC SURVIVAL"
Model Selection Techniques
Ad

Viewers also liked (12)

PPTX
Lions, zebras and Big Data Anonymization
PDF
Strata singapore survey
PDF
A Study on the Relationship between Education and Income in the US
PDF
Xavier Conort, DataScience SG Meetup - Challenges in insurance pricing
PDF
Sharing about my data science journey and what I do at Lazada
PDF
Big Data Analytics and its Application in E-Commerce
PDF
Big data, analytics and the retail industry: Luxottica
PDF
Big Data in Retail - Examples in Action
PDF
Magento scalability from the trenches (Meet Magento Sweden 2016)
PPTX
All you wanted to know about analytics in e commerce- amazon, ebay, flipkart
PDF
Big Data in e-Commerce
PDF
How Lazada ranks products to improve customer experience and conversion
Lions, zebras and Big Data Anonymization
Strata singapore survey
A Study on the Relationship between Education and Income in the US
Xavier Conort, DataScience SG Meetup - Challenges in insurance pricing
Sharing about my data science journey and what I do at Lazada
Big Data Analytics and its Application in E-Commerce
Big data, analytics and the retail industry: Luxottica
Big Data in Retail - Examples in Action
Magento scalability from the trenches (Meet Magento Sweden 2016)
All you wanted to know about analytics in e commerce- amazon, ebay, flipkart
Big Data in e-Commerce
How Lazada ranks products to improve customer experience and conversion
Ad

Similar to Musings of kaggler (20)

PPTX
Cloudera Data Science Challenge
PPTX
Data Science Challenge presentation given to the CinBITools Meetup Group
PPTX
20 Simple CART
PDF
Spark ml streaming
PDF
Heuristic design of experiments w meta gradient search
PPTX
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
PDF
PyData 2015 Keynote: "A Systems View of Machine Learning"
PDF
Mastering Customer Segmentation with LLM.pdf
DOC
DagdelenSiriwardaneY..
PDF
Benchmarking_ML_Tools
PDF
Higgs Boson Challenge
PDF
Human_Activity_Recognition_Predictive_Model
PPTX
Kaggle Gold Medal Case Study
PDF
Fianl_Paper
PPTX
Nss power point_machine_learning
PPTX
Introduction to Datamining Concept and Techniques
PPTX
Basic Deep Learning.pptx
PPT
Data science and OSS
PDF
Black_Friday_Sales_Trushita
PDF
Dimensionality Reduction
Cloudera Data Science Challenge
Data Science Challenge presentation given to the CinBITools Meetup Group
20 Simple CART
Spark ml streaming
Heuristic design of experiments w meta gradient search
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
PyData 2015 Keynote: "A Systems View of Machine Learning"
Mastering Customer Segmentation with LLM.pdf
DagdelenSiriwardaneY..
Benchmarking_ML_Tools
Higgs Boson Challenge
Human_Activity_Recognition_Predictive_Model
Kaggle Gold Medal Case Study
Fianl_Paper
Nss power point_machine_learning
Introduction to Datamining Concept and Techniques
Basic Deep Learning.pptx
Data science and OSS
Black_Friday_Sales_Trushita
Dimensionality Reduction

Recently uploaded (20)

PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Global journeys: estimating international migration
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Fluorescence-microscope_Botany_detailed content
1_Introduction to advance data techniques.pptx
Introduction to Knowledge Engineering Part 1
Clinical guidelines as a resource for EBP(1).pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
IB Computer Science - Internal Assessment.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Major-Components-ofNKJNNKNKNKNKronment.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
.pdf is not working space design for the following data for the following dat...
Galatica Smart Energy Infrastructure Startup Pitch Deck
Global journeys: estimating international migration
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Business Acumen Training GuidePresentation.pptx
Fluorescence-microscope_Botany_detailed content

Musings of kaggler