Starting data science with kaggle.com

Starting Data Science
With Kaggle.com
6/25/2017
Starting Data Science with Kaggle.com -
Nathaniel Shimoni
1Nathaniel Shimoni 25/6/2017

• What is Kaggle?
• Why is Kaggle so great? The everyone wins approach
• Kaggle tiers & top kagglers
• Frequently used terms and the main rules
• The benefits of starting with Kaggle
• Common Kaggle data science process
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
2
Talk outline

• An online platform that runs data science competitions
• Declares itself to be the home of data science
• Has over 1M registered users & over 60k active users
• One of the most vibrant communities for data scientists
• A great place to meet other “data people”
• A great place to learn and test your data & modeling
skills
6/25/2017
Nathaniel Shimoni
3
What is kaggle?

6/25/2017
Nathaniel Shimoni
4
Why is Kaggle so great? (the everyone wins approach)
• Receives prizes,
knowledge,
exposure &
portfolio
showcase
• Rapid development
& adoption of
highly performing
platforms
• Receives money,
from competition
sponsors
• influence on the
community
• knowledge on the
platforms & algo.
trends
• Have data &
business task but
no data scientists
• Receives state of
the art models
quickly and
without hiring
data scientists

6/25/2017
Nathaniel Shimoni
5
My Kaggle profile

• Novice – a new Kaggle user
• Contributor – participated in one or more competitions,
ran a kernel, and is active in the forums
• Expert – 2 top 25% finishes
• Master - 2 top 10% finishes, & 1 top 10 (places) finish
• Grandmaster – 5 top 10 finishes & 1 solo top 10 finish
6/25/2017
Nathaniel Shimoni
6
Kaggle tiers

6/25/2017
Nathaniel Shimoni
7
Top Kagglers

• Leaderboard (public & private)
6/25/2017
Nathaniel Shimoni
8
Frequently used terms
Training
Public
LB
Private LB
Available once
approved the rules
Used for ranking
submissions
through the
competition
Training data Testing data
Used for final scoring
(the only score that truly matters)
Public LB can serve as
additional validation
frame, but can also be
source of over fitting

• Leakage - the introduction of information about
the target that is not a legitimate predictor
(usually by a mistake within the data preparation process)
• Team merger – 2 or more participants competing
together
6/25/2017
Nathaniel Shimoni
9

• LB shuffle – the re-ranking that occurs at the end
of the competition (upon moving from public to private LB)
6/25/2017
Nathaniel Shimoni
10

6/25/2017
Nathaniel Shimoni
11
Main rules for Kaggle competitions
• One account per user
• No private sharing outside teams
(public sharing is usually allowed and endorsed)
• Limited number of entries per day & per competition
• Winning solutions must be written in open source code
• Winners should hand well documented source code in
order to be eligible of the price
• Usually select 2 solutions for final evaluation

• Project based learning – learn by doing
• Solve real world challenges
• Great supporting community
• Benchmark solutions & shared code samples
• Clear business objective and modeling task
• Develop work portfolio and rank yourself against
other competitors (and get recognition)
• Compete against state of the art solutions
• Learn (a lot!!!) when competition ends
6/25/2017
Nathaniel Shimoni
12
Why start with Kaggle?

• Ability to team-up with others:
 learn from better Kagglers
 learn how to collaborate effectively
 merge different solutions to achieve a score boost
 meet new exciting people
• Answer the questions of others – you only truly learn
something when you teach it to someone else
• Ability to apply new ideas at work with little effort
• Varied areas of activity (verticals)
6/25/2017
Nathaniel Shimoni
13

• The ability to follow many experts where each of them
specializes in a particular area (sample from my list)
6/25/2017
Nathaniel Shimoni
14
Ensemble learning
Mathias Müller
Feature extraction
Darius Barušauskas
Validation
Gert Jacobusse
Super fast draft modeling
ZFTurbo - unknown
Inspiration – no minimal age for data science
Mikel Bober-Irizar

6/25/2017
Nathaniel Shimoni
15
Common Kaggle Data Science process
Data cleaning
Data
augmentation
Adding
External Data
Single
models
Feature
engineering
Exploratory
data
analysis
Single
models
Diverse
single
models
Set the
correct
validation
method
Ensemble
learning
Final
prediction
EDA
Feature generation
modeling
Ensemble
learning
Data cleaning
& augmenting
Not always allowed yet
good practice to
consider when possible
40%20% 30% 10%
% of total
time spent in
each activity

• Impute missing values
(mean, median, most common value, use separate prediction task)
• Remove zero variance features
• Remove duplicated features
• Outlier removal – caution can be harmful, at cleaning
stage we’ll remove irrelevant values (e.g. negative price)
• Na’s encoding / imputing
6/25/2017
Nathaniel Shimoni
16
Data cleaning

• External data sources:
 open street map
 weather measurement data
 online calendars
• API’s
• Scraping (using ScraPy / beautiful soup)
6/25/2017
Nathaniel Shimoni
17
Data augmentation & external data

• Rescaling/ standardization of existing features
• Performing data transformations:
Tf-Idf, log1p, min-max scaling, binning of numeric features
• Turn categorical features to numeric
(label encoding / one hot encoding)
• Create count features
• Parsing textual features to get more generalizable
features
• Hashing trick
• Extracting date/time features i.e DayOfWeek, month, year,
dayOfMonth, isHoliday etc.
6/25/2017
Nathaniel Shimoni
18
Feature engineering

• Remove near-zero-variance features
• Use feature importance and eliminate least
important features
• Recursive Feature Elimination
6/25/2017
Nathaniel Shimoni
19
Feature selection

• Grid search CV (exhaustive, rarely better than alternatives)
• Random search CV
• Hyper-opt
• Bayesian optimization
* Hyper parameter adjustment will usually yield
better results but not as much as other activities
6/25/2017
Nathaniel Shimoni
20
Hyper parameter optimization

• Train test split
• Shuffle split
• Kfold is the most commonly used
• Time based separation
• Group Kfold
• Leave one group out
6/25/2017
Nathaniel Shimoni
21
Validation

• Simple/weighted average of previous best models
• Bagging of same type of models (i.e different rng,
different hyper-param)
• Majority vote
• Using out of fold predictions as meta features
a.k.a stacking
6/25/2017
Nathaniel Shimoni
22
Ensemble learning

6/25/2017
Nathaniel Shimoni
23
Out Of Fold predictions – a.k.a meta features
fold1
fold2
fold3
fold4
oof 1
oof 2
oof 3
oof 4
Out of fold
predictions
Averaged
test
predictions
Test
predictions
fold1
Test
predictions
fold2
Test
predictions
fold3
Test
predictions
fold4
Divided training data - train on 3 folds
predict the forth fold and the testing data

6/25/2017
Nathaniel Shimoni
24
fold1
fold2
fold3
fold4
oof 1
oof 2
oof 3
oof 4
Out of fold
predictions
Averaged
test
predictions
Test
predictions
fold1
Test
predictions
fold2
Test
predictions
fold3
Test
predictions
fold4
Divided training data - train on 3 folds
predict the forth fold and the testing data

6/25/2017
Nathaniel Shimoni
25
oof 1
oof 2
oof 3
oof 4
Model 1
e.g. knn
Averaged test
predictions
Out of fold
predictions
oof 1
oof 2
oof 3
oof 4
Model 2
e.g. NN
oof 1
oof 2
oof 3
oof 4
Model 3
e.g. gbm
Train
labels
Model 1
e.g. knn
Model 2
e.g. NN
Model 3
e.g. gbm
After training several models using this method (3 different models in this sample)
We can now train a new model using our newly formed meta features
* Note that we can either train our meta model using only these new features or use
the new features along with our original train data for training

• Large focus on modeling relatively to the rest of
the steps in the process
• Small weight to runtime and scalability
• Little reasoning for selecting a specific eval metric
• Competing for the last few percent points isn’t
always valuable
• “Click and submit” phenomena
6/25/2017
Nathaniel Shimoni
26
Disadvantages of Kaggle

• MOOC’s:
 Machine learning – Stanford Coursera
 Data science track – Johns Hopkins Coursera
 Udacity deep learning course
• Documentation:
 Scikit learn documentation
 Keras documentation
 R caret package documentation
6/25/2017
Nathaniel Shimoni
27
Additional reading resources

• This presentation draws heavily from the
following sources:
• Mark Peng’s presentation
“Tips for participating Kaggle challenges”
• Darius Barušauskas’s presentation
“Tips and tricks to win Kaggle data science competitions”
• Kaggle discussion forums and blog
6/25/2017
Nathaniel Shimoni
28
Links to sources

Questions?
6/25/2017
Nathaniel Shimoni
29

6/25/2017
Nathaniel Shimoni
30

Starting data science with kaggle.com

More Related Content

Similar to Starting data science with kaggle.com (20)

More from Nathaniel Shimoni (6)

Recently uploaded (20)

Starting data science with kaggle.com