SlideShare a Scribd company logo
DATA SCIENCE COMPETITION
Conversion Logic @ Whisper
2. 1. 2017
Data Science Competition
ATTRIBUTION
DATA SCIENCE COMPETITION
Since 1997
2006 - 2009
Since 2010
4
COMPETITION STRUCTURE
Training Data
Test Data
Feature Label
Provided Submission
Public LB
Score
5
COMPETITION STRUCTURE
Training Data
Test Data
Feature Label
Provided Submission
Public LB
Score
Private LB
Score
5
KAGGLE
• 237 competitions since 2010
• 500K+ users
• 50K+ competitors
• $3MM+ prize paid out
6
KAGGLE
7
KAGGLE
8
WHY COMPETITION
9
WHY COMPETITION
• For fun
• For experience
• For learning
• For networking
10
FUN
11
FUN
• Competing with others
11
FUN
• Competing with others
• Incremental improvement
11
EXPERIENCE
12
LEARNING
13
LEARNING
13
LEARNING
13
LEARNING
13
LEARNING
13
LEARNING
13
LEARNING
14
NETWORKING
15
NETWORKING
15
NETWORKING
15
NETWORKING
15
16
BS ON COMPETITIONS
17
BS ON COMPETITIONS
• No ETL
• No EDA
• Not worth it
• Not for production
18
NO ETL?
19
• Deloitte Western Australia Rental Prices
NO ETL?
20
• Outbrain Click Prediction
2B page views. 16.9MM clicks. 700MM users. 560 sites
NO ETL?
21
NO EDA?
• Most of competitions provide actual labels - typical EDA
• Anonymized data - more creative EDA
• People decode age, states, time intervals, income, etc.
22
NO EDA?
• Anonymized data - more creative EDA
23
NOT WORTH IT?
• Performance matters
• You can walk easier once you know how to run
24
NOT FOR PRODUCTION?
• Kaggle Kernel
• Max execution time:10 minutes
• Max file output: 500MB
• Memory limit: 8GB
25
ENSEMBLE PIPELINE AT CL
26
BEST PRACTICES
27
BEST PRACTICES
• Feature Engineering
• Algorithms
• CrossValidation
• Ensemble
28
FEATURE ENGINEERING
• Numerical - Log, Log(1 + x), Normalization, Binarization
• Categorical - One-hot-encode,TF-IDF (text),Weight-of-Evidence
• Timeseries - Stats, FFT, MFCC, ERP (EEG)
• Numerical/Timeseries to Categorical - RF/GBM*
• Dimensionality Reduction - PCA, SVD,Autoencoder
* http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf
29
ALGORITHMS
Algorithm Tool Note
Gradient Boosting Machine XGBoost, LightGBM
The most popular algorithm in
competitions
Random Forests Scikit-Learn, randomForest
Extremely RandomTrees Scikit-Learn
Neural Networks/ Deep Learning Keras, MXNet
Blends well with GBM. Best at image
recognition competitions, NLP.
Logistic/Linear Regression Scikit-Learn,Vowpal Wabbit Fastest. Good for ensemble.
SupportVector Machine Scikit-Learn
FTRL Vowpal Wabbit
Competitive solution for CTR
estimation competitions
Factorization Machine libFM Winning solution for KDD Cup 2012
Field-aware Factorization Machine libFFM
Winning solution for CTR estimation
competitions (Criteo,Avazu)
30
CROSSVALIDATION
Training data are split into five folds where the sample size and
dropout rate are preserved (stratified).
31
Ensemble Model Training
32
ENSEMBLE
* for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/33
KDDCUP 2015 SOLUTION
34
WHY COMPETITION
• For fun
• For experiences
• For learning
• For networking
35
36

More Related Content

PPTX
Data Science Competition
PDF
Winning Data Science Competitions
PDF
Mastering Machine Learning with Competitions
PDF
Feature Engineering
PDF
MLSD18. Feature Engineering
PDF
Tips for data science competitions
PPTX
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
PDF
Winning data science competitions
Data Science Competition
Winning Data Science Competitions
Mastering Machine Learning with Competitions
Feature Engineering
MLSD18. Feature Engineering
Tips for data science competitions
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Winning data science competitions

What's hot (20)

PDF
BSSML17 - Deepnets
PPTX
Microsoft Introduction to Automated Machine Learning
PDF
Automated Machine Learning
PDF
BSSML17 - Feature Engineering
PDF
Finding Products on the Internet Using Neural Networks
PDF
SigOpt for Hedge Funds
PDF
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
PDF
BigML Summer 2017 Release
PDF
Introduction to XGBoost
PPTX
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
PDF
BigML Education - Feature Engineering with Flatline
PDF
AutoML - The Future of AI
PPTX
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
PDF
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
PDF
Boosting Algorithms Omar Odibat
PPTX
Automated Machine Learning
PDF
Demystifying Xgboost
PDF
Automatic machine learning (AutoML) 101
PDF
Ad Click Prediction - Paper review
PPTX
Jay Yagnik at AI Frontiers : A History Lesson on AI
BSSML17 - Deepnets
Microsoft Introduction to Automated Machine Learning
Automated Machine Learning
BSSML17 - Feature Engineering
Finding Products on the Internet Using Neural Networks
SigOpt for Hedge Funds
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
BigML Summer 2017 Release
Introduction to XGBoost
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
BigML Education - Feature Engineering with Flatline
AutoML - The Future of AI
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Boosting Algorithms Omar Odibat
Automated Machine Learning
Demystifying Xgboost
Automatic machine learning (AutoML) 101
Ad Click Prediction - Paper review
Jay Yagnik at AI Frontiers : A History Lesson on AI
Ad

Viewers also liked (20)

PPTX
Kill the wabbit
PDF
Menstrual Health Reader - mEo
PPTX
A Panorama of Natural Language Processing
PDF
Open Innovation - A Case Study
PDF
No-Bullshit Data Science
PDF
Work - LIGHT Ministry
PDF
DataRobot R Package
PDF
Intra company hackathons using HackerEarth
PDF
Data science at the command line
PDF
USC LIGHT Ministry Introduction
PPTX
Make Sense Out of Data with Feature Engineering
PDF
Tda presentation
PPTX
How to recruit excellent tech talent
PDF
How to assess & hire Java developers accurately?
PDF
HackerEarth Sourcing Solution
PDF
Leveraged Analytics at Scale
PDF
How hackathons can drive top line revenue growth
PPTX
Vowpal Wabbit
PDF
Marriage - LIGHT Ministry
PPTX
Smart Switchboard: An home automation system
Kill the wabbit
Menstrual Health Reader - mEo
A Panorama of Natural Language Processing
Open Innovation - A Case Study
No-Bullshit Data Science
Work - LIGHT Ministry
DataRobot R Package
Intra company hackathons using HackerEarth
Data science at the command line
USC LIGHT Ministry Introduction
Make Sense Out of Data with Feature Engineering
Tda presentation
How to recruit excellent tech talent
How to assess & hire Java developers accurately?
HackerEarth Sourcing Solution
Leveraged Analytics at Scale
How hackathons can drive top line revenue growth
Vowpal Wabbit
Marriage - LIGHT Ministry
Smart Switchboard: An home automation system
Ad

Similar to Data Science Competition (20)

PPTX
Kaggle Days Milan - March 2019
PPTX
How to get into Kaggle? by Philipp Singer and Dmitry Gordeev
PDF
Kaggle: Crowd Sourcing for Data Analytics
PDF
Kaggle Days Brussels - Alberto Danese
PPT
kaggle_meet_up
PDF
Kaggle and data science
PPTX
What does it take to win the Kaggle/Yandex competition
PPTX
Starting data science with kaggle.com
PDF
The Hitchhiker’s Guide to Kaggle
PDF
Kaggle - global Data Science community
PDF
Data Wrangling For Kaggle Data Science Competitions
PDF
R, Data Wrangling & Kaggle Data Science Competitions
PPTX
Public Data and Data Mining Competitions - What are Lessons?
PPTX
Hacking kaggle click prediction
PDF
My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!
PDF
Beat the Benchmark.
PDF
Beat the Benchmark.
PDF
Kaggle presentation at SF Data Mining Meetup - Trulia June 23, 2015
PPTX
Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15
PPTX
Lessons Learned from Running Hundreds of Kaggle Competitions
Kaggle Days Milan - March 2019
How to get into Kaggle? by Philipp Singer and Dmitry Gordeev
Kaggle: Crowd Sourcing for Data Analytics
Kaggle Days Brussels - Alberto Danese
kaggle_meet_up
Kaggle and data science
What does it take to win the Kaggle/Yandex competition
Starting data science with kaggle.com
The Hitchhiker’s Guide to Kaggle
Kaggle - global Data Science community
Data Wrangling For Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science Competitions
Public Data and Data Mining Competitions - What are Lessons?
Hacking kaggle click prediction
My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!
Beat the Benchmark.
Beat the Benchmark.
Kaggle presentation at SF Data Mining Meetup - Trulia June 23, 2015
Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15
Lessons Learned from Running Hundreds of Kaggle Competitions

Recently uploaded (20)

PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Quality review (1)_presentation of this 21
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Computer network topology notes for revision
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Major-Components-ofNKJNNKNKNKNKronment.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Quality review (1)_presentation of this 21
oil_refinery_comprehensive_20250804084928 (1).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Computer network topology notes for revision
Supervised vs unsupervised machine learning algorithms
Taxes Foundatisdcsdcsdon Certificate.pdf
climate analysis of Dhaka ,Banglades.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Clinical guidelines as a resource for EBP(1).pdf
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”

Data Science Competition