SlideShare a Scribd company logo
An Intro to Kaggle
By Lex Toumbourou
Senior Consultant at Thoughtworks
Part 1: Kaggle Overview
What is
● Founded in 2010 in Australia
● Acquired by Google in 2017
● Host of data science
competitions
● Largest data science
community at 536,000
registered users
?
Why
● Good resource for turning
theoretical skills in practical
skills
● Learn from other data scientists
● Gain reputation
?
Getting started with competitions
● What problems are you interested
in solving?
● What computational budget do you
have?
● Is the competition a good match for
your level?
Competition evaluation and rules
● What is the goal of the
competition?
● How is it evaluated?
○ Accuracy
○ Log Loss
○ Root mean squared error
○ Area under the ROC curve
○ F1 score
○ (many more)
Datasets
● 3 main files:
○ Train.csv
○ Test.csv
○ Sample submissions.csv
● Important to read data documentation
● Kaggle CLI useful for download datasets
on headless computers:
kaggle competitions download -c
house-prices-advanced-regression-tec
hniques
Loading dataset (useful Pandas one-liner)
Leaderboard
● Split into public and private
leaderboard.
● Be careful not to overfit on the
test set.
● Equal scores = oldest predict
wins.
Submissions
● Predictions provided as a CSV with row id
and prediction value(s)
● Some predictions are used for public, the
other for private.
● Usually limited to 5 submissions per day.
● At competition conclusion, pick 2
submissions to use on private
leaderboard.
Generating submission one-liner
Kernels
● Kaggle provided computers - even GPUs
provided
● Allows for sharing results with others.
● Scripts allows you to submit submissions
directly after running code.
Discussion forums
● Lots of useful insights.
● Competition winners will usually always
have read the forums in full.
Part 2: Getting Started
Tools
● Usually Python or R
● Jupyter Notebooks (interactive
development)
● Numpy (linear algebra)
● Pandas (structured data)
● Matplotlib
● Scikit-learn (models and ML tools)
● PyTorch or Tensorflow/Keras (neural
networks)
Model selection
● Dependent on problem
● Tree-based (RandomForests, XGBoost,
LightGBM) - good starting point for
structured data
● Linear Models (SVM, Logistic Reg) - still
useful for certain problems.
● Neural Networks (CNN, RNN) - image,
text and speech data, sometimes
structured
Choosing a validation method
● Train / val split
● Cross-validation
● Out-of-bag error
Fast iteration
● Run experiments on a subset of
your data.
● Good validation strategy.
● Save complex model stacking
and ensembling until after you’ve
maximized feature engineering.
Preparing data
● Model dependent
● Careful feature preparation and
engineering usually quite
important.
● 4 main columns type:
continuous, ordinal, categorical
and date
Image by Tobias Fischer
Continuous (aka numeric) features
● Scaling recommended (non-tree models)
sklearn.preprocessing.MinMaxScaler
sklearn.preprocessing.StandardScaler
● Outlier cleaning (non-tree models)
Winsorization: remove 99th and 1th percentile
log(x)
● Data imputation (fill in missing values)
df.SomeValue.fillna(df.SomeValue.median())
df[‘SomeValue_isna’] = df.SomeValue.isna()
Categorical features
● Ensure order of ordinal columns
df.Rating.cat.set_categories([1, 2, 3], ordered=True, inplace=True)
● One-hot encode non-ordinal columns
dummies = pd.get_dummies(df[cat_columns], dummy_na=True)
df = pd.concat([df, dummies], axis=1)
https://datascience.stackexchange.com/questions/30215/what-is-one-hot-encoding-in-tensorflow
Date time features
● Lots of information in a single date:
○ Day of week
○ Day of month
○ Is it a weekend?
○ Is it a public holiday?
● Lots of handy methods in the dt attribute of a Pandas column, which can be added as new columns
Image by Charisse Kenion
Feature engineering
● Combining columns (adding
values together, multiplying,
dividing etc
● Adding additional data sources*
○ Things nearby to house
○ Weather on the day
○ Etc etc
* Ensure competition allows it
● Discover Feature Engineering -
great article
Image by Chester Alvarez
Hyperparameter (aka settings) tuning
● Hyperparam = parameter
that isn’t learned by model.
● Manually (try some values
and see what happens)
● Automated
○ RandomizedSearchCV
(sklearn)
○ GridSearchCV (sklearn)
○ Hyperopt
○ Spearmint
○ Lots more...
Stacking / ensembling (aka combining models)
● Most winnings solutions a
combination of models.
● Averaging predictions of multiple
models
● “Meta models”: a model trained on
predictions of multiple models.
http://www.chioka.in/stacking-blending-and-stacked-generalization/
Fin.

More Related Content

PPTX
Emerging Best Practises for Machine Learning Engineering (Canberra Meetup edits)
PDF
Kaggle and data science
PPTX
Kaggle Days Milan - March 2019
PDF
Model selection and tuning at scale
PPTX
A Beginner's Guide to Machine Learning with Scikit-Learn
PDF
Ds for finance day 3
PDF
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
PDF
Knowledge Discovery
Emerging Best Practises for Machine Learning Engineering (Canberra Meetup edits)
Kaggle and data science
Kaggle Days Milan - March 2019
Model selection and tuning at scale
A Beginner's Guide to Machine Learning with Scikit-Learn
Ds for finance day 3
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Knowledge Discovery

What's hot (20)

PDF
DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
PPTX
Build a Sentiment Model using ML.Net
PDF
Automatic Machine Learning, AutoML
PDF
Introduction to machine learning and applications (1)
PPTX
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
PDF
Machine Learning with Big Data using Apache Spark
PDF
Building Data Apps with Python
PDF
Data science
PDF
Presentation: Ad-Click Prediction, A Data-Intensive Problem
PDF
Kaggle Days Paris - Alberto Danese - ML Interpretability
PDF
Azure Machine Learning
PDF
AllegroGraph - Cognitive Probability Graph webcast
ODP
Dynamic Optimization without Markov Assumptions: application to power systems
PPTX
Incremental Machine Learning.pptx
PDF
Data science in 10 steps
PDF
Distributed machine learning 101 using apache spark from a browser devoxx.b...
PDF
Machine Learning to moderate ads in real world classified's business
PPTX
Master guide to become a data scientist
PDF
Parametric & Non-Parametric Machine Learning (Supervised ML)
ODP
Ilab Metis: we optimize power systems and we are not afraid of direct policy ...
DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
Build a Sentiment Model using ML.Net
Automatic Machine Learning, AutoML
Introduction to machine learning and applications (1)
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
Machine Learning with Big Data using Apache Spark
Building Data Apps with Python
Data science
Presentation: Ad-Click Prediction, A Data-Intensive Problem
Kaggle Days Paris - Alberto Danese - ML Interpretability
Azure Machine Learning
AllegroGraph - Cognitive Probability Graph webcast
Dynamic Optimization without Markov Assumptions: application to power systems
Incremental Machine Learning.pptx
Data science in 10 steps
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Machine Learning to moderate ads in real world classified's business
Master guide to become a data scientist
Parametric & Non-Parametric Machine Learning (Supervised ML)
Ilab Metis: we optimize power systems and we are not afraid of direct policy ...
Ad

Similar to A Kaggle Talk (20)

PPTX
Starting data science with kaggle.com
PPTX
How to get into Kaggle? by Philipp Singer and Dmitry Gordeev
PDF
R user group meeting 25th jan 2017
PDF
Beat the Benchmark.
PDF
Beat the Benchmark.
PDF
Kaggle presentation
PDF
Data Wrangling For Kaggle Data Science Competitions
PDF
R, Data Wrangling & Kaggle Data Science Competitions
PPTX
Robust Algorithms for Machine Learning
PPT
kaggle_meet_up
PDF
General Tips for participating Kaggle Competitions
PDF
CM UTaipei Kaggle Share
PPTX
Musings of kaggler
PDF
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
PDF
My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!
PDF
The Hitchhiker’s Guide to Kaggle
PDF
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
PPTX
Hacking kaggle click prediction
PDF
Tips for data science competitions
PPTX
Unit 2- Machine Learninnonjjnkbhkhjjljknkmg.pptx
Starting data science with kaggle.com
How to get into Kaggle? by Philipp Singer and Dmitry Gordeev
R user group meeting 25th jan 2017
Beat the Benchmark.
Beat the Benchmark.
Kaggle presentation
Data Wrangling For Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science Competitions
Robust Algorithms for Machine Learning
kaggle_meet_up
General Tips for participating Kaggle Competitions
CM UTaipei Kaggle Share
Musings of kaggler
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!
The Hitchhiker’s Guide to Kaggle
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Hacking kaggle click prediction
Tips for data science competitions
Unit 2- Machine Learninnonjjnkbhkhjjljknkmg.pptx
Ad

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Approach and Philosophy of On baking technology
PPT
Teaching material agriculture food technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
Tartificialntelligence_presentation.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Machine Learning_overview_presentation.pptx
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Empathic Computing: Creating Shared Understanding
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation_ Review paper, used for researhc scholars
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Approach and Philosophy of On baking technology
Teaching material agriculture food technology
Unlocking AI with Model Context Protocol (MCP)
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
cloud_computing_Infrastucture_as_cloud_p
Tartificialntelligence_presentation.pptx
Assigned Numbers - 2025 - Bluetooth® Document
A comparative analysis of optical character recognition models for extracting...
Machine Learning_overview_presentation.pptx
Spectroscopy.pptx food analysis technology
Digital-Transformation-Roadmap-for-Companies.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Empathic Computing: Creating Shared Understanding
Mobile App Security Testing_ A Comprehensive Guide.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm

A Kaggle Talk

  • 1. An Intro to Kaggle By Lex Toumbourou Senior Consultant at Thoughtworks
  • 2. Part 1: Kaggle Overview
  • 3. What is ● Founded in 2010 in Australia ● Acquired by Google in 2017 ● Host of data science competitions ● Largest data science community at 536,000 registered users ?
  • 4. Why ● Good resource for turning theoretical skills in practical skills ● Learn from other data scientists ● Gain reputation ?
  • 5. Getting started with competitions ● What problems are you interested in solving? ● What computational budget do you have? ● Is the competition a good match for your level?
  • 6. Competition evaluation and rules ● What is the goal of the competition? ● How is it evaluated? ○ Accuracy ○ Log Loss ○ Root mean squared error ○ Area under the ROC curve ○ F1 score ○ (many more)
  • 7. Datasets ● 3 main files: ○ Train.csv ○ Test.csv ○ Sample submissions.csv ● Important to read data documentation ● Kaggle CLI useful for download datasets on headless computers: kaggle competitions download -c house-prices-advanced-regression-tec hniques
  • 8. Loading dataset (useful Pandas one-liner)
  • 9. Leaderboard ● Split into public and private leaderboard. ● Be careful not to overfit on the test set. ● Equal scores = oldest predict wins.
  • 10. Submissions ● Predictions provided as a CSV with row id and prediction value(s) ● Some predictions are used for public, the other for private. ● Usually limited to 5 submissions per day. ● At competition conclusion, pick 2 submissions to use on private leaderboard.
  • 12. Kernels ● Kaggle provided computers - even GPUs provided ● Allows for sharing results with others. ● Scripts allows you to submit submissions directly after running code.
  • 13. Discussion forums ● Lots of useful insights. ● Competition winners will usually always have read the forums in full.
  • 14. Part 2: Getting Started
  • 15. Tools ● Usually Python or R ● Jupyter Notebooks (interactive development) ● Numpy (linear algebra) ● Pandas (structured data) ● Matplotlib ● Scikit-learn (models and ML tools) ● PyTorch or Tensorflow/Keras (neural networks)
  • 16. Model selection ● Dependent on problem ● Tree-based (RandomForests, XGBoost, LightGBM) - good starting point for structured data ● Linear Models (SVM, Logistic Reg) - still useful for certain problems. ● Neural Networks (CNN, RNN) - image, text and speech data, sometimes structured
  • 17. Choosing a validation method ● Train / val split ● Cross-validation ● Out-of-bag error
  • 18. Fast iteration ● Run experiments on a subset of your data. ● Good validation strategy. ● Save complex model stacking and ensembling until after you’ve maximized feature engineering.
  • 19. Preparing data ● Model dependent ● Careful feature preparation and engineering usually quite important. ● 4 main columns type: continuous, ordinal, categorical and date Image by Tobias Fischer
  • 20. Continuous (aka numeric) features ● Scaling recommended (non-tree models) sklearn.preprocessing.MinMaxScaler sklearn.preprocessing.StandardScaler ● Outlier cleaning (non-tree models) Winsorization: remove 99th and 1th percentile log(x) ● Data imputation (fill in missing values) df.SomeValue.fillna(df.SomeValue.median()) df[‘SomeValue_isna’] = df.SomeValue.isna()
  • 21. Categorical features ● Ensure order of ordinal columns df.Rating.cat.set_categories([1, 2, 3], ordered=True, inplace=True) ● One-hot encode non-ordinal columns dummies = pd.get_dummies(df[cat_columns], dummy_na=True) df = pd.concat([df, dummies], axis=1) https://datascience.stackexchange.com/questions/30215/what-is-one-hot-encoding-in-tensorflow
  • 22. Date time features ● Lots of information in a single date: ○ Day of week ○ Day of month ○ Is it a weekend? ○ Is it a public holiday? ● Lots of handy methods in the dt attribute of a Pandas column, which can be added as new columns Image by Charisse Kenion
  • 23. Feature engineering ● Combining columns (adding values together, multiplying, dividing etc ● Adding additional data sources* ○ Things nearby to house ○ Weather on the day ○ Etc etc * Ensure competition allows it ● Discover Feature Engineering - great article Image by Chester Alvarez
  • 24. Hyperparameter (aka settings) tuning ● Hyperparam = parameter that isn’t learned by model. ● Manually (try some values and see what happens) ● Automated ○ RandomizedSearchCV (sklearn) ○ GridSearchCV (sklearn) ○ Hyperopt ○ Spearmint ○ Lots more...
  • 25. Stacking / ensembling (aka combining models) ● Most winnings solutions a combination of models. ● Averaging predictions of multiple models ● “Meta models”: a model trained on predictions of multiple models. http://www.chioka.in/stacking-blending-and-stacked-generalization/
  • 26. Fin.