A Kaggle Talk

An Intro to Kaggle
By Lex Toumbourou
Senior Consultant at Thoughtworks

What is
● Founded in 2010 in Australia
● Acquired by Google in 2017
● Host of data science
competitions
● Largest data science
community at 536,000
registered users
?

Why
● Good resource for turning
theoretical skills in practical
skills
● Learn from other data scientists
● Gain reputation
?

Getting started with competitions
● What problems are you interested
in solving?
● What computational budget do you
have?
● Is the competition a good match for
your level?

Competition evaluation and rules
● What is the goal of the
competition?
● How is it evaluated?
○ Accuracy
○ Log Loss
○ Root mean squared error
○ Area under the ROC curve
○ F1 score
○ (many more)

Datasets
● 3 main files:
○ Train.csv
○ Test.csv
○ Sample submissions.csv
● Important to read data documentation
● Kaggle CLI useful for download datasets
on headless computers:
kaggle competitions download -c
house-prices-advanced-regression-tec
hniques

Loading dataset (useful Pandas one-liner)

Leaderboard
● Split into public and private
leaderboard.
● Be careful not to overfit on the
test set.
● Equal scores = oldest predict
wins.

Submissions
● Predictions provided as a CSV with row id
and prediction value(s)
● Some predictions are used for public, the
other for private.
● Usually limited to 5 submissions per day.
● At competition conclusion, pick 2
submissions to use on private
leaderboard.

Generating submission one-liner

Kernels
● Kaggle provided computers - even GPUs
provided
● Allows for sharing results with others.
● Scripts allows you to submit submissions
directly after running code.

Discussion forums
● Lots of useful insights.
● Competition winners will usually always
have read the forums in full.

Tools
● Usually Python or R
● Jupyter Notebooks (interactive
development)
● Numpy (linear algebra)
● Pandas (structured data)
● Matplotlib
● Scikit-learn (models and ML tools)
● PyTorch or Tensorflow/Keras (neural
networks)

Model selection
● Dependent on problem
● Tree-based (RandomForests, XGBoost,
LightGBM) - good starting point for
structured data
● Linear Models (SVM, Logistic Reg) - still
useful for certain problems.
● Neural Networks (CNN, RNN) - image,
text and speech data, sometimes
structured

Choosing a validation method
● Train / val split
● Cross-validation
● Out-of-bag error

Fast iteration
● Run experiments on a subset of
your data.
● Good validation strategy.
● Save complex model stacking
and ensembling until after you’ve
maximized feature engineering.

Preparing data
● Model dependent
● Careful feature preparation and
engineering usually quite
important.
● 4 main columns type:
continuous, ordinal, categorical
and date
Image by Tobias Fischer

Continuous (aka numeric) features
● Scaling recommended (non-tree models)
sklearn.preprocessing.MinMaxScaler
sklearn.preprocessing.StandardScaler
● Outlier cleaning (non-tree models)
Winsorization: remove 99th and 1th percentile
log(x)
● Data imputation (fill in missing values)
df.SomeValue.fillna(df.SomeValue.median())
df[‘SomeValue_isna’] = df.SomeValue.isna()

Categorical features
● Ensure order of ordinal columns
df.Rating.cat.set_categories([1, 2, 3], ordered=True, inplace=True)
● One-hot encode non-ordinal columns
dummies = pd.get_dummies(df[cat_columns], dummy_na=True)
df = pd.concat([df, dummies], axis=1)
https://datascience.stackexchange.com/questions/30215/what-is-one-hot-encoding-in-tensorflow

Date time features
● Lots of information in a single date:
○ Day of week
○ Day of month
○ Is it a weekend?
○ Is it a public holiday?
● Lots of handy methods in the dt attribute of a Pandas column, which can be added as new columns
Image by Charisse Kenion

Feature engineering
● Combining columns (adding
values together, multiplying,
dividing etc
● Adding additional data sources*
○ Things nearby to house
○ Weather on the day
○ Etc etc
* Ensure competition allows it
● Discover Feature Engineering -
great article
Image by Chester Alvarez

Hyperparameter (aka settings) tuning
● Hyperparam = parameter
that isn’t learned by model.
● Manually (try some values
and see what happens)
● Automated
○ RandomizedSearchCV
(sklearn)
○ GridSearchCV (sklearn)
○ Hyperopt
○ Spearmint
○ Lots more...

Stacking / ensembling (aka combining models)
● Most winnings solutions a
combination of models.
● Averaging predictions of multiple
models
● “Meta models”: a model trained on
predictions of multiple models.
http://www.chioka.in/stacking-blending-and-stacked-generalization/

A Kaggle Talk

More Related Content

What's hot (20)

Similar to A Kaggle Talk (20)

Recently uploaded (20)

A Kaggle Talk