SlideShare a Scribd company logo
© DataRobot, Inc. All rights reserved.
Kaggle
and
Data Science
Japan, 2018
Sergey Yurgenson
Director, Advanced Data Science Services
Kaggle Grandmaster
© DataRobot, Inc. All rights reserved.
© DataRobot, Inc. All rights reserved.
Kaggle
● Kaggle is a platform for data science competitions
● It was created by Anthony Goldbloom in 2010 in Australia and then moved to San
Francisco
● In March of 2017 it was acquired by Google
● Right now many other start-up are trying to replicate the same idea, but Kaggle is still the
most known in data science community name
● As of now Kaggle hosted more than 280 competitions and has more than 1 million
members from more than 190 countries
© DataRobot, Inc. All rights reserved.
Kaggle competitions
● Most of Kaggle competitions are predictive modeling competition
● Participants are provided with training data to train their models and test data with
unknown targets
● Participants need to calculate predictions for test data and submit those
predictions to Kaggle platform.
● Accuracy of predictions is evaluated using predefined objective metric and that
result is provided back to participants.
● Model performance of all participants is publicly available and participants can
compare quality of their models with models of other participants
● Many competitions have monetary prizes for top finishers
© DataRobot, Inc. All rights reserved.
Kaggle competitions
© DataRobot, Inc. All rights reserved.
Kaggle ranking
● Based on competitions performance Kaggle ranks members using points and
awards titles for top finishing in competitions
● For example to get title of master member needs to earn one gold medal and
two silver medal. For competitions with 1000 participants it means to finish
once in top 10 places and twice in top 50.
© DataRobot, Inc. All rights reserved.
Kaggle ranking
© DataRobot, Inc. All rights reserved.
Kaggle and Data Science
© DataRobot, Inc. All rights reserved.
Why do you dislike Kaggle ?
● Kaggle competition does not have much in common with real Data Science
○ The problems are already well formulated with metrics predefined. In an industry setting there is
ambiguity, and knowing what to solve is one of the key steps towards a solution.
○ Data is most cases is already provided and is relatively clean.
○ The goal is more leaderboard driven rather than understanding driven. Winning a competition
versus why an approach works is a top priority. Results may not be trustworthy.
○ There are chances of overfitting to test data with repeated submissions.
○ In most cases the solution is an ensemble of algorithms and not “productionizable”.
https://www.quora.com/Why-do-you-dislike-Kaggle
© DataRobot, Inc. All rights reserved.
True or False ?
● “The problems are already well formulated with metrics predefined. In an
industry setting there is ambiguity, and knowing what to solve is one of the
key steps towards a solution.”
https://www.quora.com/Why-do-you-dislike-Kaggle
© DataRobot, Inc. All rights reserved.
Problem is well formulated
Mostly True , however...
● Need for criteria is inherited property of any competition.
● In real world not all data scientists are free to select and reformulate the problem. Many problems
are already defined with assigned specific success criteria.
● We learn many subjects and skills by solving provided predefined problems, doing predefined
exercises. We learn math by solving problems from textbooks, we learn physics by solving
problems from textbooks. Problems already formulated. By solving problems we also learn how
to formulate problems, what is suitable approach in particular data science situation.
● We also have to admit that evaluating business value of solving the problem is completely out of
scope of Kaggle competitions. While business value analysis and problem prioritization is
important part of many real life data science projects.
© DataRobot, Inc. All rights reserved.
True or False ?
● “Data is most cases is already provided and is relatively clean.”
https://www.quora.com/Why-do-you-dislike-Kaggle
© DataRobot, Inc. All rights reserved.
Data is clean
Half true
● In many competitions datasets are
○ Very big
○ Have multiple tables
○ Some records are duplicated and mislabeled
○ Contain combination of structured data and unstructured data
● Some competitions encourage search for additional sources of data
● Many data leaks
● Often features names and meaning are not provided making problem even more difficult than in real
world
● Data may be intentionally distorted to conform to data privacy laws
© DataRobot, Inc. All rights reserved.
Data is clean
● Complex data structure ● Big datasets
● No meaningful feature names
© DataRobot, Inc. All rights reserved.
Data is clean
● Kaggle competitions teach unique data manipulation skills:
○ Dealing with data with hardware limitations : efficient code, smart sampling, clever encoding…
○ Using EDA to uncover meaning of data without relying on labels or other provided information
○ Data leaks discovering based on the data analysis
© DataRobot, Inc. All rights reserved.
True or False ?
● The goal is more leaderboard driven rather than understanding driven. Winning
a competition versus why an approach works is a top priority. Results may not
be trustworthy.
https://www.quora.com/Why-do-you-dislike-Kaggle
© DataRobot, Inc. All rights reserved.
No understanding
True but maybe not that important
● Assumes that model we can not understand is less valuable than model we can understand
○ Model is not necessarily used for knowledge discovery
○ In real life we often use something and rely on something we do not completely understand
○ If something that we do not understand can not be trustworthy then how we ever trust other
people?
○ Even complex machine learning model may provide simplification of even more complex real
system
© DataRobot, Inc. All rights reserved.
No understanding
● Ignores all new research of model interpretability
○ Feature importance
○ Reason codes
○ Partial dependence plots
○ Surrogate models
○ Neuron activation visualization
○ ...
● Those methods allow us to analyze and understand behaviour of models as complicated as GBM and
Neural Networks
© DataRobot, Inc. All rights reserved.
No understanding ?
© DataRobot, Inc. All rights reserved.
True or False ?
● There are chances of overfitting to test data with repeated submissions.
https://www.quora.com/Why-do-you-dislike-Kaggle
© DataRobot, Inc. All rights reserved.
Overfitting
False
● Complete misunderstanding of how Kaggle works
○ Test data in Kaggle competition is split into two parts - public and private
○ During competition models are evaluated only on public part of the test set
○ Final results are based only on private part of the test dataset
○ Thus final model evaluation is based on completely new data
● One of first lessons all competitions participants learn very fast
○ Do not overfit leaderboard.
○ Create training/validation partition which reflect as much as possible test data including
seasonality effects and data drift
© DataRobot, Inc. All rights reserved.
True or False ?
● In most cases the solution is an ensemble of algorithms and not
“productionizable”.
https://www.quora.com/Why-do-you-dislike-Kaggle
© DataRobot, Inc. All rights reserved.
Difficult to put in production
Half True, half false
● Yes, in most cases top models are complicated ensembles
● Difficult to put in production if one does it one-by-one for each model separately
● Easy if one uses appropriately developed platform that can handle many models and blenders
© DataRobot, Inc. All rights reserved.
True or False ?
● Sometimes, a 0.01 difference in AUC can be the difference between 1st place
and 294th place (out of 626) . Those marginal gains take significant time and
effort that may not be worthwhile in the face of other projects and priorities
https://www.quora.com/How-similar-are-Kaggle-competitions-to-what-data-scientists-do
© DataRobot, Inc. All rights reserved.
Marginal gain is not valuable
Not always true
● Often we ourselves advise clients on balance between time spent and model performance
● However in investment world 0.01 AUC difference means difference in millions of dollars of gain or
loss
● Competition aspect of the data science problem with small margins drives innovation
○ New preprocessing steps
○ New feature engineering ideas
○ Continues testing of new algorithms and implementations (GBM - XGboost - LightGBM -
CatBoost)
© DataRobot, Inc. All rights reserved.
Kaggle and Data Science
● “Kaggle competitions cover a decent amount of what a data scientist does.
The two big missing pieces are:
○ 1. taking a business problem and specifying it as a data science problem
(which includes pulling the data and structuring it so that it addresses that
business problem).
○ 2. putting models into production.”
Anthony Goldbloom
© DataRobot, Inc. All rights reserved.
Kaggle and Data Science
● Kaggle is a competition
● “Real” Data Science is ...
also competition
© DataRobot, Inc. All rights reserved.
Kaggle to “real life” Data Science
● DataRobot - created by top Kagglers
Owen Zhang
Product Advisor
Highest: #1
Xavier Conort
Chief Data Scientist
Highest: 1st
Sergey Yurgenson
Director- AI Services
Highest: 1st
Jeremy Achin
CEO & Co-Founder
Highest: 20th
Tom de Godoy
CTO & Co-Founder
Highest: 20th
Amanda Schierz
Data Scientist
Highest: 24
DataRobot automatically replicates the steps seasoned data scientists take. This allows
non-technical business users to create accurate predictive models and data scientists to add
to their existing tool set.
© DataRobot, Inc. All rights reserved.
Kaggle and Data Science

More Related Content

PDF
General Tips for participating Kaggle Competitions
PDF
Feature Engineering
PDF
Feature Engineering
PDF
Feature Engineering - Getting most out of data for predictive models
PDF
Kaggle presentation
PPTX
Tips and tricks to win kaggle data science competitions
PPTX
Feature Engineering
PDF
Tips for data science competitions
General Tips for participating Kaggle Competitions
Feature Engineering
Feature Engineering
Feature Engineering - Getting most out of data for predictive models
Kaggle presentation
Tips and tricks to win kaggle data science competitions
Feature Engineering
Tips for data science competitions

What's hot (20)

PDF
因果探索: 基本から最近の発展までを概説
PDF
Best Python Libraries For Data Science & Machine Learning | Edureka
PPTX
Normalization 방법
PDF
Shallow and Deep Latent Models for Recommender System
PDF
boosting 기법 이해 (bagging vs boosting)
PDF
5分でわかるかもしれないglmnet
PPTX
Machine learning introduction
PDF
Winning data science competitions, presented by Owen Zhang
PPTX
Practical tips for handling noisy data and annotaiton
PDF
Deep learning - A Visual Introduction
PDF
An introduction to Machine Learning
PDF
[기초개념] Graph Convolutional Network (GCN)
PDF
Feature Engineering - Getting most out of data for predictive models - TDC 2017
PDF
Hacking Predictive Modeling - RoadSec 2018
PDF
Seq2Seq (encoder decoder) model
PPTX
Neural Learning to Rank
PPTX
Graph Neural Network (한국어)
PDF
整数計画法に基づく説明可能性な機械学習へのアプローチ
PDF
Machine Learning Algorithms
因果探索: 基本から最近の発展までを概説
Best Python Libraries For Data Science & Machine Learning | Edureka
Normalization 방법
Shallow and Deep Latent Models for Recommender System
boosting 기법 이해 (bagging vs boosting)
5分でわかるかもしれないglmnet
Machine learning introduction
Winning data science competitions, presented by Owen Zhang
Practical tips for handling noisy data and annotaiton
Deep learning - A Visual Introduction
An introduction to Machine Learning
[기초개념] Graph Convolutional Network (GCN)
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Hacking Predictive Modeling - RoadSec 2018
Seq2Seq (encoder decoder) model
Neural Learning to Rank
Graph Neural Network (한국어)
整数計画法に基づく説明可能性な機械学習へのアプローチ
Machine Learning Algorithms
Ad

Similar to Kaggle and data science (20)

PPTX
Kaggle Days Milan - March 2019
PPTX
How to get into Kaggle? by Philipp Singer and Dmitry Gordeev
PDF
Kaggle: Crowd Sourcing for Data Analytics
PDF
Kaggle - global Data Science community
PDF
Winning Data Science Competitions
PDF
Kaggle Days Brussels - Alberto Danese
PPTX
Starting data science with kaggle.com
PDF
Data Science Competition
PDF
How to win a machine learning competition pavel pleskov
PPTX
What does it take to win the Kaggle/Yandex competition
PDF
R, Data Wrangling & Kaggle Data Science Competitions
PPTX
Kaggle & Datathons: A Practical Guide to AI Competitions
PPTX
Data Science Competition
PDF
A Kaggle Talk
PDF
The coding portion of Data Science
PPTX
Online Data Science Competitions(Kaggle)- Pranav Bahl
PDF
Beat the Benchmark.
PDF
Beat the Benchmark.
PPTX
Public Data and Data Mining Competitions - What are Lessons?
PPT
kaggle_meet_up
Kaggle Days Milan - March 2019
How to get into Kaggle? by Philipp Singer and Dmitry Gordeev
Kaggle: Crowd Sourcing for Data Analytics
Kaggle - global Data Science community
Winning Data Science Competitions
Kaggle Days Brussels - Alberto Danese
Starting data science with kaggle.com
Data Science Competition
How to win a machine learning competition pavel pleskov
What does it take to win the Kaggle/Yandex competition
R, Data Wrangling & Kaggle Data Science Competitions
Kaggle & Datathons: A Practical Guide to AI Competitions
Data Science Competition
A Kaggle Talk
The coding portion of Data Science
Online Data Science Competitions(Kaggle)- Pranav Bahl
Beat the Benchmark.
Beat the Benchmark.
Public Data and Data Mining Competitions - What are Lessons?
kaggle_meet_up
Ad

More from Akira Shibata (20)

PDF
WandbotをWeaveでモニタリング・評価する by 鎌田 啓輔 (@olachinkei)
PDF
WeaveによるRAGシステムのLLM変更時の精度検証手順 by 布留川 英一(@npaka)
PPTX
大規模言語モデル開発を支える分散学習技術 - 東京工業大学横田理央研究室の藤井一喜さん
PDF
W&B monthly meetup#7 Intro.pdf
PDF
20230705 - Optuna Integration (to share).pdf
PDF
W&B Seminar #5(to share).pdf
PDF
makoto shing (stability ai) - image model fine-tuning - wandb_event_230525.pdf
PDF
LLM Webinar - シバタアキラ to share.pdf
PDF
W&B Seminar #4.pdf
PDF
PDF
Akira shibata at developer summit 2016
PDF
PyData.Tokyo Hackathon#2 TensorFlow
PDF
20150421 日経ビッグデータカンファレンス
PDF
人工知能をビジネスに活かす
PDF
LHCにおける素粒子ビッグデータの解析とROOTライブラリ(Big Data Analysis at LHC and ROOT)
PDF
PyData Tokyo Tutorial & Hackathon #1
PDF
20150128 cross2015
PDF
PyData NYC by Akira Shibata
PDF
20141127 py datatokyomeetup2
PDF
The LHC Explained by CNN
WandbotをWeaveでモニタリング・評価する by 鎌田 啓輔 (@olachinkei)
WeaveによるRAGシステムのLLM変更時の精度検証手順 by 布留川 英一(@npaka)
大規模言語モデル開発を支える分散学習技術 - 東京工業大学横田理央研究室の藤井一喜さん
W&B monthly meetup#7 Intro.pdf
20230705 - Optuna Integration (to share).pdf
W&B Seminar #5(to share).pdf
makoto shing (stability ai) - image model fine-tuning - wandb_event_230525.pdf
LLM Webinar - シバタアキラ to share.pdf
W&B Seminar #4.pdf
Akira shibata at developer summit 2016
PyData.Tokyo Hackathon#2 TensorFlow
20150421 日経ビッグデータカンファレンス
人工知能をビジネスに活かす
LHCにおける素粒子ビッグデータの解析とROOTライブラリ(Big Data Analysis at LHC and ROOT)
PyData Tokyo Tutorial & Hackathon #1
20150128 cross2015
PyData NYC by Akira Shibata
20141127 py datatokyomeetup2
The LHC Explained by CNN

Recently uploaded (20)

PPTX
Computer network topology notes for revision
PPTX
1_Introduction to advance data techniques.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Mega Projects Data Mega Projects Data
PDF
Foundation of Data Science unit number two notes
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Computer network topology notes for revision
1_Introduction to advance data techniques.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction-to-Cloud-ComputingFinal.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
IB Computer Science - Internal Assessment.pptx
Business Acumen Training GuidePresentation.pptx
Database Infoormation System (DBIS).pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
.pdf is not working space design for the following data for the following dat...
Mega Projects Data Mega Projects Data
Foundation of Data Science unit number two notes
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Clinical guidelines as a resource for EBP(1).pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”

Kaggle and data science

  • 1. © DataRobot, Inc. All rights reserved. Kaggle and Data Science Japan, 2018
  • 2. Sergey Yurgenson Director, Advanced Data Science Services Kaggle Grandmaster © DataRobot, Inc. All rights reserved.
  • 3. © DataRobot, Inc. All rights reserved. Kaggle ● Kaggle is a platform for data science competitions ● It was created by Anthony Goldbloom in 2010 in Australia and then moved to San Francisco ● In March of 2017 it was acquired by Google ● Right now many other start-up are trying to replicate the same idea, but Kaggle is still the most known in data science community name ● As of now Kaggle hosted more than 280 competitions and has more than 1 million members from more than 190 countries
  • 4. © DataRobot, Inc. All rights reserved. Kaggle competitions ● Most of Kaggle competitions are predictive modeling competition ● Participants are provided with training data to train their models and test data with unknown targets ● Participants need to calculate predictions for test data and submit those predictions to Kaggle platform. ● Accuracy of predictions is evaluated using predefined objective metric and that result is provided back to participants. ● Model performance of all participants is publicly available and participants can compare quality of their models with models of other participants ● Many competitions have monetary prizes for top finishers
  • 5. © DataRobot, Inc. All rights reserved. Kaggle competitions
  • 6. © DataRobot, Inc. All rights reserved. Kaggle ranking ● Based on competitions performance Kaggle ranks members using points and awards titles for top finishing in competitions ● For example to get title of master member needs to earn one gold medal and two silver medal. For competitions with 1000 participants it means to finish once in top 10 places and twice in top 50.
  • 7. © DataRobot, Inc. All rights reserved. Kaggle ranking
  • 8. © DataRobot, Inc. All rights reserved. Kaggle and Data Science
  • 9. © DataRobot, Inc. All rights reserved. Why do you dislike Kaggle ? ● Kaggle competition does not have much in common with real Data Science ○ The problems are already well formulated with metrics predefined. In an industry setting there is ambiguity, and knowing what to solve is one of the key steps towards a solution. ○ Data is most cases is already provided and is relatively clean. ○ The goal is more leaderboard driven rather than understanding driven. Winning a competition versus why an approach works is a top priority. Results may not be trustworthy. ○ There are chances of overfitting to test data with repeated submissions. ○ In most cases the solution is an ensemble of algorithms and not “productionizable”. https://www.quora.com/Why-do-you-dislike-Kaggle
  • 10. © DataRobot, Inc. All rights reserved. True or False ? ● “The problems are already well formulated with metrics predefined. In an industry setting there is ambiguity, and knowing what to solve is one of the key steps towards a solution.” https://www.quora.com/Why-do-you-dislike-Kaggle
  • 11. © DataRobot, Inc. All rights reserved. Problem is well formulated Mostly True , however... ● Need for criteria is inherited property of any competition. ● In real world not all data scientists are free to select and reformulate the problem. Many problems are already defined with assigned specific success criteria. ● We learn many subjects and skills by solving provided predefined problems, doing predefined exercises. We learn math by solving problems from textbooks, we learn physics by solving problems from textbooks. Problems already formulated. By solving problems we also learn how to formulate problems, what is suitable approach in particular data science situation. ● We also have to admit that evaluating business value of solving the problem is completely out of scope of Kaggle competitions. While business value analysis and problem prioritization is important part of many real life data science projects.
  • 12. © DataRobot, Inc. All rights reserved. True or False ? ● “Data is most cases is already provided and is relatively clean.” https://www.quora.com/Why-do-you-dislike-Kaggle
  • 13. © DataRobot, Inc. All rights reserved. Data is clean Half true ● In many competitions datasets are ○ Very big ○ Have multiple tables ○ Some records are duplicated and mislabeled ○ Contain combination of structured data and unstructured data ● Some competitions encourage search for additional sources of data ● Many data leaks ● Often features names and meaning are not provided making problem even more difficult than in real world ● Data may be intentionally distorted to conform to data privacy laws
  • 14. © DataRobot, Inc. All rights reserved. Data is clean ● Complex data structure ● Big datasets ● No meaningful feature names
  • 15. © DataRobot, Inc. All rights reserved. Data is clean ● Kaggle competitions teach unique data manipulation skills: ○ Dealing with data with hardware limitations : efficient code, smart sampling, clever encoding… ○ Using EDA to uncover meaning of data without relying on labels or other provided information ○ Data leaks discovering based on the data analysis
  • 16. © DataRobot, Inc. All rights reserved. True or False ? ● The goal is more leaderboard driven rather than understanding driven. Winning a competition versus why an approach works is a top priority. Results may not be trustworthy. https://www.quora.com/Why-do-you-dislike-Kaggle
  • 17. © DataRobot, Inc. All rights reserved. No understanding True but maybe not that important ● Assumes that model we can not understand is less valuable than model we can understand ○ Model is not necessarily used for knowledge discovery ○ In real life we often use something and rely on something we do not completely understand ○ If something that we do not understand can not be trustworthy then how we ever trust other people? ○ Even complex machine learning model may provide simplification of even more complex real system
  • 18. © DataRobot, Inc. All rights reserved. No understanding ● Ignores all new research of model interpretability ○ Feature importance ○ Reason codes ○ Partial dependence plots ○ Surrogate models ○ Neuron activation visualization ○ ... ● Those methods allow us to analyze and understand behaviour of models as complicated as GBM and Neural Networks
  • 19. © DataRobot, Inc. All rights reserved. No understanding ?
  • 20. © DataRobot, Inc. All rights reserved. True or False ? ● There are chances of overfitting to test data with repeated submissions. https://www.quora.com/Why-do-you-dislike-Kaggle
  • 21. © DataRobot, Inc. All rights reserved. Overfitting False ● Complete misunderstanding of how Kaggle works ○ Test data in Kaggle competition is split into two parts - public and private ○ During competition models are evaluated only on public part of the test set ○ Final results are based only on private part of the test dataset ○ Thus final model evaluation is based on completely new data ● One of first lessons all competitions participants learn very fast ○ Do not overfit leaderboard. ○ Create training/validation partition which reflect as much as possible test data including seasonality effects and data drift
  • 22. © DataRobot, Inc. All rights reserved. True or False ? ● In most cases the solution is an ensemble of algorithms and not “productionizable”. https://www.quora.com/Why-do-you-dislike-Kaggle
  • 23. © DataRobot, Inc. All rights reserved. Difficult to put in production Half True, half false ● Yes, in most cases top models are complicated ensembles ● Difficult to put in production if one does it one-by-one for each model separately ● Easy if one uses appropriately developed platform that can handle many models and blenders
  • 24. © DataRobot, Inc. All rights reserved. True or False ? ● Sometimes, a 0.01 difference in AUC can be the difference between 1st place and 294th place (out of 626) . Those marginal gains take significant time and effort that may not be worthwhile in the face of other projects and priorities https://www.quora.com/How-similar-are-Kaggle-competitions-to-what-data-scientists-do
  • 25. © DataRobot, Inc. All rights reserved. Marginal gain is not valuable Not always true ● Often we ourselves advise clients on balance between time spent and model performance ● However in investment world 0.01 AUC difference means difference in millions of dollars of gain or loss ● Competition aspect of the data science problem with small margins drives innovation ○ New preprocessing steps ○ New feature engineering ideas ○ Continues testing of new algorithms and implementations (GBM - XGboost - LightGBM - CatBoost)
  • 26. © DataRobot, Inc. All rights reserved. Kaggle and Data Science ● “Kaggle competitions cover a decent amount of what a data scientist does. The two big missing pieces are: ○ 1. taking a business problem and specifying it as a data science problem (which includes pulling the data and structuring it so that it addresses that business problem). ○ 2. putting models into production.” Anthony Goldbloom
  • 27. © DataRobot, Inc. All rights reserved. Kaggle and Data Science ● Kaggle is a competition ● “Real” Data Science is ... also competition
  • 28. © DataRobot, Inc. All rights reserved. Kaggle to “real life” Data Science ● DataRobot - created by top Kagglers Owen Zhang Product Advisor Highest: #1 Xavier Conort Chief Data Scientist Highest: 1st Sergey Yurgenson Director- AI Services Highest: 1st Jeremy Achin CEO & Co-Founder Highest: 20th Tom de Godoy CTO & Co-Founder Highest: 20th Amanda Schierz Data Scientist Highest: 24 DataRobot automatically replicates the steps seasoned data scientists take. This allows non-technical business users to create accurate predictive models and data scientists to add to their existing tool set.
  • 29. © DataRobot, Inc. All rights reserved. Kaggle and Data Science