SlideShare a Scribd company logo
Kaggle
The home of data
science
GE Flight Quest 2
Optimize flight routes based
on weather & traffic
$250,000
122 teams
Hewlett Foundation: Automated Essay Scoring
Develop an automated scoring algorithm
for student-written essays
$100,000
155 teams
Allstate Purchase Prediction Challenge
Develop an automated scoring algorithm
for student-written essays
$50,000
1,570 teams
Merck Molecular Activity Challenge
Help develop safe and effective medicines
by predicting molecular activity
$40,000
236 teams
Higgs Boson Machine Learning Challenge
Use the ATLAS experiment to
identify the Higgs boson
$13,000
1,302 teams
Age Income Default
58 $95,824 True
73 $20,708 False
59 $82,152 False
66 $25,334 True
Age Income Default
73 $53,445
61 $36,679
47 $90,422
44 $79,040
Training Data Test Data
The Kaggle Approach
Kaggle presentation at SF Data Mining Meetup - Trulia June 23, 2015
Mapping Dark Matter
Competition Progress
Accuracy
(lower is better)
Week 1 Week 3 Week 5 Week 7 End
.0150
.0170
Martin O’Leary
PhD student in Glaciology, Cambridge U
“In less than a week, Martin O’Leary,
a PhD student in glaciology,
outperformed the state-of-the-art
algorithms”
“The world’s brightest physicists have
been working for decades on solving
one of the great unifying problems of
our universe”
Mapping Dark Matter
Competition Progress
Accuracy
(lower is better)
Week 1 Week 3 Week 5 Week 7 End
.0150
.0170
Martin O’Leary
PhD student in Glaciology, Cambridge U
Marius Cobzarenco
Grad student in computer vision, UC London
Ali Haissaine & Eu Jin Loc
Signature Verification, Qatar U & Grad Student @ Deloitte
Other
deepZot (David Kirkby & Daniel Margala)
Particle Physicist & Cosmologist
EXAMPLE ESSAY QUESTION —
We all understand the benefits of laughter.
For example, someone once said,
“Laughter is the shortest distance between
two people.”
Many other people believe that laughter is
an important part of any relationship. Tell a
true story in which laughter was one
element or part.
We can work
with difficult
data —
The winning model
correctly predicted
seizures 82% of the
time. Until that point,
researchers had
struggled to develop an
algorithm that did better
than chance
Mayo Clinic:
Seizure detection
from EEG
readings
We’ve worked with
many of the
world’s largest
companies
Healthcare &
Pharma
Consumer
Internet
Finance IndustrialConsumer
Marketing
Oil
& Gas
$50b+
Beverage
Co.
Global
Bank
Top
Credit
Card
Issuer
Top 5 E&P
Top 20 E&P
Community of
over 320K data
scientists
That submit over
100K machine
learning models
per month
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
May-10 May-11 May-12 May-13 May-14 May-15
Monthly Submissions to Kaggle Competitions
Feature
engineering
matters most
Good software
engineering
practices and
robust statistical
methods are key
80% of data science is grunt work and only 20% involves deep thinking
A good pipeline makes data scientists more productive and their work higher quality and more
enjoyable
Kaggle presentation at SF Data Mining Meetup - Trulia June 23, 2015
Kaggle presentation at SF Data Mining Meetup - Trulia June 23, 2015
Kaggle presentation at SF Data Mining Meetup - Trulia June 23, 2015
Kaggle presentation at SF Data Mining Meetup - Trulia June 23, 2015
Kaggle presentation at SF Data Mining Meetup - Trulia June 23, 2015
Kaggle presentation at SF Data Mining Meetup - Trulia June 23, 2015
Our workflow environment will be the central repository for all data science work in a company
Anthony Goldbloom
a@kaggle.com
650 283 9781

More Related Content

PPTX
Kaggle The Home of Data Science
PPTX
Lessons from 2MM machine learning models
PDF
MedStack: 500 Demo Day Batch 21
PPTX
DSPA Insights Conference 2019
PDF
Out of the Lab and Into the World: The Dawn of Consumer-Ready Brain Sensing Tech
PPTX
500’s Demo Day Batch 14 >> ZendyHealth
PDF
rev_2 (1) (3)
PPTX
SaltGrid
Kaggle The Home of Data Science
Lessons from 2MM machine learning models
MedStack: 500 Demo Day Batch 21
DSPA Insights Conference 2019
Out of the Lab and Into the World: The Dawn of Consumer-Ready Brain Sensing Tech
500’s Demo Day Batch 14 >> ZendyHealth
rev_2 (1) (3)
SaltGrid

What's hot (13)

PDF
Meaningful (meta)data at scale: removing barriers to precision medicine research
PDF
PDF
Allegro
PDF
Elsevier Medical Graph – mit Machine Learning zu Precision Medicine
PDF
Predictive model for falls poster v3
PDF
Impact.Tech "Statistical Literacy for Deep Tech"
PDF
Using Spark in Healthcare Predictive Analytics in the OR - Data Science Pop-u...
DOCX
Stress detection screen shots
DOCX
Stress detection
PPTX
Blue Button for Medicaid
PPTX
Hi ssies 2013
PDF
1645 dyskant using our laptop
PDF
Is one enough? Data warehousing for biomedical research
Meaningful (meta)data at scale: removing barriers to precision medicine research
Allegro
Elsevier Medical Graph – mit Machine Learning zu Precision Medicine
Predictive model for falls poster v3
Impact.Tech "Statistical Literacy for Deep Tech"
Using Spark in Healthcare Predictive Analytics in the OR - Data Science Pop-u...
Stress detection screen shots
Stress detection
Blue Button for Medicaid
Hi ssies 2013
1645 dyskant using our laptop
Is one enough? Data warehousing for biomedical research
Ad

Similar to Kaggle presentation at SF Data Mining Meetup - Trulia June 23, 2015 (20)

PDF
Thinking about Data Strategy: for Ophthalmologists
PPT
Machine Learning, Data Mining, and
PDF
인공지능은 의료를 어떻게 혁신할 것인가 (ver 2)
PDF
Keynote on 2015 Yale Day of Data
PDF
machine_learning_section1_ebook.pdf
PPTX
Capstone Project.pptx
PDF
Open Source Pharma: Crowd computing: A new approach to predictive modeling
PPTX
Fairness in Machine Learning
PPT
2011 SBS Singapore | Nicholas Gruen, The Coming Revolution in Data
PDF
Data science for developers
PPTX
Top 5 Deep Learning and AI Stories 2/10
PDF
840 plenary elder_using his laptop
PDF
Kaggle: Crowd Sourcing for Data Analytics
PPTX
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
PDF
Counter Intuitive Machine Learning for the Industrial Internet of Things
PPTX
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
PDF
Ophthalmology & Optometry 2.0
PDF
Business Optimization via Causal Inference
PDF
920 plenary elder
PDF
910 plenary Elder
Thinking about Data Strategy: for Ophthalmologists
Machine Learning, Data Mining, and
인공지능은 의료를 어떻게 혁신할 것인가 (ver 2)
Keynote on 2015 Yale Day of Data
machine_learning_section1_ebook.pdf
Capstone Project.pptx
Open Source Pharma: Crowd computing: A new approach to predictive modeling
Fairness in Machine Learning
2011 SBS Singapore | Nicholas Gruen, The Coming Revolution in Data
Data science for developers
Top 5 Deep Learning and AI Stories 2/10
840 plenary elder_using his laptop
Kaggle: Crowd Sourcing for Data Analytics
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
Counter Intuitive Machine Learning for the Industrial Internet of Things
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Ophthalmology & Optometry 2.0
Business Optimization via Causal Inference
920 plenary elder
910 plenary Elder
Ad

More from gpano (6)

PDF
Making money with Data Science
PDF
Reducing Presentation Noise
PDF
From Signal to Symbols
PDF
Income targeting and surge pricing
PDF
Natural Language Processing on Non-Textual Data
PDF
Page rank for anomaly detection
Making money with Data Science
Reducing Presentation Noise
From Signal to Symbols
Income targeting and surge pricing
Natural Language Processing on Non-Textual Data
Page rank for anomaly detection

Recently uploaded (20)

PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Introduction to Business Data Analytics.
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
Quality review (1)_presentation of this 21
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to Business Data Analytics.
.pdf is not working space design for the following data for the following dat...
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Database Infoormation System (DBIS).pptx
Moving the Public Sector (Government) to a Digital Adoption
STUDY DESIGN details- Lt Col Maksud (21).pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Major-Components-ofNKJNNKNKNKNKronment.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
IBA_Chapter_11_Slides_Final_Accessible.pptx
Launch Your Data Science Career in Kochi – 2025
Reliability_Chapter_ presentation 1221.5784
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Miokarditis (Inflamasi pada Otot Jantung)

Kaggle presentation at SF Data Mining Meetup - Trulia June 23, 2015

  • 1. Kaggle The home of data science
  • 2. GE Flight Quest 2 Optimize flight routes based on weather & traffic $250,000 122 teams Hewlett Foundation: Automated Essay Scoring Develop an automated scoring algorithm for student-written essays $100,000 155 teams Allstate Purchase Prediction Challenge Develop an automated scoring algorithm for student-written essays $50,000 1,570 teams Merck Molecular Activity Challenge Help develop safe and effective medicines by predicting molecular activity $40,000 236 teams Higgs Boson Machine Learning Challenge Use the ATLAS experiment to identify the Higgs boson $13,000 1,302 teams
  • 3. Age Income Default 58 $95,824 True 73 $20,708 False 59 $82,152 False 66 $25,334 True Age Income Default 73 $53,445 61 $36,679 47 $90,422 44 $79,040 Training Data Test Data The Kaggle Approach
  • 5. Mapping Dark Matter Competition Progress Accuracy (lower is better) Week 1 Week 3 Week 5 Week 7 End .0150 .0170 Martin O’Leary PhD student in Glaciology, Cambridge U
  • 6. “In less than a week, Martin O’Leary, a PhD student in glaciology, outperformed the state-of-the-art algorithms” “The world’s brightest physicists have been working for decades on solving one of the great unifying problems of our universe”
  • 7. Mapping Dark Matter Competition Progress Accuracy (lower is better) Week 1 Week 3 Week 5 Week 7 End .0150 .0170 Martin O’Leary PhD student in Glaciology, Cambridge U Marius Cobzarenco Grad student in computer vision, UC London Ali Haissaine & Eu Jin Loc Signature Verification, Qatar U & Grad Student @ Deloitte Other deepZot (David Kirkby & Daniel Margala) Particle Physicist & Cosmologist
  • 8. EXAMPLE ESSAY QUESTION — We all understand the benefits of laughter. For example, someone once said, “Laughter is the shortest distance between two people.” Many other people believe that laughter is an important part of any relationship. Tell a true story in which laughter was one element or part. We can work with difficult data —
  • 9. The winning model correctly predicted seizures 82% of the time. Until that point, researchers had struggled to develop an algorithm that did better than chance Mayo Clinic: Seizure detection from EEG readings
  • 10. We’ve worked with many of the world’s largest companies Healthcare & Pharma Consumer Internet Finance IndustrialConsumer Marketing Oil & Gas $50b+ Beverage Co. Global Bank Top Credit Card Issuer Top 5 E&P Top 20 E&P
  • 11. Community of over 320K data scientists
  • 12. That submit over 100K machine learning models per month 0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 May-10 May-11 May-12 May-13 May-14 May-15 Monthly Submissions to Kaggle Competitions
  • 14. Good software engineering practices and robust statistical methods are key
  • 15. 80% of data science is grunt work and only 20% involves deep thinking
  • 16. A good pipeline makes data scientists more productive and their work higher quality and more enjoyable
  • 23. Our workflow environment will be the central repository for all data science work in a company