SlideShare a Scribd company logo
Center for Data Science
Paris-Saclay1
CNRS & University Paris Saclay	

Center for Data Science
BALÁZS KÉGL
WHAT IS WRONG WITH DATA
CHALLENGES
THE HIGGSML STORY:	

THE GOOD, THE BAD AND THE UGLY
2
Why am I so critical?
!
Why do I mitigate our own
success with the HiggsML?
3
Because I believe that there is
enormous potential in
open innovation/crowdsourcing
in science.
!
The current data challenge format
is a single point in the landscape.
4
Olga Kokshagina 2015
INTERMEDIARIES: THE GROWING INTEREST FOR
« CROWDS » - > EXPLOSION OF TOOLS
!  Crowdsourcing
!  is a model leveraging
on novel technologies
(web 2.0, mobile apps,
social networks)
!  To build content and a
structured set of
information by
gathering contributions
from large groups of
individuals
5
Center for Data Science
Paris-Saclay
CROWDSOURCING ANNOTATION
5
Center for Data Science
Paris-Saclay
CROWDSOURCING COLLECTION AND
ANNOTATION
6
Center for Data Science
Paris-Saclay
CROWDSOURCING MATH
7
Center for Data Science
Paris-Saclay
CROWDSOURCING ANALYTICS
8
Center for Data Science
Paris-Saclay
OPEN SOURCE
9
Center for Data Science
Paris-Saclay
NEW PUBLICATION MODELS
10
Center for Data Science
Paris-Saclay
THE BOOK TO READ
11
Center for Data Science
Paris-Saclay
• Summary of our conclusions after the HiggsML challenge	

• The good, the bad and the ugly	

• Elaborating on some of the points	

• Rapid Analytics and Model Prototyping	

• an experimental format we have been developing
12
OUTLINE
Center for Data Science
Paris-Saclay13
CIML WORKSHOP TOMORROW
Center for Data Science
Paris-Saclay
• Publicity, awareness	

• both in physics (about the technology) and in ML (about the problem)	

• Triggering open data	

• http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014 	

• Learning a lot from Gábor on how to win a challenge	

• Gábor getting hired by Google Deep Mind	

• Benchmarking
• Tool dissemination (xgboost, keras)
14
THE GOOD
Center for Data Science
Paris-Saclay
• No direct access to code	

• No direct access to data scientists	

• No fundamentally new ideas	

• No incentive to collaborate
15
THE BAD
Center for Data Science
Paris-Saclay
• 18 months to prepare	

• legal issues, access to data	

• problem formulation: intellectually way more interesting than the
challenge itself, but difficult to “market” or to crowdsource	

• once a problem is formalized/formatted to challenge, the problem is
solved (“learning is easy” - GaelVaroquaux)
16
THE UGLY
Center for Data Science
Paris-Saclay
• We asked the wrong question, on purpose!	

• because the right questions are complex and don’t fit the challenge
setup	

• would have led to way less participation	

• would have led to bitterness among the participants, bad (?) for
marketing
17
THE UGLY
Center for Data Science
Paris-Saclay
• The HiggsML challenge on Kaggle	

• https://www.kaggle.com/c/higgs-boson
18
PUBLICITY, AWARENESS
Center for Data Science
Paris-Saclay
PUBLICITY, AWARENESS
19
B. Kégl / AppStat@LAL Learning to discover
CLASSIFICATION FOR DISCOVERY
14
Center for Data Science
Paris-Saclay
AWARENESS DYNAMICS	

20
• HEPML workshop @NIPS14	

• JMLR WS proceedings: http://jmlr.csail.mit.edu/proceedings/papers/v42	

• CERN Open Data	

• http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014 	

• DataScience@LHC	

• http://indico.cern.ch/event/395374/	

• Flavors of physics challenge	

• https://www.kaggle.com/c/flavours-of-physics
Center for Data Science
Paris-Saclay
LEARNING FROM THE WINNER	

21
https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
Center for Data Science
Paris-Saclay
LEARNING FROM THE WINNER	

22
• Sophisticated cross validation, CV bagging	

• Sophisticated calibration and model averaging	

• The first step: pro participants check if the effort is worthy,
risk assessment	

• variance estimate of the score	

• Don’t use the public leaderboard score for model selection	

• None of Gábor’s 200 out-of-the-ordinary ideas worked
https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
Center for Data Science
Paris-Saclay
BENCHMARKING
23
CLASSIFICATION FOR DISCOVERY
15
Center for Data Science
Paris-Saclay
BENCHMARKING
24
But what score did we
optimize?
!
And why?
Center for Data Science
Paris-Saclay
count (per year)
background
signal
probability
background
signal
CLASSIFICATION FOR DISCOVERY
25
Goal: optimize the expected discovery significance
flux × time
selection
expected background	

say, b = 100 events
total count,	

say, 150 events
excess is s = 50 events
AMS = = 5 sigma
ground expectation µb. When optimizing the design of
gion G = {x : g(x) = s}, we do not know n and µb. As
we estimate the expectation µb by its empirical counter-
+ b to obtain the approximate median significance
⇣
(s + b) ln
⇣
1 +
s
b
⌘
s
⌘
. (14)
x + 1) = x + x2/2 + O(x3), AMS2 can be rewritten as
MS3 ⇥
s
1 + O
✓⇣ s
b
⌘3
◆
,
AMS3 =
s
p
b
. (15)
tically indistinguishable when b s. This approxima-
nding on the chosen search region, be a valid surrogate
selection 	

thresholdselection threshold
Center for Data Science
Paris-Saclay
How to handle systematic (model) uncertainties?
• OK, so let’s design an objective function that can take background
systematics into consideration
• Likelihood with unknown background b ⇠ N(µb, b)
L(µs, µb) = P(n, b|µs, µb, b) =
(µs + µb)n
n!
e (µs+µb) 1
p
2⇡ b
e (b µb)2
/2 b
2
• Profile likelihood ratio (0) =
L(0, ˆˆµb)
L(ˆµs, ˆµb)
• The new Approximate Median Significance (by Glen Cowan)
AMS =
s
2
✓
(s + b) ln
s + b
b0
s b + b0
◆
+
(b b0)2
b
2
where
b0 =
1
2
⇣
b b
2
+
p
(b b
2)2 + 4(s + b) b
2
⌘
1 / 1
26
Center for Data Science
Paris-Saclay
HOW TO HANDLE SYSTEMATIC UNCERTAINTIES
27
Why didn’t we use it?
Center for Data Science
Paris-Saclay28
How to handle systematic (model) uncertainties?
• The new Approximate Median Significance
AMS =
s
2
✓
(s + b) ln
s + b
b0
s b + b0
◆
+
(b b0)2
b
2
where
b0 =
1
2
⇣
b b
2
+
p
(b b
2)2 + 4(s + b) b
2
⌘
1 / 1
New AMS
ATLAS
Old AMS
Center for Data Science
Paris-Saclay
LEARNING FROM THE WINNER	

29
• Sophisticated cross validation, CV bagging	

• Sophisticated calibration and model averaging	

• The first step: pro participants check if the effort is worthy,
risk assessment	

• variance estimate of the score	

• Don’t use the public leaderboard score for model selection	

• None of Gábor’s 200 out-of-the-ordinary ideas worked
Center for Data Science
Paris-Saclay
THE TWO MOST COMMON DATA
CHALLENGE KILLERS
30
Leakage
Variance of the test score
Center for Data Science
Paris-Saclay
VARIANCE OF THE TEST SCORE
31
Center for Data Science
Paris-Saclay
• Challenges are useful for	

• generating visibility in the data science community about novel
application domains	

• benchmarking in a fair way state-of-the-art techniques on
well-defined problems	

• finding talented data scientists	

• Limitations	

• not necessary adapted to solving complex and open-ended
data science problems in realistic environments	

• no direct access to solutions and data scientist	

• no incentive to collaboration
32
DATA CHALLENGES
33
We decided to design something better
Center for Data Science
Paris-Saclay
• Direct access to code, prototyping	

• Incentivizing diversity	

• Incentivizing collaboration
• Training
• Networking
34
RAPID ANALYTICS AND MODEL
PROTOTYPING (RAMP)
Center for Data Science
Paris-Saclay
• Our experience with the HiggsML challenge	

• Need to connect data scientist to domain scientists
and problems at the Paris-Saclay Center for Data
Science	

• Collaboration with management scientists specializing
in managing innovation	

• Michel Nielsen’s book: Reinventing Discovery	

• 5+ iterations so far
35
WHERE DOES IT COME FROM?
Center for Data Science
Paris-Saclay
UNIVERSITÉ PARIS-SACLAY
36
+ horizontal multi-disciplinary and multi-partner
initiatives to create cohesion
Center for Data Science
Paris-Saclay37
Center for Data Science
Paris-Saclay
A multi-disciplinary initiative to define, structure, and manage
the data science ecosystem at the Université Paris-Saclay
http://www.datascience-paris-saclay.fr/
Biology & bioinformatics
IBISC/UEvry
LRI/UPSud
Hepatinov
CESP/UPSud-UVSQ-Inserm
IGM-I2BC/UPSud
MIA/Agro
MIAj-MIG/INRA
LMAS/Centrale
Chemistry
EA4041/UPSud
Earth sciences
LATMOS/UVSQ
GEOPS/UPSud
IPSL/UVSQ
LSCE/UVSQ
LMD/Polytechnique
Economy
LM/ENSAE
RITM/UPSud
LFA/ENSAE
Neuroscience
UNICOG/Inserm
U1000/Inserm
NeuroSpin/CEA
Particle physics
astrophysics &
cosmology
LPP/Polytechnique
DMPH/ONERA
CosmoStat/CEA
IAS/UPSud
AIM/CEA
LAL/UPSud
250researchers in 35laboratories
Machine learning
LRI/UPSud
LTCI/Telecom
CMLA/Cachan
LS/ENSAE
LIX/Polytechnique
MIA/Agro
CMA/Polytechnique
LSS/Supélec
CVN/Centrale
LMAS/Centrale
DTIM/ONERA
IBISC/UEvry
Visualization
INRIA
LIMSI
Signal processing
LTCI/Telecom
CMA/Polytechnique
CVN/Centrale
LSS/Supélec
CMLA/Cachan
LIMSI
DTIM/ONERA
Statistics
LMO/UPSud
LS/ENSAE
LSS/Supélec
CMA/Polytechnique
LMAS/Centrale
MIA/AgroParisTech
machine learning
information retrieval
signal processing
data visualization
databases
Domain science
human society
life
brain
earth
universe
Tool building
software engineering
clouds/grids
high-performance
computing
optimization
Domain scientistSoftware engineer
datascience-paris-saclay.fr
LIST/CEA
38
THE DATA SCIENCE LANDSCAPE
Domain science
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
Data scientist
Data trainer
Applied scientist
Domain scientistSoftware engineer
Data engineer
Data science
statistics

machine learning
information retrieval
signal processing
data visualization
databases
Tool building
software engineering

clouds/grids
high-performance

computing
optimization
Center for Data Science
Paris-Saclay39
https://medium.com/@balazskegl
Center for Data Science
Paris-Saclay
TOOLS: LANDSCAPE TO ECOSYSTEM
40
Data scientist
Data trainer
Applied scientist
Domain expertSoftware engineer
Data engineer
Tool building Data domains
Data science
statistics

machine learning
information retrieval
signal processing
data visualization
databases
• interdisciplinary projects
• matchmaking tool
• design and innovation strategy workshops
• data challenges
• coding sprints
• Open Software Initiative
• code consolidator and engineering projects
software engineering

clouds/grids
high-performance

computing
optimization
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
• data science RAMPs and TSs
• IT platform for linked data
• annotation tools
• SaaS data science platform
Center for Data Science
Paris-Saclay
• Modularizing the collaboration	

• independent subtasks	

• reduces barriers	

• broadens the range of available expertise	

• Encouraging small contributions	

• Rich and well-structured information commons	

• so people can build on earlier work
41
NIELSEN’S CROWDSOURCING PRINCIPLES
Center for Data Science
Paris-Saclay42
RAMPS
• Single-day coding sessions
• 20-40 participants	

• preparation is similar to challenges
• Goals	

• focusing and motivating top talents	

• promoting collaboration, speed, and efficiency	

• solving (prototyping) real problems
43
TRAINING SPRINTS
• Single-day training sessions
• 20-40 participants	

• focusing on a single subject (deep learning, model tuning, functional
data, etc.)	

• preparing RAMPs
44
ANALYTICS TOOLS TO PROMOTE 	

COLLABORATION AND CODE REUSE
Center for Data Science
Paris-Saclay45
ANALYTICS TOOL TO PROMOTE 	

COLLABORATION AND CODE REUSE
Center for Data Science
Paris-Saclay
ANALYTICS TOOLS TO MONITOR PROGRESS
46
Center for Data Science
Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 Jan 15
The HiggsML challenge
47
Center for Data Science
Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 Apr 10
Classifying variable stars
48
Center for Data Science
Paris-Saclay
VARIABLE STARS
49
Learning to discoverB. Kégl / CNRS - Saclay
VARIABLE STARS
50
accuracy improvement: 89% to 96%
Center for Data Science
Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 June 16 and Sept 26
Predicting El Nino
51
52
RAPID ANALYTICS AND MODEL PROTOTYPING
RMSE improvement: 0.9˚C to 0.4˚C
53
2015 October 8
Insect classification
RAPID ANALYTICS AND MODEL PROTOTYPING
54
RAPID ANALYTICS AND MODEL PROTOTYPING
accuracy improvement: 30% to 70%
55
CONCLUSIONS
• Explore the open innovation space
• read Nielsen’s book	

• Drop me a mail (balazs.kegl@gmail.com) if you are
interested in beta-testing the RAMP tool
• Come to our CIML WS tomorrow
Center for Data Science
Paris-Saclay56
THANK YOU!

More Related Content

PDF
Creativity through deep learning
PDF
A data science observatory based on RAMP - rapid analytics and model prototyping
PPTX
Presentation de la R&D à la RID : Des projets innovants bien financés ou du b...
PDF
The systemic challenges in data science initiatives (and some solutions)
PDF
The Paris-Saclay Center for Data Science
PDF
RAMP Data Challenge
PDF
Deep learning and the systemic challenges of data science initiatives
PDF
Machine learning in scientific workflows
Creativity through deep learning
A data science observatory based on RAMP - rapid analytics and model prototyping
Presentation de la R&D à la RID : Des projets innovants bien financés ou du b...
The systemic challenges in data science initiatives (and some solutions)
The Paris-Saclay Center for Data Science
RAMP Data Challenge
Deep learning and the systemic challenges of data science initiatives
Machine learning in scientific workflows

Similar to What is wrong with data challenges (20)

PDF
RAMP: Collaborative challenge with code submission
PDF
Data science and good questions eric kostello
PDF
Data science as a new frontier for design.
PDF
Debugging machine-learning
PDF
Bring survey sampling techniques into big data
PPTX
algorithmic-decisions, fairness, machine learning, provenance, transparency
PPTX
The Challenges of Bringing Machine Learning to the Masses
PDF
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
PDF
0-introduction.pdf
PDF
A field guide the machine learning zoo
PPTX
Data Science Data Science Data Science.pptx
PDF
Introduction To Data Science Laura Igual Santi Segu
PDF
Data Science unit 2 By: Professor Lili Saghafi
PDF
Chapter 1 - Introduction
PDF
S2-Programming_with_Data_Computational_Physics.pdf
PPTX
Data Responsibly: The next decade of data science
PDF
Data Science Accelerator Program
PDF
Data Center Computing for Data Science: an evolution of machines, middleware,...
PDF
Data Science with Spark - Training at SparkSummit (East)
RAMP: Collaborative challenge with code submission
Data science and good questions eric kostello
Data science as a new frontier for design.
Debugging machine-learning
Bring survey sampling techniques into big data
algorithmic-decisions, fairness, machine learning, provenance, transparency
The Challenges of Bringing Machine Learning to the Masses
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
0-introduction.pdf
A field guide the machine learning zoo
Data Science Data Science Data Science.pptx
Introduction To Data Science Laura Igual Santi Segu
Data Science unit 2 By: Professor Lili Saghafi
Chapter 1 - Introduction
S2-Programming_with_Data_Computational_Physics.pdf
Data Responsibly: The next decade of data science
Data Science Accelerator Program
Data Center Computing for Data Science: an evolution of machines, middleware,...
Data Science with Spark - Training at SparkSummit (East)
Ad

More from Balázs Kégl (7)

PDF
Data-driven hypothesis generation using deep neural nets
PDF
Model-based reinforcement learning and self-driving engineering systems
PDF
Managing the AI process: putting humans (back) in the loop
PPTX
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
PDF
A historical introduction to deep learning: hardware, data, and tricks
PDF
Build your own data challenge, or just organize team work
PDF
Learning do discover: machine learning in high-energy physics
Data-driven hypothesis generation using deep neural nets
Model-based reinforcement learning and self-driving engineering systems
Managing the AI process: putting humans (back) in the loop
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
A historical introduction to deep learning: hardware, data, and tricks
Build your own data challenge, or just organize team work
Learning do discover: machine learning in high-energy physics
Ad

Recently uploaded (20)

PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
Sciences of Europe No 170 (2025)
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPT
protein biochemistry.ppt for university classes
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPT
Chemical bonding and molecular structure
PDF
The scientific heritage No 166 (166) (2025)
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
diccionario toefl examen de ingles para principiante
PPTX
famous lake in india and its disturibution and importance
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
HPLC-PPT.docx high performance liquid chromatography
Sciences of Europe No 170 (2025)
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Taita Taveta Laboratory Technician Workshop Presentation.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
protein biochemistry.ppt for university classes
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Chemical bonding and molecular structure
The scientific heritage No 166 (166) (2025)
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
The KM-GBF monitoring framework – status & key messages.pptx
diccionario toefl examen de ingles para principiante
famous lake in india and its disturibution and importance
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
Phytochemical Investigation of Miliusa longipes.pdf
Derivatives of integument scales, beaks, horns,.pptx
ECG_Course_Presentation د.محمد صقران ppt
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS

What is wrong with data challenges

  • 1. Center for Data Science Paris-Saclay1 CNRS & University Paris Saclay Center for Data Science BALÁZS KÉGL WHAT IS WRONG WITH DATA CHALLENGES THE HIGGSML STORY: THE GOOD, THE BAD AND THE UGLY
  • 2. 2 Why am I so critical? ! Why do I mitigate our own success with the HiggsML?
  • 3. 3 Because I believe that there is enormous potential in open innovation/crowdsourcing in science. ! The current data challenge format is a single point in the landscape.
  • 4. 4 Olga Kokshagina 2015 INTERMEDIARIES: THE GROWING INTEREST FOR « CROWDS » - > EXPLOSION OF TOOLS !  Crowdsourcing !  is a model leveraging on novel technologies (web 2.0, mobile apps, social networks) !  To build content and a structured set of information by gathering contributions from large groups of individuals 5
  • 5. Center for Data Science Paris-Saclay CROWDSOURCING ANNOTATION 5
  • 6. Center for Data Science Paris-Saclay CROWDSOURCING COLLECTION AND ANNOTATION 6
  • 7. Center for Data Science Paris-Saclay CROWDSOURCING MATH 7
  • 8. Center for Data Science Paris-Saclay CROWDSOURCING ANALYTICS 8
  • 9. Center for Data Science Paris-Saclay OPEN SOURCE 9
  • 10. Center for Data Science Paris-Saclay NEW PUBLICATION MODELS 10
  • 11. Center for Data Science Paris-Saclay THE BOOK TO READ 11
  • 12. Center for Data Science Paris-Saclay • Summary of our conclusions after the HiggsML challenge • The good, the bad and the ugly • Elaborating on some of the points • Rapid Analytics and Model Prototyping • an experimental format we have been developing 12 OUTLINE
  • 13. Center for Data Science Paris-Saclay13 CIML WORKSHOP TOMORROW
  • 14. Center for Data Science Paris-Saclay • Publicity, awareness • both in physics (about the technology) and in ML (about the problem) • Triggering open data • http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014 • Learning a lot from Gábor on how to win a challenge • Gábor getting hired by Google Deep Mind • Benchmarking • Tool dissemination (xgboost, keras) 14 THE GOOD
  • 15. Center for Data Science Paris-Saclay • No direct access to code • No direct access to data scientists • No fundamentally new ideas • No incentive to collaborate 15 THE BAD
  • 16. Center for Data Science Paris-Saclay • 18 months to prepare • legal issues, access to data • problem formulation: intellectually way more interesting than the challenge itself, but difficult to “market” or to crowdsource • once a problem is formalized/formatted to challenge, the problem is solved (“learning is easy” - GaelVaroquaux) 16 THE UGLY
  • 17. Center for Data Science Paris-Saclay • We asked the wrong question, on purpose! • because the right questions are complex and don’t fit the challenge setup • would have led to way less participation • would have led to bitterness among the participants, bad (?) for marketing 17 THE UGLY
  • 18. Center for Data Science Paris-Saclay • The HiggsML challenge on Kaggle • https://www.kaggle.com/c/higgs-boson 18 PUBLICITY, AWARENESS
  • 19. Center for Data Science Paris-Saclay PUBLICITY, AWARENESS 19 B. Kégl / AppStat@LAL Learning to discover CLASSIFICATION FOR DISCOVERY 14
  • 20. Center for Data Science Paris-Saclay AWARENESS DYNAMICS 20 • HEPML workshop @NIPS14 • JMLR WS proceedings: http://jmlr.csail.mit.edu/proceedings/papers/v42 • CERN Open Data • http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014 • DataScience@LHC • http://indico.cern.ch/event/395374/ • Flavors of physics challenge • https://www.kaggle.com/c/flavours-of-physics
  • 21. Center for Data Science Paris-Saclay LEARNING FROM THE WINNER 21 https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
  • 22. Center for Data Science Paris-Saclay LEARNING FROM THE WINNER 22 • Sophisticated cross validation, CV bagging • Sophisticated calibration and model averaging • The first step: pro participants check if the effort is worthy, risk assessment • variance estimate of the score • Don’t use the public leaderboard score for model selection • None of Gábor’s 200 out-of-the-ordinary ideas worked https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
  • 23. Center for Data Science Paris-Saclay BENCHMARKING 23 CLASSIFICATION FOR DISCOVERY 15
  • 24. Center for Data Science Paris-Saclay BENCHMARKING 24 But what score did we optimize? ! And why?
  • 25. Center for Data Science Paris-Saclay count (per year) background signal probability background signal CLASSIFICATION FOR DISCOVERY 25 Goal: optimize the expected discovery significance flux × time selection expected background say, b = 100 events total count, say, 150 events excess is s = 50 events AMS = = 5 sigma ground expectation µb. When optimizing the design of gion G = {x : g(x) = s}, we do not know n and µb. As we estimate the expectation µb by its empirical counter- + b to obtain the approximate median significance ⇣ (s + b) ln ⇣ 1 + s b ⌘ s ⌘ . (14) x + 1) = x + x2/2 + O(x3), AMS2 can be rewritten as MS3 ⇥ s 1 + O ✓⇣ s b ⌘3 ◆ , AMS3 = s p b . (15) tically indistinguishable when b s. This approxima- nding on the chosen search region, be a valid surrogate selection thresholdselection threshold
  • 26. Center for Data Science Paris-Saclay How to handle systematic (model) uncertainties? • OK, so let’s design an objective function that can take background systematics into consideration • Likelihood with unknown background b ⇠ N(µb, b) L(µs, µb) = P(n, b|µs, µb, b) = (µs + µb)n n! e (µs+µb) 1 p 2⇡ b e (b µb)2 /2 b 2 • Profile likelihood ratio (0) = L(0, ˆˆµb) L(ˆµs, ˆµb) • The new Approximate Median Significance (by Glen Cowan) AMS = s 2 ✓ (s + b) ln s + b b0 s b + b0 ◆ + (b b0)2 b 2 where b0 = 1 2 ⇣ b b 2 + p (b b 2)2 + 4(s + b) b 2 ⌘ 1 / 1 26
  • 27. Center for Data Science Paris-Saclay HOW TO HANDLE SYSTEMATIC UNCERTAINTIES 27 Why didn’t we use it?
  • 28. Center for Data Science Paris-Saclay28 How to handle systematic (model) uncertainties? • The new Approximate Median Significance AMS = s 2 ✓ (s + b) ln s + b b0 s b + b0 ◆ + (b b0)2 b 2 where b0 = 1 2 ⇣ b b 2 + p (b b 2)2 + 4(s + b) b 2 ⌘ 1 / 1 New AMS ATLAS Old AMS
  • 29. Center for Data Science Paris-Saclay LEARNING FROM THE WINNER 29 • Sophisticated cross validation, CV bagging • Sophisticated calibration and model averaging • The first step: pro participants check if the effort is worthy, risk assessment • variance estimate of the score • Don’t use the public leaderboard score for model selection • None of Gábor’s 200 out-of-the-ordinary ideas worked
  • 30. Center for Data Science Paris-Saclay THE TWO MOST COMMON DATA CHALLENGE KILLERS 30 Leakage Variance of the test score
  • 31. Center for Data Science Paris-Saclay VARIANCE OF THE TEST SCORE 31
  • 32. Center for Data Science Paris-Saclay • Challenges are useful for • generating visibility in the data science community about novel application domains • benchmarking in a fair way state-of-the-art techniques on well-defined problems • finding talented data scientists • Limitations • not necessary adapted to solving complex and open-ended data science problems in realistic environments • no direct access to solutions and data scientist • no incentive to collaboration 32 DATA CHALLENGES
  • 33. 33 We decided to design something better
  • 34. Center for Data Science Paris-Saclay • Direct access to code, prototyping • Incentivizing diversity • Incentivizing collaboration • Training • Networking 34 RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP)
  • 35. Center for Data Science Paris-Saclay • Our experience with the HiggsML challenge • Need to connect data scientist to domain scientists and problems at the Paris-Saclay Center for Data Science • Collaboration with management scientists specializing in managing innovation • Michel Nielsen’s book: Reinventing Discovery • 5+ iterations so far 35 WHERE DOES IT COME FROM?
  • 36. Center for Data Science Paris-Saclay UNIVERSITÉ PARIS-SACLAY 36 + horizontal multi-disciplinary and multi-partner initiatives to create cohesion
  • 37. Center for Data Science Paris-Saclay37 Center for Data Science Paris-Saclay A multi-disciplinary initiative to define, structure, and manage the data science ecosystem at the Université Paris-Saclay http://www.datascience-paris-saclay.fr/ Biology & bioinformatics IBISC/UEvry LRI/UPSud Hepatinov CESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/Agro MIAj-MIG/INRA LMAS/Centrale Chemistry EA4041/UPSud Earth sciences LATMOS/UVSQ GEOPS/UPSud IPSL/UVSQ LSCE/UVSQ LMD/Polytechnique Economy LM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA CosmoStat/CEA IAS/UPSud AIM/CEA LAL/UPSud 250researchers in 35laboratories Machine learning LRI/UPSud LTCI/Telecom CMLA/Cachan LS/ENSAE LIX/Polytechnique MIA/Agro CMA/Polytechnique LSS/Supélec CVN/Centrale LMAS/Centrale DTIM/ONERA IBISC/UEvry Visualization INRIA LIMSI Signal processing LTCI/Telecom CMA/Polytechnique CVN/Centrale LSS/Supélec CMLA/Cachan LIMSI DTIM/ONERA Statistics LMO/UPSud LS/ENSAE LSS/Supélec CMA/Polytechnique LMAS/Centrale MIA/AgroParisTech machine learning information retrieval signal processing data visualization databases Domain science human society life brain earth universe Tool building software engineering clouds/grids high-performance computing optimization Domain scientistSoftware engineer datascience-paris-saclay.fr LIST/CEA
  • 38. 38 THE DATA SCIENCE LANDSCAPE Domain science energy and physical sciences health and life sciences Earth and environment economy and society brain Data scientist Data trainer Applied scientist Domain scientistSoftware engineer Data engineer Data science statistics
 machine learning information retrieval signal processing data visualization databases Tool building software engineering
 clouds/grids high-performance
 computing optimization
  • 39. Center for Data Science Paris-Saclay39 https://medium.com/@balazskegl
  • 40. Center for Data Science Paris-Saclay TOOLS: LANDSCAPE TO ECOSYSTEM 40 Data scientist Data trainer Applied scientist Domain expertSoftware engineer Data engineer Tool building Data domains Data science statistics
 machine learning information retrieval signal processing data visualization databases • interdisciplinary projects • matchmaking tool • design and innovation strategy workshops • data challenges • coding sprints • Open Software Initiative • code consolidator and engineering projects software engineering
 clouds/grids high-performance
 computing optimization energy and physical sciences health and life sciences Earth and environment economy and society brain • data science RAMPs and TSs • IT platform for linked data • annotation tools • SaaS data science platform
  • 41. Center for Data Science Paris-Saclay • Modularizing the collaboration • independent subtasks • reduces barriers • broadens the range of available expertise • Encouraging small contributions • Rich and well-structured information commons • so people can build on earlier work 41 NIELSEN’S CROWDSOURCING PRINCIPLES
  • 42. Center for Data Science Paris-Saclay42 RAMPS • Single-day coding sessions • 20-40 participants • preparation is similar to challenges • Goals • focusing and motivating top talents • promoting collaboration, speed, and efficiency • solving (prototyping) real problems
  • 43. 43 TRAINING SPRINTS • Single-day training sessions • 20-40 participants • focusing on a single subject (deep learning, model tuning, functional data, etc.) • preparing RAMPs
  • 44. 44 ANALYTICS TOOLS TO PROMOTE COLLABORATION AND CODE REUSE
  • 45. Center for Data Science Paris-Saclay45 ANALYTICS TOOL TO PROMOTE COLLABORATION AND CODE REUSE
  • 46. Center for Data Science Paris-Saclay ANALYTICS TOOLS TO MONITOR PROGRESS 46
  • 47. Center for Data Science Paris-Saclay RAPID ANALYTICS AND MODEL PROTOTYPING 2015 Jan 15 The HiggsML challenge 47
  • 48. Center for Data Science Paris-Saclay RAPID ANALYTICS AND MODEL PROTOTYPING 2015 Apr 10 Classifying variable stars 48
  • 49. Center for Data Science Paris-Saclay VARIABLE STARS 49
  • 50. Learning to discoverB. Kégl / CNRS - Saclay VARIABLE STARS 50 accuracy improvement: 89% to 96%
  • 51. Center for Data Science Paris-Saclay RAPID ANALYTICS AND MODEL PROTOTYPING 2015 June 16 and Sept 26 Predicting El Nino 51
  • 52. 52 RAPID ANALYTICS AND MODEL PROTOTYPING RMSE improvement: 0.9˚C to 0.4˚C
  • 53. 53 2015 October 8 Insect classification RAPID ANALYTICS AND MODEL PROTOTYPING
  • 54. 54 RAPID ANALYTICS AND MODEL PROTOTYPING accuracy improvement: 30% to 70%
  • 55. 55 CONCLUSIONS • Explore the open innovation space • read Nielsen’s book • Drop me a mail (balazs.kegl@gmail.com) if you are interested in beta-testing the RAMP tool • Come to our CIML WS tomorrow
  • 56. Center for Data Science Paris-Saclay56 THANK YOU!