SlideShare a Scribd company logo
Coding for science and innovation
Ga¨el Varoquaux
to change the world!
Science
The process of discovering
knowledge and mechanisms
Computing is a central part of how we do science
G Varoquaux 2
Science
The process of discovering
knowledge and mechanisms
Computing is a central part of how we do science
Science + Computers = Computational science
Nuclear physics Fluid dynamics Chemistry
G Varoquaux 2
Science
The process of discovering
knowledge and mechanisms
Computing is a central part of how we do science
Science + Computers = Computational science
Psychology
G Varoquaux 2
Science
The process of discovering
knowledge and mechanisms
Computing is a central part of how we do science
Science + Computers = Computational science
Psychology
Marketting
Data science: using data to acquire insights
G Varoquaux 2
Science
The process of discovering
knowledge and mechanisms
“Science is not a political construct or a belief sys-
tem. Scientific progress depends on openness, trans-
parency, and the free flow of ideas and people.”
— Dr. Rush Holt, CEO of AAAS,
testimony to the House Committee on Science, Space, and Tech-
nology, Feb 8, 2017
G Varoquaux 3
Science
The process of discovering
knowledge and mechanisms
Science helps shaping society
Growth in a time of debt [Reinhart & Rogoff 2010]:
Wrong conclusions due to flawed Excel processing
⇒ Public debt blamed for financial crisis (Osborne UK MP)
Autism and vaccines:
forged study: [Wakefield et al, Lancet 1998]
⇒ Drop in vaccination, measles outbreak
Loss of trust in science is very costly
G Varoquaux 3
Innovation
Putting the right technology to the right use
G Varoquaux 4
Innovation
Putting the right technology to the right use
Light blub:
Invented ∼ 1835 by Lindsay
Extra progress: vaccum pumps (Swan ∼ 1880)
Economics: availability of electric power
⇒ Edison’s company
G Varoquaux 4
Innovation
Putting the right technology to the right use
Light blub:
Invented ∼ 1835 by Lindsay
Extra progress: vaccum pumps (Swan ∼ 1880)
Economics: availability of electric power
⇒ Edison’s company
Outbox: company digitizing physical mail
But citizens aren’t the USPS customers, junk mailers are
⇒ No cooperation from USPS, Outbox dies
Power balances drive innovation as much as technology
G Varoquaux 4
Coding for science and innovation:
Computing is the new electricity:
a driver for change
With new data sources,
it reaches beyond physics & engineering
G Varoquaux 5
Coding for science and innovation:
1 Coding as a scientist
2 Building software for science
3 An ecosystem
G Varoquaux 6
1 Coding as a scientist
G Varoquaux 7
1 Data in brain sciences
The mental world
cognition, emotions
autism, depression
Historically studied
via verbal interactions
Psychology
G Varoquaux 8
1 Data in brain sciences
The mental world
cognition, emotions
autism, depression
Historically studied
via verbal interactions
The brain
an organ:
neurons, firing
Imaging brain activity
Quantitative data
G Varoquaux 8
1 One example of our work: biomarkers of Autism
[Abraham...Varoquaux, 2017]
Comparing the brain activity of many subjects
Supervised machine learning to discriminate Autism
G Varoquaux 9
1 One example of our work: biomarkers of Autism
[Abraham...Varoquaux, 2017]
1. Extract brain networks
Unsupervised feature learning
complex model fit to 1Tb data
G Varoquaux 9
1 One example of our work: biomarkers of Autism
[Abraham...Varoquaux, 2017]
1. Extract brain networks
2. Per-subject connections
Information geometry,
Lie algebra...
G Varoquaux 9
1 One example of our work: biomarkers of Autism
[Abraham...Varoquaux, 2017]
1. Extract brain networks
2. Per-subject connections
3. Supervised learning
Scikit-learn
G Varoquaux 9
1 One example of our work: biomarkers of Autism
[Abraham...Varoquaux, 2017]
1. Extract brain networks
2. Per-subject connections
3. Supervised learning
Scikit-learn
Limits to impact:
Cannot outperform clinicians that define Autism/Control
Psychiatrists unhappy with current blurry definition
But not ready to accept black-box algorithmic definition
G Varoquaux 9
1 One example of our work: biomarkers of Autism
[Abraham...Varoquaux, 2017]
1. Extract brain networks
2. Per-subject connections
3. Supervised learning
Scikit-learn
Limits to impact:
Cannot outperform clinicians that define Autism/Control
Psychiatrists unhappy with current blurry definition
But not ready to accept black-box algorithmic definition
Lots of moving parts
Practitionners need to
make the tools theirs
G Varoquaux 9
1 A quest for trust: reproducible research
“if it’s not open and verifiable by others, it’s not science,
or engineering, or whatever it is you call what we do“
— V. Stodden, The scientific method in practice
Computational reproducibility:
Automate everything
Control the environment
G Varoquaux 10
1 Automate everything
Just a simple matter of programming
G Varoquaux 11
1 Automate everything...
Some operations work better with a human in the loop
Scientific research is an iterative process
Tension between needs for interaction and replay
G Varoquaux 11
1 Automate everything...
Some operations work better with a human in the loop
Scientific research is an iterative process
Tension between needs for interaction and replay
Mayavi
Reflexivity between dialogs and objects
Record mode
G Varoquaux 11
1 Automate everything...
Some operations work better with a human in the loop
Scientific research is an iterative process
Tension between needs for interaction and replay
Jupyter, and its widgets:
Exploring the space between interaction and code
G Varoquaux 11
1 Beyond computational reproducibility
Make every computational step reproducible,
and good science will emerge
G Varoquaux 12
1 Beyond computational reproducibility
Make every computational step reproducible,
and good science will emerge
Estimating the reproducibility of psychological science
[Science 2015] 36% of effects replicate
Reasons:
Statistical challenges — analysis degrees of freedom
Weak insentives — winner’s curse in publication
Seldom computational reproducibility
G Varoquaux 12
1 Beyond computational reproducibility
Make every computational step reproducible,
and good science will emerge
Estimating the reproducibility of psychological science
[Science 2015] 36% of effects replicate
Reasons:
Statistical challenges — analysis degrees of freedom
Weak insentives — winner’s curse in publication
Seldom computational reproducibility
I think that reproducibility is a misnomer.
What matters is that operations be
verifiable or reusable.
G Varoquaux 12
In practice, the best way to improve research
is to use the right (conceptual) tools.
G Varoquaux 13
1 Managing complexity
In practice, the best way to improve research
is to use the right (conceptual) tools.
The everyday roadblock is cognitive load
Machine learning, brain anatomy, psychology
R, Python, shell scripts
Funding agencies, reviewer 3, courting VCs
G Varoquaux 14
Coding as a scientist
Final code should be auditable,
ideally reusable
Tension between interactive computing
& automating
Main enemy: cognitive overload
G Varoquaux 15
Coding as a scientist
Final code should be auditable,
ideally reusable
Tension between interactive computing
& automating
Main enemy: cognitive overload
In the industry
Reusable
Verifiable? Not for silicon valley,
but in insurance, healthcare, banking...
Moving data-scientist code
to production?
Software projects going over budget?
G Varoquaux 15
Code quality in exploratory work
Use pyflakes in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
G Varoquaux 16
Code quality in exploratory workIncreasingcost
?
Use pyflakes in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
Avoid premature software engineering
G Varoquaux 16
Code quality in exploratory workIncreasingcost
?
Use pyflakes in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
Avoid premature software engineering
Over versus under engineering
Goal is generating insights / moving in new spaces
Experimentation for intuitions and proofs of concepts
⇒ new ideas
As the path becomes clear: consolidation
is great for that
Heavy engineering too early freezes bad ideas
G Varoquaux 16
2 Building software for science
The point of view of the developer
Libraries are what enables us to scale:
Abstractions reduce cognitive load
Code reuse gets us further
G Varoquaux 17
2 Examples of such libraries
scikit-learn
Make research in machine-learning
models and algorithm useable to people
who do not understand them
ni
nilearn
Make it easy to answer neuroimaging
problems with them
G Varoquaux 18
2 Examples of such libraries
scikit-learn
Make research in machine-learning
models and algorithm useable to people
who do not understand them
Challenges:
Variety of that space
Statistical concepts coding concepts
ni
nilearn
Make it easy to answer neuroimaging
problems with them
Challenges: Onboarding technology-adverse users
G Varoquaux 18
2 Tools that reduce cognitive overload
It’s a design problem
G Varoquaux 19
2 Tools that reduce cognitive overload
Jonathan Ive, an industrial designer, is #4 at Apple
Code different.
G Varoquaux 20
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 21
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
np.save(file, obj) pickle.dump(obj, file)
fmin(...maxiter=10) lsq linear(...max iter=10)
Creates cognitive overload
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 22
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
Objects have hidden states,
Objects have no universal interface, entry point, output
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 23
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
How much do usage patterns carry out across the library?
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 24
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Facilitates working with multiple libraries together
Easier to get up to speed with a given library
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 25
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Change of behavior depending on input type
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 26
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Interfaces define objects
Incompatible behaviors lead to bugs (eg np.matrix)
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 27
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Properties obfuscate the data model of the object
Properties can create hidden compute costs
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 28
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Objects are understood by their surface
Composition creates cognitive overload
Error messages matter
Be Pythonic
G Varoquaux 29
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Explain the problem
Print the offending value
Be Pythonic
G Varoquaux 30
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
Avoid syntax hacks
G Varoquaux 31
2 Scikit-learn API
Scikit-learn cheat sheet
Scikit-learn
Fit and predict
>>> estimator = Estimator(param1=param1)
>>> estimator.fit(X train, y train)
>>> y test = estimator.predict(X test)
Transform data
>>> X red = estimator.transform(X test)
G Varoquaux 32
2 Scikit-learn API
Scikit-learn cheat sheet
Scikit-learn
Fit and predict
>>> estimator = Estimator(param1=param1)
>>> estimator.fit(X train, y train)
>>> y test = estimator.predict(X test)
Transform data
>>> X red = estimator.transform(X test)
The estimator is a “contract”
(slightly more elaborate than above)
It has created an ecosystem of packages
Based on duck-typing, not inheritence
G Varoquaux 32
2 numpy arrays
03878794797927
01790752701578
94071746124797
54970718717887
0495190
03878794797927
01790752701578
94071746124797
54970718717887
495190
ndarray
Abstraction over pointers & operation
Contract: the memory layout
IMHO, gone too far in number of methods (163)
The array protocol makes it easy to quack like an array
PS: The ecosystem needs categorical dtypes in numpy
G Varoquaux 33
2 Example-driven development
The 3-liner as the new cool
Teaching others
Teaching yourself
Write examples that solve end problems
Iterate on your API until these are simple
Mayavi scikit-learn nilearn
G Varoquaux 34
2 Example-driven development
The 3-liner as the new cool
Teaching others
Teaching yourself
Write examples that solve end problems
Iterate on your API until these are simple
Mayavi scikit-learn nilearn
User flow on the scikit-learn website:
Examples
G Varoquaux 34
2 Example-driven development
The 3-liner as the new cool
Teaching others
Teaching yourself
Write examples that solve end problems
Iterate on your API until these are simple
Mayavi scikit-learn nilearn
User flow on the nilearn website:
Examples
G Varoquaux 34
2 Example-driven development
The 3-liner as the new cool
Teaching others
Teaching yourself
Write examples that solve end problems
Iterate on your API until these are simple
Mayavi scikit-learn nilearn
Sphinx-gallery: compiling scripts in an examples gallery
G Varoquaux 34
2 Example-driven development
The 3-liner as the new cool
Teaching others
Teaching yourself
Write examples that solve end problems
Iterate on your API until these are simple
Mayavi scikit-learn nilearn
Sphinx-gallery: compiling scripts in an examples gallery
Restructured text
formatting
Capturing
outputs
Links to
function docs
+Creates Jupyter
notebooks
G Varoquaux 34
2 Example-driven development
The 3-liner as the new cool
Teaching others
Teaching yourself
Write examples that solve end problems
Iterate on your API until these are simple
Mayavi scikit-learn nilearn
Sphinx-gallery: compiling scripts in an examples gallery
Insert links to examples
containing a function
G Varoquaux 34
2 Building great documentation
Focus on explaining concepts (hint: write plain English)
Less is more: prioritize, avoid redundancy
Code examples must be short (link to full tutorial examples)
Links everywhere: users will land at the wrong place
Teach with the docs
Plan for maintenance of docs:
Continuous integration
Check links
Runs examples
Doctests
G Varoquaux 35
2 Reusable science
scikit-learn is the new machine-learning textbook
nilearn is the new neuroimaging review article
Experiments reproduced
at each commit
eg: brain reading
nilearn.github.io/auto examples/02 decoding/plot miyawaki reconstruction.html
G Varoquaux 36
2 Reusable science
scikit-learn is the new machine-learning textbook
nilearn is the new neuroimaging review article
Experiments reproduced
at each commit
eg: brain reading
nilearn.github.io/auto examples/02 decoding/plot miyawaki reconstruction.html
Resource intensive CI:
Data ⇒ Fight for good open data
Computation ⇒ Find good algorithms and tradeoffs
Forces us to distill the literature (as a review)
G Varoquaux 36
2 Reusable science
scikit-learn is the new machine-learning textbook
nilearn is the new neuroimaging review article
Experiments reproduced
at each commit
eg: brain reading
nilearn.github.io/auto examples/02 decoding/plot miyawaki reconstruction.html
Package development consolidates
science and moves it outside the lab
G Varoquaux 36
3 An ecosystem
A bird’s eye view on scientific packages
G Varoquaux 37
3 Packages of the Python ecosystem
1 10 100 1000 10000
Package rank
104
105
106
107
108
109
NumberofPyPIdownloads
A small number of packages
are used by many
1
f distribution, preferential attachment
G Varoquaux 38
3 Packages of the Python ecosystem
1 10 100 1000 10000
Package rank
104
105
106
107
108
109
NumberofPyPIdownloads
numpy#49
scikit-learn #110
joblib #431
nilearn
#2877
simplejson #1
six #2setuptools#3
A small number of packages
are used by many
1
f distribution, preferential attachment
nilearn relies on scikit-learn & joblib that rely on numpy...
G Varoquaux 38
3 Standing on the shoulders of maintainers
May 31th: pip broken
https://github.com/pypa/
setuptools/pull/1043
Left-pad:
How left-padding strings broke
the Internet
A Javascript package
for left padding strings
was removed from
node’s package manager,
breaking all the websites
that depended on it.
G Varoquaux 39
3 Dependencies
Beyond installation, a challenge is to ensure package
versions play way together: correctness of the code
Breakage of backward compability
yields irreconcilable dependencies
G Varoquaux 40
3 Dependencies and their upgrade
It’s a fact: users hate upgrading
If it ain’t broken, don’t fix it
even if it is, apparently
G Varoquaux 41
3 Declaring undependence?
Monolythic packages with no dependencies...
But:
Scaling is hard
Complexity grows as square of codebase size
[Woodfield 1979]
User support grows with userbase size
G Varoquaux 42
3 Core software is infrastructure
Everybody uses it everyday
In industry, education, & research
G Varoquaux 43
3 Core software is infrastructure
Everybody uses it everyday
In industry, education, & research
It needs maintenance
Like roads (or openSSL, to prevent heartbleed)
Central infrastructure packages are “boring”
They are understaffed and underfunded
References: “Roads and Bridge” Ford foundation report
Excellent talk by Heather Miller
https://www.youtube.com/watch?v=17yy5BwIiTw
G Varoquaux 43
@GaelVaroquaux
Coding for science and innovation
New science
High value of bringing new methods to a field
⇒ Enable domain-specialists
Rapid interation, but with automation & consolidation
Software tools
Scientists are limited by cognitive load
⇒ Design of API and documentation in libraries
Libraries make science reproducible and reusable
An ecosystem
Central packages hold the ecosystem together
Thanks to: the scipy community

More Related Content

PDF
Succeeding in academia despite doing good_software
PDF
On the code of data science
PDF
Open Source Scientific Software
PDF
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
PDF
Scikit-learn: the state of the union 2016
PDF
Building a cutting-edge data processing environment on a budget
PDF
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
PPTX
Binary Analysis - Luxembourg
Succeeding in academia despite doing good_software
On the code of data science
Open Source Scientific Software
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn: the state of the union 2016
Building a cutting-edge data processing environment on a budget
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Binary Analysis - Luxembourg

Similar to Coding for science and innovation (20)

PDF
Better neuroimaging data processing: driven by evidence, open communities, an...
PDF
Computational practices for reproducible science
PDF
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
PPTX
Germany 20180424 v8
PPTX
Chabot 20180404 v5
PDF
Data Science: Notes and Toolkits
PPTX
Uidp 20180404 v6
PDF
STEAM Workshops with Binder and JupyterHub
PDF
2014-10-10-SBC361-Reproducible research
PDF
2013 10-30-sbc361-reproducible designsandsustainablesoftware
PDF
2020_02_21 «Teaching Informatics to All: a European perspective»
PPTX
Ntegra 20180523 v10 copy.pptx
PDF
OAI7 Research Objects
PPTX
Computational Thinking - a 4 step approach and a new pedagogy
PPTX
AIDR2019 - standards - tools - incentives - what does it take to enable data ...
PPTX
Computational Thinking in the Workforce and Next Generation Science Standards...
PPTX
Code sharing at MediaEval
PDF
Driving Data and Cognitive Sciences Curriculum at the Nexus of Society, Polic...
PPTX
Session 5 coding handson Tensorflow
PDF
AI/ML as an empirical science
Better neuroimaging data processing: driven by evidence, open communities, an...
Computational practices for reproducible science
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Germany 20180424 v8
Chabot 20180404 v5
Data Science: Notes and Toolkits
Uidp 20180404 v6
STEAM Workshops with Binder and JupyterHub
2014-10-10-SBC361-Reproducible research
2013 10-30-sbc361-reproducible designsandsustainablesoftware
2020_02_21 «Teaching Informatics to All: a European perspective»
Ntegra 20180523 v10 copy.pptx
OAI7 Research Objects
Computational Thinking - a 4 step approach and a new pedagogy
AIDR2019 - standards - tools - incentives - what does it take to enable data ...
Computational Thinking in the Workforce and Next Generation Science Standards...
Code sharing at MediaEval
Driving Data and Cognitive Sciences Curriculum at the Nexus of Society, Polic...
Session 5 coding handson Tensorflow
AI/ML as an empirical science
Ad

More from Gael Varoquaux (20)

PDF
Evaluating machine learning models and their diagnostic value
PDF
Measuring mental health with machine learning and brain imaging
PDF
Machine learning with missing values
PDF
Dirty data science machine learning on non-curated data
PDF
Representation learning in limited-data settings
PDF
Functional-connectome biomarkers to meet clinical needs?
PDF
Atlases of cognition with large-scale human brain mapping
PDF
Similarity encoding for learning on dirty categorical variables
PDF
Machine learning for functional connectomes
PDF
Towards psychoinformatics with machine learning and brain imaging
PDF
Simple representations for learning: factorizations and similarities
PDF
A tutorial on Machine Learning, with illustrations for MR imaging
PDF
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
PDF
Scientist meets web dev: how Python became the language of data
PDF
Machine learning and cognitive neuroimaging: new tools can answer new questions
PDF
Social-sparsity brain decoders: faster spatial sparsity
PDF
Inter-site autism biomarkers from resting state fMRI
PDF
Brain maps from machine learning? Spatial regularizations
PDF
Scikit-learn for easy machine learning: the vision, the tool, and the project
PDF
Simple big data, in Python
Evaluating machine learning models and their diagnostic value
Measuring mental health with machine learning and brain imaging
Machine learning with missing values
Dirty data science machine learning on non-curated data
Representation learning in limited-data settings
Functional-connectome biomarkers to meet clinical needs?
Atlases of cognition with large-scale human brain mapping
Similarity encoding for learning on dirty categorical variables
Machine learning for functional connectomes
Towards psychoinformatics with machine learning and brain imaging
Simple representations for learning: factorizations and similarities
A tutorial on Machine Learning, with illustrations for MR imaging
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Scientist meets web dev: how Python became the language of data
Machine learning and cognitive neuroimaging: new tools can answer new questions
Social-sparsity brain decoders: faster spatial sparsity
Inter-site autism biomarkers from resting state fMRI
Brain maps from machine learning? Spatial regularizations
Scikit-learn for easy machine learning: the vision, the tool, and the project
Simple big data, in Python
Ad

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Cloud computing and distributed systems.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Per capita expenditure prediction using model stacking based on satellite ima...
Empathic Computing: Creating Shared Understanding
Spectral efficient network and resource selection model in 5G networks
Building Integrated photovoltaic BIPV_UPV.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
NewMind AI Weekly Chronicles - August'25-Week II
MIND Revenue Release Quarter 2 2025 Press Release
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Unlocking AI with Model Context Protocol (MCP)
Chapter 3 Spatial Domain Image Processing.pdf
Cloud computing and distributed systems.
Dropbox Q2 2025 Financial Results & Investor Presentation
Encapsulation_ Review paper, used for researhc scholars
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Coding for science and innovation

  • 1. Coding for science and innovation Ga¨el Varoquaux to change the world!
  • 2. Science The process of discovering knowledge and mechanisms Computing is a central part of how we do science G Varoquaux 2
  • 3. Science The process of discovering knowledge and mechanisms Computing is a central part of how we do science Science + Computers = Computational science Nuclear physics Fluid dynamics Chemistry G Varoquaux 2
  • 4. Science The process of discovering knowledge and mechanisms Computing is a central part of how we do science Science + Computers = Computational science Psychology G Varoquaux 2
  • 5. Science The process of discovering knowledge and mechanisms Computing is a central part of how we do science Science + Computers = Computational science Psychology Marketting Data science: using data to acquire insights G Varoquaux 2
  • 6. Science The process of discovering knowledge and mechanisms “Science is not a political construct or a belief sys- tem. Scientific progress depends on openness, trans- parency, and the free flow of ideas and people.” — Dr. Rush Holt, CEO of AAAS, testimony to the House Committee on Science, Space, and Tech- nology, Feb 8, 2017 G Varoquaux 3
  • 7. Science The process of discovering knowledge and mechanisms Science helps shaping society Growth in a time of debt [Reinhart & Rogoff 2010]: Wrong conclusions due to flawed Excel processing ⇒ Public debt blamed for financial crisis (Osborne UK MP) Autism and vaccines: forged study: [Wakefield et al, Lancet 1998] ⇒ Drop in vaccination, measles outbreak Loss of trust in science is very costly G Varoquaux 3
  • 8. Innovation Putting the right technology to the right use G Varoquaux 4
  • 9. Innovation Putting the right technology to the right use Light blub: Invented ∼ 1835 by Lindsay Extra progress: vaccum pumps (Swan ∼ 1880) Economics: availability of electric power ⇒ Edison’s company G Varoquaux 4
  • 10. Innovation Putting the right technology to the right use Light blub: Invented ∼ 1835 by Lindsay Extra progress: vaccum pumps (Swan ∼ 1880) Economics: availability of electric power ⇒ Edison’s company Outbox: company digitizing physical mail But citizens aren’t the USPS customers, junk mailers are ⇒ No cooperation from USPS, Outbox dies Power balances drive innovation as much as technology G Varoquaux 4
  • 11. Coding for science and innovation: Computing is the new electricity: a driver for change With new data sources, it reaches beyond physics & engineering G Varoquaux 5
  • 12. Coding for science and innovation: 1 Coding as a scientist 2 Building software for science 3 An ecosystem G Varoquaux 6
  • 13. 1 Coding as a scientist G Varoquaux 7
  • 14. 1 Data in brain sciences The mental world cognition, emotions autism, depression Historically studied via verbal interactions Psychology G Varoquaux 8
  • 15. 1 Data in brain sciences The mental world cognition, emotions autism, depression Historically studied via verbal interactions The brain an organ: neurons, firing Imaging brain activity Quantitative data G Varoquaux 8
  • 16. 1 One example of our work: biomarkers of Autism [Abraham...Varoquaux, 2017] Comparing the brain activity of many subjects Supervised machine learning to discriminate Autism G Varoquaux 9
  • 17. 1 One example of our work: biomarkers of Autism [Abraham...Varoquaux, 2017] 1. Extract brain networks Unsupervised feature learning complex model fit to 1Tb data G Varoquaux 9
  • 18. 1 One example of our work: biomarkers of Autism [Abraham...Varoquaux, 2017] 1. Extract brain networks 2. Per-subject connections Information geometry, Lie algebra... G Varoquaux 9
  • 19. 1 One example of our work: biomarkers of Autism [Abraham...Varoquaux, 2017] 1. Extract brain networks 2. Per-subject connections 3. Supervised learning Scikit-learn G Varoquaux 9
  • 20. 1 One example of our work: biomarkers of Autism [Abraham...Varoquaux, 2017] 1. Extract brain networks 2. Per-subject connections 3. Supervised learning Scikit-learn Limits to impact: Cannot outperform clinicians that define Autism/Control Psychiatrists unhappy with current blurry definition But not ready to accept black-box algorithmic definition G Varoquaux 9
  • 21. 1 One example of our work: biomarkers of Autism [Abraham...Varoquaux, 2017] 1. Extract brain networks 2. Per-subject connections 3. Supervised learning Scikit-learn Limits to impact: Cannot outperform clinicians that define Autism/Control Psychiatrists unhappy with current blurry definition But not ready to accept black-box algorithmic definition Lots of moving parts Practitionners need to make the tools theirs G Varoquaux 9
  • 22. 1 A quest for trust: reproducible research “if it’s not open and verifiable by others, it’s not science, or engineering, or whatever it is you call what we do“ — V. Stodden, The scientific method in practice Computational reproducibility: Automate everything Control the environment G Varoquaux 10
  • 23. 1 Automate everything Just a simple matter of programming G Varoquaux 11
  • 24. 1 Automate everything... Some operations work better with a human in the loop Scientific research is an iterative process Tension between needs for interaction and replay G Varoquaux 11
  • 25. 1 Automate everything... Some operations work better with a human in the loop Scientific research is an iterative process Tension between needs for interaction and replay Mayavi Reflexivity between dialogs and objects Record mode G Varoquaux 11
  • 26. 1 Automate everything... Some operations work better with a human in the loop Scientific research is an iterative process Tension between needs for interaction and replay Jupyter, and its widgets: Exploring the space between interaction and code G Varoquaux 11
  • 27. 1 Beyond computational reproducibility Make every computational step reproducible, and good science will emerge G Varoquaux 12
  • 28. 1 Beyond computational reproducibility Make every computational step reproducible, and good science will emerge Estimating the reproducibility of psychological science [Science 2015] 36% of effects replicate Reasons: Statistical challenges — analysis degrees of freedom Weak insentives — winner’s curse in publication Seldom computational reproducibility G Varoquaux 12
  • 29. 1 Beyond computational reproducibility Make every computational step reproducible, and good science will emerge Estimating the reproducibility of psychological science [Science 2015] 36% of effects replicate Reasons: Statistical challenges — analysis degrees of freedom Weak insentives — winner’s curse in publication Seldom computational reproducibility I think that reproducibility is a misnomer. What matters is that operations be verifiable or reusable. G Varoquaux 12
  • 30. In practice, the best way to improve research is to use the right (conceptual) tools. G Varoquaux 13
  • 31. 1 Managing complexity In practice, the best way to improve research is to use the right (conceptual) tools. The everyday roadblock is cognitive load Machine learning, brain anatomy, psychology R, Python, shell scripts Funding agencies, reviewer 3, courting VCs G Varoquaux 14
  • 32. Coding as a scientist Final code should be auditable, ideally reusable Tension between interactive computing & automating Main enemy: cognitive overload G Varoquaux 15
  • 33. Coding as a scientist Final code should be auditable, ideally reusable Tension between interactive computing & automating Main enemy: cognitive overload In the industry Reusable Verifiable? Not for silicon valley, but in insurance, healthcare, banking... Moving data-scientist code to production? Software projects going over budget? G Varoquaux 15
  • 34. Code quality in exploratory work Use pyflakes in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... G Varoquaux 16
  • 35. Code quality in exploratory workIncreasingcost ? Use pyflakes in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... Avoid premature software engineering G Varoquaux 16
  • 36. Code quality in exploratory workIncreasingcost ? Use pyflakes in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... Avoid premature software engineering Over versus under engineering Goal is generating insights / moving in new spaces Experimentation for intuitions and proofs of concepts ⇒ new ideas As the path becomes clear: consolidation is great for that Heavy engineering too early freezes bad ideas G Varoquaux 16
  • 37. 2 Building software for science The point of view of the developer Libraries are what enables us to scale: Abstractions reduce cognitive load Code reuse gets us further G Varoquaux 17
  • 38. 2 Examples of such libraries scikit-learn Make research in machine-learning models and algorithm useable to people who do not understand them ni nilearn Make it easy to answer neuroimaging problems with them G Varoquaux 18
  • 39. 2 Examples of such libraries scikit-learn Make research in machine-learning models and algorithm useable to people who do not understand them Challenges: Variety of that space Statistical concepts coding concepts ni nilearn Make it easy to answer neuroimaging problems with them Challenges: Onboarding technology-adverse users G Varoquaux 18
  • 40. 2 Tools that reduce cognitive overload It’s a design problem G Varoquaux 19
  • 41. 2 Tools that reduce cognitive overload Jonathan Ive, an industrial designer, is #4 at Apple Code different. G Varoquaux 20
  • 42. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 21
  • 43. 2 Some API design principles for the scipy stack Consistency, consistency, consistency np.save(file, obj) pickle.dump(obj, file) fmin(...maxiter=10) lsq linear(...max iter=10) Creates cognitive overload Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 22
  • 44. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes Objects have hidden states, Objects have no universal interface, entry point, output A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 23
  • 45. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts How much do usage patterns carry out across the library? Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 24
  • 46. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Facilitates working with multiple libraries together Easier to get up to speed with a given library Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 25
  • 47. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Change of behavior depending on input type Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 26
  • 48. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Interfaces define objects Incompatible behaviors lead to bugs (eg np.matrix) Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 27
  • 49. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Properties obfuscate the data model of the object Properties can create hidden compute costs Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 28
  • 50. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Objects are understood by their surface Composition creates cognitive overload Error messages matter Be Pythonic G Varoquaux 29
  • 51. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Explain the problem Print the offending value Be Pythonic G Varoquaux 30
  • 52. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic Avoid syntax hacks G Varoquaux 31
  • 53. 2 Scikit-learn API Scikit-learn cheat sheet Scikit-learn Fit and predict >>> estimator = Estimator(param1=param1) >>> estimator.fit(X train, y train) >>> y test = estimator.predict(X test) Transform data >>> X red = estimator.transform(X test) G Varoquaux 32
  • 54. 2 Scikit-learn API Scikit-learn cheat sheet Scikit-learn Fit and predict >>> estimator = Estimator(param1=param1) >>> estimator.fit(X train, y train) >>> y test = estimator.predict(X test) Transform data >>> X red = estimator.transform(X test) The estimator is a “contract” (slightly more elaborate than above) It has created an ecosystem of packages Based on duck-typing, not inheritence G Varoquaux 32
  • 55. 2 numpy arrays 03878794797927 01790752701578 94071746124797 54970718717887 0495190 03878794797927 01790752701578 94071746124797 54970718717887 495190 ndarray Abstraction over pointers & operation Contract: the memory layout IMHO, gone too far in number of methods (163) The array protocol makes it easy to quack like an array PS: The ecosystem needs categorical dtypes in numpy G Varoquaux 33
  • 56. 2 Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Mayavi scikit-learn nilearn G Varoquaux 34
  • 57. 2 Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Mayavi scikit-learn nilearn User flow on the scikit-learn website: Examples G Varoquaux 34
  • 58. 2 Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Mayavi scikit-learn nilearn User flow on the nilearn website: Examples G Varoquaux 34
  • 59. 2 Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Mayavi scikit-learn nilearn Sphinx-gallery: compiling scripts in an examples gallery G Varoquaux 34
  • 60. 2 Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Mayavi scikit-learn nilearn Sphinx-gallery: compiling scripts in an examples gallery Restructured text formatting Capturing outputs Links to function docs +Creates Jupyter notebooks G Varoquaux 34
  • 61. 2 Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Mayavi scikit-learn nilearn Sphinx-gallery: compiling scripts in an examples gallery Insert links to examples containing a function G Varoquaux 34
  • 62. 2 Building great documentation Focus on explaining concepts (hint: write plain English) Less is more: prioritize, avoid redundancy Code examples must be short (link to full tutorial examples) Links everywhere: users will land at the wrong place Teach with the docs Plan for maintenance of docs: Continuous integration Check links Runs examples Doctests G Varoquaux 35
  • 63. 2 Reusable science scikit-learn is the new machine-learning textbook nilearn is the new neuroimaging review article Experiments reproduced at each commit eg: brain reading nilearn.github.io/auto examples/02 decoding/plot miyawaki reconstruction.html G Varoquaux 36
  • 64. 2 Reusable science scikit-learn is the new machine-learning textbook nilearn is the new neuroimaging review article Experiments reproduced at each commit eg: brain reading nilearn.github.io/auto examples/02 decoding/plot miyawaki reconstruction.html Resource intensive CI: Data ⇒ Fight for good open data Computation ⇒ Find good algorithms and tradeoffs Forces us to distill the literature (as a review) G Varoquaux 36
  • 65. 2 Reusable science scikit-learn is the new machine-learning textbook nilearn is the new neuroimaging review article Experiments reproduced at each commit eg: brain reading nilearn.github.io/auto examples/02 decoding/plot miyawaki reconstruction.html Package development consolidates science and moves it outside the lab G Varoquaux 36
  • 66. 3 An ecosystem A bird’s eye view on scientific packages G Varoquaux 37
  • 67. 3 Packages of the Python ecosystem 1 10 100 1000 10000 Package rank 104 105 106 107 108 109 NumberofPyPIdownloads A small number of packages are used by many 1 f distribution, preferential attachment G Varoquaux 38
  • 68. 3 Packages of the Python ecosystem 1 10 100 1000 10000 Package rank 104 105 106 107 108 109 NumberofPyPIdownloads numpy#49 scikit-learn #110 joblib #431 nilearn #2877 simplejson #1 six #2setuptools#3 A small number of packages are used by many 1 f distribution, preferential attachment nilearn relies on scikit-learn & joblib that rely on numpy... G Varoquaux 38
  • 69. 3 Standing on the shoulders of maintainers May 31th: pip broken https://github.com/pypa/ setuptools/pull/1043 Left-pad: How left-padding strings broke the Internet A Javascript package for left padding strings was removed from node’s package manager, breaking all the websites that depended on it. G Varoquaux 39
  • 70. 3 Dependencies Beyond installation, a challenge is to ensure package versions play way together: correctness of the code Breakage of backward compability yields irreconcilable dependencies G Varoquaux 40
  • 71. 3 Dependencies and their upgrade It’s a fact: users hate upgrading If it ain’t broken, don’t fix it even if it is, apparently G Varoquaux 41
  • 72. 3 Declaring undependence? Monolythic packages with no dependencies... But: Scaling is hard Complexity grows as square of codebase size [Woodfield 1979] User support grows with userbase size G Varoquaux 42
  • 73. 3 Core software is infrastructure Everybody uses it everyday In industry, education, & research G Varoquaux 43
  • 74. 3 Core software is infrastructure Everybody uses it everyday In industry, education, & research It needs maintenance Like roads (or openSSL, to prevent heartbleed) Central infrastructure packages are “boring” They are understaffed and underfunded References: “Roads and Bridge” Ford foundation report Excellent talk by Heather Miller https://www.youtube.com/watch?v=17yy5BwIiTw G Varoquaux 43
  • 75. @GaelVaroquaux Coding for science and innovation New science High value of bringing new methods to a field ⇒ Enable domain-specialists Rapid interation, but with automation & consolidation Software tools Scientists are limited by cognitive load ⇒ Design of API and documentation in libraries Libraries make science reproducible and reusable An ecosystem Central packages hold the ecosystem together Thanks to: the scipy community