SlideShare a Scribd company logo
Challenges and Guidelines for Reproducible Research
and Interactive Education with Jupyter Notebook
Shih-Cheng Huang, Niema Moshiri, Michael Reich, Peter W. Rose
In [2]:
In [3]:
Introduction
Jupyter Notebooks have the potential to make research more
reproducible. However, in practice, many notebooks fall short of this
promise. Here we identify challenges and propose guidelines to
organize, document, and deploy notebooks to increase reproducibility
and reusability. These guidelines also apply to instructional materials.
Jupyter Notebooks are extensively used in research and education.
We recently organized a workshop at UC San Diego and invited
students, postdocs, research scientists, and faculty to share
experiences and identify challenges of using Jupyter notebooks. The
workshop covered the use of Jupyter in many disciplines, ranging from
Astrophysics, Bioinformatics, Datascience, Genomics, Medicine, to
Structural Bioinformatics, as well as classroom use with hundreds of
students and publication of reproducible science. Here we present a
summary of our findings in the form of guidelines.
Tell a
Story
•Show how you got from
initial data to final results
•Avoid manual steps, use a
notebook for data
preparation
•Split workflow into steps,
e.g., data preparation,
data analysis, data
visualization
•Show intermediate results
so users can follow the
dataflow
•Challenges
How to combine
multiple notebooks into
a complex workflow
Capture
Entire
Workflow
•Like any good story, a
notebook should have a
• Beginning
• Introduce the topic and the
aims of the notebook
• Middle
• Explain the steps of the
workflow
• Use markdown to split
notebook into sections
• End
• Interpret the results
• What trends do tables, plots,
or figures show?
•Write notebook for an
audience
• New users may need
instructions how to run a
notebook
• Educational materials need
background and step by step
explanations
In [1]:
Avoid
Copy &
Paste
•Split common functions
into separate files and
import them into
notebooks
•Challenges
•Need tools to refactor
notebooks
•Find and extract
common code among
related notebooks
Remove
Clutter
In [4]:
•Use markdown to organize a
notebook into sections
•Split long notebooks into a
series of notebooks
• Keep a top level notebook with
links to the individual notebooks
•Avoid long cells
• Split text and code into cells
• One cell -> one paragraph or
one task (e.g., create a plot)
•Modularize code by defining
functions or classes
•Reuse code by importing
functions or classes
•Put low-level documentation in
code comments
•Challenges
• Need to be able to collapse
sections of text or code to hide
low-level details, e.g., setup for
a plot
In [5]:
Make it
Repro-
ducible
•Specify version numbers of
dependencies
•Always specify the
source/location of the data
•Keep a copy of the raw data if
possible
•Make copies of data in a
notebook to avoid corrupting
datasets when running cells out
of order
•Specify random number seeds
•Before saving a notebook, rerun
notebook to ensure linear
execution order
•Challenges
• Need functionality to freeze
current state, execution order
In [7]:
Use
Version
Control
•Keep your notebooks under
version control, e.g., GitHub
•Describe content of repository in
README files
•Specify a license to encourage
and enable use by others
•Structure your repository
• See for example
http://drivendata.github.io/cookie
cutter-data-science/
•Challenges
• Each time notebook is run, the
ipynb file is modified,
• Need a ”diff” tool for notebooks
In [6]:
Share
It!
•Use open source projects as your
dependencies
•Add a liberal open source license
(e.g., MIT, Apache 2) to your
repository
•Use Nbviewer to provide static
views of your executed notebook
https://nbviewer.jupyter.org/
•Use Binder to provide a zero-
install environment to run your
notebooks in the cloud
https://mybinder.org/
•Create a Docker image of your
environment
https://docs.docker.com/
•Challenges
• Data intensive applications
• Compute intensive applications
• Special hardware requirements,
e.g., GPU
• Multi-step workflows
References
- Jupyter Notebooks – a publishing format for reproducible computational workflows
(2016) Jupyter Dev. Team, IOS Press, doi: 10.3233/978-1-61499-649-1-87
- Exploration and Explanation in Computational Notebooks, A. Rule, et al. (2018) Proc.
of the 2018 CHI Conference on Human Factors in Computing Systems, ACM.
- Binder 2.0 - Reproducible, interactive, sharable environments for science at scale,
Project Jupyter, et al. (2018) Proc. of the 17th Python in Science Conf. (SCIPY 2018).
- The GenePattern Notebook Environment, M. Reich, et al. (2017) Cell Systems 5.2,
149-151.
Acknowledgements
Amanda Birmingham, Ilkay Altintas, Rob Knight, Tiago Leao, Nathan Mih, Mai Nguyen,
Shweta Purawat, Brin Rosenthal, Adam Rule, Britton Smith, Shuai Tang, Guorong Xu
Image credit: 1. http://higher-ed.us/wp-content/uploads/2017/12/copy-and-paste-pictures-10-unbelievable-why-a-blogger-should-never-webmasters-nigeria.jpg, 2. https://blog.prototypr.io/meet-overflow-9b2d926b6093, 3. http://romeo.landinez.co/workflow/rapid-workflow-protoype.html, 4. https://productivitysteps.files.wordpress.com/2016/09/clutter.jpg
In [*]:
We are crowdsourcing a Jupyter Guide for
reproducible research. Please help us at:
https://github.com/sbl-sdsc/jupyter-guide

More Related Content

PDF
Computable content: Notebooks, containers, and data-centric organizational le...
PDF
Jupyter notebook for interactive data visualization敖
PDF
Clean Code in Jupyter notebook
PDF
Want to write a book in Jupyter - here's how
PDF
Jupyter notebooks on steroids
PDF
Jupyter machine learning crash course
PDF
S2-Programming_with_Data_Computational_Physics.pdf
PDF
Jupyter notebook 20200728
Computable content: Notebooks, containers, and data-centric organizational le...
Jupyter notebook for interactive data visualization敖
Clean Code in Jupyter notebook
Want to write a book in Jupyter - here's how
Jupyter notebooks on steroids
Jupyter machine learning crash course
S2-Programming_with_Data_Computational_Physics.pdf
Jupyter notebook 20200728

Similar to Challenges and Guidelines for Reproducible Research with Jupyter Notebook (20)

PPTX
Azure Notebooks - Jupyter for the Cloud
PPTX
Clean code in Jupyter notebooks
PDF
2015 03-28-eb-final
PDF
Computable Content
PDF
Computable Content: Lessons Learned
PDF
Data analysis with Pandas and Spark
PDF
JupyterHub for Interactive Data Science Collaboration
PPT
NiW: Notebooks into Workflows
PDF
Jupyter, A Platform for Data Science at Scale
PDF
Jupyter Notebook_CheatSheet.pdf
PPTX
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
PDF
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
PDF
Notebooks in IBM
PDF
Spark Summit Europe: Share and analyse genomic data at scale
PDF
Jupyter: A Gateway for Scientific Collaboration and Education
PDF
Season 7 Episode 1 - Tools for Data Scientists
PPTX
Reproducibility in artificial intelligence
PDF
Fast data mining flow prototyping using IPython Notebook
PPTX
Introduction to Jupyter Notebooks and Anaconda
PPTX
Python-data-science.pptx
Azure Notebooks - Jupyter for the Cloud
Clean code in Jupyter notebooks
2015 03-28-eb-final
Computable Content
Computable Content: Lessons Learned
Data analysis with Pandas and Spark
JupyterHub for Interactive Data Science Collaboration
NiW: Notebooks into Workflows
Jupyter, A Platform for Data Science at Scale
Jupyter Notebook_CheatSheet.pdf
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Notebooks in IBM
Spark Summit Europe: Share and analyse genomic data at scale
Jupyter: A Gateway for Scientific Collaboration and Education
Season 7 Episode 1 - Tools for Data Scientists
Reproducibility in artificial intelligence
Fast data mining flow prototyping using IPython Notebook
Introduction to Jupyter Notebooks and Anaconda
Python-data-science.pptx
Ad

Recently uploaded (20)

PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Transform Your Business with a Software ERP System
PDF
PTS Company Brochure 2025 (1).pdf.......
DOCX
The Five Best AI Cover Tools in 2025.docx
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Materi-Enum-and-Record-Data-Type (1).pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
ai tools demonstartion for schools and inter college
PPTX
history of c programming in notes for students .pptx
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
How to Choose the Right IT Partner for Your Business in Malaysia
Transform Your Business with a Software ERP System
PTS Company Brochure 2025 (1).pdf.......
The Five Best AI Cover Tools in 2025.docx
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Design an Analysis of Algorithms II-SECS-1021-03
Materi-Enum-and-Record-Data-Type (1).pptx
Operating system designcfffgfgggggggvggggggggg
ai tools demonstartion for schools and inter college
history of c programming in notes for students .pptx
2025 Textile ERP Trends: SAP, Odoo & Oracle
Adobe Illustrator 28.6 Crack My Vision of Vector Design
ManageIQ - Sprint 268 Review - Slide Deck
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Upgrade and Innovation Strategies for SAP ERP Customers
Ad

Challenges and Guidelines for Reproducible Research with Jupyter Notebook

  • 1. Challenges and Guidelines for Reproducible Research and Interactive Education with Jupyter Notebook Shih-Cheng Huang, Niema Moshiri, Michael Reich, Peter W. Rose In [2]: In [3]: Introduction Jupyter Notebooks have the potential to make research more reproducible. However, in practice, many notebooks fall short of this promise. Here we identify challenges and propose guidelines to organize, document, and deploy notebooks to increase reproducibility and reusability. These guidelines also apply to instructional materials. Jupyter Notebooks are extensively used in research and education. We recently organized a workshop at UC San Diego and invited students, postdocs, research scientists, and faculty to share experiences and identify challenges of using Jupyter notebooks. The workshop covered the use of Jupyter in many disciplines, ranging from Astrophysics, Bioinformatics, Datascience, Genomics, Medicine, to Structural Bioinformatics, as well as classroom use with hundreds of students and publication of reproducible science. Here we present a summary of our findings in the form of guidelines. Tell a Story •Show how you got from initial data to final results •Avoid manual steps, use a notebook for data preparation •Split workflow into steps, e.g., data preparation, data analysis, data visualization •Show intermediate results so users can follow the dataflow •Challenges How to combine multiple notebooks into a complex workflow Capture Entire Workflow •Like any good story, a notebook should have a • Beginning • Introduce the topic and the aims of the notebook • Middle • Explain the steps of the workflow • Use markdown to split notebook into sections • End • Interpret the results • What trends do tables, plots, or figures show? •Write notebook for an audience • New users may need instructions how to run a notebook • Educational materials need background and step by step explanations In [1]: Avoid Copy & Paste •Split common functions into separate files and import them into notebooks •Challenges •Need tools to refactor notebooks •Find and extract common code among related notebooks Remove Clutter In [4]: •Use markdown to organize a notebook into sections •Split long notebooks into a series of notebooks • Keep a top level notebook with links to the individual notebooks •Avoid long cells • Split text and code into cells • One cell -> one paragraph or one task (e.g., create a plot) •Modularize code by defining functions or classes •Reuse code by importing functions or classes •Put low-level documentation in code comments •Challenges • Need to be able to collapse sections of text or code to hide low-level details, e.g., setup for a plot In [5]: Make it Repro- ducible •Specify version numbers of dependencies •Always specify the source/location of the data •Keep a copy of the raw data if possible •Make copies of data in a notebook to avoid corrupting datasets when running cells out of order •Specify random number seeds •Before saving a notebook, rerun notebook to ensure linear execution order •Challenges • Need functionality to freeze current state, execution order In [7]: Use Version Control •Keep your notebooks under version control, e.g., GitHub •Describe content of repository in README files •Specify a license to encourage and enable use by others •Structure your repository • See for example http://drivendata.github.io/cookie cutter-data-science/ •Challenges • Each time notebook is run, the ipynb file is modified, • Need a ”diff” tool for notebooks In [6]: Share It! •Use open source projects as your dependencies •Add a liberal open source license (e.g., MIT, Apache 2) to your repository •Use Nbviewer to provide static views of your executed notebook https://nbviewer.jupyter.org/ •Use Binder to provide a zero- install environment to run your notebooks in the cloud https://mybinder.org/ •Create a Docker image of your environment https://docs.docker.com/ •Challenges • Data intensive applications • Compute intensive applications • Special hardware requirements, e.g., GPU • Multi-step workflows References - Jupyter Notebooks – a publishing format for reproducible computational workflows (2016) Jupyter Dev. Team, IOS Press, doi: 10.3233/978-1-61499-649-1-87 - Exploration and Explanation in Computational Notebooks, A. Rule, et al. (2018) Proc. of the 2018 CHI Conference on Human Factors in Computing Systems, ACM. - Binder 2.0 - Reproducible, interactive, sharable environments for science at scale, Project Jupyter, et al. (2018) Proc. of the 17th Python in Science Conf. (SCIPY 2018). - The GenePattern Notebook Environment, M. Reich, et al. (2017) Cell Systems 5.2, 149-151. Acknowledgements Amanda Birmingham, Ilkay Altintas, Rob Knight, Tiago Leao, Nathan Mih, Mai Nguyen, Shweta Purawat, Brin Rosenthal, Adam Rule, Britton Smith, Shuai Tang, Guorong Xu Image credit: 1. http://higher-ed.us/wp-content/uploads/2017/12/copy-and-paste-pictures-10-unbelievable-why-a-blogger-should-never-webmasters-nigeria.jpg, 2. https://blog.prototypr.io/meet-overflow-9b2d926b6093, 3. http://romeo.landinez.co/workflow/rapid-workflow-protoype.html, 4. https://productivitysteps.files.wordpress.com/2016/09/clutter.jpg In [*]: We are crowdsourcing a Jupyter Guide for reproducible research. Please help us at: https://github.com/sbl-sdsc/jupyter-guide