SlideShare a Scribd company logo
Reproducibility
in Scientific Data Analysis
Samuel Lampa @smllmp
PhD Student
Pharmaceutical Bioinformatics at pharmb.io
with Assoc. Prof. Ola Spjuth @ola_spjuth
@ Dept. of Pharm. Biosci. / Uppsala University
Farmbio BioScience Seminar – Dec 16 2016
Reproducibility in Scientific Data Analysis - BioScience Seminar
Structure of this talk
Reproducibility in Scientific Data Analysis …
● What is it?
● Why is it important?
● Why is it a problem?
● What can we do about it?
● What does pharmb.io do about it?
What is it?
“it” = reproducibility in scientific data analysis
reproducible ≠ replicable
reproducible ≠ correct
Why is it important?
“it” = reproducibility in scientific data analysis
Why is it important?
● More and more data generation automated
→ More and more focus on data analysis
● Culture of replicability not (yet) as established
in computational as in classical disciplines
● “it is the only thing that an investigator can
guarantee about a study”
simplystatistics.org/2014/06/06/the-real-reason-reproducible-research-is-important
Why is it a problem?
“it” = reproducibility in scientific data analysis
wet lab data analysis?
Why is it a problem?
● Complexity of computing environment
– Software versions, Data versions ...
● More black box components
● Assumptions on computing
environment often left out
● Manual steps often left out
What can we do about it?
“it” = reproducibility in scientific data analysis
What can we do about it?
Utopia: Infrastructure for all data and
computations to be inspected and re-run
with other data and parameters by anyone
But: We can’t wait for that
In the meanwhile: Even small steps towards
reproducibility will help. Start today!
General themes
Know exactly what data and results mean
Know exactly how results were obtained
Be able to get same result independently
More concretely ...
Know exactly what data and results mean
– Open standards, Ontologies, Data formats
Know exactly how results were obtained
– Keeping track of manual steps, parameters, versions of
software and data ...
– Version control
– Automation (scripts)
Be able to get same result independently
– code, data, and scripts … make it all available!
Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten Simple Rules for
Reproducible Computational Research. PLoS Comput Biol.
2013;9(10):1-4. dx.doi.org/10.1371/journal.pcbi.1003285
FAIR Principles
for data and meta data
F - Findable
A - Accessible
I - Interoperable
R – Reusable
Wilkinson MD, Dumontier M, Aalbersberg IjJ, et al.
The FAIR Guiding Principles for scientific data management and
stewardship. Sci Data. 2016;3:160018. doi:10.1038/sdata.2016.18.
What does pharmb.io do about it?
“it” = reproducibility in scientific data analysis
What does pharmb.io do about it?
● Open data, open source, open standards
Promoting and using as much as possible
● BioImg.org
Store Virtual Machines & Containers
● Semantic Data Technologies
Machine readability - Avoiding ambiguity
● Re-runnable computational experiments
Via workflows, containers, infrastructure as code
O’Boyle NM, Guha R, Willighagen EL, et al.
Open Data, Open Source and Open Standards in chemistry: The
Blue Obelisk five years on. J Cheminform. 2011;3(10):1-16.
doi:10.1186/1758-2946-3-37
BioImg.org
Dahlö M, Haziza F, Kallio A, Korpelainen E, Bongcam-Rudloff E, Spjuth O.
BioImg.org: A catalog of virtual machine images for the life sciences.
Bioinform Biol Insights. 2015;9(Vmi):125-128. doi:10.4137/BBI.S28636.
Martin Dahlö
Semantic Data Technologies
Lampa S, Willighagen E, Kohonen P, King A, Vrandečić D, Grafström R,
Spjuth O. RDFIO: Extending Semantic MediaWiki for interoperable
biomedical data management. J Biomed Sem. Submitted.
Re-runnable experiments
via containers
(and infrastructure as code)
Marco Capuccini
github.com/kubenow/KubeNow
github.com/mcapuccini/SparkNow
Re-runnable experiments
via workflows
Lampa S, Alvarsson J, Spjuth O. Towards agile large-scale predictive modelling in
drug discovery with flow-based programming design principles.
J Cheminform. 2016;8(1):67. doi:10.1186/s13321-016-0179-6.
Lampa S, Alvarsson J, Spjuth O. Towards agile large-scale predictive modelling in
drug discovery with flow-based programming design principles.
J Cheminform. 2016;8(1):67. doi:10.1186/s13321-016-0179-6.
Thank you
pharmb.io

More Related Content

PPTX
USTLG Talk: The future of laboratory data: Libraries, Librarians and Digital...
PPTX
We need to solve more that just our access problems
PPT
Metadata for Data Rescue and Data at Risk
PPTX
Why canceling subscriptions may just yet save scholarship
PPTX
Reproducible research: theory
PPTX
Reproducibility and Scientific Research: why, what, where, when, who, how
PPTX
Workshop Systematic Reviews Werkgroep Sociaal Wetenschappelijke Informatie
PPTX
Uphrading the Scholarly Infrastructure
USTLG Talk: The future of laboratory data: Libraries, Librarians and Digital...
We need to solve more that just our access problems
Metadata for Data Rescue and Data at Risk
Why canceling subscriptions may just yet save scholarship
Reproducible research: theory
Reproducibility and Scientific Research: why, what, where, when, who, how
Workshop Systematic Reviews Werkgroep Sociaal Wetenschappelijke Informatie
Uphrading the Scholarly Infrastructure

Viewers also liked (20)

PDF
Batch import of large RDF datasets into Semantic MediaWiki
PDF
How to Earn the Attention of Today's Buyer
PDF
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
PDF
What is Inbound Recruiting?
PDF
3 Proven Sales Email Templates Used by Successful Companies
PDF
Add the Women Back: Wikipedia Edit-a-Thon
PDF
Hooking up Semantic MediaWiki with external tools via SPARQL
PPTX
Continuous modeling - automating model building on high-performance e-Infrast...
PDF
Python Generators - Talk at PySthlm meetup #15
PDF
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
PDF
World: Vanilla - Market Report. Analysis And Forecast To 2025
PPTX
Agile large-scale machine-learning pipelines in drug discovery
PDF
SciPipe - A light-weight workflow library inspired by flow-based programming
PDF
iRODS Rule Language Cheat Sheet
PDF
Flow based programming an overview
PDF
Loppuraportti: ODA-hankkeen kustannus-hyötyanalyysi
PPTX
Why Do Givers Give?
PDF
BABYSCANの開発について - 技術面より
PDF
Harness the Power of 21st Century Online Marketing: LinkedIn
Batch import of large RDF datasets into Semantic MediaWiki
How to Earn the Attention of Today's Buyer
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
What is Inbound Recruiting?
3 Proven Sales Email Templates Used by Successful Companies
Add the Women Back: Wikipedia Edit-a-Thon
Hooking up Semantic MediaWiki with external tools via SPARQL
Continuous modeling - automating model building on high-performance e-Infrast...
Python Generators - Talk at PySthlm meetup #15
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
World: Vanilla - Market Report. Analysis And Forecast To 2025
Agile large-scale machine-learning pipelines in drug discovery
SciPipe - A light-weight workflow library inspired by flow-based programming
iRODS Rule Language Cheat Sheet
Flow based programming an overview
Loppuraportti: ODA-hankkeen kustannus-hyötyanalyysi
Why Do Givers Give?
BABYSCANの開発について - 技術面より
Harness the Power of 21st Century Online Marketing: LinkedIn
Ad

Similar to Reproducibility in Scientific Data Analysis - BioScience Seminar (20)

PPTX
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
PPTX
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
PPTX
BioPharma and FAIR Data, a Collaborative Advantage
PPTX
Reproducibility (and the R*) of Science: motivations, challenges and trends
PPTX
How to share useful data
PDF
FAIR BioData Management
PDF
A Guide for Reproducible Research
PDF
High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014
PPTX
Talk on Research Data Management
PDF
INSERM - Data Management & Reuse of Health Data - May 2017
PPT
Data sharing - Data management - The SysMO-SEEK Story
PPT
Data management, data sharing: the SysMO-SEEK Story
PDF
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
PDF
Data sharing as part of the research workflow
PDF
Open Access Week - Oxford, 20-24 Oct 2014
PDF
Model management for systems biology projects
PDF
FAIR and biopharma
PPTX
Let’s go on a FAIR safari!
PPTX
Being Reproducible: SSBSS Summer School 2017
PPTX
How to make your published data findable, accessible, interoperable and reusable
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
BioPharma and FAIR Data, a Collaborative Advantage
Reproducibility (and the R*) of Science: motivations, challenges and trends
How to share useful data
FAIR BioData Management
A Guide for Reproducible Research
High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014
Talk on Research Data Management
INSERM - Data Management & Reuse of Health Data - May 2017
Data sharing - Data management - The SysMO-SEEK Story
Data management, data sharing: the SysMO-SEEK Story
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
Data sharing as part of the research workflow
Open Access Week - Oxford, 20-24 Oct 2014
Model management for systems biology projects
FAIR and biopharma
Let’s go on a FAIR safari!
Being Reproducible: SSBSS Summer School 2017
How to make your published data findable, accessible, interoperable and reusable
Ad

More from Samuel Lampa (12)

PDF
SciCommander - Provenance reports for outputs of ad-hoc analyses
PDF
Using Flow-based programming to write tools and workflows for Scientific Comp...
PDF
Linked Data for improved organization of research data
PDF
How to document computational research projects
PDF
Vagrant, Ansible and Docker - How they fit together for productive flexible d...
PDF
AddisDev Meetup ii: Golang and Flow-based Programming
ODP
First encounter with Elixir - Some random things
PDF
Profiling go code a beginners tutorial
PDF
The RDFIO Extension - A Status update
PDF
My lightning talk at Go Stockholm meetup Aug 6th 2013
PDF
Thesis presentation Samuel Lampa
PDF
3rd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
SciCommander - Provenance reports for outputs of ad-hoc analyses
Using Flow-based programming to write tools and workflows for Scientific Comp...
Linked Data for improved organization of research data
How to document computational research projects
Vagrant, Ansible and Docker - How they fit together for productive flexible d...
AddisDev Meetup ii: Golang and Flow-based Programming
First encounter with Elixir - Some random things
Profiling go code a beginners tutorial
The RDFIO Extension - A Status update
My lightning talk at Go Stockholm meetup Aug 6th 2013
Thesis presentation Samuel Lampa
3rd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse

Recently uploaded (20)

PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPTX
Substance Disorders- part different drugs change body
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PPTX
gene cloning powerpoint for general biology 2
PDF
The Land of Punt — A research by Dhani Irwanto
PDF
Placing the Near-Earth Object Impact Probability in Context
PPT
Mutation in dna of bacteria and repairss
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPT
Animal tissues, epithelial, muscle, connective, nervous tissue
PPTX
Hypertension_Training_materials_English_2024[1] (1).pptx
PPTX
Understanding the Circulatory System……..
PPTX
Microbes in human welfare class 12 .pptx
PPTX
BODY FLUIDS AND CIRCULATION class 11 .pptx
PPTX
endocrine - management of adrenal incidentaloma.pptx
PPT
LEC Synthetic Biology and its application.ppt
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPTX
TORCH INFECTIONS in pregnancy with toxoplasma
PPTX
Introcution to Microbes Burton's Biology for the Health
PPTX
Fluid dynamics vivavoce presentation of prakash
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Substance Disorders- part different drugs change body
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
gene cloning powerpoint for general biology 2
The Land of Punt — A research by Dhani Irwanto
Placing the Near-Earth Object Impact Probability in Context
Mutation in dna of bacteria and repairss
lecture 2026 of Sjogren's syndrome l .pdf
Animal tissues, epithelial, muscle, connective, nervous tissue
Hypertension_Training_materials_English_2024[1] (1).pptx
Understanding the Circulatory System……..
Microbes in human welfare class 12 .pptx
BODY FLUIDS AND CIRCULATION class 11 .pptx
endocrine - management of adrenal incidentaloma.pptx
LEC Synthetic Biology and its application.ppt
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
TORCH INFECTIONS in pregnancy with toxoplasma
Introcution to Microbes Burton's Biology for the Health
Fluid dynamics vivavoce presentation of prakash

Reproducibility in Scientific Data Analysis - BioScience Seminar

  • 1. Reproducibility in Scientific Data Analysis Samuel Lampa @smllmp PhD Student Pharmaceutical Bioinformatics at pharmb.io with Assoc. Prof. Ola Spjuth @ola_spjuth @ Dept. of Pharm. Biosci. / Uppsala University Farmbio BioScience Seminar – Dec 16 2016
  • 3. Structure of this talk Reproducibility in Scientific Data Analysis … ● What is it? ● Why is it important? ● Why is it a problem? ● What can we do about it? ● What does pharmb.io do about it?
  • 4. What is it? “it” = reproducibility in scientific data analysis
  • 7. Why is it important? “it” = reproducibility in scientific data analysis
  • 8. Why is it important? ● More and more data generation automated → More and more focus on data analysis ● Culture of replicability not (yet) as established in computational as in classical disciplines ● “it is the only thing that an investigator can guarantee about a study” simplystatistics.org/2014/06/06/the-real-reason-reproducible-research-is-important
  • 9. Why is it a problem? “it” = reproducibility in scientific data analysis
  • 10. wet lab data analysis?
  • 11. Why is it a problem? ● Complexity of computing environment – Software versions, Data versions ... ● More black box components ● Assumptions on computing environment often left out ● Manual steps often left out
  • 12. What can we do about it? “it” = reproducibility in scientific data analysis
  • 13. What can we do about it? Utopia: Infrastructure for all data and computations to be inspected and re-run with other data and parameters by anyone But: We can’t wait for that In the meanwhile: Even small steps towards reproducibility will help. Start today!
  • 14. General themes Know exactly what data and results mean Know exactly how results were obtained Be able to get same result independently
  • 15. More concretely ... Know exactly what data and results mean – Open standards, Ontologies, Data formats Know exactly how results were obtained – Keeping track of manual steps, parameters, versions of software and data ... – Version control – Automation (scripts) Be able to get same result independently – code, data, and scripts … make it all available!
  • 16. Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten Simple Rules for Reproducible Computational Research. PLoS Comput Biol. 2013;9(10):1-4. dx.doi.org/10.1371/journal.pcbi.1003285
  • 17. FAIR Principles for data and meta data F - Findable A - Accessible I - Interoperable R – Reusable Wilkinson MD, Dumontier M, Aalbersberg IjJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018. doi:10.1038/sdata.2016.18.
  • 18. What does pharmb.io do about it? “it” = reproducibility in scientific data analysis
  • 19. What does pharmb.io do about it? ● Open data, open source, open standards Promoting and using as much as possible ● BioImg.org Store Virtual Machines & Containers ● Semantic Data Technologies Machine readability - Avoiding ambiguity ● Re-runnable computational experiments Via workflows, containers, infrastructure as code
  • 20. O’Boyle NM, Guha R, Willighagen EL, et al. Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on. J Cheminform. 2011;3(10):1-16. doi:10.1186/1758-2946-3-37
  • 21. BioImg.org Dahlö M, Haziza F, Kallio A, Korpelainen E, Bongcam-Rudloff E, Spjuth O. BioImg.org: A catalog of virtual machine images for the life sciences. Bioinform Biol Insights. 2015;9(Vmi):125-128. doi:10.4137/BBI.S28636. Martin Dahlö
  • 22. Semantic Data Technologies Lampa S, Willighagen E, Kohonen P, King A, Vrandečić D, Grafström R, Spjuth O. RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. J Biomed Sem. Submitted.
  • 23. Re-runnable experiments via containers (and infrastructure as code) Marco Capuccini github.com/kubenow/KubeNow github.com/mcapuccini/SparkNow
  • 25. Lampa S, Alvarsson J, Spjuth O. Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles. J Cheminform. 2016;8(1):67. doi:10.1186/s13321-016-0179-6.
  • 26. Lampa S, Alvarsson J, Spjuth O. Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles. J Cheminform. 2016;8(1):67. doi:10.1186/s13321-016-0179-6.