Tools for: 
Open-Source 
Open-Data 
Rob L Davidson about.me/rob.davidson
The problem 
reproducibility.cs.arizona.edu 
• 515 papers (429 conf, 86 journal) 
• <30% reproducible
The problem 
reproducibility.cs.arizona.edu
The Cause 
• Stodden 2010 
– 638 registrant at NIPS 
• 30% share code 
• 20% share data 
http://web.stanford.edu/~vcs/papers/SMPRCS2010.pdf
Publishers must provide! 
Hosting 
Curating 
Citations for everything: 
data, tools + workflows
Tools for Reproducibility 
• Data: GigaDB 
• Images: OMERO 
• Workflows 
– Galaxy 
– Executable Docs 
– VMs
GigaDB 
github.com/gigascience/gigadb-cogini
Hosting all data
Hosting all research objects
Impact for research objects 
• Host 
• Curate 
• Share 
• Cite - DOI
Even more accessible, transparent data? 
Hosting image data with OMERO
Hosting Images 
• Image LIMS 
• Web embedding 
– View online, no 
need for software 
• Full res 
• Link all images to 
publication 
– No cherry picking 
http://www.openmicroscopy.org/site/products/omero
Cyber-Centipedes! Phenotyping 
NO
Accessible Cyber-Centipede images 
OMERO: providing 
access to imaging data 
View, filter, measure raw 
images with direct links 
from journal article. 
See all image data, not 
just cherry picked 
examples. 
Download and reprocess.
OMERO: Adding value
The alternative... 
...look but don't touch
Workflows 
1. Galaxy 
galaxyproject.org
galaxy.cbiit.cuhk.edu.hk
Implement workflows in a community-accepted 
format 
http://galaxyproject.org 
Open source 
Over 45,000 main 
Galaxy server users 
Over 1,000 papers 
citing Galaxy use 
Over 55 Galaxy 
servers deployed
Implement workflows in an intuitive format 
Tool list ToCoolp yprigahrt aNBmAFe-Bte 2r0i1s3ation Results panel
Visualising Workflows
Birmingham Metabo-Galaxy Workflow
Birmingham Metabo-Galaxy 
Tools wrapped in Python and XML 
User sees web form (easy!) 
Data stored centrally (secure!) 
Work done centrally (easy update)
G3 talk rld_2
Hosting Workflows
Hosting Workflows 
1) Test data 
2) Software files 
3) Instructions 
+ Galaxy implementation
Can we reproduce results? SOAPdenovo2 S. aureus pipeline
Workflows 
2. Executable Docs
Open lab books, dynamic documents 
• Facilitate reuse and sharing with tools like: Knitr, Sweave, 
iPython Notebook 
Sweave 
• Working towards executable papers…
E.g.
E.g.
Some testimonials for Knitr 
Authors (Wolfgang Huber) 
“I do all my projects in Knitr. Having the textual 
explanation, the associated code and the results all in one 
place really increases productivity, and helps explaining 
my analyses to colleagues, or even just to my future self.” 
Reviewers (Christophe Pouzat) 
“It took me a couple of hours to get the data, the few 
custom developed routines, the “vignette” and to 
REPRODUCE EXACTLY the analysis presented in the 
manuscript. With few more hours, I was able to modify the 
authors’ code to change their Fig. 4. In addition to making 
the presented research trustworthy, the reproducible 
research paradigm definitely makes the reviewer’s job 
much more fun!
Workflow accessibility: 
VMs
Why VMs? 
• OS settings 
• Dependencies 
– Versions 
– e.g. python! 
• Data + Code linked 
• Download or run in 
cloud
VMs in GigaDB
Summary
Share data in GigaDB 
Share all images in GigaDB 
-View images via OMERO 
Share code in GigaDB! 
Share pipeline using: 
Executable docs! 
Galaxy! 
VMs!
Improve 
reproducibility! 
Give us data, papers 
& pipelines* 
Contact us: 
scott@gigasciencejournal.com 
editorial@gigasciencejournal.com 
database@gigasciencejournal.com 
* APC’s currently generously covered 
by BGI until 2015 
www.gigasciencejournal.com
Thanks to: 
team: Our collaborators: Case study: 
Ruibang Luo (BGI/HKU) 
Shaoguang Liang (BGI-SZ) 
Tin-Lap Lee (CUHK) 
Qiong Luo (HKUST) 
Senghong Wang (HKUST) 
Yan Zhou (HKUST) 
Funding from: CBIIT 
@gigascience 
facebook.com/GigaScience 
blogs.biomedcentral.com/gigablog/ 
Peter Li 
Huayan Gao 
Chris Hunter 
Jesse Si Zhe 
Nicole Nogoy 
Laurie Goodman 
Amye Kenall 
(BMC) 
Marco Roos (LUMC) 
Mark Thompson (LUMC) 
Jun Zhao (Lancaster) 
Susanna Sansone (Oxford) 
Philippe Rocca-Serra (Oxford) 
Alejandra Gonzalez-Beltran 
(Oxford) 
www.gigadb.org 
galaxy.cbiit.cuhk.edu.hk 
www.gigasciencejournal.com

More Related Content

PPT
Rob Davidson at the G3 Workshop: Open Source - Tools for Reproducibility
PPTX
Open Babel project overview
PDF
16 years of the Chemistry Development Kit (CDK)
PPTX
PDF
Designing RESTful APIs
PDF
Load balancing in the SRE way
PDF
Telecom 2020: Preparing for a very different future
PDF
Global telecom trends by 2020
Rob Davidson at the G3 Workshop: Open Source - Tools for Reproducibility
Open Babel project overview
16 years of the Chemistry Development Kit (CDK)
Designing RESTful APIs
Load balancing in the SRE way
Telecom 2020: Preparing for a very different future
Global telecom trends by 2020

Similar to G3 talk rld_2 (20)

PPTX
Scott Edmunds flashtalk slides from Beyond the PDF2
PPTX
IDW2022: A decades experiences in transparent and interactive publication of ...
PPTX
Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing p...
PPTX
Opportunities for X-Ray science in future computing architectures
PPTX
Reproducibility - The myths and truths of pipeline bioinformatics
PPT
Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing
PPTX
Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organizatio...
PDF
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
PPT
The beauty of workflows and models
PPT
Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era
PDF
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
PPT
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
PPT
Building an Information Infrastructure to Support Genetic Sciences
PPTX
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
PDF
2014 11-13-sbsm032-reproducible research
PPTX
2014 nicta-reproducibility
PPTX
GigaScience: a new resource for the big-data community.
PDF
The Galaxy bioinformatics workflow environment
PDF
Ruby on bioinformatics
PDF
Towards Reproducibility of Microscopy Experiments
Scott Edmunds flashtalk slides from Beyond the PDF2
IDW2022: A decades experiences in transparent and interactive publication of ...
Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing p...
Opportunities for X-Ray science in future computing architectures
Reproducibility - The myths and truths of pipeline bioinformatics
Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing
Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organizatio...
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
The beauty of workflows and models
Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Genetic Sciences
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
2014 11-13-sbsm032-reproducible research
2014 nicta-reproducibility
GigaScience: a new resource for the big-data community.
The Galaxy bioinformatics workflow environment
Ruby on bioinformatics
Towards Reproducibility of Microscopy Experiments
Ad

Recently uploaded (20)

PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PDF
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
PPT
Presentation of a Romanian Institutee 2.
PPT
LEC Synthetic Biology and its application.ppt
PDF
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
PPTX
limit test definition and all limit tests
PPTX
BODY FLUIDS AND CIRCULATION class 11 .pptx
PPTX
Presentation1 INTRODUCTION TO ENZYMES.pptx
PPTX
Substance Disorders- part different drugs change body
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PPTX
A powerpoint on colorectal cancer with brief background
PPTX
Hypertension_Training_materials_English_2024[1] (1).pptx
PPTX
TORCH INFECTIONS in pregnancy with toxoplasma
PDF
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPT
Mutation in dna of bacteria and repairss
PDF
Social preventive and pharmacy. Pdf
PPTX
perinatal infections 2-171220190027.pptx
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
Presentation of a Romanian Institutee 2.
LEC Synthetic Biology and its application.ppt
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
limit test definition and all limit tests
BODY FLUIDS AND CIRCULATION class 11 .pptx
Presentation1 INTRODUCTION TO ENZYMES.pptx
Substance Disorders- part different drugs change body
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
A powerpoint on colorectal cancer with brief background
Hypertension_Training_materials_English_2024[1] (1).pptx
TORCH INFECTIONS in pregnancy with toxoplasma
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Mutation in dna of bacteria and repairss
Social preventive and pharmacy. Pdf
perinatal infections 2-171220190027.pptx
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
Ad

G3 talk rld_2

  • 1. Tools for: Open-Source Open-Data Rob L Davidson about.me/rob.davidson
  • 2. The problem reproducibility.cs.arizona.edu • 515 papers (429 conf, 86 journal) • <30% reproducible
  • 4. The Cause • Stodden 2010 – 638 registrant at NIPS • 30% share code • 20% share data http://web.stanford.edu/~vcs/papers/SMPRCS2010.pdf
  • 5. Publishers must provide! Hosting Curating Citations for everything: data, tools + workflows
  • 6. Tools for Reproducibility • Data: GigaDB • Images: OMERO • Workflows – Galaxy – Executable Docs – VMs
  • 10. Impact for research objects • Host • Curate • Share • Cite - DOI
  • 11. Even more accessible, transparent data? Hosting image data with OMERO
  • 12. Hosting Images • Image LIMS • Web embedding – View online, no need for software • Full res • Link all images to publication – No cherry picking http://www.openmicroscopy.org/site/products/omero
  • 14. Accessible Cyber-Centipede images OMERO: providing access to imaging data View, filter, measure raw images with direct links from journal article. See all image data, not just cherry picked examples. Download and reprocess.
  • 16. The alternative... ...look but don't touch
  • 17. Workflows 1. Galaxy galaxyproject.org
  • 19. Implement workflows in a community-accepted format http://galaxyproject.org Open source Over 45,000 main Galaxy server users Over 1,000 papers citing Galaxy use Over 55 Galaxy servers deployed
  • 20. Implement workflows in an intuitive format Tool list ToCoolp yprigahrt aNBmAFe-Bte 2r0i1s3ation Results panel
  • 23. Birmingham Metabo-Galaxy Tools wrapped in Python and XML User sees web form (easy!) Data stored centrally (secure!) Work done centrally (easy update)
  • 26. Hosting Workflows 1) Test data 2) Software files 3) Instructions + Galaxy implementation
  • 27. Can we reproduce results? SOAPdenovo2 S. aureus pipeline
  • 29. Open lab books, dynamic documents • Facilitate reuse and sharing with tools like: Knitr, Sweave, iPython Notebook Sweave • Working towards executable papers…
  • 30. E.g.
  • 31. E.g.
  • 32. Some testimonials for Knitr Authors (Wolfgang Huber) “I do all my projects in Knitr. Having the textual explanation, the associated code and the results all in one place really increases productivity, and helps explaining my analyses to colleagues, or even just to my future self.” Reviewers (Christophe Pouzat) “It took me a couple of hours to get the data, the few custom developed routines, the “vignette” and to REPRODUCE EXACTLY the analysis presented in the manuscript. With few more hours, I was able to modify the authors’ code to change their Fig. 4. In addition to making the presented research trustworthy, the reproducible research paradigm definitely makes the reviewer’s job much more fun!
  • 34. Why VMs? • OS settings • Dependencies – Versions – e.g. python! • Data + Code linked • Download or run in cloud
  • 37. Share data in GigaDB Share all images in GigaDB -View images via OMERO Share code in GigaDB! Share pipeline using: Executable docs! Galaxy! VMs!
  • 38. Improve reproducibility! Give us data, papers & pipelines* Contact us: scott@gigasciencejournal.com editorial@gigasciencejournal.com database@gigasciencejournal.com * APC’s currently generously covered by BGI until 2015 www.gigasciencejournal.com
  • 39. Thanks to: team: Our collaborators: Case study: Ruibang Luo (BGI/HKU) Shaoguang Liang (BGI-SZ) Tin-Lap Lee (CUHK) Qiong Luo (HKUST) Senghong Wang (HKUST) Yan Zhou (HKUST) Funding from: CBIIT @gigascience facebook.com/GigaScience blogs.biomedcentral.com/gigablog/ Peter Li Huayan Gao Chris Hunter Jesse Si Zhe Nicole Nogoy Laurie Goodman Amye Kenall (BMC) Marco Roos (LUMC) Mark Thompson (LUMC) Jun Zhao (Lancaster) Susanna Sansone (Oxford) Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford) www.gigadb.org galaxy.cbiit.cuhk.edu.hk www.gigasciencejournal.com