SlideShare a Scribd company logo
Metagenomic Data Provenance and 
Management using the ISA infrastructure 
overview, implementation patterns & software tools 
Alejandra ! 
Gonzalez-Beltran, PhD 
Eamonn ! 
Maguire 
! 
alejandra.gonzalezbeltran@oerc.ox.ac.uk 
eamonn.maguire@oerc.ox.ac.uk 
! 
! 
Metagenomics Bioinformatics, 
EMBL-EBI, Hinxton, UK 
September 2014 
University of Oxford e-Research Centre, UK
Experimental 
Metadata 
Roadmap
Experimental 
Metadata 
Roadmap
Experimental 
Metadata 
Roadmap 
link to analysis platforms
Experimental 
Metadata 
Roadmap 
link to analysis platforms 
submission to public 
repositories
Experimental 
Metadata 
Roadmap 
link to analysis platforms 
submission to public 
repositories
Experimental 
Metadata 
Roadmap 
link to analysis platforms 
submission to public 
repositories 
data publication
Experimental Metadata 
Notes in lab notebooks 
(information for humans) Spreadsheets & tables 
RDF statements 
(information for machines) 
It is all about structuring experimental information to make it available to 
computers and software agents to enable: 
8 
! 
provenance tracking 
assessment and evaluation 
accountability, reliability, trust, evidence 
conservation, preservation, storage, archiving and mining
9
http://www.ama-rochester.org/WP/wp-content/uploads/2013/01/three-pillars.png
The community
12 
A growing ecosystem of over 30 public and internal resources using 
the ISA metadata tracking framework (ISA-Tab and/or tools) to 
facilitate standards-compliant collection, curation, management and 
reuse of investigations in an increasingly diverse set of life science 
domains, including: 
! 
• stem cell discovery 
• system biology 
• transcriptomics 
• toxicogenomics 
• also by communities working to build a library of cellular 
signatures 
! 
• environmental health 
• environmental genomics 
• metabolomics 
• metagenomics 
• nanotechnology 
• proteomics
The format
Why ISA format and Tools? 
investigation 
assay(s) assay(s) 
pointers to data file 
names/location 
external files in 
native or other for-mats 
data data 
investigation 
high level concept to link 
related studies 
study 
the central unit, containing 
information on the subject 
under study, its characteristics 
and any treatments applied. 
a study has associated assays 
assay 
test performed either on 
material taken from the sub-ject 
or on the whole initial 
subject, which produce quali-tative 
or quantitative meas-urements 
(data) 
H. Sapiens 
H. Sapiens 
H. Sapiens 
H. Sapiens 
33 Years 
H1 
H1 
H2 
35 
35 
33 
Years 
Years 
Years 
ISA metadata specifications: 
! 
• workflow and process 
orientated 
• compatible with checklist 
enforcement 
• compatible with external 
vocabulary resources 
• compatible by design with 
existing schemas 
! 
H1.sample1 
H1.sample2 
H2.sample1 
Labeling 
Labeling 
H1.sample1.labeled 
H2.sample1.labeled 
h1-s1.cel 
h1-s2.cel 
h2-s1.cel 
H1 
H2 
H1.sample1 
H1.sample2 
H2.sample1 
Labeling 
Labeling 
H1.sample1.labeled 
H2.sample1.labeled 
h1-s1.cel 
h1-s2.cel 
h2-s1.cel 
H. Sapiens 
35 Years 
MAGE-Tab 
Pride-xml SRA-xml
Essentials about ISA syntax 
15 
• 3 types of files 
• Investigation file: at max 1 (think executive summary) 
–Why? general study description 
–How? methods / protocol declaration 
–How? variable declarations (factors and response variable) 
–Who? contact and affiliation information 
• Study File: true table (think sorting, filtering) 
–What? Listing all biological materials collected over the study course. 
• Assay File: true table (think sorting, filtering) 
–Results! Listing all data files collected by a given assay 
–n files, as many as there are assay types declared
Essentials about ISA syntax 
• Material Transformations: 
– Input and Outputs of Protocols are Material Nodes (Source Name, Sample Name, Extract Name, Labeled 
Extract Name.) 
Material Node 
Characteristics[…] 
Factor Value[…] (independent 
variables) 
Material Type 
Comment[…] 
Parameter Value 
! […] 
Performer (operator effect) 
Date (day effect) 
Material 
Protocol 
Process 
Data File Node 
! 
DATA Derived Data File 
Raw Data File 
! 
DATA 
! 
Material 
16
Basic coding patterns
Essentials about ISA syntax 
–Branching events: Tabular Representation 
Sample 
Material 
muscle 
biopsy 
liver 
biopsy 
human 
volunter 
1 
Source 
Name 
Characteris0c 
s[organism] 
Protocol 
REF 
Parameter 
Value[storage 
condi0on] 
Sample 
Name Characteris0cs[organ] 
volunteer 
1 Homo 
sapiens 
sample 
collec8on 
heparinated 
tube, 
room 
temperature 
volunteer 
1 
-­‐ 
sample1 peripheral 
blood 
volunteer 
1 Homo 
sapiens sample 
collec8on 
liquid 
nitrogen volunteer 
1 
-­‐ 
sample2 muscle 
volunteer 
1 Homo 
sapiens 
sample 
collec8on liquid 
nitrogen volunteer 
1 
-­‐ 
sample3 liver 
Source 
Material 
peripheral 
blood 
18
Essentials about ISA syntax 
–Pooling events: Tabular Representation 
Source 
Name 
Characteris0c 
s[organism] 
Protocol 
REF 
Parameter 
Value[storage 
condi0on] 
Sample 
Material 
Sample 
Name Characteris0cs[organ] 
animal 
1 Mus 
musculus 
sample 
collec8on 
heparinated 
tube, 
room 
temperature 
pool1 salivary 
gland 
animal 
2 Mus 
musculus sample 
collec8on 
heparinated 
tube, 
room 
temperature 
pool1 salivary 
gland 
animal 
3 Mus 
musculus 
sample 
collec8on 
heparinated 
tube, 
room 
temperature 
pool1 salivary 
gland 
animal 
1 
animal 
2 
animal 
3 
Source 
Material 
salivary 
glands 
19
Essentials about ISA syntax 
Tagging with Terminologies 
• Implicit column order matters: 
! 
! 
! 
! 
! 
! 
• ISA tools (ISAcreator - ISAconfigurator) provide Ontology 
term selection and term tagging facilities to help users. 
Source 
Name 
Characteris0cs 
[organism] 
Factor 
Value[comp 
ound 
agent] 
Factor 
Value[per 
turba0on 
agent] 
Factor 
Value[dose] 
Factor 
Value[dura 
0on] 
Factor 
Value[was 
hout 
period 
Factor 
Value[dura 
0on] 
Factor 
Value[perturba0o 
n 
agent] 
Factor 
Value[dose] Factor 
Value[dura0on] 
individual1 human 
Source 
Name 
Characteris0cs 
[organism] 
Term 
Source 
REF 
Term 
Accession 
Number 
Characteris0c 
s[dura0on] Unit 
Term 
Source 
REF 
Term 
Accession 
Number 
Factor 
Value[compound 
(htppt://purl] 
Term 
Source 
REF Term 
Accession 
Number 
individual1 Homo 
sapiens NCBITax 9606 12 week UO UO:wwerw 
ta 
aspirin CHEBI 1231354 
20
Experimental design and workflows
Parallel group design 
source: hOp://dx.doi.org/10.1016/S1569-­‐9056(02)00115-­‐X; figure 1 
22
Essentials about ISA syntax 
Representing interventions and treatments 
! 
• expressing treatments as sets of factor levels 
• examples: treatment is a tadalafil supplementation 
• Factors will be ‘compound’, ‘dose’ and duration 
• (what?, how much?, how long for?) 
! 
Characteris0c 
Factor 
! 
Source 
Name 
s[organism] 
Protocol 
REF 
Value[compoun 
Factor 
Value[dose] Factor 
Value[dura0on] 
d] 
! 
volunteer 
1 Homo 
sapiens treatment tadalafil 
250 
mg/day 12 
weeks 
! 
volunteer 
2 Homo 
sapiens treatment tadalafil 
250 
mg/day 12 
weeks 
! 
volunteer 
3 Homo 
sapiens treatment placebo 20 
mg/day 12 
weeks 
! 
• Implicit column order matters but this is independent from the ISA 
syntax specification
Cross-over design 
24 
source: Roberts et al. Journal of the International Society of Sports Nutrition 2007 4:25 doi:10.1186/1550-2783-4-25
08/26/13 
Cross-over design 
25 
10.1371/journal.pone.0037479
08/26/13 
Cross-over design 
26 
! 
Treatment 
declaration
08/26/13 
Cross-over design 
27 
10.1371/journal.pone.0037479
08/26/13 
Assays NMR 
28
08/26/13 
Assays NMR 
29
08/26/13 
Assays NMR 
30
The software suite
Metagenomic Data Provenance and Management using the ISA infrastructure --- overview, implementation patterns & software tools
1
ISA configurations 
Available from: 
http://isa-tools.org/configurations.html 
https://github.com/ISA-tools/Configuration-Files 
• Assembling workflow archetypes 
• Setting annotation requirements 
–for compliance with database schemas (SRA, MAGE, PRIDE) 
–for compliance with community based requirements (MIAME, 
MIAPE, MIMS, MIxS, …) 
• Guide users 
–Provide pre-assembled templates 
–Specify vocabulary support 
ISAconfigurator: Supporting tool 
https://github.com/ISA-tools/ISAconfigurator
ISA configurations 
Available from: 
http://isa-tools.org/configurations.html 
https://github.com/ISA-tools/Configuration-Files 
• Minimum information about any (x) sequence (MIxS) Guidelines as 
issued by Genomic Standards Consortium 
• ENA-GSC-MIxS checklist XML document: 
–based on MIxS guidelines 
–augmented with a number of regular expressions to further validate/ 
regularize input 
–fixing a number of units used to report measurement 
–issued July 2013 (version 3.0), July 2014 (version 4.0) 
• SRA 1.5 schema requirements (mandatory information and required 
terminology, e.g. Library Selection or Library Strategy) 
• All this information is used to derive ISA MIxS configurations allowing all 
those annotation requirements to be embedded in spreadsheet tables
ISAconfigurator Tables
ISAconfigurator Tables
Things to bear in mind with NGS data 
Important considerations for managing data 
and submitting to public repositories 
–be aware of support file formats 
• FastA,FastQ,SFF,..... 
–be aware of the need to demultiplex reads 
–SRA schema evolves and updates are needed 
• e.g. Study replaced by Project 
• Updates to the ISAconverter 
• Mapping from ISA is straightforward as brings a 
number of element ISA already supported
Tools for creating ISA-Tab documents 
isacreator
isacreator 
Java desktop application 
Developed to be a user 
friendly way to enter 
standards-compliant 
metadata: it has lots of 
features... 
But these are just some of 
them… we also have a data 
entry wizard and an import 
utility...
ISAcreator features: automatic template generation
ISACreator Wizard: automatic template generation 
Prerequisites and Conditions of use: 
! 
-supports factorial design experiments, meaning sets of discrete factor levels 
combined together, to define a treatment 
2x2 factorial design as in 2 compounds and 2 time points 
2x2x3 factorial design as in 2 compounds, 2 time points, 2 doses 
-assumes one sample collection event (all samples collected at sacrifice time) 
-supports some but not all currently available assay types 
-supports fractional factorial design 
-supports unbalanced factor group population sizes (ethical considerations 
for high dose toxic exposures) 
-generates automatically sample identifiers, human readable & meaning full 
labels and , if requested, barcodes 
! 
-does not support ‘crossover design’, which have to be coded manually 
-does not support sample collection timeline management (under 
development)
43 Importing your own spreadsheet: 
Mapping to third party table
ISAcreator features: visualizing experimental workflows 
Work completed during investigation of new approach for creation of glyphs with use of taxonomy for 
guidance. See Maguire et al, Taxonomy-Based Glyph Design – with a Case Study on Visualizing 
Workflows of Biological Experiments, IEEE Transactions on Visualization and Computer Graphics, 2012 
44
OntoMaton: a BioPortal powered 
Ontology widget for Google Spreadsheets 
Maguire et al, 2013 
Bioinformatics 
Tools for creating ISA-Tab documents 
! 
! 
! 
! 
http://www.slideshare.net/proccaserra/ontomaton-icbo2013alternative-ordertwv3 
http://isatools.wordpress.com/2012/07/13/introducing-ontomaton-ontology-search-tagging- 
for-google-spreadsheets/
Potential Issues and known hurdles 
• The problem of conflicting versions 
–especially high when working with big consortia 
–distributed, decentralised groups of users 
• Lack of version control and history 
• Absence of collaborative features 
! 
–Looking for new solutions while retaining the 
features ! 
= + + 
LOV
Bioportal meets Google Spreadsheet 
47
Searching and Tagging 
Templates: 
https://drive.google.com/templates?type=spreadsheets&q=ontomaton
Searching and Tagging 
Templates: 
https://drive.google.com/templates?type=spreadsheets&q=ontomaton
50
2
3
Risa - ISA-Tab manipulation for analysis in R 
• RISA R-package 
53
• R"package"available"since"BioConductor"2.11" 
h:p://www.bioconductor.org/packages/release/bioc/html/Risa.html" 
• Func@onality"for"parsing"ISAFTab"datasets"into"R"objects," 
saving"and"upda@ng"them." 
• It"bridges"the"ISAFTab"metadata"to"analysis"pipelines"of" 
specific"assay"types,"by"building"objects"for"use"in"other"R" 
packages"downstream" 
– "currently"considering"mass"spectrometry"(xmcs"package,"xcmsSet)" 
and"DNA"microarray"(Biobase"package,"ExpressionSet)" 
" 
1 2 Collect Samples 3 4 Run Assays 
5 
Experiment Design Analysis 
54 
SAMPLE1 
SAMPLE2 
SAMPLE3 
SAMPLE4 
SAMPLE5 
SAMPLE6 
SAMPLE7 
SAMPLE8 
SAMPLE9 
SAMPLE10 
SAMPLE11 
SAMPLE 1 
SAMPLE 2 
SAMPLE 3 
SAMPLE 4 
SAMPLE 5 
SAMPLE 6 
SAMPLE 7 
SAMPLE 8 
SAMPLE 9 
SAMPLE 10 
SAMPLE 11 
FILE 1 
FILE 2 
FILE 3 
FILE 4 
FILE 5 
FILE 6 
FILE 7 
FILE 8 
FIL 
FIL 
FIL 
Arabidopsis thaliana 
Treatment groups 
70% 90% 100% 
6
http://isatools.wordpress.com/2013/065/158/isacreator-available-in-genomespace/
http://isatools.wordpress.com/2013/065/168/isacreator-available-in-genomespace/
http://isatools.wordpress.com/2013/065/178/isacreator-available-in-genomespace/
4
Submission Tool 
https://github.com/ISA-tools/ISAcreator/wiki/ENASubmissionTool 
59
Pre-requirements: 
– registration to ENA/EBI Metagenomics 
– data upload by one of the methods provided by ENA 
http://www.ebi.ac.uk/ena/about/sra_data_upload 
60
http://www.ebi.ac.uk/ena/about/sra_data_upload 
Pre-requirements: 
– registration to ENA/EBI Metagenomics 
– data upload by one of the methods provided by ENA 
61
https://github.com/ISA-tools/ISAcreator/wiki/ENASubmissionTool 
62
https://github.com/ISA-tools/ISAcreator/wiki/ENASubmissionTool 
63
64
65
66
67 
ISA-Tab 
validation 
ISA-Tab 
to 
SRA 
conversion 
Submission 
to ENA 
ISA-Tab 
creation 
(SRA-xml schema)
68
69
5
http://gigasciencejournal.com 
http://gigadb.org/dataset/100035
http://gigasciencejournal.com 
http://gigadb.org/dataset/100035
• New open-access, online-only publication for descriptions of scientifically valuable datasets 
• Only content type: Data Descriptor, narrative + structured parts 
• Initially focused on the life, environmental and biomedical sciences 
• Data Descriptor will be complementary to traditional research journals and data repositories 
• Designed to foster data sharing and reuse, and ultimately to accelerate scientific discovery 
www.nature.com/scientificdata
Data Descriptors served by Scientific Data 
Narrative Section! 
A brief article-like document like with:! 
•Title! 
•Abstract! 
•Background & Summary! 
•Methods! 
•Technical Validation! 
•Usage Notes ! 
•Figures & Tables ! 
•References 
Structured Section! 
Detailed descriptions of the experimental 
procedures used to produce the data 
•Following community-defined minimum 
information requirements 
• for a level of detail sufficient to reproduce the 
experiments 
•Using ontologies & controlled-vocabularies 
• To maximise consistency of the descriptions 
www.nature.com/scientificdata
Data Descriptors served by Scientific Data 
Narrative Section! 
A brief article-like document like with:! 
•Title! 
•Abstract! 
•Background & Summary! 
•Methods! 
•Technical Validation! 
•Usage Notes ! 
•Figures & Tables ! 
•References 
Structured Section! 
Detailed descriptions of the experimental 
procedures used to produce the data 
•Following community-defined minimum 
information requirements 
• for a level of detail sufficient to reproduce the 
experiments 
•Using ontologies & controlled-vocabularies 
• To maximise consistency of the descriptions 
www.nature.com/scientificdata
Training Material 
76 
http://isa-tools.org/training.html
http://isa-tools.org/training.html 
Hands-on Material 
• Software: 
–ISAcreator 1.7.8 (see pre-release) 
–ISAconfigurator 1.6 
• Configurations: 
–ISA-ENA-MIxS Configuration 
–default MultiAssay Configuration 
• ISA-Tab formatted datasets 
–BII-S-3: Western Channel Water Samples metagenome and 
meta transcriptome 
–BII-S-7: Human gut microbiome targeted gene survey 
• Google Templates and Ontomaton 
• ISA mapping file
The Exemplar Datasets 
• BII-­‐S-­‐3: 
• Metagenome 
and 
Metatranscriptome 
on 
454
• BII-­‐S-­‐7: 
The Exemplar Datasets 
SubmiOed 
to 
ENA 
via 
ISAcreator: 
ERP000133 
• Targeted 
Gene 
Survey 
(16s 
RNA) 
on 
454
Experimental 
Metadata 
Roadmap 
link to analysis platforms 
submission to public 
repositories 
data publication
ebiteams 
funders 
81
Thanks for your attention! 
Questions? 
You can email us... 
isatools@googlegroups.com 
View our websites 
View our Git repo & contribute 
http://github.com/ISA-tools 
View our blog 
http://isatools.wordpress.com 
Follow us on Twitter 
@isatools

More Related Content

PDF
OpenTox Europe 2013
PDF
Beyond the PDF 2, 2013
PDF
Ontomaton icbo2013-alternative order-t_wv3
PDF
ISMB Workshop 2014
OpenTox Europe 2013
Beyond the PDF 2, 2013
Ontomaton icbo2013-alternative order-t_wv3
ISMB Workshop 2014

What's hot (20)

PDF
BioSharing.org - mapping the landscape of community standards, databases, dat...
PDF
Drug Discovery- ELRIG -2012
PDF
From peer-reviewed to peer-reproduced: a role for research objects in scholar...
PPTX
Aspects of Reproducibility in Earth Science
PPTX
ROHub
PPTX
Being Reproducible: SSBSS Summer School 2017
PDF
GARNet workshop on Integrating Large Data into Plant Science
PDF
Advanced Bioinformatics for Genomics and BioData Driven Research
PPTX
The Research Object Initiative: Frameworks and Use Cases
PDF
PPTX
FAIR Agronomy, where are we? The KnetMiner Use Case
PPT
DCC Keynote 2007
PPTX
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
PPTX
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
PPT
Gene Ontology Enrichment Network Analysis -Tutorial
PPTX
2016 davis-plantbio
PPTX
The Rhetoric of Research Objects
PPTX
Cshl minseqe 2013_ouellette
PDF
ICAR 2015 Poster - Araport
BioSharing.org - mapping the landscape of community standards, databases, dat...
Drug Discovery- ELRIG -2012
From peer-reviewed to peer-reproduced: a role for research objects in scholar...
Aspects of Reproducibility in Earth Science
ROHub
Being Reproducible: SSBSS Summer School 2017
GARNet workshop on Integrating Large Data into Plant Science
Advanced Bioinformatics for Genomics and BioData Driven Research
The Research Object Initiative: Frameworks and Use Cases
FAIR Agronomy, where are we? The KnetMiner Use Case
DCC Keynote 2007
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Gene Ontology Enrichment Network Analysis -Tutorial
2016 davis-plantbio
The Rhetoric of Research Objects
Cshl minseqe 2013_ouellette
ICAR 2015 Poster - Araport
Ad

Viewers also liked (19)

PDF
Computational analysis of metagenomic data: delineation of compositional feat...
PDF
Phylogeny Driven Approaches to Genomic and Metagenomic Studies
PPT
The Emerging Global Collaboratory for Microbial Metagenomics Researchers
PPT
Microbial Metagenomics Drives a New Cyberinfrastructure
PPT
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PDF
Dr. Ben Hause - Pathogen Discovery Using Metagenomic Sequencing
PPTX
Metagenomics
PPTX
Future of metagenomics
PPT
Advancing the Metagenomics Revolution
PPT
Metagenomic
PPTX
Parks kmer metagenomics
PPTX
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
PPTX
introduction to metagenomics
PDF
Multiple kernel learning applied to the integration of Tara oceans datasets
PPT
2009 hattori metagenomics
PPTX
metagenomics
PPT
The Emerging Global Community of Microbial Metagenomics Researchers
PPTX
[2013.10.29] albertsen genomics metagenomics
Computational analysis of metagenomic data: delineation of compositional feat...
Phylogeny Driven Approaches to Genomic and Metagenomic Studies
The Emerging Global Collaboratory for Microbial Metagenomics Researchers
Microbial Metagenomics Drives a New Cyberinfrastructure
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
Dr. Ben Hause - Pathogen Discovery Using Metagenomic Sequencing
Metagenomics
Future of metagenomics
Advancing the Metagenomics Revolution
Metagenomic
Parks kmer metagenomics
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
introduction to metagenomics
Multiple kernel learning applied to the integration of Tara oceans datasets
2009 hattori metagenomics
metagenomics
The Emerging Global Community of Microbial Metagenomics Researchers
[2013.10.29] albertsen genomics metagenomics
Ad

Similar to Metagenomic Data Provenance and Management using the ISA infrastructure --- overview, implementation patterns & software tools (20)

PDF
COPO kick-off meeting
PDF
ISA - a short overview - Dec 2013
PPTX
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
PDF
ISA-TAB and ISA-TAB-Nano overview
PDF
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
PDF
Oxford DTP - Sansone curation tools - Dec 2014
PDF
Metadata challenges research and re-usable data - BioSharing, ISA and STATO
PDF
Metadata for Interoperable Bioscience
PDF
Reproducible, Open Data Science in the Life Sciences
PPTX
NIH iDASH meeting on data sharing - BioSharing, ISA and Scientific Data
PPT
Sansone mibbi-intro
PDF
Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Cu...
PPTX
Bioinformatics_1_ChenS.pptx
PPT
MAGE-TAB introduction: Alvis Brazma (EBI)
PDF
Hcls sci disc-isa2rdf
PDF
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
PPTX
SEEKing our way to better presentation of data and models from scientific inv...
PDF
FAIR BioData Management
PDF
Big Data Standards - Workshop, ExpBio, Boston, 2015
PPTX
Bioinformatics1234kuhutgytdrtq3e2w5resdtyfv
COPO kick-off meeting
ISA - a short overview - Dec 2013
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-TAB and ISA-TAB-Nano overview
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Oxford DTP - Sansone curation tools - Dec 2014
Metadata challenges research and re-usable data - BioSharing, ISA and STATO
Metadata for Interoperable Bioscience
Reproducible, Open Data Science in the Life Sciences
NIH iDASH meeting on data sharing - BioSharing, ISA and Scientific Data
Sansone mibbi-intro
Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Cu...
Bioinformatics_1_ChenS.pptx
MAGE-TAB introduction: Alvis Brazma (EBI)
Hcls sci disc-isa2rdf
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
SEEKing our way to better presentation of data and models from scientific inv...
FAIR BioData Management
Big Data Standards - Workshop, ExpBio, Boston, 2015
Bioinformatics1234kuhutgytdrtq3e2w5resdtyfv

More from Alejandra Gonzalez-Beltran (9)

PDF
The Software Sustainability Institute Fellowship
PDF
CMSO Minimal reporting requirements
PDF
The DATS model: datasets descriptions for data discovery in DataMed
PDF
Datasets with bioschemas
PDF
Data publication: Discover, Explore, Visualise
PDF
ISA commons - overview and latest developments
PDF
Brazil-UK Frontiers of Engineering - Big data in healthcare session
PDF
The Software Sustainability Institute Fellowship
CMSO Minimal reporting requirements
The DATS model: datasets descriptions for data discovery in DataMed
Datasets with bioschemas
Data publication: Discover, Explore, Visualise
ISA commons - overview and latest developments
Brazil-UK Frontiers of Engineering - Big data in healthcare session

Metagenomic Data Provenance and Management using the ISA infrastructure --- overview, implementation patterns & software tools

  • 1. Metagenomic Data Provenance and Management using the ISA infrastructure overview, implementation patterns & software tools Alejandra ! Gonzalez-Beltran, PhD Eamonn ! Maguire ! alejandra.gonzalezbeltran@oerc.ox.ac.uk eamonn.maguire@oerc.ox.ac.uk ! ! Metagenomics Bioinformatics, EMBL-EBI, Hinxton, UK September 2014 University of Oxford e-Research Centre, UK
  • 4. Experimental Metadata Roadmap link to analysis platforms
  • 5. Experimental Metadata Roadmap link to analysis platforms submission to public repositories
  • 6. Experimental Metadata Roadmap link to analysis platforms submission to public repositories
  • 7. Experimental Metadata Roadmap link to analysis platforms submission to public repositories data publication
  • 8. Experimental Metadata Notes in lab notebooks (information for humans) Spreadsheets & tables RDF statements (information for machines) It is all about structuring experimental information to make it available to computers and software agents to enable: 8 ! provenance tracking assessment and evaluation accountability, reliability, trust, evidence conservation, preservation, storage, archiving and mining
  • 9. 9
  • 12. 12 A growing ecosystem of over 30 public and internal resources using the ISA metadata tracking framework (ISA-Tab and/or tools) to facilitate standards-compliant collection, curation, management and reuse of investigations in an increasingly diverse set of life science domains, including: ! • stem cell discovery • system biology • transcriptomics • toxicogenomics • also by communities working to build a library of cellular signatures ! • environmental health • environmental genomics • metabolomics • metagenomics • nanotechnology • proteomics
  • 14. Why ISA format and Tools? investigation assay(s) assay(s) pointers to data file names/location external files in native or other for-mats data data investigation high level concept to link related studies study the central unit, containing information on the subject under study, its characteristics and any treatments applied. a study has associated assays assay test performed either on material taken from the sub-ject or on the whole initial subject, which produce quali-tative or quantitative meas-urements (data) H. Sapiens H. Sapiens H. Sapiens H. Sapiens 33 Years H1 H1 H2 35 35 33 Years Years Years ISA metadata specifications: ! • workflow and process orientated • compatible with checklist enforcement • compatible with external vocabulary resources • compatible by design with existing schemas ! H1.sample1 H1.sample2 H2.sample1 Labeling Labeling H1.sample1.labeled H2.sample1.labeled h1-s1.cel h1-s2.cel h2-s1.cel H1 H2 H1.sample1 H1.sample2 H2.sample1 Labeling Labeling H1.sample1.labeled H2.sample1.labeled h1-s1.cel h1-s2.cel h2-s1.cel H. Sapiens 35 Years MAGE-Tab Pride-xml SRA-xml
  • 15. Essentials about ISA syntax 15 • 3 types of files • Investigation file: at max 1 (think executive summary) –Why? general study description –How? methods / protocol declaration –How? variable declarations (factors and response variable) –Who? contact and affiliation information • Study File: true table (think sorting, filtering) –What? Listing all biological materials collected over the study course. • Assay File: true table (think sorting, filtering) –Results! Listing all data files collected by a given assay –n files, as many as there are assay types declared
  • 16. Essentials about ISA syntax • Material Transformations: – Input and Outputs of Protocols are Material Nodes (Source Name, Sample Name, Extract Name, Labeled Extract Name.) Material Node Characteristics[…] Factor Value[…] (independent variables) Material Type Comment[…] Parameter Value ! […] Performer (operator effect) Date (day effect) Material Protocol Process Data File Node ! DATA Derived Data File Raw Data File ! DATA ! Material 16
  • 18. Essentials about ISA syntax –Branching events: Tabular Representation Sample Material muscle biopsy liver biopsy human volunter 1 Source Name Characteris0c s[organism] Protocol REF Parameter Value[storage condi0on] Sample Name Characteris0cs[organ] volunteer 1 Homo sapiens sample collec8on heparinated tube, room temperature volunteer 1 -­‐ sample1 peripheral blood volunteer 1 Homo sapiens sample collec8on liquid nitrogen volunteer 1 -­‐ sample2 muscle volunteer 1 Homo sapiens sample collec8on liquid nitrogen volunteer 1 -­‐ sample3 liver Source Material peripheral blood 18
  • 19. Essentials about ISA syntax –Pooling events: Tabular Representation Source Name Characteris0c s[organism] Protocol REF Parameter Value[storage condi0on] Sample Material Sample Name Characteris0cs[organ] animal 1 Mus musculus sample collec8on heparinated tube, room temperature pool1 salivary gland animal 2 Mus musculus sample collec8on heparinated tube, room temperature pool1 salivary gland animal 3 Mus musculus sample collec8on heparinated tube, room temperature pool1 salivary gland animal 1 animal 2 animal 3 Source Material salivary glands 19
  • 20. Essentials about ISA syntax Tagging with Terminologies • Implicit column order matters: ! ! ! ! ! ! • ISA tools (ISAcreator - ISAconfigurator) provide Ontology term selection and term tagging facilities to help users. Source Name Characteris0cs [organism] Factor Value[comp ound agent] Factor Value[per turba0on agent] Factor Value[dose] Factor Value[dura 0on] Factor Value[was hout period Factor Value[dura 0on] Factor Value[perturba0o n agent] Factor Value[dose] Factor Value[dura0on] individual1 human Source Name Characteris0cs [organism] Term Source REF Term Accession Number Characteris0c s[dura0on] Unit Term Source REF Term Accession Number Factor Value[compound (htppt://purl] Term Source REF Term Accession Number individual1 Homo sapiens NCBITax 9606 12 week UO UO:wwerw ta aspirin CHEBI 1231354 20
  • 22. Parallel group design source: hOp://dx.doi.org/10.1016/S1569-­‐9056(02)00115-­‐X; figure 1 22
  • 23. Essentials about ISA syntax Representing interventions and treatments ! • expressing treatments as sets of factor levels • examples: treatment is a tadalafil supplementation • Factors will be ‘compound’, ‘dose’ and duration • (what?, how much?, how long for?) ! Characteris0c Factor ! Source Name s[organism] Protocol REF Value[compoun Factor Value[dose] Factor Value[dura0on] d] ! volunteer 1 Homo sapiens treatment tadalafil 250 mg/day 12 weeks ! volunteer 2 Homo sapiens treatment tadalafil 250 mg/day 12 weeks ! volunteer 3 Homo sapiens treatment placebo 20 mg/day 12 weeks ! • Implicit column order matters but this is independent from the ISA syntax specification
  • 24. Cross-over design 24 source: Roberts et al. Journal of the International Society of Sports Nutrition 2007 4:25 doi:10.1186/1550-2783-4-25
  • 25. 08/26/13 Cross-over design 25 10.1371/journal.pone.0037479
  • 26. 08/26/13 Cross-over design 26 ! Treatment declaration
  • 27. 08/26/13 Cross-over design 27 10.1371/journal.pone.0037479
  • 33. 1
  • 34. ISA configurations Available from: http://isa-tools.org/configurations.html https://github.com/ISA-tools/Configuration-Files • Assembling workflow archetypes • Setting annotation requirements –for compliance with database schemas (SRA, MAGE, PRIDE) –for compliance with community based requirements (MIAME, MIAPE, MIMS, MIxS, …) • Guide users –Provide pre-assembled templates –Specify vocabulary support ISAconfigurator: Supporting tool https://github.com/ISA-tools/ISAconfigurator
  • 35. ISA configurations Available from: http://isa-tools.org/configurations.html https://github.com/ISA-tools/Configuration-Files • Minimum information about any (x) sequence (MIxS) Guidelines as issued by Genomic Standards Consortium • ENA-GSC-MIxS checklist XML document: –based on MIxS guidelines –augmented with a number of regular expressions to further validate/ regularize input –fixing a number of units used to report measurement –issued July 2013 (version 3.0), July 2014 (version 4.0) • SRA 1.5 schema requirements (mandatory information and required terminology, e.g. Library Selection or Library Strategy) • All this information is used to derive ISA MIxS configurations allowing all those annotation requirements to be embedded in spreadsheet tables
  • 38. Things to bear in mind with NGS data Important considerations for managing data and submitting to public repositories –be aware of support file formats • FastA,FastQ,SFF,..... –be aware of the need to demultiplex reads –SRA schema evolves and updates are needed • e.g. Study replaced by Project • Updates to the ISAconverter • Mapping from ISA is straightforward as brings a number of element ISA already supported
  • 39. Tools for creating ISA-Tab documents isacreator
  • 40. isacreator Java desktop application Developed to be a user friendly way to enter standards-compliant metadata: it has lots of features... But these are just some of them… we also have a data entry wizard and an import utility...
  • 41. ISAcreator features: automatic template generation
  • 42. ISACreator Wizard: automatic template generation Prerequisites and Conditions of use: ! -supports factorial design experiments, meaning sets of discrete factor levels combined together, to define a treatment 2x2 factorial design as in 2 compounds and 2 time points 2x2x3 factorial design as in 2 compounds, 2 time points, 2 doses -assumes one sample collection event (all samples collected at sacrifice time) -supports some but not all currently available assay types -supports fractional factorial design -supports unbalanced factor group population sizes (ethical considerations for high dose toxic exposures) -generates automatically sample identifiers, human readable & meaning full labels and , if requested, barcodes ! -does not support ‘crossover design’, which have to be coded manually -does not support sample collection timeline management (under development)
  • 43. 43 Importing your own spreadsheet: Mapping to third party table
  • 44. ISAcreator features: visualizing experimental workflows Work completed during investigation of new approach for creation of glyphs with use of taxonomy for guidance. See Maguire et al, Taxonomy-Based Glyph Design – with a Case Study on Visualizing Workflows of Biological Experiments, IEEE Transactions on Visualization and Computer Graphics, 2012 44
  • 45. OntoMaton: a BioPortal powered Ontology widget for Google Spreadsheets Maguire et al, 2013 Bioinformatics Tools for creating ISA-Tab documents ! ! ! ! http://www.slideshare.net/proccaserra/ontomaton-icbo2013alternative-ordertwv3 http://isatools.wordpress.com/2012/07/13/introducing-ontomaton-ontology-search-tagging- for-google-spreadsheets/
  • 46. Potential Issues and known hurdles • The problem of conflicting versions –especially high when working with big consortia –distributed, decentralised groups of users • Lack of version control and history • Absence of collaborative features ! –Looking for new solutions while retaining the features ! = + + LOV
  • 47. Bioportal meets Google Spreadsheet 47
  • 48. Searching and Tagging Templates: https://drive.google.com/templates?type=spreadsheets&q=ontomaton
  • 49. Searching and Tagging Templates: https://drive.google.com/templates?type=spreadsheets&q=ontomaton
  • 50. 50
  • 51. 2
  • 52. 3
  • 53. Risa - ISA-Tab manipulation for analysis in R • RISA R-package 53
  • 54. • R"package"available"since"BioConductor"2.11" h:p://www.bioconductor.org/packages/release/bioc/html/Risa.html" • Func@onality"for"parsing"ISAFTab"datasets"into"R"objects," saving"and"upda@ng"them." • It"bridges"the"ISAFTab"metadata"to"analysis"pipelines"of" specific"assay"types,"by"building"objects"for"use"in"other"R" packages"downstream" – "currently"considering"mass"spectrometry"(xmcs"package,"xcmsSet)" and"DNA"microarray"(Biobase"package,"ExpressionSet)" " 1 2 Collect Samples 3 4 Run Assays 5 Experiment Design Analysis 54 SAMPLE1 SAMPLE2 SAMPLE3 SAMPLE4 SAMPLE5 SAMPLE6 SAMPLE7 SAMPLE8 SAMPLE9 SAMPLE10 SAMPLE11 SAMPLE 1 SAMPLE 2 SAMPLE 3 SAMPLE 4 SAMPLE 5 SAMPLE 6 SAMPLE 7 SAMPLE 8 SAMPLE 9 SAMPLE 10 SAMPLE 11 FILE 1 FILE 2 FILE 3 FILE 4 FILE 5 FILE 6 FILE 7 FILE 8 FIL FIL FIL Arabidopsis thaliana Treatment groups 70% 90% 100% 6
  • 58. 4
  • 60. Pre-requirements: – registration to ENA/EBI Metagenomics – data upload by one of the methods provided by ENA http://www.ebi.ac.uk/ena/about/sra_data_upload 60
  • 61. http://www.ebi.ac.uk/ena/about/sra_data_upload Pre-requirements: – registration to ENA/EBI Metagenomics – data upload by one of the methods provided by ENA 61
  • 64. 64
  • 65. 65
  • 66. 66
  • 67. 67 ISA-Tab validation ISA-Tab to SRA conversion Submission to ENA ISA-Tab creation (SRA-xml schema)
  • 68. 68
  • 69. 69
  • 70. 5
  • 73. • New open-access, online-only publication for descriptions of scientifically valuable datasets • Only content type: Data Descriptor, narrative + structured parts • Initially focused on the life, environmental and biomedical sciences • Data Descriptor will be complementary to traditional research journals and data repositories • Designed to foster data sharing and reuse, and ultimately to accelerate scientific discovery www.nature.com/scientificdata
  • 74. Data Descriptors served by Scientific Data Narrative Section! A brief article-like document like with:! •Title! •Abstract! •Background & Summary! •Methods! •Technical Validation! •Usage Notes ! •Figures & Tables ! •References Structured Section! Detailed descriptions of the experimental procedures used to produce the data •Following community-defined minimum information requirements • for a level of detail sufficient to reproduce the experiments •Using ontologies & controlled-vocabularies • To maximise consistency of the descriptions www.nature.com/scientificdata
  • 75. Data Descriptors served by Scientific Data Narrative Section! A brief article-like document like with:! •Title! •Abstract! •Background & Summary! •Methods! •Technical Validation! •Usage Notes ! •Figures & Tables ! •References Structured Section! Detailed descriptions of the experimental procedures used to produce the data •Following community-defined minimum information requirements • for a level of detail sufficient to reproduce the experiments •Using ontologies & controlled-vocabularies • To maximise consistency of the descriptions www.nature.com/scientificdata
  • 76. Training Material 76 http://isa-tools.org/training.html
  • 77. http://isa-tools.org/training.html Hands-on Material • Software: –ISAcreator 1.7.8 (see pre-release) –ISAconfigurator 1.6 • Configurations: –ISA-ENA-MIxS Configuration –default MultiAssay Configuration • ISA-Tab formatted datasets –BII-S-3: Western Channel Water Samples metagenome and meta transcriptome –BII-S-7: Human gut microbiome targeted gene survey • Google Templates and Ontomaton • ISA mapping file
  • 78. The Exemplar Datasets • BII-­‐S-­‐3: • Metagenome and Metatranscriptome on 454
  • 79. • BII-­‐S-­‐7: The Exemplar Datasets SubmiOed to ENA via ISAcreator: ERP000133 • Targeted Gene Survey (16s RNA) on 454
  • 80. Experimental Metadata Roadmap link to analysis platforms submission to public repositories data publication
  • 82. Thanks for your attention! Questions? You can email us... isatools@googlegroups.com View our websites View our Git repo & contribute http://github.com/ISA-tools View our blog http://isatools.wordpress.com Follow us on Twitter @isatools