SlideShare a Scribd company logo
data - science - life
Considerations & Challenges in
Building an End-to-End
Microbiome Workflow
data - science - life
Craig earned his Ph.D. in Microbology at the University of Warwick, he then
completed his Postdoctoral research at the John Innes Centre before joining EBI
and developing their bioinformatics capability. Dr McAnulla has extensive
experience in developing large scale pipelines for analyzing microbiome
datasets. His skills cover both software and biology and he often provides deep
insights to industry experts on these topics.
Presenters
Raminderpal earned his PhD in the semiconductor industry where he spent a
decade building advanced technologies for semiconductor modelling and
data mining. In 2012, Dr Singh took the helm as the Business Development
Executive for IBM Research’s genomics program, where he developed and led
the execution of the go-to-market and business plan for the IBM Watson
Genomics program. In addition to many published papers, Dr Singh has two
books and twelve patents.
Dr Craig McAnulla
Senior Consultant Bioinformatics at Eagle Genomics
Dr Raminderpal Singh
Vice President Business Development at Eagle Genomics
Please Tweet your questions
@eaglegen or enter them
into GoTo Webinar
Agenda
1. Considerations
2. Challenges
3. Industrialized Microbiome Workflow & Collaboration
Environment
4. Deployment of a Collaboration System
data - science - life
data - science - life
Considerations for successful microbiome data management and analysis:
1. Intrinsic factors
2. The relatively early maturity of the field
3. Calculation of business value
1. Considerations
data - science - life
Intrinsic
• Microbiomes are complex and diverse biological systems
• Many components – organisms, genes, pathways, metabolites etc
• More DNA/genes than human genome
• Taxonomic diversity
• Many sources of error
• Appropriate study design and metadata collection are crucial.
1. Considerations - Intrinsic
data - science - life
Maturity
• Changing sequencing technologies
• Move from amplicon-based to shotgun
• Rapidly increasing data volumes
• Limited reference data
• Fast-moving software development
• Future data re-analysis with improved methods must be considered.
1. Considerations - Maturity
data - science - life
1. Considerations – Business value
Business value
• Commercial exploitation of the microbiome is in its infancy, and still at an exploratory phase.
• Many research/findings remain controversial and some early claims on the gut microbiome have not
been replicated.
• The true potential of microbiome data is still to be tapped, and some novel experimental designs are
being developed that result in maximum value being generated.
data - science - life
The analysis-at-scale of microbiomics data faces several challenges:
• Metadata management
• Compute resources necessary for storage and analysis
• Statistical techniques
Metadata Management
• Crucial to analysis
• Often done poorly
• Needs to encompass all relevant information
• Check for batch effects
2. Challenges
data - science - life
2. Challenges
Data Storage and Compute Resources
• Shotgun metagenomics example study - 1.7 billion paired-end Illumina reads, 266.5 Gb.
• Bigger data volume than the human genome at 30x coverage.
• Study size is increasing
How to store this?
• Analysis selection critical
• Analyses MUST work with these data volumes
• eg Kraken, Humann2
• Compute resources
• Eg Kraken, multicore machine > 100Gigs
• Solution – the cloud
• AWS, Azure
data - science - life
Statistical Techniques
• Data normalization
• rarefaction, normalization, log
transformation?
• Statistical techniques
• parametric, nonparametric?
• Confusion
• No correct answer
• Depends on the data
• function vs taxonomy
• shotgun vs amplicon
2. Challenges
data - science - life
• Eagle Genomics leads the industry in
configuring and deploying best-in-class
microbiome workflow solutions.
• These run large-scale parallelized
pipelines cost-efficiently, with seamless
connectivity between the source data
and the scientists.
3. Industrialized Microbiome
Workflow & Collaboration
Environment
www.eaglegenomics.com
“Unilever’s digital data program now
processes genetic sequences twenty
times faster - without incurring higher
compute costs. In addition, its robust
architecture supports ten times as many
scientists, all working simultaneously.”
Pete Keeley, eScience Technical Lead - R&D IT
at Unilever
data - science - life
organization
3. Industrialized Microbiome
Workflow & Collaboration
Environment
data - science - life
MICROBIOME
DATA FILES
(amplicon or shotgun)
METADATA: eaglecatalog (implements security model)
QUERY/
FILTERING:
eaglecatalog
DATA
SHARING
CRO’s
DEPOSIT
DATA
ELN
EXPERIMENT
METADATA
INTEGRATION
WITH OMICS
(RNASeq, proteomics)
FEDERATED DATA &
RESULTS
INTEGRATION
TAXONOMY/
FUNCTION
ANALYSIS
REPORTS
R,QIIME
POPULATION
COHORTS
VISUALIZATION
CLOUD STORAGE
eaglecatalog
COLLABORATIVE
ENVIRONMENT
FOR PARTNERS
FEEDBACK TO
EXPERIMENTAL
DESIGN
COST OF
EXPERIMENT
DATA COLLABORATION ARCHITECTURE
Solving the metadata problem
data - science - life
4. Deployment of a Collaboration
System
Eagle Genomics Proposal
Accelerating metadata adoption and scale up
• Offering a new rapid deployment for microbiome R&D organizations looking to
cost-efficiently build world-class integrated workflows.
• Set up of an initial collaboration system for your team.
• 1 to 4 weeks
• Enabling self-load your data to the catalog – leading to a standardized and
streamlined workflow, which is ready for future scale out
• Managing internal data, partner data, or public data
data - science - life
Current Situation Desired State
Negative Consequences
of Non-action
Positive Business Impact,
using Eagle software
Data in silos, not reused
Easily searchable, discoverable,
accessible and intuitive portal by
researchers to all data and results
Significant researcher time spent in
“data wrangling”. Breakage
pending, based on projected
bottleneck
Improved time and ability to insight
through radically improved productivity
of researchers
No correlation between
experiments
Quick and easy exploratory analyses
across experiments and data
Missed opportunities for insights.
Failure to leverage data as an asset
Faster identification and generation of
successful product innovation and
insight, leading to reduced time to
market
No unified view of data
Data and analysis results stored in a
unified catalog, with standardized &
repeatable access using best-in-
class technology
Data (internal, external) and the
associated learning not available or
exploited as necessary
Competitive advantage through
cutting edge research, driving
innovation in the product pipeline
Lack of governance and data
provenance
Secure authenticated access with
relevant data provenance of data
sets; simple data sharing with
vendors/collaborators
Costly, inefficient and limited data
sharing with vendors; risk of
inappropriate data sharing
Efficient compliant, documented,
evidenced, and enforced data sharing
process; effective (time & cost)
collaboration between vendors and
internal teams
Data is in-effect archived
Catalog integrated in the scientific
workflow, used continuously and
keeping pace with innovation in the
industry
Missed insights; silo-ed research
activities, with high barriers to
collaboration
Systematic data sharing and re-use,
driving faster unique cross-functional
insights and reducing waste in spend
BENEFITS
For	more	info,	contact:
Raminderpal Singh
raminderpal.singh@eaglegenomics.com
In Summary,
Considerations & Challenges in
Building an End-to-End Microbiome
Workflow
data - science - life

More Related Content

PDF
Digital transformation of translational medicine
PDF
Expert Panel on Data Challenges in Translational Research
PDF
Validating microbiome claims – including the latest DNA techniques
PPTX
MPS webinar master deck
PPTX
AI in translational medicine webinar
PPTX
Digital webinar master deck final
PPTX
Why should researchers care about data curation?
PDF
CEDAR work bench for metadata management
Digital transformation of translational medicine
Expert Panel on Data Challenges in Translational Research
Validating microbiome claims – including the latest DNA techniques
MPS webinar master deck
AI in translational medicine webinar
Digital webinar master deck final
Why should researchers care about data curation?
CEDAR work bench for metadata management

What's hot (20)

PPTX
Pistoia Alliance-Elsevier Datathon
PDF
BigDataAnalytics_Talk_KOCH_FINAL
PPTX
X team 2 - presentation
PDF
Data for AI models, the past, the present, the future
PDF
Healthcare Conference 2013 : Genes, Clouds and Cancer - dr. Andrew Litt
PDF
Fair by design
PPTX
Practical Drug Discovery using Explainable Artificial Intelligence
PPTX
Pharma data analytics
PDF
Caulder - DIVOS BioITWorld 2015
PDF
Sharing and standards christopher hart - clinical innovation and partnering...
PPTX
Big Data & ML for Clinical Data
PDF
EY Drug R&D: Big DATA for big returns
PPTX
Conference presentation from #iccs2014 in Noordwijkerhout
PPT
Data Mining and Big Data Analytics in Pharma
PDF
Pattern diagnostics 2015
PDF
The Future of Healthcare with Big Data and AI with Ion Stoica and Frank Nothaft
PPTX
Using the CDD Vault for MM4TB
PDF
ELSS use cases and strategy
PDF
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
PDF
Accelerate Your Scientific Discovery with GlobalCODE® - A Unique Data Managem...
Pistoia Alliance-Elsevier Datathon
BigDataAnalytics_Talk_KOCH_FINAL
X team 2 - presentation
Data for AI models, the past, the present, the future
Healthcare Conference 2013 : Genes, Clouds and Cancer - dr. Andrew Litt
Fair by design
Practical Drug Discovery using Explainable Artificial Intelligence
Pharma data analytics
Caulder - DIVOS BioITWorld 2015
Sharing and standards christopher hart - clinical innovation and partnering...
Big Data & ML for Clinical Data
EY Drug R&D: Big DATA for big returns
Conference presentation from #iccs2014 in Noordwijkerhout
Data Mining and Big Data Analytics in Pharma
Pattern diagnostics 2015
The Future of Healthcare with Big Data and AI with Ion Stoica and Frank Nothaft
Using the CDD Vault for MM4TB
ELSS use cases and strategy
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
Accelerate Your Scientific Discovery with GlobalCODE® - A Unique Data Managem...
Ad

Similar to Considerations and challenges in building an end to-end microbiome workflow (20)

PDF
Expert panel on industrialising microbiomics - with Unilever
PPTX
Best Practices for Building an End-to-End Workflow for Microbial Genomics
PPTX
Big Data Field Museum
PPTX
Implementing Pathogen Genomics
PPTX
EPA 2013 Air Sensors Meeting Big Data Talk
PPTX
2013 nas-ehs-data-integration-dc
PPTX
2015 aem-grs-keynote
PPTX
Big data nebraska
PPTX
Big data nebraska
PPTX
PI_Bioinformaticfkằnknffnànewầnkènaèn.pptx
PDF
Good practices (and challenges) for reproducibility
PPTX
Job Talk Iowa State University Ag Bio Engineering
PPT
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
PPTX
DataONE Education Module 09: Analysis and Workflows
PDF
Data Analysis and Visualization in Genomics and Proteomics 1st Edition Franci...
PPTX
Data-intensive bioinformatics on HPC and Cloud
PPT
Cloud Computing and Innovations for Optimizing Life Sciences Research
PPTX
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
PPTX
Data analysis & integration challenges in genomics
PDF
Knowledge-Driven NGS Solutions
Expert panel on industrialising microbiomics - with Unilever
Best Practices for Building an End-to-End Workflow for Microbial Genomics
Big Data Field Museum
Implementing Pathogen Genomics
EPA 2013 Air Sensors Meeting Big Data Talk
2013 nas-ehs-data-integration-dc
2015 aem-grs-keynote
Big data nebraska
Big data nebraska
PI_Bioinformaticfkằnknffnànewầnkènaèn.pptx
Good practices (and challenges) for reproducibility
Job Talk Iowa State University Ag Bio Engineering
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
DataONE Education Module 09: Analysis and Workflows
Data Analysis and Visualization in Genomics and Proteomics 1st Edition Franci...
Data-intensive bioinformatics on HPC and Cloud
Cloud Computing and Innovations for Optimizing Life Sciences Research
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
Data analysis & integration challenges in genomics
Knowledge-Driven NGS Solutions
Ad

Recently uploaded (20)

PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
1_Introduction to advance data techniques.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Mega Projects Data Mega Projects Data
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Foundation of Data Science unit number two notes
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
1_Introduction to advance data techniques.pptx
Launch Your Data Science Career in Kochi – 2025
Moving the Public Sector (Government) to a Digital Adoption
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Mega Projects Data Mega Projects Data
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
climate analysis of Dhaka ,Banglades.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Fluorescence-microscope_Botany_detailed content
Major-Components-ofNKJNNKNKNKNKronment.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Reliability_Chapter_ presentation 1221.5784
Acceptance and paychological effects of mandatory extra coach I classes.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Foundation of Data Science unit number two notes

Considerations and challenges in building an end to-end microbiome workflow

  • 1. data - science - life Considerations & Challenges in Building an End-to-End Microbiome Workflow
  • 2. data - science - life Craig earned his Ph.D. in Microbology at the University of Warwick, he then completed his Postdoctoral research at the John Innes Centre before joining EBI and developing their bioinformatics capability. Dr McAnulla has extensive experience in developing large scale pipelines for analyzing microbiome datasets. His skills cover both software and biology and he often provides deep insights to industry experts on these topics. Presenters Raminderpal earned his PhD in the semiconductor industry where he spent a decade building advanced technologies for semiconductor modelling and data mining. In 2012, Dr Singh took the helm as the Business Development Executive for IBM Research’s genomics program, where he developed and led the execution of the go-to-market and business plan for the IBM Watson Genomics program. In addition to many published papers, Dr Singh has two books and twelve patents. Dr Craig McAnulla Senior Consultant Bioinformatics at Eagle Genomics Dr Raminderpal Singh Vice President Business Development at Eagle Genomics Please Tweet your questions @eaglegen or enter them into GoTo Webinar
  • 3. Agenda 1. Considerations 2. Challenges 3. Industrialized Microbiome Workflow & Collaboration Environment 4. Deployment of a Collaboration System data - science - life
  • 4. data - science - life Considerations for successful microbiome data management and analysis: 1. Intrinsic factors 2. The relatively early maturity of the field 3. Calculation of business value 1. Considerations
  • 5. data - science - life Intrinsic • Microbiomes are complex and diverse biological systems • Many components – organisms, genes, pathways, metabolites etc • More DNA/genes than human genome • Taxonomic diversity • Many sources of error • Appropriate study design and metadata collection are crucial. 1. Considerations - Intrinsic
  • 6. data - science - life Maturity • Changing sequencing technologies • Move from amplicon-based to shotgun • Rapidly increasing data volumes • Limited reference data • Fast-moving software development • Future data re-analysis with improved methods must be considered. 1. Considerations - Maturity
  • 7. data - science - life 1. Considerations – Business value Business value • Commercial exploitation of the microbiome is in its infancy, and still at an exploratory phase. • Many research/findings remain controversial and some early claims on the gut microbiome have not been replicated. • The true potential of microbiome data is still to be tapped, and some novel experimental designs are being developed that result in maximum value being generated.
  • 8. data - science - life The analysis-at-scale of microbiomics data faces several challenges: • Metadata management • Compute resources necessary for storage and analysis • Statistical techniques Metadata Management • Crucial to analysis • Often done poorly • Needs to encompass all relevant information • Check for batch effects 2. Challenges
  • 9. data - science - life 2. Challenges Data Storage and Compute Resources • Shotgun metagenomics example study - 1.7 billion paired-end Illumina reads, 266.5 Gb. • Bigger data volume than the human genome at 30x coverage. • Study size is increasing How to store this? • Analysis selection critical • Analyses MUST work with these data volumes • eg Kraken, Humann2 • Compute resources • Eg Kraken, multicore machine > 100Gigs • Solution – the cloud • AWS, Azure
  • 10. data - science - life Statistical Techniques • Data normalization • rarefaction, normalization, log transformation? • Statistical techniques • parametric, nonparametric? • Confusion • No correct answer • Depends on the data • function vs taxonomy • shotgun vs amplicon 2. Challenges
  • 11. data - science - life • Eagle Genomics leads the industry in configuring and deploying best-in-class microbiome workflow solutions. • These run large-scale parallelized pipelines cost-efficiently, with seamless connectivity between the source data and the scientists. 3. Industrialized Microbiome Workflow & Collaboration Environment www.eaglegenomics.com “Unilever’s digital data program now processes genetic sequences twenty times faster - without incurring higher compute costs. In addition, its robust architecture supports ten times as many scientists, all working simultaneously.” Pete Keeley, eScience Technical Lead - R&D IT at Unilever
  • 12. data - science - life organization 3. Industrialized Microbiome Workflow & Collaboration Environment
  • 13. data - science - life MICROBIOME DATA FILES (amplicon or shotgun) METADATA: eaglecatalog (implements security model) QUERY/ FILTERING: eaglecatalog DATA SHARING CRO’s DEPOSIT DATA ELN EXPERIMENT METADATA INTEGRATION WITH OMICS (RNASeq, proteomics) FEDERATED DATA & RESULTS INTEGRATION TAXONOMY/ FUNCTION ANALYSIS REPORTS R,QIIME POPULATION COHORTS VISUALIZATION CLOUD STORAGE eaglecatalog COLLABORATIVE ENVIRONMENT FOR PARTNERS FEEDBACK TO EXPERIMENTAL DESIGN COST OF EXPERIMENT DATA COLLABORATION ARCHITECTURE Solving the metadata problem
  • 14. data - science - life 4. Deployment of a Collaboration System Eagle Genomics Proposal Accelerating metadata adoption and scale up • Offering a new rapid deployment for microbiome R&D organizations looking to cost-efficiently build world-class integrated workflows. • Set up of an initial collaboration system for your team. • 1 to 4 weeks • Enabling self-load your data to the catalog – leading to a standardized and streamlined workflow, which is ready for future scale out • Managing internal data, partner data, or public data
  • 15. data - science - life Current Situation Desired State Negative Consequences of Non-action Positive Business Impact, using Eagle software Data in silos, not reused Easily searchable, discoverable, accessible and intuitive portal by researchers to all data and results Significant researcher time spent in “data wrangling”. Breakage pending, based on projected bottleneck Improved time and ability to insight through radically improved productivity of researchers No correlation between experiments Quick and easy exploratory analyses across experiments and data Missed opportunities for insights. Failure to leverage data as an asset Faster identification and generation of successful product innovation and insight, leading to reduced time to market No unified view of data Data and analysis results stored in a unified catalog, with standardized & repeatable access using best-in- class technology Data (internal, external) and the associated learning not available or exploited as necessary Competitive advantage through cutting edge research, driving innovation in the product pipeline Lack of governance and data provenance Secure authenticated access with relevant data provenance of data sets; simple data sharing with vendors/collaborators Costly, inefficient and limited data sharing with vendors; risk of inappropriate data sharing Efficient compliant, documented, evidenced, and enforced data sharing process; effective (time & cost) collaboration between vendors and internal teams Data is in-effect archived Catalog integrated in the scientific workflow, used continuously and keeping pace with innovation in the industry Missed insights; silo-ed research activities, with high barriers to collaboration Systematic data sharing and re-use, driving faster unique cross-functional insights and reducing waste in spend BENEFITS
  • 16. For more info, contact: Raminderpal Singh raminderpal.singh@eaglegenomics.com In Summary, Considerations & Challenges in Building an End-to-End Microbiome Workflow data - science - life