SlideShare a Scribd company logo
MIT Licensed
Customized for NCBI Staff in collaboration with the
bioinformatics community.
Version 1.0
Federated Data Access
In Four Acts.
MIT Licensed
Customized for NCBI Staff in collaboration with the
bioinformatics community.
Version 1.0
Scenes
Act 1: Virus Characterization and Discovery (x2!)
Act 2: Genome Graphs
Act 3: Annotation of Haplotypes (and Graphs)
Act 4: Indexing Data for Federated Discovery on any
Platform, Anywhere in the World
Act 5: Epilogue: Metadata Matters
Disclosures
The idea and material presented here is not necessarily the view of NCBI, NLM, NIH or any other federal
agency. Some of the underlying work was supported by the Intramural Research Program of the NLM and
various other federal agencies.
Ben Busby is contracted to NCBI through Medical Science and Computing (MSC).
Ben Busby is also an advisor to:
● Johns Hopkins
● Ariel Precision Medicine
● Deloitte
through Mountain Genomics, a company headquartered in Pittsburgh, PA.
Overview
As the volume of publicly available genomic data expands, it is becoming increasingly clear the SRA and other
public repositories play an evermore important role of being stewards, allowing researchers to leverage the
statistical and discovery power represented in huge arrays of datasets. That said, these repositories must not
simply become ‘bags of data’, but indexed repositories where data can be found, approximately assessed for quality,
mixed with other data, and most importantly, used by researchers to ask fundamental biological and biomedical
questions.
Nearly as important, we must provision reproducible workflows, such that investigators can go from finding data to
answering questions as simply and quickly as possible. These workflows will range in complexity from basic blast
searching, variant calling and annotation, and transcript counting to building genome graphs and discovering novel
viruses and back-spliced RNA. While doing all this we must be cognisant that while building a simple GUI interface
for these types of analysis is impractical and likely impossible, the onus is on us to teach fledgeling biological data
scientists to apply computational tools to their existing biological questions.
Federation of Sequencing Data in the Cloud:
Four Illustrative Topic Areas
1
Virus Characterization and Discovery.
Generation of a cloud-based indexing system
to allow investigators to identify data sets of
interest based on taxonomic, gene and protein
domain profiles
2
Genome Graphs
Generation of simple, usable systems to not
compare an individual patient or organism to
another -- or small group of -- individual, but
to an entire community, dramatically
compress data, and immediately find “toxic
paths”.
3
Annotation of Haplotypes (and Graphs)
Annotation of haplotypes, instead of graphs,
to allowing investigators to query complex
disease
4
Indexing Data for Federated Discovery on
any Platform, Anywhere in the World
Flexible presentation (API) of viral protein
domains, host-pathogen interactions and
eventually graph loops as proof-of-principle
data federation.
Containerized attribute indexing and graph genomes for federated data access
Containerized attribute indexing and graph genomes for federated data access
Containerized attribute indexing and graph genomes for federated data access
Virus Characterization
and Discovery
We have extracted known viruses, heretofore unknown family members, sequences identifiable as viral by the protein domains
that decorate their proteins, and the true genomic “dark matter” and built a prototype index for viral signatures across the vast
metadata space within SRA (~90,000-150,000 datasets). This is an example of how a researcher can reach across a massive
data repository with a very simple bioinformatic tool and extract dramatic results, except in this case, we have pre-built this
index for researchers working on almost any kind of virus, including noroviruses, hemorragic fever-causing viruses, and
bacteriophages, the most abundant organisms on earth.
Powered by:
Containerized attribute indexing and graph genomes for federated data access
Containerized attribute indexing and graph genomes for federated data access
Containerized attribute indexing and graph genomes for federated data access
Genome Graphs
Initially leveraging standard workflows for annotation of primary datasets
in cloud infrastructure mapped to linear genomes, we will likely transition to
genome graphs over the next ten years. This will allow us to compress
reads, do phasing automatically, do reference-guided assembly on complex
genomes, and detect “toxic paths” automatically. Most importantly, this will
allow us to see complex genotype - phenotype interactions easily.
Evan Biederstedt, Alex Dilthey
many others
Erik Garrison, many others
Containerized attribute indexing and graph genomes for federated data access
Jordan Izenga, many others
Containerized attribute indexing and graph genomes for federated data access
Justin Zook, many others
Fritz Sedlazeck, many others
Eric Dawson, Fernanda Forterre, many
others
https://github.com/NCBI-Codeathons/SWIGG
Jason Chin, Alex Gener, many others
github.com/ncbi-codeathons/virus_graphs
Mike Tisza, Alexis Norris, many others
Medical Annotations of Graphs and Haplotypes
Lon Phan, Sara Kalla, many others
Haplocravat2
Mike Smallegan, Kyle
Moad, Matt
Hynes-Grace, Dina
Mikdadi, many others
Vince Carey, John Didion, many others
Containerized attribute indexing and graph genomes for federated data access
Containerized attribute indexing and graph genomes for federated data access
Eric Dawson, Fernanda Forterre, many
others
A Brief Proposal for Graph-Based Sequence
Compression and Phasing (personal opinion)
++
A Brief Proposal for Graph-Based Sequence
Compression and Phasing (personal opinion)
=
(no longer available on
CD-ROM)
Indexing Data for Federated Discovery on
any Platform, Anywhere in the World!
Indexing Data for Federated Discovery on
any Platform, Anywhere in the World!
Indexing Data for Federated Discovery on
any Platform, anywhere in the World!
Christiam Camacho, Sej
Modha, Alex Efremov,
Joan Marti-Carreras
The Importance of Metadata
If biologists leverage the data indices or pipelines for the illustrative subject areas listed here, without metadata, they will simply
be doing more indexing and cataloging. For maximum utility, the primary data must be accompanied by rich metadata that puts
the samples in question in context, not just in terms of their technical origins, but also their biological origins; for example, their
disease state, tissue type, and the precise geographic location, not of the sequencing, but where the individual organism lives.
Containerized attribute indexing and graph genomes for federated data access
Containerized attribute indexing and graph genomes for federated data access

More Related Content

PDF
disgenet2r: The DisGeNET R package
PPTX
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
PDF
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
PPTX
Model Organism Linked Data
PDF
NetBioSIG2012 chrisevelo
PPT
Dynamic Semantic Metadata in Biomedical Communications
PDF
Overall Vision for NRNB: 2015-2020
PDF
Link Analysis of Life Sciences Linked Data
disgenet2r: The DisGeNET R package
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
Model Organism Linked Data
NetBioSIG2012 chrisevelo
Dynamic Semantic Metadata in Biomedical Communications
Overall Vision for NRNB: 2015-2020
Link Analysis of Life Sciences Linked Data

What's hot (19)

PDF
Technology R&D Theme 1: Differential Networks
PDF
NRNB Annual Report 2016: Overall
PPT
Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...
PDF
NRNB Annual Report 2011
PDF
Technology R&D Theme 3: Multi-scale Network Representations
PDF
Technology R&D Theme 2: From Descriptive to Predictive Networks
PPTX
NRNB EAC Meeting 2012
PPT
Modular RADAR: Immune System Inspired Strategies for Distributed Systems
PDF
GI 2013 - ENCODE Project Data Access via RESTful API and JSON
PPTX
Open PHACTS for BDE SC1.1
PDF
GenomeTrakr: Perspectives on linking internationally - Canada and IRIDA.ca
PDF
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]
PDF
Nonadaptive mastermind algorithms for string and vector databases, with case ...
PPT
Immune System Inspired Strategies for Distributed Systems
PDF
Classifying lymphoma and tuberculosis case reports using machine learning alg...
PDF
NRNB Annual Report 2013
PPTX
2016 bmdid-mappings
PDF
Biological models of security for virus propagation in computer networks
PPTX
Technology R&D Theme 1: Differential Networks
NRNB Annual Report 2016: Overall
Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...
NRNB Annual Report 2011
Technology R&D Theme 3: Multi-scale Network Representations
Technology R&D Theme 2: From Descriptive to Predictive Networks
NRNB EAC Meeting 2012
Modular RADAR: Immune System Inspired Strategies for Distributed Systems
GI 2013 - ENCODE Project Data Access via RESTful API and JSON
Open PHACTS for BDE SC1.1
GenomeTrakr: Perspectives on linking internationally - Canada and IRIDA.ca
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]
Nonadaptive mastermind algorithms for string and vector databases, with case ...
Immune System Inspired Strategies for Distributed Systems
Classifying lymphoma and tuberculosis case reports using machine learning alg...
NRNB Annual Report 2013
2016 bmdid-mappings
Biological models of security for virus propagation in computer networks
Ad

Similar to Containerized attribute indexing and graph genomes for federated data access (20)

PDF
Addressing privacy concerns_in_the_age_of_federated_data_access
PPTX
Computational Resources In Infectious Disease
PPT
Folker Meyer: Metagenomic Data Annotation
PPTX
Cool Informatics Tools and Services for Biomedical Research
PPTX
Cshl minseqe 2013_ouellette
PDF
Virus Analytics Poster
PDF
2015 GU-ICBI Poster (third printing)
PPTX
From Buffer-Overflowing Genomic Tools to Securing Biomedical File Formats
PPTX
Informal presentation on bioinformatics
PPTX
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
PPTX
Emerging challenges in data-intensive genomics
PDF
Addressing standardization challenges through integrated approaches in biomed...
PPTX
FedCentric_Presentation
PDF
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
PPTX
Data analysis & integration challenges in genomics
PPTX
2014 moore-ddd
PDF
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
PPTX
CS Guest Lecture 2015 10-05 advanced databases
PDF
D1803012022
PPT
Introduction to Bioinformatics and DatabasesDay1.ppt
Addressing privacy concerns_in_the_age_of_federated_data_access
Computational Resources In Infectious Disease
Folker Meyer: Metagenomic Data Annotation
Cool Informatics Tools and Services for Biomedical Research
Cshl minseqe 2013_ouellette
Virus Analytics Poster
2015 GU-ICBI Poster (third printing)
From Buffer-Overflowing Genomic Tools to Securing Biomedical File Formats
Informal presentation on bioinformatics
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Emerging challenges in data-intensive genomics
Addressing standardization challenges through integrated approaches in biomed...
FedCentric_Presentation
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
Data analysis & integration challenges in genomics
2014 moore-ddd
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
CS Guest Lecture 2015 10-05 advanced databases
D1803012022
Introduction to Bioinformatics and DatabasesDay1.ppt
Ad

More from Ben Busby (20)

PPTX
Artificial_Intelligence_for_Data_Reuse_2019
PPTX
Dream.recomb.ncbi.hackathons v003
PPTX
Human_Pangenomics_Bio-IT_2019
PPTX
RNAML_Bio-IT_2019
PPTX
Hackathon_Bio-IT_2019
PPTX
Data science futures_v_vu2
PPTX
Sage 2 19_v5_busby
PPTX
Bb health ai_jan26_v2
PPTX
BB_NCBI_PAG_2019_Workshop
PPTX
Hackathons lightning v_nbs
PPTX
Cmu oss 18
PPTX
Genome web v_repro1
PPTX
Data science futures_v_une
PPTX
Variant and disease_grs_kickoff
PPTX
Bioinformatics_resources_SVAI_v2
PPTX
Ncbi resources i5_k_v4
PPTX
Ncbi resources abrf_v3
PPTX
Data science futures_v_lbirn
PPTX
Pag three ways_to_ngs_at_ncbi
PPTX
NIH Hackathons Lightning Slides
Artificial_Intelligence_for_Data_Reuse_2019
Dream.recomb.ncbi.hackathons v003
Human_Pangenomics_Bio-IT_2019
RNAML_Bio-IT_2019
Hackathon_Bio-IT_2019
Data science futures_v_vu2
Sage 2 19_v5_busby
Bb health ai_jan26_v2
BB_NCBI_PAG_2019_Workshop
Hackathons lightning v_nbs
Cmu oss 18
Genome web v_repro1
Data science futures_v_une
Variant and disease_grs_kickoff
Bioinformatics_resources_SVAI_v2
Ncbi resources i5_k_v4
Ncbi resources abrf_v3
Data science futures_v_lbirn
Pag three ways_to_ngs_at_ncbi
NIH Hackathons Lightning Slides

Recently uploaded (20)

PPTX
2. Earth - The Living Planet Module 2ELS
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
2Systematics of Living Organisms t-.pptx
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
famous lake in india and its disturibution and importance
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
BIOMOLECULES PPT........................
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPT
protein biochemistry.ppt for university classes
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
2. Earth - The Living Planet Module 2ELS
AlphaEarth Foundations and the Satellite Embedding dataset
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
2Systematics of Living Organisms t-.pptx
Placing the Near-Earth Object Impact Probability in Context
Introduction to Cardiovascular system_structure and functions-1
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
famous lake in india and its disturibution and importance
Cell Membrane: Structure, Composition & Functions
BIOMOLECULES PPT........................
Taita Taveta Laboratory Technician Workshop Presentation.pptx
Biophysics 2.pdffffffffffffffffffffffffff
protein biochemistry.ppt for university classes
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
. Radiology Case Scenariosssssssssssssss
Classification Systems_TAXONOMY_SCIENCE8.pptx

Containerized attribute indexing and graph genomes for federated data access

  • 1. MIT Licensed Customized for NCBI Staff in collaboration with the bioinformatics community. Version 1.0 Federated Data Access In Four Acts.
  • 2. MIT Licensed Customized for NCBI Staff in collaboration with the bioinformatics community. Version 1.0 Scenes Act 1: Virus Characterization and Discovery (x2!) Act 2: Genome Graphs Act 3: Annotation of Haplotypes (and Graphs) Act 4: Indexing Data for Federated Discovery on any Platform, Anywhere in the World Act 5: Epilogue: Metadata Matters
  • 3. Disclosures The idea and material presented here is not necessarily the view of NCBI, NLM, NIH or any other federal agency. Some of the underlying work was supported by the Intramural Research Program of the NLM and various other federal agencies. Ben Busby is contracted to NCBI through Medical Science and Computing (MSC). Ben Busby is also an advisor to: ● Johns Hopkins ● Ariel Precision Medicine ● Deloitte through Mountain Genomics, a company headquartered in Pittsburgh, PA.
  • 4. Overview As the volume of publicly available genomic data expands, it is becoming increasingly clear the SRA and other public repositories play an evermore important role of being stewards, allowing researchers to leverage the statistical and discovery power represented in huge arrays of datasets. That said, these repositories must not simply become ‘bags of data’, but indexed repositories where data can be found, approximately assessed for quality, mixed with other data, and most importantly, used by researchers to ask fundamental biological and biomedical questions. Nearly as important, we must provision reproducible workflows, such that investigators can go from finding data to answering questions as simply and quickly as possible. These workflows will range in complexity from basic blast searching, variant calling and annotation, and transcript counting to building genome graphs and discovering novel viruses and back-spliced RNA. While doing all this we must be cognisant that while building a simple GUI interface for these types of analysis is impractical and likely impossible, the onus is on us to teach fledgeling biological data scientists to apply computational tools to their existing biological questions.
  • 5. Federation of Sequencing Data in the Cloud: Four Illustrative Topic Areas 1 Virus Characterization and Discovery. Generation of a cloud-based indexing system to allow investigators to identify data sets of interest based on taxonomic, gene and protein domain profiles 2 Genome Graphs Generation of simple, usable systems to not compare an individual patient or organism to another -- or small group of -- individual, but to an entire community, dramatically compress data, and immediately find “toxic paths”. 3 Annotation of Haplotypes (and Graphs) Annotation of haplotypes, instead of graphs, to allowing investigators to query complex disease 4 Indexing Data for Federated Discovery on any Platform, Anywhere in the World Flexible presentation (API) of viral protein domains, host-pathogen interactions and eventually graph loops as proof-of-principle data federation.
  • 9. Virus Characterization and Discovery We have extracted known viruses, heretofore unknown family members, sequences identifiable as viral by the protein domains that decorate their proteins, and the true genomic “dark matter” and built a prototype index for viral signatures across the vast metadata space within SRA (~90,000-150,000 datasets). This is an example of how a researcher can reach across a massive data repository with a very simple bioinformatic tool and extract dramatic results, except in this case, we have pre-built this index for researchers working on almost any kind of virus, including noroviruses, hemorragic fever-causing viruses, and bacteriophages, the most abundant organisms on earth. Powered by:
  • 13. Genome Graphs Initially leveraging standard workflows for annotation of primary datasets in cloud infrastructure mapped to linear genomes, we will likely transition to genome graphs over the next ten years. This will allow us to compress reads, do phasing automatically, do reference-guided assembly on complex genomes, and detect “toxic paths” automatically. Most importantly, this will allow us to see complex genotype - phenotype interactions easily.
  • 14. Evan Biederstedt, Alex Dilthey many others
  • 21. Eric Dawson, Fernanda Forterre, many others
  • 24. Medical Annotations of Graphs and Haplotypes Lon Phan, Sara Kalla, many others
  • 25. Haplocravat2 Mike Smallegan, Kyle Moad, Matt Hynes-Grace, Dina Mikdadi, many others
  • 26. Vince Carey, John Didion, many others
  • 29. Eric Dawson, Fernanda Forterre, many others
  • 30. A Brief Proposal for Graph-Based Sequence Compression and Phasing (personal opinion) ++
  • 31. A Brief Proposal for Graph-Based Sequence Compression and Phasing (personal opinion) = (no longer available on CD-ROM)
  • 32. Indexing Data for Federated Discovery on any Platform, Anywhere in the World!
  • 33. Indexing Data for Federated Discovery on any Platform, Anywhere in the World!
  • 34. Indexing Data for Federated Discovery on any Platform, anywhere in the World! Christiam Camacho, Sej Modha, Alex Efremov, Joan Marti-Carreras
  • 35. The Importance of Metadata If biologists leverage the data indices or pipelines for the illustrative subject areas listed here, without metadata, they will simply be doing more indexing and cataloging. For maximum utility, the primary data must be accompanied by rich metadata that puts the samples in question in context, not just in terms of their technical origins, but also their biological origins; for example, their disease state, tissue type, and the precise geographic location, not of the sequencing, but where the individual organism lives.