Containerized attribute indexing and graph genomes for federated data access

MIT Licensed
Customized for NCBI Staﬀ in collaboration with the
bioinformatics community.
Version 1.0
Federated Data Access
In Four Acts.

MIT Licensed
Customized for NCBI Staﬀ in collaboration with the
bioinformatics community.
Version 1.0
Scenes
Act 1: Virus Characterization and Discovery (x2!)
Act 2: Genome Graphs
Act 3: Annotation of Haplotypes (and Graphs)
Act 4: Indexing Data for Federated Discovery on any
Platform, Anywhere in the World
Act 5: Epilogue: Metadata Matters

Disclosures
The idea and material presented here is not necessarily the view of NCBI, NLM, NIH or any other federal
agency. Some of the underlying work was supported by the Intramural Research Program of the NLM and
various other federal agencies.
Ben Busby is contracted to NCBI through Medical Science and Computing (MSC).
Ben Busby is also an advisor to:
● Johns Hopkins
● Ariel Precision Medicine
● Deloitte
through Mountain Genomics, a company headquartered in Pittsburgh, PA.

Overview
As the volume of publicly available genomic data expands, it is becoming increasingly clear the SRA and other
public repositories play an evermore important role of being stewards, allowing researchers to leverage the
statistical and discovery power represented in huge arrays of datasets. That said, these repositories must not
simply become ‘bags of data’, but indexed repositories where data can be found, approximately assessed for quality,
mixed with other data, and most importantly, used by researchers to ask fundamental biological and biomedical
questions.
Nearly as important, we must provision reproducible workflows, such that investigators can go from finding data to
answering questions as simply and quickly as possible. These workflows will range in complexity from basic blast
searching, variant calling and annotation, and transcript counting to building genome graphs and discovering novel
viruses and back-spliced RNA. While doing all this we must be cognisant that while building a simple GUI interface
for these types of analysis is impractical and likely impossible, the onus is on us to teach fledgeling biological data
scientists to apply computational tools to their existing biological questions.

Federation of Sequencing Data in the Cloud:
Four Illustrative Topic Areas
1
Virus Characterization and Discovery.
Generation of a cloud-based indexing system
to allow investigators to identify data sets of
interest based on taxonomic, gene and protein
domain proﬁles
2
Genome Graphs
Generation of simple, usable systems to not
compare an individual patient or organism to
another -- or small group of -- individual, but
to an entire community, dramatically
compress data, and immediately ﬁnd “toxic
paths”.
3
Annotation of Haplotypes (and Graphs)
Annotation of haplotypes, instead of graphs,
to allowing investigators to query complex
disease
4
Indexing Data for Federated Discovery on
any Platform, Anywhere in the World
Flexible presentation (API) of viral protein
domains, host-pathogen interactions and
eventually graph loops as proof-of-principle
data federation.

Containerized attribute indexing and graph genomes for federated data access

Virus Characterization
and Discovery
We have extracted known viruses, heretofore unknown family members, sequences identiﬁable as viral by the protein domains
that decorate their proteins, and the true genomic “dark matter” and built a prototype index for viral signatures across the vast
metadata space within SRA (~90,000-150,000 datasets). This is an example of how a researcher can reach across a massive
data repository with a very simple bioinformatic tool and extract dramatic results, except in this case, we have pre-built this
index for researchers working on almost any kind of virus, including noroviruses, hemorragic fever-causing viruses, and
bacteriophages, the most abundant organisms on earth.
Powered by:

Genome Graphs
Initially leveraging standard workﬂows for annotation of primary datasets
in cloud infrastructure mapped to linear genomes, we will likely transition to
genome graphs over the next ten years. This will allow us to compress
reads, do phasing automatically, do reference-guided assembly on complex
genomes, and detect “toxic paths” automatically. Most importantly, this will
allow us to see complex genotype - phenotype interactions easily.

Evan Biederstedt, Alex Dilthey
many others

Eric Dawson, Fernanda Forterre, many
others

https://github.com/NCBI-Codeathons/SWIGG
Jason Chin, Alex Gener, many others

github.com/ncbi-codeathons/virus_graphs
Mike Tisza, Alexis Norris, many others

Medical Annotations of Graphs and Haplotypes
Lon Phan, Sara Kalla, many others

Haplocravat2
Mike Smallegan, Kyle
Moad, Matt
Hynes-Grace, Dina
Mikdadi, many others

Vince Carey, John Didion, many others

A Brief Proposal for Graph-Based Sequence
Compression and Phasing (personal opinion)
++

A Brief Proposal for Graph-Based Sequence
Compression and Phasing (personal opinion)
=
(no longer available on
CD-ROM)

any Platform, Anywhere in the World!

any Platform, anywhere in the World!
Christiam Camacho, Sej
Modha, Alex Efremov,
Joan Marti-Carreras

The Importance of Metadata
If biologists leverage the data indices or pipelines for the illustrative subject areas listed here, without metadata, they will simply
be doing more indexing and cataloging. For maximum utility, the primary data must be accompanied by rich metadata that puts
the samples in question in context, not just in terms of their technical origins, but also their biological origins; for example, their
disease state, tissue type, and the precise geographic location, not of the sequencing, but where the individual organism lives.

Containerized attribute indexing and graph genomes for federated data access

More Related Content

What's hot (19)

Similar to Containerized attribute indexing and graph genomes for federated data access (20)

More from Ben Busby (20)

Recently uploaded (20)

Containerized attribute indexing and graph genomes for federated data access