A Gen3 Perspective of Disparate Data

A Gen3 Perspective of Disparate Data:
From Pipelines in Data Commons to AI in Data Ecosystems
Robert L. Grossman
Center for Translational Data Science
University of Chicago
March 12, 2019
San Francisco
Molecular Tri Conference

1. Disparate Data in a World of Data
Commons and Data Ecosystems

Data Commons (2015-2030)
Data Clouds (2010-2025)
Data Ecosystems (2018 – 2030)
Projects
Communities
Multiple Communities
• Data objects in clouds
• Execute bioinformatics pipelines using workflow
languages and docker repositories
• Expose API for data access of
object and structured data
• Expose data models
• Harmonize data within
commons
• Build an ecosystem of
apps across commons &
resources
• Harmonize data across
commons
• Support ML/AI across
commons & resources
Today, I’ll talk about the transition
from data commons to data
ecosystems

genomic data,
imaging data, etc.*
AWS
GCP
genomics clouds
genomic analysis
research discoveries
pipelines/dockstore
clinical data
data objects in cloud storage
structured data in databases
clinical research
data warehouse
private
data curation &
management
data exploration data analysis
*also imaging data,
proteomics data, etc.
Today, I’ll talk about
supporting both data
objects & structured
data in data commons
& ecosystems.

2. Building Gen3 Data Commons over the
Data Commons Framework Services

Narrow Middle Design (aka End-to-End Design Principle)
Bioinformaticians curating
and submitting data
Researchers analyzing data
and making discoveries
data clouds
container-based
workspaces
ML/AI apps
notebooks
data commons
Compare: Saltzer, J.H., Reed, D.P. and Clark, D.D., 1984. End-to-end arguments in system design. Technology, 100, p.0661.

genomic data,
imaging data, etc.*
AWS
GCP
genomics clouds
genomic analysis
research discoveries
pipelines/dockstore
clinical data
data objects in cloud storage
structured data in databases
clinical research
data warehouse
private
data curation &
management
data exploration data analysis
*also imaging data,
proteomics data, etc. DCFS Standards
data commons /
data ecosystem

1. Define a data model.
2. Use the Gen3 software to auto-
generate the data commons and
associated API.
3. Import data into the commons
using Gen3 import application.
4. Use Gen3 to explore your data
and create synthetic cohorts.
5. Use platforms such as Tera,
Seven Bridges, Galaxy, etc. to
analyze the synthetic cohosts.
6. Develop your own container-
based workflows, applications
and Jupyter Notebooks.
A Gen3 Data Commons Platform in Six Steps

Will be released
in 2Q19
(Selected)

Data Model 1
Data Model 2
Data Model 3
Data Model 4
Data Model 5
Data Model 6
Data Model 7
Data Model 8
Data Model 9
1. What are the minimum data access services for object and structured data?
2. What are the minimum data model services?
3. What are the minimum services for identity and access management to
support a passport type system?
Will be released
in 2Q19

3. Setting Up and Operating a Data Commons
or Data Ecosystem

1. The Data Commons Framework Services (DCFS) is a set of software services for
setting up and operating a data commons and cloud-based resources.
2. The DCF is designed to support multiple data commons, knowledge bases, and
applications as part of data ecosystem.
3. It is used to help operate the NCI Cancer Research Data Commons (CRDC), NHLBI
DataSTAGE, NHRGI AnVIL, and NIAID Data Hub pilot.
4. The implementation is based on the open source Gen3 software platform.
Bioinformaticians curating
and submitting data
Researchers analyzing data
and making discoveries
data clouds
container-based
workspaces
ML/AI apps
notebooks
data commons

Data Model 1
Data Model 2
Data Model 3
Data Model 4
Data Model 5
Data Model 6
Data Model 7
Data Model 8
Data Model 9
1. Data commons and resources expose API for access to data and resources
2. Data commons expose their data models through API
3. Data models include references to third party ontologies and other authorities
4. Authentication and authorization systems can interoperate
5. 2Q19: Structured data can be serialized, versioned, exported, processed and imported
Will be released
in 2Q19

1. Define a data model.
2. Use the Gen3 software to auto-
generate the data commons and
associated API.
3. Import data into the commons
using Gen3 import application.
4. Use Gen3 to explore your data
and create synthetic cohorts.
5. Use platforms such as Tera,
Seven Bridges, Galaxy, etc. to
analyze the synthetic cohosts.
6. Develop your own container-
based workflows, applications
and Jupyter Notebooks.
A Gen3 Data Commons Platform in Six Steps
1. Build data commons over hosted Data
Commons Framework Services
2. Interoperate your data commons with other
DCFS compliant data commons.

Data Commons Framework Services (DCFS) Roadmap
2019
• DCFS services hosted by the
University of Chicago using a
Common Services Operations
Center (CSOC)
• You can build your own data
commons over the hosted DCFS
• Six production data commons
will be working with GA4GH to
standardize DCFS
2020
• Third parties can build data
commons by standing up an entire
stack including their own DCFS
• You can build your own data
commons using DCFS hosted by
the UChicago CSOC
• We expect a third party to host
DCFS and support data commons
over it
• CSOCs can interoperate
• First draft of GA4GH standard
Gen3.org dcf.gen3.org

4. Managing Structured Data in Data
Commons and Data Ecosystems

Linking Structured Clinical Data with Genomic Data
Object data - CRAM/BAM genomic data files,
DICOM image files, anything stored on the
cloud object storage systems (AWS S3, GCP
GCS)
Clinical data / graph data / core data /
structured data - data that are harmonized to
a data model and searchable using the data
model and related APIs. Gen3 uses a graph
data model as the logical model and
PostgreSQL as the database.
Data objects
stored with
GUIDS in one or
more clouds
Clinical data
and other
structured
data stored in
a database
Data objects and
clinical data linked in
data model

…, but what do we do for structured data?
• Within a data commons, we can use ETL tools, databases, NoSQL
databases, data warehouses, etc.
• But what about if we have 25 data commons that want to interop?
Data Model 1
Data Model 2
Data Model 3
Data Model 5
Data Model 6
Data Model 7
Data Model 8
Data Model 9
Will be released
in 2Q19Data Model 4

Requirement Approach Gen3 Services
1. Make the data FAIR Data objects are assign
GUID & metadata and
placed in multiple clouds
IndexD, Fence, Metadata
services via Sheepdog and
Peregrine
(also part of DCF services)
2. Express the pipelines in a
workflow language and
making them FAIR
We support Common
Workflow Language
We support Dockstore,
CWL & cwltool, use object
services to manage CWL
files, soon cromwell
3. Encapsulate the code and
tools
We encapsulate code in
virtual machines &
containers
We use Kubernetes,
Docker, Dockstore and
WES
4. Link data and code Use notebooks We support Jupyter
notebooks and
JupyterHub
5. Make struc. data portable ??? ???

5. Portable Formats for Biomedical Data

Life Cycle of Clinical Data (Structured Data)
Initial data modelHarmonized data model
(wrt ontology, NCIt, etc.)
Initial upload,
small changes
to schema
New data requiring
updated data model
Data used by
another project,
requiring new
data model
Subset of data is extracted
from main system as a
synthetic cohort and
imported into analysis
system
2nd, 3rd, etc. data
releases, continuous
creation of synthetic
cohorts 4th, 5th data
releases, new
data model
Platform refreshed
& data, metadata
migrated
Blue – schema change
Green – data change
Red – platform change

What is the Portable Format for Biomedical Data (PFB)?
● PFB is an Avro-based serialization format with a specific schema to
import, export and evolve biomedical data.
● PFB specifies metadata and data in one file. Metadata includes data
dictionary, ontology references & relations between nodes.
● PFB is:
○ Portable: supporting import & export.
○ Extensible: data model changes, versioning, back- and forward
compatibility;
○ Efficient: the binary format.

Why Avro?
Avro Protobuf
Self-describing ✓ ✗
Schema evolution ✓ ✓
Dynamic schema ✓ Partially, needs
recompilation
No need to compile ✓ ✗
Hadoop support ✓, built-in ✓, third-party libraries
JSON schema ✓ ✗, special IDL for
schema

PFB Performance
(preliminary results)
KidsFirst dictionary (JSON): 0.21M
PostgreSQL database: 277M
JSON load time: 10 minutes
PostgreSQL → PFB takes 25 seconds
Schema only PFB: 0.08M
Schema + data PFB: 38M
With compression: 9.7M
PFB → PostgreSQL load time: 1 min
29 times smaller in size.
Import of structured data
Export of structured data

PFB simplifies
the
management
of structured
data in data
ecosystems
• PFB is much smaller and much faster for bulk import and export
• PFB files contain data models and pointers to third party ontologies
and authorities
• PFB files can be versioned, managed as data objects in clouds, and
accessed FAIR services
# of nodes in
data model
Sheepdog
(sec)
PFB Import
(sec)
PFB Export
(sec)
10 14.75 3.25 3.25
100 121.25 3.5 5.5
1000 1209.75 13 11
10000 13349.25 92 69.75

• PFB is an application independent and system independent serialization format for
importing and exporting: 1) schema and other metadata, 2) pointers to third party
ontologies and authorities, and 3) data.
• PFB services can export to JSON
Portable Format for Biomedical Data (PFB)
Application or
commons 1
Application or
commons 2*
Applications or services
can process the PFB file
PFB file
*Can be the same app
or commons 1
Can be managed as
a data object with
FAIR services

Requirement Approach Gen3 Services
1. Make the data FAIR Data objects are assign
GUID & metadata and
placed in multiple clouds
IndexD, Fence, Metadata
services via Sheepdog and
Peregrine
(also part of DCF services)
2. Express the pipelines in a
workflow language and
making them FAIR
We support Common
Workflow Language
We support Dockstore,
CWL & cwltool, use object
services to manage CWL
files, soon cromwell
3. Encapsulate the code and
tools
We encapsulate code in
virtual machines &
containers
We use Kubernetes,
Docker, Dockstore and
WES
4. Link data and code Use notebooks We support Jupyter
notebooks and
JupyterHub
5. Make struc. data portable Make the data self-describ. Import & export PFB

For more information:
• Review: Robert L. Grossman, Data Lakes, Clouds and Commons: A Review of Platforms for Analyzing
and Sharing Genomic Data, Trends in Genetics 35 (2019) pp. 223-234,
https://doi.org/10.1016/j.tig.2018.12.006. See also https://arxiv.org/abs/1809.01699
• To learn about data ecosystems: Robert L. Grossman, Progress Towards Cancer Data Ecosystems, The
Cancer Journal: The Journal of Principles and Practice of Oncology, May/June 2018, Volume 24
Number 3, pages 122-126 doi: 10.1097/PPO.0000000000000318.
• To learn more about data commons: Robert L. Grossman, et. al. A Case for Data Commons: Toward
Data Science as a Service, Computing in Science & Engineering 18.5 (2016): 10-20. Also
https://arxiv.org/abs/1604.02608
• To learn more about the NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared
vision for cancer genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112. The
GDC was developed using Bionimbus Gen2.
• To learn more about BloodPAC, Grossman, R. L., et al. "Collaborating to compete: Blood Profiling Atlas
in Cancer (BloodPAC) Consortium." Clinical Pharmacology & Therapeutics (2017). BloodPAC was
developed using the GDC Community Edition (CE) aka Bionimbus Gen3
• To large more about large scale, secure compliant cloud based computing environments for
biomedical data, see: Heath, Allison P., et al. "Bionimbus: a cloud for managing, analyzing and sharing
large genomics datasets." Journal of the American Medical Informatics Association 21.6 (2014): 969-
975. This article describes Bionimbus Gen1.

A Gen3 Perspective of Disparate Data

More Related Content

Similar to A Gen3 Perspective of Disparate Data (20)

More from Robert Grossman (20)

Recently uploaded (20)

A Gen3 Perspective of Disparate Data