SlideShare a Scribd company logo
A Gen3 Perspective of Disparate Data:
From Pipelines in Data Commons to AI in Data Ecosystems
Robert L. Grossman
Center for Translational Data Science
University of Chicago
March 12, 2019
San Francisco
Molecular Tri Conference
1. Disparate Data in a World of Data
Commons and Data Ecosystems
Data Commons (2015-2030)
Data Clouds (2010-2025)
Data Ecosystems (2018 – 2030)
Projects
Communities
Multiple Communities
• Data objects in clouds
• Execute bioinformatics pipelines using workflow
languages and docker repositories
• Expose API for data access of
object and structured data
• Expose data models
• Harmonize data within
commons
• Build an ecosystem of
apps across commons &
resources
• Harmonize data across
commons
• Support ML/AI across
commons & resources
Today, I’ll talk about the transition
from data commons to data
ecosystems
genomic data,
imaging data, etc.*
AWS
GCP
genomics clouds
genomic analysis
research discoveries
pipelines/dockstore
clinical data
data objects in cloud storage
structured data in databases
clinical research
data warehouse
private
data curation &
management
data exploration data analysis
*also imaging data,
proteomics data, etc.
Today, I’ll talk about
supporting both data
objects & structured
data in data commons
& ecosystems.
2. Building Gen3 Data Commons over the
Data Commons Framework Services
Narrow Middle Design (aka End-to-End Design Principle)
Bioinformaticians curating
and submitting data
Researchers analyzing data
and making discoveries
data clouds
container-based
workspaces
ML/AI apps
notebooks
data commons
Compare: Saltzer, J.H., Reed, D.P. and Clark, D.D., 1984. End-to-end arguments in system design. Technology, 100, p.0661.
genomic data,
imaging data, etc.*
AWS
GCP
genomics clouds
genomic analysis
research discoveries
pipelines/dockstore
clinical data
data objects in cloud storage
structured data in databases
clinical research
data warehouse
private
data curation &
management
data exploration data analysis
*also imaging data,
proteomics data, etc. DCFS Standards
data commons /
data ecosystem
We have updated
Gen3.org
Video 1
1. Define a data model.
2. Use the Gen3 software to auto-
generate the data commons and
associated API.
3. Import data into the commons
using Gen3 import application.
4. Use Gen3 to explore your data
and create synthetic cohorts.
5. Use platforms such as Tera,
Seven Bridges, Galaxy, etc. to
analyze the synthetic cohosts.
6. Develop your own container-
based workflows, applications
and Jupyter Notebooks.
A Gen3 Data Commons Platform in Six Steps
Will be released
in 2Q19
(Selected)
Data Model 1
Data Model 2
Data Model 3
Data Model 4
Data Model 5
Data Model 6
Data Model 7
Data Model 8
Data Model 9
1. What are the minimum data access services for object and structured data?
2. What are the minimum data model services?
3. What are the minimum services for identity and access management to
support a passport type system?
Will be released
in 2Q19
3. Setting Up and Operating a Data Commons
or Data Ecosystem
1. The Data Commons Framework Services (DCFS) is a set of software services for
setting up and operating a data commons and cloud-based resources.
2. The DCF is designed to support multiple data commons, knowledge bases, and
applications as part of data ecosystem.
3. It is used to help operate the NCI Cancer Research Data Commons (CRDC), NHLBI
DataSTAGE, NHRGI AnVIL, and NIAID Data Hub pilot.
4. The implementation is based on the open source Gen3 software platform.
Bioinformaticians curating
and submitting data
Researchers analyzing data
and making discoveries
data clouds
container-based
workspaces
ML/AI apps
notebooks
data commons
Data Model 1
Data Model 2
Data Model 3
Data Model 4
Data Model 5
Data Model 6
Data Model 7
Data Model 8
Data Model 9
1. Data commons and resources expose API for access to data and resources
2. Data commons expose their data models through API
3. Data models include references to third party ontologies and other authorities
4. Authentication and authorization systems can interoperate
5. 2Q19: Structured data can be serialized, versioned, exported, processed and imported
Will be released
in 2Q19
Video 2
1. Define a data model.
2. Use the Gen3 software to auto-
generate the data commons and
associated API.
3. Import data into the commons
using Gen3 import application.
4. Use Gen3 to explore your data
and create synthetic cohorts.
5. Use platforms such as Tera,
Seven Bridges, Galaxy, etc. to
analyze the synthetic cohosts.
6. Develop your own container-
based workflows, applications
and Jupyter Notebooks.
A Gen3 Data Commons Platform in Six Steps
1. Build data commons over hosted Data
Commons Framework Services
2. Interoperate your data commons with other
DCFS compliant data commons.
Data Commons Framework Services (DCFS) Roadmap
2019
• DCFS services hosted by the
University of Chicago using a
Common Services Operations
Center (CSOC)
• You can build your own data
commons over the hosted DCFS
• Six production data commons
will be working with GA4GH to
standardize DCFS
2020
• Third parties can build data
commons by standing up an entire
stack including their own DCFS
• You can build your own data
commons using DCFS hosted by
the UChicago CSOC
• We expect a third party to host
DCFS and support data commons
over it
• CSOCs can interoperate
• First draft of GA4GH standard
Gen3.org dcf.gen3.org
4. Managing Structured Data in Data
Commons and Data Ecosystems
Linking Structured Clinical Data with Genomic Data
Object data - CRAM/BAM genomic data files,
DICOM image files, anything stored on the
cloud object storage systems (AWS S3, GCP
GCS)
Clinical data / graph data / core data /
structured data - data that are harmonized to
a data model and searchable using the data
model and related APIs. Gen3 uses a graph
data model as the logical model and
PostgreSQL as the database.
Data objects
stored with
GUIDS in one or
more clouds
Clinical data
and other
structured
data stored in
a database
Data objects and
clinical data linked in
data model
…, but what do we do for structured data?
• Within a data commons, we can use ETL tools, databases, NoSQL
databases, data warehouses, etc.
• But what about if we have 25 data commons that want to interop?
Data Model 1
Data Model 2
Data Model 3
Data Model 5
Data Model 6
Data Model 7
Data Model 8
Data Model 9
Will be released
in 2Q19Data Model 4
Requirement Approach Gen3 Services
1. Make the data FAIR Data objects are assign
GUID & metadata and
placed in multiple clouds
IndexD, Fence, Metadata
services via Sheepdog and
Peregrine
(also part of DCF services)
2. Express the pipelines in a
workflow language and
making them FAIR
We support Common
Workflow Language
We support Dockstore,
CWL & cwltool, use object
services to manage CWL
files, soon cromwell
3. Encapsulate the code and
tools
We encapsulate code in
virtual machines &
containers
We use Kubernetes,
Docker, Dockstore and
WES
4. Link data and code Use notebooks We support Jupyter
notebooks and
JupyterHub
5. Make struc. data portable ??? ???
5. Portable Formats for Biomedical Data
Life Cycle of Clinical Data (Structured Data)
Initial data modelHarmonized data model
(wrt ontology, NCIt, etc.)
Initial upload,
small changes
to schema
New data requiring
updated data model
Data used by
another project,
requiring new
data model
Subset of data is extracted
from main system as a
synthetic cohort and
imported into analysis
system
2nd, 3rd, etc. data
releases, continuous
creation of synthetic
cohorts 4th, 5th data
releases, new
data model
Platform refreshed
& data, metadata
migrated
Blue – schema change
Green – data change
Red – platform change
What is the Portable Format for Biomedical Data (PFB)?
● PFB is an Avro-based serialization format with a specific schema to
import, export and evolve biomedical data.
● PFB specifies metadata and data in one file. Metadata includes data
dictionary, ontology references & relations between nodes.
● PFB is:
○ Portable: supporting import & export.
○ Extensible: data model changes, versioning, back- and forward
compatibility;
○ Efficient: the binary format.
Why Avro?
Avro Protobuf
Self-describing ✓ ✗
Schema evolution ✓ ✓
Dynamic schema ✓ Partially, needs
recompilation
No need to compile ✓ ✗
Hadoop support ✓, built-in ✓, third-party libraries
JSON schema ✓ ✗, special IDL for
schema
PFB Performance
(preliminary results)
KidsFirst dictionary (JSON): 0.21M
PostgreSQL database: 277M
JSON load time: 10 minutes
PostgreSQL → PFB takes 25 seconds
Schema only PFB: 0.08M
Schema + data PFB: 38M
With compression: 9.7M
PFB → PostgreSQL load time: 1 min
29 times smaller in size.
Import of structured data
Export of structured data
PFB simplifies
the
management
of structured
data in data
ecosystems
• PFB is much smaller and much faster for bulk import and export
• PFB files contain data models and pointers to third party ontologies
and authorities
• PFB files can be versioned, managed as data objects in clouds, and
accessed FAIR services
# of nodes in
data model
Sheepdog
(sec)
PFB Import
(sec)
PFB Export
(sec)
10 14.75 3.25 3.25
100 121.25 3.5 5.5
1000 1209.75 13 11
10000 13349.25 92 69.75
• PFB is an application independent and system independent serialization format for
importing and exporting: 1) schema and other metadata, 2) pointers to third party
ontologies and authorities, and 3) data.
• PFB services can export to JSON
Portable Format for Biomedical Data (PFB)
Application or
commons 1
Application or
commons 2*
Applications or services
can process the PFB file
PFB file
*Can be the same app
or commons 1
Can be managed as
a data object with
FAIR services
Requirement Approach Gen3 Services
1. Make the data FAIR Data objects are assign
GUID & metadata and
placed in multiple clouds
IndexD, Fence, Metadata
services via Sheepdog and
Peregrine
(also part of DCF services)
2. Express the pipelines in a
workflow language and
making them FAIR
We support Common
Workflow Language
We support Dockstore,
CWL & cwltool, use object
services to manage CWL
files, soon cromwell
3. Encapsulate the code and
tools
We encapsulate code in
virtual machines &
containers
We use Kubernetes,
Docker, Dockstore and
WES
4. Link data and code Use notebooks We support Jupyter
notebooks and
JupyterHub
5. Make struc. data portable Make the data self-describ. Import & export PFB
For more information:
• Review: Robert L. Grossman, Data Lakes, Clouds and Commons: A Review of Platforms for Analyzing
and Sharing Genomic Data, Trends in Genetics 35 (2019) pp. 223-234,
https://doi.org/10.1016/j.tig.2018.12.006. See also https://arxiv.org/abs/1809.01699
• To learn about data ecosystems: Robert L. Grossman, Progress Towards Cancer Data Ecosystems, The
Cancer Journal: The Journal of Principles and Practice of Oncology, May/June 2018, Volume 24
Number 3, pages 122-126 doi: 10.1097/PPO.0000000000000318.
• To learn more about data commons: Robert L. Grossman, et. al. A Case for Data Commons: Toward
Data Science as a Service, Computing in Science & Engineering 18.5 (2016): 10-20. Also
https://arxiv.org/abs/1604.02608
• To learn more about the NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared
vision for cancer genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112. The
GDC was developed using Bionimbus Gen2.
• To learn more about BloodPAC, Grossman, R. L., et al. "Collaborating to compete: Blood Profiling Atlas
in Cancer (BloodPAC) Consortium." Clinical Pharmacology & Therapeutics (2017). BloodPAC was
developed using the GDC Community Edition (CE) aka Bionimbus Gen3
• To large more about large scale, secure compliant cloud based computing environments for
biomedical data, see: Heath, Allison P., et al. "Bionimbus: a cloud for managing, analyzing and sharing
large genomics datasets." Journal of the American Medical Informatics Association 21.6 (2014): 969-
975. This article describes Bionimbus Gen1.
32
@BobGrossman

More Related Content

PPTX
PDF
Sigfox Technology Overview (nov 2017)
PPTX
Do protocolo de Madri ao Acordo de Haia: registro internacional de marcas e d...
PPTX
data-mesh-101.pptx
PPTX
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
PDF
Enterprise guide to building a Data Mesh
PDF
Tag.bio: Self Service Data Mesh Platform
PPTX
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Sigfox Technology Overview (nov 2017)
Do protocolo de Madri ao Acordo de Haia: registro internacional de marcas e d...
data-mesh-101.pptx
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
Enterprise guide to building a Data Mesh
Tag.bio: Self Service Data Mesh Platform
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent

Similar to A Gen3 Perspective of Disparate Data (20)

PDF
Minerva: Drill Storage Plugin for IPFS
PPTX
Tim Pugh-SPEDDEXES 2014
PPTX
My Other Computer is a Data Center: The Sector Perspective on Big Data
PPTX
RDF-Gen: Generating RDF from streaming and archival data
PPTX
GlobusWorld 2020 Keynote
PPTX
Databricks Platform.pptx
PDF
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
PPTX
OCC Overview OMG Clouds Meeting 07-13-09 v3
PDF
Introduction Big Data
PDF
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
PPT
CloudComputingJun28.ppt
PPT
CloudComputingJun28.ppt
PPT
Cloud Computing: Concepts, Technologies and Business Implications
PPT
CloudComputingJun28.ppt
PPTX
Big data applications
PDF
BDE SC3.3 Workshop - BDE Platform: Technical overview
PPT
Cloud Computing concepts and technologies
DOCX
hadoop seminar training report
PPTX
Bionimbus - Northwestern CGI Workshop 4-21-2011
PPT
Organizing the Data Chaos of Scientists
Minerva: Drill Storage Plugin for IPFS
Tim Pugh-SPEDDEXES 2014
My Other Computer is a Data Center: The Sector Perspective on Big Data
RDF-Gen: Generating RDF from streaming and archival data
GlobusWorld 2020 Keynote
Databricks Platform.pptx
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
OCC Overview OMG Clouds Meeting 07-13-09 v3
Introduction Big Data
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
CloudComputingJun28.ppt
CloudComputingJun28.ppt
Cloud Computing: Concepts, Technologies and Business Implications
CloudComputingJun28.ppt
Big data applications
BDE SC3.3 Workshop - BDE Platform: Technical overview
Cloud Computing concepts and technologies
hadoop seminar training report
Bionimbus - Northwestern CGI Workshop 4-21-2011
Organizing the Data Chaos of Scientists
Ad

More from Robert Grossman (20)

PDF
Some Frameworks for Improving Analytic Operations at Your Company
PDF
Some Proposed Principles for Interoperating Cloud Based Data Platforms
PDF
A Data Biosphere for Biomedical Research
PDF
What is Data Commons and How Can Your Organization Build One?
PDF
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
PDF
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
PDF
AnalyticOps - Chicago PAW 2016
PDF
Keynote on 2015 Yale Day of Data
PDF
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
PDF
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
PDF
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
PDF
Architectures for Data Commons (XLDB 15 Lightning Talk)
PDF
Practical Methods for Identifying Anomalies That Matter in Large Datasets
PDF
What is a Data Commons and Why Should You Care?
PDF
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
PDF
Big Data, The Community and The Commons (May 12, 2014)
PDF
What Are Science Clouds?
PDF
Adversarial Analytics - 2013 Strata & Hadoop World Talk
PDF
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
PDF
Using the Open Science Data Cloud for Data Science Research
Some Frameworks for Improving Analytic Operations at Your Company
Some Proposed Principles for Interoperating Cloud Based Data Platforms
A Data Biosphere for Biomedical Research
What is Data Commons and How Can Your Organization Build One?
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
AnalyticOps - Chicago PAW 2016
Keynote on 2015 Yale Day of Data
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Architectures for Data Commons (XLDB 15 Lightning Talk)
Practical Methods for Identifying Anomalies That Matter in Large Datasets
What is a Data Commons and Why Should You Care?
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Big Data, The Community and The Commons (May 12, 2014)
What Are Science Clouds?
Adversarial Analytics - 2013 Strata & Hadoop World Talk
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
Using the Open Science Data Cloud for Data Science Research
Ad

Recently uploaded (20)

PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Lecture1 pattern recognition............
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Database Infoormation System (DBIS).pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Computer network topology notes for revision
PDF
Foundation of Data Science unit number two notes
PPTX
1_Introduction to advance data techniques.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
climate analysis of Dhaka ,Banglades.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction-to-Cloud-ComputingFinal.pptx
Lecture1 pattern recognition............
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Database Infoormation System (DBIS).pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Galatica Smart Energy Infrastructure Startup Pitch Deck
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Launch Your Data Science Career in Kochi – 2025
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Computer network topology notes for revision
Foundation of Data Science unit number two notes
1_Introduction to advance data techniques.pptx

A Gen3 Perspective of Disparate Data

  • 1. A Gen3 Perspective of Disparate Data: From Pipelines in Data Commons to AI in Data Ecosystems Robert L. Grossman Center for Translational Data Science University of Chicago March 12, 2019 San Francisco Molecular Tri Conference
  • 2. 1. Disparate Data in a World of Data Commons and Data Ecosystems
  • 3. Data Commons (2015-2030) Data Clouds (2010-2025) Data Ecosystems (2018 – 2030) Projects Communities Multiple Communities • Data objects in clouds • Execute bioinformatics pipelines using workflow languages and docker repositories • Expose API for data access of object and structured data • Expose data models • Harmonize data within commons • Build an ecosystem of apps across commons & resources • Harmonize data across commons • Support ML/AI across commons & resources Today, I’ll talk about the transition from data commons to data ecosystems
  • 4. genomic data, imaging data, etc.* AWS GCP genomics clouds genomic analysis research discoveries pipelines/dockstore clinical data data objects in cloud storage structured data in databases clinical research data warehouse private data curation & management data exploration data analysis *also imaging data, proteomics data, etc. Today, I’ll talk about supporting both data objects & structured data in data commons & ecosystems.
  • 5. 2. Building Gen3 Data Commons over the Data Commons Framework Services
  • 6. Narrow Middle Design (aka End-to-End Design Principle) Bioinformaticians curating and submitting data Researchers analyzing data and making discoveries data clouds container-based workspaces ML/AI apps notebooks data commons Compare: Saltzer, J.H., Reed, D.P. and Clark, D.D., 1984. End-to-end arguments in system design. Technology, 100, p.0661.
  • 7. genomic data, imaging data, etc.* AWS GCP genomics clouds genomic analysis research discoveries pipelines/dockstore clinical data data objects in cloud storage structured data in databases clinical research data warehouse private data curation & management data exploration data analysis *also imaging data, proteomics data, etc. DCFS Standards data commons / data ecosystem
  • 10. 1. Define a data model. 2. Use the Gen3 software to auto- generate the data commons and associated API. 3. Import data into the commons using Gen3 import application. 4. Use Gen3 to explore your data and create synthetic cohorts. 5. Use platforms such as Tera, Seven Bridges, Galaxy, etc. to analyze the synthetic cohosts. 6. Develop your own container- based workflows, applications and Jupyter Notebooks. A Gen3 Data Commons Platform in Six Steps
  • 11. Will be released in 2Q19 (Selected)
  • 12. Data Model 1 Data Model 2 Data Model 3 Data Model 4 Data Model 5 Data Model 6 Data Model 7 Data Model 8 Data Model 9 1. What are the minimum data access services for object and structured data? 2. What are the minimum data model services? 3. What are the minimum services for identity and access management to support a passport type system? Will be released in 2Q19
  • 13. 3. Setting Up and Operating a Data Commons or Data Ecosystem
  • 14. 1. The Data Commons Framework Services (DCFS) is a set of software services for setting up and operating a data commons and cloud-based resources. 2. The DCF is designed to support multiple data commons, knowledge bases, and applications as part of data ecosystem. 3. It is used to help operate the NCI Cancer Research Data Commons (CRDC), NHLBI DataSTAGE, NHRGI AnVIL, and NIAID Data Hub pilot. 4. The implementation is based on the open source Gen3 software platform. Bioinformaticians curating and submitting data Researchers analyzing data and making discoveries data clouds container-based workspaces ML/AI apps notebooks data commons
  • 15. Data Model 1 Data Model 2 Data Model 3 Data Model 4 Data Model 5 Data Model 6 Data Model 7 Data Model 8 Data Model 9 1. Data commons and resources expose API for access to data and resources 2. Data commons expose their data models through API 3. Data models include references to third party ontologies and other authorities 4. Authentication and authorization systems can interoperate 5. 2Q19: Structured data can be serialized, versioned, exported, processed and imported Will be released in 2Q19
  • 17. 1. Define a data model. 2. Use the Gen3 software to auto- generate the data commons and associated API. 3. Import data into the commons using Gen3 import application. 4. Use Gen3 to explore your data and create synthetic cohorts. 5. Use platforms such as Tera, Seven Bridges, Galaxy, etc. to analyze the synthetic cohosts. 6. Develop your own container- based workflows, applications and Jupyter Notebooks. A Gen3 Data Commons Platform in Six Steps 1. Build data commons over hosted Data Commons Framework Services 2. Interoperate your data commons with other DCFS compliant data commons.
  • 18. Data Commons Framework Services (DCFS) Roadmap 2019 • DCFS services hosted by the University of Chicago using a Common Services Operations Center (CSOC) • You can build your own data commons over the hosted DCFS • Six production data commons will be working with GA4GH to standardize DCFS 2020 • Third parties can build data commons by standing up an entire stack including their own DCFS • You can build your own data commons using DCFS hosted by the UChicago CSOC • We expect a third party to host DCFS and support data commons over it • CSOCs can interoperate • First draft of GA4GH standard Gen3.org dcf.gen3.org
  • 19. 4. Managing Structured Data in Data Commons and Data Ecosystems
  • 20. Linking Structured Clinical Data with Genomic Data Object data - CRAM/BAM genomic data files, DICOM image files, anything stored on the cloud object storage systems (AWS S3, GCP GCS) Clinical data / graph data / core data / structured data - data that are harmonized to a data model and searchable using the data model and related APIs. Gen3 uses a graph data model as the logical model and PostgreSQL as the database. Data objects stored with GUIDS in one or more clouds Clinical data and other structured data stored in a database Data objects and clinical data linked in data model
  • 21. …, but what do we do for structured data? • Within a data commons, we can use ETL tools, databases, NoSQL databases, data warehouses, etc. • But what about if we have 25 data commons that want to interop? Data Model 1 Data Model 2 Data Model 3 Data Model 5 Data Model 6 Data Model 7 Data Model 8 Data Model 9 Will be released in 2Q19Data Model 4
  • 22. Requirement Approach Gen3 Services 1. Make the data FAIR Data objects are assign GUID & metadata and placed in multiple clouds IndexD, Fence, Metadata services via Sheepdog and Peregrine (also part of DCF services) 2. Express the pipelines in a workflow language and making them FAIR We support Common Workflow Language We support Dockstore, CWL & cwltool, use object services to manage CWL files, soon cromwell 3. Encapsulate the code and tools We encapsulate code in virtual machines & containers We use Kubernetes, Docker, Dockstore and WES 4. Link data and code Use notebooks We support Jupyter notebooks and JupyterHub 5. Make struc. data portable ??? ???
  • 23. 5. Portable Formats for Biomedical Data
  • 24. Life Cycle of Clinical Data (Structured Data) Initial data modelHarmonized data model (wrt ontology, NCIt, etc.) Initial upload, small changes to schema New data requiring updated data model Data used by another project, requiring new data model Subset of data is extracted from main system as a synthetic cohort and imported into analysis system 2nd, 3rd, etc. data releases, continuous creation of synthetic cohorts 4th, 5th data releases, new data model Platform refreshed & data, metadata migrated Blue – schema change Green – data change Red – platform change
  • 25. What is the Portable Format for Biomedical Data (PFB)? ● PFB is an Avro-based serialization format with a specific schema to import, export and evolve biomedical data. ● PFB specifies metadata and data in one file. Metadata includes data dictionary, ontology references & relations between nodes. ● PFB is: ○ Portable: supporting import & export. ○ Extensible: data model changes, versioning, back- and forward compatibility; ○ Efficient: the binary format.
  • 26. Why Avro? Avro Protobuf Self-describing ✓ ✗ Schema evolution ✓ ✓ Dynamic schema ✓ Partially, needs recompilation No need to compile ✓ ✗ Hadoop support ✓, built-in ✓, third-party libraries JSON schema ✓ ✗, special IDL for schema
  • 27. PFB Performance (preliminary results) KidsFirst dictionary (JSON): 0.21M PostgreSQL database: 277M JSON load time: 10 minutes PostgreSQL → PFB takes 25 seconds Schema only PFB: 0.08M Schema + data PFB: 38M With compression: 9.7M PFB → PostgreSQL load time: 1 min 29 times smaller in size. Import of structured data Export of structured data
  • 28. PFB simplifies the management of structured data in data ecosystems • PFB is much smaller and much faster for bulk import and export • PFB files contain data models and pointers to third party ontologies and authorities • PFB files can be versioned, managed as data objects in clouds, and accessed FAIR services # of nodes in data model Sheepdog (sec) PFB Import (sec) PFB Export (sec) 10 14.75 3.25 3.25 100 121.25 3.5 5.5 1000 1209.75 13 11 10000 13349.25 92 69.75
  • 29. • PFB is an application independent and system independent serialization format for importing and exporting: 1) schema and other metadata, 2) pointers to third party ontologies and authorities, and 3) data. • PFB services can export to JSON Portable Format for Biomedical Data (PFB) Application or commons 1 Application or commons 2* Applications or services can process the PFB file PFB file *Can be the same app or commons 1 Can be managed as a data object with FAIR services
  • 30. Requirement Approach Gen3 Services 1. Make the data FAIR Data objects are assign GUID & metadata and placed in multiple clouds IndexD, Fence, Metadata services via Sheepdog and Peregrine (also part of DCF services) 2. Express the pipelines in a workflow language and making them FAIR We support Common Workflow Language We support Dockstore, CWL & cwltool, use object services to manage CWL files, soon cromwell 3. Encapsulate the code and tools We encapsulate code in virtual machines & containers We use Kubernetes, Docker, Dockstore and WES 4. Link data and code Use notebooks We support Jupyter notebooks and JupyterHub 5. Make struc. data portable Make the data self-describ. Import & export PFB
  • 31. For more information: • Review: Robert L. Grossman, Data Lakes, Clouds and Commons: A Review of Platforms for Analyzing and Sharing Genomic Data, Trends in Genetics 35 (2019) pp. 223-234, https://doi.org/10.1016/j.tig.2018.12.006. See also https://arxiv.org/abs/1809.01699 • To learn about data ecosystems: Robert L. Grossman, Progress Towards Cancer Data Ecosystems, The Cancer Journal: The Journal of Principles and Practice of Oncology, May/June 2018, Volume 24 Number 3, pages 122-126 doi: 10.1097/PPO.0000000000000318. • To learn more about data commons: Robert L. Grossman, et. al. A Case for Data Commons: Toward Data Science as a Service, Computing in Science & Engineering 18.5 (2016): 10-20. Also https://arxiv.org/abs/1604.02608 • To learn more about the NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112. The GDC was developed using Bionimbus Gen2. • To learn more about BloodPAC, Grossman, R. L., et al. "Collaborating to compete: Blood Profiling Atlas in Cancer (BloodPAC) Consortium." Clinical Pharmacology & Therapeutics (2017). BloodPAC was developed using the GDC Community Edition (CE) aka Bionimbus Gen3 • To large more about large scale, secure compliant cloud based computing environments for biomedical data, see: Heath, Allison P., et al. "Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets." Journal of the American Medical Informatics Association 21.6 (2014): 969- 975. This article describes Bionimbus Gen1.