NIH Data Summit - The NIH Data Commons

NIH Data Commons
NIH Data Storage Summit
October 20, 2017
Vivien Bonazzi Ph.D.
Senior Advisor for Data Science (NIH/OD)
Project Leader for the NIH Data Commons

What’s driving the need for a
Data Commons?

Challenges with the current state of data
 Generating large volumes of biomedical data
 Cheap to generate, costly to store on local servers
 Multiple copies of the same data in different locations
 Building data resources that cannot be easily found by others
 Data resources are not connected to each other and cannot
share data or tools
 No standards and guidelines on how to share and access data

Convergence of factors
 Increasing recognition of the need to support data sharing
 Availability of digital technologies and infrastructures that
support Data at scale
 Cloud: data storage, compute and sharing
 FAIR – Findable Accessible Interoperable Reproducible
 Understanding that data is a valuable resource that needs to be
sustained

https://gds.nih.gov/
Went into effect January 25, 2015
NCI guidance:
http://www.cancer.gov/grants-training/grants-management/nci-
policies/genomic-data
Requires public sharing of genomic data sets

NIH Data Summit - The NIH Data Commons

Findable
Accessible
Interoperable
Reusable

DATA has VALUE
DATA is CENTRAL to the Digital Economy
a signal of the coming Digital Economy

Scientific digital assets
Data
Software
Workflows
Documentation
Journal Articles
Organizations will be defined by their digital assets

The most successful organizations of the
future will be those that can
leverage their digital assets and transform
them into a digital enterprise

Data Commons
Enabling data driven science
Enable investigators to leverage all possible data and
tools in the effort to accelerate biomedical discoveries,
therapies and cures
by
driving the development of data infrastructure and data
science capabilities through collaborative research and
robust engineering

Developing a Data Commons
 Treats products of research – data, methods, tools,
papers etc. as digital objects
 For this presentation: Data = Digital Objects
 These digital objects exist in a shared virtual space
 Find, Deposit, Manage, Share, and Reuse data,
software, metadata and workflows
 Digital object compliance through FAIR principles:
 Findable
 Accessible (and usable)
 Interoperable
 Reusable

The Data Commons
is a platform
that allows transactions to occur
on FAIR data at scale

The Data Commons Platform
Compute Platform: Cloud
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
FAIR
App store/User Interface/Portal
PaaS
SaaS
IaaS

Data Commons Engagement
US Government Agencies & EU groups

Interoperability with other Commons’
 Common goals – democratizing, collaborating & sharing data
 Reuse of currently available open source tools which support
interoperability
 GA4GH, UCSC, GDC, NYGC
 May 2017 BioIT Commons Session
 Shared open standard APIs for data access and computing
 Ability to deploy and compute across multiple cloud environments
 Docker containers – Dockerstore/Docker registry
 Workflows management, sharing and deployment
 Discoverability (indexing) objects across cloud commons
 Global Unique identifiers
 Common user authentication system

The Good News
 Considerable agreement about the general approaches to
be taken
 Many people are already addressing many of the problems:
 Data architectures/platforms
 Automated/semi-automated data access/authentication protocols
 Common metadata standards and templates
 Open tools and software
 Instantiation and initial metrics of Findability, Accessibility,
Interoperability, and Reusability
 Relationships/agreements with Cloud Service Providers that leverage
their interest in hosting NIH data
 Moving data to the cloud and operating in a cloud environment

The Challenges
 A need to “Bring it all Together” – Community endorsement of:
 Metadata standards/tools/approaches
 Crosswalks between equivalent terms/ontologies
 Robust, shared approaches to data access/authentication
 Best practices that will enable existing data to become FAIR and will
guide generation of future datasets
 Rapidly evolving field makes approaches/tools/etc subject to
change – approaches need to be adaptable
 Effort is required to adapt data to community standards and move
data to the cloud
 How much does that cost and how long does it take?
 Lack of interoperability between cloud providers

The Challenges
 Making data FAIR comes with a cost
 How much does it actually cost?
 How can we minimize the cost?
 How do we determine whether any one set of data warrants the
expense?
 What is the value added to the data by making it FAIR?
 What new science can be achieved?
 How can new derived data or new computational approaches be
added to the dataset to enrich it?
 What are the limitations of FAIRness from dataset to dataset?

Development of a
NIH Data Commons Pilot

NIH Data Commons Pilot
allows access, use and sharing
of large, high value NIH data
in the cloud

NIH Data Commons Structure
26
Cloud
Services: APIs, Containers, GUIDs, Indexing, Search,
Auth
ACCESS
Scientific analysis tools/workflows
Data
“Reference” Data Sets
TOPMed, GTEx, MODs
FAIR
App store/User Interface/Portal/Workspace
PaaS
SaaS
IaaS

Operationalizing
the NIH Data Commons Pilot

NIH Data Commons Pilot : Implementation
Storage, NIH Marketplace, Metrics and Costs
Leveraging and extending relationships established as part of BD2K
to provide access cloud to storage and compute
Supplements: TOPMed, GTEx, MODs groups
Prepare (and move) data sets to the cloud for storage, access and
scientific use
Work collaboratively with the OT awardees to build towards data access
Data Commons OT Solicitation: Other Transaction
ROA: Research Opportunity Announcement
Developing the fundamental FAIR computational components to
support access, use and sharing of the 3 data sets above

NIH Data Commons Pilot Consortium

 Establishing a new NIH Marketplace
 access to a sustainable cloud infrastructure for data science at NIH
 Over the next 18 months, NIH will establish its own NIH Cloud Marketplace
 Data Commons Pilot Consortium awardees ability to acquire cloud storage and compute
services
 Enable ICs to easily acquire cloud storage and storage services from commercial
cloud providers, resellers, and integrators
 Building on existing relationship with CSPs
 Led by CIT with input from Multi-IC working group

 Assessment and Evaluation
 What are the costs associated with cloud storage and usage?
 What are the business best practices?
 How should costs be paid?
 Who should pay them?
 How should highly used data be managed vs less used data?
 Are data producers supportive of this model?
 Are users (of all experience levels) able to access and use data effectively?
 How will we know if the Data Commons Pilot is successful?
 How to adjust to changing needs?

Supplements to 3 Test Data Set Groups
 Administrative Supplements to TOPMed, GTEx and MODs
 PIs for each data set were requested to review the OT (ROA) and
determine appropriate ways to interact
 Prepare (and move) data sets to the cloud for storage, access
and scientific use
 Make community workflows and cloud based tools of popular
analysis pipelines from the 3 datasets accessible
 Facilitate discovery and interpretation of the association of
human and model organism genotypes and phenotypes

NIH Data Commons: OT ROA
 Key Capabilities – modular components
 Development of Community Supported FAIR Guidelines and Metrics
 Global Unique Identifiers (GUID) for FAIR biomedical data
 Open Standard APIs (interoperability & connectivity)
 Cloud Agnostic Architecture and Frameworks
 Cloud User Workspaces
 Research Ethics, Privacy, and Security (AUTH)
 Indexing and Search
 Scientific Use cases
 Training, Outreach, Coordination

 Stage 1: 180 day window
 Develop MVPs (Minimum Viable Products)
 Demonstrations of the Data Commons and its components
 Have one copy of each test data set in each cloud provider
 Understanding of the process required to achieve this
 Draft version of a single standard access control system
 be able to access and use the data through the access control system
 Able to use a variety of analysis tools and pipelines on the 3 data sets in the
cloud – (driven by scientific use cases)
 Have a rudimentary ability to query across test data sets
 Display phenotype, expression and variant data aligned with a specific gene or
genomic location
 Display model organism orthologs for a given set of human genes
 Draft FAIR guidelines and metrics
 Understand how each of the computational components that support the ability
to access data fit together and what standards are needed
 Written plans of how and why these demonstrations should be extended into a full
Pilot
NIH Data Commons Pilot: Outcomes

 Stage 2: 4 year period
 To extend and fully implement the Data Commons Pilot based on the
design strategies and capabilities developed as part of Stage 1
 Review of MVP/demonstrations and written plans from Stage 1
 Goals and Milestones with clear and specific outcomes
 Evaluate, negotiate, and revise terms of existing awards
 Award additional OTs
NIH Data Commons Pilot: Outcomes

Acknowledgments
DPCPSI: Jim Anderson, Betsy Wilder, Vivien Bonazzi, Marie Nierras, Rachel Britt,
Sonyka Ngosso, Lora Kutkat, Kristi Faulk, Jen Lewis, Kate Nicholson,
Chris Darby, Tonya Scott
NHLBI: Gary Gibbons, Alastair Thomson, Teresa Marquette, Jeff Snyder,
Melissa Garcia, Maarten Lerkes, Ann Gawalt, Cashell Jaquish,
George, Papanicolaou
NHGRI: Eric Green, Valentina di Francesco, Ajay Pillai, Simona Volpi, Ken Wiley
NIAID: Nick Weber
CIT: Andrea Norris
NLM: Patti Brennan
NCBI: Steve Sherry

Stay in
Touch
QR Business Card
LinkedIn
@Vivien.Bonazzi
Slideshare
Blog
(Coming soon!)

NIH Data Summit - The NIH Data Commons

More Related Content

What's hot (20)

Similar to NIH Data Summit - The NIH Data Commons (20)

Recently uploaded (20)

NIH Data Summit - The NIH Data Commons

Editor's Notes