SlideShare a Scribd company logo
NIH Data Commons
NIH Data Storage Summit
October 20, 2017
Vivien Bonazzi Ph.D.
Senior Advisor for Data Science (NIH/OD)
Project Leader for the NIH Data Commons
What’s driving the need for a
Data Commons?
Challenges with the current state of data
 Generating large volumes of biomedical data
 Cheap to generate, costly to store on local servers
 Multiple copies of the same data in different locations
 Building data resources that cannot be easily found by others
 Data resources are not connected to each other and cannot
share data or tools
 No standards and guidelines on how to share and access data
Convergence of factors
 Increasing recognition of the need to support data sharing
 Availability of digital technologies and infrastructures that
support Data at scale
 Cloud: data storage, compute and sharing
 FAIR – Findable Accessible Interoperable Reproducible
 Understanding that data is a valuable resource that needs to be
sustained
https://gds.nih.gov/
Went into effect January 25, 2015
NCI guidance:
http://www.cancer.gov/grants-training/grants-management/nci-
policies/genomic-data
Requires public sharing of genomic data sets
NIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data Commons
Findable
Accessible
Interoperable
Reusable
DATA has VALUE
DATA is CENTRAL to the Digital Economy
a signal of the coming Digital Economy
Scientific digital assets
Data
Software
Workflows
Documentation
Journal Articles
Organizations will be defined by their digital assets
The most successful organizations of the
future will be those that can
leverage their digital assets and transform
them into a digital enterprise
Data Commons
Enabling data driven science
Enable investigators to leverage all possible data and
tools in the effort to accelerate biomedical discoveries,
therapies and cures
by
driving the development of data infrastructure and data
science capabilities through collaborative research and
robust engineering
Developing a Data Commons
 Treats products of research – data, methods, tools,
papers etc. as digital objects
 For this presentation: Data = Digital Objects
 These digital objects exist in a shared virtual space
 Find, Deposit, Manage, Share, and Reuse data,
software, metadata and workflows
 Digital object compliance through FAIR principles:
 Findable
 Accessible (and usable)
 Interoperable
 Reusable
The Data Commons
is a platform
that allows transactions to occur
on FAIR data at scale
The Data Commons Platform
Compute Platform: Cloud
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
FAIR
App store/User Interface/Portal
PaaS
SaaS
IaaS
Other Data Commons’
Data Commons Engagement
US Government Agencies & EU groups
Interoperability with other Commons’
 Common goals – democratizing, collaborating & sharing data
 Reuse of currently available open source tools which support
interoperability
 GA4GH, UCSC, GDC, NYGC
 May 2017 BioIT Commons Session
 Shared open standard APIs for data access and computing
 Ability to deploy and compute across multiple cloud environments
 Docker containers – Dockerstore/Docker registry
 Workflows management, sharing and deployment
 Discoverability (indexing) objects across cloud commons
 Global Unique identifiers
 Common user authentication system
The Good News
 Considerable agreement about the general approaches to
be taken
 Many people are already addressing many of the problems:
 Data architectures/platforms
 Automated/semi-automated data access/authentication protocols
 Common metadata standards and templates
 Open tools and software
 Instantiation and initial metrics of Findability, Accessibility,
Interoperability, and Reusability
 Relationships/agreements with Cloud Service Providers that leverage
their interest in hosting NIH data
 Moving data to the cloud and operating in a cloud environment
The Challenges
 A need to “Bring it all Together” – Community endorsement of:
 Metadata standards/tools/approaches
 Crosswalks between equivalent terms/ontologies
 Robust, shared approaches to data access/authentication
 Best practices that will enable existing data to become FAIR and will
guide generation of future datasets
 Rapidly evolving field makes approaches/tools/etc subject to
change – approaches need to be adaptable
 Effort is required to adapt data to community standards and move
data to the cloud
 How much does that cost and how long does it take?
 Lack of interoperability between cloud providers
The Challenges
 Making data FAIR comes with a cost
 How much does it actually cost?
 How can we minimize the cost?
 How do we determine whether any one set of data warrants the
expense?
 What is the value added to the data by making it FAIR?
 What new science can be achieved?
 How can new derived data or new computational approaches be
added to the dataset to enrich it?
 What are the limitations of FAIRness from dataset to dataset?
Development of a
NIH Data Commons Pilot
NIH Data Commons Pilot
allows access, use and sharing
of large, high value NIH data
in the cloud
NIH Data Commons Pilot
NIH Data Commons Structure
26
Cloud
Services: APIs, Containers, GUIDs, Indexing, Search,
Auth
ACCESS
Scientific analysis tools/workflows
Data
“Reference” Data Sets
TOPMed, GTEx, MODs
FAIR
App store/User Interface/Portal/Workspace
PaaS
SaaS
IaaS
Operationalizing
the NIH Data Commons Pilot
NIH Data Commons Pilot : Implementation
Storage, NIH Marketplace, Metrics and Costs
Leveraging and extending relationships established as part of BD2K
to provide access cloud to storage and compute
Supplements: TOPMed, GTEx, MODs groups
Prepare (and move) data sets to the cloud for storage, access and
scientific use
Work collaboratively with the OT awardees to build towards data access
Data Commons OT Solicitation: Other Transaction
ROA: Research Opportunity Announcement
Developing the fundamental FAIR computational components to
support access, use and sharing of the 3 data sets above
NIH Data Commons Pilot Consortium
 Establishing a new NIH Marketplace
 access to a sustainable cloud infrastructure for data science at NIH
 Over the next 18 months, NIH will establish its own NIH Cloud Marketplace
 Data Commons Pilot Consortium awardees ability to acquire cloud storage and compute
services
 Enable ICs to easily acquire cloud storage and storage services from commercial
cloud providers, resellers, and integrators
 Building on existing relationship with CSPs
 Led by CIT with input from Multi-IC working group
Storage, NIH Marketplace, Metrics and Costs
 Assessment and Evaluation
 What are the costs associated with cloud storage and usage?
 What are the business best practices?
 How should costs be paid?
 Who should pay them?
 How should highly used data be managed vs less used data?
 Are data producers supportive of this model?
 Are users (of all experience levels) able to access and use data effectively?
 How will we know if the Data Commons Pilot is successful?
 How to adjust to changing needs?
Storage, NIH Marketplace, Metrics and Costs
Supplements to 3 Test Data Set Groups
 Administrative Supplements to TOPMed, GTEx and MODs
 PIs for each data set were requested to review the OT (ROA) and
determine appropriate ways to interact
 Prepare (and move) data sets to the cloud for storage, access
and scientific use
 Make community workflows and cloud based tools of popular
analysis pipelines from the 3 datasets accessible
 Facilitate discovery and interpretation of the association of
human and model organism genotypes and phenotypes
NIH Data Commons: OT ROA
 Key Capabilities – modular components
 Development of Community Supported FAIR Guidelines and Metrics
 Global Unique Identifiers (GUID) for FAIR biomedical data
 Open Standard APIs (interoperability & connectivity)
 Cloud Agnostic Architecture and Frameworks
 Cloud User Workspaces
 Research Ethics, Privacy, and Security (AUTH)
 Indexing and Search
 Scientific Use cases
 Training, Outreach, Coordination
 Stage 1: 180 day window
 Develop MVPs (Minimum Viable Products)
 Demonstrations of the Data Commons and its components
 Have one copy of each test data set in each cloud provider
 Understanding of the process required to achieve this
 Draft version of a single standard access control system
 be able to access and use the data through the access control system
 Able to use a variety of analysis tools and pipelines on the 3 data sets in the
cloud – (driven by scientific use cases)
 Have a rudimentary ability to query across test data sets
 Display phenotype, expression and variant data aligned with a specific gene or
genomic location
 Display model organism orthologs for a given set of human genes
 Draft FAIR guidelines and metrics
 Understand how each of the computational components that support the ability
to access data fit together and what standards are needed
 Written plans of how and why these demonstrations should be extended into a full
Pilot
NIH Data Commons Pilot: Outcomes
 Stage 2: 4 year period
 To extend and fully implement the Data Commons Pilot based on the
design strategies and capabilities developed as part of Stage 1
 Review of MVP/demonstrations and written plans from Stage 1
 Goals and Milestones with clear and specific outcomes
 Evaluate, negotiate, and revise terms of existing awards
 Award additional OTs
NIH Data Commons Pilot: Outcomes
Acknowledgments
DPCPSI: Jim Anderson, Betsy Wilder, Vivien Bonazzi, Marie Nierras, Rachel Britt,
Sonyka Ngosso, Lora Kutkat, Kristi Faulk, Jen Lewis, Kate Nicholson,
Chris Darby, Tonya Scott
NHLBI: Gary Gibbons, Alastair Thomson, Teresa Marquette, Jeff Snyder,
Melissa Garcia, Maarten Lerkes, Ann Gawalt, Cashell Jaquish,
George, Papanicolaou
NHGRI: Eric Green, Valentina di Francesco, Ajay Pillai, Simona Volpi, Ken Wiley
NIAID: Nick Weber
CIT: Andrea Norris
NLM: Patti Brennan
NCBI: Steve Sherry
Stay in
Touch
QR Business Card
LinkedIn
@Vivien.Bonazzi
Slideshare
Blog
(Coming soon!)

More Related Content

PPTX
Data Commons Garvan - 2016
PPTX
Bonazzi data commons nhgri council feb 2017
PPTX
NIH Data Commons - Note: Presentation has animations
PPTX
Data commons bonazzi bd2 k fundamentals of science feb 2017
PPTX
BD2K and the Commons : ELIXR All Hands
PPTX
The NIH Data Commons - BD2K All Hands Meeting 2015
PPTX
EMBL Australian Bioinformatics Resource AHM - Data Commons
PPTX
Bonazzi commons bd2 k ahm 2016 v2
Data Commons Garvan - 2016
Bonazzi data commons nhgri council feb 2017
NIH Data Commons - Note: Presentation has animations
Data commons bonazzi bd2 k fundamentals of science feb 2017
BD2K and the Commons : ELIXR All Hands
The NIH Data Commons - BD2K All Hands Meeting 2015
EMBL Australian Bioinformatics Resource AHM - Data Commons
Bonazzi commons bd2 k ahm 2016 v2

What's hot (20)

PPTX
NDS Relevant Update from the NIH Data Science (ADDS) Office
PDF
Why Data Citation Currently Misses the Point
PPTX
Digital Library Federation - DataNets Panel presentation (Nov. 1st, 2011)
PPTX
Komatsoulis internet2 global forum 2015
PPTX
Komatsoulis internet2 executive track
PPTX
SEAD slide set (October 2011)
PDF
Data Policy for Open Science
PPT
BD2K Update
PPT
More with Less? Collaborative Trends in Research Data Management
PPT
Licence to Share: Research and Collaboration through Go-Geo! and ShareGeo
PPTX
A SWOT Analysis of Data Science @ NIH
PPTX
FAIR data
PPT
Open Data in a Global Ecosystem
PPTX
NSF DataNet Partners Update at RDAP14
PPT
Big Data in Biomedicine – An NIH Perspective
PPTX
ESA14 Workshop on SEAD's Data Services and Tools
PPTX
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
PDF
DataCite and its Members: Connecting Research and Identifying Knowledge
PPT
Opportunities and Challenges for International Cooperation Around Big Data
PDF
RDA Presentation to the International Federation of Library Associations
NDS Relevant Update from the NIH Data Science (ADDS) Office
Why Data Citation Currently Misses the Point
Digital Library Federation - DataNets Panel presentation (Nov. 1st, 2011)
Komatsoulis internet2 global forum 2015
Komatsoulis internet2 executive track
SEAD slide set (October 2011)
Data Policy for Open Science
BD2K Update
More with Less? Collaborative Trends in Research Data Management
Licence to Share: Research and Collaboration through Go-Geo! and ShareGeo
A SWOT Analysis of Data Science @ NIH
FAIR data
Open Data in a Global Ecosystem
NSF DataNet Partners Update at RDAP14
Big Data in Biomedicine – An NIH Perspective
ESA14 Workshop on SEAD's Data Services and Tools
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
DataCite and its Members: Connecting Research and Identifying Knowledge
Opportunities and Challenges for International Cooperation Around Big Data
RDA Presentation to the International Federation of Library Associations
Ad

Similar to NIH Data Summit - The NIH Data Commons (20)

PPTX
The Commons: Leveraging the Power of the Cloud for Big Data
PPT
A Framework for Geospatial Web Services for Public Health by Dr. Leslie Lenert
PDF
What is Data Commons and How Can Your Organization Build One?
PPTX
Recognising data sharing
PPTX
FAIR data: what it means, how we achieve it, and the role of RDA
PPTX
A Big Picture in Research Data Management
PPTX
Open Science Globally: Some Developments/Dr Simon Hodson
PDF
University of Minho Data Repository - features to publish & share data and w...
PPTX
Paving the way to open and interoperable research data service workflows
PPTX
Data, Data Everywhere: What's A Publisher to Do?
PPTX
Publishing Data on the Web
PDF
McGeary Data Curation Network: Developing and Scaling
PPT
Hedstrom Infrastructure
PPTX
Meeting the NSF DMP Requirement June 13, 2012
PPTX
Data Harmonization for a Molecularly Driven Health System
PDF
Tag.bio: Self Service Data Mesh Platform
PPTX
FAIRy stories: the FAIR Data principles in theory and in practice
PPTX
Paving the way to open and interoperable research data service workflows Prog...
PPTX
Data Harmonization for a Molecularly Driven Health System
PPTX
A coordinated framework for open data open science in Botswana/Simon Hodson
The Commons: Leveraging the Power of the Cloud for Big Data
A Framework for Geospatial Web Services for Public Health by Dr. Leslie Lenert
What is Data Commons and How Can Your Organization Build One?
Recognising data sharing
FAIR data: what it means, how we achieve it, and the role of RDA
A Big Picture in Research Data Management
Open Science Globally: Some Developments/Dr Simon Hodson
University of Minho Data Repository - features to publish & share data and w...
Paving the way to open and interoperable research data service workflows
Data, Data Everywhere: What's A Publisher to Do?
Publishing Data on the Web
McGeary Data Curation Network: Developing and Scaling
Hedstrom Infrastructure
Meeting the NSF DMP Requirement June 13, 2012
Data Harmonization for a Molecularly Driven Health System
Tag.bio: Self Service Data Mesh Platform
FAIRy stories: the FAIR Data principles in theory and in practice
Paving the way to open and interoperable research data service workflows Prog...
Data Harmonization for a Molecularly Driven Health System
A coordinated framework for open data open science in Botswana/Simon Hodson
Ad

Recently uploaded (20)

PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PDF
An interstellar mission to test astrophysical black holes
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PDF
The scientific heritage No 166 (166) (2025)
PPTX
famous lake in india and its disturibution and importance
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPT
Chemical bonding and molecular structure
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
2. Earth - The Living Planet earth and life
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
. Radiology Case Scenariosssssssssssssss
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
An interstellar mission to test astrophysical black holes
microscope-Lecturecjchchchchcuvuvhc.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
The scientific heritage No 166 (166) (2025)
famous lake in india and its disturibution and importance
TOTAL hIP ARTHROPLASTY Presentation.pptx
Cell Membrane: Structure, Composition & Functions
Derivatives of integument scales, beaks, horns,.pptx
Chemical bonding and molecular structure
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
Phytochemical Investigation of Miliusa longipes.pdf
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
2. Earth - The Living Planet earth and life
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
. Radiology Case Scenariosssssssssssssss
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField

NIH Data Summit - The NIH Data Commons

  • 1. NIH Data Commons NIH Data Storage Summit October 20, 2017 Vivien Bonazzi Ph.D. Senior Advisor for Data Science (NIH/OD) Project Leader for the NIH Data Commons
  • 2. What’s driving the need for a Data Commons?
  • 3. Challenges with the current state of data  Generating large volumes of biomedical data  Cheap to generate, costly to store on local servers  Multiple copies of the same data in different locations  Building data resources that cannot be easily found by others  Data resources are not connected to each other and cannot share data or tools  No standards and guidelines on how to share and access data
  • 4. Convergence of factors  Increasing recognition of the need to support data sharing  Availability of digital technologies and infrastructures that support Data at scale  Cloud: data storage, compute and sharing  FAIR – Findable Accessible Interoperable Reproducible  Understanding that data is a valuable resource that needs to be sustained
  • 5. https://gds.nih.gov/ Went into effect January 25, 2015 NCI guidance: http://www.cancer.gov/grants-training/grants-management/nci- policies/genomic-data Requires public sharing of genomic data sets
  • 10. DATA has VALUE DATA is CENTRAL to the Digital Economy a signal of the coming Digital Economy
  • 11. Scientific digital assets Data Software Workflows Documentation Journal Articles Organizations will be defined by their digital assets
  • 12. The most successful organizations of the future will be those that can leverage their digital assets and transform them into a digital enterprise
  • 13. Data Commons Enabling data driven science Enable investigators to leverage all possible data and tools in the effort to accelerate biomedical discoveries, therapies and cures by driving the development of data infrastructure and data science capabilities through collaborative research and robust engineering
  • 14. Developing a Data Commons  Treats products of research – data, methods, tools, papers etc. as digital objects  For this presentation: Data = Digital Objects  These digital objects exist in a shared virtual space  Find, Deposit, Manage, Share, and Reuse data, software, metadata and workflows  Digital object compliance through FAIR principles:  Findable  Accessible (and usable)  Interoperable  Reusable
  • 15. The Data Commons is a platform that allows transactions to occur on FAIR data at scale
  • 16. The Data Commons Platform Compute Platform: Cloud Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data FAIR App store/User Interface/Portal PaaS SaaS IaaS
  • 18. Data Commons Engagement US Government Agencies & EU groups
  • 19. Interoperability with other Commons’  Common goals – democratizing, collaborating & sharing data  Reuse of currently available open source tools which support interoperability  GA4GH, UCSC, GDC, NYGC  May 2017 BioIT Commons Session  Shared open standard APIs for data access and computing  Ability to deploy and compute across multiple cloud environments  Docker containers – Dockerstore/Docker registry  Workflows management, sharing and deployment  Discoverability (indexing) objects across cloud commons  Global Unique identifiers  Common user authentication system
  • 20. The Good News  Considerable agreement about the general approaches to be taken  Many people are already addressing many of the problems:  Data architectures/platforms  Automated/semi-automated data access/authentication protocols  Common metadata standards and templates  Open tools and software  Instantiation and initial metrics of Findability, Accessibility, Interoperability, and Reusability  Relationships/agreements with Cloud Service Providers that leverage their interest in hosting NIH data  Moving data to the cloud and operating in a cloud environment
  • 21. The Challenges  A need to “Bring it all Together” – Community endorsement of:  Metadata standards/tools/approaches  Crosswalks between equivalent terms/ontologies  Robust, shared approaches to data access/authentication  Best practices that will enable existing data to become FAIR and will guide generation of future datasets  Rapidly evolving field makes approaches/tools/etc subject to change – approaches need to be adaptable  Effort is required to adapt data to community standards and move data to the cloud  How much does that cost and how long does it take?  Lack of interoperability between cloud providers
  • 22. The Challenges  Making data FAIR comes with a cost  How much does it actually cost?  How can we minimize the cost?  How do we determine whether any one set of data warrants the expense?  What is the value added to the data by making it FAIR?  What new science can be achieved?  How can new derived data or new computational approaches be added to the dataset to enrich it?  What are the limitations of FAIRness from dataset to dataset?
  • 23. Development of a NIH Data Commons Pilot
  • 24. NIH Data Commons Pilot allows access, use and sharing of large, high value NIH data in the cloud
  • 26. NIH Data Commons Structure 26 Cloud Services: APIs, Containers, GUIDs, Indexing, Search, Auth ACCESS Scientific analysis tools/workflows Data “Reference” Data Sets TOPMed, GTEx, MODs FAIR App store/User Interface/Portal/Workspace PaaS SaaS IaaS
  • 28. NIH Data Commons Pilot : Implementation Storage, NIH Marketplace, Metrics and Costs Leveraging and extending relationships established as part of BD2K to provide access cloud to storage and compute Supplements: TOPMed, GTEx, MODs groups Prepare (and move) data sets to the cloud for storage, access and scientific use Work collaboratively with the OT awardees to build towards data access Data Commons OT Solicitation: Other Transaction ROA: Research Opportunity Announcement Developing the fundamental FAIR computational components to support access, use and sharing of the 3 data sets above
  • 29. NIH Data Commons Pilot Consortium
  • 30.  Establishing a new NIH Marketplace  access to a sustainable cloud infrastructure for data science at NIH  Over the next 18 months, NIH will establish its own NIH Cloud Marketplace  Data Commons Pilot Consortium awardees ability to acquire cloud storage and compute services  Enable ICs to easily acquire cloud storage and storage services from commercial cloud providers, resellers, and integrators  Building on existing relationship with CSPs  Led by CIT with input from Multi-IC working group Storage, NIH Marketplace, Metrics and Costs
  • 31.  Assessment and Evaluation  What are the costs associated with cloud storage and usage?  What are the business best practices?  How should costs be paid?  Who should pay them?  How should highly used data be managed vs less used data?  Are data producers supportive of this model?  Are users (of all experience levels) able to access and use data effectively?  How will we know if the Data Commons Pilot is successful?  How to adjust to changing needs? Storage, NIH Marketplace, Metrics and Costs
  • 32. Supplements to 3 Test Data Set Groups  Administrative Supplements to TOPMed, GTEx and MODs  PIs for each data set were requested to review the OT (ROA) and determine appropriate ways to interact  Prepare (and move) data sets to the cloud for storage, access and scientific use  Make community workflows and cloud based tools of popular analysis pipelines from the 3 datasets accessible  Facilitate discovery and interpretation of the association of human and model organism genotypes and phenotypes
  • 33. NIH Data Commons: OT ROA  Key Capabilities – modular components  Development of Community Supported FAIR Guidelines and Metrics  Global Unique Identifiers (GUID) for FAIR biomedical data  Open Standard APIs (interoperability & connectivity)  Cloud Agnostic Architecture and Frameworks  Cloud User Workspaces  Research Ethics, Privacy, and Security (AUTH)  Indexing and Search  Scientific Use cases  Training, Outreach, Coordination
  • 34.  Stage 1: 180 day window  Develop MVPs (Minimum Viable Products)  Demonstrations of the Data Commons and its components  Have one copy of each test data set in each cloud provider  Understanding of the process required to achieve this  Draft version of a single standard access control system  be able to access and use the data through the access control system  Able to use a variety of analysis tools and pipelines on the 3 data sets in the cloud – (driven by scientific use cases)  Have a rudimentary ability to query across test data sets  Display phenotype, expression and variant data aligned with a specific gene or genomic location  Display model organism orthologs for a given set of human genes  Draft FAIR guidelines and metrics  Understand how each of the computational components that support the ability to access data fit together and what standards are needed  Written plans of how and why these demonstrations should be extended into a full Pilot NIH Data Commons Pilot: Outcomes
  • 35.  Stage 2: 4 year period  To extend and fully implement the Data Commons Pilot based on the design strategies and capabilities developed as part of Stage 1  Review of MVP/demonstrations and written plans from Stage 1  Goals and Milestones with clear and specific outcomes  Evaluate, negotiate, and revise terms of existing awards  Award additional OTs NIH Data Commons Pilot: Outcomes
  • 36. Acknowledgments DPCPSI: Jim Anderson, Betsy Wilder, Vivien Bonazzi, Marie Nierras, Rachel Britt, Sonyka Ngosso, Lora Kutkat, Kristi Faulk, Jen Lewis, Kate Nicholson, Chris Darby, Tonya Scott NHLBI: Gary Gibbons, Alastair Thomson, Teresa Marquette, Jeff Snyder, Melissa Garcia, Maarten Lerkes, Ann Gawalt, Cashell Jaquish, George, Papanicolaou NHGRI: Eric Green, Valentina di Francesco, Ajay Pillai, Simona Volpi, Ken Wiley NIAID: Nick Weber CIT: Andrea Norris NLM: Patti Brennan NCBI: Steve Sherry
  • 37. Stay in Touch QR Business Card LinkedIn @Vivien.Bonazzi Slideshare Blog (Coming soon!)

Editor's Notes

  • #2: Current snapshot of Commons status
  • #20:   Development of FAIR-ness Metrics 
  • #25: The Data Commons is a federated way to provide access and sharing of large , high value NIH data The purpose of a Cloud based Data Commons is to make large data sets accessible and usable by the broader community. Having one copy of a large data set on the cloud means it is accessible by many researchers and they don’t need to copy the data set from NCBI (or other repositories) to the cloud every time they want to use it. One copy of a large data set on the cloud, accessed multiple times by many researchers who are only paying for the ability to compute on that data is more cost and time effective than moving the same large data set multiple times to the cloud A cloud based Data Commons becomes much more powerful when (community based) standardized methods and systems are adopted. These standards apply to the way the data and tools interact with each and the computing environment they sit within ie cloud and other how data and tools are made accessible to the user. Standards specifically relate to the FAIR guidelines, API to access data, workflows and tool, docker containers for deployment of tools to the cloud Standards are what enable a federated Commons. Standards create the basic ground rules and common language for interactions in the system.
  • #27: The Data Commons Framework describes the ecosystem that the OT solicitation is building towards. Each of the key capabilities described in the OT have a major role in the development of the ecosystem
  • #31: Governance of the Commons can be found on slide XX
  • #34: The purpose of this slide is to give a sense that to provide access to the data requires a series of modular reusable components I wont describe each KC but I want to give them a sense that there are modular components that fit together to permit access
  • #37: Multi IC Working Group co-chairs for the Data Commons Pilot Gary Gibbons, Eric Green, Patti Brennan, Jim Anderson, Andrea Norris