SlideShare a Scribd company logo
BD2K & the Commons @ NIH
Vivien Bonazzi, Ph.D.
Senior Advisor for Data Science Technologies
Office of Data Science (ADDS)
National Institutes of Health
BD2K and the Commons : ELIXR All Hands
A Digital Story
BD2K and the Commons : ELIXR All Hands
NIH Data
NIH DataNIH Data
BD2K and the Commons : ELIXR All Hands
BD2K and the Commons : ELIXR All Hands
BD2K and the Commons : ELIXR All Hands
BD2K and the Commons : ELIXR All Hands
US Government Memo - Increasing Access to
Results of Federally Funded Scientific Research
In Feb 2013 the US OSTP issued a memo calling for
all US Federal Agencies to make digital assets
from federally funded research available
OSTP - Office of Science Technology Policy at the White House
Public Access to Data
Memohttp://www.whitehouse.gov/sites/default/files/microsites/ostp/
ostp_public_access_memo_2013.pdf
US Government Memo - Increasing Access to
Results of Federally Funded Scientific Research
Each agency’s public access plan shall:
Maximize access, by the general public and without
charge, to digitally formatted scientific data
created with Federal funds while:
i) protecting confidentiality and personal privacy
ii) recognizing proprietary interests, business confidential information, and intellectual
property rights and avoiding significant negative impact on intellectual property
rights, innovation, and U.S. competitiveness, and
iii) preserving the balance between the relative value of long-term preservation and
access and the associated cost and administrative burden.
NIH Response
 In response to the
incredible growth of large
biomedical (digital)
datasets, the Director of
NIH established a special
Data and Informatics
Working Group (DIWG)
http://acd.od.nih.gov/diwg.htm
NIH Response
Establish new data science research and training programs
Fulfilling the recommendation of the ACD WG report
Big Data to Knowledge (BD2K) - 2013
http://datascience.nih.gov/bd2k
Establish a new position:
NIH Associate Director of Data Science
(ADDS)
Phil Bourne – 2014
CHAPTER 3
BD2K – Big Data to Knowledge
 Expanding training programs in data science
 Find and Sharing Data & Software though Indexes
 Targeted Software tools and methods
 Data wrangling
 Privacy security of data
 Data repurposing
 Applications of metadata
 Advance Big methods, tools and applications
 BD2K Centers of Excellence)
https://datascience.nih.gov/bd2k/funded-programs
To enable biomedical research as
a digital enterprise through which
new discoveries are made and
knowledge generated by
maximizing community
engagement and productivity.
NIH ADDS Mission Statement
To use data science
to foster an
Open Digital Ecosystem
that will accelerate
efficient, cost-effective
biomedical research
to enhance health, lengthen
life, and reduce illness and
disability
Enabling digital Ecosystems
via a Commons & BD2K
Leveraging BD2K efforts
Harnessing e-infrastructures
- Public-private partnerships & Interagency collaborations
Collaborating with external communities
Commons : Achieving a Balance
Biomedical Use Cases + Data Science + e-infrastructures
Supporting open biomedical science using robust, scalable
and flexible digital technologies
In collaboration with global communities
What are the PRINCIPLES of a Commons?
 Supports a digital biomedical ecosystem
 Treats products of research – data, software, methods,
papers etc. as digital objects
 Digital objects exist in a shared virtual space
Find, Deposit, Manage, Share and Reuse data,
software, metadata and workflows
 Digital objects need to conform to FAIR principles:
 Findable
 Accessible (and usable)
 Interoperable
 Reusable
Developing a Commons Framework
 Exploits new scalable computing technologies - Cloud
 Making digital objects : FAIR
 Indexable/Findable, Accessible & Usable, Interoperable,
Reproducible
 Simplifies access, sharing and interoperability of digital objects
such as data, software, metadata and workflows
 Provides physical or logical access to digital objects
 Provides understanding and accounting of usage patterns
 Is potentially more cost effective given digital growth
 Gives currency to digital objects and the people who develop and
support them
Commons Framework
Compute Platform: Cloud or SC Facilities
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
https://datascience.nih.gov/commons
Commons Framework
Compute Platform: Cloud or SC Facilities
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
IaaS
PaaS
SaaS
https://datascience.nih.gov/commons
Commons: Digital Object Compliance
 Attributes of digital research objects in the Commons
Initial Phase
 Unique digital object identifiers of resolvable to original authoritative
source
 Machine readable
 A minimal set of searchable metadata
 Physically available in a cloud based Commons provider
 Clear access rules (especially important for human subjects data)
 An entry (with metadata) in one or more indices
Future Phases
 Standard, community based unique digital object identifiers
 Conform to community approved standard metadata and ontologies for
enhanced searching
 Digital objects accessible via open standard APIs
 Are physically and logical available to the commons
Towards Data Commons’
Towards Data Commons’
co-locate data, storage and computing
infrastructure with commonly used tools for
accessing, analyzing, sharing data to create an
open interoperable resource for the research
community.
NIH Commons PILOTS
Current Commons Pilots
 Explore feasibility of the Commons framework
 Provide data objects to populate the Commons
 Facilitate collaboration and interoperability
 Provide access to cloud (IaaS) and PaaS/SaaS via credits
 Connecting credits to NIH Grants
 Making large and/or high value NIH funded data sets and tool
accessible in the cloud
 Developing Data & Software Indexing methods
 Leveraging BD2K efforts bioCADDIE et al
 Collaborating with external groups
Other Commons Activities
 Testing cloud environments to enable access, sharing. use and
reuse of large data sets and accompanying tools
 The Cancer Genome Atlas (TCGA) - NCI
 Human Microbiome Project (HMP) - NIAID
 Providing a portals to view representation and analysis of large
data sets (Genomic Data Commons – NCI)
Commons Framework Pilots
 Exploring feasibility of the Commons framework using
the BD2K Centers, MODs, and HMP groups
 Facilitating connectivity, interoperability and access to
digital objects
 Providing digital research objects to populate the
Commons
 Enable biomedical science to happen more easily and
robustly
 Connecting biology use cases with data science
Commons Framework Pilots
BD2K Centers, MODs, HMP
BD2K Centers,
MODS and HMP
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Mapping to the Commons framework:
Commons Framework Pilots
PaaS
SaaS
Does your work map to the Commons framework?
Good
Bad
Ugly
How does it enable science?
Using robust computational methods
Enable biomedical use cases
Commons Framework Pilots
BD2K Centers, MODs, HMP
Commons Framework Pilots
PI Parent grant’s
IC
Project description
TOGA NIBIB • Cloud-hosted data publication system
• Allows the automatic creation and publication of data a personalized data
repository
MUSEN NIAID • Smart APIs – improved handling for metadata within APIs
• Ontological support for metadata within an API
• Improving smart API discoverability: a registry of APIs
HAN NIGMS • Docker container hub for BD2K community
• Docker containers for genomic analysis applications and pipelines
• Benchmark, Evaluation & best practices
COOPER/KOHANE NHGRI • Cloud based authenticated API access and exchange of causal modeling data
, tools + genomic and phenomic data (PICI)
• Docker containers for CCD tools available in AWS
HAUSSLER NHGRI • Secure sharing of germline genetic variations for a targeted panel of breast
cancer susceptibility genes and variations
• (GA4GH) API : being able to query this data and metadata
Ohno-Machado NHLBI • Development of an ecosystem for repeatable science
• easy reuse of data AND software; tracking of provenance.
• Use of container technologies for software and data reuse.
Sternberg NHGRI • Development of a cloud-based literature curation system for specific curation
tasks of the collaborating sites.
• An API to provide programmatic access to the relevant papers in PMC
White NHGRI • The entire HMP1 data set made accessible on AWS
• Analysis tools for microbiome data in AWS
Westerfield NHGRI • Development of a common data model for the MODs
• Development of APIs accessing data across the MODs
 More specifically from a Data Science perspective
 Open standards for APIs and Docker containers
 Docker registry and best practices
 Improved metadata handing in APIs
 Data Object registry and indexing
 Reusing what is currently available
 bioCADDIE, schema.org and schema.org
 Publication
 Preprint server with Links to all digital objects
Commons Framework Pilots
BD2K Centers, MODs, HMP
 Example of a biomedical Use Case:
 Develop a common gene model for all the MODs
 Develop a open well structured, resuable and documented
API that can be used across the MOD data
 Why?
• To be able to query a human gene against all MOD orthologs
• Improved understanding of health and disease states
• Improved understanding of genome structure & organization
Commons Framework Pilots
BD2K Centers, MODs, HMP
The purpose of the Commons Framework is to support
BOTH
Biological use cases + Data Science methods
To allow biological research to happen at scale
Commons Framework Pilots
BD2K Centers, MODs, HMP
Commons Credits Model
The Cloud Credits Model
The Commons
Cloud Provider
A
Cloud Provider
B
Investigator
NIH
Provides credits
HPC Provider
Uses credits in
Commons
Enabling search: Index
Commons Compliance
Commons Conformance
Drivers of the Cloud Credits Model
 Scalability
 Exploiting new computing models
 Potentially Cost Effectiveness
 Simplified sharing of digital objects
Cloud computing supports many of these
objectives
Cloud credits
model (CCM)
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Mapping pilots to the Commons framework: Cloud
Credits Model:
IaaS
PaaS
SaaS
 Supports simplified data sharing by driving science into publicly
accessible computing environments that still provide for
investigator level access control
 Scalable for the needs of the scientific community for the next 5
years
 Democratize access to data and computational tools
 Cost effective
 Competitive marketplace for biomedical computing services
 Reduces redundancy
 Uses resources efficiently
Advantages of this Model
 Novelty:
Never been tried, so we don’t have data about likelihood of success
 Cost Models:
Assumes stable or declining prices among providers
True for the last several years, but we can’t guarantee that it will continue,
particularly if there is significant consolidation in industry
 Service Providers:
Assumes that providers are willing to make the investment to become
conformant
Market research suggests 3-5 providers within 2-3 months of launch
 Persistence:
 The model is ‘Pay As You Go’ which means if you stop paying it stops
going
 Giving investigators an unprecedented level of control over what lives (or
dies) in the Commons
Potential Disadvantages of this Model
Cloud Commons
Reference Data Sets
Data Sets in a Cloud Commons
 Making High Value and/or High Volume NIH funded data
sets available in a cloud commons
 Co-location of large datasets and compute power enables
access, use, resuse and sharing of data and tools
 Data must adhere to FAIR/Commons compliance principles
 Helps “seed” the Commons with FAIR/Commons compliant data
 Provides an Indexable test data sets for bioCADDIE (and other
indexing efforts)
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Mapping pilots to the Commons framework :
Large, high value Data Sets
NIH defined data sets
Data Sets in the Cloud Commons
 Preliminary possible data sets
 GTex (Genotype-Tissue Expression)
 LINCS (Library of Integrated network based cellular signatures)
 Model Organism Databases (MODs)
 UniProt
 Neuroimaging Resource (NITRIC)
 Radiology Image Share
 Epigenomics
 GenPort
 The Cancer Genome Atlas Project (TCGA) this data set is currently housed at the
GDC but there ARE plans to move to AWS and Google
 BTRIS Data – NIH Clinical center
 NIAID AIDs Data
 dbGAP
 GEO
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Mapping pilots to the Commons framework :
Community Defined Data Sets
Community
defined data sets
Data Sets in a Cloud Commons: Opportunities
 Ability to share data more easily
 Ability to access and compute on data more easily
 Reduced costs:
 Costs is paid by NIH not the individual PI
 Stops continues uploads of the same data sets
 FAIR/ Commons Compliance of data sets
Data Sets in a Cloud Commons: Challenges
 Supporting sensitive (human) data in commercial clouds
 Updating, versioning, maintaining
 Consents for data
 Can be very strict and only valid across 1 data set
 Analysis across data sets may constrained by consents
 Optimizing for cloud environments: performance
 Incentivizing data (and tool) generators to move and maintain
their data in the cloud
 Data peering across clouds
 Commercial clouds are resistant : cyclinders of excellence
 Peering and Virtualization of services
Making things Findable
Indexing & Search methods
Commons Pilots: Search & Index
 Indexing and Searching digital objects in a Commons
Leveraging indexing methods within BD2K
BioCADDIE,
Others approach within BD2K
Schema.org
Coexisting efforts
BD2K Indexing
e.g. BioCADDIE,
Other, schema.org
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Mapping pilots to the Commons framework :
Indexing & Searching
What is bioCADDIE?
biomedical and healthCAre
Data Discovery Index Ecosystem
 University of California San Diego
 PI Lucila Ohno-Machado
 Development of a prototype of Data Discovery Index (DDI)
 Aims – “Pubmed” for Data
1. Help users find shared data
2. Build a prototype data discovery index
3. Evaluate requirements for next phase
ecosystem components for finding data
Policies
criteria for inclusion,
sustainability
Standards
metadata
data
Identifiers
reuse of existing
ID issuing
services
Metadata
minimal set
guidelines for mapping,
accessibility information,
provenanceSearch engine
connection to other
engines,
repositories, data
sets
Commons Pilots
 Leveraging Schema.org
 Marking up a biomedical resource using schema.org
 Flexible and scalable
 Developing a bioschema.org approach
 Helps drive a community standard for reuse by other
groups
 Harnesses the power of search engines to find digital objects
Commons : Achieving a Balance
Biomedical Use Cases + Data Science + e-infrastructures
Supporting open biomedical science using robust, scalable
and flexible digital technologies
In collaboration with global communities
Thankyou
 ADDS Office
Phil Bourne, Michelle Dunn, Jennie Larkin, Mark Guyer, Sonynka Ngosso
 NCBI: George Komatsoulis
 NHGRI: Valentina di Francesco, Kevin Lee
 CIT: Debbie Sinmao, Andrea Norris, Stacy Charland
 Trans NIH BD2K Executive Committee & Working groups
 NCI: Warren Kibbe, Tony Kerlavage, Lou Staudt, Tanja Davidsen, Ian Fore
 NIAID: Nick Weber, Darrell Hurt, Maria Giovanni, JJ McGowan
 Many biomedical researchers, cloud providers, IT professionals
The end

More Related Content

PPTX
NIH Data Commons - Note: Presentation has animations
PPTX
NIH Data Summit - The NIH Data Commons
PPTX
Data Commons Garvan - 2016
PPTX
EMBL Australian Bioinformatics Resource AHM - Data Commons
PPTX
Bonazzi commons bd2 k ahm 2016 v2
PPTX
Bonazzi data commons nhgri council feb 2017
PPTX
Data commons bonazzi bd2 k fundamentals of science feb 2017
PPTX
NDS Relevant Update from the NIH Data Science (ADDS) Office
NIH Data Commons - Note: Presentation has animations
NIH Data Summit - The NIH Data Commons
Data Commons Garvan - 2016
EMBL Australian Bioinformatics Resource AHM - Data Commons
Bonazzi commons bd2 k ahm 2016 v2
Bonazzi data commons nhgri council feb 2017
Data commons bonazzi bd2 k fundamentals of science feb 2017
NDS Relevant Update from the NIH Data Science (ADDS) Office

What's hot (19)

PPTX
Komatsoulis internet2 executive track
PPTX
Komatsoulis internet2 global forum 2015
PDF
Mobile Data Analytics
PPT
A Framework for Geospatial Web Services for Public Health by Dr. Leslie Lenert
PDF
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
PDF
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
PDF
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
PPT
Smart Geo. Guido Satta (Maggio 2015)
PDF
Linked Building (Energy) Data
PDF
Key Technology Trends for Big Data in Europe
PDF
Big data Mining Using Very-Large-Scale Data Processing Platforms
PPT
Opportunities and Challenges for International Cooperation Around Big Data
PDF
Big Data Systems: Past, Present & (Possibly) Future with @techmilind
 
PPTX
Paving the way to open and interoperable research data service workflows Prog...
PDF
A HEALTH RESEARCH COLLABORATION CLOUD ARCHITECTURE
PPTX
Research data management & planning: an introduction
PDF
EDF2013: Invited Talk Julie Marguerite: Big data: a new world of opportunitie...
PDF
A Survey on Big Data Mining Challenges
Komatsoulis internet2 executive track
Komatsoulis internet2 global forum 2015
Mobile Data Analytics
A Framework for Geospatial Web Services for Public Health by Dr. Leslie Lenert
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Smart Geo. Guido Satta (Maggio 2015)
Linked Building (Energy) Data
Key Technology Trends for Big Data in Europe
Big data Mining Using Very-Large-Scale Data Processing Platforms
Opportunities and Challenges for International Cooperation Around Big Data
Big Data Systems: Past, Present & (Possibly) Future with @techmilind
 
Paving the way to open and interoperable research data service workflows Prog...
A HEALTH RESEARCH COLLABORATION CLOUD ARCHITECTURE
Research data management & planning: an introduction
EDF2013: Invited Talk Julie Marguerite: Big data: a new world of opportunitie...
A Survey on Big Data Mining Challenges
Ad

Similar to BD2K and the Commons : ELIXR All Hands (20)

PPTX
The Commons: Leveraging the Power of the Cloud for Big Data
PPTX
The NIH Data Commons - BD2K All Hands Meeting 2015
PDF
What is Data Commons and How Can Your Organization Build One?
PPTX
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
PPTX
Data Harmonization for a Molecularly Driven Health System
PDF
Tag.bio: Self Service Data Mesh Platform
PDF
A Data Biosphere for Biomedical Research
PPTX
Data Harmonization for a Molecularly Driven Health System
PPTX
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
PPTX
Big Data as a Catalyst for Collaboration & Innovation
PPTX
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research...
PDF
Toward a FAIR Biomedical Data Ecosystem
PDF
Ilik - Beyond the Manuscript: Using IRs for Non Traditional Content Types
PPTX
The NIH Commons: A Cloud-based Training Environment
PPTX
FAIRy stories: the FAIR Data principles in theory and in practice
PPTX
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
PPTX
Sharing Big Data - Bob Jones
PPTX
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
PPTX
Advancing Biomedical Knowledge Reuse with FAIR
PPT
Advancing Science In A Collaborative Web 20 World
The Commons: Leveraging the Power of the Cloud for Big Data
The NIH Data Commons - BD2K All Hands Meeting 2015
What is Data Commons and How Can Your Organization Build One?
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
Data Harmonization for a Molecularly Driven Health System
Tag.bio: Self Service Data Mesh Platform
A Data Biosphere for Biomedical Research
Data Harmonization for a Molecularly Driven Health System
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Big Data as a Catalyst for Collaboration & Innovation
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research...
Toward a FAIR Biomedical Data Ecosystem
Ilik - Beyond the Manuscript: Using IRs for Non Traditional Content Types
The NIH Commons: A Cloud-based Training Environment
FAIRy stories: the FAIR Data principles in theory and in practice
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Sharing Big Data - Bob Jones
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
Advancing Biomedical Knowledge Reuse with FAIR
Advancing Science In A Collaborative Web 20 World
Ad

Recently uploaded (20)

PDF
The scientific heritage No 166 (166) (2025)
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PDF
Sciences of Europe No 170 (2025)
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
An interstellar mission to test astrophysical black holes
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
C1 cut-Methane and it's Derivatives.pptx
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPT
6.1 High Risk New Born. Padetric health ppt
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
The scientific heritage No 166 (166) (2025)
Placing the Near-Earth Object Impact Probability in Context
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Sciences of Europe No 170 (2025)
neck nodes and dissection types and lymph nodes levels
An interstellar mission to test astrophysical black holes
. Radiology Case Scenariosssssssssssssss
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
C1 cut-Methane and it's Derivatives.pptx
7. General Toxicologyfor clinical phrmacy.pptx
lecture 2026 of Sjogren's syndrome l .pdf
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
6.1 High Risk New Born. Padetric health ppt
Introduction to Cardiovascular system_structure and functions-1
2Systematics of Living Organisms t-.pptx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...

BD2K and the Commons : ELIXR All Hands

  • 1. BD2K & the Commons @ NIH Vivien Bonazzi, Ph.D. Senior Advisor for Data Science Technologies Office of Data Science (ADDS) National Institutes of Health
  • 11. US Government Memo - Increasing Access to Results of Federally Funded Scientific Research In Feb 2013 the US OSTP issued a memo calling for all US Federal Agencies to make digital assets from federally funded research available OSTP - Office of Science Technology Policy at the White House Public Access to Data Memohttp://www.whitehouse.gov/sites/default/files/microsites/ostp/ ostp_public_access_memo_2013.pdf
  • 12. US Government Memo - Increasing Access to Results of Federally Funded Scientific Research Each agency’s public access plan shall: Maximize access, by the general public and without charge, to digitally formatted scientific data created with Federal funds while: i) protecting confidentiality and personal privacy ii) recognizing proprietary interests, business confidential information, and intellectual property rights and avoiding significant negative impact on intellectual property rights, innovation, and U.S. competitiveness, and iii) preserving the balance between the relative value of long-term preservation and access and the associated cost and administrative burden.
  • 13. NIH Response  In response to the incredible growth of large biomedical (digital) datasets, the Director of NIH established a special Data and Informatics Working Group (DIWG) http://acd.od.nih.gov/diwg.htm
  • 14. NIH Response Establish new data science research and training programs Fulfilling the recommendation of the ACD WG report Big Data to Knowledge (BD2K) - 2013 http://datascience.nih.gov/bd2k Establish a new position: NIH Associate Director of Data Science (ADDS) Phil Bourne – 2014
  • 16. BD2K – Big Data to Knowledge  Expanding training programs in data science  Find and Sharing Data & Software though Indexes  Targeted Software tools and methods  Data wrangling  Privacy security of data  Data repurposing  Applications of metadata  Advance Big methods, tools and applications  BD2K Centers of Excellence) https://datascience.nih.gov/bd2k/funded-programs
  • 17. To enable biomedical research as a digital enterprise through which new discoveries are made and knowledge generated by maximizing community engagement and productivity.
  • 18. NIH ADDS Mission Statement To use data science to foster an Open Digital Ecosystem that will accelerate efficient, cost-effective biomedical research to enhance health, lengthen life, and reduce illness and disability
  • 19. Enabling digital Ecosystems via a Commons & BD2K Leveraging BD2K efforts Harnessing e-infrastructures - Public-private partnerships & Interagency collaborations Collaborating with external communities
  • 20. Commons : Achieving a Balance Biomedical Use Cases + Data Science + e-infrastructures Supporting open biomedical science using robust, scalable and flexible digital technologies In collaboration with global communities
  • 21. What are the PRINCIPLES of a Commons?  Supports a digital biomedical ecosystem  Treats products of research – data, software, methods, papers etc. as digital objects  Digital objects exist in a shared virtual space Find, Deposit, Manage, Share and Reuse data, software, metadata and workflows  Digital objects need to conform to FAIR principles:  Findable  Accessible (and usable)  Interoperable  Reusable
  • 22. Developing a Commons Framework  Exploits new scalable computing technologies - Cloud  Making digital objects : FAIR  Indexable/Findable, Accessible & Usable, Interoperable, Reproducible  Simplifies access, sharing and interoperability of digital objects such as data, software, metadata and workflows  Provides physical or logical access to digital objects  Provides understanding and accounting of usage patterns  Is potentially more cost effective given digital growth  Gives currency to digital objects and the people who develop and support them
  • 23. Commons Framework Compute Platform: Cloud or SC Facilities Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data DigitalObjectCompliance App store/User Interface https://datascience.nih.gov/commons
  • 24. Commons Framework Compute Platform: Cloud or SC Facilities Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data DigitalObjectCompliance App store/User Interface IaaS PaaS SaaS https://datascience.nih.gov/commons
  • 25. Commons: Digital Object Compliance  Attributes of digital research objects in the Commons Initial Phase  Unique digital object identifiers of resolvable to original authoritative source  Machine readable  A minimal set of searchable metadata  Physically available in a cloud based Commons provider  Clear access rules (especially important for human subjects data)  An entry (with metadata) in one or more indices Future Phases  Standard, community based unique digital object identifiers  Conform to community approved standard metadata and ontologies for enhanced searching  Digital objects accessible via open standard APIs  Are physically and logical available to the commons
  • 27. Towards Data Commons’ co-locate data, storage and computing infrastructure with commonly used tools for accessing, analyzing, sharing data to create an open interoperable resource for the research community.
  • 29. Current Commons Pilots  Explore feasibility of the Commons framework  Provide data objects to populate the Commons  Facilitate collaboration and interoperability  Provide access to cloud (IaaS) and PaaS/SaaS via credits  Connecting credits to NIH Grants  Making large and/or high value NIH funded data sets and tool accessible in the cloud  Developing Data & Software Indexing methods  Leveraging BD2K efforts bioCADDIE et al  Collaborating with external groups
  • 30. Other Commons Activities  Testing cloud environments to enable access, sharing. use and reuse of large data sets and accompanying tools  The Cancer Genome Atlas (TCGA) - NCI  Human Microbiome Project (HMP) - NIAID  Providing a portals to view representation and analysis of large data sets (Genomic Data Commons – NCI)
  • 32.  Exploring feasibility of the Commons framework using the BD2K Centers, MODs, and HMP groups  Facilitating connectivity, interoperability and access to digital objects  Providing digital research objects to populate the Commons  Enable biomedical science to happen more easily and robustly  Connecting biology use cases with data science Commons Framework Pilots BD2K Centers, MODs, HMP
  • 33. BD2K Centers, MODS and HMP Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data DigitalObjectCompliance App store/User Interface Mapping to the Commons framework: Commons Framework Pilots PaaS SaaS
  • 34. Does your work map to the Commons framework? Good Bad Ugly How does it enable science? Using robust computational methods Enable biomedical use cases Commons Framework Pilots BD2K Centers, MODs, HMP
  • 35. Commons Framework Pilots PI Parent grant’s IC Project description TOGA NIBIB • Cloud-hosted data publication system • Allows the automatic creation and publication of data a personalized data repository MUSEN NIAID • Smart APIs – improved handling for metadata within APIs • Ontological support for metadata within an API • Improving smart API discoverability: a registry of APIs HAN NIGMS • Docker container hub for BD2K community • Docker containers for genomic analysis applications and pipelines • Benchmark, Evaluation & best practices COOPER/KOHANE NHGRI • Cloud based authenticated API access and exchange of causal modeling data , tools + genomic and phenomic data (PICI) • Docker containers for CCD tools available in AWS HAUSSLER NHGRI • Secure sharing of germline genetic variations for a targeted panel of breast cancer susceptibility genes and variations • (GA4GH) API : being able to query this data and metadata Ohno-Machado NHLBI • Development of an ecosystem for repeatable science • easy reuse of data AND software; tracking of provenance. • Use of container technologies for software and data reuse. Sternberg NHGRI • Development of a cloud-based literature curation system for specific curation tasks of the collaborating sites. • An API to provide programmatic access to the relevant papers in PMC White NHGRI • The entire HMP1 data set made accessible on AWS • Analysis tools for microbiome data in AWS Westerfield NHGRI • Development of a common data model for the MODs • Development of APIs accessing data across the MODs
  • 36.  More specifically from a Data Science perspective  Open standards for APIs and Docker containers  Docker registry and best practices  Improved metadata handing in APIs  Data Object registry and indexing  Reusing what is currently available  bioCADDIE, schema.org and schema.org  Publication  Preprint server with Links to all digital objects Commons Framework Pilots BD2K Centers, MODs, HMP
  • 37.  Example of a biomedical Use Case:  Develop a common gene model for all the MODs  Develop a open well structured, resuable and documented API that can be used across the MOD data  Why? • To be able to query a human gene against all MOD orthologs • Improved understanding of health and disease states • Improved understanding of genome structure & organization Commons Framework Pilots BD2K Centers, MODs, HMP
  • 38. The purpose of the Commons Framework is to support BOTH Biological use cases + Data Science methods To allow biological research to happen at scale Commons Framework Pilots BD2K Centers, MODs, HMP
  • 40. The Cloud Credits Model The Commons Cloud Provider A Cloud Provider B Investigator NIH Provides credits HPC Provider Uses credits in Commons Enabling search: Index Commons Compliance Commons Conformance
  • 41. Drivers of the Cloud Credits Model  Scalability  Exploiting new computing models  Potentially Cost Effectiveness  Simplified sharing of digital objects Cloud computing supports many of these objectives
  • 42. Cloud credits model (CCM) Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data DigitalObjectCompliance App store/User Interface Mapping pilots to the Commons framework: Cloud Credits Model: IaaS PaaS SaaS
  • 43.  Supports simplified data sharing by driving science into publicly accessible computing environments that still provide for investigator level access control  Scalable for the needs of the scientific community for the next 5 years  Democratize access to data and computational tools  Cost effective  Competitive marketplace for biomedical computing services  Reduces redundancy  Uses resources efficiently Advantages of this Model
  • 44.  Novelty: Never been tried, so we don’t have data about likelihood of success  Cost Models: Assumes stable or declining prices among providers True for the last several years, but we can’t guarantee that it will continue, particularly if there is significant consolidation in industry  Service Providers: Assumes that providers are willing to make the investment to become conformant Market research suggests 3-5 providers within 2-3 months of launch  Persistence:  The model is ‘Pay As You Go’ which means if you stop paying it stops going  Giving investigators an unprecedented level of control over what lives (or dies) in the Commons Potential Disadvantages of this Model
  • 46. Data Sets in a Cloud Commons  Making High Value and/or High Volume NIH funded data sets available in a cloud commons  Co-location of large datasets and compute power enables access, use, resuse and sharing of data and tools  Data must adhere to FAIR/Commons compliance principles  Helps “seed” the Commons with FAIR/Commons compliant data  Provides an Indexable test data sets for bioCADDIE (and other indexing efforts)
  • 47. Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data DigitalObjectCompliance App store/User Interface Mapping pilots to the Commons framework : Large, high value Data Sets NIH defined data sets
  • 48. Data Sets in the Cloud Commons  Preliminary possible data sets  GTex (Genotype-Tissue Expression)  LINCS (Library of Integrated network based cellular signatures)  Model Organism Databases (MODs)  UniProt  Neuroimaging Resource (NITRIC)  Radiology Image Share  Epigenomics  GenPort  The Cancer Genome Atlas Project (TCGA) this data set is currently housed at the GDC but there ARE plans to move to AWS and Google  BTRIS Data – NIH Clinical center  NIAID AIDs Data  dbGAP  GEO
  • 49. Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data DigitalObjectCompliance App store/User Interface Mapping pilots to the Commons framework : Community Defined Data Sets Community defined data sets
  • 50. Data Sets in a Cloud Commons: Opportunities  Ability to share data more easily  Ability to access and compute on data more easily  Reduced costs:  Costs is paid by NIH not the individual PI  Stops continues uploads of the same data sets  FAIR/ Commons Compliance of data sets
  • 51. Data Sets in a Cloud Commons: Challenges  Supporting sensitive (human) data in commercial clouds  Updating, versioning, maintaining  Consents for data  Can be very strict and only valid across 1 data set  Analysis across data sets may constrained by consents  Optimizing for cloud environments: performance  Incentivizing data (and tool) generators to move and maintain their data in the cloud  Data peering across clouds  Commercial clouds are resistant : cyclinders of excellence  Peering and Virtualization of services
  • 53. Commons Pilots: Search & Index  Indexing and Searching digital objects in a Commons Leveraging indexing methods within BD2K BioCADDIE, Others approach within BD2K Schema.org Coexisting efforts
  • 54. BD2K Indexing e.g. BioCADDIE, Other, schema.org Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data DigitalObjectCompliance App store/User Interface Mapping pilots to the Commons framework : Indexing & Searching
  • 55. What is bioCADDIE? biomedical and healthCAre Data Discovery Index Ecosystem  University of California San Diego  PI Lucila Ohno-Machado  Development of a prototype of Data Discovery Index (DDI)  Aims – “Pubmed” for Data 1. Help users find shared data 2. Build a prototype data discovery index 3. Evaluate requirements for next phase
  • 56. ecosystem components for finding data Policies criteria for inclusion, sustainability Standards metadata data Identifiers reuse of existing ID issuing services Metadata minimal set guidelines for mapping, accessibility information, provenanceSearch engine connection to other engines, repositories, data sets
  • 57. Commons Pilots  Leveraging Schema.org  Marking up a biomedical resource using schema.org  Flexible and scalable  Developing a bioschema.org approach  Helps drive a community standard for reuse by other groups  Harnesses the power of search engines to find digital objects
  • 58. Commons : Achieving a Balance Biomedical Use Cases + Data Science + e-infrastructures Supporting open biomedical science using robust, scalable and flexible digital technologies In collaboration with global communities
  • 59. Thankyou  ADDS Office Phil Bourne, Michelle Dunn, Jennie Larkin, Mark Guyer, Sonynka Ngosso  NCBI: George Komatsoulis  NHGRI: Valentina di Francesco, Kevin Lee  CIT: Debbie Sinmao, Andrea Norris, Stacy Charland  Trans NIH BD2K Executive Committee & Working groups  NCI: Warren Kibbe, Tony Kerlavage, Lou Staudt, Tanja Davidsen, Ian Fore  NIAID: Nick Weber, Darrell Hurt, Maria Giovanni, JJ McGowan  Many biomedical researchers, cloud providers, IT professionals

Editor's Notes

  • #8: There is not enough funding for every researcher to house all the data they need Analyzing the data is more expensive than producing it It can take weeks to download large datasets
  • #12: OSTP Office of Science and Technology Policy https://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf
  • #13: OSTP Office of Science and Technology Policy https://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf
  • #56: . The ultimate objective is to provide the community with a fully functional DDI integrated into the digital commons.