SlideShare a Scribd company logo
Data Science &
BD2K Update
Philip Bourne, PhD, FACMI
Associate Director for Data Science
Advisory Committee to the NIH Director
June 10, 2016
http://datascience.nih.gov
Slides: http://www.slideshare.net/pebourne
Data Science Agenda
• What problems are we trying to solve?
• What are the solutions we are exploring?
• How does BD2K facilitate those solutions?
What Problems Are We
Trying to Solve?
• Data are extensive, complex and growing
• Data are in silos while science transcends
those silos
• Data are expensive to maintain and share
while demands for sharing are increasing
• There is an insufficient workforce with the
needed data analytical skills
• A collective (trans NIH) solution
Quantifying the Problem
• Big Data
– Total data from NIH-funded research currently
estimated at 650 PB*
– 20 PB of that is in NCBI/NLM (3%) and it is expected
to grow by 10 PB this year
• Dark Data
– Only 12% of data described in published papers is in
recognized archives – 88% is dark data^
• Cost
– 2007-2014: NIH spent ~$1.2Bn extramurally on
maintaining data archives
* In 2012 Library of Congress was 3 PB
^ http://www.ncbi.nlm.nih.gov/pubmed/26207759
*Note: Award data confirmed as of 03/2016. Some repositories funded by hybrid mechanisms (eg. grants-contracts, IAA-contracts, etc.)
Biomedical Digital Data
Repository Survey by Institute
and Center (IC)
• Leadership meeting late in 2015 requested a
survey of IC approaches and plans for data
repositories
• Responses received from 18 IC’s
• Clear challenges were identified
The Major Challenge
Encountered When Considering
Repository Funding
Cost
(6 ICs)
Lack of Expertise
(Within and Outside
NIH)
(3 ICs)
Utility
(3 ICs)
Lack of
Trans-
NIH
Guidance
or
Best
Practices
(2 ICs)
Sustainability
(6 ICs)
Redundancy
(4 ICs)
* Some IC’s identified multiple challenges equally
What Solutions Are We
Exploring?
The Commons is one solution that
leverages the experiences in
cloud-based computing and is
being enabled by BD2K research
Examples of Cloud Based
Initiatives
5 PB
40TB AWS
The Commons – The Internet of
Data
• Findable
• Accessible
• Interoperable
• Reusable
* http://www.ncbi.nlm.nih.gov/pubmed/26978244
The Commons offers a path forward to integrate
these discreet cloud-based initiatives using BD2K
developments to make data FAIR*
The internet started as discreet networks that
merged - the same could happen with data
Use Case:
Aggregate integrated data offers
the potential for new insights into
rare diseases …
As we get more precise every disease becomes a rare disease
Diffuse Intrinsic Pontine
Gliomas (DIPG): In need of a
new data-driven approach
• Occur 1:100,000
individuals
• Peak incidence 6-8 years
of age
• Median survival 9-12
months
• Surgery is not an option
• Chemotherapy ineffective
and radiotherapy only
transitive
From Adam Resnick
Timeline of Genomic Studies
in DIPG
• Landmark studies
identify histone mutations
as recurrent driver
mutations in DIPG ~2012
• Almost 3 years later, in
largely the same
datasets, but partially
expanded, the same two
groups and 2 others
identify ACVR1
mutations as a
secondary, co-ocurring
mutation
From Adam Resnick
Hypothesis: The Commons
would have revealed ACVR1
• ACVR1 is a targetable kinase
• Inhibition of ACVR1 inhibited tumor
progression in vitro
• ~300 DIPG patients a year
• ~60 are predicted to have ACVR1
• If large scale data sets were only
integrated with TCGA and/or rare
disease data in 2012, ACVR1 mutations
would have been identified
• 60 patients/year X 3 years = 180
children’s lives (who likely succumbed to
the disease during that time) could have
been impacted if only data were FAIR
From Adam Resnick
A 3 Year BD2K Sponsored
Commons Pilot is Under Way
– Questions to be addressed:
• Does the ability to compute across very large
datasets lead to new discoveries?
• Are data and analytics more easily located and
shared and does this improve productivity?
• Is there an advantage to have the results of those
large calculations also available?
• Is research more reproducible?
• Is this environment more cost-effective than what
we do now?
Another use case…
Let’s review the Commons pilot
using the Model Organism
Databases (MODs) as an
example …
Example of the Problem:
The Model Organism Databases (MODS)
• Highly curated and
valuable data
• Siloed / Not
interoperable
• Cumbersome to
compute over all the
data
• Costly to maintain as
individual resources
NHGRI & NHLBI
Shared Research Objects
NCBI Intramural
Hybrid
Extramural
NCBI/NLM Existing Coordinating CentersCIT IC’s
Step 1: Data & Analytics
Moved to the Commons
• Moved as Commons compliant shared
research objects, including:
– Identifiers
– Minimal metadata standards
CF
MOD Data
BD2K is Providing Those
Metadata Standards
NCBI/NLM Existing Coordinating CentersCIT IC’s
Services: APIs, Containers, Indexing,Services: APIs, Containers, Indexing,
Software: Services & ToolsSoftware: Services & Tools
scientific analysis tools/workflowsscientific analysis tools/workflows
App store/User InterfaceApp store/User Interface
Step 2: Layers of Software &
Services Added
Shared Research Objects
NCBI Intramural
Hybrid
Extramural
DataMed is a “Find” Service
Developed by BD2K
MOD Data indexed
NCBI/NLM Existing Coordinating CentersCIT IC’s
Services: APIs, Containers, Indexing,Services: APIs, Containers, Indexing,
Software: Services & ToolsSoftware: Services & Tools
scientific analysis tools/workflowsscientific analysis tools/workflows
App store/User InterfaceApp store/User Interface
Step 3: Commons Content Shared
While Maintaining Autonomous
Views
CF
ICIC ICIC ICIC CFCF
Shared Research Objects
NCBI Intramural
Hybrid
Extramural
MOD Data
View
BD2K Commons Pilot
Timeline
Project Year 1
FY 2015
Oct 2015 – Sep 2016
Project Year 2
FY 2016
Oct 2016 – Sep 2017
Project Year 3
FY 2017
Oct 2017 – Sep 2018
Step 0Step 0 Step 1Step 1 Step 2Step 2 Step 3Step 3
Step 0: Initiation
•Finalize conformance requirements
•Arrange initial providers
Step 1: ~5 Initial Projects including Common Fund We Are Here
Step 2: ~50 projects
Step 3: Evaluation & Next Steps
The Major Challenge
Encountered When Considering
Repository Funding
Cost
(6 ICs)
Lack of Expertise
(Within and Outside
NIH)
(3 ICs)
Utility
(3 ICs)
Lack of
Trans-
NIH
Guidance
or
Best
Practices
(2 ICs)
Sustainability
(6 ICs)
Redundancy
(4 ICs)
* Some IC’s identified multiple challenges equally
16 T32/T15
Predoctoral
Training
Programs
21
Postdoctoral and
Faculty Career
Awards
Enhancing Diversity
• Focus on low-resourced institutions
– Supports curriculum and faculty development
– Supports research experiences for
undergraduates
• Builds partnerships with BD2K Centers
Improving Data Science
Skills Among all Biomedical
Scientists
24 awards
1 award
The Role of BD2K
1. Commons
– Resource
Indexing
– Standards
– Cloud & HPC
– Sustainability
2. Data Science
Research
– Centers
– Software
Analysis &
Methods
3. Training & Workforce Development
NIHNIH……
Turning Discovery Into HealthTurning Discovery Into Health
philip.bourne@nih.gov
https://datascience.nih.gov/
• Pi Day
• 2016 Lecture by Carlos Bustamante
• Poster Session with Pies
• PiCo Lightning Talks
• Pi Day Scholars: outreach to high schools
• Workshop: Reproducible Research
• Lecture Series: Distinguished and Frontiers in Data Science
• Data Science Courses
• Machine learning
• Hackathons
Data Science Events at NIH

More Related Content

PPT
There is No Intelligent Life Down Here
PPTX
Understanding the Big Data Enterprise
PPT
BD2K Update
PPTX
From Where Have We Come & Where Are We Going
PPT
RDAP 033111
PPTX
Highlights from NIH Data Science
PPTX
The Commons: Leveraging the Power of the Cloud for Big Data
PPT
Open Data in a Global Ecosystem
There is No Intelligent Life Down Here
Understanding the Big Data Enterprise
BD2K Update
From Where Have We Come & Where Are We Going
RDAP 033111
Highlights from NIH Data Science
The Commons: Leveraging the Power of the Cloud for Big Data
Open Data in a Global Ecosystem

What's hot (20)

PPTX
A SWOT Analysis of Data Science @ NIH
PPT
Big Data in Biomedicine – An NIH Perspective
PPT
Big Data in Biomedicine: Where is the NIH Headed
PPT
Data Analytics
PPTX
Towards the Digital Research Enterprise
PPTX
Big Data as a Catalyst for Collaboration & Innovation
PPTX
Open Science: Some Possible Actions by University Leaders on Behalf of Resear...
PPTX
Making Biomedical Research More Like Airbnb
PPTX
The NIH Commons: A Cloud-based Training Environment
PPTX
BD2K @ NIH - A Vision Through 2020
PPT
The Vision for Data @ the NIH
PPT
The NIH as a Digital Enterprise: Implications for PAG
PPT
Big Data and the Promise and Pitfalls when Applied to Disease Prevention and ...
PPTX
SWOT Analysis - What Does it Tell Us?
PPT
Meeting the Computational Challenges Associated with Human Health
PPTX
A VIVO VIEW OF CANCER RESEARCH: Dream, Vision and Reality
PPTX
The Analytics and Data Science Landscape
PPT
Health Policy and Management as it Relates to Big Data
PPT
AMIA 2014
PPTX
What Can Happen when Genome Sciences Meets Data Sciences?
A SWOT Analysis of Data Science @ NIH
Big Data in Biomedicine – An NIH Perspective
Big Data in Biomedicine: Where is the NIH Headed
Data Analytics
Towards the Digital Research Enterprise
Big Data as a Catalyst for Collaboration & Innovation
Open Science: Some Possible Actions by University Leaders on Behalf of Resear...
Making Biomedical Research More Like Airbnb
The NIH Commons: A Cloud-based Training Environment
BD2K @ NIH - A Vision Through 2020
The Vision for Data @ the NIH
The NIH as a Digital Enterprise: Implications for PAG
Big Data and the Promise and Pitfalls when Applied to Disease Prevention and ...
SWOT Analysis - What Does it Tell Us?
Meeting the Computational Challenges Associated with Human Health
A VIVO VIEW OF CANCER RESEARCH: Dream, Vision and Reality
The Analytics and Data Science Landscape
Health Policy and Management as it Relates to Big Data
AMIA 2014
What Can Happen when Genome Sciences Meets Data Sciences?
Ad

Similar to Data Science BD2K Update for NIH (20)

PPTX
Will Biomedical Research Fundamentally Change in the Era of Big Data?
PPT
Ask Not What the NIH Can Do For You; Ask What You Can Do For the NIH
PPT
Data at the NIH
PPTX
BD2K Update
PPT
The Thinking Behind Big Data at the NIH
PPTX
BD2K and the Commons : ELIXR All Hands
PPTX
EMBL Australian Bioinformatics Resource AHM - Data Commons
PPT
The Commons
PPT
Opportunities and Challenges for International Cooperation Around Big Data
PPTX
Komatsoulis internet2 global forum 2015
PPTX
NIH Data Summit - The NIH Data Commons
PPTX
Reproducibility: A Funder and Data Science Perspective
PPT
Yale Day of Data
PPTX
NIH Big Data to Knowledge (BD2K)
PPTX
Data commons bonazzi bd2 k fundamentals of science feb 2017
PPTX
The NIH Data Commons - BD2K All Hands Meeting 2015
PPT
Data Science at NIH and its Relationship to Social Computing, Behavioral-Cult...
PPTX
Biomedical Data Sciences - New Name and New Opportunities for Change?
PPTX
Bonazzi commons bd2 k ahm 2016 v2
PPTX
NIH Data Commons - Note: Presentation has animations
Will Biomedical Research Fundamentally Change in the Era of Big Data?
Ask Not What the NIH Can Do For You; Ask What You Can Do For the NIH
Data at the NIH
BD2K Update
The Thinking Behind Big Data at the NIH
BD2K and the Commons : ELIXR All Hands
EMBL Australian Bioinformatics Resource AHM - Data Commons
The Commons
Opportunities and Challenges for International Cooperation Around Big Data
Komatsoulis internet2 global forum 2015
NIH Data Summit - The NIH Data Commons
Reproducibility: A Funder and Data Science Perspective
Yale Day of Data
NIH Big Data to Knowledge (BD2K)
Data commons bonazzi bd2 k fundamentals of science feb 2017
The NIH Data Commons - BD2K All Hands Meeting 2015
Data Science at NIH and its Relationship to Social Computing, Behavioral-Cult...
Biomedical Data Sciences - New Name and New Opportunities for Change?
Bonazzi commons bd2 k ahm 2016 v2
NIH Data Commons - Note: Presentation has animations
Ad

More from Philip Bourne (20)

PPTX
Your Science Needs You - More Than Ever Before
PPTX
The Biological Data Sustainability Paradox: A Time to Think Differently
PPTX
Data Science and AI in Biomedicine: The World has Changed
PPTX
Data Science and AI in Biomedicine: The World has Changed
PPTX
AI in Medical Education A Meta View to Start a Conversation
PPTX
AI+ Now and Then How Did We Get Here And Where Are We Going
PPTX
Thoughts on Biological Data Sustainability
PPTX
What is FAIR Data and Who Needs It?
PPTX
Data Science Meets Biomedicine, Does Anything Change
PPTX
Data Science Meets Drug Discovery
PPTX
Biomedical Data Science: We Are Not Alone
PPTX
BIMS7100-2023. Social Responsibility in Research
PPTX
AI from the Perspective of a School of Data Science
PPTX
What Data Science Will Mean to You - One Person's View
PPTX
Novo Nordisk 080522.pptx
PPTX
Towards a US Open research Commons (ORC)
PPTX
COVID and Precision Education
PPTX
One View of Data Science
PPTX
Cancer Research Meets Data Science — What Can We Do Together?
PPTX
Data Science Meets Open Scholarship – What Comes Next?
Your Science Needs You - More Than Ever Before
The Biological Data Sustainability Paradox: A Time to Think Differently
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
AI in Medical Education A Meta View to Start a Conversation
AI+ Now and Then How Did We Get Here And Where Are We Going
Thoughts on Biological Data Sustainability
What is FAIR Data and Who Needs It?
Data Science Meets Biomedicine, Does Anything Change
Data Science Meets Drug Discovery
Biomedical Data Science: We Are Not Alone
BIMS7100-2023. Social Responsibility in Research
AI from the Perspective of a School of Data Science
What Data Science Will Mean to You - One Person's View
Novo Nordisk 080522.pptx
Towards a US Open research Commons (ORC)
COVID and Precision Education
One View of Data Science
Cancer Research Meets Data Science — What Can We Do Together?
Data Science Meets Open Scholarship – What Comes Next?

Recently uploaded (20)

PPTX
master seminar digital applications in india
PPTX
Institutional Correction lecture only . . .
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
01-Introduction-to-Information-Management.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Complications of Minimal Access Surgery at WLH
PDF
Classroom Observation Tools for Teachers
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
GDM (1) (1).pptx small presentation for students
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Sports Quiz easy sports quiz sports quiz
master seminar digital applications in india
Institutional Correction lecture only . . .
Pharmacology of Heart Failure /Pharmacotherapy of CHF
01-Introduction-to-Information-Management.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
O5-L3 Freight Transport Ops (International) V1.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPH.pptx obstetrics and gynecology in nursing
O7-L3 Supply Chain Operations - ICLT Program
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Complications of Minimal Access Surgery at WLH
Classroom Observation Tools for Teachers
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
TR - Agricultural Crops Production NC III.pdf
GDM (1) (1).pptx small presentation for students
STATICS OF THE RIGID BODIES Hibbelers.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
RMMM.pdf make it easy to upload and study
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Sports Quiz easy sports quiz sports quiz

Data Science BD2K Update for NIH

  • 1. Data Science & BD2K Update Philip Bourne, PhD, FACMI Associate Director for Data Science Advisory Committee to the NIH Director June 10, 2016 http://datascience.nih.gov Slides: http://www.slideshare.net/pebourne
  • 2. Data Science Agenda • What problems are we trying to solve? • What are the solutions we are exploring? • How does BD2K facilitate those solutions?
  • 3. What Problems Are We Trying to Solve? • Data are extensive, complex and growing • Data are in silos while science transcends those silos • Data are expensive to maintain and share while demands for sharing are increasing • There is an insufficient workforce with the needed data analytical skills • A collective (trans NIH) solution
  • 4. Quantifying the Problem • Big Data – Total data from NIH-funded research currently estimated at 650 PB* – 20 PB of that is in NCBI/NLM (3%) and it is expected to grow by 10 PB this year • Dark Data – Only 12% of data described in published papers is in recognized archives – 88% is dark data^ • Cost – 2007-2014: NIH spent ~$1.2Bn extramurally on maintaining data archives * In 2012 Library of Congress was 3 PB ^ http://www.ncbi.nlm.nih.gov/pubmed/26207759
  • 5. *Note: Award data confirmed as of 03/2016. Some repositories funded by hybrid mechanisms (eg. grants-contracts, IAA-contracts, etc.)
  • 6. Biomedical Digital Data Repository Survey by Institute and Center (IC) • Leadership meeting late in 2015 requested a survey of IC approaches and plans for data repositories • Responses received from 18 IC’s • Clear challenges were identified
  • 7. The Major Challenge Encountered When Considering Repository Funding Cost (6 ICs) Lack of Expertise (Within and Outside NIH) (3 ICs) Utility (3 ICs) Lack of Trans- NIH Guidance or Best Practices (2 ICs) Sustainability (6 ICs) Redundancy (4 ICs) * Some IC’s identified multiple challenges equally
  • 8. What Solutions Are We Exploring? The Commons is one solution that leverages the experiences in cloud-based computing and is being enabled by BD2K research
  • 9. Examples of Cloud Based Initiatives 5 PB 40TB AWS
  • 10. The Commons – The Internet of Data • Findable • Accessible • Interoperable • Reusable * http://www.ncbi.nlm.nih.gov/pubmed/26978244 The Commons offers a path forward to integrate these discreet cloud-based initiatives using BD2K developments to make data FAIR* The internet started as discreet networks that merged - the same could happen with data
  • 11. Use Case: Aggregate integrated data offers the potential for new insights into rare diseases … As we get more precise every disease becomes a rare disease
  • 12. Diffuse Intrinsic Pontine Gliomas (DIPG): In need of a new data-driven approach • Occur 1:100,000 individuals • Peak incidence 6-8 years of age • Median survival 9-12 months • Surgery is not an option • Chemotherapy ineffective and radiotherapy only transitive From Adam Resnick
  • 13. Timeline of Genomic Studies in DIPG • Landmark studies identify histone mutations as recurrent driver mutations in DIPG ~2012 • Almost 3 years later, in largely the same datasets, but partially expanded, the same two groups and 2 others identify ACVR1 mutations as a secondary, co-ocurring mutation From Adam Resnick
  • 14. Hypothesis: The Commons would have revealed ACVR1 • ACVR1 is a targetable kinase • Inhibition of ACVR1 inhibited tumor progression in vitro • ~300 DIPG patients a year • ~60 are predicted to have ACVR1 • If large scale data sets were only integrated with TCGA and/or rare disease data in 2012, ACVR1 mutations would have been identified • 60 patients/year X 3 years = 180 children’s lives (who likely succumbed to the disease during that time) could have been impacted if only data were FAIR From Adam Resnick
  • 15. A 3 Year BD2K Sponsored Commons Pilot is Under Way – Questions to be addressed: • Does the ability to compute across very large datasets lead to new discoveries? • Are data and analytics more easily located and shared and does this improve productivity? • Is there an advantage to have the results of those large calculations also available? • Is research more reproducible? • Is this environment more cost-effective than what we do now?
  • 16. Another use case… Let’s review the Commons pilot using the Model Organism Databases (MODs) as an example …
  • 17. Example of the Problem: The Model Organism Databases (MODS) • Highly curated and valuable data • Siloed / Not interoperable • Cumbersome to compute over all the data • Costly to maintain as individual resources NHGRI & NHLBI
  • 18. Shared Research Objects NCBI Intramural Hybrid Extramural NCBI/NLM Existing Coordinating CentersCIT IC’s Step 1: Data & Analytics Moved to the Commons • Moved as Commons compliant shared research objects, including: – Identifiers – Minimal metadata standards CF MOD Data
  • 19. BD2K is Providing Those Metadata Standards
  • 20. NCBI/NLM Existing Coordinating CentersCIT IC’s Services: APIs, Containers, Indexing,Services: APIs, Containers, Indexing, Software: Services & ToolsSoftware: Services & Tools scientific analysis tools/workflowsscientific analysis tools/workflows App store/User InterfaceApp store/User Interface Step 2: Layers of Software & Services Added Shared Research Objects NCBI Intramural Hybrid Extramural
  • 21. DataMed is a “Find” Service Developed by BD2K MOD Data indexed
  • 22. NCBI/NLM Existing Coordinating CentersCIT IC’s Services: APIs, Containers, Indexing,Services: APIs, Containers, Indexing, Software: Services & ToolsSoftware: Services & Tools scientific analysis tools/workflowsscientific analysis tools/workflows App store/User InterfaceApp store/User Interface Step 3: Commons Content Shared While Maintaining Autonomous Views CF ICIC ICIC ICIC CFCF Shared Research Objects NCBI Intramural Hybrid Extramural MOD Data View
  • 23. BD2K Commons Pilot Timeline Project Year 1 FY 2015 Oct 2015 – Sep 2016 Project Year 2 FY 2016 Oct 2016 – Sep 2017 Project Year 3 FY 2017 Oct 2017 – Sep 2018 Step 0Step 0 Step 1Step 1 Step 2Step 2 Step 3Step 3 Step 0: Initiation •Finalize conformance requirements •Arrange initial providers Step 1: ~5 Initial Projects including Common Fund We Are Here Step 2: ~50 projects Step 3: Evaluation & Next Steps
  • 24. The Major Challenge Encountered When Considering Repository Funding Cost (6 ICs) Lack of Expertise (Within and Outside NIH) (3 ICs) Utility (3 ICs) Lack of Trans- NIH Guidance or Best Practices (2 ICs) Sustainability (6 ICs) Redundancy (4 ICs) * Some IC’s identified multiple challenges equally
  • 26. Enhancing Diversity • Focus on low-resourced institutions – Supports curriculum and faculty development – Supports research experiences for undergraduates • Builds partnerships with BD2K Centers
  • 27. Improving Data Science Skills Among all Biomedical Scientists 24 awards 1 award
  • 28. The Role of BD2K 1. Commons – Resource Indexing – Standards – Cloud & HPC – Sustainability 2. Data Science Research – Centers – Software Analysis & Methods 3. Training & Workforce Development
  • 29. NIHNIH…… Turning Discovery Into HealthTurning Discovery Into Health philip.bourne@nih.gov https://datascience.nih.gov/
  • 30. • Pi Day • 2016 Lecture by Carlos Bustamante • Poster Session with Pies • PiCo Lightning Talks • Pi Day Scholars: outreach to high schools • Workshop: Reproducible Research • Lecture Series: Distinguished and Frontiers in Data Science • Data Science Courses • Machine learning • Hackathons Data Science Events at NIH

Editor's Notes

  • #4: Holdren memo GDS policy
  • #5: $1.25bn per year to capture all data. After a significant effort at reduction, intramurally data is spread across > 60 data centers; imagine the extramural situation.
  • #6: 30-40 of the 777 obligated awards were co-funded as opposed to funded by solely one IC Supplemental slides provides a further breakdown. Say ‘cannot be business as usual and that we must transition to a new model’
  • #8: Utility – see as a value proposition.
  • #10: Goal is to be able to easily draw upon data across these initiatives. Internet of data will see these individual initiatives merge.
  • #16: Other federal agencies are adopting this model and the EU is investing 6 bn Euros based on this model.
  • #19: We are currently in the pilot phase with small amounts of data/analytics being migrated.
  • #20: CEDAR is developing metadata templates SCC – coordinates standards to address such questions as “Is there a standard for data type x?” Working groups are spontaneous efforts between the centers to develop and share developments, including standards.
  • #23: The Commons enables data sharing across IC’s but also enables them to have an autonomous view on the data and services they provide.
  • #25: Utility – see as a value proposition.
  • #26: The biomedical data science pipeline is centered around the T32/T15 Predoctoral Training Programs. We are funding 16, each with 6 trainees, across the country. Over a quarter of the training budget is going to this program. Postdocs and beyond specialize in biomedical data science through mentored protected time for career development.
  • #28: Educational resource discovery index uses data science to help s learn about data science Automatically discovers training resources using information extraction (this is an international collaboration Organizes training resources through manual curation and resource modeling Personalizes recommendations through predictive modeling (future work)