SlideShare a Scribd company logo
Building (and traveling) the
data-brick road:
A report from the front lines of
data integration
Melissa Haendel, PhD
NIH Data Commons Kickoff
2017-12-06
Chasm of
semantic
despair
The data-brick road
is a means…
…not a destination
Reuse: the last (and hardest) mileCompliance
F A I R
VS
FAIR FAIR
Reuse requires (semantic) integration
Under-
appreciated
obstacles to
integration?
FAIR-TLC
“Traceability, Licensing,
and Connectedness --
OH MY!”
Traceable Licensed Connected
Traceable Licensed Connected
bit.ly/fair-tlc
• Data models
• Ontologies
• Concept alignment
• Common Data
Elements
• Identifiers in service of
above
Evidence
Provenance
Attribution
• Clearly stated
• Comprehensive and
non-negotiated
• Accessible
• Avoid restrictions on
kinds of (re)use
• Avoid restrictions on
who may (re)use
Traceability and Licensure
go hand-in-hand:
Licensing often a
(convoluted) credit hack
Why do we
persist?
A vision for the future of scholarship
Thomas Markello
Dong Chen
Justin Y. Kwan
Iren Horkayne-
Szakaly
Alan Morrison
Olga Simakova
Irina Maric
Jay Lozier
Andrew R. Cullinane
Tatjana Kilo
Lynn Meister
Kourosh Pakzad
Sanjay Chainani
Roxanne Fischer
Camilo Toro
James G. White
David Adams
Cornelius Boerkoel
William A. Gahl
Cynthia J. Tifft
Meral Gunay-Aygun
Hans Goeble
Karen Balbach
Nadine Pfeifer
Sandra Werner
Christian Linden
Melissa Haendel
Peter Robinson
Chris Mungall
Sebastian Kohler
Cindy Smith
Nicole Vasilevsky
Sandra Dolken
Elizabeth Lee
Amanda Links
Will Bone
Murat Sincan
Damian Smedley
Jules Jacobson
Nicole Washington
Elise Flynn
Sebastian Kohler
Orion Buske
Marta Girdea
Michael Brudno
Jeremy Band
Melissa Haendel
David Adams
David Draper
Bailey Gallinger
Joie Davis
Nicole Vasilevsky
Heather Trang
Rena Godfrey
Gretchen Golas
Catherine Groden
Michele Nehrebecky
Ariane Soldatos
Elise Valkanas,
Colleen Wahl
Lynne Wolfe
Johannes Grosse
Attila Braun
David Varga-Szabo
Niklas Beyersdorf
Boris Schneider
Lutz Zeitlmann
Petra Hanke
Patricia Schropp
Silke Mühlstedt
Carolin Zorn
Michael Huber
Carolin Schmittwolf
Wolfgang Jagla
Philipp Yu
Thomas Kerkau
Harald Schulze
Michael Nehls
Bernhard Nieswandt
Clinicians/
care team
Pathologists Ontologists Informaticians Curators Basic
Research
The translational workforce:
It takes a village to solve disease
GEO dataset
Gemma DRG
Genes differentially expressed:
~8,000 gene comparisons
Genes significantly expressed or
unchanged:
~13,000 gene comparisons
Incongruous results
Evidence and provenance
Importance of raw data alignment
(never mind that the gene IDs had to be mapped from strings)
Increased: 1,640
Decreased: 1,110
Differential: 2,920
Increased: 4,264
Decreased: 3,833
Differential: 8,133
Both resources recorded 95% confidence intervals for significance
…how many of these data are truly reusable?
Openness is assumed, but …
Most licenses are vague, non-standard, or missing
(n = 51 DBs)
Reusabledata.org
Identifiers are the invisible bedrock of all scientific inquiry;
the more complex the question,
the greater the reliance on ID hygiene
What? Why?
How?
Identifiers
Identifiers &
Metadata
Identifiers &
MetaData &
Models
Requiredharmonization
Question complexity
FAIR
FAIR
How many?
FAIR
Identifier Reality: Not all IDs created equal
We need systems that accommodate the heterogeneity
Traditional
Literature
Non-
Traditional
Persistent
Ephemeral
Non-existent
IdentifierMaturity
Scholarly Output Maturity
Genomic
resources
Wild west of identifier
tumbleweed
(way) beyond “linkrot”:
Pain points in identifier tech/standards
• Versioning and
Content evolution
AKA “content drift”
• Identifier
Surrogacy /
Granularity
• Ambiguous
equivalence
• Distribution
(& replication)
of content
over multiple
providers
bit.ly/evidence-of-identifier-pain
Distribution
across
providers
Ambiguous
equivalence
case 2:
Eye of the
beholder
Tangible, actionable community best practice
on identifiers for data integration
doi:10.1371/journal.pbio.2001414 (bit.ly/id21c-plosbio)
What integrators are aiming to do is non-trivial
Ambiguous equivalence case 4: Post-hoc harmonization
Ambiguous equivalence case 3: Fuzzy Match on xrefs/content
How are these
11 records for
“Ehlers Danlos
Syndrome”
related to each
other?
Narrow synonym?
Broad? Exact?
Child? Parent?
Bayesian models
like k-BOOM can
help
Mungall
doi:10.1101/048843
bit.ly/xref-wildwest
Challenges in propagating knowledge:
Different sources associate phenotypic information with
different aspects of the genotype
fgf8ati282a/ti282a;shhatbx392/+[TL];
cdkn1caMO3-cdkn1ca
Mysm1<tm1a>/Mysm1<tm1a>[C57BL/6]
daf-2(e1370)
ATP1A3(NM_152296.3)
[c.946G>A, p.Gly316Ser]
tin(ABD/346)
Includes gene knockdowns
Includes genetic background
Includes multiple alleles
Single allele w/zygosity
allele
variant/mutation
gene
Challenges in propagating knowledge:
Different sources associate phenotypic information with
different aspects of the genotype
fgf8ati282a/ti282a;shhatbx392/+[TL];
cdkn1caMO3-cdkn1ca
Mysm1<tm1a>/Mysm1<tm1a>[C57BL/6]
daf-2(e1370)
ATP1A3(NM_152296.3)
[c.946G>A, p.Gly316Ser]
tin(ABD/346)
Includes gene knockdowns
Includes genetic background
Includes multiple alleles
Single allele w/zygosity
allele
variant/mutation
gene
Transverse spirally arranged
myofibrils are almost
completely absent.
Dystonia 12
Decomposition of complex concepts allows interoperability
“Palmoplantar
hyperkeratosis”
increased
Stratum corneum
layer of skin
=
Human phenotype PATO
Uberon
Species neutral ontologies, homologous concepts
Autopod
keratinization
GO
“Ulcerated
paws”
Mouse phenotype
=
Need for a
comprehensive
and connected
picture across
resources
Goldilocks
approach to
harmonizing
data dissonance
Genes Environment Phenotypes+ =
We need interoperability for not only the types of
things in our data….
G-P or D (disease)
• causes
• contributes to
• is risk factor for
• protects against
• correlates with
• is marker for
• modulates
• involved in
• increases susceptibility to
G-G (kind of)
• regulates
• negatively regulates (inhibits)
• positively regulates (activates)
• directly regulates
• interacts with
• co-localizes with
• co-expressed with
P/D - P/D
• part of
• results in
• co-occurs with
• correlates with
• hallmark of (P->D)
E-P
• contributes to (E->P)
• influences (E->P)
• exacerbates (E->P)
• manifest in (P->E)
G-E (kind of)
• expressed in
• expressed during
• contains
• inactivated by
…the relationships and their evidence must also be
captured
Data Integrators have deep knowledge to help data
providers birth interoperable data
Thank you
Julie McMurry
For her illustrative gifts
Thank you for helping build the data-brick road
Melissa Haendel, PhD
@ontowonka

More Related Content

PPTX
Equivalence is in the (ID) of the beholder
PPTX
Envisioning a world where everyone helps solve disease
PPTX
PDF
In grammars we trust: LeadMine, a knowledge driven solution
PDF
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
PDF
Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...
PPT
Towards semantic systems chemical biology
PDF
State of the Art Natural Language Processing at Scale with Alexander Thomas a...
Equivalence is in the (ID) of the beholder
Envisioning a world where everyone helps solve disease
In grammars we trust: LeadMine, a knowledge driven solution
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...
Towards semantic systems chemical biology
State of the Art Natural Language Processing at Scale with Alexander Thomas a...

Similar to Building (and traveling) the data-brick road: A report from the front lines of data integration (20)

PDF
The Monarch Initiative: From Model Organism to Precision Medicine
PPT
provenance of microarray experiments
PDF
Fairness-Aware Data Mining
PPTX
How to make your published data findable, accessible, interoperable and reusable
PDF
Visual Exploration of Clinical and Genomic Data for Patient Stratification
PDF
Cancer Analytics Poster
PDF
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
PPT
Evolution of Knowledge Discovery and Management
PDF
Rare diseases in children and genetic diagnosis - part 1 [Today's paper]
PDF
CINECA webinar slides: Modular and reproducible workflows for federated molec...
PDF
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...
PDF
Single-Cell Sequencing for Drug Discovery: Applications and Challenges
PDF
Predicting phenotype from genotype with machine learning
PPTX
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
PDF
A Genome Sequence Analysis System Built with Hypertable
PPT
Anonymity
PPTX
Crowds Cure Canver: Annotating Data from The Cancer Imaging Archive
PPT
American Society for Mass Spectrometry Conference 2013
PPTX
2016 bergen-sars
PDF
Prof. Barend Mons, Biosemantics Group at Leiden University Medical Center and...
The Monarch Initiative: From Model Organism to Precision Medicine
provenance of microarray experiments
Fairness-Aware Data Mining
How to make your published data findable, accessible, interoperable and reusable
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Cancer Analytics Poster
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
Evolution of Knowledge Discovery and Management
Rare diseases in children and genetic diagnosis - part 1 [Today's paper]
CINECA webinar slides: Modular and reproducible workflows for federated molec...
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...
Single-Cell Sequencing for Drug Discovery: Applications and Challenges
Predicting phenotype from genotype with machine learning
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
A Genome Sequence Analysis System Built with Hypertable
Anonymity
Crowds Cure Canver: Annotating Data from The Cancer Imaging Archive
American Society for Mass Spectrometry Conference 2013
2016 bergen-sars
Prof. Barend Mons, Biosemantics Group at Leiden University Medical Center and...
Ad

More from mhaendel (20)

PPTX
Patient-led deep phenotyping using a lay-friendly version of the Human Phenot...
PDF
Semantics for rare disease phenotyping, diagnostics, and discovery
PPTX
The Software and Data Licensing Solution: Not Your Dad’s UBMTA
PPTX
GA4GH Monarch Driver Project Introduction
PPTX
GA4GH Phenotype Ontologies Task team update
PPTX
Reusable data for biomedicine: A data licensing odyssey
PPTX
Data Translator: an Open Science Data Platform for Mechanistic Disease Discovery
PPTX
Global phenotypic data sharing standards to maximize diagnostic discovery
PPTX
How open is open? An evaluation rubric for public knowledgebases
PPTX
Deep phenotyping to aid identification of coding & non-coding rare disease v...
PPTX
Science in the open, what does it take?
PPTX
Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...
PPTX
Phenopackets as applied to variant interpretation
PPTX
Credit where credit is due: acknowledging all types of contributions
PPTX
Deep phenotyping for everyone
PPTX
Why the world needs phenopacketeers, and how to be one
PPTX
On the frontier of genotype-2-phenotype data integration
PPTX
The Monarch Initiative: A semantic phenomics approach to disease discovery
PPTX
Getting (and giving) credit for all that we do
PPTX
The Monarch Initiative: An integrated genotype-phenotype platform for disease...
Patient-led deep phenotyping using a lay-friendly version of the Human Phenot...
Semantics for rare disease phenotyping, diagnostics, and discovery
The Software and Data Licensing Solution: Not Your Dad’s UBMTA
GA4GH Monarch Driver Project Introduction
GA4GH Phenotype Ontologies Task team update
Reusable data for biomedicine: A data licensing odyssey
Data Translator: an Open Science Data Platform for Mechanistic Disease Discovery
Global phenotypic data sharing standards to maximize diagnostic discovery
How open is open? An evaluation rubric for public knowledgebases
Deep phenotyping to aid identification of coding & non-coding rare disease v...
Science in the open, what does it take?
Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...
Phenopackets as applied to variant interpretation
Credit where credit is due: acknowledging all types of contributions
Deep phenotyping for everyone
Why the world needs phenopacketeers, and how to be one
On the frontier of genotype-2-phenotype data integration
The Monarch Initiative: A semantic phenomics approach to disease discovery
Getting (and giving) credit for all that we do
The Monarch Initiative: An integrated genotype-phenotype platform for disease...
Ad

Recently uploaded (20)

PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PDF
Sciences of Europe No 170 (2025)
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
BIOMOLECULES PPT........................
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
Phytochemical Investigation of Miliusa longipes.pdf
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
TOTAL hIP ARTHROPLASTY Presentation.pptx
Introduction to Fisheries Biotechnology_Lesson 1.pptx
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
Sciences of Europe No 170 (2025)
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Biophysics 2.pdffffffffffffffffffffffffff
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
BIOMOLECULES PPT........................
HPLC-PPT.docx high performance liquid chromatography
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
Taita Taveta Laboratory Technician Workshop Presentation.pptx
The KM-GBF monitoring framework – status & key messages.pptx

Building (and traveling) the data-brick road: A report from the front lines of data integration

  • 1. Building (and traveling) the data-brick road: A report from the front lines of data integration Melissa Haendel, PhD NIH Data Commons Kickoff 2017-12-06
  • 3. The data-brick road is a means… …not a destination
  • 4. Reuse: the last (and hardest) mileCompliance F A I R VS FAIR FAIR Reuse requires (semantic) integration
  • 5. Under- appreciated obstacles to integration? FAIR-TLC “Traceability, Licensing, and Connectedness -- OH MY!” Traceable Licensed Connected
  • 6. Traceable Licensed Connected bit.ly/fair-tlc • Data models • Ontologies • Concept alignment • Common Data Elements • Identifiers in service of above Evidence Provenance Attribution • Clearly stated • Comprehensive and non-negotiated • Accessible • Avoid restrictions on kinds of (re)use • Avoid restrictions on who may (re)use Traceability and Licensure go hand-in-hand: Licensing often a (convoluted) credit hack
  • 8. A vision for the future of scholarship
  • 9. Thomas Markello Dong Chen Justin Y. Kwan Iren Horkayne- Szakaly Alan Morrison Olga Simakova Irina Maric Jay Lozier Andrew R. Cullinane Tatjana Kilo Lynn Meister Kourosh Pakzad Sanjay Chainani Roxanne Fischer Camilo Toro James G. White David Adams Cornelius Boerkoel William A. Gahl Cynthia J. Tifft Meral Gunay-Aygun Hans Goeble Karen Balbach Nadine Pfeifer Sandra Werner Christian Linden Melissa Haendel Peter Robinson Chris Mungall Sebastian Kohler Cindy Smith Nicole Vasilevsky Sandra Dolken Elizabeth Lee Amanda Links Will Bone Murat Sincan Damian Smedley Jules Jacobson Nicole Washington Elise Flynn Sebastian Kohler Orion Buske Marta Girdea Michael Brudno Jeremy Band Melissa Haendel David Adams David Draper Bailey Gallinger Joie Davis Nicole Vasilevsky Heather Trang Rena Godfrey Gretchen Golas Catherine Groden Michele Nehrebecky Ariane Soldatos Elise Valkanas, Colleen Wahl Lynne Wolfe Johannes Grosse Attila Braun David Varga-Szabo Niklas Beyersdorf Boris Schneider Lutz Zeitlmann Petra Hanke Patricia Schropp Silke Mühlstedt Carolin Zorn Michael Huber Carolin Schmittwolf Wolfgang Jagla Philipp Yu Thomas Kerkau Harald Schulze Michael Nehls Bernhard Nieswandt Clinicians/ care team Pathologists Ontologists Informaticians Curators Basic Research The translational workforce: It takes a village to solve disease
  • 10. GEO dataset Gemma DRG Genes differentially expressed: ~8,000 gene comparisons Genes significantly expressed or unchanged: ~13,000 gene comparisons Incongruous results Evidence and provenance Importance of raw data alignment (never mind that the gene IDs had to be mapped from strings) Increased: 1,640 Decreased: 1,110 Differential: 2,920 Increased: 4,264 Decreased: 3,833 Differential: 8,133 Both resources recorded 95% confidence intervals for significance
  • 11. …how many of these data are truly reusable? Openness is assumed, but …
  • 12. Most licenses are vague, non-standard, or missing (n = 51 DBs) Reusabledata.org
  • 13. Identifiers are the invisible bedrock of all scientific inquiry; the more complex the question, the greater the reliance on ID hygiene What? Why? How? Identifiers Identifiers & Metadata Identifiers & MetaData & Models Requiredharmonization Question complexity FAIR FAIR How many? FAIR
  • 14. Identifier Reality: Not all IDs created equal We need systems that accommodate the heterogeneity Traditional Literature Non- Traditional Persistent Ephemeral Non-existent IdentifierMaturity Scholarly Output Maturity Genomic resources Wild west of identifier tumbleweed
  • 15. (way) beyond “linkrot”: Pain points in identifier tech/standards • Versioning and Content evolution AKA “content drift” • Identifier Surrogacy / Granularity • Ambiguous equivalence • Distribution (& replication) of content over multiple providers bit.ly/evidence-of-identifier-pain
  • 18. Tangible, actionable community best practice on identifiers for data integration doi:10.1371/journal.pbio.2001414 (bit.ly/id21c-plosbio)
  • 19. What integrators are aiming to do is non-trivial
  • 20. Ambiguous equivalence case 4: Post-hoc harmonization
  • 21. Ambiguous equivalence case 3: Fuzzy Match on xrefs/content How are these 11 records for “Ehlers Danlos Syndrome” related to each other? Narrow synonym? Broad? Exact? Child? Parent? Bayesian models like k-BOOM can help Mungall doi:10.1101/048843 bit.ly/xref-wildwest
  • 22. Challenges in propagating knowledge: Different sources associate phenotypic information with different aspects of the genotype fgf8ati282a/ti282a;shhatbx392/+[TL]; cdkn1caMO3-cdkn1ca Mysm1<tm1a>/Mysm1<tm1a>[C57BL/6] daf-2(e1370) ATP1A3(NM_152296.3) [c.946G>A, p.Gly316Ser] tin(ABD/346) Includes gene knockdowns Includes genetic background Includes multiple alleles Single allele w/zygosity allele variant/mutation gene
  • 23. Challenges in propagating knowledge: Different sources associate phenotypic information with different aspects of the genotype fgf8ati282a/ti282a;shhatbx392/+[TL]; cdkn1caMO3-cdkn1ca Mysm1<tm1a>/Mysm1<tm1a>[C57BL/6] daf-2(e1370) ATP1A3(NM_152296.3) [c.946G>A, p.Gly316Ser] tin(ABD/346) Includes gene knockdowns Includes genetic background Includes multiple alleles Single allele w/zygosity allele variant/mutation gene Transverse spirally arranged myofibrils are almost completely absent. Dystonia 12
  • 24. Decomposition of complex concepts allows interoperability “Palmoplantar hyperkeratosis” increased Stratum corneum layer of skin = Human phenotype PATO Uberon Species neutral ontologies, homologous concepts Autopod keratinization GO “Ulcerated paws” Mouse phenotype =
  • 25. Need for a comprehensive and connected picture across resources
  • 27. Genes Environment Phenotypes+ = We need interoperability for not only the types of things in our data….
  • 28. G-P or D (disease) • causes • contributes to • is risk factor for • protects against • correlates with • is marker for • modulates • involved in • increases susceptibility to G-G (kind of) • regulates • negatively regulates (inhibits) • positively regulates (activates) • directly regulates • interacts with • co-localizes with • co-expressed with P/D - P/D • part of • results in • co-occurs with • correlates with • hallmark of (P->D) E-P • contributes to (E->P) • influences (E->P) • exacerbates (E->P) • manifest in (P->E) G-E (kind of) • expressed in • expressed during • contains • inactivated by …the relationships and their evidence must also be captured
  • 29. Data Integrators have deep knowledge to help data providers birth interoperable data
  • 30. Thank you Julie McMurry For her illustrative gifts
  • 31. Thank you for helping build the data-brick road Melissa Haendel, PhD @ontowonka