SlideShare a Scribd company logo
Describing Scientific Datasets:
The HCLS Community Profile
1
Michel Dumontier, Ph.D.
Associate Professor of Medicine (Biomedical Informatics)
Stanford University
World Wide Web Consortium (W3C)
• The W3C is the main international standards
organization for the World Wide Web.
• The W3C is made up of over 400 member
organizations for the purpose of working
together in the development of standards for
the World Wide Web.
@micheldumontier::CEDAR:Jan 20152
The Semantic Web
is the new global web of knowledge
3 @micheldumontier::CEDAR:Jan 2015
It involves standards for publishing, sharing and querying
facts, expert knowledge and services
It is a scalable approach to the
discovery of independently formulated
and distributed knowledge
Resource Description Framework
• It’s a language to represent knowledge
– Logic-based formalism -> automated reasoning
– graph-like properties -> data analysis
• Good for
– Describing in terms of type, attributes, relations
– Integrating data from different sources
– Sharing the data (W3C standard)
– Reusing what is available, developing what you need,
and contributing back to the web of data.
@micheldumontier::CEDAR:Jan 20154
@micheldumontier::CEDAR:Jan 2015
drugbank:DB00586
drugbank_vocabulary:Drug
rdf:type
drugbank:290
drugbank_vocabulary:Target
rdf:type
drugbank_vocabulary:targets
rdfs:label
Prostaglandin G/H synthase 2
[drugbank_target:290]
rdfs:label
Diclofenac [drugbank:DB00586]
5
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
PREFIX drugbank: <http://bio2rdf.org/drugbank:>
PREFIX drugbank_vocabulary: <http://bio2rdf.org/drugbank_vocabulary:>
The linked data network expands
with every reference
@micheldumontier::CEDAR:Jan 2015
drugbank:DB00586
pharmgkb_vocabulary:Drug
rdf:type
rdfs:label
diclofenac [drugbank:DB00586]
pharmgkb:PA449293
drugbank_vocabulary:Drug
pharmgkb_vocabulary:x-drugbank
diclofenac [pharmgkb:PA449293]
rdfs:label
DrugBank
PharmGKB
6
We are building a massive network of linked open data
7
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
@micheldumontier::CEDAR:Jan 2015
Linked Data for the Life Sciences
• Free and open source
• Leverages Semantic Web standards
• 10B+ interlinked statements from 30+
conventional and high value datasets
• Partnerships with EBI, SIB, NCBI, DBCLS, NCBO,
OpenPHACTS, and many others
chemicals/drugs/formulations,
genomes/genes/proteins, domains
Interactions, complexes & pathways
animal models and phenotypes
Disease, genetic markers, treatments
Terminologies & publications
@micheldumontier::CEDAR:Jan 20158
Alison Callahan, Jose Cruz-Toledo, Peter Ansell, Michel Dumontier:
Bio2RDF Release 2: Improved Coverage, Interoperability and
Provenance of Life Science Linked Data. ESWC 2013: 200-212
Semantic Web
for Health Care and Life Sciences Interest Group (HCLS)
• Mission: to develop, advocate for, and support the use of Semantic
Web technologies across health care, life sciences, clinical research and
translational medicine.
• Since 2001. 86 members from 29 organizations.
• Chairs: Michel Dumontier and Charlie Mead
• Objectives:
– Develop high level and architectural vocabularies.
– Implement proof-of-concept demonstrations and industry-ready
code.
– Document guidelines to accelerate the adoption of the technology.
– Disseminate information about the group's work at government,
industry, academic events and by participating in community
initiatives.
@micheldumontier::CEDAR:Jan 20159
Challenge: Working with Web Data
• Often have inadequate descriptions so we don’t know
what they are about or how they were constructed.
• datasets change over time, but often don’t come with
versioning information
• may have been constructed using other data, but it’s not
clear which version of data was used or whether these
were modified
• Data may be available in a variety of formats
• There may be multiple copies of data from different
providers, but it’s unclear if they are exact copies or
derivatives
@micheldumontier::CEDAR:Jan 201510
Data registries aren’t in sync
– Identifiers.org, Bio2RDF.org, BioSharing.org, etc.
– May be concerned about only some data
elements i.e. incomplete
– May be out-of-date and there is no easy way to
exchange data descriptions
– May contain conflicting information, unclear the
sources used.
@micheldumontier::CEDAR:Jan 201511
no single vocabulary provides all key
metadata fields
@micheldumontier::CEDAR:Jan 201512
Key Use Cases
1. Dataset Identification, Description, Licensing and
Provenance
2. Dataset Discovery (via Catalog)
3. Exchange of Dataset Descriptions
4. Dataset Linking
5. Content Summary
6. Monitoring of Dataset Changes
@micheldumontier::CEDAR:Jan 201513
Objective
• Develop a guidance note for reusing existing
vocabularies to describe datasets with RDF
– Mandatory, recommended, optional descriptors
– Identifiers
– Versioning
– Attribution
– Provenance
– Content summarization
• Recommend vocabulary-linked attributes and
value sets
• Provide reference editor and validation
@micheldumontier::CEDAR:Jan 201514
Dublin Core Metadata Initiative
Widely used
Broadly applicable
– Documents
– Datasets
✗Generic terms
✗Not comprehensive
✗No required properties
@micheldumontier::CEDAR:Jan
15
“Date: A point or period of time
associated with an event in the
lifecycle of the resource.”
DCAT: Data Catalog
 Separates Dataset and Distribution
✗No versioning
✗No prescribed properties
@micheldumontier::CEDAR:Jan 201516
17
@micheldumontier::CEDAR:Jan
VoID: Vocabulary of Interlinked
Datasets
Metadata carried with data
– Directly embedded: void:inDataset
✗No versioning
✗No checklist of requisite fields
✗Only for RDF data
We compiled a list of metadata fields
used across the community
@micheldumontier::CEDAR:Jan 201518
and then surveyed over 20 vocabularies to see if they
provided relevant metadata elements or value sets
To produce a big spreadsheet that maps metadata needs
with existing vocabularies
@micheldumontier::CEDAR:Jan 201519
@micheldumontier::CEDAR:Jan 201520
Dataset
“A collection of data, available for access or
download in one or more formats”
– DCAT
@micheldumontier::CEDAR:Jan 201521
Included Vocabularies
@micheldumontier::CEDAR:Jan 201522
Three Component Metadata Model:
description – version - distribution
@micheldumontier::CEDAR:Jan 201523
Example of Use
@micheldumontier::CEDAR:Jan 201524
61 metadata elements
@micheldumontier::CEDAR:Jan 201525
Metadata element, description, and
example of use
@micheldumontier::CEDAR:Jan 201526
Metadata Specification
constrained property:value pairs
@micheldumontier::CEDAR:Jan 201527
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
RFC 2119 [RFC2119].
Description
• Identifiers
• Title
• Description
• Homepage
• License
• Language
• Keywords
• Concepts and vocabularies used
• Standards
• Publication
@micheldumontier::CEDAR:Jan 201528
Attribution
• Simple Model
– Individuals are related to roles using specific
properties
e.g. dct:creator, pav:createdBy, pav:curatedBy
• Expandable Model
– Individuals are related to roles and dates by
associated object
– PROV, ViVo
@micheldumontier::CEDAR:Jan 201529
Provenance and Change
• Version number
• Source
• Provenance: retrieved from, derived from,
created with
• Frequency of change
@micheldumontier::CEDAR:Jan 201530
Availability
• Format
• Download URL
• Landing page
• SPARQL endpoint
@micheldumontier::CEDAR:Jan 201531
RDF Dataset Statistics
Basic Statistics
• # of triples
• # of typed entities
• # of distinct subjects
• # of distinct predicates
• # of distinct objects
• # of classes
• # of literals
Enhanced Statistics
• Classes + #
• Properties + triples
• Subject Types + # Property +
triples
• Object Types + # Property +
triples
• Literals + # Property +
triples
• Dataset-Dataset links
@micheldumontier::CEDAR:Jan 201532
Application scenarios
@micheldumontier::CEDAR:Jan 201533
VoID Editor
@micheldumontier::CEDAR:Jan 201534
Validator
@micheldumontier::CEDAR:Jan 201535
New version
using ShEx in
development
Towards Semantic Interoperability
@micheldumontier::CEDAR:Jan 201536
dumontierlab.com
michel.dumontier@stanford.edu
@micheldumontier::CEDAR:Jan 2015
Website: http://dumontierlab.com
Presentations: http://slideshare.com/micheldumontier
37
HCLS:
http://www.w3.org/blog/hcls/
Mailing list:
http://lists.w3.org/Archives/Public/public-semweb-lifesci/
Editors’ Draft:
http://tiny.cc/hcls-datadesc-ed
W3C Interest Group Note:
http://tiny.cc/hcls-datadesc
Special thanks to Alasdair Gray, Scott Marshall, Joachim Baran
Thanks to all other contributors to the HCLS note

More Related Content

PPTX
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
PDF
Link Analysis of Life Sciences Linked Data
PDF
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
PPTX
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
PPTX
Model Organism Linked Data
PPTX
2016 bmdid-mappings
PPTX
Building a Network of Interoperable and Independently Produced Linked and Ope...
PPTX
Data Science for the Win
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Link Analysis of Life Sciences Linked Data
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Model Organism Linked Data
2016 bmdid-mappings
Building a Network of Interoperable and Independently Produced Linked and Ope...
Data Science for the Win

What's hot (20)

PDF
The DataTags System: Sharing Sensitive Data with Confidence
PDF
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
PPTX
Citing data in research articles: principles, implementation, challenges - an...
PDF
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
PDF
Data Quality and the FAIR principles
PDF
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)
PDF
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
PPTX
PPT
Hosting a compound centric community resource for chemistry data
PDF
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
PPTX
Reproducibility (and the R*) of Science: motivations, challenges and trends
PDF
Workshop on Data Quality Management in Wikidata
PPTX
Leveraging publication metadata to help overcome the data ingest bottleneck
PPTX
Payton Eliminating Conflicts in Ebook Metadata
PPTX
Better Software, Better Research
PDF
An Open Repository Model for Acquiring Knowledge About Scientific Experiments
PDF
Canadian health census to lod
PDF
Towards a gold standard and regarding quality in public domain chemistry data...
PPTX
Data reuse and scholarly reward: understanding practice and building infrastr...
The DataTags System: Sharing Sensitive Data with Confidence
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Citing data in research articles: principles, implementation, challenges - an...
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
Data Quality and the FAIR principles
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
Hosting a compound centric community resource for chemistry data
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Reproducibility (and the R*) of Science: motivations, challenges and trends
Workshop on Data Quality Management in Wikidata
Leveraging publication metadata to help overcome the data ingest bottleneck
Payton Eliminating Conflicts in Ebook Metadata
Better Software, Better Research
An Open Repository Model for Acquiring Knowledge About Scientific Experiments
Canadian health census to lod
Towards a gold standard and regarding quality in public domain chemistry data...
Data reuse and scholarly reward: understanding practice and building infrastr...
Ad

Viewers also liked (20)

PDF
Web standards
PDF
Global Standards Networks
PPTX
Web Standards And Protocols
PPTX
Web accessibility for municipalities - How to meet compliance requirements an...
PDF
Marketing - customer need, wants, and demands
PPTX
Collaborate
PDF
Reputation snapshot for the banking industry, 2012, final
PPTX
Cyclops Intro 2011
PDF
Riskopecredience 10 05 Final_Version
PPT
Spiceworksintro 1219952728712087 9
KEY
IT for Nurisng - Web 2.0
PPT
Uz big design talk may10
PPT
Sentimenduak
PDF
IVI Program, 'Scaling Up Entrepreneurship,' progam description
PPT
Web 2.0, Dutch Railways & Teststation Leiden
PPT
Cámara oscura
PPT
PPT
Anxlisi del decret_de_plurilingxisme
PDF
Vinyl sulfones: Click applications in bioconjugation. The resurgence of a che...
PPT
GeekMeet Iasi Intro
Web standards
Global Standards Networks
Web Standards And Protocols
Web accessibility for municipalities - How to meet compliance requirements an...
Marketing - customer need, wants, and demands
Collaborate
Reputation snapshot for the banking industry, 2012, final
Cyclops Intro 2011
Riskopecredience 10 05 Final_Version
Spiceworksintro 1219952728712087 9
IT for Nurisng - Web 2.0
Uz big design talk may10
Sentimenduak
IVI Program, 'Scaling Up Entrepreneurship,' progam description
Web 2.0, Dutch Railways & Teststation Leiden
Cámara oscura
Anxlisi del decret_de_plurilingxisme
Vinyl sulfones: Click applications in bioconjugation. The resurgence of a che...
GeekMeet Iasi Intro
Ad

Similar to W3C HCLS Dataset Description Guidelines (20)

PPTX
Metadata and Metrics to Support Open Access
PPTX
FAIR principles and metrics for evaluation
PPTX
Towards metrics to assess and encourage FAIRness
PPT
A Data Citation Roadmap for Scholarly Data Repositories
PPTX
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
PPTX
Dataset description using the W3C HCLS standard
PPTX
Research data management workshop april12 2016
PPTX
Research data management workshop April 2016
PDF
Metadata 2020 Vivo Conference 2018
PDF
Standards: awareness, information, education
PPTX
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
PPTX
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
PPTX
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
PPTX
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
PPTX
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
PPTX
DataONE Education Module 07: Metadata
PPTX
Lynch & Dirks - Platforms for Open Research - Charleston Conference 2011
PPTX
EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 14, 2016...
PDF
Preparing Data for Sharing: The FAIR Principles
PDF
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
Metadata and Metrics to Support Open Access
FAIR principles and metrics for evaluation
Towards metrics to assess and encourage FAIRness
A Data Citation Roadmap for Scholarly Data Repositories
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Dataset description using the W3C HCLS standard
Research data management workshop april12 2016
Research data management workshop April 2016
Metadata 2020 Vivo Conference 2018
Standards: awareness, information, education
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
DataONE Education Module 07: Metadata
Lynch & Dirks - Platforms for Open Research - Charleston Conference 2011
EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 14, 2016...
Preparing Data for Sharing: The FAIR Principles
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale

More from Michel Dumontier (20)

PPTX
Generating (useful) synthetic data for medical research and AI application
PDF
FAIR & AI Ready KGs for Explainable Predictions.pdf
PPTX
FAIR & AI Ready KGs for Explainable Predictions
PPTX
A metadata standard for Knowledge Graphs
PPTX
Data-Driven Discovery Science with FAIR Knowledge Graphs
PDF
Evaluating FAIRness
PPTX
The Role of the FAIR Guiding Principles for an effective Learning Health System
PPTX
CIKM2020 Keynote: Accelerating discovery science with an Internet of FAIR dat...
PPTX
The role of the FAIR Guiding Principles in a Learning Health System
PPTX
Acclerating biomedical discovery with an internet of FAIR data and services -...
PPTX
Accelerating Biomedical Research with the Emerging Internet of FAIR Data and ...
PPTX
Are we FAIR yet? And will it be worth it?
PPTX
The Future of FAIR Data: An international social, legal and technological inf...
PDF
Keynote at the 2018 Maastricht University Dinner
PPTX
The future of science and business - a UM Star Lecture
PPTX
Are we FAIR yet?
PPTX
Developing and assessing FAIR digital resources
PPTX
Advancing Biomedical Knowledge Reuse with FAIR
PPTX
A Framework to develop the FAIR Metrics
PDF
Ontologies
Generating (useful) synthetic data for medical research and AI application
FAIR & AI Ready KGs for Explainable Predictions.pdf
FAIR & AI Ready KGs for Explainable Predictions
A metadata standard for Knowledge Graphs
Data-Driven Discovery Science with FAIR Knowledge Graphs
Evaluating FAIRness
The Role of the FAIR Guiding Principles for an effective Learning Health System
CIKM2020 Keynote: Accelerating discovery science with an Internet of FAIR dat...
The role of the FAIR Guiding Principles in a Learning Health System
Acclerating biomedical discovery with an internet of FAIR data and services -...
Accelerating Biomedical Research with the Emerging Internet of FAIR Data and ...
Are we FAIR yet? And will it be worth it?
The Future of FAIR Data: An international social, legal and technological inf...
Keynote at the 2018 Maastricht University Dinner
The future of science and business - a UM Star Lecture
Are we FAIR yet?
Developing and assessing FAIR digital resources
Advancing Biomedical Knowledge Reuse with FAIR
A Framework to develop the FAIR Metrics
Ontologies

Recently uploaded (20)

PPTX
presentation_pfe-universite-molay-seltan.pptx
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PPT
FIRE PREVENTION AND CONTROL PLAN- LUS.FM.MQ.OM.UTM.PLN.00014.ppt
DOCX
Unit-3 cyber security network security of internet system
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PPT
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
PPTX
Funds Management Learning Material for Beg
PPTX
Introuction about WHO-FIC in ICD-10.pptx
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PPTX
INTERNET------BASICS-------UPDATED PPT PRESENTATION
PPTX
Power Point - Lesson 3_2.pptx grad school presentation
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PPTX
artificialintelligenceai1-copy-210604123353.pptx
PDF
Tenda Login Guide: Access Your Router in 5 Easy Steps
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PDF
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
PPT
Ethics in Information System - Management Information System
PDF
Paper PDF World Game (s) Great Redesign.pdf
presentation_pfe-universite-molay-seltan.pptx
Decoding a Decade: 10 Years of Applied CTI Discipline
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
FIRE PREVENTION AND CONTROL PLAN- LUS.FM.MQ.OM.UTM.PLN.00014.ppt
Unit-3 cyber security network security of internet system
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
The New Creative Director: How AI Tools for Social Media Content Creation Are...
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
Funds Management Learning Material for Beg
Introuction about WHO-FIC in ICD-10.pptx
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
INTERNET------BASICS-------UPDATED PPT PRESENTATION
Power Point - Lesson 3_2.pptx grad school presentation
PptxGenJS_Demo_Chart_20250317130215833.pptx
artificialintelligenceai1-copy-210604123353.pptx
Tenda Login Guide: Access Your Router in 5 Easy Steps
Design_with_Watersergyerge45hrbgre4top (1).ppt
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
Ethics in Information System - Management Information System
Paper PDF World Game (s) Great Redesign.pdf

W3C HCLS Dataset Description Guidelines

  • 1. Describing Scientific Datasets: The HCLS Community Profile 1 Michel Dumontier, Ph.D. Associate Professor of Medicine (Biomedical Informatics) Stanford University
  • 2. World Wide Web Consortium (W3C) • The W3C is the main international standards organization for the World Wide Web. • The W3C is made up of over 400 member organizations for the purpose of working together in the development of standards for the World Wide Web. @micheldumontier::CEDAR:Jan 20152
  • 3. The Semantic Web is the new global web of knowledge 3 @micheldumontier::CEDAR:Jan 2015 It involves standards for publishing, sharing and querying facts, expert knowledge and services It is a scalable approach to the discovery of independently formulated and distributed knowledge
  • 4. Resource Description Framework • It’s a language to represent knowledge – Logic-based formalism -> automated reasoning – graph-like properties -> data analysis • Good for – Describing in terms of type, attributes, relations – Integrating data from different sources – Sharing the data (W3C standard) – Reusing what is available, developing what you need, and contributing back to the web of data. @micheldumontier::CEDAR:Jan 20154
  • 5. @micheldumontier::CEDAR:Jan 2015 drugbank:DB00586 drugbank_vocabulary:Drug rdf:type drugbank:290 drugbank_vocabulary:Target rdf:type drugbank_vocabulary:targets rdfs:label Prostaglandin G/H synthase 2 [drugbank_target:290] rdfs:label Diclofenac [drugbank:DB00586] 5 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# PREFIX drugbank: <http://bio2rdf.org/drugbank:> PREFIX drugbank_vocabulary: <http://bio2rdf.org/drugbank_vocabulary:>
  • 6. The linked data network expands with every reference @micheldumontier::CEDAR:Jan 2015 drugbank:DB00586 pharmgkb_vocabulary:Drug rdf:type rdfs:label diclofenac [drugbank:DB00586] pharmgkb:PA449293 drugbank_vocabulary:Drug pharmgkb_vocabulary:x-drugbank diclofenac [pharmgkb:PA449293] rdfs:label DrugBank PharmGKB 6
  • 7. We are building a massive network of linked open data 7 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/” @micheldumontier::CEDAR:Jan 2015
  • 8. Linked Data for the Life Sciences • Free and open source • Leverages Semantic Web standards • 10B+ interlinked statements from 30+ conventional and high value datasets • Partnerships with EBI, SIB, NCBI, DBCLS, NCBO, OpenPHACTS, and many others chemicals/drugs/formulations, genomes/genes/proteins, domains Interactions, complexes & pathways animal models and phenotypes Disease, genetic markers, treatments Terminologies & publications @micheldumontier::CEDAR:Jan 20158 Alison Callahan, Jose Cruz-Toledo, Peter Ansell, Michel Dumontier: Bio2RDF Release 2: Improved Coverage, Interoperability and Provenance of Life Science Linked Data. ESWC 2013: 200-212
  • 9. Semantic Web for Health Care and Life Sciences Interest Group (HCLS) • Mission: to develop, advocate for, and support the use of Semantic Web technologies across health care, life sciences, clinical research and translational medicine. • Since 2001. 86 members from 29 organizations. • Chairs: Michel Dumontier and Charlie Mead • Objectives: – Develop high level and architectural vocabularies. – Implement proof-of-concept demonstrations and industry-ready code. – Document guidelines to accelerate the adoption of the technology. – Disseminate information about the group's work at government, industry, academic events and by participating in community initiatives. @micheldumontier::CEDAR:Jan 20159
  • 10. Challenge: Working with Web Data • Often have inadequate descriptions so we don’t know what they are about or how they were constructed. • datasets change over time, but often don’t come with versioning information • may have been constructed using other data, but it’s not clear which version of data was used or whether these were modified • Data may be available in a variety of formats • There may be multiple copies of data from different providers, but it’s unclear if they are exact copies or derivatives @micheldumontier::CEDAR:Jan 201510
  • 11. Data registries aren’t in sync – Identifiers.org, Bio2RDF.org, BioSharing.org, etc. – May be concerned about only some data elements i.e. incomplete – May be out-of-date and there is no easy way to exchange data descriptions – May contain conflicting information, unclear the sources used. @micheldumontier::CEDAR:Jan 201511
  • 12. no single vocabulary provides all key metadata fields @micheldumontier::CEDAR:Jan 201512
  • 13. Key Use Cases 1. Dataset Identification, Description, Licensing and Provenance 2. Dataset Discovery (via Catalog) 3. Exchange of Dataset Descriptions 4. Dataset Linking 5. Content Summary 6. Monitoring of Dataset Changes @micheldumontier::CEDAR:Jan 201513
  • 14. Objective • Develop a guidance note for reusing existing vocabularies to describe datasets with RDF – Mandatory, recommended, optional descriptors – Identifiers – Versioning – Attribution – Provenance – Content summarization • Recommend vocabulary-linked attributes and value sets • Provide reference editor and validation @micheldumontier::CEDAR:Jan 201514
  • 15. Dublin Core Metadata Initiative Widely used Broadly applicable – Documents – Datasets ✗Generic terms ✗Not comprehensive ✗No required properties @micheldumontier::CEDAR:Jan 15 “Date: A point or period of time associated with an event in the lifecycle of the resource.”
  • 16. DCAT: Data Catalog  Separates Dataset and Distribution ✗No versioning ✗No prescribed properties @micheldumontier::CEDAR:Jan 201516
  • 17. 17 @micheldumontier::CEDAR:Jan VoID: Vocabulary of Interlinked Datasets Metadata carried with data – Directly embedded: void:inDataset ✗No versioning ✗No checklist of requisite fields ✗Only for RDF data
  • 18. We compiled a list of metadata fields used across the community @micheldumontier::CEDAR:Jan 201518 and then surveyed over 20 vocabularies to see if they provided relevant metadata elements or value sets To produce a big spreadsheet that maps metadata needs with existing vocabularies
  • 21. Dataset “A collection of data, available for access or download in one or more formats” – DCAT @micheldumontier::CEDAR:Jan 201521
  • 23. Three Component Metadata Model: description – version - distribution @micheldumontier::CEDAR:Jan 201523
  • 26. Metadata element, description, and example of use @micheldumontier::CEDAR:Jan 201526
  • 27. Metadata Specification constrained property:value pairs @micheldumontier::CEDAR:Jan 201527 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].
  • 28. Description • Identifiers • Title • Description • Homepage • License • Language • Keywords • Concepts and vocabularies used • Standards • Publication @micheldumontier::CEDAR:Jan 201528
  • 29. Attribution • Simple Model – Individuals are related to roles using specific properties e.g. dct:creator, pav:createdBy, pav:curatedBy • Expandable Model – Individuals are related to roles and dates by associated object – PROV, ViVo @micheldumontier::CEDAR:Jan 201529
  • 30. Provenance and Change • Version number • Source • Provenance: retrieved from, derived from, created with • Frequency of change @micheldumontier::CEDAR:Jan 201530
  • 31. Availability • Format • Download URL • Landing page • SPARQL endpoint @micheldumontier::CEDAR:Jan 201531
  • 32. RDF Dataset Statistics Basic Statistics • # of triples • # of typed entities • # of distinct subjects • # of distinct predicates • # of distinct objects • # of classes • # of literals Enhanced Statistics • Classes + # • Properties + triples • Subject Types + # Property + triples • Object Types + # Property + triples • Literals + # Property + triples • Dataset-Dataset links @micheldumontier::CEDAR:Jan 201532
  • 37. dumontierlab.com michel.dumontier@stanford.edu @micheldumontier::CEDAR:Jan 2015 Website: http://dumontierlab.com Presentations: http://slideshare.com/micheldumontier 37 HCLS: http://www.w3.org/blog/hcls/ Mailing list: http://lists.w3.org/Archives/Public/public-semweb-lifesci/ Editors’ Draft: http://tiny.cc/hcls-datadesc-ed W3C Interest Group Note: http://tiny.cc/hcls-datadesc Special thanks to Alasdair Gray, Scott Marshall, Joachim Baran Thanks to all other contributors to the HCLS note

Editor's Notes

  • #9: The Bio2RDF project transforms silos of life science data into a globally distributed network of linked data for biological knowledge discovery.
  • #17: We reuse several properties
  • #35: Dataset description creator Generates outline description through web form Allows you to see generated content
  • #36: Given a dataset description, does it conform to the OPS guidelines Generates error (red) and warning (orange) reports Error for MUST properties Warning for SHOULD properties Information for MAY properties