SlideShare a Scribd company logo
HDL
Towards a Harmonized Dataset
Model for Open Data Portals
Ahmad Assaf, Raphaël Troncy And Aline Senart
@ahmadaassaf
PROFILES 15 – 2nd International Workshop on Dataset PROFIling & fEderated Search for Linked Data 1st June 2015
HDL Towards a Harmonized Dataset Model for Open Data Portals
Open Data/Linked Open Data
 Open Data (OD) is the data that can be easily discovered, accessed, reused and
redistributed by anyone [Davies et al. 2014]
 Open Data should be placed in public domain under liberal terms of use and available
in electronic formats that are non-proprietary and machine readable.
 Linked Open Data (LOD) refers to the semantically rich, linked and machine readable
open data.
 Open Data has major benefits for citizens, businesses, societies and governments.
2
HDL Towards a Harmonized Dataset Model for Open Data Portals
Metadata
Metadata is structured information that describes, explains, locates or otherwise makes it
easier to retrieve use or manage information resources
Data Discovery,
exploration and
reuse
Organization
&
identification
Archiving
&
preservation
3
HDL Towards a Harmonized Dataset Model for Open Data Portals
Data Portals/Data Management Systems
 Data Portals (Catalogs) are the entry points to discover published
datasets
 Data Portals are a curated collection of datasets metadata providing a
set discovery and integration services.
 Data Portals can be private like datahub.io, publicdata.eu or private like
enigma.io or quandle.com
 Portals are built on top of Data Management Systems (DMS) like
CKAN, DKAN and Socrata
4
HDL Towards a Harmonized Dataset Model for Open Data Portals
Why a Harmonized Model ?
 Exploring/discovering datasets for
(re)use
 Defining a “minimal” set of
information needed to build a
“profile”
 Building tools that will
automatically generate/validate
metadata models
5
 The Data Catalog Vocabulary (DCAT)✝ is a W3C recommendation to facilitate interoperability
between data catalogs on the web
 DCAT is an RDF vocabulary with three main classes: dcat:Catalog, dcat:Dataset and dcat:Distribution
 DCAT Profiles [extensions built upon DCAT]
 DCAT-AP✝✝ defines a minimal set of properties that should be included in a datasets
profile by specifying mandatory and optional properties
 The Asset Description Metadata Schema (ADMS)✝✝✝ is used to semantically describe
assets (code lists, taxonomies, vocabularies)
HDL Towards a Harmonized Dataset Model for Open Data Portals
Dataset Models - DCAT
6
✝ http://w3.org/TR/vocab-dcat/
✝✝ https://joinup.ec.europa.eu/asset/dcat_application_profile/description
✝✝✝ http://www.w3.org/TR/vocab-adms/
HDL Towards a Harmonized Dataset Model for Open Data Portals
Dataset Models - VoID✝
 RDF vocabulary for interlinked datasets
 In addition to describing datasets, VoID
describes the links between datasets
 VoID defines three main classes:
void:Dataset, void:Linkset and void:subset
 A linkset in voiD is a subclass of a dataset,
used for storing triples to express the
interlinking relationship between datasets
7
✝ http://www.w3.org/TR/void/
HDL Towards a Harmonized Dataset Model for Open Data Portals
Dataset Models – CKAN✝/DKAN✝✝
 Data model describes a set of entities (dataset, resource, group, tag)
 Allow additional information to be added via “extra” arbitrary key/value fields
 The core metadata restricted as a JSON file
 Supports Linked Data and RDF by providing a complete and functional mapping of its
model to LD formats
 CKAN support descriptions of vocabularies
 DKAN is a Drupal based DMS
8
✝ http://ckan.org/
✝✝ http://demo.getdkan.com/
 Online collection of best practices
and case studies to help data
publishers
 POD data model is based on DCAT
 Similarly to DCAT-AP, POD defines
three types of metadata elements:
Required, Required-If and
Expanded(optional)
 Metadata extensions using elements
from the “Expanded” fields
HDL Towards a Harmonized Dataset Model for Open Data Portals
Dataset Models - Continued
 Commercial platform to streamline
data publishing, management,
analysis and reusing.
 The model is designed specifically to
represent tabular data
 The model covers a basic set of
metadata properties and has good
support for geospatial data
 A collection of schema used to
markup HTML pages with structured
data
 Covers many domains. We are
interested in the Dataset schema
although we also use various
properties from schemas like
organizations, authors, etc.
9
✝ http://socrata.com/
✝✝ http://schema.org/
✝✝✝ https://project-open-data.cio.gov/
✝ ✝✝ ✝✝✝
10
Ballmer
effect
anyone?
HDL Towards a Harmonized Dataset Model for Open Data Portals
https://xkcd.com/323/
HDL Towards a Harmonized Dataset Model for Open Data Portals
Metadata Classification – Information Groups
11
Organization
Clustering or curation
solely based on
associations with specific
administration parties
Resource
Actual raw data that can
be downloaded or
accessed directly e.g.
JSON, CSV, SPARQL
endpoint
Tag
Descriptive knowledge
about the dataset
contents and structure.
This can range from
simple textual tags to
semantically rich
controlled terms
Group
Organizational units that
share common
semantics. They can be
seen as a cluster or
curation based on shared
themes/categories
HDL Towards a Harmonized Dataset Model for Open Data Portals
Metadata Classification – Information Types
12
General Information
title, description, id
Ownership Information
author, maintainer_email
Provenance Information
version, creation_date, update_date
Access Information
URL, license_title, license_id
Geospatial Information
bbox, layers
Temporal Information
coverage_from, coverage_to
Statistical Information
max_value, uniques, average
Quality Information
rating, availability, freshness
Dataset Metadata
HDL Towards a Harmonized Dataset Model for Open Data Portals
Harmonization Process
 Examine the model or vocabulary specification and documentation
 Examine existing datasets using these models
 Examine the source code for DMS
13
1 Map the information groups [resource, tag, group, organization]
2 Map the information types [general, ownership, provenance, etc.]
HDL Towards a Harmonized Dataset Model for Open Data Portals
Mapping Information Types
14
CKAN maintainer_email
DKAN maintainer_email
POD ContactPoint -> hasEmail
Schema.org CreativeWork:producer -> Person:email
VoID void:Dataset -> dct:creator -> foaf:Person:givenName
DCAT dcat:Dataset -> dct:creator -> foaf:Person:givenName
HDL Towards a Harmonized Dataset Model for Open Data Portals
Extra Information
15
 Examining the models, we noticed an abundance of information filled in “extras” fields
 Using Roomba we generated aggregation reports to inspect those extras on LOD Cloud✝ and
OpenAfrica✝✝
extras>value:extras>name1 Extra fields names and values
resources>resource_type:resources>name2 Types describing resources
 53% of the datasets in OpenAfrica have additional geospatial attached (spatial-reference-system, spatial
harvester, bbox-east-long, bbox-north-long, bbox-south-long, bbox-west-long)
 16% of the datasets have additional provenance and ownership information (frequency-of-update, dataset-
reference-date)
✝ http://datahub.io/group/lodcloud
✝✝ http://africaopendata.org/https://github.com/ahmadassaf/opendata-checker/tree/master/model
HDL Towards a Harmonized Dataset Model for Open Data Portals 16
https://xkcd.com/927/
17HDL Towards a Harmonized Dataset Model for Open Data Portals
Questions?
Ahmad Assaf
http://ahmadassaf.com/
@ahmadaassaf
http://github.com/ahmadassaf

More Related Content

PPTX
Dataset description: DCAT and other vocabularies
PPT
Metadata: A concept
PPTX
How to describe a dataset. Interoperability issues
PDF
The importance of metadata for datasets: The DCAT-AP European standard
PDF
Metadata Standards
PPT
Metadata : Concentrating on the data, not on the scheme
PDF
Introduction to eudat and its services
PDF
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
Dataset description: DCAT and other vocabularies
Metadata: A concept
How to describe a dataset. Interoperability issues
The importance of metadata for datasets: The DCAT-AP European standard
Metadata Standards
Metadata : Concentrating on the data, not on the scheme
Introduction to eudat and its services
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...

What's hot (20)

PPT
Metadata harvesting Tools
PPTX
Advantages of metadata
PPTX
Metadata harvesting
PPT
Introduction to Metadata
PDF
Linked Open Data Principles, Technologies and Examples
PPTX
Gap Analysis
PPT
Applying Digital Library Metadata Standards
PPT
Metadata an overview
PPTX
Metadata
PPTX
PRELIDA Project Draft Roadmap
PDF
FAIR Data in Trustworthy Data Repositories Webinar - 12-13 December 2016| www...
PDF
General concepts: DDI
PPTX
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
PPTX
PPTX
Interaction with Linked Data
PPTX
Flexible metadata schemes for research data repositories - Clarin Conference...
PPTX
Meta data
PDF
Wed roman tut_open_datapub
PPTX
Providing Linked Data
Metadata harvesting Tools
Advantages of metadata
Metadata harvesting
Introduction to Metadata
Linked Open Data Principles, Technologies and Examples
Gap Analysis
Applying Digital Library Metadata Standards
Metadata an overview
Metadata
PRELIDA Project Draft Roadmap
FAIR Data in Trustworthy Data Repositories Webinar - 12-13 December 2016| www...
General concepts: DDI
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
Interaction with Linked Data
Flexible metadata schemes for research data repositories - Clarin Conference...
Meta data
Wed roman tut_open_datapub
Providing Linked Data
Ad

Viewers also liked (20)

PDF
HOTEL EXPO 2016
ODP
FPGAs libres
PDF
LEY DE COMPAÑÍAS
PDF
2016/10/28: Reset ETSII UPM
PDF
Timeplan
DOCX
Joseph S Stump Resume
PDF
Heroku cloud platform
ODT
Resume 1.4
PPTX
Nascenia: Road to Software Industry
PDF
Dina_Condon_Resume_2016
PPSX
боги древних славян
PPTX
Useful C++ Features You Should be Using
PDF
Bachillerato de humanidades ok
DOCX
Inspección atún en lata
ODP
Reunión de padres y tutores 2016. IES Joaquín Turina
ODP
Decálogo antibulling
PPTX
Vagrant vs Docker
DOCX
Snyder_Susan-Resume 2016
DOCX
Formatos de manejo de almacén
PDF
1 q 2016-us-tile-industry-update
HOTEL EXPO 2016
FPGAs libres
LEY DE COMPAÑÍAS
2016/10/28: Reset ETSII UPM
Timeplan
Joseph S Stump Resume
Heroku cloud platform
Resume 1.4
Nascenia: Road to Software Industry
Dina_Condon_Resume_2016
боги древних славян
Useful C++ Features You Should be Using
Bachillerato de humanidades ok
Inspección atún en lata
Reunión de padres y tutores 2016. IES Joaquín Turina
Decálogo antibulling
Vagrant vs Docker
Snyder_Susan-Resume 2016
Formatos de manejo de almacén
1 q 2016-us-tile-industry-update
Ad

Similar to HDL - Towards A Harmonized Dataset Model for Open Data Portals (20)

PPTX
How to Describe a Dataset. Interoperability Issues, by Valeria Pesce
PPTX
Linked Data In Action
PDF
Dataset Catalogs as a Foundation for FAIR* Data
PPTX
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
PPTX
CLARIN CMDI support in Dataverse
 
PDF
Llinked open data training for EU institutions
PDF
Data Wrangling and Visualization Using Python
PPTX
Linked data life cycles
PPTX
Flexible metadata schemes for research data repositories - CLARIN Conference'21
 
PDF
EUDAT data architecture and interoperability aspects – Daan Broeder
PPTX
chapter_2_Data Science, Addis ababa_new.pptx
PDF
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
PDF
IRJET- Data Retrieval using Master Resource Description Framework
PPTX
Ontologies, controlled vocabularies and Dataverse
 
PPT
Linked Data Planet Key Note
PPTX
Metadata lecture(9 17-14)
PDF
Let's downscale the semantic web !
PPT
Linked Data Tutorial
PPTX
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
PPTX
CLARIN CMDI use case and flexible metadata schemes
 
How to Describe a Dataset. Interoperability Issues, by Valeria Pesce
Linked Data In Action
Dataset Catalogs as a Foundation for FAIR* Data
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
CLARIN CMDI support in Dataverse
 
Llinked open data training for EU institutions
Data Wrangling and Visualization Using Python
Linked data life cycles
Flexible metadata schemes for research data repositories - CLARIN Conference'21
 
EUDAT data architecture and interoperability aspects – Daan Broeder
chapter_2_Data Science, Addis ababa_new.pptx
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
IRJET- Data Retrieval using Master Resource Description Framework
Ontologies, controlled vocabularies and Dataverse
 
Linked Data Planet Key Note
Metadata lecture(9 17-14)
Let's downscale the semantic web !
Linked Data Tutorial
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
CLARIN CMDI use case and flexible metadata schemes
 

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
KodekX | Application Modernization Development
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Approach and Philosophy of On baking technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation theory and applications.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Spectral efficient network and resource selection model in 5G networks
Advanced methodologies resolving dimensionality complications for autism neur...
MYSQL Presentation for SQL database connectivity
KodekX | Application Modernization Development
Encapsulation_ Review paper, used for researhc scholars
Approach and Philosophy of On baking technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation theory and applications.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Network Security Unit 5.pdf for BCA BBA.
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Understanding_Digital_Forensics_Presentation.pptx
Unlocking AI with Model Context Protocol (MCP)
Dropbox Q2 2025 Financial Results & Investor Presentation
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Building Integrated photovoltaic BIPV_UPV.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
sap open course for s4hana steps from ECC to s4
Spectral efficient network and resource selection model in 5G networks

HDL - Towards A Harmonized Dataset Model for Open Data Portals

  • 1. HDL Towards a Harmonized Dataset Model for Open Data Portals Ahmad Assaf, Raphaël Troncy And Aline Senart @ahmadaassaf PROFILES 15 – 2nd International Workshop on Dataset PROFIling & fEderated Search for Linked Data 1st June 2015
  • 2. HDL Towards a Harmonized Dataset Model for Open Data Portals Open Data/Linked Open Data  Open Data (OD) is the data that can be easily discovered, accessed, reused and redistributed by anyone [Davies et al. 2014]  Open Data should be placed in public domain under liberal terms of use and available in electronic formats that are non-proprietary and machine readable.  Linked Open Data (LOD) refers to the semantically rich, linked and machine readable open data.  Open Data has major benefits for citizens, businesses, societies and governments. 2
  • 3. HDL Towards a Harmonized Dataset Model for Open Data Portals Metadata Metadata is structured information that describes, explains, locates or otherwise makes it easier to retrieve use or manage information resources Data Discovery, exploration and reuse Organization & identification Archiving & preservation 3
  • 4. HDL Towards a Harmonized Dataset Model for Open Data Portals Data Portals/Data Management Systems  Data Portals (Catalogs) are the entry points to discover published datasets  Data Portals are a curated collection of datasets metadata providing a set discovery and integration services.  Data Portals can be private like datahub.io, publicdata.eu or private like enigma.io or quandle.com  Portals are built on top of Data Management Systems (DMS) like CKAN, DKAN and Socrata 4
  • 5. HDL Towards a Harmonized Dataset Model for Open Data Portals Why a Harmonized Model ?  Exploring/discovering datasets for (re)use  Defining a “minimal” set of information needed to build a “profile”  Building tools that will automatically generate/validate metadata models 5
  • 6.  The Data Catalog Vocabulary (DCAT)✝ is a W3C recommendation to facilitate interoperability between data catalogs on the web  DCAT is an RDF vocabulary with three main classes: dcat:Catalog, dcat:Dataset and dcat:Distribution  DCAT Profiles [extensions built upon DCAT]  DCAT-AP✝✝ defines a minimal set of properties that should be included in a datasets profile by specifying mandatory and optional properties  The Asset Description Metadata Schema (ADMS)✝✝✝ is used to semantically describe assets (code lists, taxonomies, vocabularies) HDL Towards a Harmonized Dataset Model for Open Data Portals Dataset Models - DCAT 6 ✝ http://w3.org/TR/vocab-dcat/ ✝✝ https://joinup.ec.europa.eu/asset/dcat_application_profile/description ✝✝✝ http://www.w3.org/TR/vocab-adms/
  • 7. HDL Towards a Harmonized Dataset Model for Open Data Portals Dataset Models - VoID✝  RDF vocabulary for interlinked datasets  In addition to describing datasets, VoID describes the links between datasets  VoID defines three main classes: void:Dataset, void:Linkset and void:subset  A linkset in voiD is a subclass of a dataset, used for storing triples to express the interlinking relationship between datasets 7 ✝ http://www.w3.org/TR/void/
  • 8. HDL Towards a Harmonized Dataset Model for Open Data Portals Dataset Models – CKAN✝/DKAN✝✝  Data model describes a set of entities (dataset, resource, group, tag)  Allow additional information to be added via “extra” arbitrary key/value fields  The core metadata restricted as a JSON file  Supports Linked Data and RDF by providing a complete and functional mapping of its model to LD formats  CKAN support descriptions of vocabularies  DKAN is a Drupal based DMS 8 ✝ http://ckan.org/ ✝✝ http://demo.getdkan.com/
  • 9.  Online collection of best practices and case studies to help data publishers  POD data model is based on DCAT  Similarly to DCAT-AP, POD defines three types of metadata elements: Required, Required-If and Expanded(optional)  Metadata extensions using elements from the “Expanded” fields HDL Towards a Harmonized Dataset Model for Open Data Portals Dataset Models - Continued  Commercial platform to streamline data publishing, management, analysis and reusing.  The model is designed specifically to represent tabular data  The model covers a basic set of metadata properties and has good support for geospatial data  A collection of schema used to markup HTML pages with structured data  Covers many domains. We are interested in the Dataset schema although we also use various properties from schemas like organizations, authors, etc. 9 ✝ http://socrata.com/ ✝✝ http://schema.org/ ✝✝✝ https://project-open-data.cio.gov/ ✝ ✝✝ ✝✝✝
  • 10. 10 Ballmer effect anyone? HDL Towards a Harmonized Dataset Model for Open Data Portals https://xkcd.com/323/
  • 11. HDL Towards a Harmonized Dataset Model for Open Data Portals Metadata Classification – Information Groups 11 Organization Clustering or curation solely based on associations with specific administration parties Resource Actual raw data that can be downloaded or accessed directly e.g. JSON, CSV, SPARQL endpoint Tag Descriptive knowledge about the dataset contents and structure. This can range from simple textual tags to semantically rich controlled terms Group Organizational units that share common semantics. They can be seen as a cluster or curation based on shared themes/categories
  • 12. HDL Towards a Harmonized Dataset Model for Open Data Portals Metadata Classification – Information Types 12 General Information title, description, id Ownership Information author, maintainer_email Provenance Information version, creation_date, update_date Access Information URL, license_title, license_id Geospatial Information bbox, layers Temporal Information coverage_from, coverage_to Statistical Information max_value, uniques, average Quality Information rating, availability, freshness Dataset Metadata
  • 13. HDL Towards a Harmonized Dataset Model for Open Data Portals Harmonization Process  Examine the model or vocabulary specification and documentation  Examine existing datasets using these models  Examine the source code for DMS 13 1 Map the information groups [resource, tag, group, organization] 2 Map the information types [general, ownership, provenance, etc.]
  • 14. HDL Towards a Harmonized Dataset Model for Open Data Portals Mapping Information Types 14 CKAN maintainer_email DKAN maintainer_email POD ContactPoint -> hasEmail Schema.org CreativeWork:producer -> Person:email VoID void:Dataset -> dct:creator -> foaf:Person:givenName DCAT dcat:Dataset -> dct:creator -> foaf:Person:givenName
  • 15. HDL Towards a Harmonized Dataset Model for Open Data Portals Extra Information 15  Examining the models, we noticed an abundance of information filled in “extras” fields  Using Roomba we generated aggregation reports to inspect those extras on LOD Cloud✝ and OpenAfrica✝✝ extras>value:extras>name1 Extra fields names and values resources>resource_type:resources>name2 Types describing resources  53% of the datasets in OpenAfrica have additional geospatial attached (spatial-reference-system, spatial harvester, bbox-east-long, bbox-north-long, bbox-south-long, bbox-west-long)  16% of the datasets have additional provenance and ownership information (frequency-of-update, dataset- reference-date) ✝ http://datahub.io/group/lodcloud ✝✝ http://africaopendata.org/https://github.com/ahmadassaf/opendata-checker/tree/master/model
  • 16. HDL Towards a Harmonized Dataset Model for Open Data Portals 16 https://xkcd.com/927/
  • 17. 17HDL Towards a Harmonized Dataset Model for Open Data Portals Questions? Ahmad Assaf http://ahmadassaf.com/ @ahmadaassaf http://github.com/ahmadassaf

Editor's Notes

  • #7: An asset is something that can be opened and read using a familiar desktop software as opposed to the need to be processed like raw data.
  • #8: The interlinking is modelled by a linkset (void:Linkset). A linkset in voiD is a subclass of a dataset, used for storing triples to express the interlinking relationship between datasets. In each interlinking triple, the subject is a resource hosted in one dataset and the object is a resource hosted in another dataset. This modelling enables a flexible and powerful way to talk in great detail about the interlinking between two datasets, such as how many links there exist, which kind of links (e.g. owl:sameAs or foaf:knows) are present, or stating who claims these statements.