HDL - Towards A Harmonized Dataset Model for Open Data Portals

HDL
Towards a Harmonized Dataset
Model for Open Data Portals
Ahmad Assaf, Raphaël Troncy And Aline Senart
@ahmadaassaf
PROFILES 15 – 2nd International Workshop on Dataset PROFIling & fEderated Search for Linked Data 1st June 2015

HDL Towards a Harmonized Dataset Model for Open Data Portals
Open Data/Linked Open Data
 Open Data (OD) is the data that can be easily discovered, accessed, reused and
redistributed by anyone [Davies et al. 2014]
 Open Data should be placed in public domain under liberal terms of use and available
in electronic formats that are non-proprietary and machine readable.
 Linked Open Data (LOD) refers to the semantically rich, linked and machine readable
open data.
 Open Data has major benefits for citizens, businesses, societies and governments.
2

Metadata
Metadata is structured information that describes, explains, locates or otherwise makes it
easier to retrieve use or manage information resources
Data Discovery,
exploration and
reuse
Organization
&
identification
Archiving
&
preservation
3

Data Portals/Data Management Systems
 Data Portals (Catalogs) are the entry points to discover published
datasets
 Data Portals are a curated collection of datasets metadata providing a
set discovery and integration services.
 Data Portals can be private like datahub.io, publicdata.eu or private like
enigma.io or quandle.com
 Portals are built on top of Data Management Systems (DMS) like
CKAN, DKAN and Socrata
4

Why a Harmonized Model ?
 Exploring/discovering datasets for
(re)use
 Defining a “minimal” set of
information needed to build a
“profile”
 Building tools that will
automatically generate/validate
metadata models
5

 The Data Catalog Vocabulary (DCAT)✝ is a W3C recommendation to facilitate interoperability
between data catalogs on the web
 DCAT is an RDF vocabulary with three main classes: dcat:Catalog, dcat:Dataset and dcat:Distribution
 DCAT Profiles [extensions built upon DCAT]
 DCAT-AP✝✝ defines a minimal set of properties that should be included in a datasets
profile by specifying mandatory and optional properties
 The Asset Description Metadata Schema (ADMS)✝✝✝ is used to semantically describe
assets (code lists, taxonomies, vocabularies)
Dataset Models - DCAT
6
✝ http://w3.org/TR/vocab-dcat/
✝✝ https://joinup.ec.europa.eu/asset/dcat_application_profile/description
✝✝✝ http://www.w3.org/TR/vocab-adms/

Dataset Models - VoID✝
 RDF vocabulary for interlinked datasets
 In addition to describing datasets, VoID
describes the links between datasets
 VoID defines three main classes:
void:Dataset, void:Linkset and void:subset
 A linkset in voiD is a subclass of a dataset,
used for storing triples to express the
interlinking relationship between datasets
7
✝ http://www.w3.org/TR/void/

Dataset Models – CKAN✝/DKAN✝✝
 Data model describes a set of entities (dataset, resource, group, tag)
 Allow additional information to be added via “extra” arbitrary key/value fields
 The core metadata restricted as a JSON file
 Supports Linked Data and RDF by providing a complete and functional mapping of its
model to LD formats
 CKAN support descriptions of vocabularies
 DKAN is a Drupal based DMS
8
✝ http://ckan.org/
✝✝ http://demo.getdkan.com/

 Online collection of best practices
and case studies to help data
publishers
 POD data model is based on DCAT
 Similarly to DCAT-AP, POD defines
three types of metadata elements:
Required, Required-If and
Expanded(optional)
 Metadata extensions using elements
from the “Expanded” fields
Dataset Models - Continued
 Commercial platform to streamline
data publishing, management,
analysis and reusing.
 The model is designed specifically to
represent tabular data
 The model covers a basic set of
metadata properties and has good
support for geospatial data
 A collection of schema used to
markup HTML pages with structured
data
 Covers many domains. We are
interested in the Dataset schema
although we also use various
properties from schemas like
organizations, authors, etc.
9
✝ http://socrata.com/
✝✝ http://schema.org/
✝✝✝ https://project-open-data.cio.gov/
✝ ✝✝ ✝✝✝

10
Ballmer
effect
anyone?
https://xkcd.com/323/

Metadata Classification – Information Groups
11
Organization
Clustering or curation
solely based on
associations with specific
administration parties
Resource
Actual raw data that can
be downloaded or
accessed directly e.g.
JSON, CSV, SPARQL
endpoint
Tag
Descriptive knowledge
about the dataset
contents and structure.
This can range from
simple textual tags to
semantically rich
controlled terms
Group
Organizational units that
share common
semantics. They can be
seen as a cluster or
curation based on shared
themes/categories

Metadata Classification – Information Types
12
General Information
title, description, id
Ownership Information
author, maintainer_email
Provenance Information
version, creation_date, update_date
Access Information
URL, license_title, license_id
Geospatial Information
bbox, layers
Temporal Information
coverage_from, coverage_to
Statistical Information
max_value, uniques, average
Quality Information
rating, availability, freshness
Dataset Metadata

Harmonization Process
 Examine the model or vocabulary specification and documentation
 Examine existing datasets using these models
 Examine the source code for DMS
13
1 Map the information groups [resource, tag, group, organization]
2 Map the information types [general, ownership, provenance, etc.]

Mapping Information Types
14
CKAN maintainer_email
DKAN maintainer_email
POD ContactPoint -> hasEmail
Schema.org CreativeWork:producer -> Person:email
VoID void:Dataset -> dct:creator -> foaf:Person:givenName
DCAT dcat:Dataset -> dct:creator -> foaf:Person:givenName

Extra Information
15
 Examining the models, we noticed an abundance of information filled in “extras” fields
 Using Roomba we generated aggregation reports to inspect those extras on LOD Cloud✝ and
OpenAfrica✝✝
extras>value:extras>name1 Extra fields names and values
resources>resource_type:resources>name2 Types describing resources
 53% of the datasets in OpenAfrica have additional geospatial attached (spatial-reference-system, spatial
harvester, bbox-east-long, bbox-north-long, bbox-south-long, bbox-west-long)
 16% of the datasets have additional provenance and ownership information (frequency-of-update, dataset-
reference-date)
✝ http://datahub.io/group/lodcloud
✝✝ http://africaopendata.org/https://github.com/ahmadassaf/opendata-checker/tree/master/model

HDL Towards a Harmonized Dataset Model for Open Data Portals 16
https://xkcd.com/927/

17HDL Towards a Harmonized Dataset Model for Open Data Portals
Questions?
Ahmad Assaf
http://ahmadassaf.com/
@ahmadaassaf
http://github.com/ahmadassaf

HDL - Towards A Harmonized Dataset Model for Open Data Portals

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to HDL - Towards A Harmonized Dataset Model for Open Data Portals (20)

Recently uploaded (20)

HDL - Towards A Harmonized Dataset Model for Open Data Portals

Editor's Notes