SlideShare a Scribd company logo
What's up LOD Cloud
Observing The State of Linked Open
Data Cloud Metadata
Ahmad Assaf, Raphaël Troncy And Aline Senart
LDQ15 – 2nd Workshop on Linked Data Quality 1st June 2015
@ahmadaassaf
What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata
LOD Cloud
2
 LOD Cloud is a mine of data
 The heterogeneity of sources
reflect directly on the data
quality
 Finding useful dataset without
prior knowledge is increasingly
difficult
Demonstrate the of LOD Cloud metadata
by running Roomba on the accessible
LOD Cloud through datahub.io
Dataset Metadata
General information
e.g. title, description
Ownership information
e.g. author, maintainer_email
Provenance information
e.g. creation_date, version
Access information
e.g. license_title, license_id
 Metadata is structured information that describes, explains, locates or otherwise makes
it easier to retrieve use or manage information resources
 Data Portals are a curated collection of datasets metadata providing a set discovery and
integration services
 We divided the metadata information into four main types
3What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata
Metadata Information Groups
4
Organization
Clustering or curation
solely based on
associations with specific
administration parties
Resource
Actual raw data that can
be downloaded or
accessed directly e.g.
JSON, CSV, SPARQL
endpoint
Tag
Descriptive knowledge
about the dataset
contents and structure.
This can range from
simple textual tags to
semantically rich
controlled terms
Group
Organizational units that
share common
semantics. They can be
seen as a cluster or
curation based on shared
themes/categories
What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata
Roomba - An Extensible Framework to Validate and Build Dataset Profiles
Roomba addresses the challenges of automatic validation and generation of descriptive dataset profiles
https://github.com/ahmadassaf/opendata-checker/
5What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata
6
What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata
Metadata Errors for Information Groups
7
 41% of ownership information is
missing or undefined
 Resources have the poorest
metadata health across
information groups
 64% general metadata
 100% access metadata
 80% provenance metadata
What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata
Top Metadata Errors
8
 19% of these errors can be fixed
automatically
 33.33% can be fixed automatically
by tools plugged into the data
publishing workflow
What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata
Top Metadata Errors
9
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
%ofresources
Information Field
What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata
Top Metadata Errors
10
 25% of the datasets access information are not clean [unavailability mainly]
 68% of the resources access data can be fixed automatically
 31.27% of the resources were not reachable
 63.17% of the resources don’t have valid resource_type
 Creating an aggregate report for resources > format:title
 62.16% of the datasets have defined SPARQL endpoints using the api/sparql resource format
 92.27% provided RDF example links
 56.3% provided direct links to downloadable RDF dumps
What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata
Metadata File Types Errors
11
What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata
Metadata Errors
12
 Noisiest part of the access metadata is license information
 16.6% of the datasets don’t have license_title and license_id
 54.44% datasets don’t have license_url
 51.35% don’t have maintainer
 55.21% of the datasets are missing maintainer_email | 15.06% author_email
 80% of the provenance information is missing or undefined
 The only manual field in provenance information “version” is missing from 60.23% of the
datasets
What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata
Enriched Profiles
 1.87% of the resources have incorrect mimetype defined
 4.82% of the resources have incorrect size values
 47.49% of the datasets license information have been normalized via the manual license
mapping file ✝
13
✝ https://github.com/ahmadassaf/opendata-checker/blob/master/util/licenseMappings.json
14
Questions?
Ahmad Assaf
http://ahmadassaf.com
@ahmadaassaf
http://github.com/ahmadassaf
What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata

More Related Content

PDF
Linked Vitals-20141112-v1a
PDF
Linked Vitals: A Linked Data Approach to Semantic Interoperability
PPTX
Apache atlas sydney 2017-v4
PPTX
ACS 248th Paper 108 NIST-IUPAC Solubility Data
PDF
New Initiatives - Geoffrey Bilder - London LIVE 2017
PPTX
An Extensible Framework to Validate and Build Dataset Profiles
PPTX
Data, data, everywhere? Not nearly enough!
PPTX
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
Linked Vitals-20141112-v1a
Linked Vitals: A Linked Data Approach to Semantic Interoperability
Apache atlas sydney 2017-v4
ACS 248th Paper 108 NIST-IUPAC Solubility Data
New Initiatives - Geoffrey Bilder - London LIVE 2017
An Extensible Framework to Validate and Build Dataset Profiles
Data, data, everywhere? Not nearly enough!
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

What's hot (20)

PDF
The Crossref/ORCID Auto-Update: all you need to know
PDF
CDISC2RDF overview with examples
PDF
Crossref LIVE: The Benefits of Open Infrastructure (APAC time zones) - 29th O...
PDF
New product developments - Jennifer Lin - London LIVE 2017
PDF
CEDAR Technologies for AIRR Submissions
PPTX
CrossCheck iThenticate Admin Webinar
PDF
Putting Historical Data in Context: how to use DSpace-GLAM
PPTX
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
PPTX
ORCID at Crossref LIVE Indonesia
PPTX
Collecting and using funding data in your publications
PDF
Content Registration at Crossref - LIVE Kuala Lumpur
PDF
Working with ROR as a Crossref member: what you need to know
PPTX
Supporting Dataset Descriptions in the Life Sciences
PDF
Introduction to Crossref - Crossref LIVE Kuala Lumpur
PPTX
The Global reach of Crossref metadata
DOC
Liger cat challenge
 
PPTX
Text and Data Mining
PPTX
Force2015 orcid poster
PDF
DSpace-CRIS & OpenAIRE
PDF
DataTags, The Tags Toolset, and Dataverse Integration
The Crossref/ORCID Auto-Update: all you need to know
CDISC2RDF overview with examples
Crossref LIVE: The Benefits of Open Infrastructure (APAC time zones) - 29th O...
New product developments - Jennifer Lin - London LIVE 2017
CEDAR Technologies for AIRR Submissions
CrossCheck iThenticate Admin Webinar
Putting Historical Data in Context: how to use DSpace-GLAM
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
ORCID at Crossref LIVE Indonesia
Collecting and using funding data in your publications
Content Registration at Crossref - LIVE Kuala Lumpur
Working with ROR as a Crossref member: what you need to know
Supporting Dataset Descriptions in the Life Sciences
Introduction to Crossref - Crossref LIVE Kuala Lumpur
The Global reach of Crossref metadata
Liger cat challenge
 
Text and Data Mining
Force2015 orcid poster
DSpace-CRIS & OpenAIRE
DataTags, The Tags Toolset, and Dataverse Integration
Ad

Viewers also liked (14)

PPTX
PHONE GAP
PPTX
South south and north north co operation on (2)
DOCX
Molly Elizabeth Cox resume
PPTX
HEALTHY FOODS - DINESH VORA
PDF
Git Hooks
PDF
Troubleshooting Plan Changes with Query Store in SQL Server 2016
PPT
Gen1 week3final
ODP
2016 09-24-mini-maker-faire-madrid
PPTX
The Secrets of SQL Server: Database Worst Practices
PDF
Infopath replacement sharepoint alternative
PPTX
Getting Started Building Mobile Applications for iOS and Android
PDF
Qa tester
DOCX
my resume
PHONE GAP
South south and north north co operation on (2)
Molly Elizabeth Cox resume
HEALTHY FOODS - DINESH VORA
Git Hooks
Troubleshooting Plan Changes with Query Store in SQL Server 2016
Gen1 week3final
2016 09-24-mini-maker-faire-madrid
The Secrets of SQL Server: Database Worst Practices
Infopath replacement sharepoint alternative
Getting Started Building Mobile Applications for iOS and Android
Qa tester
my resume
Ad

Similar to What's up LOD Cloud - Observing the state of Linked Open Data Cloud Metadata (12)

PPTX
The Future of LOD
PDF
OpenMetadata Community Meeting - 7th August 2024
PPT
Metadata issues and challenges: Link Data
PDF
OpenMetadata Community Meeting - 18th December 2024
PDF
OpenMetadata Community Meeting - 5th June 2024
PPT
Metadata quality in digital repositories
PDF
Levine - Data Curation; Ethics and Legal Considerations
PPTX
Etosha - Data Asset Manager : Status and road map
PDF
Visualizing linkeddata aall2012d-ss
PDF
Adoption of the Linked Data Best Practices in Different Topical Domains
PDF
US EPA Resource Conservation and Recovery Act published as Linked Open Data
PPT
EPA OEI Linked Data Process
The Future of LOD
OpenMetadata Community Meeting - 7th August 2024
Metadata issues and challenges: Link Data
OpenMetadata Community Meeting - 18th December 2024
OpenMetadata Community Meeting - 5th June 2024
Metadata quality in digital repositories
Levine - Data Curation; Ethics and Legal Considerations
Etosha - Data Asset Manager : Status and road map
Visualizing linkeddata aall2012d-ss
Adoption of the Linked Data Best Practices in Different Topical Domains
US EPA Resource Conservation and Recovery Act published as Linked Open Data
EPA OEI Linked Data Process

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
A Presentation on Artificial Intelligence
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Big Data Technologies - Introduction.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
DOCX
The AUB Centre for AI in Media Proposal.docx
Electronic commerce courselecture one. Pdf
Encapsulation_ Review paper, used for researhc scholars
Mobile App Security Testing_ A Comprehensive Guide.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Building Integrated photovoltaic BIPV_UPV.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
A Presentation on Artificial Intelligence
MYSQL Presentation for SQL database connectivity
sap open course for s4hana steps from ECC to s4
Diabetes mellitus diagnosis method based random forest with bat algorithm
Big Data Technologies - Introduction.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
The AUB Centre for AI in Media Proposal.docx

What's up LOD Cloud - Observing the state of Linked Open Data Cloud Metadata

  • 1. What's up LOD Cloud Observing The State of Linked Open Data Cloud Metadata Ahmad Assaf, Raphaël Troncy And Aline Senart LDQ15 – 2nd Workshop on Linked Data Quality 1st June 2015 @ahmadaassaf
  • 2. What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata LOD Cloud 2  LOD Cloud is a mine of data  The heterogeneity of sources reflect directly on the data quality  Finding useful dataset without prior knowledge is increasingly difficult Demonstrate the of LOD Cloud metadata by running Roomba on the accessible LOD Cloud through datahub.io
  • 3. Dataset Metadata General information e.g. title, description Ownership information e.g. author, maintainer_email Provenance information e.g. creation_date, version Access information e.g. license_title, license_id  Metadata is structured information that describes, explains, locates or otherwise makes it easier to retrieve use or manage information resources  Data Portals are a curated collection of datasets metadata providing a set discovery and integration services  We divided the metadata information into four main types 3What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata
  • 4. Metadata Information Groups 4 Organization Clustering or curation solely based on associations with specific administration parties Resource Actual raw data that can be downloaded or accessed directly e.g. JSON, CSV, SPARQL endpoint Tag Descriptive knowledge about the dataset contents and structure. This can range from simple textual tags to semantically rich controlled terms Group Organizational units that share common semantics. They can be seen as a cluster or curation based on shared themes/categories What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata
  • 5. Roomba - An Extensible Framework to Validate and Build Dataset Profiles Roomba addresses the challenges of automatic validation and generation of descriptive dataset profiles https://github.com/ahmadassaf/opendata-checker/ 5What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata
  • 6. 6
  • 7. What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata Metadata Errors for Information Groups 7  41% of ownership information is missing or undefined  Resources have the poorest metadata health across information groups  64% general metadata  100% access metadata  80% provenance metadata
  • 8. What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata Top Metadata Errors 8  19% of these errors can be fixed automatically  33.33% can be fixed automatically by tools plugged into the data publishing workflow
  • 9. What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata Top Metadata Errors 9 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 %ofresources Information Field
  • 10. What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata Top Metadata Errors 10  25% of the datasets access information are not clean [unavailability mainly]  68% of the resources access data can be fixed automatically  31.27% of the resources were not reachable  63.17% of the resources don’t have valid resource_type  Creating an aggregate report for resources > format:title  62.16% of the datasets have defined SPARQL endpoints using the api/sparql resource format  92.27% provided RDF example links  56.3% provided direct links to downloadable RDF dumps
  • 11. What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata Metadata File Types Errors 11
  • 12. What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata Metadata Errors 12  Noisiest part of the access metadata is license information  16.6% of the datasets don’t have license_title and license_id  54.44% datasets don’t have license_url  51.35% don’t have maintainer  55.21% of the datasets are missing maintainer_email | 15.06% author_email  80% of the provenance information is missing or undefined  The only manual field in provenance information “version” is missing from 60.23% of the datasets
  • 13. What's up LOD Cloud - Observing The State of Linked Open Data Cloud Metadata Enriched Profiles  1.87% of the resources have incorrect mimetype defined  4.82% of the resources have incorrect size values  47.49% of the datasets license information have been normalized via the manual license mapping file ✝ 13 ✝ https://github.com/ahmadassaf/opendata-checker/blob/master/util/licenseMappings.json

Editor's Notes

  • #3: The 259 datasets contain a total of 1068 resources.
  • #12: 211 (63.17%) resources do not have valid resource_type , 112 (33.53%) are files, 8 (2.39%) a re metadata and one (0.029%) are example and documentation types
  • #14: Roomba enhanced the access information by ~26%