SlideShare a Scribd company logo
Open	Data	Quality:		
dimensions,	metrics,		
assessment	and	improvement	
1
Amrapali	Zaveri	
Workshop	on	Data	Quality	Management	in	Wikidata Jan	18,	2019
@amrapaliz
2
An	increasing	number	of	
discoveries	are	made	using	other	
people’s	data
3
@amrapaliz
4
Garbage Garbage
Biggest Challenge: Poor Data Quality
*http://www.ibmbigdatahub.com/infographic/four-vs-big-data
@amrapaliz
Data	Quality	Assessment
Dimensions	&	Metrics
5
@amrapaliz
Systematic Literature Review
30 core
articles
Conference - 21
Journal - 8
Masters Thesis
- 1
18 Dimensions
69 Metrics
@amrapaliz
LDQ Dimensions & Metrics
Quality assessment
for linked data: A
survey. A Zaveri, A
Rula, A Maurino, R
Pietrobon, J
Lehmann, S Auer.
Semantic Web 7 (1),
63-93
300+
citations
@amrapaliz
LDQ Dimensions & Metrics
•Data Quality: commonly conceived as a multi-
dimensional construct with a popular definition ‘fitness
for use’*.
•Dimension: characteristics of a dataset.
•Metric: or indicator is a procedure for measuring an
information quality dimension.
*Juran et al., The Quality Control Handbook, 1974
@amrapaliz
LDQ Assessment Goal
Fix data quality issues in given sets of (semantic) data

Such quality issues may

• be in source datasets (e.g., inaccurate or wrong data items, outdated data items)

• result from imperfections of a data integration process (e.g., data items that have
been incorrectly linked with each other)

• reveal themselves only after the data integration (e.g., duplicates, inconsistencies)

Data cleaning may be relevant both, for original datasets before combining/
integrating and for datasets resulting from an integration.
Source: http://www.ida.liu.se/research/semanticweb/events/SemDataMgmtTutorial-Part7-
Cleaning.pdf
@amrapaliz
18 LDQ Dimensions
@amrapaliz
LDQ Dimensions - Accessibility dimensions & metrics
• Availability - extent to which data (or some portion of it) is present, obtainable and
ready for use

• accessibility of the SPARQL endpoint and the server

• dereferenceability of the URI

• Interlinking - degree to which entities that represent the same concept are linked to
each other, be it within or between two or more data sources

• detection of the existence and usage of external URIs
• detection of all local in-links or back-links: all triples from a dataset that have
the resource’s URI as the object
@amrapaliz
LDQ Dimensions - Intrinsic dimensions & metrics
• Syntactic Validity - degree to which an RDF document conforms
to the specification of the serialization format

• detecting syntax errors using (i) validators, (ii) via
crowdsourcing

• by (i) use of explicit definition of the allowed values for a
datatype, (ii) syntactic rules (type of characters allowed and/
or the pattern of literal values)

•
@amrapaliz
LDQ Dimensions - Intrinsic dimensions & metrics
• Completeness
• Schema - ontology completeness
• Property - missing values for a specific property
• Population - % of all real-world objects of a particular type
• Interlinking - degree to which instances in the dataset are
interlinked
@amrapaliz
Data	Quality	Assessment
Tools
14
@amrapaliz
RDFUnit: RDF Unit-Testing Suite
http://aksw.org/Projects/RDFUnit.html
Syntactic
Semantic
Consiste
@amrapaliz
16
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing linked data quality assessment
M Acosta, A Zaveri, E Simperl, D Kontokostas, S Auer, J Lehmann ISWC 2013 @amrapaliz
Luzzu: QA for LOD
http://eis-bonn.github.io/Luzzu/index.html
2
Asses
3
Clean
4
Store
5
Rank
1
Metric
@amrapaliz
LDQ Assessment Tools — LODLaundromat
http://lodlaundromat.org/
@amrapaliz
LDQ Beyond Data — Mapping Quality
Dimou et al. Assessing and Refining Mappings to RDF to Improve Dataset
Quality. ISWC 2015.
https://github.com/RMLio/RML-Validator
@amrapaliz
LDQ- ShEx Validation
https://www.w3.org/2013/ShEx/Examples/
@amrapaliz
Data	Quality	Assessment
Improvement
21
@amrapaliz
Data Quality Improvement
• Root cause analysis

• Iterative process

• Ensure high data and metadata quality
@amrapaliz
W3C
Data
Quality
Vocabulary
https://www.w3.org/TR/vocab-dqv/
@amrapaliz
Poor	quality	(meta)data	hampers	research
@amrapaliz
25
*400 AI papers
6% code
30% test data
54% pseudo code
*http://science.sciencemag.org/content/359/6377/725 @amrapaliz
26
Lambin et al. Radiother Oncol. 2013. 109(1):159-64. doi: 10.1016/j.radonc.2013.07.007
@amrapaliz
If	we	are	ever	to	realize	the	full	
potential	of	content	we	create



then	we	must	find	ways	to	reduce	the	
barrier	to	publish,	find	and	reuse	their	
content	in	a	responsible	manner
27
@amrapaliz
FAIR Principles
http://www.nature.com/articles/sdata201618
@amrapaliz
Principles	to	enhance	the	value	of	all	digital	resources		
data,	images,	software,	web	services,	repositories,…	
Developed	and	endorsed	by	researchers,	publishers,	
funding	agencies,	industry	partners.
29
@amrapaliz
Improving	 the	 FAIRness	 of	 digital	
resources	 will	 increase	 their	 quality	 and	
their	potential	and	ease	for	reuse.
30
@amrapaliz
Thank you!

Questions?
@amrapaliz

More Related Content

PDF
Data Quality and the FAIR principles
PDF
Linked Data Quality Assessment: A Survey
PDF
Crowdsourcing Linked Data Quality Assessment
PPT
Wheat Data Interoperability (2) by Esther DZALE YEUMO KABORE and Richard FULSS
PPTX
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
PDF
Amrapali Zaveri Defense
PPTX
Creating impact with accessible data in agriculture and nutrition: sharing da...
PPTX
Investigating Perpetual Access
Data Quality and the FAIR principles
Linked Data Quality Assessment: A Survey
Crowdsourcing Linked Data Quality Assessment
Wheat Data Interoperability (2) by Esther DZALE YEUMO KABORE and Richard FULSS
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
Amrapali Zaveri Defense
Creating impact with accessible data in agriculture and nutrition: sharing da...
Investigating Perpetual Access

What's hot (20)

PPTX
Carma internet research module preparing for manuscript submission
DOCX
140127 Performance Metrics WG
PDF
Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...
PDF
Digital Scholar Webinar: Recruiting Research Participants Online Using Reddit
PPTX
SocialCite makes its debut at the HighWire Press meeting
PDF
NIH BD2K bioCADDIE DataMed: Data Discovery Index
PPTX
data citation
PPTX
Practicing Data Science Responsibly
PDF
CTSA Inventory Resource Web Presence
PPTX
Frank Harbers - Automatic genre classification of historical newspaper articles
PPTX
Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Source...
PDF
Advanced Keyword Research SMX Toronto March 2013
PPTX
Recommender system and big data (design a smartphone recommender system based...
PPTX
Trial Promoter: A Web-Based Tool to Test Stakeholder Engagement in Research o...
PDF
Data publication: Discover, Explore, Visualise
PDF
Metadata challenges research and re-usable data - BioSharing, ISA and STATO
PPTX
Recruiting Study Participants Online using Amazon's Mechanical Turk
PPTX
Overview of the altmetrics landscape
PPTX
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
PPTX
Accessibility Compliance: One State, Two Approaches
Carma internet research module preparing for manuscript submission
140127 Performance Metrics WG
Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...
Digital Scholar Webinar: Recruiting Research Participants Online Using Reddit
SocialCite makes its debut at the HighWire Press meeting
NIH BD2K bioCADDIE DataMed: Data Discovery Index
data citation
Practicing Data Science Responsibly
CTSA Inventory Resource Web Presence
Frank Harbers - Automatic genre classification of historical newspaper articles
Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Source...
Advanced Keyword Research SMX Toronto March 2013
Recommender system and big data (design a smartphone recommender system based...
Trial Promoter: A Web-Based Tool to Test Stakeholder Engagement in Research o...
Data publication: Discover, Explore, Visualise
Metadata challenges research and re-usable data - BioSharing, ISA and STATO
Recruiting Study Participants Online using Amazon's Mechanical Turk
Overview of the altmetrics landscape
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
Accessibility Compliance: One State, Two Approaches
Ad

Similar to Workshop on Data Quality Management in Wikidata (20)

PPTX
NISO Plus: Data Discovery and Reuse: AI Solutions & the Human Factor
PPTX
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
PPTX
Active actionable DMPs
PPTX
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
PDF
Big Data Analytics M1.pdf big data analytics
PPTX
Unit-I- Introduction- Traits of Big Data-Final.pptx
PPTX
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
PDF
productionising-recommenders
PPTX
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
PPTX
Data formats and ontologies
PDF
Survival Guide: Taming the Data Quality Beast
PPTX
Ncicbiit
PPTX
Architecture and Standards
PPT
Chapter 4 Organizational Aspects of Data Management.ppt
PDF
Democratizing Data within your organization - Data Discovery
PDF
When stars align: studies in data quality, knowledge graphs, and machine lear...
PPTX
ALIGNED Data Curation Methods and Tools
PPTX
Role of Biometric in Reducing the Size of Big Data
PPTX
Linked Data Quality Assessment – daQ and Luzzu
PDF
Paper Submission Open Now *** International Journal of Database Management Sy...
NISO Plus: Data Discovery and Reuse: AI Solutions & the Human Factor
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
Active actionable DMPs
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
Big Data Analytics M1.pdf big data analytics
Unit-I- Introduction- Traits of Big Data-Final.pptx
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
productionising-recommenders
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
Data formats and ontologies
Survival Guide: Taming the Data Quality Beast
Ncicbiit
Architecture and Standards
Chapter 4 Organizational Aspects of Data Management.ppt
Democratizing Data within your organization - Data Discovery
When stars align: studies in data quality, knowledge graphs, and machine lear...
ALIGNED Data Curation Methods and Tools
Role of Biometric in Reducing the Size of Big Data
Linked Data Quality Assessment – daQ and Luzzu
Paper Submission Open Now *** International Journal of Database Management Sy...
Ad

More from Amrapali Zaveri, PhD (12)

PDF
ESOF Panel 2018
PDF
CrowdED: Guideline for optimal Crowdsourcing Experimental Design
PDF
MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment
PDF
smartAPI: Towards a more intelligent network of Web APIs
PDF
Introduction to Bio SPARQL
PDF
LDQ 2014 DQ Methodology
PDF
TripleCheckMate
PDF
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
PDF
User-driven Quality Evaluation of DBpedia
PDF
Converting GHO to RDF
KEY
ReDD-Observatory
ESOF Panel 2018
CrowdED: Guideline for optimal Crowdsourcing Experimental Design
MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment
smartAPI: Towards a more intelligent network of Web APIs
Introduction to Bio SPARQL
LDQ 2014 DQ Methodology
TripleCheckMate
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
User-driven Quality Evaluation of DBpedia
Converting GHO to RDF
ReDD-Observatory

Recently uploaded (20)

PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
advance database management system book.pdf
PDF
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
PDF
Classroom Observation Tools for Teachers
PPTX
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PDF
1_English_Language_Set_2.pdf probationary
PPTX
Introduction to Building Materials
PDF
Indian roads congress 037 - 2012 Flexible pavement
PPTX
Lesson notes of climatology university.
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PDF
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
PPTX
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
LDMMIA Reiki Yoga Finals Review Spring Summer
Chinmaya Tiranga quiz Grand Finale.pdf
A powerpoint presentation on the Revised K-10 Science Shaping Paper
Final Presentation General Medicine 03-08-2024.pptx
advance database management system book.pdf
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
Classroom Observation Tools for Teachers
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
Paper A Mock Exam 9_ Attempt review.pdf.
1_English_Language_Set_2.pdf probationary
Introduction to Building Materials
Indian roads congress 037 - 2012 Flexible pavement
Lesson notes of climatology university.
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
Complications of Minimal Access Surgery at WLH
Orientation - ARALprogram of Deped to the Parents.pptx

Workshop on Data Quality Management in Wikidata