Profiling Linked Open Data

Profiling Linked (Open) Data
Blerina Spahiu
Department of Computer Science,
Systems and Communication,
University of Milan - Bicocca
blerina.spahiu@disco.unimib.it
Supervisor: Andrea Maurino, Matteo Palmonari
Tutor: Prof. Flavio De Paoli
blerina.spahiu@disco.unimib.it

Outline
 The research background
 The research plan
 Preliminary results
 Conclusions and Future Work
2University of Milan - Bicocca

Outline

Data profiling definition
- Where shall I begin, please your Majesty?
- Begin at the beginning - the King said gravely.
Lewis Carroll in Alice’s Adventures in Wonderland
The process of evaluating data quality is called data profiling and typically
involves gathering several aggregated data statistics which constitute the data
profiling.
Encyclopedia of Database Systems, June 2014

Linked Open Data Cloud
University of Milan - Bicocca 5
-1014 datasets
-188 mio. triples
-7 topical categories
-80% of linking
property is
owl:sameAs
-98,22% of datasets
use RDF vocabulary
-7,85% of the datasets
provide licensing
information.
Etc.

Linked Open Data Cloud
What types of resources are described in a data set?
How are they described?
How well connected are the datasets in the LOD cloud?
What is their topic/s?
Are data described as prescribed by the ontology?

Why we need data profiling?
“….because prevention is better than curing”
 Data quality assessment
 Query optimization
 Ontology / Data integration
 Data analytics
 Complex schema discovery
 Topical discovery
 Data visualization

State of the art
Tools Goal Input Output Autom
atizati
on
Scalabili
ty
Availabil
ity
License Tutorial
Roomba
Assaf et
al., 2015
Generate
descriptive
dataset profiles
Query portal
APIs for available
metadata
Quality assessment
of metadata
Code in
github
Open
Source
LODStats
Auer et
al., 2012
Comprehensive
statistics about
RDF
RDF 32 statistical criteria
on schema and
data level
Only the
demo
Demo
ExpLOD
Khatchad
ourian S.
and
Consense
s M. P.,
2010
Supports
exploring
summaries of
RDF usage and
interlinking
among datasets
RDF dataset, the
BL (bisimulation
label) schema
and the
neighborhoods to
consider
Summaries can be
viewed and
explored in an
interactive graphical
way and can be
exported in a
variety of formats
RDFStats
Langegge
r A. and
Wob W.,
2010
Generation of
different statistics
RDF dataset Histograms for
value distributions,
classes/properties/d
atatypes
Semi-
Autom
atic
Yes Apache Yes
ProLOD+
+
Bohm et
al., 2010
Computes
different
profiling,mining
or cleansing
tasks
RDF dataset Statistics about
properties, classes
etc. Information
about uniqueness
and keyness.
Autom
atic
Demo
8

Data Profiling Tools Survey

Profiling challenges
 The results of data profiling are computationally complex to discover
 Different and new data management architectures and frameworks have
emerged
 Linked Open Data are heterogeneous data
• Syntactic Heterogeneity
(Different formats, query languages)
• Schematic Heterogeneity
(Different encoding schemas)
• Semantic Heterogeneity
(Different vocabularies, semantic overlap of terms)
 Unified view of data profiling as a field
 Unifying framework for its task

Outline

Objectives
 Develop automatic approaches
 Generate new statistics and knowledge patterns to provide dataset summary
and inspect its quality.
• Apply data mining techniques to extract useful knowledge from large
datasets
• Implementation of different approaches for outlier detection
 Algorithms to overcome challenges to perform profiling in Linked Open Data
• Parallel calculation of statistics and patterns extraction in LOD
• Data mining techniques to deal with high dimensionality
 Topical information extraction and classification
 Developing a methodology on how to perform profiling tasks
• A deep literature study to classify and formalize profiling tasks

Methodology Used
Schematic
Hetetogeneity
Semantic Heterogeneity
Syntactic
Heterogeneity
Topical
Discovery
Data
Quality
Data
Understanding

Outline

Work already done (1)
Profiling of Italian Public Administration websites
• Decree 33 and 150

Profiling of Italian PAs
 Benchmark of PAs
• Geographical distribution (country wide)
• Type of PAs (region, municipality, county)
• Size (number of inhabitants)
 Compliance Index
• Completeness
• Accuracy
• Timeliness
 Profiling websites in terms of compliance

“Data Profiling, the moment of truth”
 The average index of compliance for the selected
• Italian Regions is 0.488 (50% has an index lower than the mean).
• Italian Provinces is 0.561 (more than 50% has an index lower than the mean).
• Italian Municipalities is 0.462 (more than 50% has an index lower than the mean).
 Regions
Veneto has the highest score (0.839)
Campania has the lowest score (0.043)
 Provinces
Bergamo in Lombardia Region have the
highest score (0.759)
Massa Carrara, in theToscana Region
has the lowest score (0.266)
 Municipalities
Voghera (Lombardia Region) has the
highest score (0.759)
Ozegna (Piemonte Region) has the
lowest score (0.164)

Works already done (2)
 Facilitating query for similar datasets discovery
 Speeding up data searches
 Trends and best practices of a particular domain can be identified
18
To which extent topical classification can be automated

Data Corpus and Feature Set
Category Datasets %
Government 183 18.05
Publications 96 9.47
Life sciences 83 8.19
User generated content 48 4.73
Cross domain 41 4.04
Media 22 2.17
Geographic 21 2.07
Social Web 520 51.28
19
 Data corpus (1014 datasets) extracted in April 2014 from Schmachenberg et al.
 Vocabulary Usage (1439)
 Class URIs (914)
 Property URIs (2333)
 Local Class Names (1041)
 Local Property Names (2493)
 Text from rdfs:label (1440)
 Top Level Domain (55)
 In and Out Degree (2)

Experimental Setup
 Classification Approaches
 K-Nearest Neighbor
 J-48
 Naïve Bayes
 Two normalization strategies
 Binary (bin)
 Relative term occurrences (rto)
 Three sampling techniques
 No sampling
 Down sampling
 Up sampling
20

Results on Combined Feature Sets
21
 Our model reaches an accuracy of 81.62%

Confusion Matrix
22
 Confusion between publications with government
and life sciences
 Confusion between user generated content and
social networking

Works already done (3)
 ABSTAT is a framework which can be used to summarize linked datasets and at
the same time to provide statistics about them
• Summary consists of Abstract Knowledge Patterns (AKPs) of the form
<subjectType, predicate, objectType>
• Can help users comparing two datasets
• Help detecting errors in the data such as accuracy
Eg: AKPs <dbo:Band,dbo:genre,dbo:Band>
• The domain or the range is unspecified for 585 properties in DBpedia Ontology
SubjectType Porperty ObjectType
http://dbpedia.org/ontology/Town http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/Country
http://dbpedia.org/ontology/City http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/Legistrature
http://dbpedia.org/ontology/Settlement http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/Settlement
http://dbpedia.org/ontology/Country http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/PoliticalParty
http://dbpedia.org/ontology/Village http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/MilitaryConflict
http://dbpedia.org/ontology/Organization http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/City
http://dbpedia.org/ontology/AdministrativeRegion http://dbpedia.org/ontology/governmentType

Evaluation Plan
 Where the Gold Standard exist validate in terms of precision, recall and F-
measure
 Difficulties to evaluate the validity of the proposed approach
• How these statistics or summarization allow to improve the performance
of the actual profiling tasks
• Humans will evaluate the validity of the summarization in terms of
relatedness and informativeness
• Provide users a list of statistics and ask their opinion which is more
important for their use case

Outline

Conclusions and Future Work
 The Topical Classification approach yield an accuracy of 82%,
enriching with other features like the linkage coverage
• Each dataset has only one topic, for some datasets multi label
classification can be appropriate
• A classifier chain for the multi-label classification
• Because of the heavy imbalance of the data a two stage classifier
might help
 Enrich ABSTAT framework with other statistics and to apply it to
unstructured data such as microdata.
 Investigate the trade-off between ABSTAT summarization to support
dataset exploration and understanding.
26

Publications
• A. Maurino, B. Spahiu, C. Batini, G. Viscusi – Compliance with Open Government Data
Policies: An empirical evaluation of Italian Local Public Administration. ECIS –eGOV
Workshop in the Twenty Second European Conference on Information Systems, Tel Aviv
2014
• A. Maurino, B. Spahiu, C. Batini, G. Viscusi – Compliance with Open Government Data
Policies: An empirical evaluation of Italian Local Public Administration. Information Polity
Journal, p.263-275, 2014
• M.Palmonari, A.Rula, R.Porrini, A. Maurino, B.Spahiu, V. Ferme – ABSTAT: Linked Data
Summaries with Abstraction and STATistics- The Semantic Web: ESWC 2015, Portoroz
Slovenia, May31th, 2015 to June 4th, 2015
• R.Meusel, B.Spahiu, C. Bizer, H. Paulheim – Towards Automatic Classification of LOD
datasets – LDOW Workshop co-located with 24th International World Wide Web
Conferenze (WWW 2015) Firenze, May 19, 2015
• B. Spahiu – Profiling the Linked (Open) Data – Doctoral Consortium Call at ISWC 2015
• C. Xie, D. Ritze, B. Spahiu, H. Cai- Instance-based property matching in Linked Open
Data Environment – Ontology Matching Workshop co-located with 14th International
Semantic Web Conference, 2015 Bethlehem, Pennsylvania USA.

Thank you for your attention!

Profiling Linked Open Data

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Profiling Linked Open Data (20)

Recently uploaded (20)

Profiling Linked Open Data

Editor's Notes