SlideShare a Scribd company logo
Profiling Linked (Open) Data
Blerina Spahiu
Department of Computer Science,
Systems and Communication,
University of Milan - Bicocca
blerina.spahiu@disco.unimib.it
Supervisor: Andrea Maurino, Matteo Palmonari
Tutor: Prof. Flavio De Paoli
blerina.spahiu@disco.unimib.it
Outline
 The research background
 The research plan
 Preliminary results
 Conclusions and Future Work
2University of Milan - Bicocca
Outline
 The research background
 The research plan
 Preliminary results
 Conclusions and Future Work
3University of Milan - Bicocca
Data profiling definition
- Where shall I begin, please your Majesty?
- Begin at the beginning - the King said gravely.
Lewis Carroll in Alice’s Adventures in Wonderland
The process of evaluating data quality is called data profiling and typically
involves gathering several aggregated data statistics which constitute the data
profiling.
Encyclopedia of Database Systems, June 2014
4University of Milan - Bicocca
Linked Open Data Cloud
University of Milan - Bicocca 5
-1014 datasets
-188 mio. triples
-7 topical categories
-80% of linking
property is
owl:sameAs
-98,22% of datasets
use RDF vocabulary
-7,85% of the datasets
provide licensing
information.
Etc.
Linked Open Data Cloud
University of Milan - Bicocca 6
What types of resources are described in a data set?
How are they described?
How well connected are the datasets in the LOD cloud?
What is their topic/s?
Are data described as prescribed by the ontology?
Why we need data profiling?
“….because prevention is better than curing”
7University of Milan - Bicocca
 Data quality assessment
 Query optimization
 Ontology / Data integration
 Data analytics
 Complex schema discovery
 Topical discovery
 Data visualization
State of the art
Tools Goal Input Output Autom
atizati
on
Scalabili
ty
Availabil
ity
License Tutorial
Roomba
Assaf et
al., 2015
Generate
descriptive
dataset profiles
Query portal
APIs for available
metadata
Quality assessment
of metadata
Code in
github
Open
Source
LODStats
Auer et
al., 2012
Comprehensive
statistics about
RDF
RDF 32 statistical criteria
on schema and
data level
Only the
demo
Demo
ExpLOD
Khatchad
ourian S.
and
Consense
s M. P.,
2010
Supports
exploring
summaries of
RDF usage and
interlinking
among datasets
RDF dataset, the
BL (bisimulation
label) schema
and the
neighborhoods to
consider
Summaries can be
viewed and
explored in an
interactive graphical
way and can be
exported in a
variety of formats
RDFStats
Langegge
r A. and
Wob W.,
2010
Generation of
different statistics
RDF dataset Histograms for
value distributions,
classes/properties/d
atatypes
Semi-
Autom
atic
Yes Apache Yes
ProLOD+
+
Bohm et
al., 2010
Computes
different
profiling,mining
or cleansing
tasks
RDF dataset Statistics about
properties, classes
etc. Information
about uniqueness
and keyness.
Autom
atic
Demo
8
Data Profiling Tools Survey
University of Milan - Bicocca 9
Profiling challenges
 The results of data profiling are computationally complex to discover
 Different and new data management architectures and frameworks have
emerged
 Linked Open Data are heterogeneous data
• Syntactic Heterogeneity
(Different formats, query languages)
• Schematic Heterogeneity
(Different encoding schemas)
• Semantic Heterogeneity
(Different vocabularies, semantic overlap of terms)
 Unified view of data profiling as a field
 Unifying framework for its task
10University of Milan - Bicocca
Outline
 The research background
 The research plan
 Preliminary results
 Conclusions and Future Work
11University of Milan - Bicocca
Objectives
 Develop automatic approaches
 Generate new statistics and knowledge patterns to provide dataset summary
and inspect its quality.
• Apply data mining techniques to extract useful knowledge from large
datasets
• Implementation of different approaches for outlier detection
 Algorithms to overcome challenges to perform profiling in Linked Open Data
• Parallel calculation of statistics and patterns extraction in LOD
• Data mining techniques to deal with high dimensionality
 Topical information extraction and classification
 Developing a methodology on how to perform profiling tasks
• A deep literature study to classify and formalize profiling tasks
12University of Milan - Bicocca
Methodology Used
Schematic
Hetetogeneity
University of Milan - Bicocca 13
Semantic Heterogeneity
Syntactic
Heterogeneity
Topical
Discovery
Data
Quality
Data
Understanding
Outline
 The research background
 The research plan
 Preliminary results
 Conclusions and Future Work
14University of Milan - Bicocca
Work already done (1)
Profiling of Italian Public Administration websites
• Decree 33 and 150
University of Milan - Bicocca 15
Profiling of Italian PAs
 Benchmark of PAs
• Geographical distribution (country wide)
• Type of PAs (region, municipality, county)
• Size (number of inhabitants)
 Compliance Index
• Completeness
• Accuracy
• Timeliness
 Profiling websites in terms of compliance
University of Milan - Bicocca 16
“Data Profiling, the moment of truth”
University of Milan - Bicocca 17
 The average index of compliance for the selected
• Italian Regions is 0.488 (50% has an index lower than the mean).
• Italian Provinces is 0.561 (more than 50% has an index lower than the mean).
• Italian Municipalities is 0.462 (more than 50% has an index lower than the mean).
 Regions
Veneto has the highest score (0.839)
Campania has the lowest score (0.043)
 Provinces
Bergamo in Lombardia Region have the
highest score (0.759)
Massa Carrara, in theToscana Region
has the lowest score (0.266)
 Municipalities
Voghera (Lombardia Region) has the
highest score (0.759)
Ozegna (Piemonte Region) has the
lowest score (0.164)
Works already done (2)
 Facilitating query for similar datasets discovery
 Speeding up data searches
 Trends and best practices of a particular domain can be identified
18
To which extent topical classification can be automated
Data Corpus and Feature Set
Category Datasets %
Government 183 18.05
Publications 96 9.47
Life sciences 83 8.19
User generated content 48 4.73
Cross domain 41 4.04
Media 22 2.17
Geographic 21 2.07
Social Web 520 51.28
19
 Data corpus (1014 datasets) extracted in April 2014 from Schmachenberg et al.
 Vocabulary Usage (1439)
 Class URIs (914)
 Property URIs (2333)
 Local Class Names (1041)
 Local Property Names (2493)
 Text from rdfs:label (1440)
 Top Level Domain (55)
 In and Out Degree (2)
Experimental Setup
 Classification Approaches
 K-Nearest Neighbor
 J-48
 Naïve Bayes
 Two normalization strategies
 Binary (bin)
 Relative term occurrences (rto)
 Three sampling techniques
 No sampling
 Down sampling
 Up sampling
20
Results on Combined Feature Sets
21
 Our model reaches an accuracy of 81.62%
Confusion Matrix
22
 Confusion between publications with government
and life sciences
 Confusion between user generated content and
social networking
Works already done (3)
 ABSTAT is a framework which can be used to summarize linked datasets and at
the same time to provide statistics about them
• Summary consists of Abstract Knowledge Patterns (AKPs) of the form
<subjectType, predicate, objectType>
• Can help users comparing two datasets
• Help detecting errors in the data such as accuracy
Eg: AKPs <dbo:Band,dbo:genre,dbo:Band>
• The domain or the range is unspecified for 585 properties in DBpedia Ontology
University of Milan - Bicocca 23
SubjectType Porperty ObjectType
http://dbpedia.org/ontology/Town http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/Country
http://dbpedia.org/ontology/City http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/Legistrature
http://dbpedia.org/ontology/Settlement http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/Settlement
http://dbpedia.org/ontology/Country http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/PoliticalParty
http://dbpedia.org/ontology/Village http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/MilitaryConflict
http://dbpedia.org/ontology/Organization http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/City
http://dbpedia.org/ontology/AdministrativeRegion http://dbpedia.org/ontology/governmentType
Evaluation Plan
 Where the Gold Standard exist validate in terms of precision, recall and F-
measure
 Difficulties to evaluate the validity of the proposed approach
• How these statistics or summarization allow to improve the performance
of the actual profiling tasks
• Humans will evaluate the validity of the summarization in terms of
relatedness and informativeness
• Provide users a list of statistics and ask their opinion which is more
important for their use case
University of Milan - Bicocca 24
Outline
 The research background
 The research plan
 Preliminary results
 Conclusions and Future Work
25University of Milan - Bicocca
Conclusions and Future Work
 The Topical Classification approach yield an accuracy of 82%,
enriching with other features like the linkage coverage
• Each dataset has only one topic, for some datasets multi label
classification can be appropriate
• A classifier chain for the multi-label classification
• Because of the heavy imbalance of the data a two stage classifier
might help
 Enrich ABSTAT framework with other statistics and to apply it to
unstructured data such as microdata.
 Investigate the trade-off between ABSTAT summarization to support
dataset exploration and understanding.
26
Publications
• A. Maurino, B. Spahiu, C. Batini, G. Viscusi – Compliance with Open Government Data
Policies: An empirical evaluation of Italian Local Public Administration. ECIS –eGOV
Workshop in the Twenty Second European Conference on Information Systems, Tel Aviv
2014
• A. Maurino, B. Spahiu, C. Batini, G. Viscusi – Compliance with Open Government Data
Policies: An empirical evaluation of Italian Local Public Administration. Information Polity
Journal, p.263-275, 2014
• M.Palmonari, A.Rula, R.Porrini, A. Maurino, B.Spahiu, V. Ferme – ABSTAT: Linked Data
Summaries with Abstraction and STATistics- The Semantic Web: ESWC 2015, Portoroz
Slovenia, May31th, 2015 to June 4th, 2015
• R.Meusel, B.Spahiu, C. Bizer, H. Paulheim – Towards Automatic Classification of LOD
datasets – LDOW Workshop co-located with 24th International World Wide Web
Conferenze (WWW 2015) Firenze, May 19, 2015
• B. Spahiu – Profiling the Linked (Open) Data – Doctoral Consortium Call at ISWC 2015
• C. Xie, D. Ritze, B. Spahiu, H. Cai- Instance-based property matching in Linked Open
Data Environment – Ontology Matching Workshop co-located with 14th International
Semantic Web Conference, 2015 Bethlehem, Pennsylvania USA.
27University of Milan - Bicocca
Thank you for your attention!
28University of Milan - Bicocca

More Related Content

PPT
bonino
PDF
[DOLAP2019] Augmented Business Intelligence
PDF
[ADBIS 2021] - Optimizing Execution Plans in a Multistore
PPTX
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
PPTX
[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...
PDF
Blockchain Technology - Week 11 - Thai-Nichi Institute of Technology
PPTX
Heterogeneous data annotation
PDF
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
bonino
[DOLAP2019] Augmented Business Intelligence
[ADBIS 2021] - Optimizing Execution Plans in a Multistore
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...
Blockchain Technology - Week 11 - Thai-Nichi Institute of Technology
Heterogeneous data annotation
Correlation Coefficient Based Average Textual Similarity Model for Informatio...

What's hot (20)

PPT
Chang network analysis
PDF
An efficient-classification-model-for-unstructured-text-document
PDF
isprsarchives-XL-3-381-2014
PDF
Curriculum Vitae
PPTX
Collnet turkey feroz-core_scientific domain
PPTX
Anomaly detection in plain static graphs
PDF
H0444146
PDF
New Approaches in Cognitive Radios using Evolutionary Algorithms
PDF
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
PPTX
Towards a Vocabulary for Incorporating Predictive Models into the Linked Data...
PDF
Visual mapping sentence_a_methodological (1)
PPTX
WIDS 2021--An Introduction to Network Science
PDF
Hadoop in Alibaba Cloud
PDF
Introductory Lecture to Applied Mathematics Stream
PDF
Designing learning activities_on_computa (1)
PDF
Labels in the web of data
PDF
2009 educational data mining 8 43-2-pb
PDF
PDF
道具としての機械学習:直感的概要とその実際
PDF
Machine learning in automated text categorization
Chang network analysis
An efficient-classification-model-for-unstructured-text-document
isprsarchives-XL-3-381-2014
Curriculum Vitae
Collnet turkey feroz-core_scientific domain
Anomaly detection in plain static graphs
H0444146
New Approaches in Cognitive Radios using Evolutionary Algorithms
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
Towards a Vocabulary for Incorporating Predictive Models into the Linked Data...
Visual mapping sentence_a_methodological (1)
WIDS 2021--An Introduction to Network Science
Hadoop in Alibaba Cloud
Introductory Lecture to Applied Mathematics Stream
Designing learning activities_on_computa (1)
Labels in the web of data
2009 educational data mining 8 43-2-pb
道具としての機械学習:直感的概要とその実際
Machine learning in automated text categorization
Ad

Viewers also liked (20)

PDF
Demo: Profiling & Exploration of Linked Open Data
PDF
Lessons Learnt from LinkedUp
PDF
Open Education and Open Development – working together
PPTX
Open Education Handbook
PDF
LODOP - Multi-Query Optimization for Linked Data Profiling Queries
PDF
Turning Data into Knowledge (KESW2014 Keynote)
PPS
Arte Culinario
PDF
Drawings Of Facial Expressions
PDF
평촌오피 부평오피 역삼오피 천안오피걸 무료성인자료
PPT
Actividad N°6 (Modulo 1)
PDF
OPTIMIZATION OF PROCESS PARAMETERS OF PLASTIC INJECTION MOLDING FOR POLYPROPY...
PDF
Reseptori 2 2014
PDF
La discapacidad en europa.
PPTX
Tuotteistamisen perusteet
PDF
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
PDF
Gestão Estratégica deFacilities_2015_Programação
PDF
Open data in Education
PPTX
Triết: phân tích 2 thuộc tính của hàng hóa và các bieenjphaps nâng cao năng s...
PDF
Metodo pilates
PDF
Big Data Profiling
Demo: Profiling & Exploration of Linked Open Data
Lessons Learnt from LinkedUp
Open Education and Open Development – working together
Open Education Handbook
LODOP - Multi-Query Optimization for Linked Data Profiling Queries
Turning Data into Knowledge (KESW2014 Keynote)
Arte Culinario
Drawings Of Facial Expressions
평촌오피 부평오피 역삼오피 천안오피걸 무료성인자료
Actividad N°6 (Modulo 1)
OPTIMIZATION OF PROCESS PARAMETERS OF PLASTIC INJECTION MOLDING FOR POLYPROPY...
Reseptori 2 2014
La discapacidad en europa.
Tuotteistamisen perusteet
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Gestão Estratégica deFacilities_2015_Programação
Open data in Education
Triết: phân tích 2 thuộc tính của hàng hóa và các bieenjphaps nâng cao năng s...
Metodo pilates
Big Data Profiling
Ad

Similar to Profiling Linked Open Data (20)

ODP
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
PPTX
Towards Automatic Classification of LOD Datasets
PDF
Exposing algorithms pydatadc2016
PPTX
A Comparison of Propositionalization Strategies for Creating Features from Li...
PDF
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
PDF
The Impact of the Data Revolution on Official Statistics: Opportunities, Chal...
PDF
DataGraft: Data-as-a-Service for Open Data
PPTX
algorithmic-decisions, fairness, machine learning, provenance, transparency
PDF
Vivarana fyp report
PDF
City Forward and Open Data Standards
ODP
Linked Open Data enhanced Knowledge Discovery
ODP
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
PPTX
Barbieri De Francisci_Dealing with Open Data at ISTAT
PPTX
Creating and Utilizing Linked Open Statistical Data for the Development of Ad...
PPTX
Identifying semantics characteristics of user’s interactions datasets through...
PDF
Workflow Provenance: From Modelling to Reporting
PPTX
Ogi conf delft_v1_evangelos_kalampokis
ODP
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
PDF
Data preprocessing in Data Mining
PDF
Tutorial on User Profiling with Graph Neural Networks and Related Beyond-Acc...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Towards Automatic Classification of LOD Datasets
Exposing algorithms pydatadc2016
A Comparison of Propositionalization Strategies for Creating Features from Li...
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
The Impact of the Data Revolution on Official Statistics: Opportunities, Chal...
DataGraft: Data-as-a-Service for Open Data
algorithmic-decisions, fairness, machine learning, provenance, transparency
Vivarana fyp report
City Forward and Open Data Standards
Linked Open Data enhanced Knowledge Discovery
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
Barbieri De Francisci_Dealing with Open Data at ISTAT
Creating and Utilizing Linked Open Statistical Data for the Development of Ad...
Identifying semantics characteristics of user’s interactions datasets through...
Workflow Provenance: From Modelling to Reporting
Ogi conf delft_v1_evangelos_kalampokis
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
Data preprocessing in Data Mining
Tutorial on User Profiling with Graph Neural Networks and Related Beyond-Acc...

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Big Data Technologies - Introduction.pptx
PDF
Electronic commerce courselecture one. Pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Mobile App Security Testing_ A Comprehensive Guide.pdf
cuic standard and advanced reporting.pdf
Encapsulation theory and applications.pdf
Programs and apps: productivity, graphics, security and other tools
Encapsulation_ Review paper, used for researhc scholars
Big Data Technologies - Introduction.pptx
Electronic commerce courselecture one. Pdf
20250228 LYD VKU AI Blended-Learning.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Agricultural_Statistics_at_a_Glance_2022_0.pdf
A comparative analysis of optical character recognition models for extracting...
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Weekly Chronicles - August'25-Week II
Per capita expenditure prediction using model stacking based on satellite ima...
Reach Out and Touch Someone: Haptics and Empathic Computing
Chapter 3 Spatial Domain Image Processing.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectral efficient network and resource selection model in 5G networks
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Profiling Linked Open Data

  • 1. Profiling Linked (Open) Data Blerina Spahiu Department of Computer Science, Systems and Communication, University of Milan - Bicocca blerina.spahiu@disco.unimib.it Supervisor: Andrea Maurino, Matteo Palmonari Tutor: Prof. Flavio De Paoli blerina.spahiu@disco.unimib.it
  • 2. Outline  The research background  The research plan  Preliminary results  Conclusions and Future Work 2University of Milan - Bicocca
  • 3. Outline  The research background  The research plan  Preliminary results  Conclusions and Future Work 3University of Milan - Bicocca
  • 4. Data profiling definition - Where shall I begin, please your Majesty? - Begin at the beginning - the King said gravely. Lewis Carroll in Alice’s Adventures in Wonderland The process of evaluating data quality is called data profiling and typically involves gathering several aggregated data statistics which constitute the data profiling. Encyclopedia of Database Systems, June 2014 4University of Milan - Bicocca
  • 5. Linked Open Data Cloud University of Milan - Bicocca 5 -1014 datasets -188 mio. triples -7 topical categories -80% of linking property is owl:sameAs -98,22% of datasets use RDF vocabulary -7,85% of the datasets provide licensing information. Etc.
  • 6. Linked Open Data Cloud University of Milan - Bicocca 6 What types of resources are described in a data set? How are they described? How well connected are the datasets in the LOD cloud? What is their topic/s? Are data described as prescribed by the ontology?
  • 7. Why we need data profiling? “….because prevention is better than curing” 7University of Milan - Bicocca  Data quality assessment  Query optimization  Ontology / Data integration  Data analytics  Complex schema discovery  Topical discovery  Data visualization
  • 8. State of the art Tools Goal Input Output Autom atizati on Scalabili ty Availabil ity License Tutorial Roomba Assaf et al., 2015 Generate descriptive dataset profiles Query portal APIs for available metadata Quality assessment of metadata Code in github Open Source LODStats Auer et al., 2012 Comprehensive statistics about RDF RDF 32 statistical criteria on schema and data level Only the demo Demo ExpLOD Khatchad ourian S. and Consense s M. P., 2010 Supports exploring summaries of RDF usage and interlinking among datasets RDF dataset, the BL (bisimulation label) schema and the neighborhoods to consider Summaries can be viewed and explored in an interactive graphical way and can be exported in a variety of formats RDFStats Langegge r A. and Wob W., 2010 Generation of different statistics RDF dataset Histograms for value distributions, classes/properties/d atatypes Semi- Autom atic Yes Apache Yes ProLOD+ + Bohm et al., 2010 Computes different profiling,mining or cleansing tasks RDF dataset Statistics about properties, classes etc. Information about uniqueness and keyness. Autom atic Demo 8
  • 9. Data Profiling Tools Survey University of Milan - Bicocca 9
  • 10. Profiling challenges  The results of data profiling are computationally complex to discover  Different and new data management architectures and frameworks have emerged  Linked Open Data are heterogeneous data • Syntactic Heterogeneity (Different formats, query languages) • Schematic Heterogeneity (Different encoding schemas) • Semantic Heterogeneity (Different vocabularies, semantic overlap of terms)  Unified view of data profiling as a field  Unifying framework for its task 10University of Milan - Bicocca
  • 11. Outline  The research background  The research plan  Preliminary results  Conclusions and Future Work 11University of Milan - Bicocca
  • 12. Objectives  Develop automatic approaches  Generate new statistics and knowledge patterns to provide dataset summary and inspect its quality. • Apply data mining techniques to extract useful knowledge from large datasets • Implementation of different approaches for outlier detection  Algorithms to overcome challenges to perform profiling in Linked Open Data • Parallel calculation of statistics and patterns extraction in LOD • Data mining techniques to deal with high dimensionality  Topical information extraction and classification  Developing a methodology on how to perform profiling tasks • A deep literature study to classify and formalize profiling tasks 12University of Milan - Bicocca
  • 13. Methodology Used Schematic Hetetogeneity University of Milan - Bicocca 13 Semantic Heterogeneity Syntactic Heterogeneity Topical Discovery Data Quality Data Understanding
  • 14. Outline  The research background  The research plan  Preliminary results  Conclusions and Future Work 14University of Milan - Bicocca
  • 15. Work already done (1) Profiling of Italian Public Administration websites • Decree 33 and 150 University of Milan - Bicocca 15
  • 16. Profiling of Italian PAs  Benchmark of PAs • Geographical distribution (country wide) • Type of PAs (region, municipality, county) • Size (number of inhabitants)  Compliance Index • Completeness • Accuracy • Timeliness  Profiling websites in terms of compliance University of Milan - Bicocca 16
  • 17. “Data Profiling, the moment of truth” University of Milan - Bicocca 17  The average index of compliance for the selected • Italian Regions is 0.488 (50% has an index lower than the mean). • Italian Provinces is 0.561 (more than 50% has an index lower than the mean). • Italian Municipalities is 0.462 (more than 50% has an index lower than the mean).  Regions Veneto has the highest score (0.839) Campania has the lowest score (0.043)  Provinces Bergamo in Lombardia Region have the highest score (0.759) Massa Carrara, in theToscana Region has the lowest score (0.266)  Municipalities Voghera (Lombardia Region) has the highest score (0.759) Ozegna (Piemonte Region) has the lowest score (0.164)
  • 18. Works already done (2)  Facilitating query for similar datasets discovery  Speeding up data searches  Trends and best practices of a particular domain can be identified 18 To which extent topical classification can be automated
  • 19. Data Corpus and Feature Set Category Datasets % Government 183 18.05 Publications 96 9.47 Life sciences 83 8.19 User generated content 48 4.73 Cross domain 41 4.04 Media 22 2.17 Geographic 21 2.07 Social Web 520 51.28 19  Data corpus (1014 datasets) extracted in April 2014 from Schmachenberg et al.  Vocabulary Usage (1439)  Class URIs (914)  Property URIs (2333)  Local Class Names (1041)  Local Property Names (2493)  Text from rdfs:label (1440)  Top Level Domain (55)  In and Out Degree (2)
  • 20. Experimental Setup  Classification Approaches  K-Nearest Neighbor  J-48  Naïve Bayes  Two normalization strategies  Binary (bin)  Relative term occurrences (rto)  Three sampling techniques  No sampling  Down sampling  Up sampling 20
  • 21. Results on Combined Feature Sets 21  Our model reaches an accuracy of 81.62%
  • 22. Confusion Matrix 22  Confusion between publications with government and life sciences  Confusion between user generated content and social networking
  • 23. Works already done (3)  ABSTAT is a framework which can be used to summarize linked datasets and at the same time to provide statistics about them • Summary consists of Abstract Knowledge Patterns (AKPs) of the form <subjectType, predicate, objectType> • Can help users comparing two datasets • Help detecting errors in the data such as accuracy Eg: AKPs <dbo:Band,dbo:genre,dbo:Band> • The domain or the range is unspecified for 585 properties in DBpedia Ontology University of Milan - Bicocca 23 SubjectType Porperty ObjectType http://dbpedia.org/ontology/Town http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/Country http://dbpedia.org/ontology/City http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/Legistrature http://dbpedia.org/ontology/Settlement http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/Settlement http://dbpedia.org/ontology/Country http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/PoliticalParty http://dbpedia.org/ontology/Village http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/MilitaryConflict http://dbpedia.org/ontology/Organization http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/City http://dbpedia.org/ontology/AdministrativeRegion http://dbpedia.org/ontology/governmentType
  • 24. Evaluation Plan  Where the Gold Standard exist validate in terms of precision, recall and F- measure  Difficulties to evaluate the validity of the proposed approach • How these statistics or summarization allow to improve the performance of the actual profiling tasks • Humans will evaluate the validity of the summarization in terms of relatedness and informativeness • Provide users a list of statistics and ask their opinion which is more important for their use case University of Milan - Bicocca 24
  • 25. Outline  The research background  The research plan  Preliminary results  Conclusions and Future Work 25University of Milan - Bicocca
  • 26. Conclusions and Future Work  The Topical Classification approach yield an accuracy of 82%, enriching with other features like the linkage coverage • Each dataset has only one topic, for some datasets multi label classification can be appropriate • A classifier chain for the multi-label classification • Because of the heavy imbalance of the data a two stage classifier might help  Enrich ABSTAT framework with other statistics and to apply it to unstructured data such as microdata.  Investigate the trade-off between ABSTAT summarization to support dataset exploration and understanding. 26
  • 27. Publications • A. Maurino, B. Spahiu, C. Batini, G. Viscusi – Compliance with Open Government Data Policies: An empirical evaluation of Italian Local Public Administration. ECIS –eGOV Workshop in the Twenty Second European Conference on Information Systems, Tel Aviv 2014 • A. Maurino, B. Spahiu, C. Batini, G. Viscusi – Compliance with Open Government Data Policies: An empirical evaluation of Italian Local Public Administration. Information Polity Journal, p.263-275, 2014 • M.Palmonari, A.Rula, R.Porrini, A. Maurino, B.Spahiu, V. Ferme – ABSTAT: Linked Data Summaries with Abstraction and STATistics- The Semantic Web: ESWC 2015, Portoroz Slovenia, May31th, 2015 to June 4th, 2015 • R.Meusel, B.Spahiu, C. Bizer, H. Paulheim – Towards Automatic Classification of LOD datasets – LDOW Workshop co-located with 24th International World Wide Web Conferenze (WWW 2015) Firenze, May 19, 2015 • B. Spahiu – Profiling the Linked (Open) Data – Doctoral Consortium Call at ISWC 2015 • C. Xie, D. Ritze, B. Spahiu, H. Cai- Instance-based property matching in Linked Open Data Environment – Ontology Matching Workshop co-located with 14th International Semantic Web Conference, 2015 Bethlehem, Pennsylvania USA. 27University of Milan - Bicocca
  • 28. Thank you for your attention! 28University of Milan - Bicocca

Editor's Notes

  • #5: According to the definition we found in the Encyclopedia of Database Systems, The process of evaluating data quality is called data profiling and typically involves gathering several aggregated data statistics which constitute the data profiling. So we can say, that data profiling refers to the activity of creating small but informative summeries of a dataset and is a cardinal activity when facing an unfamiliar dataset.
  • #6: In the actual LOD State we have 1014 datasets, 188 mio triples, overall 7 categories, 80% of the linking property is owl:sameAs, 98,22% of datasets use RDF vocabulary, around 8% of the datasets provide licensing information etc. Even in this huge amount of data, we still miss information which might be hidden or not easy to understand
  • #7: In this amount of data we still make some questions: What types of resources are described in a data set? How are they described? How well connected are the datasets in the LOD cloud? What is their topic/s? Are data described as prescribed by the ontology?
  • #8: Why we need profiling? The need to profile a new or unfamiliar dataset arises in many situations, in general to prepare for some subsequent task. Data quality assessment: Probably the most typical use case when we need profiling is when we are preparing for a data cleansing process. Profiling reveals data errors, such as inconsistent formatting, missing values or outliners. Profiling results can also be used to measure and monitor the general quality of a dataset. Query optimization: Basic profiling is performed by most of database systems to support query optimization with statistics about tables and columns. These profiling results can be used to estimate for example the cost of a query plan. -Ontology and Data integration: Often the dataset to be integrated are somewhat unfamiliar and to integration expert wants to explore the dataset first. How large is the db? Are there independencies between tables and among dbs. Ontologies published on the Web, even for datasets in similar domains can have differences. Data profiling techniques can help understanding the overlap between ontologies and help in the process of ontology creation, maintenance and integration. Data analytic: Almost any statistical analysis or data mining run is preceded by profiling step to help the analyst have a hint at of the data at hand -Complex schema discovery: Schema complexity leads to difficulties to understand and access the data. Schema summaries provide users a concise overview of the entire schema despite its complexity. -Topical Classification: Finding features that best represent the topic/s of a given dataset can help not only the topical classification of the dataset but also understanding the semantic of the information found in the data. -Data visualization for summarization. Profiling techniques can support data visualization tools to visualize large multidimensional datasets by displaying only a small and concise summary of the most relevant and important features.
  • #9: In the table we show an overview of the actual state of the art tools and techniques. Roomba - is a framework which validate and generate descriptive dataset profiles. One of portal API endpoint is queried, and information about dataset’s title, description, maintainer email, update and creation date are extracted. From the extracted metadata they are able to extract resources associated with that dataset. If the dataset contains many instances only a sample of 10% is used for the validation of the extracted metadata. There is no information about automatization, scalability. There is no interface and only the code is available in github. Also no tutorials are provided. LODStats – is used to gather some comprehensive statistics about rdf data. So in input it takes an RDF file and as an output it generates 32 different statistical criteria on schema and data level. Usually the statistics cover classes/properties usage, out links, in links etc. There exist only a demo on how statistics look like and no information about the availability, scalability and license. ExpLOD - supports exploring summaries of RDF usage and interlinking among datasets. As an input it considers the dataset, the BL (bisimulation label graph which is constructed considering the rdf usage of each node considering also its neighbors) schema and the neighborhoods to consider. The summaries produced can be viewed and explored in an interactive graphical environment and they can also be exported in a variety of formats (including RDF). There are no information about automatization, scalability, license or tutorials available. RDFStats – is used to generate statistics about rdf datasets. the generator creates histograms for each combination of class c, property p (used in the extension of c), and type range t of property values. Because a property value can be a (plain or typed) literal, blank node, or URI, it is necessary, to generate different histograms for each of the occurring types. It is a semi automatic tool, because you have to manually configure it. There is no information about the scalability and
  • #10: This table has only aggregated information about profiling tools, but in my research I performed a deeper survey for each tool identifying where is possible each statistical criteria which can be calculated by the tool.
  • #11: While data profiling techniques have proposed several solutions as the literature shows, it still represents several challenges: First, the results of data profiling are computationally complex to discover. Discovering key candidates or dependencies usually involves some sorting step for each considered column. Different and new data management architectures and frameworks have emerged including distributed systems, multi-core or main-memory based servers etc. Traditional profiling task can not be applied to Linked Data due to their heterogeneity. Heterogeneity can appear in different forms such as different formats or query languages called syntactic heterogeneity. Linked Open Data can be represented in different formats, stored in different storage architectures also the data encoding schemes may vary. This is referred to as schematic heterogeneity. Datasets published as LOD might use different vocabularies, to describe synonymous terms. Although much effort is done in the past what is missing even in the state of the art is a unified view of data profiling as a filed and a unified framework for its tasks.
  • #13: Our intent is to develop automatic approaches and generating new statistics that are not covered by the actual state of the art techniques. While much effort is done in the past as described in the state of the art, the statistics that are generated are limited in some basic statistics such as the number of triples, number of classes/ properties that are used in a dataset, the datatypes or sameAs links used, etc. Datasets hold much more interesting information which might be hidden, but at the same time, this information could be useful for the consumer of the dataset. As data profiling is referred to the activity of providing useful descriptive information, new techniques on how to extract the hidden information should be developed. So applying some data mining techniques to extract useful knowledge from large datasets should be investigated. Different data mining techniques, such as association rule mining, can be used to discover and extract patterns and dependencies in the dataset. These patterns might provide useful information especially to detect errors and inconsistencies in spatial data (consistency quality dimension). Also other algorithms such as, Apriori algorithm , can be used to find sets of items within a dataset that satisfies a user-defined minimum support threshold, to support summarization. Implementation of different approaches for outlier detection, like distance/deviation/depth-based, evolutionary techniques, etc. could provide insight about abnormalities in the underlying data. Also as the number of the datasets published is increasing the need to adapt and optimize profiling techniques to support huge amount of data is also high. A good approach when dealing with large datasets, is to improve the profiling performance running the calculation of statistics and patterns extraction in parallel. MapReduce can help for this profiling challenge. We also plan to adapt some data mining techniques to deal with high dimensionality data, such as Linked Open Data A deep literature study for the actual tools which are used to profile Linked Open Data has been taken. We analyzed existing tools in terms of the goal they are used for, techniques, input, output, approach, automatization information, license etc, with the aim to have a complete view of the existing approaches and techniques for profiling. This deepen study will also help us for the third contribution on creating a general methodology for each of the profiling tasks. Also our intend is to provide the topical information of the dataset while giving our profile.
  • #16: As a first step we measured the value of Linked Open Data, profiling the data published as Open Data from the Italian Public Administrations calculating a compliance index considering three quality dimensions for the published data; completeness, accuracy and timeliness. Public administrations are compelled by law number 33 of “Decreto sviluppo”, to publish all their data following a strict schema where the information should be placed under a session of their websites called “Amministrazione Trasparente” which can be found in the PA’s homepage, where you can find information about various topics of administration. Finally, the legislative decree 150 imposes to public administrations to publish documents in open format that correspond to the 1 star of the 5 stars scale proposed by Berners-Lee (2006). The law each PA to have 69 sessions on the webpage.
  • #17: We have a benchmark with 50 public administration websites considering 3 factors in distribution (geographical distribution, country wide) types of PAs (regions, county and municipalities) and (their size by number of inhabitants). We then calculated the compliance index considering three criteria: the completeness of the website, the accuracy and the timelines. After we normalized all the values and gave a greater weight to completeness we calculated the compliance index with this formula.
  • #18: After analyzing all the results, calculated with the above formula, we profiled Italian PAs in terms of completeness, accuracy and timeliness taking in consideration the criteria of the sampling. All these statistics gave an overview of compliance index for all the PAs in Italy. As we can see from the results the average compliance index for regions is 0.488 where more that half of them have a lower index than the mean. The average index is somehow grater for provinces but still 50% of the provinces have a lower index than the mean. While the compliance index is even lower for municipalities. The highest compliance index is for the region of Veneto with 0.839 while Campania has the lowest score 0.043. If we want to see the compliance degree divided by the type of PA, as showed in the table it seems that regions in the centre of Italy have a higher index than those in the south. Even if we consider provinces the higher compliance have the provinces in the north, while for those in the centre and south they are pretty much the same. If we consider the number of inhabitants it seems that PAs with high number of inhabitants seems to be the ones to comply more to the law. For more details you can check the publication of this work.
  • #19: To which extent the topical classification of new LOD datasets can be automated for upcoming versions of the LOD cloud diagram using machine learning techniques and the existing annotations as supervision. Beside creating upcoming versions of the LOD cloud diagram, the automatic topical classification of LOD datasets can be interesting for other purposes as well. Facilitating queries for similar datasets discovery; Agents navigating on the Web of Linked Data should know the topical domain of the datasets that they discover by following links in order to judge weather the datasets might be useful for their use case at hand or not. Speeding up data searches. Knowing the domain of a dataset can not only facilitate queries but also speed up data searches, as now agent know what data are they looking for. Furthermore it is interesting to analyze characteristics of datasets grouped by topical domain, so that trends and best practices that exist in that particular topical domain can be identified. These are the main motivations in building to the best of our knowledge the first automatic approach to classify LOD datasets into topical categories that are used by the LOD cloud diagram.
  • #20: For this purpose we used the latest version of LOD cloud crawled in April 2014 containing 1014 different LOD datasets describing around 8 million resources. The datasets were manually classified into one of the following categories; media, government, publications, life sciences, geographic, social networking, user generated content and cross domain. As showed in the table The LOD cloud is dominated by datasets belonging to the category social networking (48%), followed by government (18%) and publications (13%) datasets. The categories media and geographic are only represented by less than 25 datasets within the whole corpus. For each of the datasets, we created the following eight feature sets based on the crawled data: Vocabulary Usage (VOC): As many vocabularies target a specific topical domain, eg bibi bibliographic information, we assume that the vocabularies that are used by a dataset might be helpful indicator for determining the topical category of the dataset. Thus we determine the vocabulary of all terms that are used as predicated or as the object of a type statement within each dataset, Altogether we identified 1439 different vocabularies being used by the datasets. Class URIs (Curi): As a more fine-grained feature, the rdfs: and owl:classes which are used to describe entities within a dataset might provide useful information to determine the topical category of the dataset. Thus we extracted all the classes that are used by at least two different datasets resulting in 914 attributed for this feature set. Property URIs (Puri): Beside the class information of an entity, information about which properties are used to describe the entity can be helpful. For example it might make a difference, is a person is described with foaf:knows statements or if her professional affiliation is provided. To leverage this information, we collected all properties that are used within the crawled data by at least two datasets. This feature set consists of 2333 attributes. Local Class Names (LCN): Different vocabularies might contain synonymous (or at least closely related) terms that share the same local name and only differ in their namespace, eg foaf:Person and dbpedia:person. Creating correspondences between similar classes from different vocabularies reduced the diversity of features but on the other side might increase the number of the attributes which are used by more than one dataset. As we lack correspondences between all the vocabularies, we bypass this, by using only the local names of the type URIs, meaning vocab1: Country and vocab2: Country are mapped to the same attribute. We used a simple regular expression to determine the local class name checking for #, : and / within the type object. By focusing only on the local part of a class name, we increase the number of classes that are used by more than one dataset in comparison to Curi and thus generated 1041 attributes for the LCN feature set. Local Property Names (LPN): Using the same assumption as for the LCN feature set, we also extracted the local name of each property that is used by a dataset. This results in treating vocab1:name and vocab2:name as a single property. We used the same heuristic for the extraction as for the LCN feature set and generated 2493 different local property names which are used by more than one dataset, resulting in an increase of the number of attributes in comparison to the Puri feature set. Text from rdf:label (LAB): beside the vocabulary level features, the names of the described entities might also indicate the topical domain of a dataset. We thus extracted all values of rdfs:label properties, lower-case them and tokenized the values at space characters. We further excluded tokens shorter than three and longer than 25 characters. Afterward we calculated the TF-IDF value for each token while excluding tokens that appeared in less than 10 and more than 200 datasets, in order to reduce the influence of noise. This resulted in a feature set consisting of 1440 attributes. Another feature which might help to assign datasets to topical categories is the top-level domain of the dataset. For instance, government data is often hosted in the gov top leve domain, whereas library data might be found more likely on edu or org top level domains. We restrict ourselves to top-level domains and not public suffixes. In & out degree: In addition to the covabulary-based and textual features, the number of outgoing RDF links to other datasets and incoming rdf links from other dataset could also provide useful information for classifying datasets. This features could give a hint about the density of the linkage of a dataset, as well as the way the dataset is interconnected within the whole LOD cloud ecosystem. We were able to create all features (except LAB) for 1001 datasets. As only 470 datasets provide the rdfs:label, we only uses these datasets for evaluating the utility of the LAB feature set.
  • #21: We evaluated the following three classification techniques on our task of assigning topical categories to LOD datasets K-Nearest Neighbor classification models make use of similarity between new cases and known cases to predict the class for the new case. A case is classified by its majority vote of its neighbors with the case being assigned to the class most common among its k nearest neighbors measured by the distance function. In our experiments we used a k equal to 5. J49 Decision Tree: A decision tree is a flowchart-like tree structure which is built top-down from a root node and involves some partitioning steps to divide data into subsets that contain instances with similar values. For our experiments we use the Weka implementation of the C4.5 decision tree. We learn a pruned tree, using a confidence threshold of 0.25 with a minimum number of 2 instances per leaf. Naïve Bayes: As a last classification method, we used Naïve Bayes. NB uses joint probabilities of some evidence to estimate the probability of some event. Although this classifier is based on the assumption that all features are independent, which is violated in many use cases, NB has shown to work well in practice. In order to evaluate the performance of the selected classifiers we use a 10-fold cross-validation and report the average accuracy As the total number of occurrences of vocabularies and terms is heavily influenced by the distribution of entities within the crawl for each dataset, we apply two different normalization strategies to the values of the vocabulary level features VOC; Curi, Puri, LCN and LPN. On the one hand side we created a binary version (bin) where the feature vectors of each feature set consist of 0 and 1 indicating presence and absence of the vocabulary or term. While the second version the relative term occurrence (rto) captures the fraction of the vocabulary or term usage for each dataset. As the number of datasets per category as showed in the table, is not equally distributed within the LOD cloud, which might influence the performance of the classification model, we also explore the effect of balancing the training data. We used three different datasets approaches: (1) no sampling, so we applied our model in the data without any sampling techniques, (2) we down sample the number of datasets used for training until each category is represented by the same number of datasets; this number is equal to the number of datasets within the smallest category; and (3) we up sample the datasets for each category until each category is at least represented by the number of datasets equal to the number of datasets of the largest category. While down sampling reduces the chances to overfit a model into the direction of the larger represented classes, it might also remove valuable information from the training set, as examples are removed and not taken into account for learning the model. The up sampling, ensures that all possible examples are taken into account and no information is lost for training, but by creating the same entity many times can result in emphasizing those particular data points. For example a neighborhood based classifier might look at the 5 nearest neighbors, which than could be one and the same data point, which would result into looking only at the nearest neighbor.
  • #22: For our second set of experiments, we combine the available attributed from the different feature sets and train again our classification models using the three described algorithms. As before, we generate a binary and relative term occurrence version of the vocabulary-based feature sets. In addition we created a second set (binary and relative term occurrence) where we omit the attributes from the LAB feature set as we wanted to measure the influence of this particular set of attributes, which is only available for less than half of the datasets. Furthermore we created a combined set of attributes from the previous section. We can observe that when selecting a larger set of attribues our model is able to reach a slightly higher accuracy of 81.62% than using just the attributes for the one feature set 80.59% LCNbin).
  • #23: In the following we will have a look at the confusion matrix of the best performing approach Naïve bayes trained on attributed of the NoLabbin feature set using the up sampling. On the left side of the matrix we list the predications by the learned model, while the head names the actual category of the dataset. As observed in the table, there are three kinds of errors which occur more frequently than ten times. The most common confusion occurs for the publication domain, where a larger number of datasets are predicted to belong to the government domain. A reason for this is that government datasets often contain metadata about government statistics which are represented using the same vocabularies and terms (e.g. skos:consept) that are also used in the publication domain. This makes it challenging for a vocabulary based classifier to distinguish those two categories apart. Another example is the classification of the dataset of the Ministry of Culture in Spain., which was manually assigned the publication label whereas our model predict government which turns out to be a borderline case in the gold standard. A similar frequent problem is the prediction of life sciences for datasets in the publications category. This can be observed for the ns.nature publication website which describes the publication in Nature. Those publications are often in the life science field which makes the labeling in the gold standard a borderline case. The third most common confusion occurs between the user generated content and the social networking domain. The problem here is in the shared use of similar vocabularies such as foaf. At the same time, labeling a dataset as either one of the two is often not so simple. The social networking dataset should focus in the presentation of people and their interrelations, while user generated content should have a stronger focus on the content, Datasets form the personal blogs, such as wordpress.com however can convey both aspects. Due to the labeling rule these datasets are labeled as user generated content but our approach frequently classifies them as social networking. While we can observe some true classification errors, many of the mistakes made by our approach actually point at datasets which are difficult to classify and which are rather borderline cases between two categories.
  • #24: ABSTAT is a framework which can be used to summarize linked datasets and at the same time to provide statistics about them. The summary consists of Abstract Knowledge Patterns (AKPs) of the form <subjectType, predicate,objectType> which represent the occurrence of triples < sub,pred,obj> in the data, such that subjectType is a minimal type of sub and objectType is a minimal type of obj. The ABSTAT summaries can help users comparing in which of two datasets a concept is described with richer and diverse properties, and also help detecting errors in the data, extracting counterintuitive AKPs, e.g.,<dbo:Band,dbo:genre,dbo:Band> . ABSTAT can also be used to x the domain and range information for properties. Either the domain or the range is unspecified for 585 properties in DBpedia Ontology and AKPs can help us in determining at least one domain and one range for the unspeciefid properties. For example, for the property http://dbpedia.org/ontology/governmentType in DBpedia we do not have information about the domain. With our approach we can derive 7 different AKPs meaning that we can derive 7 domains for this property. ABSTAT can be used to detect missing or datatype diversity, etc
  • #25: To evaluate the validity of the proposed approach or the results achieved is very difficult as in the field of LOD profiling there is no Gold Standard, thus is very difficult to compare with others. For this issue, we want to further explore how these new statistics or summarization allow to improve the performance of the actual profiling techniques and tools, e.g. how profiling tasks can improve full-text search etc. To evaluate the validity of the proposed profiling techniques to summarize datasets, as pattern discovery is not trivial, humans will evaluate the validity of the summarization in terms of relatedness and informativeness. We intend to provide to users a list of statistics and ask them which in their opinion is more important to support profiling of Linked Open Data. The evaluation of the performance of profiling tasks is very difficult, which still remains an open issue on which I am currently working.
  • #27: To sum up, in this work we tried to investigate to which extent the topical classification of new LOD datasets can be automated using machine learning techniques. Our experiments indicate that vocabulary-level features are good indicator for the topical domain yielding an accuracy of around 82%. The analysis of the limitations of our approach i.e the cases where the automatic classification deviates from the manually labeled one, points to a problem of the categorization approach that is currently used for the LOD cloud. All datasets are labeled with exactly one topical category, although sometimes two or more categories would be equally appropriate. Thus the LOD dataset classification task might be more suitably formulated as a multi-label classification problem. A particular challenge of the classification is the heavy imbalance of the dataset categories, with roughly half of the datasets belonging to the social networking domain. A two stage approach might help, in which a first classifier tries to separate the largest category from the rest, while a second classifier then tries to make a prediction for the remaining classes. When regarding the problem as a multi-label problem, the corresponding approach would be a classifier chains, which make a prediction for one class after the other, taking the prediction of the first classifiers into account as a feature for the remaining classifications. In our experiments, RDF link have not been exploited beyond dataset in and out degree. Some link-based classification techniques may further exploit the content of a dataset. Also adding more features to the model might help in achieving higher accuracy such as linkage information