SlideShare a Scribd company logo
How redundant is it? – An empirical analysis on linked datasets 
Honghan Wu1, Boris Villazon-Terrazas2, Jeff Z. Pan1 and José Manuel Gómez Pérez2 
University of Aberdeen1, UK 
iSOCO2 , Spain 
20/10/2014 1
2 
Content 
• 
What is data redundancy with linked data? 
• 
Why is it of special interest to linked data consumption? 
• 
Linked Data redundancy categorisation 
• 
How to analysis? 
• 
Dataset selection & The Result 
• 
Conclusion
3 
What is the data redundancy in LD? 
• 
Data Redundancy 
– 
[Database systems] Same piece of data in multiple places 
– 
[Information theory] Wasted "space" used to transmit certain data 
• 
(In this work)Linked Data Redundancy 
– 
Wasted “space” to represent certain meaning (represented in certain semantics) 
– 
Duplication-free
4 
Why is it of special interest to LD consumption? 
• 
Bad Redundancy & Good Redundancy 
– 
Bad for exchange: storage, transmission 
– 
Good for inference computation 
• 
Relevant consumption tasks 
– 
Hosting/Sharing 
– 
Query Answering (SPARQL) 
– 
Ontology Based Data Access 
– 
Reasoning
Redundancy in Linked Data 
• 
Redundancy Categorisation for RDF Data 
• 
Redundancies caused by the “Linked” nature
6 
RDF Redundancies vs. Succinct Representations 
[Rule based] A. K. Joshi, P. Hitzler, and G. Dong. Logical linked data compression. In The Semantic Web: Semantics and Big Data, pages 170–184. Springer, 2013. 
[HDT]J. D. FernáNdez, M. A. MartíNez-Prieto, C. GutiéRrez, A. Polleres, and M. Arias. Binary rdf representation for publication and exchange (hdt). Web Semant., 19:22–41, Mar. 2013. 
[WaterFowl] O. Curé, G. Blin, D. Revuz, and D. C. Faye. Waterfowl: A compact, self-indexed and inference-enabled immutable rdf store. In The Semantic Web: Trends and Challenges, pages 302– 316. Springer, 2014. 
Pan, Jeff Z., Jose Manuel Gomez-Perez, Yuan Ren, Honghan Wu, Haofen Wang and Man Zhu. “Graph Pattern based RDF Data Compression”. In Proc. of 4th Joint International Semantic Technology Conference (JIST). 2014. (To appear)
7 
Semantic redundancy 
Rule Representation 
- 
DL Axioms (T-Box) 
- 
Other semantics (graph pattern substitution)
8 
Syntactic Redundancy 
Concise syntax 
- 
RDF abbreviation & striping syntax 
- 
Intra-structure & Inter- structure
9 
Symbolic Redundancy 
• 
http://xmlns.com/foaf/0.1/name 
– 
31 bytes in ASCII 
URI 
ID (4 bytes) 
… 
… 
http://xmlns.com/foaf/0.1/name 
128 
… 
… 
Less bytes for basic data units 
- 
(Fix-length)Dictionary Based 
- 
(Variable-length) Huffman coding 
- 
Predictive encoding
10 
Semantic Redundancy Caused by “Linked” Nature 
• 
Vocabulary Linkage 
– 
Reuse of other vocabularies: more rules 
– 
Less redundancy ratio: more triples derivable 
– 
More redundancy: co-occurrence triples removable 
• 
Instance Linkage 
– 
sameAs linkages 
– 
Bring in new assertions (e.g., type assertions) 
– 
Bring in new axioms
How to analysis? 
• 
Two dimension analysis 
• 
Methodology 
• 
Metrics
12 
Two dimension analysis 
Semantic 
Syntactic 
Symbolic 
A-Box 
✔ 
✔ 
A-Box & T-Box 
No Linkage 
✔ 
- 
- 
T-Box Reuse 
✔ 
- 
- 
A-Box Linkage 
- 
- 
RDF Redundancy Dimension 
Linked Semantic 
Dimension
13 
Methodology: EDP Summarisation
14 
Virtually Materialised A-Box: expanded EDP 
A1, B1 (1) 
A2, B2 (1) 
A-Box: A1(o1) B1(o1) A2(o2) B2(o2) R(o1, o2) 
T-Box: A1⊆A, A2⊆A, B1⊆B, B2⊆B 
R (1:1) 
A, B, 
A, B,
Linked Dataset Analysis Results 
• 
Dataset Selection & Summary 
• 
Analysis Results
16 
Dataset Selection and Summary 
LOD 2011
17 
A-Box Only: Semantic Redundancies 
– Redundant Triples 
– Semantic redundancy ratio, i.e. 
– # Graph Patterns used to substitute redundant triples
18 
A-Box Only: Syntactic Redundancies 
– the redundant resource occurrences of inter-structural 
redundancies 
– the syntactic redundancy ratio, i.e.
19 
A-Box & T-Box: No Linkage 
DBLP2013: SWRC ontology 
Ordnance Survey: official published OS ontology 
1.7% 
184% 
108% 
4.7%
20 
A-Box & T-Box: No Linkage 
First 3 datasts are reusing FOAF Ontology 
– the number of directly used terms from reused T-Box 
– the number of applicable axioms from (materialised) reused T-Box 
26.9% 
4% 
45.4% 
1.3%
21 
Conclusion 
• 
LOD redundancy are heterogeneous & huge 
• 
Vocabulary linkage might lead to huge number of derivable triples 
• 
Redundancy aware techniques are demanded
22 
Redundancy-aware Consumption 
• 
Compression: different redundancies might need different techniques 
• 
For Data Access: (high inter-structure redundancy) skewed entity distributions over EDPs -> efficient access? 
• 
OBDA/Reasoning: A-Box redundancy = less T-Box axioms 
• 
Data Publisher: should be aware of the consequences of reusing
Thanks! Q & A

More Related Content

PDF
Matching and merging anonymous terms from web sources
PPT
Positional Data Organization and Compression in Web Inverted Indexes
PDF
EDF2012 Mariana Damova - Factforge
PDF
Fact forge20 edf
PDF
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
PDF
Linked Open (Geo)Data and the Distributed Ontology Language – a perfect match
PDF
Improving Document Clustering by Eliminating Unnatural Language
PPT
Complex Matching of RDF Datatype Properties
Matching and merging anonymous terms from web sources
Positional Data Organization and Compression in Web Inverted Indexes
EDF2012 Mariana Damova - Factforge
Fact forge20 edf
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
Linked Open (Geo)Data and the Distributed Ontology Language – a perfect match
Improving Document Clustering by Eliminating Unnatural Language
Complex Matching of RDF Datatype Properties

What's hot (20)

PPT
Zhishi.me - Weaving Chinese Linking Open Data
PDF
The Maze of Deletion in Ontology Stream Reasoning
PPTX
Towards the implementation of a refined data model for a Zulu machine-readabl...
PPTX
3. Stack - Data Structures using C++ by Varsha Patil
PPTX
6. Linked list - Data Structures using C++ by Varsha Patil
PPTX
13. Indexing MTrees - Data Structures using C++ by Varsha Patil
PDF
Scaling the (evolving) web data –at low cost-
PDF
Effective Data Retrieval in XML using TreeMatch Algorithm
PPTX
Efficient RDF Interchange (ERI) Format for RDF Data Streams
PPT
A hierarchical approach for semi structured document indexing and
PPTX
5. Queue - Data Structures using C++ by Varsha Patil
PDF
On the Way to a Holding Ontology
PDF
SQL For PHP Programmers
PDF
Contexts and Importing in RDF
PPTX
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
PPTX
10. Search Tree - Data Structures using C++ by Varsha Patil
PPTX
7. Tree - Data Structures using C++ by Varsha Patil
ODP
Introduction to LDL 2012
PPTX
14. Files - Data Structures using C++ by Varsha Patil
Zhishi.me - Weaving Chinese Linking Open Data
The Maze of Deletion in Ontology Stream Reasoning
Towards the implementation of a refined data model for a Zulu machine-readabl...
3. Stack - Data Structures using C++ by Varsha Patil
6. Linked list - Data Structures using C++ by Varsha Patil
13. Indexing MTrees - Data Structures using C++ by Varsha Patil
Scaling the (evolving) web data –at low cost-
Effective Data Retrieval in XML using TreeMatch Algorithm
Efficient RDF Interchange (ERI) Format for RDF Data Streams
A hierarchical approach for semi structured document indexing and
5. Queue - Data Structures using C++ by Varsha Patil
On the Way to a Holding Ontology
SQL For PHP Programmers
Contexts and Importing in RDF
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
10. Search Tree - Data Structures using C++ by Varsha Patil
7. Tree - Data Structures using C++ by Varsha Patil
Introduction to LDL 2012
14. Files - Data Structures using C++ by Varsha Patil
Ad

Viewers also liked (20)

PDF
Introduction to Data Mining for Newbies
PDF
Jonanthan Leopard's Visual Resume
PDF
Experimental investigation of effectiveness of heat wheel as a rotory heat ex...
PDF
Design of a usb based data acquisition system
PDF
Performance bounds for unequally punctured
PDF
A comprehensive survey on security issues in cloud computing and data privacy...
PDF
IVT Företagspresentation
PDF
Implementation of delay measurement technique using signature register for sm...
PPTX
Загадки о животных
PPTX
Solentive / InRule AADI Gartner Summit 2014
PPTX
Космическое фотопутешествие с телескопом хаббл
PDF
NBPC 1613 San Diego, CA Proposed 2014 bylaws draft_july_24_unanimous_consensu...
PDF
A language independent web data extraction using vision based page segmentati...
PDF
Road map of development for pull system in thailand small and medium automoti...
PDF
Study of protein content and effect of p h variation on solubility of seed pr...
PDF
Ga based dynamic routing in wdm optical networks
PPTX
Andrea paola duran 11- 03 trabajo
PPTX
Hardware cristian villavicencio 1
PDF
Kaoru.K_portfolio_ADV124
Introduction to Data Mining for Newbies
Jonanthan Leopard's Visual Resume
Experimental investigation of effectiveness of heat wheel as a rotory heat ex...
Design of a usb based data acquisition system
Performance bounds for unequally punctured
A comprehensive survey on security issues in cloud computing and data privacy...
IVT Företagspresentation
Implementation of delay measurement technique using signature register for sm...
Загадки о животных
Solentive / InRule AADI Gartner Summit 2014
Космическое фотопутешествие с телескопом хаббл
NBPC 1613 San Diego, CA Proposed 2014 bylaws draft_july_24_unanimous_consensu...
A language independent web data extraction using vision based page segmentati...
Road map of development for pull system in thailand small and medium automoti...
Study of protein content and effect of p h variation on solubility of seed pr...
Ga based dynamic routing in wdm optical networks
Andrea paola duran 11- 03 trabajo
Hardware cristian villavicencio 1
Kaoru.K_portfolio_ADV124
Ad

Similar to Redundancy analysis on linked data #cold2014 #ISWC2014 (20)

PPTX
Data curation and data archiving at different stages of the research process
PDF
The web of interlinked data and knowledge stripped
PPTX
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
PDF
Efficient Query Answering against Dynamic RDF Databases
PPTX
Democratizing Big Semantic Data management
PDF
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
PPT
Explanations in Dialogue Systems through Uncertain RDF Knowledge Bases
PPTX
‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...
PDF
SemFacet paper
PDF
Sem facet paper
PDF
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
PPTX
Data Integration at the Ontology Engineering Group
PPTX
Quantifying the bias in data links
PDF
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
PDF
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
PPTX
Timbuctoo 2 EASY
PPTX
RDF Stream Processing and the role of Semantics
PDF
Automatically converting tabular data to
PPTX
semantic web & natural language
PDF
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Data curation and data archiving at different stages of the research process
The web of interlinked data and knowledge stripped
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
Efficient Query Answering against Dynamic RDF Databases
Democratizing Big Semantic Data management
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Explanations in Dialogue Systems through Uncertain RDF Knowledge Bases
‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...
SemFacet paper
Sem facet paper
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Data Integration at the Ontology Engineering Group
Quantifying the bias in data links
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
Timbuctoo 2 EASY
RDF Stream Processing and the role of Semantics
Automatically converting tabular data to
semantic web & natural language
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
A Presentation on Artificial Intelligence
PDF
Encapsulation theory and applications.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Electronic commerce courselecture one. Pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
Teaching material agriculture food technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Approach and Philosophy of On baking technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The Rise and Fall of 3GPP – Time for a Sabbatical?
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Spectroscopy.pptx food analysis technology
Assigned Numbers - 2025 - Bluetooth® Document
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation_ Review paper, used for researhc scholars
A Presentation on Artificial Intelligence
Encapsulation theory and applications.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Electronic commerce courselecture one. Pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Teaching material agriculture food technology
Network Security Unit 5.pdf for BCA BBA.
20250228 LYD VKU AI Blended-Learning.pptx
The AUB Centre for AI in Media Proposal.docx
Approach and Philosophy of On baking technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Redundancy analysis on linked data #cold2014 #ISWC2014

  • 1. How redundant is it? – An empirical analysis on linked datasets Honghan Wu1, Boris Villazon-Terrazas2, Jeff Z. Pan1 and José Manuel Gómez Pérez2 University of Aberdeen1, UK iSOCO2 , Spain 20/10/2014 1
  • 2. 2 Content • What is data redundancy with linked data? • Why is it of special interest to linked data consumption? • Linked Data redundancy categorisation • How to analysis? • Dataset selection & The Result • Conclusion
  • 3. 3 What is the data redundancy in LD? • Data Redundancy – [Database systems] Same piece of data in multiple places – [Information theory] Wasted "space" used to transmit certain data • (In this work)Linked Data Redundancy – Wasted “space” to represent certain meaning (represented in certain semantics) – Duplication-free
  • 4. 4 Why is it of special interest to LD consumption? • Bad Redundancy & Good Redundancy – Bad for exchange: storage, transmission – Good for inference computation • Relevant consumption tasks – Hosting/Sharing – Query Answering (SPARQL) – Ontology Based Data Access – Reasoning
  • 5. Redundancy in Linked Data • Redundancy Categorisation for RDF Data • Redundancies caused by the “Linked” nature
  • 6. 6 RDF Redundancies vs. Succinct Representations [Rule based] A. K. Joshi, P. Hitzler, and G. Dong. Logical linked data compression. In The Semantic Web: Semantics and Big Data, pages 170–184. Springer, 2013. [HDT]J. D. FernáNdez, M. A. MartíNez-Prieto, C. GutiéRrez, A. Polleres, and M. Arias. Binary rdf representation for publication and exchange (hdt). Web Semant., 19:22–41, Mar. 2013. [WaterFowl] O. Curé, G. Blin, D. Revuz, and D. C. Faye. Waterfowl: A compact, self-indexed and inference-enabled immutable rdf store. In The Semantic Web: Trends and Challenges, pages 302– 316. Springer, 2014. Pan, Jeff Z., Jose Manuel Gomez-Perez, Yuan Ren, Honghan Wu, Haofen Wang and Man Zhu. “Graph Pattern based RDF Data Compression”. In Proc. of 4th Joint International Semantic Technology Conference (JIST). 2014. (To appear)
  • 7. 7 Semantic redundancy Rule Representation - DL Axioms (T-Box) - Other semantics (graph pattern substitution)
  • 8. 8 Syntactic Redundancy Concise syntax - RDF abbreviation & striping syntax - Intra-structure & Inter- structure
  • 9. 9 Symbolic Redundancy • http://xmlns.com/foaf/0.1/name – 31 bytes in ASCII URI ID (4 bytes) … … http://xmlns.com/foaf/0.1/name 128 … … Less bytes for basic data units - (Fix-length)Dictionary Based - (Variable-length) Huffman coding - Predictive encoding
  • 10. 10 Semantic Redundancy Caused by “Linked” Nature • Vocabulary Linkage – Reuse of other vocabularies: more rules – Less redundancy ratio: more triples derivable – More redundancy: co-occurrence triples removable • Instance Linkage – sameAs linkages – Bring in new assertions (e.g., type assertions) – Bring in new axioms
  • 11. How to analysis? • Two dimension analysis • Methodology • Metrics
  • 12. 12 Two dimension analysis Semantic Syntactic Symbolic A-Box ✔ ✔ A-Box & T-Box No Linkage ✔ - - T-Box Reuse ✔ - - A-Box Linkage - - RDF Redundancy Dimension Linked Semantic Dimension
  • 13. 13 Methodology: EDP Summarisation
  • 14. 14 Virtually Materialised A-Box: expanded EDP A1, B1 (1) A2, B2 (1) A-Box: A1(o1) B1(o1) A2(o2) B2(o2) R(o1, o2) T-Box: A1⊆A, A2⊆A, B1⊆B, B2⊆B R (1:1) A, B, A, B,
  • 15. Linked Dataset Analysis Results • Dataset Selection & Summary • Analysis Results
  • 16. 16 Dataset Selection and Summary LOD 2011
  • 17. 17 A-Box Only: Semantic Redundancies – Redundant Triples – Semantic redundancy ratio, i.e. – # Graph Patterns used to substitute redundant triples
  • 18. 18 A-Box Only: Syntactic Redundancies – the redundant resource occurrences of inter-structural redundancies – the syntactic redundancy ratio, i.e.
  • 19. 19 A-Box & T-Box: No Linkage DBLP2013: SWRC ontology Ordnance Survey: official published OS ontology 1.7% 184% 108% 4.7%
  • 20. 20 A-Box & T-Box: No Linkage First 3 datasts are reusing FOAF Ontology – the number of directly used terms from reused T-Box – the number of applicable axioms from (materialised) reused T-Box 26.9% 4% 45.4% 1.3%
  • 21. 21 Conclusion • LOD redundancy are heterogeneous & huge • Vocabulary linkage might lead to huge number of derivable triples • Redundancy aware techniques are demanded
  • 22. 22 Redundancy-aware Consumption • Compression: different redundancies might need different techniques • For Data Access: (high inter-structure redundancy) skewed entity distributions over EDPs -> efficient access? • OBDA/Reasoning: A-Box redundancy = less T-Box axioms • Data Publisher: should be aware of the consequences of reusing