Redundancy analysis on linked data #cold2014 #ISWC2014

How redundant is it? – An empirical analysis on linked datasets
Honghan Wu1, Boris Villazon-Terrazas2, Jeff Z. Pan1 and José Manuel Gómez Pérez2
University of Aberdeen1, UK
iSOCO2 , Spain
20/10/2014 1

2
Content
•
What is data redundancy with linked data?
•
Why is it of special interest to linked data consumption?
•
Linked Data redundancy categorisation
•
How to analysis?
•
Dataset selection & The Result
•
Conclusion

3
What is the data redundancy in LD?
•
Data Redundancy
–
[Database systems] Same piece of data in multiple places
–
[Information theory] Wasted "space" used to transmit certain data
•
(In this work)Linked Data Redundancy
–
Wasted “space” to represent certain meaning (represented in certain semantics)
–
Duplication-free

4
Why is it of special interest to LD consumption?
•
Bad Redundancy & Good Redundancy
–
Bad for exchange: storage, transmission
–
Good for inference computation
•
Relevant consumption tasks
–
Hosting/Sharing
–
Query Answering (SPARQL)
–
Ontology Based Data Access
–
Reasoning

Redundancy in Linked Data
•
Redundancy Categorisation for RDF Data
•
Redundancies caused by the “Linked” nature

6
RDF Redundancies vs. Succinct Representations
[Rule based] A. K. Joshi, P. Hitzler, and G. Dong. Logical linked data compression. In The Semantic Web: Semantics and Big Data, pages 170–184. Springer, 2013.
[HDT]J. D. FernáNdez, M. A. MartíNez-Prieto, C. GutiéRrez, A. Polleres, and M. Arias. Binary rdf representation for publication and exchange (hdt). Web Semant., 19:22–41, Mar. 2013.
[WaterFowl] O. Curé, G. Blin, D. Revuz, and D. C. Faye. Waterfowl: A compact, self-indexed and inference-enabled immutable rdf store. In The Semantic Web: Trends and Challenges, pages 302– 316. Springer, 2014.
Pan, Jeff Z., Jose Manuel Gomez-Perez, Yuan Ren, Honghan Wu, Haofen Wang and Man Zhu. “Graph Pattern based RDF Data Compression”. In Proc. of 4th Joint International Semantic Technology Conference (JIST). 2014. (To appear)

7
Semantic redundancy
Rule Representation
-
DL Axioms (T-Box)
-
Other semantics (graph pattern substitution)

8
Syntactic Redundancy
Concise syntax
-
RDF abbreviation & striping syntax
-
Intra-structure & Inter- structure

9
Symbolic Redundancy
•
http://xmlns.com/foaf/0.1/name
–
31 bytes in ASCII
URI
ID (4 bytes)
…
…
http://xmlns.com/foaf/0.1/name
128
…
…
Less bytes for basic data units
-
(Fix-length)Dictionary Based
-
(Variable-length) Huffman coding
-
Predictive encoding

10
Semantic Redundancy Caused by “Linked” Nature
•
Vocabulary Linkage
–
Reuse of other vocabularies: more rules
–
Less redundancy ratio: more triples derivable
–
More redundancy: co-occurrence triples removable
•
Instance Linkage
–
sameAs linkages
–
Bring in new assertions (e.g., type assertions)
–
Bring in new axioms

How to analysis?
•
Two dimension analysis
•
Methodology
•
Metrics

12
Two dimension analysis
Semantic
Syntactic
Symbolic
A-Box
✔
✔
A-Box & T-Box
No Linkage
✔
-
-
T-Box Reuse
✔
-
-
A-Box Linkage
-
-
RDF Redundancy Dimension
Linked Semantic
Dimension

13
Methodology: EDP Summarisation

14
Virtually Materialised A-Box: expanded EDP
A1, B1 (1)
A2, B2 (1)
A-Box: A1(o1) B1(o1) A2(o2) B2(o2) R(o1, o2)
T-Box: A1⊆A, A2⊆A, B1⊆B, B2⊆B
R (1:1)
A, B,
A, B,

Linked Dataset Analysis Results
•
Dataset Selection & Summary
•
Analysis Results

16
Dataset Selection and Summary
LOD 2011

17
A-Box Only: Semantic Redundancies
– Redundant Triples
– Semantic redundancy ratio, i.e.
– # Graph Patterns used to substitute redundant triples

18
A-Box Only: Syntactic Redundancies
– the redundant resource occurrences of inter-structural
redundancies
– the syntactic redundancy ratio, i.e.

19
A-Box & T-Box: No Linkage
DBLP2013: SWRC ontology
Ordnance Survey: official published OS ontology
1.7%
184%
108%
4.7%

20
A-Box & T-Box: No Linkage
First 3 datasts are reusing FOAF Ontology
– the number of directly used terms from reused T-Box
– the number of applicable axioms from (materialised) reused T-Box
26.9%
4%
45.4%
1.3%

21
Conclusion
•
LOD redundancy are heterogeneous & huge
•
Vocabulary linkage might lead to huge number of derivable triples
•
Redundancy aware techniques are demanded

22
Redundancy-aware Consumption
•
Compression: different redundancies might need different techniques
•
For Data Access: (high inter-structure redundancy) skewed entity distributions over EDPs -> efficient access?
•
OBDA/Reasoning: A-Box redundancy = less T-Box axioms
•
Data Publisher: should be aware of the consequences of reusing

Redundancy analysis on linked data #cold2014 #ISWC2014

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Redundancy analysis on linked data #cold2014 #ISWC2014 (20)

Recently uploaded (20)

Redundancy analysis on linked data #cold2014 #ISWC2014