Mining and Managing Large-scale Linked Open Data

Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Ansgar Scherp
Mining and Managing
Large-scale Linked Open Data
GVDB, Nörten-Hardenberg, May 25, 2016
Thanks to: Chifumi Nishioka, Renata Dividino, Thomas Gottron,
and many more …

Team Knowledge Discovery @
Ansgar
Scherp
Ahmed
Saleh
Chifumi
Nishioka
Falk
Böschen
Mohammad
Abdel-Qader
Till Blume
Anke
Koslowski
(Secretariat)
Henrik
Schmidt
(Engineer)
Lukas
Galke
Florian
Mai
&

Linked Open Data (LOD) Cloud
• Publishing and interlinking data on the web
• Different quality, purpose, and sources
• Using the Resource Description Framework(RDF)
World Wide Web LOD Cloud
Documents Data
Hyperlinks via <a> Typed Links
HTML RDF
Addresses (URIs) Addresses (URIs)

Relevance of Linked Data?

Prof. Ansgar Scherp – asc@informatik.uni-kiel.de1000+ Datasets, 50+ Billion Triples
Media
Geographic
Publications
Web 2.0
eGovernment
Cross-Domain
Life
Sciences
Linked Data: May ‘07  August ‘14
Source: http://lod-cloud.net
Social Networking

LOD on One Slide: Example Graph
biglynx:matt-briggs
foaf:Person
rdf:type
Fully qualified URI using vocabulary prefixes:
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix biglynx: <http://biglynx.co.uk/people/> .
Object
Predicate
Subject
RDF Triple

biglynx:matt-briggs
foaf:Person
rdf:type
Fully qualified URI using vocabulary prefixes:
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix biglynx: <http://biglynx.co.uk/people/> .
biglynx:Director
rdf:type …
…

biglynx:matt-briggs
foaf:Person
biglynx:dave-smith
biglynx:Director
rdf:type
foaf:knows
rdf:type
_1:point
wgs84:
lat
wgs84:
long
dp:London
foaf:based_near
……
…
…
ex:loc
“-0.118”
“51.509”
Types
Properties
Entity

Motivation for the SchemEX Index
• Single entry point to query the LOD cloud
• Search for data sources containing entities like
– ‘Persons, who are Politicians and Actors’
– ‘Research data sets’
– ‘Scientific publications’
Query
SELECT ?x
FROM …
WHERE {
?x rdf:type ex:Actor .
?x rdf:type ex:Politician . }
Index1
2
2
2

Input Data for SchemEX
• Quads: <subject> <predicate> <object> <context>
• Example:
<http://biglynx.co.uk/people/matt-briggs>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://xmlns.com/foaf/0.1/Person>
<http://biglynx.co.uk/people/matt-briggs.rdf>
<http://biglynx.co.uk/people/
matt-briggs.rdf>
rdf:type
biglynx:
matt-briggs
foaf:
Person
LOD Cloud
Dataset 𝑋

SchemEX Idea
• Schema-level index SchemEX
• Assign RDF entities to graph patterns
• Map graph patterns to data sources (context)
• Defined over entities, but store the context
• Construction of schema-level index
• Stream-based for scalability
• Stratified bi-simulation for detecting patterns
• Little loss of accuracy
[KGS+12]

Building the Index from a Stream
• Stream of quads coming from a LD crawler
… Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1
FiFo
4
3
2
1
1
6
2
3
4
5
C3
C2
C2
C1
+ Reasonable accuracy at cache size of 50k

Full BTC 2011Data Set: 2.17 Bn Triples
Cache size: 50 k
Winner
BTC’11
+ Linear runtime with respect to number of triples
+ Memory consumption scales with window size

[GSK+13] Generalization
Specialization
Result list with
examples
Inspired by
Google

LODatio Under the Hood
SPARQL
Snippets
Generalize
Retrieve
Data Sources
Query
translation
Rank
Specialize
Count
Select
Select
• Hybrid database with off-the-shelf components

LOD on One Slide: Recap
biglynx:matt-briggs
foaf:Person
biglynx:dave-smith
biglynx:Director
rdf:type
foaf:knows
rdf:type
_1:point
wgs84:
lat
wgs84:
long
dp:London
foaf:based_near
……
…
…
ex:loc
“-0.118”
“51.509”
Type Set (TS)
Property Set (PS)
Information theoretic analyses of LOD
• How much information is encoded in TS and PS?
• … information encoded, once TS or PS is known?
• … to which degree are TS and PS redundant?
• Example: 20% of PLDs do not need TS (6% for PS)
[GKS15]

• 29 weekly LOD snapshots of ~100 Mio triples
• Still running since May 2012 (now 200+ weeks)
Käfer et al.’s Temporal Analysis of LOD
• Data on the cloud changes a lot
[Käfer et al., 2013] T. Käfer, A. Abdelrahman, J. Umbrich, P. O'Byrne, A. Hogan: Observing Linked
Data Dynamics. ESWC 2013: 213-227
Changes?
• But vocabularies defining RDF types and
properties are highly static, e.g., RDF, FOAF
LOD cloud ~2012 LOD cloud ~2014

𝐻(𝑃𝑆|𝑇𝑆=𝑡𝑠)
𝐻(𝑇𝑆|𝑃𝑆=𝑝𝑠)
But:DoChangesOccurinPS and TS?
• Analysis: expected conditional entropy over time
• 𝐻(𝑃𝑆|𝑇𝑆 = 𝑡𝑠): entropy of 𝑃𝑆 given 𝑇𝑆 is known
• Observation: types become less important
• Changes in the use of TS and PS ? !

Changes over Time
• Extended characteristic sets: ECS = PS ∪ TS
# of ECS
Avg.: 83.898 ECS per week
# of ECS
[DSG+13]
• Avg. 73% of ECS re-occur next week (orange)
• Avg. 35% of ECS remain unchanged (blue)
• Avg. 20% of entity sets of ECS change / week
[Neumann and Moerkotte, 2011] Thomas Neumann, Guido Moerkotte: Characteristic sets:
Accurate cardinality estimation for RDF queries with multiple joins. ICDE 2011: 984-994
[Neumann and
Moerkotte, 2011]

Temporal Dynamics of the Entities?
• Notion of entity motivated by ECS: entity is a
set of triples 𝑋 sharing the same subject URI 𝑠
• Example:
–1 entity
–4 triples
w.l.o.g.
• Useful to keep LOD caches up-to-date?
• Can we predict when LOD sources will change?

Dynamics Function Θ
• Definition of Θ over change rate function 𝑐(𝑋𝑡)
Time
X
𝑡𝑖 𝑡𝑗
Θ
Θ 𝑡 𝑖
𝑋 = Θ(𝑋𝑡 𝑗
) − Θ(𝑋𝑡 𝑖
) = 𝑡 𝑖
𝑡 𝑗
𝑐 𝑋𝑡 d𝑡
[DGS+14]
𝑡𝑗
≈
𝑘=𝑖+1
𝑗
𝛿(𝑋𝑡 𝑘−1
, 𝑋𝑡 𝑘
)
• Approximation as step function over changes
Monotone,
non-negative
c

Update Strategies for LOD Sources
• Apply strategies from keeping caches of WWW
documents up-to-date to maintain LOD caches
• Assumptions
–LOD is fetched from various sources
–Sources are scored and prioritized based on
strategy
–Data of a source is fetched only when the
operation can be entirely executed

Scheduling Update Strategies
a) HTTP Header [Dividino et al., 2014a]
b) Age or Last Visited [Dasdan et al., 2009, Cho and
Garcia-Molina, 2000]
c) PageRank [Page et al., 1999, Boldi et al., 2004,
Baeza-Yates et al., 2005]
d) LOD Sources Size
e) Change Ratio [Douglis et al., 1997, Cho et al., 2002.
Tan et al., 2007]
f) Change Rate [Olston et al., 2002, Ntoulas et al.,
2004, Dividino et al., 2013]
g) History Information: Dynamics [Dividino et al., 2014b]
We borrow strategies developed for the WWW and
metrics for data change analysis in the LOD cloud.

Ranking
Sources which
changed (most)
Sources that not
changed/less changesTimeti tj
e) Change Ratio
• Captures the change
frequency of the data
(freshness)
• Percentage of data items
in the cache that are up-to-date

f) Change Rate
• Data from sources which are less similar which their
previous update (snapshot) should be updated first
• Comparison of two RDF data sets
– 𝑋 : Set of triple statements
– 𝛿 : Numeric expression (distance)
𝛿𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝑋1, 𝑋2 =
1 −
𝑋1 ∩ 𝑋2
𝑋1 ∪ 𝑋2
0,¥[ )
Time𝑡𝑖 𝑡𝑗
𝛿
Example:

g) History Information: Dynamics
• Data from sources which most evolve in a given
period of time should be updated first
• Uses both history information and change rate
Θ(𝑋𝑡 𝑗
) − Θ(𝑋𝑡 𝑖
) = 𝑡 𝑖
𝑡 𝑗
𝑐 𝑋𝑡 d𝑡
Time
X
𝑡𝑖 𝑡𝑗
Θ
c
≈
𝑘=𝑖+1
𝑗
𝛿(𝑋𝑡 𝑘−1
, 𝑋𝑡 𝑘
)

Evaluation
 Idea: simulation of limitations of available
computational resources (network bandwidth,
computation time)
Time
100%
𝑡𝑖 𝑡𝑖+1

Evaluation: Single Step Update
Time
100%
15%
5%40%
75%
95%60%
𝑡𝑖 𝑡𝑖+1

Evaluation: Iterative Updates
Time
. . .
15%
5%40%
75%
95%60%
15%
5%40%
75%
95%60%
100%
𝑡𝑖 𝑡𝑖+1 𝑡𝑖+2

Dataset
• Dynamic Linked Data Observatory
• Weekly snapshots, 14 M triples
 154 snapshots (approx. 3 years)
 590 data sources (PLD)
Top 10 largest data sources Average size
dbpedia.org 3,406,364.5
edgarwrap.ontologycentral.com 982,631.0
dbtune.org 864,107.6
dbtropes.org 787,299.9
data.linkedct.org 498,986.3
aims.fao.org 416,708.9
www.legislation.gov.uk 399,601.6
kent.zpr.fer.hr 387,034.8
identi.ca 278,316.2
webenemasuno.linkeddata.es 250,557.9

Metrics:Precision & Recall
• Precision: portion of cached data that are
actually up-to-date
• Recall: portion of data in the LOD cloud that
is identical to the cached data
Cached data
Actual data on the LOD cloud
(w.r.t. to the 590 sources considered)

Results: Single Step Update
Time
t jti
100% 15%
5%40%
75%
95%60%

Results: Iterative Updates
Time
tjti tj
. . .
15%
5%40%
75%
95%60%
15%
5%
40%
75%
95%60%
100%

Results: Summary
 Best strategies: ones which
capture the change
behaviour over time
 Specially for low relative
bandwidth

Dynamics Function Θ: Revisited
Time
X
𝑡𝑖 𝑡𝑗
c
• Can we predict when LOD sources will change?
• Notion of dynamics to compute periodicities!
• Dynamics as vector of changes:
< 𝛿(𝑋𝑡1
, 𝑋𝑡2
), … , 𝛿(𝑋𝑡 𝑁−1
, 𝑋𝑡 𝑁
) >

Temporal Clustering of Entities
• Dynamics as vector: < 𝛿(𝑋𝑡1
, 𝑋𝑡2
), … , 𝛿(𝑋𝑡 𝑁−1
, 𝑋𝑡 𝑁
) >
Time
Change(logscale)
[NS15]
• Clustering with
k-means++ to
find patterns
• 165 snapshots
• 65,044 entities
• 7 patterns (after
optimizing 𝑘)

Periodicity of Entity Dynamics
• Examples: < 0, 3, 2, 0, 3, 2, 0 >, < 1, 2, 1, 2, 1, 2 >
# of
entities
Most likely
periodicity
C1 12,982 66
C2 168 23
C3 35 1
C4 12 1
C5 1 1
C6 1,541 56
C7 30 37
CS 50,725
[Elfeky et al., 2005] Mohamed G. Elfeky, Walid G. Aref, Ahmed K. Elmagarmid:
Periodicity Detection in Time Series Databases. IEEE Trans. Knowl. Data Eng.
17(7): 875-887 (2005)
• Convolution-based algorithm
[Elfeky et al. 2005]
• Entities of legislation.gov.uk
found in several clusters
(C1,C3,C4,C5,C6)
• No changes (CS): 77.29%
• CS: entities from w3.org and
ontologydesignpatterns.org

Application Areas: More than One!
• Searching for LOD sources
[GSK+13,KGS+12]
• Strategies for updating data caches [DGS15]
• Programming queries against LOD [SSS12]
• Recommending LOD vocabularies [SGS16]
 Foundation for Future Data-driven Applications

Summary: KDD in Social Media & DL
How to deal with the vast amount of content related to
research and innovation?
• H2020 INSO-4 project, duration: 04/2016-03/2019
• Data mining & visualization tools enabling information
professionals to deal with large corpora
• Website: http://www.moving-project.eu/
New

Got Interested?
Knowledge Discovery at ZBW
Contact me!
Prof. Dr. Ansgar Scherp
• Email: a.scherp@zbw.eu
• Twitter: https://twitter.com/ansgarscherp
• Slideshare: http://de.slideshare.net/ascherp
• KD-Website:
http://www.zbw.eu/en/research/knowledge-discovery/
http://www.kd.informatik.uni-kiel.de/en/

References
[DGS15] R. Dividino, T. Gottron, A. Scherp: Strategies for Efficiently Keeping Local
Linked Open Data Caches Up-To-Date. International Semantic Web Conference (2)
2015: 356-373
[DGS+14] R. Dividino, T. Gottron, A. Scherp, G. Gröner: From Changes to Dynamics:
Dynamics Analysis of Linked Open Data Sources. PROFILES@ESWC 2014
[GKS15] T. Gottron, M. Knauf, A. Scherp: Analysis of schema structures in the Linked
Open Data graph based on unique subject URIs, pay-level domains, and vocabulary
usage. Distributed and Parallel Databases 33(4): 515-553 (2015)
[DSG+13] R. Dividino, A. Scherp, G. Gröner, T. Gottron: Change-a-LOD: Does the
Schema on the Linked Data Cloud Change or Not? COLD 2013
[GSK+13] T. Gottron, A. Scherp, B. Krayer, A. Peters: LODatio: using a schema-level
index to support users in finding relevant sources of linked data. K-CAP 2013: 105-108
[KGS+12] M. Konrath, T. Gottron, S. Staab, A. Scherp: SchemEX - Efficient construction
of a data catalogue by stream-based indexing of linked data. J. Web Sem. 16: 52-58
(2012)
[NS15] C. Nishioka, A Scherp: Temporal Patterns and Periodicity of Entity Dynamics in
the Linked Open Data Cloud. K-CAP 2015.
[SGS16] J. Schaible, T. Gottron, and A. Scherp: TermPicker Enabling the Reuse of
Vocabulary Terms by Exploiting Data from the Linked Open Data Cloud, ESWC,
Springer, 2016.
[SSS12] S. Scheglmann, A. Scherp, S. Staab: Declarative Representation of
Programming Access to Ontologies. ESWC 2012: 659-673

a) HTTP Header
• Data from sources which have been changed
since the last update should be updated first
HTTP Response
HEADER
…
Last-Modified: Tue, 15 Nov 1994 12:45:26
GMT
CONTENT

b) Age or Last Visited
• Time elapsed from last
update (the difference
between query time and
last update time)
• It guarantees that every
source is updated after a
period
Ranking
Sources that have been
at longer time updated
Sources that have
been recently updated

c) PageRank and d) Source Size
• PageRank captures popularity/
importance of the LOD source
• Data from sources with highest
PageRank are updated first
• LOD source size: data from the
biggest/smallest LOD sources
should be updated first
Ranking
Sources with
higher PR
Sources with
lower PR

Mining and Managing Large-scale Linked Open Data

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Mining and Managing Large-scale Linked Open Data (20)

More from MOVING Project (20)

Recently uploaded (20)

Mining and Managing Large-scale Linked Open Data