Experiments with evolving RDF

Experiments with
evolving RDF
Sławek Staworko
(joint work with Peter Buneman)
University of Edinburgh

Preservation of evolving data
Tom
cat
has
tuna
eats
Tom
cat
has
Apr 1
dies
Tom
dog
has
dog
food eats
Version 1 Version 2 Version 3
…
Archive
• Version retrieval
• Timeline queries
• Storage space efﬁciency

Approaches to data
preservation
• Store all versions
• Store the original databases and log the changes
• Hybrid approach of the above two
• store the initial and every 10th version
• store log changes for the intermediate versions
• Annotation based approach!
• never delete data but annotate its validity with
time intervals

Annotation of RDF
Tom
cat
has
tuna
eats
Tom
cat
has
Apr 1
dies
Tom
dog
has
dog
food eats
Version 1 Version 2 Version 3
Archive
Tom
cat
has [1–2]
tuna
eats [1–1]
Apr 1
dies [2–2]
dog
has [3—]
dog
food
eats [3—]

What exactly is the input?
Delta = difference between two databases expressed with
two atomic operations: inserting a triple and deleting a triple
Tom
cat
has
tuna
eats
Tom
cat
has
Apr 1
dies
Tom
dog
has
dog
food eats
delete (cat, eats, tuna)
insert (cat, dies, Apr 1)
delete (Tom, has, cat)
insert (Tom, has, dog)
inset (dog, eats, dog food)
delete (cat, dies, Apr 1)
Snapshots
Deltas
Snapshots = complete database instances

Challenges in preserving
evolving data with annotations
1. The task is relatively simple if deltas are know:!
• deleting a triple closes its interval!
• adding a triple opens a new interval !
2. It gets complicated when only snapshots are given!
• it boils down to computing deltas!
• main challenge: identify objects that are the same across
versions of the database
Entity resolution problem!
which data object represent the same entity across different versions!
well-studied database problem in various different settings
(from duplicate elimination to record matching)

Entity resolution and RDF
URI (Uniform resource identiﬁer)
URIs are supposed to make things easy but…
• RDF has also blank nodes
• URIs don’t exactly solve the problem in the
context of evolving/merged ontologies…
Two different RDF nodes need not represent different objects

Blank nodes
• LOD initiative frowns upon them
• Blank nodes are commonplace (and misused?)
Tom
cat
has
Peter
believes
Tom cathas
Peter believes
_bsubject
pred
object
_b
2.4 -0.4
Reiﬁcation Complex number

Blank nodes (cont.)
1. Reiﬁcation (Peter believes that Tom has a cat)
2. Data structures (complex types)
3. Anonymization (Tom has a pet)
Assumptions on reasonable use of blank nodes:!
1. Represent concrete objects !
2. The objects can be identiﬁed from the context

Deblanking
_b1
7 end
_b2
3
_b3
5
LISP-style encoding
list of numbers [5,3,7]
head
head
head
tail
tail
tail
#(7,end)
7 end
_b2
3
_b3
5
head
head
head
tail
tail
tail
#(7,end)
7 end
#(3,7,end)
3
_b3
5
head
head
head
tail
tail
tail
#(7,end)
7 end
#(3,7,end)
3
#(5,3,7,end)
5
head
head
head
tail
tail
tail
Assumption: graph has no cycles consisting of blanks only
Assumption: identity of a blank node is determined by its contents

Experiements
• 10 versions of Experimental Factor Ontology (EFO)
data expressed in OWL
• 200k triples in the 1st version, 290k in the last
• On average 20k blank nodes in each version
• 920k triples overall (blank nodes are independent)
• many triples do not last more than 1 version

Experiment
Deblanking and life expectancy of an object
Round Triples Blanks Life expect.
0 921896 165935 2.55
1 358857 33253 6.39
2 348356 28150 6.57
3 339695 23502 6.88
4 330564 18862 7.10
5 318761 14763 7.24
6 311562 11021 7.39
7 304628 7299 7.54
8 297744 3622 7.83
9 285484 58 7.83
10 285334 2 7.83
11 285334 1 7.83
12 285334 0 7.83

Improving space efﬁciency
Peter
Edinburgh +44 712 4567
phone [1–10]lives [1–10]
Peter
Edinburgh +44 712 4567
phonelives
[1–10]Lift common intervals to subject
dog
has [1–5]
dog
has [1–5]
• Intervals moved from all but 33.7k triples (of total 285k)
• Number of subjects with histories is 34.3k
• Total number of intervals is reduced from 285k to 60k
• The size of the index reduced by almost 80%

Future:
• Bisimulation
• Nested RDF

Conclusions
• Annotation offers an attractive way of representing
an evolving RDF dataset (need for nested RDF?)
• Evolution of data may require more complex atomic
operations. For instance, vocabulary evolution:
adding, splitting, merging classes. (can
bisimulation help here?)

Experiments with evolving RDF

More Related Content

Similar to Experiments with evolving RDF (20)

More from PRELIDA Project (16)

Recently uploaded (20)

Experiments with evolving RDF