SlideShare a Scribd company logo
DSNotify: Handling Broken Links in the Web of Data

                          Niko P. Popitsch                                            Bernhard Haslhofer
                    niko.popitsch@univie.ac.at                                  bernhard.haslhofer@univie.ac.at
                           University of Vienna, Department of Distributed and Multimedia Systems
                                            Liebiggasse 4/3-4, 1010 Vienna, Austria




ABSTRACT                                                                    the network leading to the practical unavailability of infor-
The Web of Data has emerged as a way of exposing struc-                     mation [15, 2, 18, 19, 22].
tured linked data on the Web. It builds on the central build-                  Recently, Linked Data has been proposed as an approach
ing blocks of the Web (URIs, HTTP) and benefits from its                     for exposing structured data by means of common Web
simplicity and wide-spread adoption. It does, however, also                 technologies such as dereferencable URIs, HTTP, and RDF.
inherit the unresolved issues such as the broken link prob-                 Links between resources play a central role in this Web of
lem. Broken links constitute a major challenge for actors                   Data as they connect semantically related data. Meanwhile
consuming Linked Data as they require them to deal with                     an estimated number of 4.7 billion RDF triples and 142 mil-
reduced accessibility of data. We believe that the broken                   lion RDF links (cf. [6]) are exposed on the Web by numerous
link problem is a major threat to the whole Web of Data                     data sources from different domains. DBpedia [7], the struc-
idea and that both Linked Data consumers and providers                      tured representation of Wikipedia, is the most prominent
will require solutions that deal with this problem. Since no                one. Web agents can easily retrieve these data by derefer-
general solutions for fixing such links in the Web of Data                   encing URIs via HTTP and process the returned RDF data
have emerged, we make three contributions into this direc-                  in their own application context. In the example shown in
tion: first, we provide a concise definition of the broken                    Figure 1, an institution has linked a resource representing a
link problem and a comprehensive analysis of existing ap-                   band in their local data set with the corresponding resource
proaches. Second, we present DSNotify, a generic frame-                     in DBpedia in order to publish a combination of these data
work able to assist human and machine actors in fixing bro-                  on its Web portal.
ken links. It uses heuristic feature comparison and employs                  Source Resource                              Link       Target Resource
a time-interval-based blocking technique for the underlying
                                                                               http://example.com/bands/OliverBlack   rdfs:seeAlso     http://dbpedia.org/resource/Oliver_Black
instance matching problem. Third, we derived benchmark
datasets from knowledge bases such as DBpedia and eval-                                                                                          dbpprop: abstract
                                                                                              foaf:name
uated the effectiveness of our approach with respect to the                   Representation                                          Representation

broken link problem. Our results show the feasibility of a                                 Oliver Black
                                                                                                                                             Oliver Black is a Canadian
                                                                                                                                                    rock group ...
time-interval-based blocking approach for systems that aim
at detecting and fixing broken links in the Web of Data.
                                                                                Figure 1: Sample link to a DBpedia resource
Categories and Subject Descriptors
H.4.m [Information Systems]: Miscellaneous; H.3.3 [                            The Linked Data approach builds on the Architecture of
Information Systems]: Information Search and Retrieval                      the World Wide Web [16] and inherits the technical benefits
                                                                            such as simplicity and wide-spread adoption but also the
                                                                            unsolved problems such as broken links. In the course of
General Terms                                                               time, resources and their representations can be removed
Algorithms, Theory, Experimentation, Measurement                            completely or “moved” to another URI meaning that they
                                                                            are published under a different HTTP URI. In case of the
1.     INTRODUCTION                                                         above example, the band eventually changed its name and
  Data integrity on the Web is not given because URI ref-                   the title of their Wikipedia entry to “Townline”, with the
erences of links between resources are not as cool 1 as they                result that the corresponding DBpedia entry moved from its
are supposed to be. Resources may be removed, moved, or                     previous URI to http://dbpedia.org/resource/Townline.
updated over time leading to broken links. These constitute                    In the Linked Data context, we informally speak of links
a major problem for consumers of Web data, both human                       pointing from one resource (source) to another resource (tar-
and machine actors, as they interrupt navigational paths in                 get). Such a link is broken when the representations of the
                                                                            target cannot be accessed anymore. However, we consider
1
    See http://www.w3.org/Provider/Style/URI                                a link also as broken when the representations of the target
Copyright is held by the International World Wide Web Conference Com-
                                                                            resource were updated in such a way that they underwent a
mittee (IW3C2). Distribution of these papers is limited to classroom use,   change in meaning the link-creator had not in mind.
and personal use by others.                                                    When regarding the changes in recent DBpedia releases,
WWW 2010, April 26–30, 2010, Raleigh, North Carolina, USA.                  the broken link problem becomes evident: analogous to [7],
ACM 978-1-60558-799-8/10/04.
we analyzed the instances of common DBpedia classes in the      2.1    Definition of Broken Links and Events
snapshots 3.2 (October 2008) and 3.3 (May 2009), identified        We distinguish two types of broken links that differ in their
the events that occurred between these versions and catego-     characteristics and in the way how they can be detected and
rized them into move, remove, and create events. The results    fixed: structurally and semantically broken links.
in Table 1 show that within a period of about seven months
the DBpedia data space has grown and was considerably re-       Structurally broken links. Formally, we define structurally
organized. Different from other data sources, DBpedia has        broken (binary) links as follows:
the great advantage that it records move events in so-called
redirect links that are derived from redirection pages. These      Definition 1 (Broken Link). Let R and D be the set
are automatically created in the Wikipedia when articles are    of resources and resource representations respectively and
renamed.                                                        P(A) be the powerset of an arbitrary set A.
 Class          Ins. 3.2   Ins. 3.3   MV       RM       CR
                                                                Further let δt : R −→ P(D), be a dereferencing function
 Person          213,016    244,621   2,841   20,561   49,325   returning the set of representations of a given resource at a
 Place           247,508    318,017   2,209    2,430   70,730   given time t.
 Organisation     76,343    105,827   2,020    1,242   28,706   Now we can define a (binary) link as a pair
 Work            189,725    213,231   4,097   6,558    25,967
                                                                l = (rsource , rtarget ) with rsource ∧ rtarget ∈ R.
Table 1: Changes between two DBpedia releases.                  Such a link is called structurally broken if
Ins. 3.2 and Ins 3.3 denote the number of instances             δt−∆ (rtarget ) = ∅ ∧ δt (rtarget ) = ∅.
of a certain DBpedia class in the respective release
                                                                   That is, a link (as depicted for example in Figure 1) is
data sets, MV the moved, RM the removed, and
                                                                considered structurally broken if its target resource had rep-
CR the number of created resources.
                                                                resentations that are not retrievable anymore2 . In the re-
                                                                mainder of this paper, we will refer to structurally broken
   If humans encounter broken links caused by a move event,     links simply as broken links if not stated otherwise.
they can search the data source or the Web for the new lo-
cation of the target resource. However, for machine agents
broken links can lead to serious processing errors or misin-
                                                                Semantically broken links. Apart from structurally bro-
                                                                ken links, we also consider a link as broken if the human
terpretation of resources when they do not implement ap-
                                                                interpretation (the meaning) of the representations of its
propriate fallback mechanisms. If, for instance, the link in
                                                                target resource differs from the one intended by the link au-
Figure 1 breaks and the target resource becomes unavailable
                                                                thor. In a quantitative analysis of Wikipedia articles that
due to a remove or move event, the referenced biography
                                                                changed their meaning over time [13], the authors found out
information cannot be provided anymore. If the resource
                                                                that only a small number of articles (6 out of a test set
representations are updated and undergo a change in mean-
                                                                of 100 articles) underwent minor or major changes in their
ing, the Web portal could encounter the problem of exposing
                                                                meaning. However, we do not think that these results can
semantically invalid information.
                                                                be generalized for arbitrary data sources.
   While the detection of broken links on the Web is sup-
                                                                  In contrast to structurally broken links, semantically bro-
ported by a number of tools, there are only few approaches
                                                                ken links are much harder to detect and fix by machine ac-
for automatically fixing them [19]. Techniques for dealing
                                                                tors. But they are fixable by human actors that may, in a
with the broken link problem in the Web of Data do not
                                                                semi-automatic process, report such events to a system that
exist yet. The current approach is to rely on the HTTP
                                                                then forwards these events to subscribed actors. We have
404 Not Found response and assume that data-consuming
                                                                foreseen this in our system (see Section 3).
actors can deal with it. We consider this as insufficient and
propose DSNotify, which we informally introduced in [12],
as a possible solution. DSNotify is a generic change detec-     Events. Having defined links and broken links we can now
tion framework for Linked Data sources that informs data-       define events that occur when datasets are modified, possi-
consuming actors about the various types of events (create,     bly leading to broken links:
remove, move, update) that can occur in data sources.
   Our contributions are: (i) we formalize the broken-link         Definition 2 (Event). Let E be the set of all events
problem in the context of the Web of Data and provide a         and e ∈ E be a quadruple e = (r1 , r2 , τ, t), where r1 ∈ R and
comprehensive analysis of existing solution strategies (Sec-    r2 ∈ R ∪ {∅} are resources affected by the event,
tion 2), (ii) we present DSNotify, focusing on its core algo-   τ ∈ {created, removed, updated, moved} is the type of the
rithms for handling the underlying instance matching prob-      event and t is the time when the event took place. Further
lem (Section 3), and (iii) we have derived benchmark data       let L ⊆ E be a set of detected events.
sets from the ISLab Instance Matching Benchmark [11] and           Then we can assert that, ∀r ∈ R :
from the DBpedia knowledge base and evaluate the effec-          δt−∆ (r) = ∅ ∧ δt (r) = ∅ =⇒ L ←− L ∪ {(r, ∅, created, t)} .
tiveness of the DSNotify approach (Section 4).                  δt−∆ (r) = ∅ ∧ δt (r) = ∅ =⇒ L ←− L ∪ {(r, ∅, removed, t)} .
                                                                δt−∆ (r) = δt (r) =⇒ L ←− L ∪ {(r, ∅, updated, t)} .

2.   THE BROKEN LINK PROBLEM                                      Note the analogy between the definition of broken links
   In the previous section, we informally described the bro-    and removed events: whenever the representations of a link
ken link problem and its possible consequences. In this sec-    2
                                                                  Note that our definitions do not consider a link as broken
tion we want to contribute a formal definition of a broken       if only some of the representations of the target resource are
link in the context of Linked Data and discuss existing so-     not retrievable anymore. We consider clarifications of this
lution strategies for dealing with this problem.                issue as a topic for further research.
target are removed, the corresponding links are broken. Now                                                       Broken link type
                                                                       Solution Strategy                  Class   A      B       C
we have defined create, remove and update events, but what
                                                                       Ignoring the Problem                 -
about the event type “moved ”? In fact, it is not possible             Embedded Links                       p
to speak about moved resources considering only the pre-               Relative References                  p
vious definitions. Although there is no concept of resource             Indirection                          p
location in RDF, it exists in the Web of Data as it relies             Versioned and Static Collections     p
on dereferencable HTTP URIs. For this reason, we define a               Regular Updates                      p
                                                                       Redundancy                           p
weak equality relation between resources in the Web of Data
                                                                       Dynamic Links                        a
based on a similarity relation between its representations             Notification                          c
and build on that to define move events:                                Detect and Correct                   c
                                                                       Manual Edit/Search                   c
   Definition 3 (Move Event).
Let σ : P(D) × P(D) −→ [0, 1] be a similarity function
between two sets of resource representations. Further let          Table 2: Solution strategies for the broken link prob-
Θ ∈ [0, 1] be a threshold value.                                   lem. The strategies are classified according to Ash-
  We define the maximum similarity of a resource                    man [2]: preventative methods (p) that try to avoid
rold ∈ {r ∈ R|δt (rold ) = ∅} with any other resource rnew ∈       broken links in the first place, adaptive methods (a)
R  {rold } as simmax ≡ max(σ(δt−∆ (rold ), δt (rnew ))) .         that create links dynamically thereby avoiding bro-
                   rold
Now we can assert that:                                            ken links and corrective methods (c) that try to fix
∃!rnew ∈ R|δt−∆ (rnew ) = ∅ ∧ Θ < σ(δt−∆ (rold ), δt (rnew )) =    broken links. Symbols: potentially fixes/avoids such
simmax =⇒ L ← L ∪ {(rnew , rold , moved, t)} .                     broken links ( ), does not fix/avoid such broken
    rold
                                                                   links ( ), partly fixes/avoids such broken links ( )
  Thus, we consider a resource as moved from one HTTP
URI to another when the resource with the “old” URI was
removed, the resource with the “new” URI was created and           and the target resource is referenced using e.g., an HTTP
the representations retrieved from the “old” URI are very          URI reference. This method preserves link integrity when
similar3 to the representations retrieved from the “new” URI.      the source resource of a link is moved (type A).

2.2    Solution Strategies                                         Relative References. Relative references prevent broken
  In consequence of Definition 1, we further provide a more         links in some cases (e.g., when a whole resource collection
informal definition of link integrity:                              is moved). But neither does this method avoid broken links
                                                                   due to removed resources (type C ), nor does it hinder links
  Definition 4 (Link Integrity). Link integrity is a               with external sources/targets (i.e., absolute references) from
qualitative measure for the reliability that a link leads to the   breaking.
representations of a resource that were intended by the au-
thor of the link.                                                  Indirection5 . Introducing a layer of indirection allows con-
                                                                   tent providers to keep links to their Web resources up to
  Methods to preserve link integrity have a long history in
                                                                   date. Aliases refer to the location of a resource and special
hypertext research. We have analyzed existing approaches
                                                                   services translate between an alias and its referred location.
in detail, building to a great part on Ashman’s work [2] and
                                                                   Moving or removing a resource requires an update in these
extending it. In the following, we categorize broken links by
                                                                   service’s translation tables. Uniform Resource Names were
the events that caused their breaking:
                                                                   proposed for such an indirection strategy, PURLs and DOIs
type A: links broken because source resources were moved4 ,
                                                                   are two well known examples [1]. Permalinks use a similar
type B: links broken because target resources were moved
                                                                   strategy, although the translation step is performed by the
and type C: links broken because source or target resources
                                                                   content repository itself and not by a special (possibly cen-
were removed. The identified solution strategies are sum-
                                                                   tral) service. Another special case on the Web is the use
marized in Table 2 and discussed in the following:
                                                                   of small (“gravestone”) pages that reside at the locations of
                                                                   moved or removed resources and indicate what happened to
Ignoring the Problem. Although this can hardly be called           the resource (e.g., by automatically redirecting the HTTP
a “solution strategy”, it is the status-quo to simply ignore       request to the new location).
the problem of broken links and shift it to higher-level ap-          The main disadvantage of the indirection strategy is that
plications that process the data. As mentioned before, this        it depends on notifications (see below) for updating the
strategy is even less acceptable in the Web of Data.               translation tables. Furthermore it “. . . requires the cooper-
                                                                   ation of link creators to refer to the alias instead of to the
Embedded Links. The embedded link model [8] is the most            absolute address.” [2]. Another disadvantage is that spe-
common way how links on the Web are modeled. As in                 cial “translation services” (PURL server, CMS, gravestone
HTML, the link is embedded in the source representation            pages) are required that introduce additional latency when
3
  Note that in the case that there are multiple possible move      accessing resources (e.g., two HTTP requests instead of one).
target resources with equal (maximum) similarity to the re-        Nevertheless, indirection is an increasingly popular method
moved resource rold , no event is issued (∃! should be read as     on the Web. This strategy prevents broken links of type A
“there exists exactly one”).
4                                                                  5
  Note that in our definitions we did not consider links of          The “Dereferencing (Aliasing) of End Points” and “For-
type A as we assumed an embedded link model for Linked             warding Mechanisms and Gravestones” categories from [2]
Data sources.                                                      are combined in this category.
and B and can also help with type C links, e.g., removed
PURLs result in HTTP 410 status code (Gone), which al-                                                   links to
lows an application to react accordingly (e.g., by removing                        Dsrc                                               Dtgt

the links to this resource).                                        updates                              consumes
                                                                                 consumes                                        monitors
Versioned and Static Collections. In this approach, no
modifications/deletions of the considered resources are al-                      Application              notifies                DSNotify
lowed. This prevents broken links of types A-C within a                                       consumes              notifies
static collection (e.g., an archive), links with sources/targets                                                       event choice
outside this collection can still break. Examples are HTML
links into the Internet Archive 6 . Furthermore, semantically                                             User
broken links may be prevented when e.g., linking to a certain
(unmodifiable) revision of a Wikipedia article.
                                                                               Figure 2: DSNotify Usage Scenario.
Regular Updates. This approach is based on predictable
updates to resource collections, so applications can easily         Manual Edit/Search. In this category we summarize man-
update their links to the new (predictable) resource loca-          ual strategies for fixing broken links or re-finding missing
tions, avoiding broken links of types A-C ).                        link targets. This “solution strategy” is currently arguably
                                                                    among the most popular ones on the Web. First, content
Redundancy. Redundant copies of resource representations            providers may manually update links in their contents (per-
are kept and a service forwards referrers to one of these           haps assisted by automatic link checking software like W3C’s
copies as long as at least one copy still exists. This approach     link checker). Second, users may re-find the target resources
is related to the versioning and indirection approaches. How-       of broken links e.g., by exploiting search engines or by man-
ever, such services can reasonably be applied only to highly        ual URI manipulations.
available, unmodifiable data (e.g., collections of scientific
documents). This method may prevent broken links of types
A-C, examples include LOCKSS [21] or RepWeb [23].                   3.    DSNOTIFY
                                                                       After having investigated possible solution strategies to
Dynamic Links. In this method, links are created dynami-            deal with the broken link problem, we now present our own
cally when required and are not stored, avoiding broken links       solution strategy. It is called DSNotify and is a generic
of types A-C. The links are created based on computations           change detection framework that assists human and machine
that may reflect the current state of the involved resources         actors in fixing broken links. It can be used as an add-on to
as well as other (external) parameters, i.e., such links may        existing applications that want to preserve link integrity in
be context-dependent. However, the automatic generation             the data sets that are under their control (detect and correct
of links is a non-trivial task and this solution strategy is not    strategy, see below). It can also be used to notify subscribed
applicable to many real-world problems.                             applications of changes in a set of data sources (notification
                                                                    strategy). Further, it can be set-up as a service that au-
Notification. Here, clients are informed about the events            tomatically forwards requests to new resource locations of
that may lead to broken links and all required information          moved resources (indirection strategy).
(e.g., new resource locations) is communicated to them so              DSNotify is not meant to be a service that monitors the
they can fix affected links. This strategy was for example            whole Linked Data space but rather as a light-weight com-
used in the Hyper-G system [17] where resource updates are          ponent that can be tailored to application-specific needs and
propagated between document servers using a p-flood algo-            detects modifications in selected Linked Data sources.
rithm. It is also the strategy of Silk and Triplify discussed          A typical usage scenario is illustrated in Figure 2: an ap-
in Section 5. This method resolves broken links caused by           plication hosts a data set Dsrc that contains links to a target
A-C but requires that the content provider can observe and          data set Dtgt . The application consumes and integrates data
communicate such events.                                            from both data sets, and provides a view on this data (such
                                                                    as a Website) consumed by Web users. The application uses
                                                                    DSNotify to monitor Dtgt as it has no control over this tar-
Detect and Correct. The solution for the broken link prob-          get data set. DSNotify notifies the application (and possibly
lem proposed in this work falls mainly into this category
                                                                    other subscribed actors) about the events occurring in Dtgt
that Ashman describes in her work as: “. . . the application
                                                                    and the application can update and fix potentially broken
using the link first checks the validity of the endpoint refer-
                                                                    links in the source data set.
ence against the information, perhaps matching it with an
expected value. If the validity test fails, an attempt to correct   3.1       General Approach
the link by relocating it may be made . . . ” [2].
   As all other solutions in this category (cf., Section 5), we        Our approach for fixing broken links is based on an in-
use heuristic methods to semi-automatically fix broken links         dexing infrastructure. A monitor periodically accesses con-
of the types A-C.                                                   sidered data sources (e.g., a Linked Data source), creates
                                                                    an item for each resource it encounters and extracts a fea-
6
  For example the URI http://web.archive.org/web/                   ture vector from this item’s representations. The feature
19971011064403/http://www.archive.org/index.html                    vector is stored together with the item’s URI in an item
references the 1997 version of the internet archive main            index (ii). The set of extracted features, their implementa-
page.                                                               tion, and their extractors are configurable. Feature vectors
CA CB       CC            CD                   CE        RA    MB,F MC,G                        CH    RF       RD            ME,I
are updated in the ii with every monitoring cycle resulting
in possible update events logged by DSNotify.
   Items corresponding to resources that are not found any-                                           m0                                                          m1                       h1

more are moved to another index, the removed item index                                 A
                                                                                             B
                                                                                                                                                  E
                                                                                                                                                          F
                                                                                                                                                                  B              E
                                                                                                                                                                                   F       A
                                                                                                                                                                                                    B
                                                                                                                                                      G            C
(rii). After some timeout period, items are moved from the                              C
                                                                                             D                                                            D       A
                                                                                                                                                                                  G
                                                                                                                                                                                    D               C
rii to a third index called the archived item index (aii) re-                           ii            rii           aii
sulting in a remove event (cf. Definition 2).
   Items in the ii and the rii are periodically considered by a                                                                                                                                            t
housekeeper thread (a “move detector”) that compares their                         m2                                      h2                             m3                            h3
feature vectors and tries to identify possible successors for              H        F            B          H              F     B           H                F     B        H                  F
                                                                                                                                                                                                     B
                                                                               I   E                            I               E                 I                E              I             E
removed (real-world) items (cf. Definition 3). A plug-in                    G         D           C          G
                                                                                                                           D      C          G
                                                                                                                                                              D      C       G                      CD
                                                                                    A                                           A                                  A                                 A
heuristic is used for this comparison; in the default con-
figuration a vector space model acting on the extracted and                                             timeout                                                              timeout

weighted feature vectors is used. The similarity measures for
the features themselves are configurable; for character string        Figure 3: Example timeline illustrating the main
features one could use, for instance, exact string matching          workflow of DSNotify. Ci ,Ri and Mi,j denote create,
or the Levenshtein distance. The similarity value calculated         remove and move events of items i and j. mx and
by the heuristic is compared against threshold values 7 to           hx denote monitoring and housekeeping operations
determine whether an item is another item’s predecessor              respectively. The current index contents is shown
(resulting in a detected move event) or not (possibly re-            in the grey boxes below the respective operation,
sulting in a detected create event). The predecessors of the         the overall process is explained in the text.
newly indexed items are moved to the aii and linked to the
new corresponding items. This enables actors to query the
DSNotify indices for actual locations of items.                      the evaluation of our system (see Section 4).
   The events detected by DSNotify are stored in a central
event log. This log as well as the indices can be accessed           3.3           Monitoring and Housekeeping
via various interfaces, such as a Java API, an XML-RPC                  The cooperation of monitor, housekeeper, and the indices
interface, and an HTTP interface.                                    is depicted in Figure 3. To simplify matters, we assume an
                                                                     empty dataset at the beginning. Then four items (A,B,C,D)
3.2    Time-interval-based Blocking                                  are created before the initial DSNotify monitoring process
   The main task of DSNotify is the efficient and accurate             starts at m0 . The four items are detected and added to
matching of pairs of feature vectors representing the same           the item index (ii). Then a new item E is created, item A
real-world item at different points in time. As in record             is removed and the items B and C are “moved” to a new
linkage and related problems (cf. Section 5), the number             location becoming items F and G respectively. At m1 the
of such pairs grows quadratically with the number of con-            three items that are not found anymore by the monitor are
sidered items resulting in unacceptable computational ef-            “moved” to the removed item index (rii) and the new item
fort. The reduction in the number of comparisons is called           is added to the ii. When the housekeeper is started for the
blocking and various approaches have been proposed in the            first time at h1 , it acts on the current indices and compares
past [25].                                                           the recent new items (E,F ,G) with the recently removed
   We have developed a time-interval-based blocking (TIBB)           items (B,C,A). It does not include the “old” item D in its
mechanism for an efficient and accurate reduction of the               feature vector comparisons. The housekeeper detects B as a
number of compared feature vector pairs. Our method in-              predecessor of F and C as a predecessor of G, moves B and
cludes only the feature vectors derived from items that were         C to the archived item index (aii) and links them to their
detected as being recently created or removed, blocking out          successors. Between m1 and m2 a new item is created (H),
the vast majority of the items in our indices. Reconsider-           two items (F ,D) are removed and the item E is “moved”
ing Definition 3, this means that we are allowing only small          to item I. The monitor updates the indices accordingly at
values for ∆. Thus, if x is the number of feature vectors            m2 and the subsequent housekeeping operation at h2 tries
stored in our system, n is the number of new items and r             to find predecessors of the items H and I. But before this
is the number of recently removed items with n + r ≤ x,              operation, the housekeeper recognizes that the retention pe-
then the number of comparisons in a single DSNotify house-           riod of item A in rii is longer than the timeout period and
keeping operation is n · r instead of x2 . It is intuitively clear   moves it to the aii. The housekeeper then detects E as a
that normally n and r are much smaller than x and therefore          predecessor of I, moves it also to the aii and links it to I.
n·r     x2 . The actual number of feature vector comparisons         Between m2 and m3 no events take place and the indices
in a single housekeeper operation depends on the vitality of         remain untouched by the monitor. At h3 the housekeeper
the monitored data source with respect to created, removed           recognizes the timeout of the items F and D and moves
and moved items and on the frequency of housekeeping op-             them to the aii leaving an empty rii.
erations8 . We have analyzed and confirmed this behavior in
7
                                                                     3.4           Event Choices
  Note that using threshold values for the determination                As mentioned before, a threshold value (upperThreshold )
of non-matches, possible-matches and matches was already
proposed by Fellegi and Sunter in 1969 [10].                         is used to decide whether two feature vectors are similar
8
  As housekeeping and monitoring are separate operations             enough to assume their corresponding items as a predeces-
in DSNotify, this number depends also on the monitoring              sor/successor pair. Furthermore, DSNotify uses a second
frequency when lower than the housekeeping frequency.                threshold value (lowerThreshold ) to decide whether two fea-
Data: Item indices ii, rii, aii
  Result: List of detected events L                                                            updates          monitors
  begin                                                                 Dsrc
     Move timed-out items from rii to aii;
     L ←− ∅; P M C ←− ∅;                                                           Simulator                               DSNotify
                                                                                                         Dobs
     foreach ni ∈ ii.getRecentItems() do
         pmc ←− ∅;
         foreach oi ∈ rii.getItems() do                                 Dtgt
             sim ←− calculateSimilarity(oi, ni);
             if sim > lowerT hreshold then                                                          Analyzer
                                                                                     Etest                                   Edet
                 pmc ←− pmc + {(oi, ni, sim)};
             end
         end                                                                                         Results
         if pmc = ∅ then
             L ←− L + {(ni, ∅, create)};
         else
             P M C ←− P M C + {pmc};                             Figure 4: The DSNotify evaluation approach. A
         end                                                     simulator takes two datasets (Dsrc and Dtgt ) and an
     end                                                         eventset (Etest ) as input and continuously updates
     foreach pmc ∈ P M C do                                      a newly created observed dataset (Dobs ). DSNotify
         if pmc = ∅ then
             (oimax , nimax , simmax ) ←−                        monitors this dataset and creates a log of detected
             getElementW ithM axSim(pmc);                        events (Edet ). This log is compared to the eventset
             if simmax > upperT hreshold then                    Etest to evaluate the system’s accuracy.
                 L ←− L + {(oimax , nimax , move)};
                 move oimax to aii;
                 link oimax to nimax ;                           we rely on simple timeouts for removing old data items from
                 remove all elements from P M C where            these structures but this method can still result in unaccept-
                 pmc.oi = oimax ;
                                                                 able memory consumption when monitoring highly dynamic
             else
                 Issue an eventChoice for pmc;                   data sources. More advanced strategies are under consider-
             end                                                 ation. Note that we consider particularly the feature vector
         end                                                     history as a very valuable data structure as it allows ex post
     end                                                         analysis of the evolution of items w.r.t. their location in a
     return L;                                                   data set and the values of the indexed features.
  end
Algorithm 1: Central DSNotify housekeeping algorithm.            4.     EVALUATION
                                                                    In the evaluation of our system we concentrated on two
                                                                 issues: first, we wanted to evaluate the system for its ap-
ture vectors are similar enough to be even considered as
                                                                 plicability for real-world Linked Data sources, and second,
such a pair (“possible move candidates”, pmc). When none
                                                                 we wanted to analyze the influence of the housekeeping fre-
of the feature vectors considered for a possible move op-
                                                                 quency on the overall effectiveness of the system.
eration are similar enough (i.e., >upperThreshold ), DSNo-
                                                                    We evaluated our system with datasets that we call event-
tify stores all considered pairs of feature vectors with sim-
                                                                 sets. An eventset is a timely-ordered set of events (cf. Defini-
ilarity values >lowerThreshold in a so-called event choice
                                                                 tion 2 and 3) that transforms a source into a target dataset.
object. Event choices are representations of decisions that
                                                                 Thus, an eventset can be seen as the event history of a
have to be made outside of DSNotify, for example by hu-
                                                                 dataset. We have developed a simulator that can re-play
man actors or by other machine actors that can resort to
                                                                 such eventsets, interpreting the event timestamps with re-
additonal data/knowledge. These external actors may ac-
                                                                 gard to a configurable duration of the whole simulation. Fig-
cess the log of event choice objects and send their decisions
                                                                 ure 4 depicts an overview of our evaluation approach.
about what feature vector pair (if any) corresponds to a pre-
                                                                    All experiments were carried out on a system using two
decessor/successor pair back. DSNotify will now update its
                                                                 Intel Xeon CPUs with 2.66 Ghz each and 8 GB of RAM.
indices accordingly and send notifications to all subscribed
                                                                 The used threshold values were 0.8 (upperThreshold ) and 0.3
actors. A detailed description of the overall housekeeping al-
                                                                 (lowerThreshold ). We have created two types of eventsets
gorithm, the core of DSNotify, is presented in Algorithm 1.
                                                                 from existing datasets for our evaluation: the iimb-eventsets
3.5   Item History and Event Log                                 and the dbpedia-eventset 9 .
   As discussed above, DSNotify incrementally constructs         4.1      The IIMB Eventsets
three central data structures during its operation: (i) an
                                                                   The iimb-eventsets are derived from the ISLab Instance
event log containing all events detected by the system, (ii) a
                                                                 Matching Benchmark [11] which contains one (source) data-
log containing all unresolved event choices and (iii) a linked
                                                                 set containing 222 instances and 37 target datasets that vary
structure of feature vectors constituting a history of the re-
                                                                 in number and type of introduced modifications to the in-
spective items. This latter structure is stored in the indices
                                                                 stance data. It is the goal of instance matching tools to
maintained by DSNotify. All three data structures can be
                                                                 match the resources in the source dataset with the resources
accessed in various ways by agents that make use of DSNo-
                                                                 in the respective target dataset by comparing their instance
tify for fixing broken links as further described in [12].
                                                                 data. The benchmark contains an alignment file describing
   As these data structures may grow indefinitely, a strategy
                                                                 9
for pruning them from time to time is required. Currently            All data sets are available at http://dsnotify.org/.
90 100
  Name                             Coverage             H     Hnorm                                           hki=1s
  tbox:cogito-Name                   0.995            5.378   0.995




                                                                                                 80
  tbox:cogito-first sentence          0.991            5.354   0.991                                       q
                                                                                                              hki=3s    q
                                                                                                                                                       q




                                                                              F1 − measure (%)
                                                                                                                 q




                                                                                                 70
  tbox:cogito-tag                   0.986             1.084   0.201                                           hki=10s
                                                                                                                            q
                                                                                                                                     q




                                                                                                 60
  tbox:cogito-domain                0.982             3.129   0.579                                                                          q                     q




                                                                                                 50
                                                                                                              hki=20s                                      q
  tbox:wikipedia-name                0.333            1.801   0.333




                                                                                                 40
                                                                                                                 q
  tbox:wikipedia-birthdate           0.225            1.217   0.225                                       q
                                                                                                              hki=30s
                                                                                                                            q




                                                                                                 30
                                                                                                                        q
  tbox:wikipedia-location            0.185            0.992   0.184                                                                                    q




                                                                                                 20
                                                                                                                                     q
                                                                                                                                             q             q   q   q
  tbox:wikipedia-birthplace          0.104            0.553   0.102




                                                                                                 10
                                                                                                                                                               q
  Namespace prefix tbox: <http://islab.dico.unimi.it/iimb/tbox.owl# >




                                                                                                 0
                                                                                                          1      2      3   4        5       6         7   8   9   10
Table 3: Coverage, entropy and normalized en-                                                                                   iimb−eventset number
tropy of all properties in the iimb datasets with a
coverage > 10%. The selected properties are written                           Figure 5: Influence of the housekeeping interval
in bold.                                                                      (hki) on the F1 -measure in the iimb-eventsets evalu-
                                                                              ations.
what resources correspond to each other that can be used
to measure the effectiveness of such tools. We used this                       curacy with the increasing dataset number. This is also
alignment information to derive 10 eventsets, correspond-                     expected as the benchmarks introduces more value trans-
ing to the first 10 iimb target datasets, each containing 222                  formations with higher dataset numbers, although there are
move events. The first 10 iimb datasets introduce increasing                   two outliners for the datasets 7 and 10.
numbers of value transformations like typographical errors
to the instance data. We used random timestamps for the                       4.2                         The DBpedia Persondata Eventset
events (as this data is not available in this benchmark) that                    In order to evaluate our approach with real-world data we
resulted in an equal distribution of events over the eventset                 have created a dbpedia-eventset that was derived from the
duration.                                                                     person datasets of the DBpedia snapshots 3.2 and 3.310 . The
   We have simulated these eventsets, monitored the chang-                    raw persondata datasets contain 20,284 (version 3.2) and
ing dataset with DSNotify and measured precision and recall                   29,498 (version 3.3) subjects typed as foaf:Person each hav-
of the reported events with respect to the eventset informa-                  ing three properties foaf:name, foaf:surname and foaf:given-
tion. For a useful feature selection we first calculated the                   name. Naturally, these properties are very well suited to
entropy of the properties with a coverage > 10%, i.e., only                   uniquely identify persons as also confirmed by their high
properties were considered where at least 10% of the re-                      entropy values (cf. Table 4). For the same reasons as al-
sources had instance values. The results are summarized                       ready discussed for the iimb datasets an evaluation with
in Table 3. As the goal of the evaluation was not to op-                      only these properties would not clearly demonstrate our
timize the resulting precision/recall values but to analyze                   approach. Therefore we enriched both raw data sets with
our blocking approach, we consequently chose the properties                   four properties (see Table 4) from the respective DBpedia
tbox:cogito-tag and tbox:cogito-domain for the evaluation be-                 Mapping-based Infobox Extraction datasets [7] with smaller
cause they have good coverage but comparatively small en-                     coverage and entropy values.
tropy in this dataset. We calculated the entropy as shown                        We derived the dbpedia-eventset by comparing both data-
in Equation 1 and normalized it by dividing by ln(n).                         sets for created, removed or updated resources. We retrieved
                                    n                                         the creation and removal dates for the events from Wikipedia
                     H(p) = −            pi ln(pi )                     (1)   as these data are not included in the DBpedia datasets. For
                                   i=1                                        the update events we used random dates. Furthermore, we
                                                                              used the DBpedia redirect dataset to identify and generate
DSNotify was configured to compare these properties using                      move events. This dataset contains redirection information
the Levenshtein distance and both properties contributed                      derived from Wikipedia’s redirect pages that are automat-
equally (weight = 1.0) to the corresponding feature vector                    ically created when a Wikipedia article is renamed. The
comparison. The simulation was configured to run for 60                        dates for these events were also retrieved from Wikipedia.
seconds, thus the monitored datasets changed with an aver-                       The resulting eventset contained 3810 create, 230 remove,
age rate of 222 = 3.7 events/s.
              60                                                              4161 update and 179 move events, summing up to 8380
  As stated before, the goal of this evaluation was to demon-                 events11 . The histogram of the eventset depicted in Figure 6
strate the influence of the housekeeping frequency on the                      shows a high peak in bin 14. About a quarter of all events
overall effectiveness of the system. For this, we repeated                     10
the experiment with varying housekeeping intervals of 1s,                        The snapshots contain a subset of all instances of type
3s, 10s, 20s, 30s (corresponding to an average rate 3.7, 11.1,                 foaf:Person and can be downloaded from http://dbpedia.
                                                                               org/ (filename: persondata_en.nt).
37.0, 74.0, 111.0 events/housekeeping cycle) and calculated                   11
                                                                                 Another 5666 events were excluded from the eventset as
the F1 -measure (the harmonic mean of precision and recall)                    they resulted from inaccuracies in the DBpedia datasets.
for each dataset (Figure 5).                                                   For example there are some items in the 3.2 snapshot
                                                                               that are not part of the 3.3 snapshot but were not re-
Results. The results clearly demonstrate the expected de-                      moved from Wikipedia (a prominent example is the re-
crease in accuracy when increasing the length of the house-                    source http://dbpedia.org/resource/Tim_Berners-Lee).
                                                                               Furthermore several items from version 3.3 were not in-
keeping intervals, as this leads to more feature vector com-                   cluded in version 3.2 although the creation date of the corre-
parisons and therefore more possibilities to make the wrong                    sponding Wikipedia article is before the creation date of the
decisions. Furthermore, Figure 5 depicts the decreasing ac-                    3.2 snapshot. We decided to generally exclude such items.
90 100
          Name                                              Coverage                 H                    Hnorm                                                    q
                                                                                                                                                                                     foaf:name
                                                                                                                                                                                          q
                                                                                                                                                                                                                   q

          foaf:name (d)                                     1.00/1.00           9.91/10.28               1.00/1.00                                                 q
                                                                                                                                                                                             q
                                                                                                                                                                                                                                             q              q




                                                                                                                                                 80
                                                                                                                                                                   q
                                                                                                                                                                                             q
                                                                                                                                                                                                                                  dsnotify:rdfhash
          foaf:surname (d)                                  1.00/1.00           9.11/9.25                0.92/0.90                                                                                                 q
                                                                                                                                                                                                                                             q                                        q




                                                                                                                        F1 − measure (%)
                                                                                                                                                                                                                                                            q
                                                                                                                                                                                                                   q




                                                                                                                                                 70
                                                                                                                                                                                                                                   foaf:givenname
          foaf:givenname (d)                                1.00/1.00            8.23/8.52               0.83/0.83                                                 dbpedia:birthdate+dbpedia:birthplace
                                                                                                                                                                                                                                           q                                          q
                                                                                                                                                                                                                                                                                                     q




                                                                                                                                                 60
          dbpedia:birthdate (d)                             0.60/0.60            5.84/5.96               0.59/0.58                                                                                                                                          q




                                                                                                                                                 50
                                                                                                                                                                                                                                                                                                     q
                                                                                                                                                                   q
          dbpedia:birthplace (o)                            0.48/0.47            4.24/4.32               0.43/0.42                                                 dbpedia:birthplace
                                                                                                                                                                                             q                                                                                        q




                                                                                                                                                 40
                                                                                                                                                                                                                   q
                                                                                                                                                                                                                                             q
                                                                                                                                                                                                                                                                                                     q
                                                                                                                                                                                                                                                            q
          dbpedia:height (d)                                0.10/0.08            0.65/0.51               0.07/0.05




                                                                                                                                                 30
                                                                                                                                                                                                                                                                                      q
                                                                                                                                                                                                                                  dbpedia:birthdate
          dbpedia:draftyear (d)                             0.01/0.01            0.06/0.05               0.01/0.01




                                                                                                                                                 20
                                                                                                                                                                                                                                                                                                     q
                                                                                                                                                                                   dbpedia:height




                                                                                                                                                 10
          Namespace prefix dbpedia: <http://dbpedia.org/ontology/>                                                                                                                dbpedia:draftyear




                                                                                                                                                 0
          Namespace prefix foaf: <http://xmlns.com/foaf/0.1/>
                                                                                                                                                                           1.6        1.8              2.0                 2.2        2.4            2.6            2.8        3.0            3.2            3.4

                                                                                                                                                                                                               Log10(average events/housekeeping cycle)
Table 4: Coverage, type, entropy and normalized
entropy of all properties in the enriched dbpedia
                                                                                                                        Figure 7: Influence of the number of events per
3.2/3.3 persondata sets. The selected properties
                                                                                                                        housekeeping cycle on the F1 -measure of detected
are written in bold. Symbols: object property (o),
                                                                                                                        move events in the dbpedia-eventset evaluation.
datatype property (d).
           1000 1500 2000




                                                                                                                        Results. The results, depicted in Figure 7, show a fast sat-
                                                                                                                        uration of the F1 -measure with an decreasing number of
 Events




                                                                                                                        events per housekeeping cycle. This clearly confirms the
                                                                                                                        findings from our iimb evaluation. The accuracy of DSNo-
           500




                                                                                                                        tify increases with increasing housekeeping frequencies or
           0




                            1   2   3   4   5   6   7   8    9   10   11   12   13   14   15   16   17   18   19   20   decreasing event rates. From a pragmatical viewpoint, this
                                                                 Bins                                                   means a tradeoff between the costs for monitoring and house-
                                                                                                                        keeping operations (computational effort, network transmis-
Figure 6: Histogram of the distribution of events in                                                                    sion costs, etc.) and accuracy. The curve for the simple rdf-
the dbpedia-eventset. A bin corresponds to a time                                                                       hash function is surprisingly good, stabilizing at about 80%
interval of about 11 days.                                                                                              for the F1 -measure. This can be attributed mainly to the
                                                                                                                        high precision rates that are expected from such a function.
                                                                                                                        The curve for the combined properties shows maximum val-
occured within this time interval. We think that such event                                                             ues for the F1 -measure of about 60%.
peaks are not unusual in real-world data and are interested                                                                The measured precision and recall rates are depicted in
how our application deals with such situations.                                                                         Figure 8. Both measures show a decrease with increasing
   We re-played the eventsets, monitored the changing data-                                                             numbers of events per housekeeping cycle. For the preci-
set with DSNotify and measured precision and recall of the                                                              sion this can be observed mainly for low-entropy properties
reported events with respect to the eventset information (cf.                                                           whereas the recall measures for all properties are affected.
Figure 4). We repeated the simulation seven times varying
                                                                                                                                                                                                                                                                            dsnotify:rdfhash
                                                                                                                                                      90 100




                                                                                                                                                                       q
                                                                                                                                                                       q                         q
                                                                                                                                                                                                 q


the number of average events per housekeeping interval and
                                                                                                                                                                                                                                                 q              q
                                                                                                                                                                                                                       q
                                                                                                                                                                                                                       q                         q              q                         q
                                                                                                                                                                                                                                                                                          q              q
                                                                                                                                                                                                                                                                                                         q
                                                                                                                                                                       q                                                                                                       foaf:name
calculated the F1 -measure of the reported move events12 .                                                                                                             q                         q
                                                                                                                                                      80




                                                                                                                                                                                                 q
                                                                                                                                 Precision (%)
                                                                                                                                                      70




                                                                                                                                                                                                                       q

   For each simulation, DSNotify was configured to index                                                                                                                                                                q                         q
                                                                                                                                                      60




                                                                                                                                                                                                                                                 q              q

only one of the six selected properties in Table 4. To calcu-                                                                                                                                                                                                              dbpedia:birthplace
                                                                                                                                                      50




                                                                                                                                                                                                                                                                q                   q                    q
                                                                                                                                                                                                                                                                            foaf:givenname
                                                                                                                                                      40




late the similarity between datatype properties, we used the                                                                                                                                                                                                                        q

                                                                                                                                                                                                                                                            dbpedia:birthdate+dbpedia:birthplace
                                                                                                                                                                                                                                                                                                         q
                                                                                                                                                      30




Levensthein distance. For object properties we used a sim-                                                                                                                                                                                                                 dbpedia:birthdate
                                                                                                                                                      20




                                                                                                                                                                                                                                                                            dbpedia:height
ple similarity function that counted the number of common
                                                                                                                                                      10




                                                                                                                                                                                                                                                                           dbpedia:draftyear
                                                                                                                                                      0




property values (i.e., resources) in both resources that are                                                                                                                1.6        1.8              2.0                 2.2        2.4            2.6            2.8        3.0            3.2            3.4

compared and divided it by the number of total values.                                                                                                                                                          Log10(average events/housekeeping cycle)

   Furthermore, we ran the simulations indexing only one
                                                                                                                                                          90 100




                                                                                                                                                                                                                foaf:name
cumulative attribute, an rdf-hash. This hash function calcu-                                                                                                           q
                                                                                                                                                                       q                         q
                                                                                                                                                                                                 q
                                                                                                                                                                                                                     q

                                                                                                                                                                                                             foaf:givenname                  q
                                                                                                                                                          80




                                                                                                                                                                                                                     q                                      q


lates an MD5 hashsum over all string-serialized properties
                                                                                                                                                          70




                                                                                                                                                                                                                                             q
                                                                                                                                                                                                             dsnotify:rdfhash
                                                                                                                                    Recall (%)




                                                                                                                                                                                                 q
                                                                                                                                                                       q
                                                                                                                                                                                                                       q                                    q                         q

of a resource and the corresponding similarity function re-
                                                                                                                                                          60




                                                                                                                                                                                                                                             q
                                                                                                                                                                                                                                                            q
                                                                                                                                                          50




                                                                                                                                                                                                                                                                                      q

turns 1.0 if the hash-sums are equal or 0.0 otherwise. Thus
                                                                                                                                                                                                                                                                                      q
                                                                                                                                                                                                                                                                                                     q
                                                                                                                                                                                                     dbpedia:birthdate+dbpedia:birthplace
                                                                                                                                                          40




                                                                                                                                                                                                        dbpedia:birthplace                                                                           q

this rdf-hash is sensible to any modifications in a resource’s
                                                                                                                                                          30




                                                                                                                                                                                                                                                                                                     q
                                                                                                                                                                       q                         q              q
                                                                                                                                                                                                                                             q
                                                                                                                                                                                                                                                            q
                                                                                                                                                                                                         dbpedia:birthdate
                                                                                                                                                          20




                                                                                                                                                                                                                                                                                      q


instance data.                                                                                                                                                                                                                                                                                       q
                                                                                                                                                          10




                                                                                                                                                                                                                                   dbpedia:height
                                                                                                                                                                                   dbpedia:draftyear
   Additionally we evaluated a combination of the dbpedia
                                                                                                                                                          0




                                                                                                                                                                            1.6        1.8              2.0                 2.2        2.4           2.6            2.8        3.0            3.2        3.4
birthdate and birthplace properties, each contributed with                                                                                                                                                      Log10(average events/housekeeping cycle)
equal weight to the weighted feature vector. The coverage of
resources that had values for at least one of these attributes
                                                                                                                        Figure 8: Influence of the number of events per
was 65% in the 3.2 snapshot and 62% in the 3.3 snapshot.
                                                                                                                        housekeeping cycle on the measured precision and
                                                                                                                        recall of detected move events in the dbpedia-eventset
12
 We fixed the housekeeping period for this experiment to 30s                                                             evaluation.
and varied the simulation length from 3600 to 56.25s. Thus
the event rates varied between 2.3 to 149.0 events/second                                                                  It is, again, important to state that our evaluation had not
or 35.2 to 2250.1 events/housekeeeping interval respectively.
For these calculations we considered only move, remove and                                                              the goal to maximize the accuracy of the system for these
create events (i.e., 4219 events) from the eventset as only                                                             particular eventsets but rather to reveal the characteristics
these influence the accuracy of the algorithm.                                                                           of our time-interval-based blocking approach. It shows that
we can achieve good results even for attributes with low en-       RDF. Triplify also provides a Linked Data Update Log that
tropy when choosing an appropriate housekeeping frequency.         groups updates to an RDF model within a certain timespan
                                                                   into nested collections accessible via HTTP. In principle, this
5.    RELATED WORK                                                 solution provides a scalable approach for logging events that
                                                                   occurred in a Linked Data source. However, this notification
   Besides the already cited works, our research is also closely
                                                                   approach requires clients to regularly poll this log and the
related to the areas of record linkage, a well-researched prob-
                                                                   data source to capture all these events (if possible). Fur-
lem in the database domain, and instance matching, which
                                                                   thermore, the current specification of the vocabulary16 used
is related to ontology alignment and schema matching.
                                                                   for describing the update events does not contain moved or
   Record linkage13 is concerned with finding pairs of data
                                                                   created events but only “Update” and “Deletion”.
records (from one or multiple datasets) that refer to (de-
                                                                      Peridot is a tool developed by IBM for automatically fix-
scribe) the same real-world entity [25, 9]. This informa-
                                                                   ing links in the Web. It is based on the patents [4, 5] and the
tion is useful e.g., for joining different relations or for du-
                                                                   basic idea is to calculate fingerprints of Web documents and
plicate detection [9]. Record linkage is trivial where entity-
                                                                   repair broken links based on their similarity. The method
unique identifiers (such as ISBN numbers or OWL inverse-
                                                                   differs from DSNotify in that we consider the structured
functional properties like foaf:mbox ) are available. When
                                                                   nature of Linked Data and support domain-specific, config-
such additional identifiers are missing, tools often rely on
                                                                   urable similarity-heuristics on a property level which allows
probabilistic distance metrics and machine learning meth-
                                                                   more advanced comparisons methods. Furthermore, DSNo-
ods (e.g., HMMs, TD-IDF; SVM). A comprehensive survey
                                                                   tify introduces the described time-interval-based blocking
of record linkage research can be found in [9]. The instance
                                                                   approach and detects also create, remove and update events.
matching problem is closely related to record linkage but re-
                                                                      In [19], Morishima et al. describe Link Integrity Man-
quires certain specific methods when dealing with structural
                                                                   agement tools that focus on fixing broken links in the Web
and logical heterogeneities as pointed out in [11].
                                                                   that occurred due to moved link targets (type B, cf. section
   Furthermore, our work is closely related to current work
                                                                   2.2). Similar to DSNotify, they have developed a tool called
in Semantic Web and in hypertext research:
                                                                   PageChaser that uses a heuristic approach to find missing
   In [20], Phelps and Wilensky propose so-called robust hy-
                                                                   resources based on indexed information (namely URIs, page
perlinks based on URI references that are decorated with a
                                                                   content and redirect information). An explorer component,
small14 lexical signature composed of terms extracted from
                                                                   which makes use of search engines, redirect information,
the referenced document. When a target document is not
                                                                   and so-called link authorities (Web pages containing well-
found, this lexical signature can be used to re-find the re-
                                                                   maintained links) is used to find possible URIs of moved
source using regular Web search engines. A disadvantage
                                                                   resources. They also provide a heuristics-based method to
of the robust hyperlink approach is that it requires existing
                                                                   calculate such link authority pages. A major difference to
URI references to be changed which is not the case with
                                                                   our approach is that PageChaser was built for fixing links in
our approach. Furthermore, it is unclear how to extend this
                                                                   the (human) Web exploiting some of its characteristics (like
method to non-textual resources whereas our feature vector
                                                                   locality or link authorities), while DSNotify aims at becom-
based approach could in principle be combined with most
                                                                   ing a general framework for assisting actors in fixing links
existing multimedia feature extraction solutions.
                                                                   based on domain-specific content features.
   In [14], an algorithm for record linkage (object consoli-
                                                                      In a recent paper, Van de Sompel et al. discuss a proto-
dation) based on inverse-functional properties is described.
                                                                   col for time-based content negotiation that can be used to
The approach groups instances with matching IFP values
                                                                   access archived representations (“Mementos”) of resources
together and determines canonical URIs for identification of
                                                                   [22]. By this, the protocol enables a kind of time-travel
the real-world entities described by those instances. Natu-
                                                                   when accessing archived resources (such archives would fall
rally, IFPs can be used efficiently for record linkage prob-
                                                                   into the Versioned and Static Collections category in Ta-
lems but are unfortunately unavailable in many real-world
                                                                   ble 2). DSNotify could be used to build such archives when
datasets. In DSNotify, IFPs can be exploited by simply us-
                                                                   a monitor implementation was used that would store not
ing them as a single feature in a feature vector.
                                                                   only a feature vector derived from a resource representation
   The Silk framework [24] aims mainly at the automatic,
                                                                   but also the representation data itself.
heuristics-based discovery of semantic relationships between
resources in the Web of Data. These heuristics may be
configured using an XML-based links specification language           6.      CONCLUSIONS AND FUTURE WORK
(Silk-LSL). In order to react on changes in the interlinked          We presented the broken link problem in the context of
datasets, the authors propose a new SOAP-based protocol            the Web of Data as a special case of the instance matching
(Web of Data - Link Maintenance Protocol, WOD-LMP 15 )             problem and showed the feasibility of a time-interval-based
for synchronizing and maintaining links between LD sources.        blocking approach for systems that aim at detecting and fix-
   Triplify [3] is a system that exposes data from relational      ing such broken links. Reconsidering the solution strategies
databases as Linked Data. It is based on mapping HTTP re-          for the broken link problem discussed in Section 2.2, we can
quests to RDBMS queries and publishing the result data as          state that DSNotify can actually be used in multiple ways,
13
   Record linkage is also known under many other names,            including: (1) to function as a detect and correct module in
 such as object identification, data cleaning, entity resolution,   an existing software, (2) as a standalone notification service
 coreference resolution or object consolidation [25, 14, 9].       that keeps subscribed clients informed about changes in a
14
   The authors found out that five terms are enough to              data source, (3) as an indirection service that automatically
 uniquely identify a Web resource in virtually all cases.          forwards requests for moved resources to their new location.
15
   http://www4.wiwiss.fu-berlin.de/bizer/silk/
                                                                   16
 wodlmp/                                                                http://triplify.org/vocabulary/update
The various interfaces for accessing the data structures built   [12] B. Haslhofer and N. Popitsch. DSNotify – detecting
by DSNotify should facilitate the integration with existing           and fixing broken links in linked data sets. In 8th
applications. Our approach is by design a semi-automatic              International Workshop on Web Semantics (WebS
solution that is capable of integrating human intelligence in         09), co-located with DEXA 2009, 2009.
the sense of human-based computation. We plan to further         [13] M. Hepp, K. Siorpaes, and D. Bachlechner. Harvesting
elaborate on this issue. However, DSNotify cannot “cure”              wiki consensus: Using Wikipedia entries as vocabulary
the Web of Data from broken links. It may rather be used              for knowledge management. IEEE Internet
as an add-on for particular data providers that want to keep          Computing, 11(5):54–65, 2007.
a high level of link integrity in their data.                    [14] A. Hogan, A. Harth, and S. Decker. Performing object
   The flexibility of our tool is founded in its generic nature        consolidation on the semantic web data graph. In
and its customizability. Consequently, the development and            Proceedings of the 1st I3 : Identity, Identifiers,
evaluation of additional monitors, features and extractors,           Identification Workshop, 2007.
heuristics and indices is one part of our future work. In
                                                                 [15] D. Ingham, S. Caughey, and M. Little. Fixing the
particular we want to research feasible methods for auto-
                                                                      “broken-link” problem: the W3Objects approach.
matic feature selection as well as for the determination of
                                                                      Comput. Netw. ISDN Syst., 28(7-11):1255–1268, 1996.
optimal monitoring/housekeeping periods as these are the
                                                                 [16] I. Jacobs and N. Walsh. Architecture of the World
key parameters for achieving good accuracy with our tool.
                                                                      Wide Web, volume one. Technical report, W3C,
Further, we plan to analyze the applicability of DSNotify to
                                                                      December 2004. Retrieved May 7, 2009.
other domains like the (document) Web or the file system.
                                                                 [17] F. Kappe. A scalable architecture for maintaining
                                                                      referential integrity in distributed information
7.   ACKNOWLEDGMENTS                                                  systems. Journal of Universal Computer Science,
  The authors would like to thank Martin Romauch for his              1(2):84–104, 1995.
valuable comments. The work has partly been supported            [18] S. Lawrence, D. M. Pennock, G. W. Flake, R. Krovetz,
by the European Commission as part of the eContentplus                F. M. Coetzee, E. Glover, F. A. Nielsen, A. Kruger,
program (EuropeanaConnect).                                           and C. L. Giles. Persistence of web references in
                                                                      scientific research. Computer, 34(2):26–31, 2001.
8.   REFERENCES                                                  [19] A. Morishima, A. Nakamizo, T. Iida, S. Sugimoto, and
 [1] W. Y. Arms. Uniform resource names: handles, purls,              H. Kitagawa. Bringing your dead links back to life: a
     and digital object identifiers. Commun. ACM,                      comprehensive approach and lessons learned. In HT
     44(5):68, 2001.                                                  ’09: Proceedings of the 20th ACM conference on
 [2] H. Ashman. Electronic document addressing: dealing               Hypertext and hypermedia, pages 15–24, New York,
     with change. ACM Comput. Surv., 32(3), 2000.                     NY, USA, 2009. ACM.
 [3] S. Auer, S. Dietzold, J. Lehmann, S. Hellmann, and          [20] T. A. Phelps and R. Wilensky. Robust hyperlinks cost
     D. Aum¨ller. Triplify: light-weight linked data
              u                                                       just five words each. Technical Report
     publication from relational databases. In WWW ’09,               UCB/CSD-00-1091, EECS Department, University of
     New York, NY, USA, 2009. ACM.                                    California, Berkeley, 2000.
 [4] M. Beynon and A. Flegg. Hypertext request integrity         [21] D. S. H. Rosenthal and V. Reich. Permanent web
     and user experience, 2004. US Patent 0267726A1.                  publishing. In ATEC ’00: Proceedings of the annual
 [5] M. Beynon and A. Flegg. Guaranteeing hypertext link              conference on USENIX Annual Technical Conference,
     integrity, 2007. US Patent 7290131 B2.                           pages 129–140. USENIX Association, 2000.
 [6] C. Bizer, T. Heath, and T. Berners-Lee. Linked data -       [22] H. Van de Sompel, M. L. Nelson, R. Sanderson, L. L.
     the story so far. International Journal on Semantic              Balakireva, S. Ainsworth, and H. Shankar. Memento:
     Web and Information Systems (IJSWIS), 5(3), 2009.                Time travel for the web. Arxiv preprint.,
 [7] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer,                     arxiv:0911.1112, November 2009.
     C. Becker, R. Cyganiak, and S. Hellmann. DBpedia -          [23] L. Veiga and P. Ferreira. Repweb: replicated web with
     a crystallization point for the web of data. Web                 referential integrity. In SAC ’03: Proceedings of the
     Semantics: Science, Services and Agents on the World             2003 ACM symposium on Applied computing, pages
     Wide Web, July 2009.                                             1206–1211, New York, NY, USA, 2003. ACM.
 [8] H. C. Davis. Referential integrity of links in open         [24] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov.
     hypermedia systems. In Proceedings of the 9th ACM                Discovering and maintaining links on the web of data.
     conference on Hypertext and hypermedia, pages                    In 8th International Semantic Web Conference, 2009.
     207–216, New York, NY, USA, 1998. ACM.                      [25] W. E. Winkler. Overview of record linkage and
 [9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios.           current research directions. Technical report, U.S.
     Duplicate record detection: A survey. IEEE Trans. on             Bureau of the Census, 2006.
     Knowl. and Data Eng., 19(1):1–16, 2007.
[10] I. P. Fellegi and A. B. Sunter. A theory for record
     linkage. Journal of the American Statistical
     Association, 64(328):1183–1210, 1969.
[11] A. Ferrara, D. Lorusso, S. Montanelli, and G. Varese.
     Towards a benchmark for instance matching. In
     Ontology Matching (OM 2008), volume 431 of CEUR
     Workshop Proceedings. CEUR-WS.org, 2008.

More Related Content

PDF
TEDDY - Thesaurus Editor: Design and Definition Yarn
PPT
Future of Web 2.0 & The Semantic Web
PPTX
Technical Background
PPTX
Linked Data at the Open University: From Technical Challenges to Organization...
PPT
Freenet
PPTX
Introduction to the Semantic Web
PDF
CSTalks - Named Data Networks - 9 Feb
PDF
Semantic Web Nature
TEDDY - Thesaurus Editor: Design and Definition Yarn
Future of Web 2.0 & The Semantic Web
Technical Background
Linked Data at the Open University: From Technical Challenges to Organization...
Freenet
Introduction to the Semantic Web
CSTalks - Named Data Networks - 9 Feb
Semantic Web Nature

What's hot (12)

PDF
PDF
The significance of linking
PPT
ECCS 2010
PPTX
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
PPTX
1 introduction
PDF
Pal gov.tutorial2.session9.lab rdf-stores
PPTX
Experience from 10 months of University Linked Data
PPTX
Working with data.open.ac.uk, the Linked Data Platform of the Open University
PPTX
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
PPTX
Semantic Web, Linked Data and Education: A Perfect Fit?
PDF
Distributed databases
PPT
Linked Data as a new environment for Learning Analytics and education
The significance of linking
ECCS 2010
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
1 introduction
Pal gov.tutorial2.session9.lab rdf-stores
Experience from 10 months of University Linked Data
Working with data.open.ac.uk, the Linked Data Platform of the Open University
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Semantic Web, Linked Data and Education: A Perfect Fit?
Distributed databases
Linked Data as a new environment for Learning Analytics and education
Ad

Viewers also liked (9)

PPT
Europeana Connect All-Staff Meeting
PDF
The Europeana Personas
PDF
Europeana v1.0 and Interdependencies with EuropeanaConnect
PDF
eBooks on Demand
PDF
Europeana and linked cultural heritage data
PDF
Content Management Opensource solutins
PDF
in Europeana and the projects
PDF
EuropeanaConnect WP4 - Europeana Licensing Framework
PDF
Linked Data und Semantic Web-basierte Funktionalität in Europeana
Europeana Connect All-Staff Meeting
The Europeana Personas
Europeana v1.0 and Interdependencies with EuropeanaConnect
eBooks on Demand
Europeana and linked cultural heritage data
Content Management Opensource solutins
in Europeana and the projects
EuropeanaConnect WP4 - Europeana Licensing Framework
Linked Data und Semantic Web-basierte Funktionalität in Europeana
Ad

Similar to 090626cc tech-summit (20)

PDF
Spotlight
PPTX
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
PDF
Linked Data Generation for the University Data From Legacy Database
PPSX
Linked Data to Improve the OER Experience
PDF
dsnotify presentation at www2010
PPTX
Linked data for Libraries, Archives, Museums
PDF
Charleston 2012 - The Future of Serials in a Linked Data World
PDF
Linked Data
PPTX
Linked Data and Locah, UKSG2011
PDF
Semantic web assignment1
PPT
ORE and SWAP: Composition and Complexity
PDF
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
PPTX
PRELIDA Project Draft Roadmap
PPT
Sr Sue Trglobalembed
PPT
Sr Sue Trglobalembed
PPTX
Linked data for librarians
PDF
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
PDF
RDF Data and Image Annotations in ResearchSpace (paper)
PDF
Building REST and Hypermedia APIs with PHP
PDF
IRJET- Data Retrieval using Master Resource Description Framework
Spotlight
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked Data Generation for the University Data From Legacy Database
Linked Data to Improve the OER Experience
dsnotify presentation at www2010
Linked data for Libraries, Archives, Museums
Charleston 2012 - The Future of Serials in a Linked Data World
Linked Data
Linked Data and Locah, UKSG2011
Semantic web assignment1
ORE and SWAP: Composition and Complexity
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
PRELIDA Project Draft Roadmap
Sr Sue Trglobalembed
Sr Sue Trglobalembed
Linked data for librarians
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
RDF Data and Image Annotations in ResearchSpace (paper)
Building REST and Hypermedia APIs with PHP
IRJET- Data Retrieval using Master Resource Description Framework

More from EuropeanaConnect (10)

PDF
DSNotify - Detecting and Fixing Broken Links in Linked Data Sets
PDF
Europeana - Digitale Bibliothek Europas. Fenster zur Welt für lokale, regiona...
PDF
Europeana: Europe's flagship web portal, making Europe's cultural heritage ac...
PDF
Europeana and EuropeanaConnect
PDF
eBooks & more
PDF
Semantische Kontextualisierung von Museumsbestanden in Europeana
PDF
EU-funded project Europeana - Europe's flagship web portal, making Europe's c...
PDF
Promoting Austrian Cultural and Scientific Heritage via EUROPEANA
PDF
eBooks on demand
PDF
Enhancing user access to european digital heritage
DSNotify - Detecting and Fixing Broken Links in Linked Data Sets
Europeana - Digitale Bibliothek Europas. Fenster zur Welt für lokale, regiona...
Europeana: Europe's flagship web portal, making Europe's cultural heritage ac...
Europeana and EuropeanaConnect
eBooks & more
Semantische Kontextualisierung von Museumsbestanden in Europeana
EU-funded project Europeana - Europe's flagship web portal, making Europe's c...
Promoting Austrian Cultural and Scientific Heritage via EUROPEANA
eBooks on demand
Enhancing user access to european digital heritage

Recently uploaded (20)

PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
PDF
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
PPTX
Introduction to Building Materials
PDF
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PPTX
Unit 4 Computer Architecture Multicore Processor.pptx
PPTX
Share_Module_2_Power_conflict_and_negotiation.pptx
PDF
Indian roads congress 037 - 2012 Flexible pavement
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PPTX
TNA_Presentation-1-Final(SAVE)) (1).pptx
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
PPTX
20th Century Theater, Methods, History.pptx
PDF
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
PDF
My India Quiz Book_20210205121199924.pdf
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
advance database management system book.pdf
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
Introduction to Building Materials
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
Unit 4 Computer Architecture Multicore Processor.pptx
Share_Module_2_Power_conflict_and_negotiation.pptx
Indian roads congress 037 - 2012 Flexible pavement
A powerpoint presentation on the Revised K-10 Science Shaping Paper
TNA_Presentation-1-Final(SAVE)) (1).pptx
202450812 BayCHI UCSC-SV 20250812 v17.pptx
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
B.Sc. DS Unit 2 Software Engineering.pptx
AI-driven educational solutions for real-life interventions in the Philippine...
20th Century Theater, Methods, History.pptx
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
My India Quiz Book_20210205121199924.pdf
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
Chinmaya Tiranga quiz Grand Finale.pdf
advance database management system book.pdf

090626cc tech-summit

  • 1. DSNotify: Handling Broken Links in the Web of Data Niko P. Popitsch Bernhard Haslhofer niko.popitsch@univie.ac.at bernhard.haslhofer@univie.ac.at University of Vienna, Department of Distributed and Multimedia Systems Liebiggasse 4/3-4, 1010 Vienna, Austria ABSTRACT the network leading to the practical unavailability of infor- The Web of Data has emerged as a way of exposing struc- mation [15, 2, 18, 19, 22]. tured linked data on the Web. It builds on the central build- Recently, Linked Data has been proposed as an approach ing blocks of the Web (URIs, HTTP) and benefits from its for exposing structured data by means of common Web simplicity and wide-spread adoption. It does, however, also technologies such as dereferencable URIs, HTTP, and RDF. inherit the unresolved issues such as the broken link prob- Links between resources play a central role in this Web of lem. Broken links constitute a major challenge for actors Data as they connect semantically related data. Meanwhile consuming Linked Data as they require them to deal with an estimated number of 4.7 billion RDF triples and 142 mil- reduced accessibility of data. We believe that the broken lion RDF links (cf. [6]) are exposed on the Web by numerous link problem is a major threat to the whole Web of Data data sources from different domains. DBpedia [7], the struc- idea and that both Linked Data consumers and providers tured representation of Wikipedia, is the most prominent will require solutions that deal with this problem. Since no one. Web agents can easily retrieve these data by derefer- general solutions for fixing such links in the Web of Data encing URIs via HTTP and process the returned RDF data have emerged, we make three contributions into this direc- in their own application context. In the example shown in tion: first, we provide a concise definition of the broken Figure 1, an institution has linked a resource representing a link problem and a comprehensive analysis of existing ap- band in their local data set with the corresponding resource proaches. Second, we present DSNotify, a generic frame- in DBpedia in order to publish a combination of these data work able to assist human and machine actors in fixing bro- on its Web portal. ken links. It uses heuristic feature comparison and employs Source Resource Link Target Resource a time-interval-based blocking technique for the underlying http://example.com/bands/OliverBlack rdfs:seeAlso http://dbpedia.org/resource/Oliver_Black instance matching problem. Third, we derived benchmark datasets from knowledge bases such as DBpedia and eval- dbpprop: abstract foaf:name uated the effectiveness of our approach with respect to the Representation Representation broken link problem. Our results show the feasibility of a Oliver Black Oliver Black is a Canadian rock group ... time-interval-based blocking approach for systems that aim at detecting and fixing broken links in the Web of Data. Figure 1: Sample link to a DBpedia resource Categories and Subject Descriptors H.4.m [Information Systems]: Miscellaneous; H.3.3 [ The Linked Data approach builds on the Architecture of Information Systems]: Information Search and Retrieval the World Wide Web [16] and inherits the technical benefits such as simplicity and wide-spread adoption but also the unsolved problems such as broken links. In the course of General Terms time, resources and their representations can be removed Algorithms, Theory, Experimentation, Measurement completely or “moved” to another URI meaning that they are published under a different HTTP URI. In case of the 1. INTRODUCTION above example, the band eventually changed its name and Data integrity on the Web is not given because URI ref- the title of their Wikipedia entry to “Townline”, with the erences of links between resources are not as cool 1 as they result that the corresponding DBpedia entry moved from its are supposed to be. Resources may be removed, moved, or previous URI to http://dbpedia.org/resource/Townline. updated over time leading to broken links. These constitute In the Linked Data context, we informally speak of links a major problem for consumers of Web data, both human pointing from one resource (source) to another resource (tar- and machine actors, as they interrupt navigational paths in get). Such a link is broken when the representations of the target cannot be accessed anymore. However, we consider 1 See http://www.w3.org/Provider/Style/URI a link also as broken when the representations of the target Copyright is held by the International World Wide Web Conference Com- resource were updated in such a way that they underwent a mittee (IW3C2). Distribution of these papers is limited to classroom use, change in meaning the link-creator had not in mind. and personal use by others. When regarding the changes in recent DBpedia releases, WWW 2010, April 26–30, 2010, Raleigh, North Carolina, USA. the broken link problem becomes evident: analogous to [7], ACM 978-1-60558-799-8/10/04.
  • 2. we analyzed the instances of common DBpedia classes in the 2.1 Definition of Broken Links and Events snapshots 3.2 (October 2008) and 3.3 (May 2009), identified We distinguish two types of broken links that differ in their the events that occurred between these versions and catego- characteristics and in the way how they can be detected and rized them into move, remove, and create events. The results fixed: structurally and semantically broken links. in Table 1 show that within a period of about seven months the DBpedia data space has grown and was considerably re- Structurally broken links. Formally, we define structurally organized. Different from other data sources, DBpedia has broken (binary) links as follows: the great advantage that it records move events in so-called redirect links that are derived from redirection pages. These Definition 1 (Broken Link). Let R and D be the set are automatically created in the Wikipedia when articles are of resources and resource representations respectively and renamed. P(A) be the powerset of an arbitrary set A. Class Ins. 3.2 Ins. 3.3 MV RM CR Further let δt : R −→ P(D), be a dereferencing function Person 213,016 244,621 2,841 20,561 49,325 returning the set of representations of a given resource at a Place 247,508 318,017 2,209 2,430 70,730 given time t. Organisation 76,343 105,827 2,020 1,242 28,706 Now we can define a (binary) link as a pair Work 189,725 213,231 4,097 6,558 25,967 l = (rsource , rtarget ) with rsource ∧ rtarget ∈ R. Table 1: Changes between two DBpedia releases. Such a link is called structurally broken if Ins. 3.2 and Ins 3.3 denote the number of instances δt−∆ (rtarget ) = ∅ ∧ δt (rtarget ) = ∅. of a certain DBpedia class in the respective release That is, a link (as depicted for example in Figure 1) is data sets, MV the moved, RM the removed, and considered structurally broken if its target resource had rep- CR the number of created resources. resentations that are not retrievable anymore2 . In the re- mainder of this paper, we will refer to structurally broken If humans encounter broken links caused by a move event, links simply as broken links if not stated otherwise. they can search the data source or the Web for the new lo- cation of the target resource. However, for machine agents broken links can lead to serious processing errors or misin- Semantically broken links. Apart from structurally bro- ken links, we also consider a link as broken if the human terpretation of resources when they do not implement ap- interpretation (the meaning) of the representations of its propriate fallback mechanisms. If, for instance, the link in target resource differs from the one intended by the link au- Figure 1 breaks and the target resource becomes unavailable thor. In a quantitative analysis of Wikipedia articles that due to a remove or move event, the referenced biography changed their meaning over time [13], the authors found out information cannot be provided anymore. If the resource that only a small number of articles (6 out of a test set representations are updated and undergo a change in mean- of 100 articles) underwent minor or major changes in their ing, the Web portal could encounter the problem of exposing meaning. However, we do not think that these results can semantically invalid information. be generalized for arbitrary data sources. While the detection of broken links on the Web is sup- In contrast to structurally broken links, semantically bro- ported by a number of tools, there are only few approaches ken links are much harder to detect and fix by machine ac- for automatically fixing them [19]. Techniques for dealing tors. But they are fixable by human actors that may, in a with the broken link problem in the Web of Data do not semi-automatic process, report such events to a system that exist yet. The current approach is to rely on the HTTP then forwards these events to subscribed actors. We have 404 Not Found response and assume that data-consuming foreseen this in our system (see Section 3). actors can deal with it. We consider this as insufficient and propose DSNotify, which we informally introduced in [12], as a possible solution. DSNotify is a generic change detec- Events. Having defined links and broken links we can now tion framework for Linked Data sources that informs data- define events that occur when datasets are modified, possi- consuming actors about the various types of events (create, bly leading to broken links: remove, move, update) that can occur in data sources. Our contributions are: (i) we formalize the broken-link Definition 2 (Event). Let E be the set of all events problem in the context of the Web of Data and provide a and e ∈ E be a quadruple e = (r1 , r2 , τ, t), where r1 ∈ R and comprehensive analysis of existing solution strategies (Sec- r2 ∈ R ∪ {∅} are resources affected by the event, tion 2), (ii) we present DSNotify, focusing on its core algo- τ ∈ {created, removed, updated, moved} is the type of the rithms for handling the underlying instance matching prob- event and t is the time when the event took place. Further lem (Section 3), and (iii) we have derived benchmark data let L ⊆ E be a set of detected events. sets from the ISLab Instance Matching Benchmark [11] and Then we can assert that, ∀r ∈ R : from the DBpedia knowledge base and evaluate the effec- δt−∆ (r) = ∅ ∧ δt (r) = ∅ =⇒ L ←− L ∪ {(r, ∅, created, t)} . tiveness of the DSNotify approach (Section 4). δt−∆ (r) = ∅ ∧ δt (r) = ∅ =⇒ L ←− L ∪ {(r, ∅, removed, t)} . δt−∆ (r) = δt (r) =⇒ L ←− L ∪ {(r, ∅, updated, t)} . 2. THE BROKEN LINK PROBLEM Note the analogy between the definition of broken links In the previous section, we informally described the bro- and removed events: whenever the representations of a link ken link problem and its possible consequences. In this sec- 2 Note that our definitions do not consider a link as broken tion we want to contribute a formal definition of a broken if only some of the representations of the target resource are link in the context of Linked Data and discuss existing so- not retrievable anymore. We consider clarifications of this lution strategies for dealing with this problem. issue as a topic for further research.
  • 3. target are removed, the corresponding links are broken. Now Broken link type Solution Strategy Class A B C we have defined create, remove and update events, but what Ignoring the Problem - about the event type “moved ”? In fact, it is not possible Embedded Links p to speak about moved resources considering only the pre- Relative References p vious definitions. Although there is no concept of resource Indirection p location in RDF, it exists in the Web of Data as it relies Versioned and Static Collections p on dereferencable HTTP URIs. For this reason, we define a Regular Updates p Redundancy p weak equality relation between resources in the Web of Data Dynamic Links a based on a similarity relation between its representations Notification c and build on that to define move events: Detect and Correct c Manual Edit/Search c Definition 3 (Move Event). Let σ : P(D) × P(D) −→ [0, 1] be a similarity function between two sets of resource representations. Further let Table 2: Solution strategies for the broken link prob- Θ ∈ [0, 1] be a threshold value. lem. The strategies are classified according to Ash- We define the maximum similarity of a resource man [2]: preventative methods (p) that try to avoid rold ∈ {r ∈ R|δt (rold ) = ∅} with any other resource rnew ∈ broken links in the first place, adaptive methods (a) R {rold } as simmax ≡ max(σ(δt−∆ (rold ), δt (rnew ))) . that create links dynamically thereby avoiding bro- rold Now we can assert that: ken links and corrective methods (c) that try to fix ∃!rnew ∈ R|δt−∆ (rnew ) = ∅ ∧ Θ < σ(δt−∆ (rold ), δt (rnew )) = broken links. Symbols: potentially fixes/avoids such simmax =⇒ L ← L ∪ {(rnew , rold , moved, t)} . broken links ( ), does not fix/avoid such broken rold links ( ), partly fixes/avoids such broken links ( ) Thus, we consider a resource as moved from one HTTP URI to another when the resource with the “old” URI was removed, the resource with the “new” URI was created and and the target resource is referenced using e.g., an HTTP the representations retrieved from the “old” URI are very URI reference. This method preserves link integrity when similar3 to the representations retrieved from the “new” URI. the source resource of a link is moved (type A). 2.2 Solution Strategies Relative References. Relative references prevent broken In consequence of Definition 1, we further provide a more links in some cases (e.g., when a whole resource collection informal definition of link integrity: is moved). But neither does this method avoid broken links due to removed resources (type C ), nor does it hinder links Definition 4 (Link Integrity). Link integrity is a with external sources/targets (i.e., absolute references) from qualitative measure for the reliability that a link leads to the breaking. representations of a resource that were intended by the au- thor of the link. Indirection5 . Introducing a layer of indirection allows con- tent providers to keep links to their Web resources up to Methods to preserve link integrity have a long history in date. Aliases refer to the location of a resource and special hypertext research. We have analyzed existing approaches services translate between an alias and its referred location. in detail, building to a great part on Ashman’s work [2] and Moving or removing a resource requires an update in these extending it. In the following, we categorize broken links by service’s translation tables. Uniform Resource Names were the events that caused their breaking: proposed for such an indirection strategy, PURLs and DOIs type A: links broken because source resources were moved4 , are two well known examples [1]. Permalinks use a similar type B: links broken because target resources were moved strategy, although the translation step is performed by the and type C: links broken because source or target resources content repository itself and not by a special (possibly cen- were removed. The identified solution strategies are sum- tral) service. Another special case on the Web is the use marized in Table 2 and discussed in the following: of small (“gravestone”) pages that reside at the locations of moved or removed resources and indicate what happened to Ignoring the Problem. Although this can hardly be called the resource (e.g., by automatically redirecting the HTTP a “solution strategy”, it is the status-quo to simply ignore request to the new location). the problem of broken links and shift it to higher-level ap- The main disadvantage of the indirection strategy is that plications that process the data. As mentioned before, this it depends on notifications (see below) for updating the strategy is even less acceptable in the Web of Data. translation tables. Furthermore it “. . . requires the cooper- ation of link creators to refer to the alias instead of to the Embedded Links. The embedded link model [8] is the most absolute address.” [2]. Another disadvantage is that spe- common way how links on the Web are modeled. As in cial “translation services” (PURL server, CMS, gravestone HTML, the link is embedded in the source representation pages) are required that introduce additional latency when 3 Note that in the case that there are multiple possible move accessing resources (e.g., two HTTP requests instead of one). target resources with equal (maximum) similarity to the re- Nevertheless, indirection is an increasingly popular method moved resource rold , no event is issued (∃! should be read as on the Web. This strategy prevents broken links of type A “there exists exactly one”). 4 5 Note that in our definitions we did not consider links of The “Dereferencing (Aliasing) of End Points” and “For- type A as we assumed an embedded link model for Linked warding Mechanisms and Gravestones” categories from [2] Data sources. are combined in this category.
  • 4. and B and can also help with type C links, e.g., removed PURLs result in HTTP 410 status code (Gone), which al- links to lows an application to react accordingly (e.g., by removing Dsrc Dtgt the links to this resource). updates consumes consumes monitors Versioned and Static Collections. In this approach, no modifications/deletions of the considered resources are al- Application notifies DSNotify lowed. This prevents broken links of types A-C within a consumes notifies static collection (e.g., an archive), links with sources/targets event choice outside this collection can still break. Examples are HTML links into the Internet Archive 6 . Furthermore, semantically User broken links may be prevented when e.g., linking to a certain (unmodifiable) revision of a Wikipedia article. Figure 2: DSNotify Usage Scenario. Regular Updates. This approach is based on predictable updates to resource collections, so applications can easily Manual Edit/Search. In this category we summarize man- update their links to the new (predictable) resource loca- ual strategies for fixing broken links or re-finding missing tions, avoiding broken links of types A-C ). link targets. This “solution strategy” is currently arguably among the most popular ones on the Web. First, content Redundancy. Redundant copies of resource representations providers may manually update links in their contents (per- are kept and a service forwards referrers to one of these haps assisted by automatic link checking software like W3C’s copies as long as at least one copy still exists. This approach link checker). Second, users may re-find the target resources is related to the versioning and indirection approaches. How- of broken links e.g., by exploiting search engines or by man- ever, such services can reasonably be applied only to highly ual URI manipulations. available, unmodifiable data (e.g., collections of scientific documents). This method may prevent broken links of types A-C, examples include LOCKSS [21] or RepWeb [23]. 3. DSNOTIFY After having investigated possible solution strategies to Dynamic Links. In this method, links are created dynami- deal with the broken link problem, we now present our own cally when required and are not stored, avoiding broken links solution strategy. It is called DSNotify and is a generic of types A-C. The links are created based on computations change detection framework that assists human and machine that may reflect the current state of the involved resources actors in fixing broken links. It can be used as an add-on to as well as other (external) parameters, i.e., such links may existing applications that want to preserve link integrity in be context-dependent. However, the automatic generation the data sets that are under their control (detect and correct of links is a non-trivial task and this solution strategy is not strategy, see below). It can also be used to notify subscribed applicable to many real-world problems. applications of changes in a set of data sources (notification strategy). Further, it can be set-up as a service that au- Notification. Here, clients are informed about the events tomatically forwards requests to new resource locations of that may lead to broken links and all required information moved resources (indirection strategy). (e.g., new resource locations) is communicated to them so DSNotify is not meant to be a service that monitors the they can fix affected links. This strategy was for example whole Linked Data space but rather as a light-weight com- used in the Hyper-G system [17] where resource updates are ponent that can be tailored to application-specific needs and propagated between document servers using a p-flood algo- detects modifications in selected Linked Data sources. rithm. It is also the strategy of Silk and Triplify discussed A typical usage scenario is illustrated in Figure 2: an ap- in Section 5. This method resolves broken links caused by plication hosts a data set Dsrc that contains links to a target A-C but requires that the content provider can observe and data set Dtgt . The application consumes and integrates data communicate such events. from both data sets, and provides a view on this data (such as a Website) consumed by Web users. The application uses DSNotify to monitor Dtgt as it has no control over this tar- Detect and Correct. The solution for the broken link prob- get data set. DSNotify notifies the application (and possibly lem proposed in this work falls mainly into this category other subscribed actors) about the events occurring in Dtgt that Ashman describes in her work as: “. . . the application and the application can update and fix potentially broken using the link first checks the validity of the endpoint refer- links in the source data set. ence against the information, perhaps matching it with an expected value. If the validity test fails, an attempt to correct 3.1 General Approach the link by relocating it may be made . . . ” [2]. As all other solutions in this category (cf., Section 5), we Our approach for fixing broken links is based on an in- use heuristic methods to semi-automatically fix broken links dexing infrastructure. A monitor periodically accesses con- of the types A-C. sidered data sources (e.g., a Linked Data source), creates an item for each resource it encounters and extracts a fea- 6 For example the URI http://web.archive.org/web/ ture vector from this item’s representations. The feature 19971011064403/http://www.archive.org/index.html vector is stored together with the item’s URI in an item references the 1997 version of the internet archive main index (ii). The set of extracted features, their implementa- page. tion, and their extractors are configurable. Feature vectors
  • 5. CA CB CC CD CE RA MB,F MC,G CH RF RD ME,I are updated in the ii with every monitoring cycle resulting in possible update events logged by DSNotify. Items corresponding to resources that are not found any- m0 m1 h1 more are moved to another index, the removed item index A B E F B E F A B G C (rii). After some timeout period, items are moved from the C D D A G D C rii to a third index called the archived item index (aii) re- ii rii aii sulting in a remove event (cf. Definition 2). Items in the ii and the rii are periodically considered by a t housekeeper thread (a “move detector”) that compares their m2 h2 m3 h3 feature vectors and tries to identify possible successors for H F B H F B H F B H F B I E I E I E I E removed (real-world) items (cf. Definition 3). A plug-in G D C G D C G D C G CD A A A A heuristic is used for this comparison; in the default con- figuration a vector space model acting on the extracted and timeout timeout weighted feature vectors is used. The similarity measures for the features themselves are configurable; for character string Figure 3: Example timeline illustrating the main features one could use, for instance, exact string matching workflow of DSNotify. Ci ,Ri and Mi,j denote create, or the Levenshtein distance. The similarity value calculated remove and move events of items i and j. mx and by the heuristic is compared against threshold values 7 to hx denote monitoring and housekeeping operations determine whether an item is another item’s predecessor respectively. The current index contents is shown (resulting in a detected move event) or not (possibly re- in the grey boxes below the respective operation, sulting in a detected create event). The predecessors of the the overall process is explained in the text. newly indexed items are moved to the aii and linked to the new corresponding items. This enables actors to query the DSNotify indices for actual locations of items. the evaluation of our system (see Section 4). The events detected by DSNotify are stored in a central event log. This log as well as the indices can be accessed 3.3 Monitoring and Housekeeping via various interfaces, such as a Java API, an XML-RPC The cooperation of monitor, housekeeper, and the indices interface, and an HTTP interface. is depicted in Figure 3. To simplify matters, we assume an empty dataset at the beginning. Then four items (A,B,C,D) 3.2 Time-interval-based Blocking are created before the initial DSNotify monitoring process The main task of DSNotify is the efficient and accurate starts at m0 . The four items are detected and added to matching of pairs of feature vectors representing the same the item index (ii). Then a new item E is created, item A real-world item at different points in time. As in record is removed and the items B and C are “moved” to a new linkage and related problems (cf. Section 5), the number location becoming items F and G respectively. At m1 the of such pairs grows quadratically with the number of con- three items that are not found anymore by the monitor are sidered items resulting in unacceptable computational ef- “moved” to the removed item index (rii) and the new item fort. The reduction in the number of comparisons is called is added to the ii. When the housekeeper is started for the blocking and various approaches have been proposed in the first time at h1 , it acts on the current indices and compares past [25]. the recent new items (E,F ,G) with the recently removed We have developed a time-interval-based blocking (TIBB) items (B,C,A). It does not include the “old” item D in its mechanism for an efficient and accurate reduction of the feature vector comparisons. The housekeeper detects B as a number of compared feature vector pairs. Our method in- predecessor of F and C as a predecessor of G, moves B and cludes only the feature vectors derived from items that were C to the archived item index (aii) and links them to their detected as being recently created or removed, blocking out successors. Between m1 and m2 a new item is created (H), the vast majority of the items in our indices. Reconsider- two items (F ,D) are removed and the item E is “moved” ing Definition 3, this means that we are allowing only small to item I. The monitor updates the indices accordingly at values for ∆. Thus, if x is the number of feature vectors m2 and the subsequent housekeeping operation at h2 tries stored in our system, n is the number of new items and r to find predecessors of the items H and I. But before this is the number of recently removed items with n + r ≤ x, operation, the housekeeper recognizes that the retention pe- then the number of comparisons in a single DSNotify house- riod of item A in rii is longer than the timeout period and keeping operation is n · r instead of x2 . It is intuitively clear moves it to the aii. The housekeeper then detects E as a that normally n and r are much smaller than x and therefore predecessor of I, moves it also to the aii and links it to I. n·r x2 . The actual number of feature vector comparisons Between m2 and m3 no events take place and the indices in a single housekeeper operation depends on the vitality of remain untouched by the monitor. At h3 the housekeeper the monitored data source with respect to created, removed recognizes the timeout of the items F and D and moves and moved items and on the frequency of housekeeping op- them to the aii leaving an empty rii. erations8 . We have analyzed and confirmed this behavior in 7 3.4 Event Choices Note that using threshold values for the determination As mentioned before, a threshold value (upperThreshold ) of non-matches, possible-matches and matches was already proposed by Fellegi and Sunter in 1969 [10]. is used to decide whether two feature vectors are similar 8 As housekeeping and monitoring are separate operations enough to assume their corresponding items as a predeces- in DSNotify, this number depends also on the monitoring sor/successor pair. Furthermore, DSNotify uses a second frequency when lower than the housekeeping frequency. threshold value (lowerThreshold ) to decide whether two fea-
  • 6. Data: Item indices ii, rii, aii Result: List of detected events L updates monitors begin Dsrc Move timed-out items from rii to aii; L ←− ∅; P M C ←− ∅; Simulator DSNotify Dobs foreach ni ∈ ii.getRecentItems() do pmc ←− ∅; foreach oi ∈ rii.getItems() do Dtgt sim ←− calculateSimilarity(oi, ni); if sim > lowerT hreshold then Analyzer Etest Edet pmc ←− pmc + {(oi, ni, sim)}; end end Results if pmc = ∅ then L ←− L + {(ni, ∅, create)}; else P M C ←− P M C + {pmc}; Figure 4: The DSNotify evaluation approach. A end simulator takes two datasets (Dsrc and Dtgt ) and an end eventset (Etest ) as input and continuously updates foreach pmc ∈ P M C do a newly created observed dataset (Dobs ). DSNotify if pmc = ∅ then (oimax , nimax , simmax ) ←− monitors this dataset and creates a log of detected getElementW ithM axSim(pmc); events (Edet ). This log is compared to the eventset if simmax > upperT hreshold then Etest to evaluate the system’s accuracy. L ←− L + {(oimax , nimax , move)}; move oimax to aii; link oimax to nimax ; we rely on simple timeouts for removing old data items from remove all elements from P M C where these structures but this method can still result in unaccept- pmc.oi = oimax ; able memory consumption when monitoring highly dynamic else Issue an eventChoice for pmc; data sources. More advanced strategies are under consider- end ation. Note that we consider particularly the feature vector end history as a very valuable data structure as it allows ex post end analysis of the evolution of items w.r.t. their location in a return L; data set and the values of the indexed features. end Algorithm 1: Central DSNotify housekeeping algorithm. 4. EVALUATION In the evaluation of our system we concentrated on two issues: first, we wanted to evaluate the system for its ap- ture vectors are similar enough to be even considered as plicability for real-world Linked Data sources, and second, such a pair (“possible move candidates”, pmc). When none we wanted to analyze the influence of the housekeeping fre- of the feature vectors considered for a possible move op- quency on the overall effectiveness of the system. eration are similar enough (i.e., >upperThreshold ), DSNo- We evaluated our system with datasets that we call event- tify stores all considered pairs of feature vectors with sim- sets. An eventset is a timely-ordered set of events (cf. Defini- ilarity values >lowerThreshold in a so-called event choice tion 2 and 3) that transforms a source into a target dataset. object. Event choices are representations of decisions that Thus, an eventset can be seen as the event history of a have to be made outside of DSNotify, for example by hu- dataset. We have developed a simulator that can re-play man actors or by other machine actors that can resort to such eventsets, interpreting the event timestamps with re- additonal data/knowledge. These external actors may ac- gard to a configurable duration of the whole simulation. Fig- cess the log of event choice objects and send their decisions ure 4 depicts an overview of our evaluation approach. about what feature vector pair (if any) corresponds to a pre- All experiments were carried out on a system using two decessor/successor pair back. DSNotify will now update its Intel Xeon CPUs with 2.66 Ghz each and 8 GB of RAM. indices accordingly and send notifications to all subscribed The used threshold values were 0.8 (upperThreshold ) and 0.3 actors. A detailed description of the overall housekeeping al- (lowerThreshold ). We have created two types of eventsets gorithm, the core of DSNotify, is presented in Algorithm 1. from existing datasets for our evaluation: the iimb-eventsets 3.5 Item History and Event Log and the dbpedia-eventset 9 . As discussed above, DSNotify incrementally constructs 4.1 The IIMB Eventsets three central data structures during its operation: (i) an The iimb-eventsets are derived from the ISLab Instance event log containing all events detected by the system, (ii) a Matching Benchmark [11] which contains one (source) data- log containing all unresolved event choices and (iii) a linked set containing 222 instances and 37 target datasets that vary structure of feature vectors constituting a history of the re- in number and type of introduced modifications to the in- spective items. This latter structure is stored in the indices stance data. It is the goal of instance matching tools to maintained by DSNotify. All three data structures can be match the resources in the source dataset with the resources accessed in various ways by agents that make use of DSNo- in the respective target dataset by comparing their instance tify for fixing broken links as further described in [12]. data. The benchmark contains an alignment file describing As these data structures may grow indefinitely, a strategy 9 for pruning them from time to time is required. Currently All data sets are available at http://dsnotify.org/.
  • 7. 90 100 Name Coverage H Hnorm hki=1s tbox:cogito-Name 0.995 5.378 0.995 80 tbox:cogito-first sentence 0.991 5.354 0.991 q hki=3s q q F1 − measure (%) q 70 tbox:cogito-tag 0.986 1.084 0.201 hki=10s q q 60 tbox:cogito-domain 0.982 3.129 0.579 q q 50 hki=20s q tbox:wikipedia-name 0.333 1.801 0.333 40 q tbox:wikipedia-birthdate 0.225 1.217 0.225 q hki=30s q 30 q tbox:wikipedia-location 0.185 0.992 0.184 q 20 q q q q q tbox:wikipedia-birthplace 0.104 0.553 0.102 10 q Namespace prefix tbox: <http://islab.dico.unimi.it/iimb/tbox.owl# > 0 1 2 3 4 5 6 7 8 9 10 Table 3: Coverage, entropy and normalized en- iimb−eventset number tropy of all properties in the iimb datasets with a coverage > 10%. The selected properties are written Figure 5: Influence of the housekeeping interval in bold. (hki) on the F1 -measure in the iimb-eventsets evalu- ations. what resources correspond to each other that can be used to measure the effectiveness of such tools. We used this curacy with the increasing dataset number. This is also alignment information to derive 10 eventsets, correspond- expected as the benchmarks introduces more value trans- ing to the first 10 iimb target datasets, each containing 222 formations with higher dataset numbers, although there are move events. The first 10 iimb datasets introduce increasing two outliners for the datasets 7 and 10. numbers of value transformations like typographical errors to the instance data. We used random timestamps for the 4.2 The DBpedia Persondata Eventset events (as this data is not available in this benchmark) that In order to evaluate our approach with real-world data we resulted in an equal distribution of events over the eventset have created a dbpedia-eventset that was derived from the duration. person datasets of the DBpedia snapshots 3.2 and 3.310 . The We have simulated these eventsets, monitored the chang- raw persondata datasets contain 20,284 (version 3.2) and ing dataset with DSNotify and measured precision and recall 29,498 (version 3.3) subjects typed as foaf:Person each hav- of the reported events with respect to the eventset informa- ing three properties foaf:name, foaf:surname and foaf:given- tion. For a useful feature selection we first calculated the name. Naturally, these properties are very well suited to entropy of the properties with a coverage > 10%, i.e., only uniquely identify persons as also confirmed by their high properties were considered where at least 10% of the re- entropy values (cf. Table 4). For the same reasons as al- sources had instance values. The results are summarized ready discussed for the iimb datasets an evaluation with in Table 3. As the goal of the evaluation was not to op- only these properties would not clearly demonstrate our timize the resulting precision/recall values but to analyze approach. Therefore we enriched both raw data sets with our blocking approach, we consequently chose the properties four properties (see Table 4) from the respective DBpedia tbox:cogito-tag and tbox:cogito-domain for the evaluation be- Mapping-based Infobox Extraction datasets [7] with smaller cause they have good coverage but comparatively small en- coverage and entropy values. tropy in this dataset. We calculated the entropy as shown We derived the dbpedia-eventset by comparing both data- in Equation 1 and normalized it by dividing by ln(n). sets for created, removed or updated resources. We retrieved n the creation and removal dates for the events from Wikipedia H(p) = − pi ln(pi ) (1) as these data are not included in the DBpedia datasets. For i=1 the update events we used random dates. Furthermore, we used the DBpedia redirect dataset to identify and generate DSNotify was configured to compare these properties using move events. This dataset contains redirection information the Levenshtein distance and both properties contributed derived from Wikipedia’s redirect pages that are automat- equally (weight = 1.0) to the corresponding feature vector ically created when a Wikipedia article is renamed. The comparison. The simulation was configured to run for 60 dates for these events were also retrieved from Wikipedia. seconds, thus the monitored datasets changed with an aver- The resulting eventset contained 3810 create, 230 remove, age rate of 222 = 3.7 events/s. 60 4161 update and 179 move events, summing up to 8380 As stated before, the goal of this evaluation was to demon- events11 . The histogram of the eventset depicted in Figure 6 strate the influence of the housekeeping frequency on the shows a high peak in bin 14. About a quarter of all events overall effectiveness of the system. For this, we repeated 10 the experiment with varying housekeeping intervals of 1s, The snapshots contain a subset of all instances of type 3s, 10s, 20s, 30s (corresponding to an average rate 3.7, 11.1, foaf:Person and can be downloaded from http://dbpedia. org/ (filename: persondata_en.nt). 37.0, 74.0, 111.0 events/housekeeping cycle) and calculated 11 Another 5666 events were excluded from the eventset as the F1 -measure (the harmonic mean of precision and recall) they resulted from inaccuracies in the DBpedia datasets. for each dataset (Figure 5). For example there are some items in the 3.2 snapshot that are not part of the 3.3 snapshot but were not re- Results. The results clearly demonstrate the expected de- moved from Wikipedia (a prominent example is the re- crease in accuracy when increasing the length of the house- source http://dbpedia.org/resource/Tim_Berners-Lee). Furthermore several items from version 3.3 were not in- keeping intervals, as this leads to more feature vector com- cluded in version 3.2 although the creation date of the corre- parisons and therefore more possibilities to make the wrong sponding Wikipedia article is before the creation date of the decisions. Furthermore, Figure 5 depicts the decreasing ac- 3.2 snapshot. We decided to generally exclude such items.
  • 8. 90 100 Name Coverage H Hnorm q foaf:name q q foaf:name (d) 1.00/1.00 9.91/10.28 1.00/1.00 q q q q 80 q q dsnotify:rdfhash foaf:surname (d) 1.00/1.00 9.11/9.25 0.92/0.90 q q q F1 − measure (%) q q 70 foaf:givenname foaf:givenname (d) 1.00/1.00 8.23/8.52 0.83/0.83 dbpedia:birthdate+dbpedia:birthplace q q q 60 dbpedia:birthdate (d) 0.60/0.60 5.84/5.96 0.59/0.58 q 50 q q dbpedia:birthplace (o) 0.48/0.47 4.24/4.32 0.43/0.42 dbpedia:birthplace q q 40 q q q q dbpedia:height (d) 0.10/0.08 0.65/0.51 0.07/0.05 30 q dbpedia:birthdate dbpedia:draftyear (d) 0.01/0.01 0.06/0.05 0.01/0.01 20 q dbpedia:height 10 Namespace prefix dbpedia: <http://dbpedia.org/ontology/> dbpedia:draftyear 0 Namespace prefix foaf: <http://xmlns.com/foaf/0.1/> 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 Log10(average events/housekeeping cycle) Table 4: Coverage, type, entropy and normalized entropy of all properties in the enriched dbpedia Figure 7: Influence of the number of events per 3.2/3.3 persondata sets. The selected properties housekeeping cycle on the F1 -measure of detected are written in bold. Symbols: object property (o), move events in the dbpedia-eventset evaluation. datatype property (d). 1000 1500 2000 Results. The results, depicted in Figure 7, show a fast sat- uration of the F1 -measure with an decreasing number of Events events per housekeeping cycle. This clearly confirms the findings from our iimb evaluation. The accuracy of DSNo- 500 tify increases with increasing housekeeping frequencies or 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 decreasing event rates. From a pragmatical viewpoint, this Bins means a tradeoff between the costs for monitoring and house- keeping operations (computational effort, network transmis- Figure 6: Histogram of the distribution of events in sion costs, etc.) and accuracy. The curve for the simple rdf- the dbpedia-eventset. A bin corresponds to a time hash function is surprisingly good, stabilizing at about 80% interval of about 11 days. for the F1 -measure. This can be attributed mainly to the high precision rates that are expected from such a function. The curve for the combined properties shows maximum val- occured within this time interval. We think that such event ues for the F1 -measure of about 60%. peaks are not unusual in real-world data and are interested The measured precision and recall rates are depicted in how our application deals with such situations. Figure 8. Both measures show a decrease with increasing We re-played the eventsets, monitored the changing data- numbers of events per housekeeping cycle. For the preci- set with DSNotify and measured precision and recall of the sion this can be observed mainly for low-entropy properties reported events with respect to the eventset information (cf. whereas the recall measures for all properties are affected. Figure 4). We repeated the simulation seven times varying dsnotify:rdfhash 90 100 q q q q the number of average events per housekeeping interval and q q q q q q q q q q q foaf:name calculated the F1 -measure of the reported move events12 . q q 80 q Precision (%) 70 q For each simulation, DSNotify was configured to index q q 60 q q only one of the six selected properties in Table 4. To calcu- dbpedia:birthplace 50 q q q foaf:givenname 40 late the similarity between datatype properties, we used the q dbpedia:birthdate+dbpedia:birthplace q 30 Levensthein distance. For object properties we used a sim- dbpedia:birthdate 20 dbpedia:height ple similarity function that counted the number of common 10 dbpedia:draftyear 0 property values (i.e., resources) in both resources that are 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 compared and divided it by the number of total values. Log10(average events/housekeeping cycle) Furthermore, we ran the simulations indexing only one 90 100 foaf:name cumulative attribute, an rdf-hash. This hash function calcu- q q q q q foaf:givenname q 80 q q lates an MD5 hashsum over all string-serialized properties 70 q dsnotify:rdfhash Recall (%) q q q q q of a resource and the corresponding similarity function re- 60 q q 50 q turns 1.0 if the hash-sums are equal or 0.0 otherwise. Thus q q dbpedia:birthdate+dbpedia:birthplace 40 dbpedia:birthplace q this rdf-hash is sensible to any modifications in a resource’s 30 q q q q q q dbpedia:birthdate 20 q instance data. q 10 dbpedia:height dbpedia:draftyear Additionally we evaluated a combination of the dbpedia 0 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 birthdate and birthplace properties, each contributed with Log10(average events/housekeeping cycle) equal weight to the weighted feature vector. The coverage of resources that had values for at least one of these attributes Figure 8: Influence of the number of events per was 65% in the 3.2 snapshot and 62% in the 3.3 snapshot. housekeeping cycle on the measured precision and recall of detected move events in the dbpedia-eventset 12 We fixed the housekeeping period for this experiment to 30s evaluation. and varied the simulation length from 3600 to 56.25s. Thus the event rates varied between 2.3 to 149.0 events/second It is, again, important to state that our evaluation had not or 35.2 to 2250.1 events/housekeeeping interval respectively. For these calculations we considered only move, remove and the goal to maximize the accuracy of the system for these create events (i.e., 4219 events) from the eventset as only particular eventsets but rather to reveal the characteristics these influence the accuracy of the algorithm. of our time-interval-based blocking approach. It shows that
  • 9. we can achieve good results even for attributes with low en- RDF. Triplify also provides a Linked Data Update Log that tropy when choosing an appropriate housekeeping frequency. groups updates to an RDF model within a certain timespan into nested collections accessible via HTTP. In principle, this 5. RELATED WORK solution provides a scalable approach for logging events that occurred in a Linked Data source. However, this notification Besides the already cited works, our research is also closely approach requires clients to regularly poll this log and the related to the areas of record linkage, a well-researched prob- data source to capture all these events (if possible). Fur- lem in the database domain, and instance matching, which thermore, the current specification of the vocabulary16 used is related to ontology alignment and schema matching. for describing the update events does not contain moved or Record linkage13 is concerned with finding pairs of data created events but only “Update” and “Deletion”. records (from one or multiple datasets) that refer to (de- Peridot is a tool developed by IBM for automatically fix- scribe) the same real-world entity [25, 9]. This informa- ing links in the Web. It is based on the patents [4, 5] and the tion is useful e.g., for joining different relations or for du- basic idea is to calculate fingerprints of Web documents and plicate detection [9]. Record linkage is trivial where entity- repair broken links based on their similarity. The method unique identifiers (such as ISBN numbers or OWL inverse- differs from DSNotify in that we consider the structured functional properties like foaf:mbox ) are available. When nature of Linked Data and support domain-specific, config- such additional identifiers are missing, tools often rely on urable similarity-heuristics on a property level which allows probabilistic distance metrics and machine learning meth- more advanced comparisons methods. Furthermore, DSNo- ods (e.g., HMMs, TD-IDF; SVM). A comprehensive survey tify introduces the described time-interval-based blocking of record linkage research can be found in [9]. The instance approach and detects also create, remove and update events. matching problem is closely related to record linkage but re- In [19], Morishima et al. describe Link Integrity Man- quires certain specific methods when dealing with structural agement tools that focus on fixing broken links in the Web and logical heterogeneities as pointed out in [11]. that occurred due to moved link targets (type B, cf. section Furthermore, our work is closely related to current work 2.2). Similar to DSNotify, they have developed a tool called in Semantic Web and in hypertext research: PageChaser that uses a heuristic approach to find missing In [20], Phelps and Wilensky propose so-called robust hy- resources based on indexed information (namely URIs, page perlinks based on URI references that are decorated with a content and redirect information). An explorer component, small14 lexical signature composed of terms extracted from which makes use of search engines, redirect information, the referenced document. When a target document is not and so-called link authorities (Web pages containing well- found, this lexical signature can be used to re-find the re- maintained links) is used to find possible URIs of moved source using regular Web search engines. A disadvantage resources. They also provide a heuristics-based method to of the robust hyperlink approach is that it requires existing calculate such link authority pages. A major difference to URI references to be changed which is not the case with our approach is that PageChaser was built for fixing links in our approach. Furthermore, it is unclear how to extend this the (human) Web exploiting some of its characteristics (like method to non-textual resources whereas our feature vector locality or link authorities), while DSNotify aims at becom- based approach could in principle be combined with most ing a general framework for assisting actors in fixing links existing multimedia feature extraction solutions. based on domain-specific content features. In [14], an algorithm for record linkage (object consoli- In a recent paper, Van de Sompel et al. discuss a proto- dation) based on inverse-functional properties is described. col for time-based content negotiation that can be used to The approach groups instances with matching IFP values access archived representations (“Mementos”) of resources together and determines canonical URIs for identification of [22]. By this, the protocol enables a kind of time-travel the real-world entities described by those instances. Natu- when accessing archived resources (such archives would fall rally, IFPs can be used efficiently for record linkage prob- into the Versioned and Static Collections category in Ta- lems but are unfortunately unavailable in many real-world ble 2). DSNotify could be used to build such archives when datasets. In DSNotify, IFPs can be exploited by simply us- a monitor implementation was used that would store not ing them as a single feature in a feature vector. only a feature vector derived from a resource representation The Silk framework [24] aims mainly at the automatic, but also the representation data itself. heuristics-based discovery of semantic relationships between resources in the Web of Data. These heuristics may be configured using an XML-based links specification language 6. CONCLUSIONS AND FUTURE WORK (Silk-LSL). In order to react on changes in the interlinked We presented the broken link problem in the context of datasets, the authors propose a new SOAP-based protocol the Web of Data as a special case of the instance matching (Web of Data - Link Maintenance Protocol, WOD-LMP 15 ) problem and showed the feasibility of a time-interval-based for synchronizing and maintaining links between LD sources. blocking approach for systems that aim at detecting and fix- Triplify [3] is a system that exposes data from relational ing such broken links. Reconsidering the solution strategies databases as Linked Data. It is based on mapping HTTP re- for the broken link problem discussed in Section 2.2, we can quests to RDBMS queries and publishing the result data as state that DSNotify can actually be used in multiple ways, 13 Record linkage is also known under many other names, including: (1) to function as a detect and correct module in such as object identification, data cleaning, entity resolution, an existing software, (2) as a standalone notification service coreference resolution or object consolidation [25, 14, 9]. that keeps subscribed clients informed about changes in a 14 The authors found out that five terms are enough to data source, (3) as an indirection service that automatically uniquely identify a Web resource in virtually all cases. forwards requests for moved resources to their new location. 15 http://www4.wiwiss.fu-berlin.de/bizer/silk/ 16 wodlmp/ http://triplify.org/vocabulary/update
  • 10. The various interfaces for accessing the data structures built [12] B. Haslhofer and N. Popitsch. DSNotify – detecting by DSNotify should facilitate the integration with existing and fixing broken links in linked data sets. In 8th applications. Our approach is by design a semi-automatic International Workshop on Web Semantics (WebS solution that is capable of integrating human intelligence in 09), co-located with DEXA 2009, 2009. the sense of human-based computation. We plan to further [13] M. Hepp, K. Siorpaes, and D. Bachlechner. Harvesting elaborate on this issue. However, DSNotify cannot “cure” wiki consensus: Using Wikipedia entries as vocabulary the Web of Data from broken links. It may rather be used for knowledge management. IEEE Internet as an add-on for particular data providers that want to keep Computing, 11(5):54–65, 2007. a high level of link integrity in their data. [14] A. Hogan, A. Harth, and S. Decker. Performing object The flexibility of our tool is founded in its generic nature consolidation on the semantic web data graph. In and its customizability. Consequently, the development and Proceedings of the 1st I3 : Identity, Identifiers, evaluation of additional monitors, features and extractors, Identification Workshop, 2007. heuristics and indices is one part of our future work. In [15] D. Ingham, S. Caughey, and M. Little. Fixing the particular we want to research feasible methods for auto- “broken-link” problem: the W3Objects approach. matic feature selection as well as for the determination of Comput. Netw. ISDN Syst., 28(7-11):1255–1268, 1996. optimal monitoring/housekeeping periods as these are the [16] I. Jacobs and N. Walsh. Architecture of the World key parameters for achieving good accuracy with our tool. Wide Web, volume one. Technical report, W3C, Further, we plan to analyze the applicability of DSNotify to December 2004. Retrieved May 7, 2009. other domains like the (document) Web or the file system. [17] F. Kappe. A scalable architecture for maintaining referential integrity in distributed information 7. ACKNOWLEDGMENTS systems. Journal of Universal Computer Science, The authors would like to thank Martin Romauch for his 1(2):84–104, 1995. valuable comments. The work has partly been supported [18] S. Lawrence, D. M. Pennock, G. W. Flake, R. Krovetz, by the European Commission as part of the eContentplus F. M. Coetzee, E. Glover, F. A. Nielsen, A. Kruger, program (EuropeanaConnect). and C. L. Giles. Persistence of web references in scientific research. Computer, 34(2):26–31, 2001. 8. REFERENCES [19] A. Morishima, A. Nakamizo, T. Iida, S. Sugimoto, and [1] W. Y. Arms. Uniform resource names: handles, purls, H. Kitagawa. Bringing your dead links back to life: a and digital object identifiers. Commun. ACM, comprehensive approach and lessons learned. In HT 44(5):68, 2001. ’09: Proceedings of the 20th ACM conference on [2] H. Ashman. Electronic document addressing: dealing Hypertext and hypermedia, pages 15–24, New York, with change. ACM Comput. Surv., 32(3), 2000. NY, USA, 2009. ACM. [3] S. Auer, S. Dietzold, J. Lehmann, S. Hellmann, and [20] T. A. Phelps and R. Wilensky. Robust hyperlinks cost D. Aum¨ller. Triplify: light-weight linked data u just five words each. Technical Report publication from relational databases. In WWW ’09, UCB/CSD-00-1091, EECS Department, University of New York, NY, USA, 2009. ACM. California, Berkeley, 2000. [4] M. Beynon and A. Flegg. Hypertext request integrity [21] D. S. H. Rosenthal and V. Reich. Permanent web and user experience, 2004. US Patent 0267726A1. publishing. In ATEC ’00: Proceedings of the annual [5] M. Beynon and A. Flegg. Guaranteeing hypertext link conference on USENIX Annual Technical Conference, integrity, 2007. US Patent 7290131 B2. pages 129–140. USENIX Association, 2000. [6] C. Bizer, T. Heath, and T. Berners-Lee. Linked data - [22] H. Van de Sompel, M. L. Nelson, R. Sanderson, L. L. the story so far. International Journal on Semantic Balakireva, S. Ainsworth, and H. Shankar. Memento: Web and Information Systems (IJSWIS), 5(3), 2009. Time travel for the web. Arxiv preprint., [7] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, arxiv:0911.1112, November 2009. C. Becker, R. Cyganiak, and S. Hellmann. DBpedia - [23] L. Veiga and P. Ferreira. Repweb: replicated web with a crystallization point for the web of data. Web referential integrity. In SAC ’03: Proceedings of the Semantics: Science, Services and Agents on the World 2003 ACM symposium on Applied computing, pages Wide Web, July 2009. 1206–1211, New York, NY, USA, 2003. ACM. [8] H. C. Davis. Referential integrity of links in open [24] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov. hypermedia systems. In Proceedings of the 9th ACM Discovering and maintaining links on the web of data. conference on Hypertext and hypermedia, pages In 8th International Semantic Web Conference, 2009. 207–216, New York, NY, USA, 1998. ACM. [25] W. E. Winkler. Overview of record linkage and [9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. current research directions. Technical report, U.S. Duplicate record detection: A survey. IEEE Trans. on Bureau of the Census, 2006. Knowl. and Data Eng., 19(1):1–16, 2007. [10] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183–1210, 1969. [11] A. Ferrara, D. Lorusso, S. Montanelli, and G. Varese. Towards a benchmark for instance matching. In Ontology Matching (OM 2008), volume 431 of CEUR Workshop Proceedings. CEUR-WS.org, 2008.