Crowdsourcing-enabled Linked Data management architecture

A semantically enabled architecture for
crowdsourced Linked Data management
Elena Simperl,1 Maribel Acosta,1 Barry Norton2
1Institute AIFB, Karlsruhe Institute of Technology, Germany
2Ontotext AD, Bulgaria
Institute of Applied Informatics and Formal Description Methods (AIFB)
Institute of Applied Informatics and Formal Description Methods (AIFB)

KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association www.kit.edu

Background: What is Linked Data?
Linked Data: set of best practices
to publish and connect structured
data on the Web.
URIs to identify entities and
concepts in the world
HTTP to access and retrieve
resources and descriptions of
these resources
RDF as generic graph-based data
model to structure and link data
Taken together Linked Data is
said to form a ‘cloud’ of shared
references and vocabularies.
Query language: SPARQL.

http://linkeddata.org/faq
2 07.06.2012 CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Institut für Angewandte Informatik und Formale
Linked Data management Beschreibungsverfahren (AIFB)

Background: Why Linked Data?
Data.gov & public sector information: more BBC & media: added value of
transparency and accountability in content through interlinking
governance

Google, Yahoo, Bing & schema.org:
enhanced search


Outline

1 • Motivation

2 • Our Approach

3 • Extensions to VoID and SPARQL

4 • Crowdsourced query processing tasks

5 • Advantages

6 • Challenges


1. Motivation
User Query: Give me the German names of all commercial
airports in Baden-Württemberg, ordered by their most
informative description.

„Retrieve the labels in German of commercial airports located
in Baden-Württemberg, ordered by the better human-readable
description of the airport given in the comment“.

This query cannot be optimally answered automatically:
Incorrect/missing classification of entities (e.g. classification as
airports instead of commercial airports).
Missing information in data sets (e.g. German labels).
It is not possible to optimally perform subjective operations (e.g.
comparisons of pictures or NL comments).

1. Motivation
„Retrieve the labels in German of commercial airports
located in Baden-Württemberg, ordered by the better human-
readable description of the airport given in the comment“.

In order to answer the query as intended:
Classification of airports as commercial airports.
Identity resolution of places (Baden-Württemberg).
Translation of the labels of the airports.
Ordering of the comments by a subjective comparison.


1. Motivation
„Retrieve the labels in German of commercial airports
located in Baden-Württemberg, ordered by the better human-
readable description of the airport given in the comment“.

SPARQL Query:
SELECT ?label WHERE { Classification
1
?x a metar:CommercialHubAirport;
rdfs:label ?label;
rdfs:comment ?comment .
?x geonames:parentFeature ?z . Identity Resolution
2
?z owl:sameAs <http://dbpedia.org/resource/Baden-Wuerttemberg> .
FILTER (LANG(?label) = "de") 3 Missing Information
4 Ordering
} ORDER BY CROWD(?comment, "Better description of %x")


1. Motivation: Our Aim
SPARQL query engine, able to process queries using
seamless combination of automatic query processing and
crowdsourcing.
Query Results
Mediator
SPARQL query engine Crowdsourced query processing
Query parsing Task design UI generation

Query optimization

Query execution

Wrapper Wrapper Wrapper Wrapper


2. Our Approach

Parser

Query Results Decomposes the input query.
SPARQL query engine
Selects the data sets that should be
Query parsing
accessed to produce answers.
Query optimization
Rewrites the query into the internal
Query execution structures.


2. Our Approach

Optimizer

Query Results DB statistics and crowdsourcing
SPARQL query engine statistics: estimated time to completion,
Query parsing
and other information about the
performance (quality, cost) of the crowd.
Query optimization

Traditional data bases optimization
Query execution
techniques are implemented.

Determines which parts of the query
should be solved by human input: VoID
and SPARQL extensions.

Generates logical and physical plans.

2. Our Approach

Executor

Query Results Implements physical operators.
SPARQL query engine
Invokes crowdsourcing component:
Query parsing

Creates tasks.
Query optimization
Generates UI.
Query execution
Infers facts automatically.

Executes query against Linked Data:
computational tasks.

Incorporates results from the human
input.

3. Extensions to VoID and SPARQL
The RDF based schema to describe data sets is VoID
(Vocabulary of Interlinked Datasets).

Common VoID predicates: voidDataset,
void:inDataset, void:Linkset, void:linkPredicate,
void:target.
Automatic interlinking of datasets

VoID extensions: CrowdClass

CrowdProperty


Automatic interlinking of data sets

Example - Specification of Data Sets:

:METAR rdf:type void:Dataset . METAR
:Genonames rdf:type void:Dataset .
owl:sameAs

:METAR2Geonames rdf:type void:Linkset ;
void:linkPredicate owl:sameAs ;
void:target :METAR ; Geonames
void:target :Geonames .



CrowdClass
- Specifies which entities of a data set could be crowdsourced.
- All subclasses of the crowdClass are also defined (implicitly)
as crowdsourced entities.

Example:
metar:Airport void:inDataset :METAR .
metar:CommercialHubAirport void:inDataset :METAR;
rdfs:subClass metar:Airport .
metar:Airport rdf:type void:crowdClass .
metar:CommercialHubAirport rdf:type void:crowdClass.


RDF data can be queried using the language SPARQL.

Common SPARQL operators: join, union, optional,
filter, order by.

Properties related to general ontology languages such as
OWL are treated as extensions of SPARQL operators,
and are modeled in our architecture as tasks.


4. Tasks

Formal, declarative description of the data and
tasks using SPARQL patterns as a basis for the
automatic design of HITs.

Identity resolution

Missing information

Ontological classification

Ordering (new operator)


4.1. Ontological Classification
It is not always possible to automatically infer classification
from the properties.
Example: Retrieve the names (labels) of METAR stations that
correspond to commercial airports.

SELECT ?label WHERE {
?station a metar:CommercialHubAirport;
rdfs:label ?label .}

Input: {?station a metar:Station;
rdfs:label ?label;
wgs84:lat ?lat;
wgs84:long ?long}

Output: {?station a ?type.
?type rdfs:subClassOf metar:Station}

4.2. Ordering
Orderings defined via less straightforward built-ins; for
instance, the ordering of pictorial representations of entities.
SPARQL extension: ORDER BY CROWD
Example: Retrieves all airports and their pictures, and the pictures should
be ordered according to the more representative image of the given airport.

SELECT ?airport ?picture WHERE {
?airport a metar:Airport;
foaf:depiction ?picture .
} ORDER BY CROWD(?picture,
"Most representative image for %airport")

Input: {?airport foaf:depiction ?x, ?y}

Output: {{(?x ?y) a rdf:List} UNION {(?y ?x) a rdf:List}}


4.3. Computational tasks expressed as
SPARQL queries

Transitive relations inferred automatically, without
requiring human intervention.

Implementation of restrictions in SPIN.

Identity Resolution Classification Ordering
CONSTRUCT { CONSTRUCT { CONSTRUCT {
?a owl:sameAs ?c . ?a a ?b. {(?a ?b) a rdf:List .}
} WHERE { ?b rdfs:subClassOf ?c. } WHERE {
?a owl:sameAs ?b . } WHERE { (?a ?x) a rdf:List .
?b owl:sameAs ?c . ?a rdfs:subClassOf ?c. (?x ?b) a rdf:List .
} ?b rdfs:subClassOf ?b1. }
?b1 rdfs:subClassOf ?c.
}


5. Advantages
Declarative description of data allows to decompose the
query.

Generation of the UIs automatically.

Generation of human tasks on-the-fly and adjustment of
the design of the task.

Automatic consistency check of results by reasoning
against validating ontology.


6. Challenges
Appropriate level of granularity for HITs design for specific
SPARQL constructs.
Caching
Naively we can materialise HIT results into datasets.
How to deal with partial coverage and dynamic datasets.

Optimal user interfaces of graph-like content.

Pricing and workers’ assignment.


QUESTIONS


Crowdsourcing-enabled Linked Data management architecture

More Related Content

What's hot (10)

Similar to Crowdsourcing-enabled Linked Data management architecture (20)

More from Elena Simperl (20)

Recently uploaded (20)

Crowdsourcing-enabled Linked Data management architecture