Crossing the Vocabulary Gap for Querying Complex and Heterogeneous Databases

© Copyright 2009 Digital Enterprise Research Institute. All rights reserved.
Digital Enterprise Research Institute www.deri.ie
Crossing the Vocabulary Gap for
Querying Complex and Heterogeneous
Databases:
A Distributional-Compositional Semantics
Perspective
André Freitas, Sean O’Riain, Edward Curry
DEOS 2013, Oxford, UK

Big Data
 Big Data: More complete data-based picture of the
world.

Growing Schema Size
10s-100s attributes
1,000s-1,000,000s attributes
 Heterogeneous, complex and large-scale
databases.
 Very-large and dynamic “schemas”.

Growing Semantic Heterogeneity
 Multiple perspectives (conceptualizations) of the
reality.
 Ambiguity, vagueness, inconsitency.

Problem
 Structured queries are still the primary way
to query databases.

Structured query
Schema size &
heterogeneity
Query
construction time
HighLow
High
Low
10-100s
attributes
103
-106
s
attributes

Vocabulary Problem for Databases
Who is the daughter of Bill Clinton married to?
Schema-agnostic queries
Possible representations

Who is the daughter of Bill Clinton married to ?
Semantic Gap
Lexical-level
Abstraction-level
Structural-level

Who is the daughter of Bill Clinton married to ?
Semantic Gap
Lexical-level
Abstraction-level
Structural-level
Query:
Data

Solution: Schema-agnostic queries
Lexical-level
Abstraction-level
Structural-level
Distributional Semantics
Compositional Semantics
Based on the statistical
analysis of large
unstructured corpora
Query Processing and
Planning

Statistical
analysis
Datasets

Core Elements of the Proposed Approach
 Hybrid model database/IR/QA.
 Ranked query results.
 Existing IR approaches: traditional Vector Space
Models (VSMs) were not able to:
 (i) capture the structure of data.
 (ii) support a precise and comprehensive semantic
matching.
 A VSM supporting these two requirements was
formulated: Ƭ-Space.
 Ranking function based on a distributional
semantic relatedness measure.

Does it work?
 DBpedia 3.7 + YAGO.
 102 natural language queries (QALD 2011).
Entity-Attribute-Value (EAV) Dataset:
45,767 predicates
5,556,492 classes
9,434,677 instances

Selected Publications
André Freitas, Edward Curry, João Gabriel Oliveira, João C. Pereira da Silva, Sean
O'Riain, Querying the Semantic Web using Semantic Relatedness: A Vocabulary
Independent Approach. Data & Knowledge Engineering (DKE) Journal, 2013. (Article).

André Freitas, Fabricio de Faria, Sean O'Riain, Edward Curry, Answering Natural
Language Queries over Linked Data Graphs: A Distributional Semantics Approach, In
Proceedings of the 36th Annual ACM SIGIR Conference, Dublin, Ireland,
2013. (Demonstration Paper in Proceedings).
André Freitas, Edward Curry, João Gabriel Oliveira, Sean O'Riain, Querying
Heterogeneous Datasets on the Linked Data Web: Challenges, Approaches and
Trends. IEEE Internet Computing, Special Issue on Internet-Scale Data, 2012 (Article).
André Freitas, Edward Curry, João Gabriel Oliveira, Sean O'Riain, A Distributional
Structured Semantic Space for Querying RDF Graph Data. International Journal of
Semantic Computing (IJSC), 2012 (Article).

http://treo.deri.ie

Crossing the Vocabulary Gap for Querying Complex and Heterogeneous Databases

More Related Content

What's hot (9)

Similar to Crossing the Vocabulary Gap for Querying Complex and Heterogeneous Databases (20)

More from Andre Freitas (20)

Recently uploaded (20)

Crossing the Vocabulary Gap for Querying Complex and Heterogeneous Databases

Editor's Notes