Keyword proximity search in xml trees andrada astefanoaie - presentation

Keyword Proximity Search
in XML Trees

Andrada Astefanoaie
XML and Database Systems
SS 2010

Outline
I. Introduction
II. Framework
III. Algorithms:Indexed XML Data
IV. Processing Unindexed XML DATA
in XML Trees
V. Experimental Evaluation
VI. Overview

Introduction - Framework - Algorithms:Indexed XML Data – Processing Unindexed XML Data - Experimental Evaluation - Overview

Keyword Search Keyword Proximity Search
in XML Trees

Keyword search

user-friendly information discovery technique

extensively studied for text documents.

Keyword proximity search

well-suited to XML documents

Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview

Notation Keyword Proximity Search
in XML Trees

XML DOCUMENT directed tree with labeles

- labled with λ(v), a tag

- 4-tuple: id(v)
start and end correspond to the first and the final times the node is
v visited in a depth-first traversal of the XML tree,
depth is the depth of the node from the root of the tree.

- if v is a leaf, it has a string value val(v) that contains a list of keywords

set of keywords k1,. . . , km.
keyword query
returns a compact representation of the set of trees that connect
the nodes that contain the keywords


in XML Trees

r

c1

s1 s2 s3

p2 p5 p6
p1 p3 p4

a1 a2 a3 a4 a5 a6 a7 a8 a9 a10


in XML Trees

Definition

minimum connecting tree (MCT) of nodes v1,. . . ,vm of a tree → the minimum size subtree that
connects v1, . . . ,vm.

root of the tree → the lowest common ancestor (LCA) of the nodes v1, . . . ,vm.
Examples:
r r
MCTs for the query MCTs for the query
“Tom, Harry” c1 “Tom, Dick, Harry” c1

s1 s2 s3 s1 s2 s3

p1 p2 p4 p5 p6 p1 p2 p3 p4 p5 p6
p3

a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a1 a2 a3 a7 a8
a4 a5 a6 a9 a10


in XML Trees

DMCT

v1, . . . , vm Є T.
Distance MCT (DMCT) TD=d(TM) of the MCT TM of nodes v1, . . . , vm → the minimum node-labeled
and edge-labeled tree such that:
TD contains v1, . . . , vm

TD contains the LCAs u1, . . . , uk
of any pair of nodes (vi, vj)
where vi , vj Є [v1, . . . , vm], i≠ j

edge labeled with l between
any two distinct nodes n, n’ Є
{v1,...,vm, u1, . . . ,uk} if there is a
path of length l from n’ to n in
TM and the path does not
contain any node n’’ Є { u1, . . . ,
um} other than n and n’.


in XML Trees

GDMCT

A Grouped DMCT of a tree T is a labeled tree where edges are labeled with numbers and nodes
are labeled with lists of node ids from T.

DMCT D Є GDMCT G if D and G are isomorphic. Assuming that f is the mapping of the nodes of D
to the nodes of G, which induces a corresponding mapping, also called f, of the edges of D to
the edges of G, the following must hold:
nD is a node of D, nG is a node of
G and f(nD)=nG, then the label
of nG contains the id of nD.

eD is an edge of D, eG is an edge
of G and f(eD) = eG, then the
label of eD and the label of eG
are the same number.


Problems Keyword Proximity Search
in XML Trees

Problem 1 : All GDMCTs Problem

Query K Result

“Tom, Harry” 5

3


Problems Keyword Proximity Search
in XML Trees

Problem 2 : Lowest GDMCTs Problem

Query K Result

“Tom, Harry” 5

3


All GDMCTs: Keyword Proximity Search
in XML Trees
Nested Loop Algorithm
The nested loops algorithm (NL) for the case of indexed XML Examples of some entries in the
data operates over separate lists of nodes, L(k), one for each master index for our tree:
query keyword, k, to identify the GDMCTs whose sizes are no
more than the user-provided threshold, K.

Master index inverted index a hash table

list L(k)

each node n has path-id
(the list of node ids along the path from the root of T to n)


in XML Trees
checks all combinations of
nodes from the keyword lists.

for each combination computes
an MCT (minimum connecting
tree)

merges the resulting MCT into
the list of result GDMCTs, if its
size is within the user-specified
threshold.


in XML Trees
For example:
Query: “Tom, Harry” and K=3,

NL examine the 12 node-pairs 12 MCTs
determine 2 of them meet the
threshold(K) return 2 GDMCTs:


in XML Trees
Inefficienty:

 NL checks all the combinations of nodes
from the keyword lists

 The grouping of the results into GDMCTs is
not lightly integrated with the algorithm
and a lookup to the array R is required for
each relevant MCT found.


in XML Trees
Stack-Based Algorithm
Index Structure and Algorithm.

The stack-based algorithm for computing GDMCTs on indexed XML data operates over lists of
nodes, two for each query keyword.

Indexing by keyword master index contains 2 lists

o L(k) of the nodes of T that contain k in T and
o Ld(k) of the ancestors of nodes in L(k).


in XML Trees

For example the entries for Tom, Dick and Harry are:


in XML Trees

This is the high-level description of the SA.

It describes how the selected
list of nodes is traversed in a
depth-first manner and the
nodes are pushed and popped
from the stack.


in XML Trees

novel part of the SA algorithm

processing and bookkeeping
performed at each stack
operation


in XML Trees

Functions that are called from
POP(S)


in XML Trees
Illustrative Example

Query: “Tom, Harry”
K=3

Master index lists:

The intersection of the lists:


in XML Trees
Master index lists: Intersection of the La

K=3

Some of the initial stack states of the execution of the Stack Algorithm:
1. 2. 3.


in XML Trees

K=3

4. 5. 6.


in XML Trees

K=3

7. 8. 9. Entries from the lists
continue being examined,
new GDMCTs are created and
pruned until all the answers
are output.

...


Lowest GDMCTs: Keyword Proximity Search
in XML Trees
Stack- Based Algorithm

The key observation is that once we output the GDMCTs of a node u, none of the ancestors of u
in the stack can be LCAs of returned GDMCTs; hence, we can remove all of them from the stack!

Specifically, we can add the following lines after line 5:


LCAs: Keyword Proximity Search
in XML Trees
Stack- Based Algorithms
The Stack Algorithm can also be easily modified to solve the All LCAs Problem and the Lowest
LCAs Problem, where the user is not interested in the GDMCTs, but only in the LCA nodes.

o First, Merge(.) could be simplified, no merging of GDMCTs would need to be done, and

line 33 could be replaced by:

o Second, we can output an LCA early when the first GDMCT (with all keywords) is
computed for that node (in Procedure CreateNewGDMCTs(.)), instead of waiting until the
node is popped from the stack.


Complexity Keyword Proximity Search
in XML Trees
Analysis
Total number of GDMCTs

Worst case: the number of DMCTs and of GDMCTs = exponential on the number of keywords.

Under reasonable assumptions, the worst-case number of GDMCTs is smaller than that of
DMCTs

Complexity of Finding Isomorphic GDMCTs

Given this canonical representation prezented in this chapter, one can linearize the GDMCTs in
an XML-like nested representation with start and end tags, obtained from the node
annotations.

Theorem 1. The time complexity of SA is

O( L  K  (i 1 L(ki ) ) 2 )
m


Processing Keyword Proximity Search
in XML Trees
Unindexed XML Data
Both the NL Algorithm and the SA have adaptations to work without index lists by doing a single
pass over the data tree.

The streaming version of the Stack Algorithm following changes to the Stack
Algorithm SA(k1,..km, K):


Experimental Keyword Proximity Search
in XML Trees
Evaluation

Parameters affecting the performance of the presented algorithms:

1) the value of K denoting the threshold,
2) the number m of keywords,
3) the size of the data set.

Tests show that usually the algorithms based on the Stack Algorithm have better results than the
Nested Loops Algorithms both in the Indexed and Unindexed data.


Overview Keyword Proximity Search
in XML Trees

There were presented two main problems:
1) identifying and presenting in a compact manner all MCTs which explain how the keywords are
connected
2) identifying only MCTs whose root is not an ancestor of the root of another MCT.

There are presented solutions:

1) when the XML data has been preprocessed and relevant indices have been constructed
- Nested Loop Algorithm
- Stack Algorithm
2) when the XML data has not been preprocessed, i.e., the XML data can only beprocessed
sequentially.

Benefits of the algorithms are shown by the Experimental Evaluation

Resource

Name
Keyword Proximity Search in XML Trees

Vangelis Hristidis, Nick Koudas,
Yannis Papakonstantinou and Diverish Srivastava
Authors

IEEE Transactions on Knoledge and Data Engineering
Publication
Vol 18, No 4, APRIL 2006

in XML Trees

Thank you!

Keyword proximity search in xml trees andrada astefanoaie - presentation

More Related Content

Viewers also liked (19)

Similar to Keyword proximity search in xml trees andrada astefanoaie - presentation (20)

Recently uploaded (20)

Keyword proximity search in xml trees andrada astefanoaie - presentation