SlideShare a Scribd company logo
Keyword Proximity Search
in XML Trees


                    Andrada Astefanoaie
                     XML and Database Systems
                                      SS 2010
Outline
                           I.     Introduction
                           II.    Framework
                           III.   Algorithms:Indexed XML Data
Keyword Proximity Search
                           IV. Processing Unindexed XML DATA
in XML Trees
                           V.     Experimental Evaluation
                           VI. Overview
Introduction - Framework - Algorithms:Indexed XML Data – Processing Unindexed XML Data - Experimental Evaluation - Overview



Keyword Search                                                                           Keyword Proximity Search
                                                                                                    in XML Trees


Keyword search

user-friendly information discovery technique

extensively studied for text documents.



Keyword proximity search

well-suited to XML documents
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview




Notation                                                                                 Keyword Proximity Search
                                                                                                    in XML Trees


                     XML DOCUMENT                                        directed tree with labeles

                          -   labled with λ(v), a tag

                          -   4-tuple: id(v)
                                       start and end correspond to the first and the final times the node is
       v                               visited in a depth-first traversal of the XML tree,
                                       depth is the depth of the node from the root of the tree.

                          -   if v is a leaf, it has a string value val(v) that contains a list of keywords


                                        set of keywords k1,. . . , km.
keyword query
                                        returns a compact representation of the set of trees that connect
                                       the nodes that contain the keywords
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview




Notation                                                                                   Keyword Proximity Search
                                                                                                      in XML Trees


                                                                                            r




                                                                                                c1




                                                                            s1              s2                       s3




                                                                                 p2                                   p5   p6
                                                                  p1                             p3             p4




                                                           a1          a2   a3        a4    a5        a6   a7   a8    a9        a10
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview




Notation                                                                                      Keyword Proximity Search
                                                                                                         in XML Trees

Definition

minimum connecting tree (MCT) of nodes v1,. . . ,vm of a tree → the minimum size subtree that
connects v1, . . . ,vm.

root of the tree → the lowest common ancestor (LCA) of the nodes v1, . . . ,vm.
Examples:
                                    r                                                                       r
MCTs for the query                                                 MCTs for the query
“Tom, Harry”                        c1                             “Tom, Dick, Harry”                       c1




                s1                  s2                  s3                              s1                  s2                  s3




      p1             p2                            p4   p5   p6               p1             p2                 p3         p4   p5    p6
                                     p3




 a1        a2   a3        a4   a5        a6   a7   a8   a9   a10         a1        a2   a3                            a7   a8
                                                                                                  a4   a5        a6              a9   a10
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview




Notation                                                                                 Keyword Proximity Search
                                                                                                    in XML Trees


DMCT

v1, . . . , vm Є T.
Distance MCT (DMCT) TD=d(TM) of the MCT TM of nodes v1, . . . , vm → the minimum node-labeled
and edge-labeled tree such that:
                                                                                                           TD contains v1, . . . , vm



                                                                                                     TD contains the LCAs u1, . . . , uk
                                                                                                     of any pair of nodes (vi, vj)
                                                                                                     where vi , vj Є [v1, . . . , vm], i≠ j



                                                                                                      edge labeled with l between
                                                                                                      any two distinct nodes n, n’ Є
                                                                                                      {v1,...,vm, u1, . . . ,uk} if there is a
                                                                                                      path of length l from n’ to n in
                                                                                                      TM and the path does not
                                                                                                      contain any node n’’ Є { u1, . . . ,
                                                                                                      um} other than n and n’.
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview




Notation                                                                                 Keyword Proximity Search
                                                                                                    in XML Trees


GDMCT

A Grouped DMCT of a tree T is a labeled tree where edges are labeled with numbers and nodes
are labeled with lists of node ids from T.

DMCT D Є GDMCT G if D and G are isomorphic. Assuming that f is the mapping of the nodes of D
to the nodes of G, which induces a corresponding mapping, also called f, of the edges of D to
the edges of G, the following must hold:
                                                                                                      nD is a node of D, nG is a node of
                                                                                                       G and f(nD)=nG, then the label
                                                                                                        of nG contains the id of nD.




                                                                                               eD is an edge of D, eG is an edge
                                                                                               of G and f(eD) = eG, then the
                                                                                               label of eD and the label of eG
                                                                                               are the same number.
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview




Problems                                                                                 Keyword Proximity Search
                                                                                                    in XML Trees

Problem 1 : All GDMCTs Problem

Query                                  K                                        Result

“Tom, Harry”                           5




                                       3
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview




Problems                                                                                 Keyword Proximity Search
                                                                                                    in XML Trees


Problem 2 : Lowest GDMCTs Problem


Query                                  K                                        Result

“Tom, Harry”                           5




                                       3
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


All GDMCTs:                                                                             Keyword Proximity Search
                                                                                                   in XML Trees
Nested Loop Algorithm
The nested loops algorithm (NL) for the case of indexed XML                             Examples of some entries in the
data operates over separate lists of nodes, L(k), one for each                          master index for our tree:
query keyword, k, to identify the GDMCTs whose sizes are no
more than the user-provided threshold, K.




Master index             inverted index             a hash table

                                                        list L(k)

                                each node n has path-id
      (the list of node ids along the path from the root of T to n)
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


All GDMCTs:                                                                             Keyword Proximity Search
                                                                                                   in XML Trees
Nested Loop Algorithm
                                                                              checks all combinations of
                                                                             nodes from the keyword lists.




                                                                            for each combination computes
                                                                            an MCT (minimum connecting
                                                                            tree)




                                                                           merges the resulting MCT into
                                                                           the list of result GDMCTs, if its
                                                                           size is within the user-specified
                                                                           threshold.
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


All GDMCTs:                                                                             Keyword Proximity Search
                                                                                                   in XML Trees
Nested Loop Algorithm
For example:
Query: “Tom, Harry” and K=3,




NL examine the 12 node-pairs                  12 MCTs
           determine 2 of them meet the
threshold(K)             return 2 GDMCTs:
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


All GDMCTs:                                                                             Keyword Proximity Search
                                                                                                   in XML Trees
Nested Loop Algorithm
Inefficienty:

 NL checks all the combinations of nodes
    from the keyword lists



 The grouping of the results into GDMCTs is
    not lightly integrated with the algorithm
    and a lookup to the array R is required for
    each relevant MCT found.
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


All GDMCTs:                                                                             Keyword Proximity Search
                                                                                                   in XML Trees
Stack-Based Algorithm
Index Structure and Algorithm.

The stack-based algorithm for computing GDMCTs on indexed XML data operates over lists of
nodes, two for each query keyword.


Indexing by keyword                   master index                contains 2 lists



                  o L(k) of the nodes of T that contain k in T and
                  o Ld(k) of the ancestors of nodes in L(k).
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


All GDMCTs:                                                                             Keyword Proximity Search
                                                                                                   in XML Trees
Stack-Based Algorithm
Index Structure and Algorithm.

For example the entries for Tom, Dick and Harry are:
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


All GDMCTs:                                                                             Keyword Proximity Search
                                                                                                   in XML Trees
Stack-Based Algorithm
Index Structure and Algorithm.

This is the high-level description of the SA.




                                                                                        It describes how the selected
                                                                                        list of nodes is traversed in a
                                                                                        depth-first manner and the
                                                                                        nodes are pushed and popped
                                                                                        from the stack.
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


All GDMCTs:                                                                             Keyword Proximity Search
                                                                                                   in XML Trees
Stack-Based Algorithm
Index Structure and Algorithm.




                                                                                                   novel part of the SA algorithm




                                                                                                  processing and bookkeeping
                                                                                                  performed at each stack
                                                                                                  operation
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


All GDMCTs:                                                                             Keyword Proximity Search
                                                                                                   in XML Trees
Stack-Based Algorithm
Index Structure and Algorithm.



                                                                                          Functions that are called from
                                                                                          POP(S)
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


All GDMCTs:                                                                             Keyword Proximity Search
                                                                                                   in XML Trees
Stack-Based Algorithm
Illustrative Example

Query: “Tom, Harry”
K=3

Master index lists:




The intersection of the lists:
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


All GDMCTs:                                                                               Keyword Proximity Search
                                                                                                     in XML Trees
Stack-Based Algorithm
Illustrative Example
                                        Master index lists:                          Intersection of the La

Query: “Tom, Harry”
K=3


Some of the initial stack states of the execution of the Stack Algorithm:
1.                                         2.                                              3.
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


All GDMCTs:                                                                               Keyword Proximity Search
                                                                                                     in XML Trees
Stack-Based Algorithm
Illustrative Example
                                        Master index lists:                          Intersection of the La

Query: “Tom, Harry”
K=3


Some of the initial stack states of the execution of the Stack Algorithm:
4.                                         5.                                              6.
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


All GDMCTs:                                                                               Keyword Proximity Search
                                                                                                     in XML Trees
Stack-Based Algorithm
Illustrative Example
                                        Master index lists:                          Intersection of the La

Query: “Tom, Harry”
K=3


Some of the initial stack states of the execution of the Stack Algorithm:
7.                                         8.                                              9. Entries from the lists
                                                                                           continue being examined,
                                                                                           new GDMCTs are created and
                                                                                           pruned until all the answers
                                                                                           are output.



                                                                                                         ...
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


Lowest GDMCTs:                                                                          Keyword Proximity Search
                                                                                                   in XML Trees
Stack- Based Algorithm




The key observation is that once we output the GDMCTs of a node u, none of the ancestors of u
in the stack can be LCAs of returned GDMCTs; hence, we can remove all of them from the stack!

Specifically, we can add the following lines after line 5:
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


LCAs:                                                                                   Keyword Proximity Search
                                                                                                   in XML Trees
Stack- Based Algorithms
The Stack Algorithm can also be easily modified to solve the All LCAs Problem and the Lowest
LCAs Problem, where the user is not interested in the GDMCTs, but only in the LCA nodes.



    o First, Merge(.) could be simplified, no merging of GDMCTs would need to be done, and

    line 33 could be replaced by:




    o Second, we can output an LCA early when the first GDMCT (with all keywords) is
    computed for that node (in Procedure CreateNewGDMCTs(.)), instead of waiting until the
    node is popped from the stack.
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


Complexity                                                                              Keyword Proximity Search
                                                                                                   in XML Trees
Analysis
Total number of GDMCTs

Worst case: the number of DMCTs and of GDMCTs = exponential on the number of keywords.

Under reasonable assumptions, the worst-case number of GDMCTs is smaller than that of
DMCTs

Complexity of Finding Isomorphic GDMCTs

Given this canonical representation prezented in this chapter, one can linearize the GDMCTs in
an XML-like nested representation with start and end tags, obtained from the node
annotations.


Theorem 1. The time complexity of SA is

                                     O( L  K  (i 1 L(ki ) ) 2 )
                                                              m
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


Processing                                                                              Keyword Proximity Search
                                                                                                   in XML Trees
Unindexed XML Data
Both the NL Algorithm and the SA have adaptations to work without index lists by doing a single
pass over the data tree.


The streaming version of the Stack Algorithm                                 following changes to the Stack
Algorithm SA(k1,..km, K):
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


Experimental                                                                             Keyword Proximity Search
                                                                                                    in XML Trees
Evaluation


Parameters affecting the performance of the presented algorithms:


            1) the value of K denoting the threshold,
            2) the number m of keywords,
            3) the size of the data set.


Tests show that usually the algorithms based on the Stack Algorithm have better results than the
Nested Loops Algorithms both in the Indexed and Unindexed data.
Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview


Overview                                                                                 Keyword Proximity Search
                                                                                                    in XML Trees

There were presented two main problems:
    1) identifying and presenting in a compact manner all MCTs which explain how the keywords are
       connected
    2) identifying only MCTs whose root is not an ancestor of the root of another MCT.



There are presented solutions:

    1) when the XML data has been preprocessed and relevant indices have been constructed
                   - Nested Loop Algorithm
                   - Stack Algorithm
    2) when the XML data has not been preprocessed, i.e., the XML data can only beprocessed
    sequentially.



Benefits of the algorithms are shown by the Experimental Evaluation
Resource




Name
              Keyword Proximity Search in XML Trees

                                      Vangelis Hristidis, Nick Koudas,
                      Yannis Papakonstantinou and Diverish Srivastava
Authors

                IEEE Transactions on Knoledge and Data Engineering
Publication
                                           Vol 18, No 4, APRIL 2006
Keyword Proximity Search
in XML Trees




Thank you!

More Related Content

PDF
Getting Started With Scala
PDF
Harold Boley: RuleML/Grailog: The Rule Metalogic Visualized with Generalized ...
PPT
JavaYDL9
PPT
PPTX
Hummingbird - Open Source for Small Satellites - GSAW 2012
PPTX
Class Diagram Uml
PDF
Music workflow4
PDF
Summer Training In Dotnet
Getting Started With Scala
Harold Boley: RuleML/Grailog: The Rule Metalogic Visualized with Generalized ...
JavaYDL9
Hummingbird - Open Source for Small Satellites - GSAW 2012
Class Diagram Uml
Music workflow4
Summer Training In Dotnet

Viewers also liked (19)

PPTX
Keyword-based Search and Exploration on Databases (SIGMOD 2011)
PDF
Interactive Query and Search for your Big Data
PPTX
Presentation
PPT
Structured Document Search and Retrieval
PPTX
Information retrival system and PageRank algorithm
PDF
Naive Bayesian Text Classifier Event Models
PPTX
E-Learning Baseline, UCL
PDF
Text classification & sentiment analysis
PDF
Fundraising Tips - SCMM
PDF
5ภาษาอังกฤษ
PPT
23 Per Olav Vandvik - Kunnskapsbasert praksis i praksis: Hva slags verktøy tr...
PDF
tut0000021-hevery
PDF
Issue 116 obesity in adults PKU
DOCX
Matematika Matriks (Created by AkangCyber)
PPTX
Proyecto de regulacion aduanera
PPTX
Struktur dan fungsi organ tumbuhan ii
PDF
Ruby on Railsではじめるrspecテスト
PPS
HIV TO SELF-DESTRUCT. M.I.T. RESEARCH BY TIFFANY AMARIUTA
DOCX
Francuska deklaracija o pravima čoveka i gradjanina iz 1789
Keyword-based Search and Exploration on Databases (SIGMOD 2011)
Interactive Query and Search for your Big Data
Presentation
Structured Document Search and Retrieval
Information retrival system and PageRank algorithm
Naive Bayesian Text Classifier Event Models
E-Learning Baseline, UCL
Text classification & sentiment analysis
Fundraising Tips - SCMM
5ภาษาอังกฤษ
23 Per Olav Vandvik - Kunnskapsbasert praksis i praksis: Hva slags verktøy tr...
tut0000021-hevery
Issue 116 obesity in adults PKU
Matematika Matriks (Created by AkangCyber)
Proyecto de regulacion aduanera
Struktur dan fungsi organ tumbuhan ii
Ruby on Railsではじめるrspecテスト
HIV TO SELF-DESTRUCT. M.I.T. RESEARCH BY TIFFANY AMARIUTA
Francuska deklaracija o pravima čoveka i gradjanina iz 1789
Ad

Similar to Keyword proximity search in xml trees andrada astefanoaie - presentation (20)

PDF
Bitmap Indexes for Relational XML Twig Query Processing
DOCX
A survey of xml tree patterns
PDF
Semantic Semi-Structured Documents of Least Edit Distance (LED) Calculation f...
PDF
PPTX
BGOUG 2012 - XML Index Strategies
PPTX
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
PDF
Pattern Matching Part One: Suffix Trees
PDF
Lgm pakdd2011 public
PDF
A novel approach towards developing a statistical dependent and rank
PDF
Lise Getoor, "
PDF
Adaptive XML Tree Mining on Evolving Data Streams
DOCX
A survey of xml tree patterns
PPTX
Development of a new indexing technique for XML document retrieval
PPTX
Application of tries
PDF
Realtime Analytics With Elasticsearch [New Media Inspiration 2013]
PDF
Dmss2011 public
Bitmap Indexes for Relational XML Twig Query Processing
A survey of xml tree patterns
Semantic Semi-Structured Documents of Least Edit Distance (LED) Calculation f...
BGOUG 2012 - XML Index Strategies
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Pattern Matching Part One: Suffix Trees
Lgm pakdd2011 public
A novel approach towards developing a statistical dependent and rank
Lise Getoor, "
Adaptive XML Tree Mining on Evolving Data Streams
A survey of xml tree patterns
Development of a new indexing technique for XML document retrieval
Application of tries
Realtime Analytics With Elasticsearch [New Media Inspiration 2013]
Dmss2011 public
Ad

Recently uploaded (20)

PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Institutional Correction lecture only . . .
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
Lesson notes of climatology university.
PDF
Computing-Curriculum for Schools in Ghana
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
STATICS OF THE RIGID BODIES Hibbelers.pdf
Institutional Correction lecture only . . .
Microbial disease of the cardiovascular and lymphatic systems
Anesthesia in Laparoscopic Surgery in India
human mycosis Human fungal infections are called human mycosis..pptx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
2.FourierTransform-ShortQuestionswithAnswers.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
VCE English Exam - Section C Student Revision Booklet
Final Presentation General Medicine 03-08-2024.pptx
O5-L3 Freight Transport Ops (International) V1.pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Lesson notes of climatology university.
Computing-Curriculum for Schools in Ghana
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape

Keyword proximity search in xml trees andrada astefanoaie - presentation

  • 1. Keyword Proximity Search in XML Trees Andrada Astefanoaie XML and Database Systems SS 2010
  • 2. Outline I. Introduction II. Framework III. Algorithms:Indexed XML Data Keyword Proximity Search IV. Processing Unindexed XML DATA in XML Trees V. Experimental Evaluation VI. Overview
  • 3. Introduction - Framework - Algorithms:Indexed XML Data – Processing Unindexed XML Data - Experimental Evaluation - Overview Keyword Search Keyword Proximity Search in XML Trees Keyword search user-friendly information discovery technique extensively studied for text documents. Keyword proximity search well-suited to XML documents
  • 4. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview Notation Keyword Proximity Search in XML Trees XML DOCUMENT directed tree with labeles - labled with λ(v), a tag - 4-tuple: id(v) start and end correspond to the first and the final times the node is v visited in a depth-first traversal of the XML tree, depth is the depth of the node from the root of the tree. - if v is a leaf, it has a string value val(v) that contains a list of keywords set of keywords k1,. . . , km. keyword query returns a compact representation of the set of trees that connect the nodes that contain the keywords
  • 5. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview Notation Keyword Proximity Search in XML Trees r c1 s1 s2 s3 p2 p5 p6 p1 p3 p4 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10
  • 6. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview Notation Keyword Proximity Search in XML Trees Definition minimum connecting tree (MCT) of nodes v1,. . . ,vm of a tree → the minimum size subtree that connects v1, . . . ,vm. root of the tree → the lowest common ancestor (LCA) of the nodes v1, . . . ,vm. Examples: r r MCTs for the query MCTs for the query “Tom, Harry” c1 “Tom, Dick, Harry” c1 s1 s2 s3 s1 s2 s3 p1 p2 p4 p5 p6 p1 p2 p3 p4 p5 p6 p3 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a1 a2 a3 a7 a8 a4 a5 a6 a9 a10
  • 7. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview Notation Keyword Proximity Search in XML Trees DMCT v1, . . . , vm Є T. Distance MCT (DMCT) TD=d(TM) of the MCT TM of nodes v1, . . . , vm → the minimum node-labeled and edge-labeled tree such that: TD contains v1, . . . , vm TD contains the LCAs u1, . . . , uk of any pair of nodes (vi, vj) where vi , vj Є [v1, . . . , vm], i≠ j edge labeled with l between any two distinct nodes n, n’ Є {v1,...,vm, u1, . . . ,uk} if there is a path of length l from n’ to n in TM and the path does not contain any node n’’ Є { u1, . . . , um} other than n and n’.
  • 8. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview Notation Keyword Proximity Search in XML Trees GDMCT A Grouped DMCT of a tree T is a labeled tree where edges are labeled with numbers and nodes are labeled with lists of node ids from T. DMCT D Є GDMCT G if D and G are isomorphic. Assuming that f is the mapping of the nodes of D to the nodes of G, which induces a corresponding mapping, also called f, of the edges of D to the edges of G, the following must hold: nD is a node of D, nG is a node of G and f(nD)=nG, then the label of nG contains the id of nD. eD is an edge of D, eG is an edge of G and f(eD) = eG, then the label of eD and the label of eG are the same number.
  • 9. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview Problems Keyword Proximity Search in XML Trees Problem 1 : All GDMCTs Problem Query K Result “Tom, Harry” 5 3
  • 10. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview Problems Keyword Proximity Search in XML Trees Problem 2 : Lowest GDMCTs Problem Query K Result “Tom, Harry” 5 3
  • 11. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview All GDMCTs: Keyword Proximity Search in XML Trees Nested Loop Algorithm The nested loops algorithm (NL) for the case of indexed XML Examples of some entries in the data operates over separate lists of nodes, L(k), one for each master index for our tree: query keyword, k, to identify the GDMCTs whose sizes are no more than the user-provided threshold, K. Master index inverted index a hash table list L(k) each node n has path-id (the list of node ids along the path from the root of T to n)
  • 12. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview All GDMCTs: Keyword Proximity Search in XML Trees Nested Loop Algorithm checks all combinations of nodes from the keyword lists. for each combination computes an MCT (minimum connecting tree) merges the resulting MCT into the list of result GDMCTs, if its size is within the user-specified threshold.
  • 13. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview All GDMCTs: Keyword Proximity Search in XML Trees Nested Loop Algorithm For example: Query: “Tom, Harry” and K=3, NL examine the 12 node-pairs 12 MCTs determine 2 of them meet the threshold(K) return 2 GDMCTs:
  • 14. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview All GDMCTs: Keyword Proximity Search in XML Trees Nested Loop Algorithm Inefficienty:  NL checks all the combinations of nodes from the keyword lists  The grouping of the results into GDMCTs is not lightly integrated with the algorithm and a lookup to the array R is required for each relevant MCT found.
  • 15. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview All GDMCTs: Keyword Proximity Search in XML Trees Stack-Based Algorithm Index Structure and Algorithm. The stack-based algorithm for computing GDMCTs on indexed XML data operates over lists of nodes, two for each query keyword. Indexing by keyword master index contains 2 lists o L(k) of the nodes of T that contain k in T and o Ld(k) of the ancestors of nodes in L(k).
  • 16. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview All GDMCTs: Keyword Proximity Search in XML Trees Stack-Based Algorithm Index Structure and Algorithm. For example the entries for Tom, Dick and Harry are:
  • 17. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview All GDMCTs: Keyword Proximity Search in XML Trees Stack-Based Algorithm Index Structure and Algorithm. This is the high-level description of the SA. It describes how the selected list of nodes is traversed in a depth-first manner and the nodes are pushed and popped from the stack.
  • 18. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview All GDMCTs: Keyword Proximity Search in XML Trees Stack-Based Algorithm Index Structure and Algorithm. novel part of the SA algorithm processing and bookkeeping performed at each stack operation
  • 19. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview All GDMCTs: Keyword Proximity Search in XML Trees Stack-Based Algorithm Index Structure and Algorithm. Functions that are called from POP(S)
  • 20. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview All GDMCTs: Keyword Proximity Search in XML Trees Stack-Based Algorithm Illustrative Example Query: “Tom, Harry” K=3 Master index lists: The intersection of the lists:
  • 21. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview All GDMCTs: Keyword Proximity Search in XML Trees Stack-Based Algorithm Illustrative Example Master index lists: Intersection of the La Query: “Tom, Harry” K=3 Some of the initial stack states of the execution of the Stack Algorithm: 1. 2. 3.
  • 22. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview All GDMCTs: Keyword Proximity Search in XML Trees Stack-Based Algorithm Illustrative Example Master index lists: Intersection of the La Query: “Tom, Harry” K=3 Some of the initial stack states of the execution of the Stack Algorithm: 4. 5. 6.
  • 23. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview All GDMCTs: Keyword Proximity Search in XML Trees Stack-Based Algorithm Illustrative Example Master index lists: Intersection of the La Query: “Tom, Harry” K=3 Some of the initial stack states of the execution of the Stack Algorithm: 7. 8. 9. Entries from the lists continue being examined, new GDMCTs are created and pruned until all the answers are output. ...
  • 24. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview Lowest GDMCTs: Keyword Proximity Search in XML Trees Stack- Based Algorithm The key observation is that once we output the GDMCTs of a node u, none of the ancestors of u in the stack can be LCAs of returned GDMCTs; hence, we can remove all of them from the stack! Specifically, we can add the following lines after line 5:
  • 25. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview LCAs: Keyword Proximity Search in XML Trees Stack- Based Algorithms The Stack Algorithm can also be easily modified to solve the All LCAs Problem and the Lowest LCAs Problem, where the user is not interested in the GDMCTs, but only in the LCA nodes. o First, Merge(.) could be simplified, no merging of GDMCTs would need to be done, and line 33 could be replaced by: o Second, we can output an LCA early when the first GDMCT (with all keywords) is computed for that node (in Procedure CreateNewGDMCTs(.)), instead of waiting until the node is popped from the stack.
  • 26. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview Complexity Keyword Proximity Search in XML Trees Analysis Total number of GDMCTs Worst case: the number of DMCTs and of GDMCTs = exponential on the number of keywords. Under reasonable assumptions, the worst-case number of GDMCTs is smaller than that of DMCTs Complexity of Finding Isomorphic GDMCTs Given this canonical representation prezented in this chapter, one can linearize the GDMCTs in an XML-like nested representation with start and end tags, obtained from the node annotations. Theorem 1. The time complexity of SA is O( L  K  (i 1 L(ki ) ) 2 ) m
  • 27. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview Processing Keyword Proximity Search in XML Trees Unindexed XML Data Both the NL Algorithm and the SA have adaptations to work without index lists by doing a single pass over the data tree. The streaming version of the Stack Algorithm following changes to the Stack Algorithm SA(k1,..km, K):
  • 28. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview Experimental Keyword Proximity Search in XML Trees Evaluation Parameters affecting the performance of the presented algorithms: 1) the value of K denoting the threshold, 2) the number m of keywords, 3) the size of the data set. Tests show that usually the algorithms based on the Stack Algorithm have better results than the Nested Loops Algorithms both in the Indexed and Unindexed data.
  • 29. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview Overview Keyword Proximity Search in XML Trees There were presented two main problems: 1) identifying and presenting in a compact manner all MCTs which explain how the keywords are connected 2) identifying only MCTs whose root is not an ancestor of the root of another MCT. There are presented solutions: 1) when the XML data has been preprocessed and relevant indices have been constructed - Nested Loop Algorithm - Stack Algorithm 2) when the XML data has not been preprocessed, i.e., the XML data can only beprocessed sequentially. Benefits of the algorithms are shown by the Experimental Evaluation
  • 30. Resource Name Keyword Proximity Search in XML Trees Vangelis Hristidis, Nick Koudas, Yannis Papakonstantinou and Diverish Srivastava Authors IEEE Transactions on Knoledge and Data Engineering Publication Vol 18, No 4, APRIL 2006
  • 31. Keyword Proximity Search in XML Trees Thank you!