March 30, 2011, The 5th International Workshop on
Data-Mining and Statistical Science@Osaka University	


    Kernel-based Similarity Search
     in Massive Graph Databases
          with Wavelet Trees

               Yasuo Tabei
        (JST ERATO Minato Project)
         Joint work with Koji Tsuda (AIST)
Outline	
n  Introduction
  l  Recent development of graph databases
  l  Needs for graph similarity search

  l  Bag-of-words representation of a graph

  l  Semi-conjunctive query

n  Method
  l    Scalable similarity search with wavelet trees
n  Experiments
  l    Use a large-scale graph dataset
  l    25 million chemical compounds
Graphs are everywhere	



Gene co-expression
network                                     RNA 2D structure



                     Protein 3D structure

	
                                             	
Chemical compound
                                             Social Network
Graph Similarity Search	
n  Retrievegraphs similar to the query
n  Large databases
  l    More than 20 million chemical compounds in
        PubChem database
n  Bag-of-words    representation of graphs
  l    WL procedure (NIPS 2009)
n  Why    not use document retrieval methods?
  l  Inverted index
  l  Not that easy (explained later)
Weisfeiler-Lehman Procedure (NIPS,09)	
n Convert
         a graph into a set of words
 (bag-of-words)	
 i) Make a label set of adjacent
    A	
                     vertices ex) {E,A,D}
        B	
  E	
  ii) Sort     ex) A,D,E
                  iii) Add the vertex label as a prefix
    D	
                              ex) B,A,D,E
    A	
                  iv) Map the label sequence to a
        R	
   E	
     unique value
                               ex) B,A,D,E R
    D	
                  v) Assign the value as the new
 Bag-of-words
 {A,B,D,E,…,R,…}	
 vertex label
Search by cosine similarity	
n  Identify
          all graphs in the database whose
  cosine is larger than a threshold 1-ε
                                        | Wi !Q |
                 Wi s.t K N (Wi ,Q) =                ! 1" !
                                        | Wi | | Q |

   l    Wi, Q: bag-of-words of graphs
n  The above solution can be relaxed as
  follows,
 If K (Q,W ) ! 1" ! , then
         N

                                             |Q |
                   (1! ! )2 | Q |!| W |!
                                           (1! ! )2
   l    Can be used for fast search
Semi-conjunctive query	
n Cosine   query can be relaxed to the following
form
              Wi s.t | Wi !Q |" k
  l  The number of common words between
      two bag-of-words Wi and Q
  Ex) |Wi Q|=|(A,C,E,F,H) (A,E,I,J,L)|
                 =|(A,E)|=2
  l  k=(1-ε)2|Q|
  l No false negatives

  l False positives can easily be filtered out by
  cosine calculations
Inverted index	
n  Innatural language processing, inverted
   index has been used to solve semi-
   conjunctive query	

Inverted Index	
         	
           	
   n  Associative  map
  	
                 	
    l    key    word
  	
          	
  	
            	
         l    value graph identifiers
  	
          	
  	
                 	
                                       including a word
Bottom-up search	
   Inverted Index	
        	
           	
       i) Look the index up with
   	
               	
   	
             	
              query bag-of-words
   	
                	
   	
             	
          ii) Aggregate all the lists
   	
                   	
         of graph indices
         Query:(A,C,E)	
      iii) Sort
                Aggregation	
     )Scan
(2,8,13,15,8,10,16,4,9,13,14)	
               Sort	
(2,4,8,8,9,10,13,13,14,15,16)	
	
 k=
Search time of inverted index
         on 25 million graphs	
n Searchtime of inverted index is not so different from
that of sequencial scan	
                                              40 sec	
                                              38 sec
Why?	
                           n  Each    word is not
        Query:(A,C,E)	
          specific enough
                Aggregation	
 Query contains 1000s
                             n 
(2,8,13,15,8,10,16,4,9,13,14)	
 of words
                Sort	
       n  Aggregated array is

(2,4,8,8,9,10,13,13,14,15,16)	
 VERY long
                             n  Sorting takes O(ClogC)
              C	
                in time
Overview of our method 	
n  Top-down   search in a tree over the series of
    graphs
n  Huge memory, if tree is implemented with
    pointers
n  Wavelet Tree: Succinct data structure
n  The smaller the similarity threshold is, the
    quicker the algorithm finishes
     •  Not the case in inverted index
Binary tree over graphs	
                                      n  leaf       graph
                  [1,8]	
           0	
                        n  node       interval
                            1	
        [1,4]	
             [5,8]	
                                      n        Each node is identified
      0	
        1	
      0	
      1	
          by a bit string (v={01})
                                            n  At the leaves, the graph
    [1,2]	
    [3,4]	
   [5,6]	
  [7,8]	
                                                indices correspond to
 0	
     1	
 0	
 1	
0	
 1	
 0	
        1	
                                                int(v)+1
  1	
 2	
 3	
 4	
 5	
 6	
 7	
 8	
{000}	
 001}	
      {     {010}	
 {100}	
 {110}	
                  {011}	
 {101}	
   {111}	
    (e.g., int(010)+1=2+1=3)
Summarization of bag-of-words	
  Represent bag-of-words as a bit array
n 
                       12345678
Ex) Wi=(1,3,4,7,8) xi=(1,0,1,1,0,0,1,1)

n    Take disjunction   of all bit arrays in the interval
      of a node v


Ex) For an interval [1,4]
X1=(0,1,0,0,0,0,1,0)           yv=x1 x2 x3 x4
X2=(1,0,1,1,0,0,0,0)             =(1,1,1,1,0,0,1,1)	
X3=(1,0,0,0,0,0,1,1)
X4=(1,0,0,0,0,0,0,1)
Binary tree over graphs	
                        yv=111111	
                       n  Assign   to each node
                         [1,8]	
                             v a bit arrays yv
        yv=110111	
                yv=101101	
                                                          n  yv : bit array
              [1,4]	
                 [5,8]	
                                                            l    i-th bit is 1 if graphs
yv=010101	
yv=110100	
 v=100100	
yv=001101	
                    y
                                                                  in an interval have
    [1,2]	
       [3,4]	
       [5,6]	
         [7,8]	
                                                                  the corresponding
                                                                  word.
  1	
   2	
      3	
     4	
   5	
    6	
   7	
     8	
{000}	
 001}	
      {     {010}	
 {100}	
 {110}	
                 {011}	
 {101}	
 {111}
Top-down traversal	
                   yv=111111	
                      n  Q: bag-of-words of a
                       [1,8]	
                          query
        vy =110111 	
                               y =101101
                               v          	
        n  Perform top-down
             [1,4]	
              [5,8]	
               traversal
 y =010101 y =110100 y =100100 y =001101 n  Prune the search
 v          	
 v        	
 v       	
 v        	

     [1,2]	
     [3,4]	
     [5,6]	
      [7,8]	
       space if # y [ j] ! k
                                                             j"Q
                                                                   v


                                                    n  The larger k is, the
   1	
 2	
 3	
 4	
 5	
 6	
 7	
 8	
{000}	
 001}	
       {      {010}	
 {100}	
 {110}	
                     {011}	
 {101}	
        {111}	
     smaller the search
                                                        space is
Huge Memory	
n  Time   is O(τm) : Very fast
  l  τ: the number of traversed node
  l  m: the number of bag-of-words in a query

n  Space   is O(Mnlogn) bit
  l  M: the number of unique words
  l  n: the number of graphs
Wavelet Tree! (SODA,03)	
n  Replace yv    in each node v by a rank
  dictionary
  l    explained in next slides
n  Implement  a tree without using pointers
n  Only 60% memory overhead compared to
    the inverted index (Vigna,08)
n  Access to the summary information in any
    internal node
Rank dictionary (Raman,02)	
n  Give    bit array B[1,n] the following operation:
   l    rankc(B,i): return the number of c   {0,1} in B[1…i]	



Ex) B=0110011100	
                        i 1 2 3 4 5 6 7 8 9 10
           rank1(B,8)=5 0 1 1 0 0 1 1 1 0 0
           rank0(B,5)=3	
 0 1 1 0 0 1 1 1 0 0
Implementation of rank dictionary	
                           n  Divide the bit array B into
B	
                        large blocks of length l=log2n
                             RL=Ranks of large blocks
RL	
                       n  Divide each large block to
                           small blocks of length s=logn/2
                             Rs=Ranks of small blocks
RS	
                            relative to the large block
               rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank)	
                                 Time:O(1)
                                 Memory: n +o(n) bits
Restricted inverted index	
Inverted Index	
                      n  Concatenate    graph ids
     	
                          	
                                          for words in the root
	
                          	
	
                     	
             n  Restrict the inverted index for
	
                     	
                 the interval [sv,tv] of a node v	
	
                	

          [1,8]    A	
    B	
  C	
   D	
          Aroot 1 3 6 8 2 5 7 1 2 7 4 5

                                       4	
              >4	
[1,4]        A	
 B	
 C	
 D	
                           A	
 B	
 C	
 D	
 [4,8]
 Aleft      1 3 2 1 2 4                        Aright 6 8 5 7 7 5
Whole structure of
          restricted inverted index	
          [1,8]    A	
    B	
   C	
  D	
                1 3 6 8 2 5 7 1 2 7 4 5


  [1,4]    A	
 B	
 C	
 D	
                  A	
 B	
 C	
 D	
[5,8]
          1 3 2 1 2 4                      6 8 5 7 7 5

[1,2]                        [3,4] [5,6]                 [6,7]
   A	
B	
 C	
          A	
D	
        A	
B	
 D	
      A	
 B	
C	
   1 2 1 2             3 4           6 5 5            8 7 7


 A	
C	
    B	
C	
      A	
     D	
   B	
D	
 A	
      B	
 C	
 A	
 1 1       2 2         3        4    5 5 6           7 7     8
Similarity search	
n  To
    retrieve graphs similar to a query
  Q=(A,C), the tree is traversed in the top-
  down manner.

          [1,8]    A	
         C	
          Aroot 1 3 6 8 2 5 7 1 2 7 4 5

                       4	
           >4	
 [1,4]       A	
  C	
        [5,8]   A	
   C	
  Aleft     1 3 2 1 2 4      Aright 6 8 5 7 7 5
Similarity search	
n    To retrieve graphs similar to a query Q=(A,C), the tree
      is traversed in a top-down manner.
n  Observation
      l    To perform top-down traversal, only intervals
            of words in each node are necessary

                    A [1,4]	
    C [8,10]	
              Aroot 1 3 6 8 2 5 7 1 2 7 4 5

                             4	
              >4	
            A [1,2]	
C [4,5]	
             A[1,2]	
 C[5,5]	
       Aleft 1 3 2 1 2 4            Aright 6 8 5 7 7 5
Similarity search	
n  Replace
          restricted inverted index Av in
  each node v with a bit array bv.
   l    bv[i]=1 if Av[i] goes to the right child


                	
 0	
 0	
 1	
1	
 0	
 1	
 1	
 0	
 0	
 1	
 0	
 1	
            Aroot 1 3 6 8 2 5 7 1 2 7 4 5
                             0	
                       1	

 bleft	
     0	
 1	
 0	
0	
 0	
 1	
      bright	
 0	
 1	
 0	
1	
 1	
 0	
 Aleft       1 3 2 1 2 4                 Aright 6 8 5 7 7 5
Similarity search	
n  Intervals       of child nodes can be computed by
      rank operations
l    sleft(v),j=rank0(bv,svj-1)+1,tleft(v),j=rank0(bv,tvj)
l    sright(v),j =rank1(bv,svj-1)+1,tright(v),j=rank1(bv,tvj)
                                               C [8,10]	
      Ex)
                    	
 0	
 0	
 1	
1	
 0	
 1	
 1	
 0	
 0	
 1	
 0	
 1	
                Aroot 1 3 6 8 2 5 7 1 2 7 4 5
rank0(broot,8-1)+1=4,                                            rank1(broot,8-1)+1=5,
rank0(broot,10)=5	
                                              rank1(broot,10)=5	
                        C [4,5]	
                               C [5,5]	
      bleft	
    0	
 1	
 0	
0	
 0	
 1	
       bright	
 0	
 1	
 0	
1	
 1	
 0	
      Aleft      1 3 2 1 2 4                  Aright 6 8 5 7 7 5
Wavelet Tree	
n Wavelet tree can be obtained to replace the restricted
Inverted indices with bit arrays
n Wavelet tree consists of bit arrays bv and initial
intervals Croot.

     Croot	
      A	
    B	
   C	
  D	
               0 0 1 1 0 1 1 0 0 1 0 1



         0 1 0 0 0 1             0 1 0 1 1 0



    0 1 0 1          0 1       1 0 0       1 0 0
Wavelet Tree	
n Graphids can be recovered from bit strings
 on the path from the root to leaves	
  Croot	
 A	
       B	
   C	
   D	
                 0 0 1 1 0 1 1 0 0 1 0 1

                     0	
                   1	

        0 1 0 0 0 1                0 1 0 1 1 0

        0	
           1	
           0	
           1	

  0 1 0 1                  0 1   1 0 0            1 0 0

  0	
      1	
        0	
 1	
    0	
 1	
          0	
 1	
 000    001          010 011 100 101             110 111
Memory	
n  (1+α)Nlogn          + MlogN bits
        Bit arrays bv	
 Initial intervals Croot	
      l  N: the number of all words in the database
      l  n: the number of graphs

      l  α: overhead for rank dictionary (α=0.6)

n    For inverted index, Nlogn bits
n    About 60% overhead to inverted index!!
Experiments	
n  25 million chemical compounds from
    PubChem database
n  Use search time and memory as
    evaluation measures
n  Compare our method gWT to
   l  inverted index
   l  sequential scan implemented in G-Hash

       (Wang et al, 2009)
Search time on 25 million graphs	
                              40 sec	
                              38 sec	




                              8 sec	
                              3 sec	
                              2 sec
Memory usage
Overhead of rank dictionary
Construction time
Related work	
•  A lot of methods have been proposed so far.
 1.gIndex [Yan et al., 04]
 2.Graph Grep [Shasha et al., 07]
 3.Tree+Delta [Zhao et al., 07]
 4.TreePi [Zhang et al., 07]
 5.Gstring [Jiang et al., 07]
 6.FG-Index [Cheng et al., 07]
 7.GDIndex [Williams et al., 07]
 etc
Related work 	
              •  Decompose graphs
                 into a set of
Decompose	
      substructures
               - subgraphs, trees,
          …	
    paths etc
              •  Build a substructure-
Index	
                 based index
Drawbacks	
•  Require frequent subgraph mining
•  Do not scale to millions of graphs
Summary	
n  Efficientsimilarity search method for
    massive graph databases
n  Solve semi-conjunctive query efficiently
n  Built on wavelet trees
n  Use Weisfeiler-Lehman procedure to
    convert graphs into bag-of-words
n  Applicable to more than 20 million graphs
n  Software
    http://code.google.com/p/gwt/
Acknowledgements	

•  Prof. Shin-ichi Minato (Hokkaido Univ.)
•  Dr. Daisuke Okanohara (PFI)
•  Members in ERATO Minato Project

More Related Content

PDF
Gwt presen alsip-20111201
PDF
Gwt sdm public
PDF
Mlab2012 tabei 20120806
PDF
WABI2012-SuccinctMultibitTree
PDF
The multilayer perceptron
PDF
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
PDF
Skiena algorithm 2007 lecture04 elementary data structures
PDF
Multimodal Residual Networks for Visual QA
Gwt presen alsip-20111201
Gwt sdm public
Mlab2012 tabei 20120806
WABI2012-SuccinctMultibitTree
The multilayer perceptron
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
Skiena algorithm 2007 lecture04 elementary data structures
Multimodal Residual Networks for Visual QA

What's hot (20)

PDF
A new class of a stable implicit schemes for treatment of stiff
PDF
Introduction to inverse problems
PDF
1 cb02e45d01
PDF
Image denoising
PDF
Multilinear singular integrals with entangled structure
PDF
Datamining 6th svm
PDF
A Szemerédi-type theorem for subsets of the unit cube
PDF
Numerical solution of boundary value problems by piecewise analysis method
PDF
cyclic_code.pdf
PDF
Tales on two commuting transformations or flows
PDF
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...
PDF
New Mathematical Tools for the Financial Sector
PDF
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
PDF
Paraproducts with general dilations
PDF
Signal Processing Course : Convex Optimization
PDF
Signal Processing Course : Inverse Problems Regularization
PDF
Datamining 6th Svm
PDF
Data Exchange over RDF
PDF
A Szemeredi-type theorem for subsets of the unit cube
PDF
Catalan Tau Collocation for Numerical Solution of 2-Dimentional Nonlinear Par...
A new class of a stable implicit schemes for treatment of stiff
Introduction to inverse problems
1 cb02e45d01
Image denoising
Multilinear singular integrals with entangled structure
Datamining 6th svm
A Szemerédi-type theorem for subsets of the unit cube
Numerical solution of boundary value problems by piecewise analysis method
cyclic_code.pdf
Tales on two commuting transformations or flows
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...
New Mathematical Tools for the Financial Sector
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
Paraproducts with general dilations
Signal Processing Course : Convex Optimization
Signal Processing Course : Inverse Problems Regularization
Datamining 6th Svm
Data Exchange over RDF
A Szemeredi-type theorem for subsets of the unit cube
Catalan Tau Collocation for Numerical Solution of 2-Dimentional Nonlinear Par...
Ad

Viewers also liked (20)

PPTX
GIW2013
PDF
Sketch sort sugiyamalab-20101026 - public
PDF
Ibisml2011 06-20
PDF
Kdd2015reading-tabei
PPTX
DCC2014 - Fully Online Grammar Compression in Constant Space
PDF
Sketch sort ochadai20101015-public
PPT
Lp Boost
PPTX
CPM2013-tabei201306
PPTX
SPIRE2013-tabei20131009
PPTX
Lgm saarbrucken
PDF
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
PPTX
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
PDF
Lgm pakdd2011 public
PDF
異常検知 - 何を探すかよく分かっていないものを見つける方法
PDF
ウェーブレット木の世界
PDF
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
PDF
bigdata2012nlp okanohara
PPTX
мобильный смарт фитнес
PPTX
Grammar Tenses.pptx
PPT
Com Ensenyar Llengua A Xinesos Xiv Tallers
GIW2013
Sketch sort sugiyamalab-20101026 - public
Ibisml2011 06-20
Kdd2015reading-tabei
DCC2014 - Fully Online Grammar Compression in Constant Space
Sketch sort ochadai20101015-public
Lp Boost
CPM2013-tabei201306
SPIRE2013-tabei20131009
Lgm saarbrucken
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Lgm pakdd2011 public
異常検知 - 何を探すかよく分かっていないものを見つける方法
ウェーブレット木の世界
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
bigdata2012nlp okanohara
мобильный смарт фитнес
Grammar Tenses.pptx
Com Ensenyar Llengua A Xinesos Xiv Tallers
Ad

Similar to Dmss2011 public (20)

PDF
Partitions
PPTX
Unit 4-PartB of data design and algorithms
PDF
SISAP17
PPTX
Vector Space.pptx
PPTX
TSIndexingIndexacao De Série ttemporal.pptx
PPTX
Algorithm Exam Help
PDF
Conditional neural processes
PDF
[DL輪読会]Conditional Neural Processes
PPTX
Vector space, subspace, linear span .pptx
PDF
Bolt: Building A Distributed ndarray
PPT
Review session2
PPTX
Algorithm Assignment Help
PDF
Estimating the Evolution Direction of Populations to Improve Genetic Algorithms
PDF
Pre-Cal 20S January 20, 2009
PDF
Thesis defense
PDF
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
PPTX
Skew Products on Directed Graphs
PPTX
Data Structures and Agorithm: DS 21 Graph Theory.pptx
PPT
ALG5.1.ppt
Partitions
Unit 4-PartB of data design and algorithms
SISAP17
Vector Space.pptx
TSIndexingIndexacao De Série ttemporal.pptx
Algorithm Exam Help
Conditional neural processes
[DL輪読会]Conditional Neural Processes
Vector space, subspace, linear span .pptx
Bolt: Building A Distributed ndarray
Review session2
Algorithm Assignment Help
Estimating the Evolution Direction of Populations to Improve Genetic Algorithms
Pre-Cal 20S January 20, 2009
Thesis defense
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
Skew Products on Directed Graphs
Data Structures and Agorithm: DS 21 Graph Theory.pptx
ALG5.1.ppt

Recently uploaded (20)

PPTX
Build Your First AI Agent with UiPath.pptx
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
PDF
CloudStack 4.21: First Look Webinar slides
DOCX
search engine optimization ppt fir known well about this
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PPTX
2018-HIPAA-Renewal-Training for executives
PPTX
The various Industrial Revolutions .pptx
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
Flame analysis and combustion estimation using large language and vision assi...
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
The influence of sentiment analysis in enhancing early warning system model f...
Build Your First AI Agent with UiPath.pptx
Custom Battery Pack Design Considerations for Performance and Safety
A proposed approach for plagiarism detection in Myanmar Unicode text
Consumable AI The What, Why & How for Small Teams.pdf
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
CloudStack 4.21: First Look Webinar slides
search engine optimization ppt fir known well about this
NewMind AI Weekly Chronicles – August ’25 Week III
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
2018-HIPAA-Renewal-Training for executives
The various Industrial Revolutions .pptx
Taming the Chaos: How to Turn Unstructured Data into Decisions
Convolutional neural network based encoder-decoder for efficient real-time ob...
Developing a website for English-speaking practice to English as a foreign la...
Flame analysis and combustion estimation using large language and vision assi...
Module 1.ppt Iot fundamentals and Architecture
A review of recent deep learning applications in wood surface defect identifi...
A contest of sentiment analysis: k-nearest neighbor versus neural network
Final SEM Unit 1 for mit wpu at pune .pptx
The influence of sentiment analysis in enhancing early warning system model f...

Dmss2011 public

  • 1. March 30, 2011, The 5th International Workshop on Data-Mining and Statistical Science@Osaka University Kernel-based Similarity Search in Massive Graph Databases with Wavelet Trees Yasuo Tabei (JST ERATO Minato Project) Joint work with Koji Tsuda (AIST)
  • 2. Outline n  Introduction l  Recent development of graph databases l  Needs for graph similarity search l  Bag-of-words representation of a graph l  Semi-conjunctive query n  Method l  Scalable similarity search with wavelet trees n  Experiments l  Use a large-scale graph dataset l  25 million chemical compounds
  • 3. Graphs are everywhere Gene co-expression network RNA 2D structure Protein 3D structure Chemical compound Social Network
  • 4. Graph Similarity Search n  Retrievegraphs similar to the query n  Large databases l  More than 20 million chemical compounds in PubChem database n  Bag-of-words representation of graphs l  WL procedure (NIPS 2009) n  Why not use document retrieval methods? l  Inverted index l  Not that easy (explained later)
  • 5. Weisfeiler-Lehman Procedure (NIPS,09) n Convert a graph into a set of words (bag-of-words) i) Make a label set of adjacent A vertices ex) {E,A,D} B E ii) Sort ex) A,D,E iii) Add the vertex label as a prefix D ex) B,A,D,E A iv) Map the label sequence to a R E unique value ex) B,A,D,E R D v) Assign the value as the new Bag-of-words {A,B,D,E,…,R,…} vertex label
  • 6. Search by cosine similarity n  Identify all graphs in the database whose cosine is larger than a threshold 1-ε | Wi !Q | Wi s.t K N (Wi ,Q) = ! 1" ! | Wi | | Q | l  Wi, Q: bag-of-words of graphs n  The above solution can be relaxed as follows, If K (Q,W ) ! 1" ! , then N |Q | (1! ! )2 | Q |!| W |! (1! ! )2 l  Can be used for fast search
  • 7. Semi-conjunctive query n Cosine query can be relaxed to the following form Wi s.t | Wi !Q |" k l  The number of common words between two bag-of-words Wi and Q Ex) |Wi Q|=|(A,C,E,F,H) (A,E,I,J,L)| =|(A,E)|=2 l  k=(1-ε)2|Q| l No false negatives l False positives can easily be filtered out by cosine calculations
  • 8. Inverted index n  Innatural language processing, inverted index has been used to solve semi- conjunctive query Inverted Index n  Associative map l  key word l  value graph identifiers including a word
  • 9. Bottom-up search Inverted Index i) Look the index up with query bag-of-words ii) Aggregate all the lists of graph indices Query:(A,C,E) iii) Sort Aggregation )Scan (2,8,13,15,8,10,16,4,9,13,14) Sort (2,4,8,8,9,10,13,13,14,15,16) k=
  • 10. Search time of inverted index on 25 million graphs n Searchtime of inverted index is not so different from that of sequencial scan 40 sec 38 sec
  • 11. Why? n  Each word is not Query:(A,C,E) specific enough Aggregation Query contains 1000s n  (2,8,13,15,8,10,16,4,9,13,14) of words Sort n  Aggregated array is (2,4,8,8,9,10,13,13,14,15,16) VERY long n  Sorting takes O(ClogC) C in time
  • 12. Overview of our method n  Top-down search in a tree over the series of graphs n  Huge memory, if tree is implemented with pointers n  Wavelet Tree: Succinct data structure n  The smaller the similarity threshold is, the quicker the algorithm finishes •  Not the case in inverted index
  • 13. Binary tree over graphs n  leaf graph [1,8] 0 n  node interval 1 [1,4] [5,8] n  Each node is identified 0 1 0 1 by a bit string (v={01}) n  At the leaves, the graph [1,2] [3,4] [5,6] [7,8] indices correspond to 0 1 0 1 0 1 0 1 int(v)+1 1 2 3 4 5 6 7 8 {000} 001} { {010} {100} {110} {011} {101} {111} (e.g., int(010)+1=2+1=3)
  • 14. Summarization of bag-of-words Represent bag-of-words as a bit array n  12345678 Ex) Wi=(1,3,4,7,8) xi=(1,0,1,1,0,0,1,1) n  Take disjunction of all bit arrays in the interval of a node v Ex) For an interval [1,4] X1=(0,1,0,0,0,0,1,0) yv=x1 x2 x3 x4 X2=(1,0,1,1,0,0,0,0) =(1,1,1,1,0,0,1,1) X3=(1,0,0,0,0,0,1,1) X4=(1,0,0,0,0,0,0,1)
  • 15. Binary tree over graphs yv=111111 n  Assign to each node [1,8] v a bit arrays yv yv=110111 yv=101101 n  yv : bit array [1,4] [5,8] l  i-th bit is 1 if graphs yv=010101 yv=110100 v=100100 yv=001101 y in an interval have [1,2] [3,4] [5,6] [7,8] the corresponding word. 1 2 3 4 5 6 7 8 {000} 001} { {010} {100} {110} {011} {101} {111}
  • 16. Top-down traversal yv=111111 n  Q: bag-of-words of a [1,8] query vy =110111 y =101101 v n  Perform top-down [1,4] [5,8] traversal y =010101 y =110100 y =100100 y =001101 n  Prune the search v v v v [1,2] [3,4] [5,6] [7,8] space if # y [ j] ! k j"Q v n  The larger k is, the 1 2 3 4 5 6 7 8 {000} 001} { {010} {100} {110} {011} {101} {111} smaller the search space is
  • 17. Huge Memory n  Time is O(τm) : Very fast l  τ: the number of traversed node l  m: the number of bag-of-words in a query n  Space is O(Mnlogn) bit l  M: the number of unique words l  n: the number of graphs
  • 18. Wavelet Tree! (SODA,03) n  Replace yv in each node v by a rank dictionary l  explained in next slides n  Implement a tree without using pointers n  Only 60% memory overhead compared to the inverted index (Vigna,08) n  Access to the summary information in any internal node
  • 19. Rank dictionary (Raman,02) n  Give bit array B[1,n] the following operation: l  rankc(B,i): return the number of c {0,1} in B[1…i] Ex) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 rank1(B,8)=5 0 1 1 0 0 1 1 1 0 0 rank0(B,5)=3 0 1 1 0 0 1 1 1 0 0
  • 20. Implementation of rank dictionary n  Divide the bit array B into B large blocks of length l=log2n RL=Ranks of large blocks RL n  Divide each large block to small blocks of length s=logn/2 Rs=Ranks of small blocks RS relative to the large block rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank) Time:O(1) Memory: n +o(n) bits
  • 21. Restricted inverted index Inverted Index n  Concatenate graph ids for words in the root n  Restrict the inverted index for the interval [sv,tv] of a node v [1,8] A B C D Aroot 1 3 6 8 2 5 7 1 2 7 4 5 4 >4 [1,4] A B C D A B C D [4,8] Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 22. Whole structure of restricted inverted index [1,8] A B C D 1 3 6 8 2 5 7 1 2 7 4 5 [1,4] A B C D A B C D [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [6,7] A B C A D A B D A B C 1 2 1 2 3 4 6 5 5 8 7 7 A C B C A D B D A B C A 1 1 2 2 3 4 5 5 6 7 7 8
  • 23. Similarity search n  To retrieve graphs similar to a query Q=(A,C), the tree is traversed in the top- down manner. [1,8] A C Aroot 1 3 6 8 2 5 7 1 2 7 4 5 4 >4 [1,4] A C [5,8] A C Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 24. Similarity search n  To retrieve graphs similar to a query Q=(A,C), the tree is traversed in a top-down manner. n  Observation l  To perform top-down traversal, only intervals of words in each node are necessary A [1,4] C [8,10] Aroot 1 3 6 8 2 5 7 1 2 7 4 5 4 >4 A [1,2] C [4,5] A[1,2] C[5,5] Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 25. Similarity search n  Replace restricted inverted index Av in each node v with a bit array bv. l  bv[i]=1 if Av[i] goes to the right child 0 0 1 1 0 1 1 0 0 1 0 1 Aroot 1 3 6 8 2 5 7 1 2 7 4 5 0 1 bleft 0 1 0 0 0 1 bright 0 1 0 1 1 0 Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 26. Similarity search n  Intervals of child nodes can be computed by rank operations l  sleft(v),j=rank0(bv,svj-1)+1,tleft(v),j=rank0(bv,tvj) l  sright(v),j =rank1(bv,svj-1)+1,tright(v),j=rank1(bv,tvj) C [8,10] Ex) 0 0 1 1 0 1 1 0 0 1 0 1 Aroot 1 3 6 8 2 5 7 1 2 7 4 5 rank0(broot,8-1)+1=4, rank1(broot,8-1)+1=5, rank0(broot,10)=5 rank1(broot,10)=5 C [4,5] C [5,5] bleft 0 1 0 0 0 1 bright 0 1 0 1 1 0 Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 27. Wavelet Tree n Wavelet tree can be obtained to replace the restricted Inverted indices with bit arrays n Wavelet tree consists of bit arrays bv and initial intervals Croot. Croot A B C D 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 0
  • 28. Wavelet Tree n Graphids can be recovered from bit strings on the path from the root to leaves Croot A B C D 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 0 1 0 1 1 0 0 1 0 0 0 1 0 1 0 1 0 1 000 001 010 011 100 101 110 111
  • 29. Memory n  (1+α)Nlogn + MlogN bits Bit arrays bv Initial intervals Croot l  N: the number of all words in the database l  n: the number of graphs l  α: overhead for rank dictionary (α=0.6) n  For inverted index, Nlogn bits n  About 60% overhead to inverted index!!
  • 30. Experiments n  25 million chemical compounds from PubChem database n  Use search time and memory as evaluation measures n  Compare our method gWT to l  inverted index l  sequential scan implemented in G-Hash (Wang et al, 2009)
  • 31. Search time on 25 million graphs 40 sec 38 sec 8 sec 3 sec 2 sec
  • 33. Overhead of rank dictionary
  • 35. Related work •  A lot of methods have been proposed so far. 1.gIndex [Yan et al., 04] 2.Graph Grep [Shasha et al., 07] 3.Tree+Delta [Zhao et al., 07] 4.TreePi [Zhang et al., 07] 5.Gstring [Jiang et al., 07] 6.FG-Index [Cheng et al., 07] 7.GDIndex [Williams et al., 07] etc
  • 36. Related work •  Decompose graphs into a set of Decompose substructures - subgraphs, trees, … paths etc •  Build a substructure- Index based index
  • 37. Drawbacks •  Require frequent subgraph mining •  Do not scale to millions of graphs
  • 38. Summary n  Efficientsimilarity search method for massive graph databases n  Solve semi-conjunctive query efficiently n  Built on wavelet trees n  Use Weisfeiler-Lehman procedure to convert graphs into bag-of-words n  Applicable to more than 20 million graphs n  Software http://code.google.com/p/gwt/
  • 39. Acknowledgements •  Prof. Shin-ichi Minato (Hokkaido Univ.) •  Dr. Daisuke Okanohara (PFI) •  Members in ERATO Minato Project