Dmss2011 public

March 30, 2011, The 5th International Workshop on
Data-Mining and Statistical Science@Osaka University

Kernel-based Similarity Search
in Massive Graph Databases
with Wavelet Trees

Yasuo Tabei
(JST ERATO Minato Project)
Joint work with Koji Tsuda (AIST)

Outline
n  Introduction
l  Recent development of graph databases
l  Needs for graph similarity search

l  Bag-of-words representation of a graph

l  Semi-conjunctive query

n  Method
l  Scalable similarity search with wavelet trees
n  Experiments
l  Use a large-scale graph dataset
l  25 million chemical compounds

Graphs are everywhere

Gene co-expression
network RNA 2D structure

Protein 3D structure

Chemical compound
Social Network

Graph Similarity Search
n  Retrievegraphs similar to the query
n  Large databases
l  More than 20 million chemical compounds in
PubChem database
n  Bag-of-words representation of graphs
l  WL procedure (NIPS 2009)
n  Why not use document retrieval methods?
l  Inverted index
l  Not that easy (explained later)

Weisfeiler-Lehman Procedure (NIPS,09)
n Convert
a graph into a set of words
(bag-of-words)
i) Make a label set of adjacent
A
vertices ex) {E,A,D}
B
E
ii) Sort ex) A,D,E
iii) Add the vertex label as a prefix
D
ex) B,A,D,E
A
iv) Map the label sequence to a
R
E
unique value
ex) B,A,D,E R
D
v) Assign the value as the new
Bag-of-words
{A,B,D,E,…,R,…}
vertex label

Search by cosine similarity
n  Identify
all graphs in the database whose
cosine is larger than a threshold 1-ε
| Wi !Q |
Wi s.t K N (Wi ,Q) = ! 1" !
| Wi | | Q |

l  Wi, Q: bag-of-words of graphs
n  The above solution can be relaxed as
follows,
If K (Q,W ) ! 1" ! , then
N

|Q |
(1! ! )2 | Q |!| W |!
(1! ! )2
l  Can be used for fast search

Semi-conjunctive query
n Cosine query can be relaxed to the following
form
Wi s.t | Wi !Q |" k
l  The number of common words between
two bag-of-words Wi and Q
Ex) |Wi Q|=|(A,C,E,F,H) (A,E,I,J,L)|
=|(A,E)|=2
l  k=(1-ε)2|Q|
l No false negatives

l False positives can easily be filtered out by
cosine calculations

Inverted index
n  Innatural language processing, inverted
index has been used to solve semi-
conjunctive query

Inverted Index

n  Associative map

l  key word

l  value graph identifiers

including a word

Bottom-up search
Inverted Index

i) Look the index up with

query bag-of-words

ii) Aggregate all the lists

of graph indices
Query:(A,C,E)
iii) Sort
Aggregation
)Scan
(2,8,13,15,8,10,16,4,9,13,14)
Sort
(2,4,8,8,9,10,13,13,14,15,16)

k=

Search time of inverted index
on 25 million graphs
n Searchtime of inverted index is not so different from
that of sequencial scan
40 sec
38 sec

Why?
n  Each word is not
Query:(A,C,E)
specific enough
Aggregation
Query contains 1000s
n 
(2,8,13,15,8,10,16,4,9,13,14)
of words
Sort
n  Aggregated array is

(2,4,8,8,9,10,13,13,14,15,16)
VERY long
n  Sorting takes O(ClogC)
C
in time

Overview of our method
n  Top-down search in a tree over the series of
graphs
n  Huge memory, if tree is implemented with
pointers
n  Wavelet Tree: Succinct data structure
n  The smaller the similarity threshold is, the
quicker the algorithm finishes
•  Not the case in inverted index

Binary tree over graphs
n  leaf graph
[1,8]
0
n  node interval
1
[1,4]
[5,8]
n  Each node is identified
0
1
0
1
by a bit string (v={01})
n  At the leaves, the graph
[1,2]
[3,4]
[5,6]
[7,8]
indices correspond to
0
1
0
1
0
1
0
1
int(v)+1
1
2
3
4
5
6
7
8
{000}
001}
{ {010}
{100}
{110}
{011}
{101}
{111}
(e.g., int(010)+1=2+1=3)

Summarization of bag-of-words
Represent bag-of-words as a bit array
n 
12345678
Ex) Wi=(1,3,4,7,8) xi=(1,0,1,1,0,0,1,1)

n  Take disjunction of all bit arrays in the interval
of a node v

Ex) For an interval [1,4]
X1=(0,1,0,0,0,0,1,0) yv=x1 x2 x3 x4
X2=(1,0,1,1,0,0,0,0) =(1,1,1,1,0,0,1,1)
X3=(1,0,0,0,0,0,1,1)
X4=(1,0,0,0,0,0,0,1)

Binary tree over graphs
yv=111111
n  Assign to each node
[1,8]
v a bit arrays yv
yv=110111
yv=101101
n  yv : bit array
[1,4]
[5,8]
l  i-th bit is 1 if graphs
yv=010101
yv=110100
v=100100
yv=001101
y
in an interval have
[1,2]
[3,4]
[5,6]
[7,8]
the corresponding
word.
1
2
3
4
5
6
7
8
{000}
001}
{ {010}
{100}
{110}
{011}
{101}
{111}

Top-down traversal
yv=111111
n  Q: bag-of-words of a
[1,8]
query
vy =110111
y =101101
v
n  Perform top-down
[1,4]
[5,8]
traversal
y =010101 y =110100 y =100100 y =001101 n  Prune the search
v
v
v
v

[1,2]
[3,4]
[5,6]
[7,8]
space if # y [ j] ! k
j"Q
v

n  The larger k is, the
1
2
3
4
5
6
7
8
{000}
001}
{ {010}
{100}
{110}
{011}
{101}
{111}
smaller the search
space is

Huge Memory
n  Time is O(τm) : Very fast
l  τ: the number of traversed node
l  m: the number of bag-of-words in a query

n  Space is O(Mnlogn) bit
l  M: the number of unique words
l  n: the number of graphs

Wavelet Tree! (SODA,03)
n  Replace yv in each node v by a rank
dictionary
l  explained in next slides
n  Implement a tree without using pointers
n  Only 60% memory overhead compared to
the inverted index (Vigna,08)
n  Access to the summary information in any
internal node

Rank dictionary (Raman,02)
n  Give bit array B[1,n] the following operation:
l  rankc(B,i): return the number of c {0,1} in B[1…i]

Ex) B=0110011100
i 1 2 3 4 5 6 7 8 9 10
rank1(B,8)=5 0 1 1 0 0 1 1 1 0 0
rank0(B,5)=3
0 1 1 0 0 1 1 1 0 0

Implementation of rank dictionary
n  Divide the bit array B into
B
large blocks of length l=log2n
RL=Ranks of large blocks
RL
n  Divide each large block to
small blocks of length s=logn/2
Rs=Ranks of small blocks
RS
relative to the large block
rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank)
Time:O(1)
Memory: n +o(n) bits

Restricted inverted index
Inverted Index
n  Concatenate graph ids

for words in the root

n  Restrict the inverted index for

the interval [sv,tv] of a node v

[1,8] A
B
C
D
Aroot 1 3 6 8 2 5 7 1 2 7 4 5

4
>4
[1,4] A
B
C
D
A
B
C
D
[4,8]
Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5

Whole structure of
restricted inverted index
[1,8] A
B
C
D
1 3 6 8 2 5 7 1 2 7 4 5

[1,4] A
B
C
D
A
B
C
D
[5,8]
1 3 2 1 2 4 6 8 5 7 7 5

[1,2] [3,4] [5,6] [6,7]
A
B
C
A
D
A
B
D
A
B
C
1 2 1 2 3 4 6 5 5 8 7 7

A
C
B
C
A
D
B
D
A
B
C
A
1 1 2 2 3 4 5 5 6 7 7 8

Similarity search
n  To
retrieve graphs similar to a query
Q=(A,C), the tree is traversed in the top-
down manner.

[1,8] A
C
Aroot 1 3 6 8 2 5 7 1 2 7 4 5

4
>4
[1,4] A
C
[5,8] A
C
Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5

Similarity search
n  To retrieve graphs similar to a query Q=(A,C), the tree
is traversed in a top-down manner.
n  Observation
l  To perform top-down traversal, only intervals
of words in each node are necessary

A [1,4]
C [8,10]
Aroot 1 3 6 8 2 5 7 1 2 7 4 5

4
>4
A [1,2]
C [4,5]
A[1,2]
C[5,5]
Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5

Similarity search
n  Replace
restricted inverted index Av in
each node v with a bit array bv.
l  bv[i]=1 if Av[i] goes to the right child

0
0
1
1
0
1
1
0
0
1
0
1
Aroot 1 3 6 8 2 5 7 1 2 7 4 5
0
1

bleft
0
1
0
0
0
1
bright
0
1
0
1
1
0
Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5

Similarity search
n  Intervals of child nodes can be computed by
rank operations
l  sleft(v),j=rank0(bv,svj-1)+1,tleft(v),j=rank0(bv,tvj)
l  sright(v),j =rank1(bv,svj-1)+1,tright(v),j=rank1(bv,tvj)
C [8,10]
Ex)

0
0
1
1
0
1
1
0
0
1
0
1
Aroot 1 3 6 8 2 5 7 1 2 7 4 5
rank0(broot,8-1)+1=4, rank1(broot,8-1)+1=5,
rank0(broot,10)=5
rank1(broot,10)=5
C [4,5]
C [5,5]
bleft
0
1
0
0
0
1
bright
0
1
0
1
1
0
Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5

Wavelet Tree
n Wavelet tree can be obtained to replace the restricted
Inverted indices with bit arrays
n Wavelet tree consists of bit arrays bv and initial
intervals Croot.

Croot
A
B
C
D
0 0 1 1 0 1 1 0 0 1 0 1

0 1 0 0 0 1 0 1 0 1 1 0

0 1 0 1 0 1 1 0 0 1 0 0

Wavelet Tree
n Graphids can be recovered from bit strings
on the path from the root to leaves
Croot
A
B
C
D
0 0 1 1 0 1 1 0 0 1 0 1

0
1

0 1 0 0 0 1 0 1 0 1 1 0

0
1
0
1

0 1 0 1 0 1 1 0 0 1 0 0

0
1
0
1
0
1
0
1
000 001 010 011 100 101 110 111

Memory
n  (1+α)Nlogn + MlogN bits
Bit arrays bv
Initial intervals Croot
l  N: the number of all words in the database
l  n: the number of graphs

l  α: overhead for rank dictionary (α=0.6)

n  For inverted index, Nlogn bits
n  About 60% overhead to inverted index!!

Experiments
n  25 million chemical compounds from
PubChem database
n  Use search time and memory as
evaluation measures
n  Compare our method gWT to
l  inverted index
l  sequential scan implemented in G-Hash

(Wang et al, 2009)

Search time on 25 million graphs
40 sec
38 sec

8 sec
3 sec
2 sec

Related work
•  A lot of methods have been proposed so far.
1.gIndex [Yan et al., 04]
2.Graph Grep [Shasha et al., 07]
3.Tree+Delta [Zhao et al., 07]
4.TreePi [Zhang et al., 07]
5.Gstring [Jiang et al., 07]
6.FG-Index [Cheng et al., 07]
7.GDIndex [Williams et al., 07]
etc

Related work
•  Decompose graphs
into a set of
Decompose
substructures
- subgraphs, trees,
…
paths etc
•  Build a substructure-
Index
based index

Drawbacks
•  Require frequent subgraph mining
•  Do not scale to millions of graphs

Summary
n  Efficientsimilarity search method for
massive graph databases
n  Solve semi-conjunctive query efficiently
n  Built on wavelet trees
n  Use Weisfeiler-Lehman procedure to
convert graphs into bag-of-words
n  Applicable to more than 20 million graphs
n  Software
http://code.google.com/p/gwt/

Acknowledgements

•  Prof. Shin-ichi Minato (Hokkaido Univ.)
•  Dr. Daisuke Okanohara (PFI)
•  Members in ERATO Minato Project

Dmss2011 public

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Dmss2011 public (20)

Recently uploaded (20)

Dmss2011 public