Approximate Tree Kernels

B ACKGROUND PARSE T REE K ERNELS A PPROXIMATE T REE K ERNELS R ESULTS Conclusion

Approximate Tree Kernels
Konrad Rieck, Tammo Krueger, Ulf Brefeld, Klaus-Robert
¨
Muller

Presented By
Niharjyoti Sarangi
Indian Institute of Technology Madras

April 21, 2012


O UTLINE OF THE P RESENTATION
1 B ACKGROUND
Learning from tree-structured data
Application Domains
2 PARSE T REE K ERNELS
Computing PTK
Computational constraints
3 A PPROXIMATE T REE K ERNELS
Computing ATK
Validity of ATK
Types of learning
4 R ESULTS
Performance
Time
Memory
5 Conclusion


T REE - STRUCTURED DATA

Trees: carry hierarchical information
Flat feature Vectors: Fail to capture the underlying
dependency structure

Parse Tree
An ordered, rooted tree that represents the syntactic structure
of a string according to some formal grammar.

A tree X is called a parse tree of G = (S, P, s) if X is derived by
assembling productions p ∈ P such that every node x ∈ X is
labeled with a symbol l(x) ∈ S.


E XAMPLES

Figure: Parse trees for natural language text and the HTTP network
protocol.


L EARNING FROM TREES

Kernel functions for Structured data
Convolution of local kernels
Parse tree kernel proposed by Collins and Duffy(2002)

Kernel Functions
k : X × X → R is a symmetric and positive semi-deﬁnite
function, which implicitly computes an inner product in a
reproducing kernel Hilbert space


A PPLICATION D OMAINS

Natural Language Processing
Web Spam Detection
Network Intrusion Detection
Information Retreival from structured
documents
...


C OMPUTING PTK

A generic technique for deﬁning kernel functions over
structured data is the convolution of local kernels deﬁned over
sub-structures.
Parse Tree kernel
k(X, Z) = x∈X z∈Z c(x, z) , Where, X and Z are two parse trees.

Notations
xi : i-th child of a node x
|X|: Number of nodes in X
χ: Set of all possible trees


I LLUSTRATION

Figure: Shared subtrees in two parse trees.


C OUNTING F UNCTION

c(x, z) is known as the counting function which recursively
determines the number of shared subtrees rooted in the tree
nodes x and z.
Deﬁning c(x,z)


 0 if x,z not derived from same P
c(x, z) = λ if x,z are leaf nodes
 |x|
λ i=1 c(xi , zi ) otherwise

0 ≤ λ ≤ 1 , balances the contribution of subtrees, such that
small values of decay the contribution of lower nodes in large
subtrees


C OMPUTATIONAL COMPLEXITY

The complexity is (n2 ), where n is the number of nodes
in each parse tree.

Experimental data
The computation of a parse tree kernel for two HTML documents
comprising 10,000 nodes each, requires about 1 gigabyte of memory
and takes over 100 seconds on a recent computer system.

We need to compare a large number of parse trees. Going
by the above statistics, the use of PTKs are rendered to be
of no practical signiﬁcance because of the computing
resources required.


ATTEMPTED IMPROVEMENTS

A feature selection procedure based on statistical tests.
Suzuki et.al.
Limiting computation to node pairs with matching grammar
symbols. Moschitti


C OMPUTING ATK

Approximation of tree kernels is based on the observation that
trees often contain redundant parts that are not only irrelevant
for the learning task but also slow-down the kernel
computation unnecessarily.

Approximate Tree kernel
ˆ z∈Z ˜(x, z)
k(X,Z)= s∈S w(s) x∈X c , Where, X and Z are two parse trees.
l(x)=s l(z)=s

Selection function: w : S → 0, 1
Controls whether subtrees rooted in nodes with the symbol
s ∈ S contribute to the convolution. (w(s) = 0 or w(s) = 1)


A PPROXIMATE C OUNTING F UNCTION

˜(x, z) is the approximate counting function
c

Deﬁning ˜(x, z)
c


 0

if x,z not derived from same P
 0 if x or z not selected
˜(x, z) =
c
 λ
 if x,z are leaf nodes
 |x|
λ ˜
i=1 c(xi , zi ) otherwise

The selection function w(s) is decided based on the domain and
data. the exact parse tree kernel is obtained as a special case of
ATK if w(s) = 1 for all symbols s ∈ S.


ATK IS A VALID KERNEL

Proof
Let Φ(X) be the vector of frequencies of all subtrees occurring
ˆ
in X. Then, by deﬁnition, Kw can always be written as
ˆ
Kw = Pw Φ(X), Pw Φ(Z) ,
For any w, the projection Pw is independent of the actual X and
ˆ
Z, and hence Kw is a valid kernel.


ATK IS FASTER THAN PTK

Speed up factor qw

s∈S #s (X)#s (Z)
qw =
s∈S ws #s (X)#s (Z)
Where #s (X) denotes the occurances of nodes x ∈ X that were selected.

Looking at the above equation, we can argue that even if only
one symbol is rejected in Approximate Tree Kernel, we get a
speedup qw ≥ 1.


S UPERVISED S ETTING

Given n labeled parse trees (X1 , y1 ), · · · , (Xn , yn ), where yi are
the class labels.
An ideal kernel gram matrix Y is given as follows:

Yij = [|yi = yj |] − [|yi = yj |]

Kernel Target alignment

ˆ
Y, Kw = ˆ
Ki j − ˆ
Ki j
F
yi =yj yi =yj

Our target now is to maximize the above term w.r.t w.


S UPERVISED S ETTING ( CONTD .)

Optimization Problem
n
w = argmaxw∈[0,1]|S| w(S) ˜(x, z)
c
i,j=1 s∈S x∈Xi z∈Zi
i=j l(x)=s l(z)=s

subject to,
w(s) ≤ N, N∈N
s∈S


U NSUPERVISED S ETTING

Average Frequency of Node comparison
n
1
f (s) = #s (Xi )#s (Xj )
n2
i,j=1

ExpectedNodeComparisons
ComparisonRatio(ρ) =
ActualNumberOfComparisonsinPTK


U NSUPERVISED S ETTING ( CONTD .)

Optimization Problem
n
w = argmaxw∈[0,1]|S| w(S) ˜(x, z)
c
i,j=1 s∈S x∈Xi z∈Zi
i=j l(x)=s l(z)=s

subject to,
s∈S w(s)f (s)
≤ρ
s∈S f (s)


S YNTHETIC D ATA

Figure: Classiﬁcation performance Figure: Detection performance for
for the supervised synthetic data. the unsupervised synthetic data.


R EAL D ATA

Figure: Classiﬁcation performance Figure: Detection performance for
for question classiﬁcation task. the intrusion detection task (FTP).


T IME

Figure: Training and testing time of SVMs using the exact and the
approximate tree kernel.


T IME ( COND .)

Figure: Run-times for web spam (WS) and intrusion detection (ID).


M EMORY

Figure: Memory requirements for web spam (WS) and intrusion
detection (ID).


C ONCLUSION

Approximate Parse tree Kernels give us a fast and efﬁcient
way to work with parse trees.
Improvements in terms of run-time and memory
requirements. For large trees, the approximation reduces a
single kernel computation from 1 gigabyte to less than 800
kilobytes, accompanied by run-time improvements up to
three orders of magnitude.
Best results were obtained for Network Intrusion
Detection.


Q UESTIONS

Any Questions ???

Approximate Tree Kernels

More Related Content

What's hot (20)

Similar to Approximate Tree Kernels (20)

Recently uploaded (20)

Approximate Tree Kernels