Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Mining Adaptively Frequent Closed Unlabeled
Rooted Trees in Data Streams

Albert Bifet and Ricard Gavaldà

Universitat Politècnica de Catalunya

14th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD’08)
2008 Las Vegas, USA

Tree Mining
Mining frequent trees is
becoming an important
task
Applications:
chemical informatics
computer vision
text retrieval
bioinformatics
Data Streams
Web analysis.
Sequence is potentially
Many link-based
inﬁnite
structures may be
High amount of data: studied formally by
sublinear space means of unordered
High speed of arrival: trees
sublinear time per
example

Introduction: Data Streams

Data Streams
Sequence is potentially inﬁnite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed
it is discarded or archived

Example
Puzzle: Finding Missing Numbers
Let π be a permutation of {1, . . . , n}.
Let π−1 be π with one element
missing.
π−1 [i] arrives in increasing order
Task: Determine the missing number


Data Streams

Example
Puzzle: Finding Missing Numbers Use a n-bit
Let π be a permutation of {1, . . . , n}. vector to
Let π−1 be π with one element memorize all the
missing. numbers (O(n)
space)


Data Streams

Example
Puzzle: Finding Missing Numbers
Let π be a permutation of {1, . . . , n}. Data Streams:
Let π−1 be π with one element O(log(n)) space.
missing.


Data Streams

Example Data Streams:
Puzzle: Finding Missing Numbers O(log(n)) space.
Let π be a permutation of {1, . . . , n}. Store
Let π−1 be π with one element n(n + 1)
missing. − ∑ π−1 [j].
2 j≤i

Introduction: Trees

Our trees are: Our subtrees are:
Unlabeled Induced
Ordered and Unordered
Two different ordered trees
but the same unordered tree

Introduction

Induced subtrees: obtained by repeatedly removing leaf
nodes

Embedded subtrees: obtained by contracting some of the
edges

Introduction

What Is Tree Pattern Mining?

Given a dataset of trees, ﬁnd the complete set of frequent
subtrees
Frequent Tree Pattern (FS):
Include all the trees whose support is no less than min_sup
Closed Frequent Tree Pattern (CS):
Include no tree which has a super-tree with the same
support
CS ⊆ FS
Closed Frequent Tree Mining provides a compact
representation of frequent trees without loss of information

Introduction

Unordered Subtree Mining
A: B: X: Y:

D = {A, B}, min_sup = 2

# Closed Subtrees : 2
# Frequent Subtrees: 9

Closed Subtrees: X, Y

Frequent Subtrees:

Introduction

Problem
Given a data stream D of rooted, unlabelled and unordered
trees, ﬁnd frequent closed trees.

We provide three algorithms,
of increasing power
Incremental
Sliding Window
Adaptive

D

Outline

1 Introduction

2 Data Streams

3 ADWIN : Concept Drift Mining

4 Adaptive Closed Frequent Tree Mining

5 Summary

Data Streams

Data Streams
At any time t in the data stream, we would like the per-item
processing time and storage to be simultaneously
O(log k (N, t)).

Approximation algorithms
Small error rate with high probability
˜
An algorithm (ε, δ )−approximates F if it outputs F for
˜
which Pr[|F − F | > εF ] < δ .

Data Streams Approximation Algorithms

1011000111 1010101

Sliding Window
We can maintain simple statistics over sliding windows, using
O( 1 log2 N) space, where
ε
N is the length of the sliding window
ε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002


10110001111 0101011

Sliding Window
ε



101100011110 1010111

Sliding Window
ε



1011000111101 0101110

Sliding Window
ε



10110001111010 1011101

Sliding Window
ε



101100011110101 0111010

Sliding Window
ε


ADWIN: Adaptive sliding window

ADWIN
An adaptive sliding window whose size is recomputed online
according to the rate of change observed.

ADWIN has rigorous guarantees (theorems)
On ratio of false positives and negatives
On the relation of the size of the current window and
change rates

ADWIN using a Data Stream Sliding Window Model,
can provide the exact counts of 1’s in O(1) time per point.
tries O(log W ) cutpoints
uses O( 1 log W ) memory words
ε
the processing time per example is O(log W ) (amortized
and worst-case).

Time Change Detectors and Predictors: A General
Framework

Estimation
-
xt
-
Estimator

Framework

Estimation
-
xt
-
Estimator Alarm
- -
Change Detect.

Framework

Estimation
-
xt
-
Estimator Alarm
- -
Change Detect.
6
6
?
-
Memory

Window Management Models

W = 101010110111111

Equal & ﬁxed size Total window against
subwindows subwindow
1010 1011011 1111 10101011011 1111
[Kifer+ 04] [Gama+ 04]

Equal size adjacent ADWIN: All Adjacent
subwindows subwindows
1010101 1011 1111 1 01010110111111
[Dasu+ 06]


W = 101010110111111

1010 1011011 1111 10101011011 1111

1010101 1011 1111 10 1010110111111
[Dasu+ 06]


W = 101010110111111

1010 1011011 1111 10101011011 1111

1010101 1011 1111 101 010110111111
[Dasu+ 06]


W = 101010110111111

1010 1011011 1111 10101011011 1111

1010101 1011 1111 1010 10110111111
[Dasu+ 06]


W = 101010110111111

1010 1011011 1111 10101011011 1111

1010101 1011 1111 10101 0110111111
[Dasu+ 06]


W = 101010110111111

1010 1011011 1111 10101011011 1111

1010101 1011 1111 101010 110111111
[Dasu+ 06]


W = 101010110111111

1010 1011011 1111 10101011011 1111

1010101 1011 1111 1010101 10111111
[Dasu+ 06]


W = 101010110111111

1010 1011011 1111 10101011011 1111

1010101 1011 1111 10101011 0111111
[Dasu+ 06]


W = 101010110111111

1010 1011011 1111 10101011011 1111

1010101 1011 1111 101010110 111111
[Dasu+ 06]


W = 101010110111111

1010 1011011 1111 10101011011 1111

1010101 1011 1111 1010101101 11111
[Dasu+ 06]


W = 101010110111111

1010 1011011 1111 10101011011 1111

1010101 1011 1111 10101011011 1111
[Dasu+ 06]


W = 101010110111111

1010 1011011 1111 10101011011 1111

1010101 1011 1111 101010110111 111
[Dasu+ 06]


W = 101010110111111

1010 1011011 1111 10101011011 1111

1010101 1011 1111 1010101101111 11
[Dasu+ 06]


W = 101010110111111

1010 1011011 1111 10101011011 1111

1010101 1011 1111 10101011011111 1
[Dasu+ 06] 11

Pattern Relaxed Support

Guojie Song, Dongqing Yang, Bin Cui, Baihua Zheng,
Yunfeng Liu and Kunqing Xie.
CLAIM: An Efﬁcient Method for Relaxed Frequent Closed
Itemsets Mining over Stream Data
Linear Relaxed Interval:The support space of all
subpatterns can be divided into n = 1/εr intervals, where
εr is a user-speciﬁed relaxed factor, and each interval can
be denoted by Ii = [li , ui ), where li = (n − i) ∗ εr ≥ 0,
ui = (n − i + 1) ∗ εr ≤ 1 and i ≤ n.
Linear Relaxed closed subpattern t: if and only if there
exists no proper superpattern t of t such that their suports
belong to the same interval Ii .

Pattern Relaxed Support

As the number of closed frequent patterns is not linear with
respect support, we introduce a new relaxed support:
Logarithmic Relaxed Interval:The support space of all
subpatterns can be divided into n = 1/εr intervals, where
εr is a user-speciﬁed relaxed factor, and each interval can
be denoted by Ii = [li , ui ), where li = c i , ui = c i+1 − 1
and i ≤ n.
Logarithmic Relaxed closed subpattern t: if and only if
there exists no proper superpattern t of t such that their
suports belong to the same interval Ii .

Galois Lattice of closed set of trees

1 2 3

D
We need 12 13 23
a Galois
connection pair
a closure operator

123

Incremental mining on closed frequent trees

1 Adding a tree
transaction, does
not decrease the
number of closed
trees for D. 1 2 3
2 Adding a
transaction with a
closed tree, does
not modify the
number of closed 12 13 23
trees for D.

123

Sliding Window mining on closed frequent trees

1 Deleting a tree
transaction, does
not increase the
number of closed
trees for D. 1 2 3
2 Deleting a tree
transaction that is
repeated, does not
modify the number
of closed trees for 12 13 23
D.

123

Algorithms

Algorithms
Incremental: I NC T REE N AT
Sliding Window: W IN T REE N AT
Adaptive: A DAT REE N AT Uses ADWIN to monitor change

ADWIN
An adaptive sliding window whose size is recomputed online
according to the rate of change observed.

ADWIN has rigorous guarantees (theorems)
On ratio of false positives and negatives
On the relation of the size of the current window and
change rates

Experimental Validation: TN1

CMTreeMiner
300

Time 200
(sec.)
100
I NC T REE N AT

2 4 6 8
Size (Milions)

Figure: Time on experiments on ordered trees on TN1 dataset

Experimental Validation

45

35
Number of Closed Trees

25 AdaTreeInc 1
AdaTreeInc 2

15

5
0 21.460 42.920 64.380 85.840 107.300 128.760 150.220 171.680 193.140
Number of Samples

Figure: Number of closed trees maintaining the same number of
closed datasets on input data

Summary

Conclusions
New logarithmic relaxed closed support
Using Galois Latice Theory, we present methods for mining
closed trees
Incremental: I NC T REE N AT
Sliding Window: W IN T REE N AT
Adaptive: A DAT REE N AT using ADWIN to monitor change

Future Work
Labeled Trees and XML data.

Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams (20)

More from Albert Bifet (11)

Recently uploaded (20)

Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams