SlideShare a Scribd company logo
Efficient Itemset Generator Discovery over a Stream
Sliding Window
Chuancong Gao, Jianyong Wang
Database Laboratory
Department of Computer Science and Technology
Tsinghua University, Beijing 100084, China
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 1 / 28
Outline
Introduction
What is Generator
Why We Need Generators
What have We done
Related Work
The StreamGen Algorithm
FP-Tree
Enumeration Tree
ADD and REMOVE Operations
Extension for Mining Classification Rules
Evaluation Results
Conclusions
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 2 / 28
Introduction What is Generator
What is Generator
Example:
Given the 4 transactions, with the :::::::::
minimum::::::::
support:::::::::
threshold:::::::::
(supmin) of 2.
A B C
A D
A B C D
A B D
Introduction What is Generator
What is Generator
Example:
Given the 4 transactions, with the :::::::::
minimum::::::::
support:::::::::
threshold:::::::::
(supmin) of 2.
A B C
A D
A B C D
A B D
Ø : 4
D : 3C : 2B : 3A : 4
ABD : 2ABC : 2
BD : 2BC : 2AD : 3AC : 2AB : 3
Equivalence Class
Generator ItemsetClosed Itemset
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 3 / 28
Introduction What is Generator
What is Generator
::::::::::::
Equivalence::::::
class: All the frequent ::::::::
itemsets contained in the same set of
input :::::::::::
transactions
:::::::
Closed ::::::::
Itemset: The maximal one in equivalence class
::::::::::
Generator:::::::::
Itemsets: The minimal ones
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
Introduction What is Generator
What is Generator
::::::::::::
Equivalence::::::
class: All the frequent ::::::::
itemsets contained in the same set of
input :::::::::::
transactions
:::::::
Closed ::::::::
Itemset: The maximal one in equivalence class
::::::::::
Generator:::::::::
Itemsets: The minimal ones
Characteristics:
same equivalence class =⇒ same input transactions =⇒ same data
distribution =⇒ same :::::::
support value and ::::::::::
confidence value;
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
Introduction What is Generator
What is Generator
::::::::::::
Equivalence::::::
class: All the frequent ::::::::
itemsets contained in the same set of
input :::::::::::
transactions
:::::::
Closed ::::::::
Itemset: The maximal one in equivalence class
::::::::::
Generator:::::::::
Itemsets: The minimal ones
Characteristics:
same equivalence class =⇒ same input transactions =⇒ same data
distribution =⇒ same :::::::
support value and ::::::::::
confidence value;
No:::::::::::
sub-itemset for a generator itemset in an eqivalence class;
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
Introduction What is Generator
What is Generator
::::::::::::
Equivalence::::::
class: All the frequent ::::::::
itemsets contained in the same set of
input :::::::::::
transactions
:::::::
Closed ::::::::
Itemset: The maximal one in equivalence class
::::::::::
Generator:::::::::
Itemsets: The minimal ones
Characteristics:
same equivalence class =⇒ same input transactions =⇒ same data
distribution =⇒ same :::::::
support value and ::::::::::
confidence value;
No:::::::::::
sub-itemset for a generator itemset in an eqivalence class;
No:::::::::::::
super-itemset for a closed itemset in an eqivalence class;
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
Introduction What is Generator
What is Generator
::::::::::::
Equivalence::::::
class: All the frequent ::::::::
itemsets contained in the same set of
input :::::::::::
transactions
:::::::
Closed ::::::::
Itemset: The maximal one in equivalence class
::::::::::
Generator:::::::::
Itemsets: The minimal ones
Characteristics:
same equivalence class =⇒ same input transactions =⇒ same data
distribution =⇒ same :::::::
support value and ::::::::::
confidence value;
No:::::::::::
sub-itemset for a generator itemset in an eqivalence class;
No:::::::::::::
super-itemset for a closed itemset in an eqivalence class;
Only one closed itemset, while one or more generator itemsets in one same
equivalence class.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
Introduction What is Generator
What is Generator
::::::::::::
Equivalence::::::
class: All the frequent ::::::::
itemsets contained in the same set of
input :::::::::::
transactions
:::::::
Closed ::::::::
Itemset: The maximal one in equivalence class
::::::::::
Generator:::::::::
Itemsets: The minimal ones
Characteristics:
same equivalence class =⇒ same input transactions =⇒ same data
distribution =⇒ same :::::::
support value and ::::::::::
confidence value;
No:::::::::::
sub-itemset for a generator itemset in an eqivalence class;
No:::::::::::::
super-itemset for a closed itemset in an eqivalence class;
Only one closed itemset, while one or more generator itemsets in one same
equivalence class.
An itemset could be both a generator itemset and a closed itemset.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
Introduction Why We Need Generators
Why We Need Generators
Form a concise representation of equivalence classes together with
closed item-sets;
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction Why We Need Generators
Why We Need Generators
Form a concise representation of equivalence classes together with
closed item-sets;
As classification rules / features.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction Why We Need Generators
Why We Need Generators
Form a concise representation of equivalence classes together with
closed item-sets;
As classification rules / features.
At least one generator sharing the same support and confidence with
others for each equivalence class;
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction Why We Need Generators
Why We Need Generators
Form a concise representation of equivalence classes together with
closed item-sets;
As classification rules / features.
At least one generator sharing the same support and confidence with
others for each equivalence class;
The number is much smaller than all frequent ones;
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction Why We Need Generators
Why We Need Generators
Form a concise representation of equivalence classes together with
closed item-sets;
As classification rules / features.
At least one generator sharing the same support and confidence with
others for each equivalence class;
The number is much smaller than all frequent ones;
The shortest ones in an equivalence class;
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction Why We Need Generators
Why We Need Generators
Form a concise representation of equivalence classes together with
closed item-sets;
As classification rules / features.
At least one generator sharing the same support and confidence with
others for each equivalence class;
The number is much smaller than all frequent ones;
The shortest ones in an equivalence class;
The average size tends to be the smallest;
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction Why We Need Generators
Why We Need Generators
Form a concise representation of equivalence classes together with
closed item-sets;
As classification rules / features.
At least one generator sharing the same support and confidence with
others for each equivalence class;
The number is much smaller than all frequent ones;
The shortest ones in an equivalence class;
The average size tends to be the smallest;
Preferred by :::::
MDL ::::::::::
(Minimum:::::::::::
Description::::::::
Length) principle.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction What have We done
What have We done
A novel algorithm to mine frequent generator itemsets on stream sliding
window.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
Introduction What have We done
What have We done
A novel algorithm to mine frequent generator itemsets on stream sliding
window.
Contributions:
First algorithm mining frequent itemset generators over stream sliding
windows;
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
Introduction What have We done
What have We done
A novel algorithm to mine frequent generator itemsets on stream sliding
window.
Contributions:
First algorithm mining frequent itemset generators over stream sliding
windows;
Novel ::::::::::::
enumeration:::::
tree structure and some effective optimization
techniques;
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
Introduction What have We done
What have We done
A novel algorithm to mine frequent generator itemsets on stream sliding
window.
Contributions:
First algorithm mining frequent itemset generators over stream sliding
windows;
Novel ::::::::::::
enumeration:::::
tree structure and some effective optimization
techniques;
Extended to directly mine classification rules on a sliding window;
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
Introduction What have We done
What have We done
A novel algorithm to mine frequent generator itemsets on stream sliding
window.
Contributions:
First algorithm mining frequent itemset generators over stream sliding
windows;
Novel ::::::::::::
enumeration:::::
tree structure and some effective optimization
techniques;
Extended to directly mine classification rules on a sliding window;
An extensive performance study shows StreamGen outperforms others
performing similar tasks, and achieves high classification accuracy.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
Related Work
Related Work
Itemset Mining Algorithms:
Mining frequent patterns without candidate generation: A frequent-pattern tree approach.
J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28
Related Work
Related Work
Itemset Mining Algorithms:
Mining frequent patterns without candidate generation: A frequent-pattern tree approach.
J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004.
Closet: An efficient algorithm for mining frequent closed itemsets. J. Pei, J. Han, and R.
Mao. SIGMOD Workshop DMKD, 2000.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28
Related Work
Related Work
Itemset Mining Algorithms:
Mining frequent patterns without candidate generation: A frequent-pattern tree approach.
J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004.
Closet: An efficient algorithm for mining frequent closed itemsets. J. Pei, J. Han, and R.
Mao. SIGMOD Workshop DMKD, 2000.
Minimum description length principle: Generators are preferable to closed patterns. J. Li,
H. Li, L. Wong, J. Pei, and G. Dong. AAAI, 2006.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28
Related Work
Related Work
Itemset Mining Algorithms:
Mining frequent patterns without candidate generation: A frequent-pattern tree approach.
J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004.
Closet: An efficient algorithm for mining frequent closed itemsets. J. Pei, J. Han, and R.
Mao. SIGMOD Workshop DMKD, 2000.
Minimum description length principle: Generators are preferable to closed patterns. J. Li,
H. Li, L. Wong, J. Pei, and G. Dong. AAAI, 2006.
Mining statistically important equivalence classes and delta-discriminative emerging
patterns. J. Li, G. Liu, and L. Wong. SIGKDD, 2007.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28
Related Work
Related Work
Stream Itemset Mining Algorithms:
Catch the moment: maintaining closed frequent itemsets over a data stream sliding
window. Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Knowl. Inf. Syst., 2006.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 8 / 28
Related Work
Related Work
Stream Itemset Mining Algorithms:
Catch the moment: maintaining closed frequent itemsets over a data stream sliding
window. Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Knowl. Inf. Syst., 2006.
Itemset based Classification Algorithms:
On mining instance-centric classification rules. J. Wang and G. Karypis. IEEE Trans.
Knowl. Data Eng., 2006.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 8 / 28
Related Work
Related Work
Stream Itemset Mining Algorithms:
Catch the moment: maintaining closed frequent itemsets over a data stream sliding
window. Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Knowl. Inf. Syst., 2006.
Itemset based Classification Algorithms:
On mining instance-centric classification rules. J. Wang and G. Karypis. IEEE Trans.
Knowl. Data Eng., 2006.
Discriminative frequent pattern analysis for effective classification. H. Cheng, X. Yan, J.
Han, and C.-W. Hsu. ICDE, 2007.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 8 / 28
The StreamGen Algorithm
The StreamGen Algorithm
Details of our algorithm here.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 9 / 28
The StreamGen Algorithm
The StreamGen Algorithm
Details of our algorithm here.
Example:
One running example of stream data containing 6 transaction itemsets and with
window size of 4.
TimeLine
ID Itemset
1
2
3
4
5
6
ABC
AD
ABCD
ABD
BCD
CD
Window#1
Window#2
Window#3
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 9 / 28
The StreamGen Algorithm
A Few Basic Theorems
Theorem
A frequent itemset S is a generator iff there exists no subset with size |S − 1|
having the same support with S.
Hint:
Can be used to check whether an itemset is a generator easily.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 10 / 28
The StreamGen Algorithm
A Few Basic Theorems
Theorem
A frequent itemset S is a generator iff there exists no subset with size |S − 1|
having the same support with S.
Hint:
Can be used to check whether an itemset is a generator easily.
Theorem
Any subset of a generator would be also a generator.
Theorem
Any superset of an unpromising itemset must be either unpromising or
infrequent.
Hint:
Help define the border between generators and non-generators;
Form the foundation for the enumeration tree.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 10 / 28
The StreamGen Algorithm FP-Tree
FP-Tree
A modified FP-Tree for store and compress transactions in each sliding
window.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 11 / 28
The StreamGen Algorithm FP-Tree
FP-Tree
A modified FP-Tree for store and compress transactions in each sliding
window.
Example:
FP-Tree of first sliding window in previous example.
1 A B C
2 A D
3 A B C D
4 A B D
Ø
D:3
C:1
B:1
A:1
C:1
B:1
A:1
1 432
IDTable
A:1 B:1
A:1
HeadTable
A
B
D
C
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 11 / 28
The StreamGen Algorithm Enumeration Tree
Enumeration Tree
To help maintain the information of the mined generators and the border
between generators and non-generators.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
The StreamGen Algorithm Enumeration Tree
Enumeration Tree
To help maintain the information of the mined generators and the border
between generators and non-generators.
3 types of nodes:
Infrequent Node;
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
The StreamGen Algorithm Enumeration Tree
Enumeration Tree
To help maintain the information of the mined generators and the border
between generators and non-generators.
3 types of nodes:
Infrequent Node;
Unpromising Node.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
The StreamGen Algorithm Enumeration Tree
Enumeration Tree
To help maintain the information of the mined generators and the border
between generators and non-generators.
3 types of nodes:
Infrequent Node;
Unpromising Node.
Generator Node.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
The StreamGen Algorithm Enumeration Tree
Enumeration Tree
To help maintain the information of the mined generators and the border
between generators and non-generators.
3 types of nodes:
Infrequent Node;
Unpromising Node.
Generator Node.
A hash table is prepared for each level of the enumeration tree to
accelerate the checking operation.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
The StreamGen Algorithm Enumeration Tree
Enumeration Tree
Example:
Enumeration tree of first sliding window with minimum support 2
1 A B C
2 A D
3 A B C D
4 A B D
Ø::4
D:3C:2B:3A:4
BC:2 BD:2 CD:1
Solid border ellipse: Generator Node
Dotted border ellipse: Unpromising Node
Dotted border rectangle: Infrequent Node
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 13 / 28
The StreamGen Algorithm ADD and REMOVE Operations
ADD and REMOVE Operations
Core part:
Enumeration tree-node status transforming matrix.
ADD REMOVE
Type x < y x = y x > y x < y x = y x > y
G G G G G G/U I/G
U U G/U U U U I/U
I I I I/G/U I I I
x = |itemsetn ∩ T|, y = |itemsetn| − 1
G = Generator, U = Unpromising, I = Infrequent
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 14 / 28
The StreamGen Algorithm ADD and REMOVE Operations
Example of ADD Operation
Ø::4
D:3C:2B:3A:4
BC:2 BD:2 CD:1
ADD
Type x < y x = y x > y
G G G G
U U G/U U
I I I I/G/U
x = |itemsetn ∩ T|, y = |itemsetn| − 1
T = B C D
1 A B C
2 A D
3 A B C D
4 A B D
5 B C D +
Ø::5
D:4C:3B:4A:4
AB:3 AC:2
ABC:2
AD:2 BC:3 BD:3 CD:2
ACD:1ABD:2
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 15 / 28
The StreamGen Algorithm ADD and REMOVE Operations
Example of REMOVE Operation
Ø::5
D:4C:3B:4A:4
AB:3 AC:2
ABC:2
AD:2 BC:3 BD:3 CD:2
ACD:1ABD:2
1 A B C −
2 A D
3 A B C D
4 A B D
5 B C D
REMOVE
Type x < y x = y x > y
G G G/U I/G
U U U I/U
I I I I
x = |itemsetn ∩ T|, y = |itemsetn| − 1
T = A B C
Ø::4
D:4C:2B:3A:3
AB:2 AC:1 BC:2
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 16 / 28
The StreamGen Algorithm ADD and REMOVE Operations
Combine Two Operations
For Sliding Window:
ADD when window is not full
REMOVE when window is full
For Incremental
Only ADD
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 17 / 28
Extension for Mining Classification Rules
Extension for Mining Classification Rules
Algorithm 1: StreamGenRules(n)
Input : The root node n of the enemuration tree.
begin1
nodes ← getGenerators(n);2
sort nodes by info-gain;3
rules ← ∅;4
foreach cn ∈ nodes do5
if ∀r ∈ rules, r ⊂ cn then6
if cn covers at least one transaction then7
rules ← rules ∪ {cn};8
remove covered transactions;9
if no more transactions then10
break;11
return rules;12
end13
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 18 / 28
Evaluation Results
Datasets
Dataset # Items # tran. # Pos. # Neg. Avg. Len.
mushroom 116 8,124 4,208 3,916 21.695
horse 89 368 232 136 16.769
adult 128 48,842 11,687 37,155 13.868
breast 45 699 458 241 8.977
hepatitus 55 155 32 123 17.923
pima 40 768 500 268 8
chess 75 3,196 - - 37
connect 129 67,557 - - 43
pumsb 2,113 49,046 - - 74
The above part is for both runtime evaluation and classification evaluation,
The bottom part is only for runtime evaluation.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 19 / 28
Evaluation Results
Runtime Comparing with Moment
Comparsion with Moment, one frequent closed itemset mining algorithm
on sliding windows:
1
10
100
10 20 30 40 50
Runtime(inseconds)
Minimum Support Threshold (in %)
Moment
StreamGen
dataset = mushroom
window size = 2,000
1
10
100
10 20 30 40 50
Runtime(inseconds)
Minimum Support Threshold (in %)
Moment
StreamGen
dataset = mushroom
window size = 4,000
0.1
1
10
100
1000
75 80 85 90 95 100
Runtime(inseconds)
Minimum Support Threshold (in %)
Moment
StreamGen
dataset = chess
window size = 1,000
0.1
1
10
100
1000
60 70 80 90 100
Runtime(inseconds)
Minimum Support Threshold (in %)
Moment
StreamGen
dataset = chess
window size = 2,000
10
100
75 80 85 90 95 100
Runtime(inseconds)
Minimum Support Threshold (in %)
Moment
StreamGen
dataset = pumsb
window size = 2,500
10
100
70 80 90 100
Runtime(inseconds)
Minimum Support Threshold (in %)
Moment
StreamGen
dataset = pumsb
window size = 10,000
1
10
100
1000
99.333 99.5 99.667 99.833 100
Runtime(inseconds)
Minimum Support Threshold (in %)
Moment
StreamGen
dataset = connect
window size = 30,000
1
10
100
95 95.833 96.667 97.5 98.333 99.167 100
Runtime(inseconds)
Minimum Support Threshold (in %)
Moment
StreamGen
dataset = connect
window size = 60,000
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 20 / 28
Evaluation Results
Memory Use Comparing with Moment
Peak memory uses of Moment and StreamGen in KB:
Dataset window size supmin Moment StreamGen
mushroom 4,000 0.1 14,476 10,108
mushroom 2,000 0.1 12,504 8,472
chess 2,000 0.6 103,180 31,636
chess 1,000 0.75 34,624 9,176
connect-4 60,000 0.95 141,756 98,236
connect-4 30,000 0.998 73,056 52,372
pumsb 10,000 0.7 1,732,136 75,316
pumsb 2,500 0.75 90,944 23,472
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 21 / 28
Evaluation Results
Runtime Comparing with DPM & DDPMine
Comparsion with DPM, one frequent generator itemset mining algorithm on static
data:
0.1
1
10
100
1000
50 60 70 80 90 100
Runtime(inseconds)
Minimum Support Threshold (in %)
DPM
StreamGen
dataset = mushroom
window size = 4,000
0.1
1
10
100
1000
75 80 85 90 95 100
Runtime(inseconds)
Minimum Support Threshold (in %)
DPM
StreamGen
dataset = chess
window size = 1,000
1
10
100
1000
97.015 97.761 98.507 99.254 100
Runtime(inseconds)
Minimum Support Threshold (in %)
DPM
StreamGen
dataset = connect
window size = 67,000
1
10
89.796 91.837 93.878 95.918 97.959 100
Runtime(inseconds)
Minimum Support Threshold (in %)
DPM
StreamGen
dataset = pumsb
window size = 49,000
*The runtimes of DPM $ DDPMine are only mearsured on full-sized windows.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 22 / 28
Evaluation Results
Runtime Comparing with DPM & DDPMine
Comparsion with DPM, one frequent generator itemset mining algorithm on static
data:
0.1
1
10
100
1000
50 60 70 80 90 100
Runtime(inseconds)
Minimum Support Threshold (in %)
DPM
StreamGen
dataset = mushroom
window size = 4,000
0.1
1
10
100
1000
75 80 85 90 95 100
Runtime(inseconds)
Minimum Support Threshold (in %)
DPM
StreamGen
dataset = chess
window size = 1,000
1
10
100
1000
97.015 97.761 98.507 99.254 100
Runtime(inseconds)
Minimum Support Threshold (in %)
DPM
StreamGen
dataset = connect
window size = 67,000
1
10
89.796 91.837 93.878 95.918 97.959 100
Runtime(inseconds)
Minimum Support Threshold (in %)
DPM
StreamGen
dataset = pumsb
window size = 49,000
Comparsion with DDPMine, one frequent itemset based classification rule mining
algorithm on static data:
0.1
1
10
100
1000
10000
50 60 70 80 90
Runtime(inseconds)
Minimum Support Threshold (in %)
DDPMine
StreamGen
dataset = mushroom
window size = 8,000
0.01
0.1
1
10
100
1000
10000
10 20 30 40 50Runtime(inseconds)
Minimum Support Threshold (in %)
DDPMine
StreamGen
dataset = horse
window size = 600
*The runtimes of DPM $ DDPMine are only mearsured on full-sized windows.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 22 / 28
Evaluation Results
Classification Experiment Results
Classification Accuracy:
Dataset StreamGen DDPMine
Accuracy max. len. avg. len. avg. num. Accuracy max. len. avg. len. avg. num.
breast 96.708 3 1.551 23.6 95.28 9 2.448 11.6
adult 82.146 3 1.831 13 81.292 14 4.583 7.2
mushroom 98.918 3 1.958 9.6 97.184 22 15.592 16.2
hepatitus 82.006 4 2.387 15 76.986 8 4.8 5
horse 81.512 2 1.389 3.6 81.246 20 4.88 10
pima 74.87 4 1.663 18.4 75.124 7 2.435 12.6
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 23 / 28
Evaluation Results
Classification Experiment Results
Classification Accuracy:
Dataset StreamGen DDPMine
Accuracy max. len. avg. len. avg. num. Accuracy max. len. avg. len. avg. num.
breast 96.708 3 1.551 23.6 95.28 9 2.448 11.6
adult 82.146 3 1.831 13 81.292 14 4.583 7.2
mushroom 98.918 3 1.958 9.6 97.184 22 15.592 16.2
hepatitus 82.006 4 2.387 15 76.986 8 4.8 5
horse 81.512 2 1.389 3.6 81.246 20 4.88 10
pima 74.87 4 1.663 18.4 75.124 7 2.435 12.6
Rule Example on “mushroom”:
StreamGen DDPMine
38 17 39
12 25 5 7 8 11 13 15 16 17 18 19 20 26
13 25 8 17 18
7 67 5 7 9 13 14 15 16 17 18 19 20 40 41 46 53 54
66 2 7 9 11 13 14 15 16 17 18 19 20 21 38 40 44 53 54 76
7 68 2 7 9 11 13 14 15 16 17 18 19 20 28 38 40 44 53 54 76
11 18 2 7 9 11 13 14 15 16 17 18 19 20 32 38 40 53 54 65 76
6 18 37 2 7 9 11 13 14 15 16 17 18 19 20 22 32 38 40 53 54 76
4 53 2 7 9 11 13 14 15 16 17 18 19 20 28 32 38 40 46 53 54 76
2 7 9 11 13 14 15 16 17 18 19 20 21 32 38 40 45 46 53 54 76
2 7 9 11 13 14 15 16 17 18 19 20 21 32 34 38 40 46 48 53 54 76
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 23 / 28
Conclusions
Conclusions
Explored a new and challenging problem:
Mining frequent itemset generators over stream sliding window;
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 24 / 28
Conclusions
Conclusions
Explored a new and challenging problem:
Mining frequent itemset generators over stream sliding window;
Devised novel enumeration tree structure;
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 24 / 28
Conclusions
Conclusions
Explored a new and challenging problem:
Mining frequent itemset generators over stream sliding window;
Devised novel enumeration tree structure;
Also proposed effective optimization techniques;
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 24 / 28
Conclusions
Conclusions
Explored a new and challenging problem:
Mining frequent itemset generators over stream sliding window;
Devised novel enumeration tree structure;
Also proposed effective optimization techniques;
Outperformed other state-of-the-art algorithms in terms of efficiency
and classification accuracy.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 24 / 28
Conclusions
The End
Thank you for Listening!
Questions or Comments?
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 25 / 28

More Related Content

PPTX
WWW 2008 Poster - Efficient mining of frequent sequence generators
PPTX
Master Thesis 2010 - Pattern Discovery Algorithms for Classification
PDF
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
PDF
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
PDF
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
PPTX
Ranking the Linked Data: the case of DBpedia - ICWE 2010
PDF
Ranking Objects by Following Paths in Entity-Relationship Graphs (PhD Worksho...
PPTX
LocWeb 2014 Workshop at CIKM
WWW 2008 Poster - Efficient mining of frequent sequence generators
Master Thesis 2010 - Pattern Discovery Algorithms for Classification
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
Ranking the Linked Data: the case of DBpedia - ICWE 2010
Ranking Objects by Following Paths in Entity-Relationship Graphs (PhD Worksho...
LocWeb 2014 Workshop at CIKM

Viewers also liked (7)

PPTX
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
PPTX
CIKM Presentation at the AFAAS Review Workshop Addis-Ababa 15 oct 2014
PPTX
Semantic Tags Generation and Retrieval for Online Advertising - CIKM 2010
PDF
Leveraging Joint Interactions for Credibility Analysis in News Communities
PDF
Online User Location Inference Exploiting Spatiotemporal Correlations in Soci...
PDF
CIKM 2013 Tutorial: Real-time Bidding: A New Frontier of Computational Advert...
PDF
A Short Course in Data Stream Mining
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CIKM Presentation at the AFAAS Review Workshop Addis-Ababa 15 oct 2014
Semantic Tags Generation and Retrieval for Online Advertising - CIKM 2010
Leveraging Joint Interactions for Credibility Analysis in News Communities
Online User Location Inference Exploiting Spatiotemporal Correlations in Soci...
CIKM 2013 Tutorial: Real-time Bidding: A New Frontier of Computational Advert...
A Short Course in Data Stream Mining
Ad

Similar to CIKM 2009 - Efficient itemset generator discovery over a stream sliding window (20)

PDF
Db2425082511
PDF
i-Eclat: performance enhancement of Eclat via incremental approach in frequen...
PDF
PDF
PDF
FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTRO...
PPTX
mgcharm-150527055232-lva1-app6891
PPTX
Final year presentation Mg charm - Anuragsaxena
PPT
Cs583 association-rules
PDF
Feequent Item Mining - Data Mining - Pattern Mining
PDF
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
PDF
KIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and Clustering
PDF
Literature Survey of modern frequent item set mining methods
PPT
association(BahanAR-4) data mining apriori.ppt
PDF
A1030105
PPTX
Data Mining Lecture_4.pptx
PDF
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
PDF
An Efficient Algorithm for Mining Frequent Itemsets within Large Windows over...
PDF
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
PDF
IRJET - A Review on Mining High Utility Itemsets
Db2425082511
i-Eclat: performance enhancement of Eclat via incremental approach in frequen...
FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTRO...
mgcharm-150527055232-lva1-app6891
Final year presentation Mg charm - Anuragsaxena
Cs583 association-rules
Feequent Item Mining - Data Mining - Pattern Mining
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
KIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and Clustering
Literature Survey of modern frequent item set mining methods
association(BahanAR-4) data mining apriori.ppt
A1030105
Data Mining Lecture_4.pptx
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
An Efficient Algorithm for Mining Frequent Itemsets within Large Windows over...
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
IRJET - A Review on Mining High Utility Itemsets
Ad

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Empathic Computing: Creating Shared Understanding
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Big Data Technologies - Introduction.pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
Teaching material agriculture food technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Getting Started with Data Integration: FME Form 101
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
Encapsulation theory and applications.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Empathic Computing: Creating Shared Understanding
NewMind AI Weekly Chronicles - August'25-Week II
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Big Data Technologies - Introduction.pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Teaching material agriculture food technology
Encapsulation_ Review paper, used for researhc scholars
Getting Started with Data Integration: FME Form 101
Per capita expenditure prediction using model stacking based on satellite ima...
Spectral efficient network and resource selection model in 5G networks
Advanced methodologies resolving dimensionality complications for autism neur...
SOPHOS-XG Firewall Administrator PPT.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Digital-Transformation-Roadmap-for-Companies.pptx
Group 1 Presentation -Planning and Decision Making .pptx

CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

  • 1. Efficient Itemset Generator Discovery over a Stream Sliding Window Chuancong Gao, Jianyong Wang Database Laboratory Department of Computer Science and Technology Tsinghua University, Beijing 100084, China C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 1 / 28
  • 2. Outline Introduction What is Generator Why We Need Generators What have We done Related Work The StreamGen Algorithm FP-Tree Enumeration Tree ADD and REMOVE Operations Extension for Mining Classification Rules Evaluation Results Conclusions C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 2 / 28
  • 3. Introduction What is Generator What is Generator Example: Given the 4 transactions, with the ::::::::: minimum:::::::: support::::::::: threshold::::::::: (supmin) of 2. A B C A D A B C D A B D
  • 4. Introduction What is Generator What is Generator Example: Given the 4 transactions, with the ::::::::: minimum:::::::: support::::::::: threshold::::::::: (supmin) of 2. A B C A D A B C D A B D Ø : 4 D : 3C : 2B : 3A : 4 ABD : 2ABC : 2 BD : 2BC : 2AD : 3AC : 2AB : 3 Equivalence Class Generator ItemsetClosed Itemset C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 3 / 28
  • 5. Introduction What is Generator What is Generator :::::::::::: Equivalence:::::: class: All the frequent :::::::: itemsets contained in the same set of input ::::::::::: transactions ::::::: Closed :::::::: Itemset: The maximal one in equivalence class :::::::::: Generator::::::::: Itemsets: The minimal ones C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
  • 6. Introduction What is Generator What is Generator :::::::::::: Equivalence:::::: class: All the frequent :::::::: itemsets contained in the same set of input ::::::::::: transactions ::::::: Closed :::::::: Itemset: The maximal one in equivalence class :::::::::: Generator::::::::: Itemsets: The minimal ones Characteristics: same equivalence class =⇒ same input transactions =⇒ same data distribution =⇒ same ::::::: support value and :::::::::: confidence value; C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
  • 7. Introduction What is Generator What is Generator :::::::::::: Equivalence:::::: class: All the frequent :::::::: itemsets contained in the same set of input ::::::::::: transactions ::::::: Closed :::::::: Itemset: The maximal one in equivalence class :::::::::: Generator::::::::: Itemsets: The minimal ones Characteristics: same equivalence class =⇒ same input transactions =⇒ same data distribution =⇒ same ::::::: support value and :::::::::: confidence value; No::::::::::: sub-itemset for a generator itemset in an eqivalence class; C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
  • 8. Introduction What is Generator What is Generator :::::::::::: Equivalence:::::: class: All the frequent :::::::: itemsets contained in the same set of input ::::::::::: transactions ::::::: Closed :::::::: Itemset: The maximal one in equivalence class :::::::::: Generator::::::::: Itemsets: The minimal ones Characteristics: same equivalence class =⇒ same input transactions =⇒ same data distribution =⇒ same ::::::: support value and :::::::::: confidence value; No::::::::::: sub-itemset for a generator itemset in an eqivalence class; No::::::::::::: super-itemset for a closed itemset in an eqivalence class; C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
  • 9. Introduction What is Generator What is Generator :::::::::::: Equivalence:::::: class: All the frequent :::::::: itemsets contained in the same set of input ::::::::::: transactions ::::::: Closed :::::::: Itemset: The maximal one in equivalence class :::::::::: Generator::::::::: Itemsets: The minimal ones Characteristics: same equivalence class =⇒ same input transactions =⇒ same data distribution =⇒ same ::::::: support value and :::::::::: confidence value; No::::::::::: sub-itemset for a generator itemset in an eqivalence class; No::::::::::::: super-itemset for a closed itemset in an eqivalence class; Only one closed itemset, while one or more generator itemsets in one same equivalence class. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
  • 10. Introduction What is Generator What is Generator :::::::::::: Equivalence:::::: class: All the frequent :::::::: itemsets contained in the same set of input ::::::::::: transactions ::::::: Closed :::::::: Itemset: The maximal one in equivalence class :::::::::: Generator::::::::: Itemsets: The minimal ones Characteristics: same equivalence class =⇒ same input transactions =⇒ same data distribution =⇒ same ::::::: support value and :::::::::: confidence value; No::::::::::: sub-itemset for a generator itemset in an eqivalence class; No::::::::::::: super-itemset for a closed itemset in an eqivalence class; Only one closed itemset, while one or more generator itemsets in one same equivalence class. An itemset could be both a generator itemset and a closed itemset. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
  • 11. Introduction Why We Need Generators Why We Need Generators Form a concise representation of equivalence classes together with closed item-sets; C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
  • 12. Introduction Why We Need Generators Why We Need Generators Form a concise representation of equivalence classes together with closed item-sets; As classification rules / features. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
  • 13. Introduction Why We Need Generators Why We Need Generators Form a concise representation of equivalence classes together with closed item-sets; As classification rules / features. At least one generator sharing the same support and confidence with others for each equivalence class; C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
  • 14. Introduction Why We Need Generators Why We Need Generators Form a concise representation of equivalence classes together with closed item-sets; As classification rules / features. At least one generator sharing the same support and confidence with others for each equivalence class; The number is much smaller than all frequent ones; C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
  • 15. Introduction Why We Need Generators Why We Need Generators Form a concise representation of equivalence classes together with closed item-sets; As classification rules / features. At least one generator sharing the same support and confidence with others for each equivalence class; The number is much smaller than all frequent ones; The shortest ones in an equivalence class; C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
  • 16. Introduction Why We Need Generators Why We Need Generators Form a concise representation of equivalence classes together with closed item-sets; As classification rules / features. At least one generator sharing the same support and confidence with others for each equivalence class; The number is much smaller than all frequent ones; The shortest ones in an equivalence class; The average size tends to be the smallest; C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
  • 17. Introduction Why We Need Generators Why We Need Generators Form a concise representation of equivalence classes together with closed item-sets; As classification rules / features. At least one generator sharing the same support and confidence with others for each equivalence class; The number is much smaller than all frequent ones; The shortest ones in an equivalence class; The average size tends to be the smallest; Preferred by ::::: MDL :::::::::: (Minimum::::::::::: Description:::::::: Length) principle. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
  • 18. Introduction What have We done What have We done A novel algorithm to mine frequent generator itemsets on stream sliding window. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
  • 19. Introduction What have We done What have We done A novel algorithm to mine frequent generator itemsets on stream sliding window. Contributions: First algorithm mining frequent itemset generators over stream sliding windows; C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
  • 20. Introduction What have We done What have We done A novel algorithm to mine frequent generator itemsets on stream sliding window. Contributions: First algorithm mining frequent itemset generators over stream sliding windows; Novel :::::::::::: enumeration::::: tree structure and some effective optimization techniques; C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
  • 21. Introduction What have We done What have We done A novel algorithm to mine frequent generator itemsets on stream sliding window. Contributions: First algorithm mining frequent itemset generators over stream sliding windows; Novel :::::::::::: enumeration::::: tree structure and some effective optimization techniques; Extended to directly mine classification rules on a sliding window; C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
  • 22. Introduction What have We done What have We done A novel algorithm to mine frequent generator itemsets on stream sliding window. Contributions: First algorithm mining frequent itemset generators over stream sliding windows; Novel :::::::::::: enumeration::::: tree structure and some effective optimization techniques; Extended to directly mine classification rules on a sliding window; An extensive performance study shows StreamGen outperforms others performing similar tasks, and achieves high classification accuracy. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
  • 23. Related Work Related Work Itemset Mining Algorithms: Mining frequent patterns without candidate generation: A frequent-pattern tree approach. J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28
  • 24. Related Work Related Work Itemset Mining Algorithms: Mining frequent patterns without candidate generation: A frequent-pattern tree approach. J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004. Closet: An efficient algorithm for mining frequent closed itemsets. J. Pei, J. Han, and R. Mao. SIGMOD Workshop DMKD, 2000. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28
  • 25. Related Work Related Work Itemset Mining Algorithms: Mining frequent patterns without candidate generation: A frequent-pattern tree approach. J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004. Closet: An efficient algorithm for mining frequent closed itemsets. J. Pei, J. Han, and R. Mao. SIGMOD Workshop DMKD, 2000. Minimum description length principle: Generators are preferable to closed patterns. J. Li, H. Li, L. Wong, J. Pei, and G. Dong. AAAI, 2006. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28
  • 26. Related Work Related Work Itemset Mining Algorithms: Mining frequent patterns without candidate generation: A frequent-pattern tree approach. J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004. Closet: An efficient algorithm for mining frequent closed itemsets. J. Pei, J. Han, and R. Mao. SIGMOD Workshop DMKD, 2000. Minimum description length principle: Generators are preferable to closed patterns. J. Li, H. Li, L. Wong, J. Pei, and G. Dong. AAAI, 2006. Mining statistically important equivalence classes and delta-discriminative emerging patterns. J. Li, G. Liu, and L. Wong. SIGKDD, 2007. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28
  • 27. Related Work Related Work Stream Itemset Mining Algorithms: Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Knowl. Inf. Syst., 2006. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 8 / 28
  • 28. Related Work Related Work Stream Itemset Mining Algorithms: Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Knowl. Inf. Syst., 2006. Itemset based Classification Algorithms: On mining instance-centric classification rules. J. Wang and G. Karypis. IEEE Trans. Knowl. Data Eng., 2006. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 8 / 28
  • 29. Related Work Related Work Stream Itemset Mining Algorithms: Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Knowl. Inf. Syst., 2006. Itemset based Classification Algorithms: On mining instance-centric classification rules. J. Wang and G. Karypis. IEEE Trans. Knowl. Data Eng., 2006. Discriminative frequent pattern analysis for effective classification. H. Cheng, X. Yan, J. Han, and C.-W. Hsu. ICDE, 2007. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 8 / 28
  • 30. The StreamGen Algorithm The StreamGen Algorithm Details of our algorithm here. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 9 / 28
  • 31. The StreamGen Algorithm The StreamGen Algorithm Details of our algorithm here. Example: One running example of stream data containing 6 transaction itemsets and with window size of 4. TimeLine ID Itemset 1 2 3 4 5 6 ABC AD ABCD ABD BCD CD Window#1 Window#2 Window#3 C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 9 / 28
  • 32. The StreamGen Algorithm A Few Basic Theorems Theorem A frequent itemset S is a generator iff there exists no subset with size |S − 1| having the same support with S. Hint: Can be used to check whether an itemset is a generator easily. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 10 / 28
  • 33. The StreamGen Algorithm A Few Basic Theorems Theorem A frequent itemset S is a generator iff there exists no subset with size |S − 1| having the same support with S. Hint: Can be used to check whether an itemset is a generator easily. Theorem Any subset of a generator would be also a generator. Theorem Any superset of an unpromising itemset must be either unpromising or infrequent. Hint: Help define the border between generators and non-generators; Form the foundation for the enumeration tree. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 10 / 28
  • 34. The StreamGen Algorithm FP-Tree FP-Tree A modified FP-Tree for store and compress transactions in each sliding window. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 11 / 28
  • 35. The StreamGen Algorithm FP-Tree FP-Tree A modified FP-Tree for store and compress transactions in each sliding window. Example: FP-Tree of first sliding window in previous example. 1 A B C 2 A D 3 A B C D 4 A B D Ø D:3 C:1 B:1 A:1 C:1 B:1 A:1 1 432 IDTable A:1 B:1 A:1 HeadTable A B D C C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 11 / 28
  • 36. The StreamGen Algorithm Enumeration Tree Enumeration Tree To help maintain the information of the mined generators and the border between generators and non-generators. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
  • 37. The StreamGen Algorithm Enumeration Tree Enumeration Tree To help maintain the information of the mined generators and the border between generators and non-generators. 3 types of nodes: Infrequent Node; C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
  • 38. The StreamGen Algorithm Enumeration Tree Enumeration Tree To help maintain the information of the mined generators and the border between generators and non-generators. 3 types of nodes: Infrequent Node; Unpromising Node. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
  • 39. The StreamGen Algorithm Enumeration Tree Enumeration Tree To help maintain the information of the mined generators and the border between generators and non-generators. 3 types of nodes: Infrequent Node; Unpromising Node. Generator Node. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
  • 40. The StreamGen Algorithm Enumeration Tree Enumeration Tree To help maintain the information of the mined generators and the border between generators and non-generators. 3 types of nodes: Infrequent Node; Unpromising Node. Generator Node. A hash table is prepared for each level of the enumeration tree to accelerate the checking operation. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
  • 41. The StreamGen Algorithm Enumeration Tree Enumeration Tree Example: Enumeration tree of first sliding window with minimum support 2 1 A B C 2 A D 3 A B C D 4 A B D Ø::4 D:3C:2B:3A:4 BC:2 BD:2 CD:1 Solid border ellipse: Generator Node Dotted border ellipse: Unpromising Node Dotted border rectangle: Infrequent Node C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 13 / 28
  • 42. The StreamGen Algorithm ADD and REMOVE Operations ADD and REMOVE Operations Core part: Enumeration tree-node status transforming matrix. ADD REMOVE Type x < y x = y x > y x < y x = y x > y G G G G G G/U I/G U U G/U U U U I/U I I I I/G/U I I I x = |itemsetn ∩ T|, y = |itemsetn| − 1 G = Generator, U = Unpromising, I = Infrequent C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 14 / 28
  • 43. The StreamGen Algorithm ADD and REMOVE Operations Example of ADD Operation Ø::4 D:3C:2B:3A:4 BC:2 BD:2 CD:1 ADD Type x < y x = y x > y G G G G U U G/U U I I I I/G/U x = |itemsetn ∩ T|, y = |itemsetn| − 1 T = B C D 1 A B C 2 A D 3 A B C D 4 A B D 5 B C D + Ø::5 D:4C:3B:4A:4 AB:3 AC:2 ABC:2 AD:2 BC:3 BD:3 CD:2 ACD:1ABD:2 C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 15 / 28
  • 44. The StreamGen Algorithm ADD and REMOVE Operations Example of REMOVE Operation Ø::5 D:4C:3B:4A:4 AB:3 AC:2 ABC:2 AD:2 BC:3 BD:3 CD:2 ACD:1ABD:2 1 A B C − 2 A D 3 A B C D 4 A B D 5 B C D REMOVE Type x < y x = y x > y G G G/U I/G U U U I/U I I I I x = |itemsetn ∩ T|, y = |itemsetn| − 1 T = A B C Ø::4 D:4C:2B:3A:3 AB:2 AC:1 BC:2 C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 16 / 28
  • 45. The StreamGen Algorithm ADD and REMOVE Operations Combine Two Operations For Sliding Window: ADD when window is not full REMOVE when window is full For Incremental Only ADD C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 17 / 28
  • 46. Extension for Mining Classification Rules Extension for Mining Classification Rules Algorithm 1: StreamGenRules(n) Input : The root node n of the enemuration tree. begin1 nodes ← getGenerators(n);2 sort nodes by info-gain;3 rules ← ∅;4 foreach cn ∈ nodes do5 if ∀r ∈ rules, r ⊂ cn then6 if cn covers at least one transaction then7 rules ← rules ∪ {cn};8 remove covered transactions;9 if no more transactions then10 break;11 return rules;12 end13 C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 18 / 28
  • 47. Evaluation Results Datasets Dataset # Items # tran. # Pos. # Neg. Avg. Len. mushroom 116 8,124 4,208 3,916 21.695 horse 89 368 232 136 16.769 adult 128 48,842 11,687 37,155 13.868 breast 45 699 458 241 8.977 hepatitus 55 155 32 123 17.923 pima 40 768 500 268 8 chess 75 3,196 - - 37 connect 129 67,557 - - 43 pumsb 2,113 49,046 - - 74 The above part is for both runtime evaluation and classification evaluation, The bottom part is only for runtime evaluation. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 19 / 28
  • 48. Evaluation Results Runtime Comparing with Moment Comparsion with Moment, one frequent closed itemset mining algorithm on sliding windows: 1 10 100 10 20 30 40 50 Runtime(inseconds) Minimum Support Threshold (in %) Moment StreamGen dataset = mushroom window size = 2,000 1 10 100 10 20 30 40 50 Runtime(inseconds) Minimum Support Threshold (in %) Moment StreamGen dataset = mushroom window size = 4,000 0.1 1 10 100 1000 75 80 85 90 95 100 Runtime(inseconds) Minimum Support Threshold (in %) Moment StreamGen dataset = chess window size = 1,000 0.1 1 10 100 1000 60 70 80 90 100 Runtime(inseconds) Minimum Support Threshold (in %) Moment StreamGen dataset = chess window size = 2,000 10 100 75 80 85 90 95 100 Runtime(inseconds) Minimum Support Threshold (in %) Moment StreamGen dataset = pumsb window size = 2,500 10 100 70 80 90 100 Runtime(inseconds) Minimum Support Threshold (in %) Moment StreamGen dataset = pumsb window size = 10,000 1 10 100 1000 99.333 99.5 99.667 99.833 100 Runtime(inseconds) Minimum Support Threshold (in %) Moment StreamGen dataset = connect window size = 30,000 1 10 100 95 95.833 96.667 97.5 98.333 99.167 100 Runtime(inseconds) Minimum Support Threshold (in %) Moment StreamGen dataset = connect window size = 60,000 C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 20 / 28
  • 49. Evaluation Results Memory Use Comparing with Moment Peak memory uses of Moment and StreamGen in KB: Dataset window size supmin Moment StreamGen mushroom 4,000 0.1 14,476 10,108 mushroom 2,000 0.1 12,504 8,472 chess 2,000 0.6 103,180 31,636 chess 1,000 0.75 34,624 9,176 connect-4 60,000 0.95 141,756 98,236 connect-4 30,000 0.998 73,056 52,372 pumsb 10,000 0.7 1,732,136 75,316 pumsb 2,500 0.75 90,944 23,472 C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 21 / 28
  • 50. Evaluation Results Runtime Comparing with DPM & DDPMine Comparsion with DPM, one frequent generator itemset mining algorithm on static data: 0.1 1 10 100 1000 50 60 70 80 90 100 Runtime(inseconds) Minimum Support Threshold (in %) DPM StreamGen dataset = mushroom window size = 4,000 0.1 1 10 100 1000 75 80 85 90 95 100 Runtime(inseconds) Minimum Support Threshold (in %) DPM StreamGen dataset = chess window size = 1,000 1 10 100 1000 97.015 97.761 98.507 99.254 100 Runtime(inseconds) Minimum Support Threshold (in %) DPM StreamGen dataset = connect window size = 67,000 1 10 89.796 91.837 93.878 95.918 97.959 100 Runtime(inseconds) Minimum Support Threshold (in %) DPM StreamGen dataset = pumsb window size = 49,000 *The runtimes of DPM $ DDPMine are only mearsured on full-sized windows. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 22 / 28
  • 51. Evaluation Results Runtime Comparing with DPM & DDPMine Comparsion with DPM, one frequent generator itemset mining algorithm on static data: 0.1 1 10 100 1000 50 60 70 80 90 100 Runtime(inseconds) Minimum Support Threshold (in %) DPM StreamGen dataset = mushroom window size = 4,000 0.1 1 10 100 1000 75 80 85 90 95 100 Runtime(inseconds) Minimum Support Threshold (in %) DPM StreamGen dataset = chess window size = 1,000 1 10 100 1000 97.015 97.761 98.507 99.254 100 Runtime(inseconds) Minimum Support Threshold (in %) DPM StreamGen dataset = connect window size = 67,000 1 10 89.796 91.837 93.878 95.918 97.959 100 Runtime(inseconds) Minimum Support Threshold (in %) DPM StreamGen dataset = pumsb window size = 49,000 Comparsion with DDPMine, one frequent itemset based classification rule mining algorithm on static data: 0.1 1 10 100 1000 10000 50 60 70 80 90 Runtime(inseconds) Minimum Support Threshold (in %) DDPMine StreamGen dataset = mushroom window size = 8,000 0.01 0.1 1 10 100 1000 10000 10 20 30 40 50Runtime(inseconds) Minimum Support Threshold (in %) DDPMine StreamGen dataset = horse window size = 600 *The runtimes of DPM $ DDPMine are only mearsured on full-sized windows. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 22 / 28
  • 52. Evaluation Results Classification Experiment Results Classification Accuracy: Dataset StreamGen DDPMine Accuracy max. len. avg. len. avg. num. Accuracy max. len. avg. len. avg. num. breast 96.708 3 1.551 23.6 95.28 9 2.448 11.6 adult 82.146 3 1.831 13 81.292 14 4.583 7.2 mushroom 98.918 3 1.958 9.6 97.184 22 15.592 16.2 hepatitus 82.006 4 2.387 15 76.986 8 4.8 5 horse 81.512 2 1.389 3.6 81.246 20 4.88 10 pima 74.87 4 1.663 18.4 75.124 7 2.435 12.6 C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 23 / 28
  • 53. Evaluation Results Classification Experiment Results Classification Accuracy: Dataset StreamGen DDPMine Accuracy max. len. avg. len. avg. num. Accuracy max. len. avg. len. avg. num. breast 96.708 3 1.551 23.6 95.28 9 2.448 11.6 adult 82.146 3 1.831 13 81.292 14 4.583 7.2 mushroom 98.918 3 1.958 9.6 97.184 22 15.592 16.2 hepatitus 82.006 4 2.387 15 76.986 8 4.8 5 horse 81.512 2 1.389 3.6 81.246 20 4.88 10 pima 74.87 4 1.663 18.4 75.124 7 2.435 12.6 Rule Example on “mushroom”: StreamGen DDPMine 38 17 39 12 25 5 7 8 11 13 15 16 17 18 19 20 26 13 25 8 17 18 7 67 5 7 9 13 14 15 16 17 18 19 20 40 41 46 53 54 66 2 7 9 11 13 14 15 16 17 18 19 20 21 38 40 44 53 54 76 7 68 2 7 9 11 13 14 15 16 17 18 19 20 28 38 40 44 53 54 76 11 18 2 7 9 11 13 14 15 16 17 18 19 20 32 38 40 53 54 65 76 6 18 37 2 7 9 11 13 14 15 16 17 18 19 20 22 32 38 40 53 54 76 4 53 2 7 9 11 13 14 15 16 17 18 19 20 28 32 38 40 46 53 54 76 2 7 9 11 13 14 15 16 17 18 19 20 21 32 38 40 45 46 53 54 76 2 7 9 11 13 14 15 16 17 18 19 20 21 32 34 38 40 46 48 53 54 76 C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 23 / 28
  • 54. Conclusions Conclusions Explored a new and challenging problem: Mining frequent itemset generators over stream sliding window; C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 24 / 28
  • 55. Conclusions Conclusions Explored a new and challenging problem: Mining frequent itemset generators over stream sliding window; Devised novel enumeration tree structure; C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 24 / 28
  • 56. Conclusions Conclusions Explored a new and challenging problem: Mining frequent itemset generators over stream sliding window; Devised novel enumeration tree structure; Also proposed effective optimization techniques; C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 24 / 28
  • 57. Conclusions Conclusions Explored a new and challenging problem: Mining frequent itemset generators over stream sliding window; Devised novel enumeration tree structure; Also proposed effective optimization techniques; Outperformed other state-of-the-art algorithms in terms of efficiency and classification accuracy. C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 24 / 28
  • 58. Conclusions The End Thank you for Listening! Questions or Comments? C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 25 / 28