Massively distributed environments and closed itemset mining

Massively Distributed Environments and Closed
Itemset Mining: the DCIM Approach
Mehdi Zitouni & Reza Akbarinia & Sadok Ben Yahia & Florent Masseglia
Mehdi.Zitouni@inria.fr
CAiSE 2017, june 16th 2017, Essen Germany
1

Plan
2
1 • Knowledge descovery in big data
2 • DCIM approach for CFI mining in big data
3 • Experimental results
4 • Conclusion

Big data mining
• Advances in hardware and software technologies : Internet, social
networks, smart phones, etc.
• Big data mining : multiple forms of knowledge
• Pattern recognition, statistics, databases, linguistics and visualization
3
?
ENOUGH
!!

Big data mining
• A class of useful patterns : Frequent Itemsets.
• Frequency of elements in a data base : behavior of the employees in
companies, behavior of the customers in stores, etc
• When data volume grow, frequent elements grow !
• Condensed representation of frequent patterns and gives the same results:
Closed Frequent Itemsets
6

Preliminary Notions : CFI
• Itemset support : the number of transactions containing the itemset
• Frequent itemset : its support is ≥ then a threshold σ speciﬁed by the user
• Closed frequent itemset : a condensed representation of frequent itemset,
• is frequent and closed (no superset that has the same support count)
example : having σ = 2
• A, B, C, E : items
• ABC, BCE, … : itemsets
7
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
2
3222
ABC
ABCE
ACEABE BCE
ABCE
3
2
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E

Preliminary Notions : MapReduce
• Distributed data processing platform by Google 1 ,
• Available as open-source Apache Hadoop.
• Programming Model based on Key-Value pairs : map and reduce functions !
81 J. Dean and S. Ghemawat: Simpliﬁed Data Processing on Large Clusters
A, A, B,
B, B, C,
C, B, A
A, A, B
B, B, C
C, B, A
A,1 A,1 B,1
B,1 B,1 C,1
C,1 B,1 A,1
A,1 A,1 A,1
B,1 B,1 B,1
C,1 C,1
A, 3
B, 3
C, 2
example : Word Count
Map phase Reduce phase

DCIM algorithm
• Three steps :
1. Splitting : splits the dataset into multiple and successive parts
2. Job 1 : Frequency counting : first pass over the dataset and count the
support of each item and prune non-frequent ones
3. Job 2 : CFI Mining : mines the CFIs using prime number based approach
• Prime number based approach : a data modelization to avoid string operations
which are very costly in terms of communication and execution time.
9
X ; 2
Y ; 3
Z ; 5
Is X ⊂ X Y ?
X Y ; 2 x 3 = 6
If (6 % 2) = = 0
Then X ⊂ X Y True
example : membership test

DCIM algorithm : Frequency counting
10
Example : having σ = 2
T id Itemset
1 A D C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A D C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A D C
2 B C E
3 A B C E
4 B E
5 A B C E
items Support
A 3
B 4
C 4
D 1
E 4
items Support Primes
B 4 2
C 4 3
E 4 5
A 3 7
Descending order of supports
Itemset
Prime
Numbers
T id
Multiplicaton
C A 3, 7 21
B C E 2, 3, 5 30
B C E A 2, 3, 5, 7 210
B E 2, 5 10
B C E A 2, 3, 5, 7 210

DCIM algorithm : CFI Mining “Map Phase”
11
• Sets of minimized contexts, denoted as Conditional-context.
• Conditional-context ?
Example : having σ = 2
Itemset
Prime
Numbers
T id
A C 7, 3 21
B C E 2, 3, 5 30
A B C E 7, 2, 3, 5 210
B E 2, 5 10
A B C E 7, 2, 3, 5 210
A-Conditional-context
Itemset
Prime
Numbers
T id
C E 3, 5 30
C E 3, 5 30
Itemset
Prime
Numbers
T id
C A 3, 7 21
B C E 2, 3, 5 30
B C E A 7, 2, 3, 5 210
B E 2, 5 10
B C E A 7, 2, 3, 5 210
Itemset
Prime
Numbers
T id
C 3 3
B C E 2, 3, 5 30
B C E 2, 3, 5 30
AB-Conditional-context
Remove «B»

DCIM algorithm : CFI Mining “Map Phase”
12
Map Inputs : T id Processing Map Outputs
{C A} = 21 21 = 3 × 7 {A} = 7 : {C} = 3
{B C E} = 30
30 = 2 × 3 × 5
6 = 2 × 3
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
{B C E A} = 210
210 = 2 × 3 × 5 × 7
30 = 2 × 3 × 5
6 = 2 × 3
{A} = 7 : {BCE} = 30
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
{B E} = 10 10 = 2 × 5 {E} = 5 : {B} = 2
{B C E A} = 210
210 = 2 × 3 × 5 × 7
30 = 2 × 3 × 5
6 = 2 × 3
{A} = 7 : {BCE} = 30
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
Itemset
Prime
Numbers
T id
C A 3, 7 21
B C E 2, 3, 5 30
A B C E 2, 3, 5, 7 210
B E 2, 5 10
A B C E 2, 3, 5, 7 210
Map Inputs : T id Processing Map Outputs
{C A} = 21 21 = 3 × 7 {A} = 7 : {C} = 3

DCIM algorithm : CFI Mining “Reduce Phase”
13
no superset of the itemset in question that has the same support
count, GCD calculations
Example :
A-Conditional-context : {7}
3
30
30
Output : { 3 × 7 = 21 } → A C
GCD = 3
6
6
2
6
Output : { 5 × 2 = 10 } → B E
E-Conditional-context : {5}
GCD = 2
Itemset
Prime
Numbers
T id
C A 3, 7 21
B C E 2, 3, 5 30
B C E A 2, 3, 5, 7 210
B E 2, 5 10
B C E A 2, 3, 5, 7 210

14
DCIM algorithm : CFI Mining “Reduce Phase”
Map Outputs
{A} = 7 : {C} = 3
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
{A} = 7 : {BCE} = 30
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
{E} = 5 : {B} = 2
{A} = 7 : {BCE} = 30
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
Reduce Inputs CFI Mining → Reduce Outputs
{A} = 7 : {3, 30, 30}
{AB} = 14 : {15, 15}
{AE} ? → {AE} ⊂ {ABCE}
GCD(3, 30, 30) = 3 = C → 3 x 7 = 21 : {AC} is CFI
GCD(15,15) = 15 = CE → 15 x 14 = 210 : {ABCE} is CFI
STOP !
{E} = 5 : {6, 6, 2, 6}
{EC} = 15 : {2, 2, 2}
GCD(6, 6, 2, 6) = 2 = B → 2 x 5 = 10 : {BE} is CFI
GCD(2, 2, 2) = 2 → 2 x 15 = 30 : {BCE} is CFI
{C} = 3 : {7, 2, 2, 2} GCD(7, 2, 2, 2) = 1 → 3 = C : {C} is CFI
CFIs = {AC, ABCE, BE, BCE}

15
Experimental Results : Datasets
• Wikipedia Articles
• Each line mimics a research article,
• 7,892,123 transactions with 6,853,616 items,
• Maximal length of a transaction is 153,953,
• Clue Web
• One billion web pages in ten languages,
• 53,268,952 transactions with 11,153,752 items,
• Maximal length of a transaction is 689,153,

16
Experimental Results : Setup and implementation
• One of the clusters of Grid5000
• 32 nodes equipped with Hadoop 2.6.0 version,
• 96 Gigabytes Ram,
• 2,9 to 3,9 Ghz Processors,
• Java and Openjdk-7-jdk.
• Compared to a basic implementation of CLOSET algorithm in
MapReduce and the parallel FP-growth.
• Execution time and speedup for multiple values of σ.

17
Efficiency : Wikipedia Articles

Conclusion
• Big data : game changing revolution !!
19

Conclusion
• A reliable and efficient parallel algorithm for CFI mining namely DCIM,
• DCIM shows significantly better performances than approaches from
the state of the art,
• An efficient data modeling : Prime numbers processings !
→ The approach is effective and efficient
• CFI mining in data streams
20

Massively distributed environments and closed itemset mining

More Related Content

What's hot (19)

Similar to Massively distributed environments and closed itemset mining (20)

Recently uploaded (20)

Massively distributed environments and closed itemset mining