SlideShare a Scribd company logo
Massively Distributed Environments and Closed
Itemset Mining: the DCIM Approach
Mehdi Zitouni & Reza Akbarinia & Sadok Ben Yahia & Florent Masseglia
Mehdi.Zitouni@inria.fr
CAiSE 2017, june 16th 2017, Essen Germany
1
Plan
2
1 • Knowledge descovery in big data
2 • DCIM approach for CFI mining in big data
3 • Experimental results
4 • Conclusion
Big data mining
• Advances in hardware and software technologies : Internet, social
networks, smart phones, etc.
• Big data mining : multiple forms of knowledge
• Pattern recognition, statistics, databases, linguistics and visualization
3
?
ENOUGH
!!
Knowledge discovery
4
Knowledge discovery
5
Big data mining
• A class of useful patterns : Frequent Itemsets.
• Frequency of elements in a data base : behavior of the employees in
companies, behavior of the customers in stores, etc
• When data volume grow, frequent elements grow !
• Condensed representation of frequent patterns and gives the same results:
Closed Frequent Itemsets
6
Preliminary Notions : CFI
• Itemset support : the number of transactions containing the itemset
• Frequent itemset : its support is ≥ then a threshold σ specified by the user
• Closed frequent itemset : a condensed representation of frequent itemset,
• is frequent and closed (no superset that has the same support count)
example : having σ = 2
• A, B, C, E : items
• ABC, BCE, … : itemsets
7
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
2
3222
ABC
ABCE
ACEABE BCE
ABCE
3
2
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
Preliminary Notions : MapReduce
• Distributed data processing platform by Google 1 ,
• Available as open-source Apache Hadoop.
• Programming Model based on Key-Value pairs : map and reduce functions !
81 J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters
A, A, B,
B, B, C,
C, B, A
A, A, B
B, B, C
C, B, A
A,1 A,1 B,1
B,1 B,1 C,1
C,1 B,1 A,1
A,1 A,1 A,1
B,1 B,1 B,1
C,1 C,1
A, 3
B, 3
C, 2
example : Word Count
Map phase Reduce phase
DCIM algorithm
• Three steps :
1. Splitting : splits the dataset into multiple and successive parts
2. Job 1 : Frequency counting : first pass over the dataset and count the
support of each item and prune non-frequent ones
3. Job 2 : CFI Mining : mines the CFIs using prime number based approach
• Prime number based approach : a data modelization to avoid string operations
which are very costly in terms of communication and execution time.
9
X ; 2
Y ; 3
Z ; 5
Is X ⊂ X Y ?
X Y ; 2 x 3 = 6
If (6 % 2) = = 0
Then X ⊂ X Y True
example : membership test
DCIM algorithm : Frequency counting
10
Example : having σ = 2
T id Itemset
1 A D C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A D C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A D C
2 B C E
3 A B C E
4 B E
5 A B C E
items Support
A 3
B 4
C 4
D 1
E 4
items Support Primes
B 4 2
C 4 3
E 4 5
A 3 7
Descending order of supports
Itemset
Prime
Numbers
T id
Multiplicaton
C A 3, 7 21
B C E 2, 3, 5 30
B C E A 2, 3, 5, 7 210
B E 2, 5 10
B C E A 2, 3, 5, 7 210
DCIM algorithm : CFI Mining “Map Phase”
11
• Sets of minimized contexts, denoted as Conditional-context.
• Conditional-context ?
Example : having σ = 2
Itemset
Prime
Numbers
T id
A C 7, 3 21
B C E 2, 3, 5 30
A B C E 7, 2, 3, 5 210
B E 2, 5 10
A B C E 7, 2, 3, 5 210
A-Conditional-context
Itemset
Prime
Numbers
T id
C E 3, 5 30
C E 3, 5 30
Itemset
Prime
Numbers
T id
C A 3, 7 21
B C E 2, 3, 5 30
B C E A 7, 2, 3, 5 210
B E 2, 5 10
B C E A 7, 2, 3, 5 210
Itemset
Prime
Numbers
T id
C 3 3
B C E 2, 3, 5 30
B C E 2, 3, 5 30
AB-Conditional-context
Remove «B»
DCIM algorithm : CFI Mining “Map Phase”
12
Map Inputs : T id Processing Map Outputs
{C A} = 21 21 = 3 × 7 {A} = 7 : {C} = 3
{B C E} = 30
30 = 2 × 3 × 5
6 = 2 × 3
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
{B C E A} = 210
210 = 2 × 3 × 5 × 7
30 = 2 × 3 × 5
6 = 2 × 3
{A} = 7 : {BCE} = 30
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
{B E} = 10 10 = 2 × 5 {E} = 5 : {B} = 2
{B C E A} = 210
210 = 2 × 3 × 5 × 7
30 = 2 × 3 × 5
6 = 2 × 3
{A} = 7 : {BCE} = 30
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
Itemset
Prime
Numbers
T id
C A 3, 7 21
B C E 2, 3, 5 30
A B C E 2, 3, 5, 7 210
B E 2, 5 10
A B C E 2, 3, 5, 7 210
Map Inputs : T id Processing Map Outputs
{C A} = 21 21 = 3 × 7 {A} = 7 : {C} = 3
DCIM algorithm : CFI Mining “Reduce Phase”
13
no superset of the itemset in question that has the same support
count, GCD calculations
Example :
A-Conditional-context : {7}
3
30
30
Output : { 3 × 7 = 21 } → A C
GCD = 3
6
6
2
6
Output : { 5 × 2 = 10 } → B E
E-Conditional-context : {5}
GCD = 2
Itemset
Prime
Numbers
T id
C A 3, 7 21
B C E 2, 3, 5 30
B C E A 2, 3, 5, 7 210
B E 2, 5 10
B C E A 2, 3, 5, 7 210
14
DCIM algorithm : CFI Mining “Reduce Phase”
Map Outputs
{A} = 7 : {C} = 3
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
{A} = 7 : {BCE} = 30
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
{E} = 5 : {B} = 2
{A} = 7 : {BCE} = 30
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
Reduce Inputs CFI Mining → Reduce Outputs
{A} = 7 : {3, 30, 30}
{AB} = 14 : {15, 15}
{AE} ? → {AE} ⊂ {ABCE}
GCD(3, 30, 30) = 3 = C → 3 x 7 = 21 : {AC} is CFI
GCD(15,15) = 15 = CE → 15 x 14 = 210 : {ABCE} is CFI
STOP !
{E} = 5 : {6, 6, 2, 6}
{EC} = 15 : {2, 2, 2}
GCD(6, 6, 2, 6) = 2 = B → 2 x 5 = 10 : {BE} is CFI
GCD(2, 2, 2) = 2 → 2 x 15 = 30 : {BCE} is CFI
{C} = 3 : {7, 2, 2, 2} GCD(7, 2, 2, 2) = 1 → 3 = C : {C} is CFI
CFIs = {AC, ABCE, BE, BCE}
15
Experimental Results : Datasets
• Wikipedia Articles
• Each line mimics a research article,
• 7,892,123 transactions with 6,853,616 items,
• Maximal length of a transaction is 153,953,
• Clue Web
• One billion web pages in ten languages,
• 53,268,952 transactions with 11,153,752 items,
• Maximal length of a transaction is 689,153,
16
Experimental Results : Setup and implementation
• One of the clusters of Grid5000
• 32 nodes equipped with Hadoop 2.6.0 version,
• 96 Gigabytes Ram,
• 2,9 to 3,9 Ghz Processors,
• Java and Openjdk-7-jdk.
• Compared to a basic implementation of CLOSET algorithm in
MapReduce and the parallel FP-growth.
• Execution time and speedup for multiple values of σ.
17
Efficiency : Wikipedia Articles
18
Speedup : ClueWeb
Conclusion
• Big data : game changing revolution !!
19
Conclusion
• A reliable and efficient parallel algorithm for CFI mining namely DCIM,
• DCIM shows significantly better performances than approaches from
the state of the art,
• An efficient data modeling : Prime numbers processings !
→ The approach is effective and efficient
• CFI mining in data streams
20
21
Thank you !
Questions ?

More Related Content

PPTX
Presentation 2(power point presentation) dis2016
PPTX
kmaps
PDF
Assignment3 solution 3rd_edition
DOCX
Electrónica digital: Convertidores de números binarios con compuertas lógicas
PDF
The sum of the triangle sides lengths reciprocals vs a cyclic sum of a specif...
PPT
Basic maths
PDF
130701 09-05-2012
PPTX
Quadratic equations 7
Presentation 2(power point presentation) dis2016
kmaps
Assignment3 solution 3rd_edition
Electrónica digital: Convertidores de números binarios con compuertas lógicas
The sum of the triangle sides lengths reciprocals vs a cyclic sum of a specif...
Basic maths
130701 09-05-2012
Quadratic equations 7

What's hot (19)

PDF
Funções 2
PPTX
Module 6.7
PPTX
Chapter 5
PPT
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
PPTX
Productos notables web 2.o
PPTX
Appendix 3 Linear Functions PowerPoint
PPTX
Fixed point scaling
PPSX
Squaring binomials
PPTX
Alg2 lesson 5-4
PDF
River Valley Emath Paper 1_solutions_printed
DOCX
Fs for creditors aging report
PDF
5HBC2012 Conic Worksheet
PPTX
Mate tarea - 5º
PDF
X 1 cq - exponentes
PDF
Complex Integral
PDF
Data visualization with multiple groups using ggplot2
PDF
Geo Spatial Plot using R
PDF
Spm last minute revision mt
PPT
Per6 basis_Representations Of Integers
Funções 2
Module 6.7
Chapter 5
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Productos notables web 2.o
Appendix 3 Linear Functions PowerPoint
Fixed point scaling
Squaring binomials
Alg2 lesson 5-4
River Valley Emath Paper 1_solutions_printed
Fs for creditors aging report
5HBC2012 Conic Worksheet
Mate tarea - 5º
X 1 cq - exponentes
Complex Integral
Data visualization with multiple groups using ggplot2
Geo Spatial Plot using R
Spm last minute revision mt
Per6 basis_Representations Of Integers
Ad

Similar to Massively distributed environments and closed itemset mining (20)

PPTX
Dynamic Itemset Counting
PPTX
Dynamic Itemset Counting
PPTX
Frequent Itemset Mining on BigData
PPTX
Dynamic itemset counting
PPTX
Frequent Itemset Mining(FIM) on BigData
PDF
Approximation Data Structures for Streaming Applications
DOCX
Data Mining Association Analysis Basic Concepts a
PDF
Feequent Item Mining - Data Mining - Pattern Mining
PPT
Big Data Analytics with Hadoop with @techmilind
 
PDF
Mining Frequent Closed Graphs on Evolving Data Streams
PPTX
Data streaming algorithms
PDF
CLUSTBIGFIM-FREQUENT ITEMSET MINING OF BIG DATA USING PRE-PROCESSING BASED ON...
PPT
Parallel Computing 2007: Bring your own parallel application
PDF
PDF
PDF
Clustbigfim frequent itemset mining of
PPTX
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
PPT
Lecture20
PDF
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
Dynamic Itemset Counting
Dynamic Itemset Counting
Frequent Itemset Mining on BigData
Dynamic itemset counting
Frequent Itemset Mining(FIM) on BigData
Approximation Data Structures for Streaming Applications
Data Mining Association Analysis Basic Concepts a
Feequent Item Mining - Data Mining - Pattern Mining
Big Data Analytics with Hadoop with @techmilind
 
Mining Frequent Closed Graphs on Evolving Data Streams
Data streaming algorithms
CLUSTBIGFIM-FREQUENT ITEMSET MINING OF BIG DATA USING PRE-PROCESSING BASED ON...
Parallel Computing 2007: Bring your own parallel application
Clustbigfim frequent itemset mining of
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
Lecture20
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
Ad

Recently uploaded (20)

PPTX
Computer network topology notes for revision
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Lecture1 pattern recognition............
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
annual-report-2024-2025 original latest.
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to machine learning and Linear Models
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Computer network topology notes for revision
Supervised vs unsupervised machine learning algorithms
Lecture1 pattern recognition............
.pdf is not working space design for the following data for the following dat...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
annual-report-2024-2025 original latest.
Galatica Smart Energy Infrastructure Startup Pitch Deck
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
oil_refinery_comprehensive_20250804084928 (1).pptx
Qualitative Qantitative and Mixed Methods.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Knowledge Engineering Part 1
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to machine learning and Linear Models
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf

Massively distributed environments and closed itemset mining

  • 1. Massively Distributed Environments and Closed Itemset Mining: the DCIM Approach Mehdi Zitouni & Reza Akbarinia & Sadok Ben Yahia & Florent Masseglia Mehdi.Zitouni@inria.fr CAiSE 2017, june 16th 2017, Essen Germany 1
  • 2. Plan 2 1 • Knowledge descovery in big data 2 • DCIM approach for CFI mining in big data 3 • Experimental results 4 • Conclusion
  • 3. Big data mining • Advances in hardware and software technologies : Internet, social networks, smart phones, etc. • Big data mining : multiple forms of knowledge • Pattern recognition, statistics, databases, linguistics and visualization 3 ? ENOUGH !!
  • 6. Big data mining • A class of useful patterns : Frequent Itemsets. • Frequency of elements in a data base : behavior of the employees in companies, behavior of the customers in stores, etc • When data volume grow, frequent elements grow ! • Condensed representation of frequent patterns and gives the same results: Closed Frequent Itemsets 6
  • 7. Preliminary Notions : CFI • Itemset support : the number of transactions containing the itemset • Frequent itemset : its support is ≥ then a threshold σ specified by the user • Closed frequent itemset : a condensed representation of frequent itemset, • is frequent and closed (no superset that has the same support count) example : having σ = 2 • A, B, C, E : items • ABC, BCE, … : itemsets 7 T id Itemset 1 A C 2 B C E 3 A B C E 4 B E 5 A B C E 2 3222 ABC ABCE ACEABE BCE ABCE 3 2 T id Itemset 1 A C 2 B C E 3 A B C E 4 B E 5 A B C E T id Itemset 1 A C 2 B C E 3 A B C E 4 B E 5 A B C E T id Itemset 1 A C 2 B C E 3 A B C E 4 B E 5 A B C E
  • 8. Preliminary Notions : MapReduce • Distributed data processing platform by Google 1 , • Available as open-source Apache Hadoop. • Programming Model based on Key-Value pairs : map and reduce functions ! 81 J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters A, A, B, B, B, C, C, B, A A, A, B B, B, C C, B, A A,1 A,1 B,1 B,1 B,1 C,1 C,1 B,1 A,1 A,1 A,1 A,1 B,1 B,1 B,1 C,1 C,1 A, 3 B, 3 C, 2 example : Word Count Map phase Reduce phase
  • 9. DCIM algorithm • Three steps : 1. Splitting : splits the dataset into multiple and successive parts 2. Job 1 : Frequency counting : first pass over the dataset and count the support of each item and prune non-frequent ones 3. Job 2 : CFI Mining : mines the CFIs using prime number based approach • Prime number based approach : a data modelization to avoid string operations which are very costly in terms of communication and execution time. 9 X ; 2 Y ; 3 Z ; 5 Is X ⊂ X Y ? X Y ; 2 x 3 = 6 If (6 % 2) = = 0 Then X ⊂ X Y True example : membership test
  • 10. DCIM algorithm : Frequency counting 10 Example : having σ = 2 T id Itemset 1 A D C 2 B C E 3 A B C E 4 B E 5 A B C E T id Itemset 1 A D C 2 B C E 3 A B C E 4 B E 5 A B C E T id Itemset 1 A D C 2 B C E 3 A B C E 4 B E 5 A B C E items Support A 3 B 4 C 4 D 1 E 4 items Support Primes B 4 2 C 4 3 E 4 5 A 3 7 Descending order of supports Itemset Prime Numbers T id Multiplicaton C A 3, 7 21 B C E 2, 3, 5 30 B C E A 2, 3, 5, 7 210 B E 2, 5 10 B C E A 2, 3, 5, 7 210
  • 11. DCIM algorithm : CFI Mining “Map Phase” 11 • Sets of minimized contexts, denoted as Conditional-context. • Conditional-context ? Example : having σ = 2 Itemset Prime Numbers T id A C 7, 3 21 B C E 2, 3, 5 30 A B C E 7, 2, 3, 5 210 B E 2, 5 10 A B C E 7, 2, 3, 5 210 A-Conditional-context Itemset Prime Numbers T id C E 3, 5 30 C E 3, 5 30 Itemset Prime Numbers T id C A 3, 7 21 B C E 2, 3, 5 30 B C E A 7, 2, 3, 5 210 B E 2, 5 10 B C E A 7, 2, 3, 5 210 Itemset Prime Numbers T id C 3 3 B C E 2, 3, 5 30 B C E 2, 3, 5 30 AB-Conditional-context Remove «B»
  • 12. DCIM algorithm : CFI Mining “Map Phase” 12 Map Inputs : T id Processing Map Outputs {C A} = 21 21 = 3 × 7 {A} = 7 : {C} = 3 {B C E} = 30 30 = 2 × 3 × 5 6 = 2 × 3 {E} = 5 : {BC} = 6 {C} = 3 : {B} = 2 {B C E A} = 210 210 = 2 × 3 × 5 × 7 30 = 2 × 3 × 5 6 = 2 × 3 {A} = 7 : {BCE} = 30 {E} = 5 : {BC} = 6 {C} = 3 : {B} = 2 {B E} = 10 10 = 2 × 5 {E} = 5 : {B} = 2 {B C E A} = 210 210 = 2 × 3 × 5 × 7 30 = 2 × 3 × 5 6 = 2 × 3 {A} = 7 : {BCE} = 30 {E} = 5 : {BC} = 6 {C} = 3 : {B} = 2 Itemset Prime Numbers T id C A 3, 7 21 B C E 2, 3, 5 30 A B C E 2, 3, 5, 7 210 B E 2, 5 10 A B C E 2, 3, 5, 7 210 Map Inputs : T id Processing Map Outputs {C A} = 21 21 = 3 × 7 {A} = 7 : {C} = 3
  • 13. DCIM algorithm : CFI Mining “Reduce Phase” 13 no superset of the itemset in question that has the same support count, GCD calculations Example : A-Conditional-context : {7} 3 30 30 Output : { 3 × 7 = 21 } → A C GCD = 3 6 6 2 6 Output : { 5 × 2 = 10 } → B E E-Conditional-context : {5} GCD = 2 Itemset Prime Numbers T id C A 3, 7 21 B C E 2, 3, 5 30 B C E A 2, 3, 5, 7 210 B E 2, 5 10 B C E A 2, 3, 5, 7 210
  • 14. 14 DCIM algorithm : CFI Mining “Reduce Phase” Map Outputs {A} = 7 : {C} = 3 {E} = 5 : {BC} = 6 {C} = 3 : {B} = 2 {A} = 7 : {BCE} = 30 {E} = 5 : {BC} = 6 {C} = 3 : {B} = 2 {E} = 5 : {B} = 2 {A} = 7 : {BCE} = 30 {E} = 5 : {BC} = 6 {C} = 3 : {B} = 2 Reduce Inputs CFI Mining → Reduce Outputs {A} = 7 : {3, 30, 30} {AB} = 14 : {15, 15} {AE} ? → {AE} ⊂ {ABCE} GCD(3, 30, 30) = 3 = C → 3 x 7 = 21 : {AC} is CFI GCD(15,15) = 15 = CE → 15 x 14 = 210 : {ABCE} is CFI STOP ! {E} = 5 : {6, 6, 2, 6} {EC} = 15 : {2, 2, 2} GCD(6, 6, 2, 6) = 2 = B → 2 x 5 = 10 : {BE} is CFI GCD(2, 2, 2) = 2 → 2 x 15 = 30 : {BCE} is CFI {C} = 3 : {7, 2, 2, 2} GCD(7, 2, 2, 2) = 1 → 3 = C : {C} is CFI CFIs = {AC, ABCE, BE, BCE}
  • 15. 15 Experimental Results : Datasets • Wikipedia Articles • Each line mimics a research article, • 7,892,123 transactions with 6,853,616 items, • Maximal length of a transaction is 153,953, • Clue Web • One billion web pages in ten languages, • 53,268,952 transactions with 11,153,752 items, • Maximal length of a transaction is 689,153,
  • 16. 16 Experimental Results : Setup and implementation • One of the clusters of Grid5000 • 32 nodes equipped with Hadoop 2.6.0 version, • 96 Gigabytes Ram, • 2,9 to 3,9 Ghz Processors, • Java and Openjdk-7-jdk. • Compared to a basic implementation of CLOSET algorithm in MapReduce and the parallel FP-growth. • Execution time and speedup for multiple values of σ.
  • 19. Conclusion • Big data : game changing revolution !! 19
  • 20. Conclusion • A reliable and efficient parallel algorithm for CFI mining namely DCIM, • DCIM shows significantly better performances than approaches from the state of the art, • An efficient data modeling : Prime numbers processings ! → The approach is effective and efficient • CFI mining in data streams 20