Putting OAC-triclustering on MapReduce
Sergey Zudin, Dmitry V. Gnatyshak, and Dmitry I. Ignatov
National Research University Higher School of Economics, Russian Federation
Faculty of Computer Science
CLA 2015, Clermont-Ferrand, France
October 13-16
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 1 / 39
Outline
1 Motivation and previous work
2 Prime OAC-triclustering
Triadic Formal concept analysis
Basic algorithm
Online version of the algorithm
3 OAC-triclustering on MapReduce
MapReduce technology
MapReduce implementation
4 Experiments
Description of the experiments
Datasets
Results
5 Conclusion
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 2 / 39
Outline
1 Motivation and previous work
2 Prime OAC-triclustering
Triadic Formal concept analysis
Basic algorithm
Online version of the algorithm
3 OAC-triclustering on MapReduce
MapReduce technology
MapReduce implementation
4 Experiments
Description of the experiments
Datasets
Results
5 Conclusion
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 3 / 39
Motivation
Big amount of multimodal data:
Gene expression data
Folksonomies
Recommender Systems
Communities in multi-mode (social) networks
Pattern mining in relational databases
. . .
Non-binary data can be scaled (possibly increasing the dimensionality)
Increasing amount of big data: fast and/or distributed algorithms are
required (linear or sublinear, one-pass)
Existing methods: finding all n-sets (mulitimodal clusters) satisfying some
conditions (often the exponential number of patterns)
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 4 / 39
Motivation
IMDB example, [Mirkin et al., 2011]
Clump Movie-Keyword-Genre
Bicluster
{12 Angry Men (1957), To Kill a Mockingbird (1962), Wit-
ness for the Prosecution (1957)}, {Murder, Trial}, {n/a }
Tricluster
{12 Angry Men (1957), Double Indemnity (1944), China-
town (1974), The Big Sleep (1946), Witness for the Pros-
ecution (1957), Dial M for Murder (1954), Shadow of a
Doubt (1943) }, {Murder, Trial, Widow, Marriage, Private
detective, Blackmail, Letter}, {Crime, Drama, Thriller,
Mystery, Film-Noir }
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 5 / 39
Previous and related work
A short (not full) list
Triadic FCA [Wille, 1995; Lehman and Wille,1995] and Polyadic FCA
[Voutsadakis, 2002]
TRIAS [J¨aeschke et al., 2006] for mining (frequent) triconcepts
DataPeeler for closed n-sets [Cerf et al., 2009], MultiDupeHack [Cerf et al,
2013]
TriBox [Mirkin et al., 2011] for mining dense triboxes with LS criterion
Box OAC-triclustering and Spectral Triclustering [Ignatov et al., 2011,2013]
Multi-way set enumeration in weight tensors [Sch¨olkopf et al, 2011]
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 6 / 39
Previous and related work
A short (not full) list
Quadri-concepts for personalised folksnomies [Jelassi et al., 2012, 2013]
Prime OAC-triclustering [Gnatyshak et al., 2012–2014]
Triadic Boolean tensor factorisation [Miettinen et al., 2011; Belohlavek et al.,
2013] and Boolean tensor clustering [Miettinen et al., 2015]
Closed and connected patterns in multi-relational data. [Spyropoulu et al.,
2011–14]
Triadic FCA and triclustering: Searching for optimal patterns. Machine
Learning journal [Ignatov et al., 2015] and CLA 2013
. . .
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 7 / 39
Outline
1 Motivation and previous work
2 Prime OAC-triclustering
Triadic Formal concept analysis
Basic algorithm
Online version of the algorithm
3 OAC-triclustering on MapReduce
MapReduce technology
MapReduce implementation
4 Experiments
Description of the experiments
Datasets
Results
5 Conclusion
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 8 / 39
Prime OAC-triclustering
Formal concept analysis: triadic case
Definition
Let G, M, B be sets and the ternary relation I be a subset of their Cartesian
product: I ⊆ G × M × B. Then the tuple K = (G, M, B, I) is called a triadic
formal context.
G is a set of objects, M is a set of attributes, B is a set of conditions.
GM m1 m2 m3 m1 m2 m3 m1 m2 m3
g1 x x x x x x x x
g2 x x x x x
g3 x x x x
g4 x x x x x x
B b1 b2 b3
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39
Prime OAC-triclustering
Formal concept analysis: triadic case
Definition
Galois operators (prime operators) are defined in similar way to the dyadic case:
2G
→ 2M
× 2B
2G
× 2M
→ 2B
2M
→ 2G
× 2B
2G
× 2B
→ 2M
2B
→ 2G
× 2M
2M
× 2B
→ 2G
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39
Prime OAC-triclustering
Formal concept analysis: triadic case
GM m1 m2 m3 m1 m2 m3 m1 m2 m3
g1 x x x x x x x x
g2 x x x x x
g3 x x x x
g4 x x x x x x
B b1 b2 b3
({g1, g2}, {m1, m2})′
= {b1, b3}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39
Prime OAC-triclustering
Formal concept analysis: triadic case
GM m1 m2 m3 m1 m2 m3 m1 m2 m3
g1 x x x x x x x x
g2 x x x x x
g3 x x x x
g4 x x x x x x
B b1 b2 b3
m′
2 = {(g1, b1), (g2, b1), (g3, b1), (g1, b2), (g1, b3), (g2, b3), (g4, b3)}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39
Prime OAC-triclustering
Formal concept analysis: triadic case
Definition
The triple (X, Y , Z) is called triadic formal concept of the context
K = (G, M, B, I), if X ⊆ G,Y ⊆ M, Z ⊆ B, (X, Y )′
= Z, (X, Z)′
= Y ,
(Y , Z)′
= X.
X is called (formal) extent, Y — (formal) intent, Z — (formal) modus.
GM m1 m2 m3 m1 m2 m3 m1 m2 m3
g1 x x x x x x x x
g2 x x x x x
g3 x x x x
g4 x x x x x x
B b1 b2 b3
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39
Prime OAC-triclustering
Basic algorithm [Gnatyshak et al., 2013]
This method uses the following types of prime operators (for the context
K = (G, M, B, I)):
(g, m)′
= {b ∈ B | (g, m, b) ∈ I},
(g, b)′
= {m ∈ M | (g, m, b) ∈ I},
(m, b)′
= {g ∈ G | (g, m, b) ∈ I}
Definition
Then the triple T = ((m, b)′
, (g, b)′
, (g, m)′
) is called the prime-based
OAC-tricluster for a triple (g, m, b) ∈ I. The sets of tricluster are called,
respectively, tricluster extent, intent, and modus. Triple (g, m, b) is called a
generating triple of the tricluster T.
Definition
Density of a tricluster: ρ(X, Y , Z) = |I∩(X×Y ×Z)|
|X||Y ||Z|
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 10 / 39
Prime OAC-triclustering
Basic algorithm
An example of a tricluster based on triple (g, m, b):
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 11 / 39
Prime OAC-triclustering
Basic algorithm
Input: K = (G, M, B, I) — triadic context;
ρmin — density threshold
Output: T = {T = (X, Y , Z)}
1: T := ∅
2: for all (g, m): g ∈ G,m ∈ M do
3: PrimesObjAttr[g, m] = (g, m)′
4: end for
5: for all (g, b): g ∈ G,b ∈ B do
6: PrimesObjCond[g, b] = (g, b)′
7: end for
8: for all (m, b): m ∈ M,b ∈ B do
9: PrimesAttrCond[m, b] = (m, b)′
10: end for
11: for all (g, m, b) ∈ I do
12: T = (PrimesAttrCond[m, b], PrimesObjCond[g, b], PrimesObjAttr[g, m])
13: Tkey = hash(T)
14: if Tkey ̸∈ T .keys ∧ ρ(T) ≥ ρmin then
15: T [Tkey] := T
16: end if
17: end for
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 12 / 39
Prime OAC-triclustering
Online version of the algorithm [Gnatyshak et al., 2014]
Let K = (G, M, B, I) be a triadic context. We do not know G, M, B, I, or their
cardinalities in advance.
Input on each iteration: {(g, m, b)} = J ⊆ I.
Goal: maintain an updated version of the results and efficiently update them when
new triples are received.
We need to keep in memory the results of prime operators’ application (prime
sets):
PrimesObjAttr — dictionary with elements of type ((g, m), {b ∈ B}), g ∈ G,
m ∈ M;
PrimesObjCond — dictionary with elements of type ((g, b), {m ∈ M}),
g ∈ G, b ∈ B;
PrimesAttrCond — dictionary with elements of type ((m, b), {g ∈ G}),
m ∈ M, b ∈ B.
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 13 / 39
Prime OAC-triclustering
Online version of the algorithm
Remark
In this case we need to consider triclusters based on different triples different, even
if their extents, intents, and modi are equal.
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 14 / 39
Prime OAC-triclustering
Online version of the algorithm
Algorithm of triples addition:
Input: J is a set of triples to add;
T = {T = (∗X, ∗Y , ∗Z)} is a current tricluster set;
PrimesObjAttr, PrimesObjCond, PrimesAttrCond;
Output: T = {T = (∗X, ∗Y , ∗Z)};
PrimesObjAttr, PrimesObjCond, PrimesAttrCond;
1: for all (g, m, b) ∈ J do
2: PrimesObjAttr[g, m] := PrimesObjAttr[g, m] ∪ b
3: PrimesObjCond[g, b] := PrimesObjCond[g, b] ∪ m
4: PrimesAttrCond[m, b] := PrimesAttrCond[m, b] ∪ g
5: T :=
T ∪ (&PrimesAttrCond[m, b], &PrimesObjCond[g, b], &PrimesObjAttr[g, m])
6: end for
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 15 / 39
Prime OAC-triclustering
Online version of the algorithm
A user may require to remove the triclusters with the same extent, intent and
modus at the post-processing stage. At this stage we can also check various
conditions (for instance, minimal density condition).
Input: T = {T = (∗X, ∗Y , ∗Z)} is a current tricluster set;
Output: T = {T = (∗X, ∗Y , ∗Z)} — processed tricluster hash-set;
1: for all T ∈ T do
2: Compute hash(T)
3: if hash(T) ̸∈ T .keys() then
4: T := T ∪ T
5: end if
6: end for
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 16 / 39
Prime OAC-triclustering
Online version of the algorithm
Complexity summary:
Time complexity: O(|I|) (as there is a constant number of operations on
each step);
More precisely: 8|I| operations in total;
1 Modification of 3 prime sets (3);
2 Creation of a new tricluster (1);
3 Addition of pointers to its extent, intent, and modus (3);
4 Addition of the tricluster to the set of all triclusters (1).
Memory complexity: O(|I|) (as we need to keep in memory only prime sets,
|I| elements in each dictionary + keys).
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 17 / 39
Prime OAC-triclustering
Online version of the algorithm
Example:
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclustering
Online version of the algorithm
→ (g1, m1, b1)
1 PrimesObjAttr = {((g1, m1), {b1})}
2 PrimesObjCond = {((g1, b1), {m1})}
3 PrimesAttrCond = {((m1, b1), {g1})}
4 T := T ∪ {PrimesAttrCond[m1, b1], PrimesObjCond[g1, b1], PrimesObjAttr[g1, m1]}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclustering
Online version of the algorithm
→ (g1, m2, b1)
1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1})}
2 PrimesObjCond = {((g1, b1), {m1, m2})}
3 PrimesAttrCond = {((m1, b1), {g1}), ((m2, b1), {g1})}
4 T := T ∪ {PrimesAttrCond[m2, b1], PrimesObjCond[g1, b1], PrimesObjAttr[g1, m2]}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclustering
Online version of the algorithm
→ (g2, m1, b1)
1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1}), ((g2, m1), {b1})}
2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1})}
4 T := T ∪ {PrimesAttrCond[m1, b1], PrimesObjCond[g2, b1], PrimesObjAttr[g2, m1]}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclustering
Online version of the algorithm
→ (g2, m2, b1)
1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1}), ((g2, m1), {b1}), ((g2, m2), {b1})}
2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1, m2})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2})}
4 T := T ∪ {PrimesAttrCond[m2, b1], PrimesObjCond[g2, b1], PrimesObjAttr[g2, m2]}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclustering
Online version of the algorithm
→ (g3, m3, b1)
1 PrimesObjAttr =
{((g1, m1), {b1}), ((g1, m2), {b1}), ((g2, m1), {b1}), ((g2, m2), {b1}), ((g3, m3), {b1})}
2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1, m2}), ((g3, b1), {m3})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3})}
4 T := T ∪ {PrimesAttrCond[m3, b1], PrimesObjCond[g3, b1], PrimesObjAttr[g3, m3]}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclustering
Online version of the algorithm
→ (g1, m2, b2)
1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1, b2}), ((g2, m1),
{b1}), ((g2, m2), {b1}), ((g3, m3), {b1})}
2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1, m2}), ((g3, b1),
{m3}), ((g1, b2), {m2})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1),
{g3}), ((m2, b2), {g1})}
4 T := T ∪ {PrimesAttrCond[m2, b2], PrimesObjCond[g1, b2], PrimesObjAttr[g1, m2]}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclustering
Online version of the algorithm
→ (g2, m1, b2)
1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1, b2}), ((g2, m1), {b1, b2}),
((g2, m2), {b1}), ((g3, m3), {b1})}
2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1, m2}), ((g3, b1), {m3}),
((g1, b2), {m2}), ((g2, b2), {m1})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3}),
((m2, b2), {g1}), ((m1, b2), {g2})}
4 T := T ∪ {PrimesAttrCond[m1, b2], PrimesObjCond[g2, b2], PrimesObjAttr[g2, m1]}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclustering
Online version of the algorithm
→ (g2, m2, b2)
1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1, b2}), ((g2, m1), {b1, b2}),
((g2, m2), {b1, b2}), ((g3, m3), {b1})}
2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1, m2}), ((g3, b1), {m3}),
((g1, b2), {m2}), ((g2, b2), {m1, m2})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3}),
((m2, b2), {g1, g2}), ((m1, b2), {g2})}
4 T := T ∪ {PrimesAttrCond[m2, b2], PrimesObjCond[g2, b2], PrimesObjAttr[g2, m2]}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclustering
Online version of the algorithm
→ (g3, m3, b2)
1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1, b2}), ((g2, m1), {b1, b2}), ((g2, m2),
{b1, b2}), ((g3, m3), {b1, b2})}
2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1, m2}), ((g3, b1), {m3}), ((g1, b2),
{m2}), ((g2, b2), {m1, m2}), ((g3, b2), {m3})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3}),
((m2, b2), {g1, g2}), ((m1, b2), {g2}), ((m3, b2), {g3})}
4 T := T ∪ {PrimesAttrCond[m3, b2], PrimesObjCond[g3, b2], PrimesObjAttr[g3, m3]}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclustering
Online version of the algorithm
Postprocessing:
1 T(g1,m1,b1) = (g1, g2, m1, m2, b1) ← add
2 T(g1,m2,b1) = (g1, g2, m1, m2, b1, b2) ← add
3 T(g2,m1,b1) = (g1, g2, m1, m2, b1, b2) ← the same as T(g1,m2,b1), skip
4 T(g2,m2,b1) = (g1, g2, m1, m2, b1, b2) ← the same as T(g1,m2,b1), skip
5 T(g3,m3,b1) = (g3, m3, b1, b2) ← add
6 T(g1,m2,b2) = (g1, g2, m2, b1, b2) ← add
7 T(g2,m1,b2) = (g2, m1, m2, b1, b2) ← add
8 T(g2,m2,b2) = (g1, g2, m1, m2, b1, b2) ← the same as T(g1,m2,b1), skip
9 T(g3,m3,b2) = (g3, m3, b1, b2) ← the same as T(g3,m3,b1), skip
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclustering
Online version of the algorithm
The final output set of triclusters:
1 T1 = ({g1, g2}, {m1, m2}, {b1})
2 T2 = ({g1, g2}, {m1, m2}, {b1, b2})
3 T3 = ({g3}, {m3}, {b1, b2})
4 T4 = ({g1, g2}, {m2}, {b1, b2})
5 T5 = ({g2}, {m1, m2}, {b1, b2})
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Outline
1 Motivation and previous work
2 Prime OAC-triclustering
Triadic Formal concept analysis
Basic algorithm
Online version of the algorithm
3 OAC-triclustering on MapReduce
MapReduce technology
MapReduce implementation
4 Experiments
Description of the experiments
Datasets
Results
5 Conclusion
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 19 / 39
MapReduce Technology
MapReduce scheme [Dean and Ghemawat, 2004]
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 20 / 39
MapReduce Technology
MapReduce example
Figure: Word counting. Source:
http://blog.trifork.com/2009/08/04/introduction-to-hadoop/
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 21 / 39
MapReduce Technology
Communication costs: Mining of Massive Datasets [Leskovec et al., 2013]
Chapter 2: MapReduce and the New Software Stack
“Replication Rate and Reducer Size: It is often convenient to measure
communication by the replication rate, which is the communication per input.
Also, the reducer size is the maximum number of inputs associated with any
reducer. For many problems, it is possible to derive a lower bound on replication
rate as a function of the reducer size.”
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 22 / 39
MapReduce Implementation
The previous lattice-oriented M/R implementations
A version of Close-by-One algorithm was ported to M/R framework [Krajca
& Vychodil, 2009]
A M/R algorithm for computation of closed cube lattices was proposed
[Kudryavcev & Kuznecov, 2009]
[Xu et al., 2012] demonstrated that iterative algorithms like Ganter’s
NextClosure can benefit from the usage of iterative M/R schemes
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 23 / 39
MapReduce Implementation
Technologies and code repositories
Technologies used
Apache Hadoop 1
Apache Maven (framework for automatic project assembling)
Apache Commons (for work with extended Java collections)
Google Guava (utilities and data structures)
Jackson JSON (open-source library for transformation of object-oriented
representation of an object like tricluster to string)
TypeTools (for real-time type resolution of inbound and outbound key-value
pairs)
. . .
Implementations
Source 1: “Chaining-job” module2
Source 2: M/R-based OAC Triclustering3
1http://hadoop.apache.org/
2https://github.com/zydins/chaining-job
3https://github.com/zydins/DistributedTriclustering
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 24 / 39
Two-stage MapReduce Implementation
Distributed OAC-triclustering: First Map
Input: S is a set of input triples as strings;
r is a number of reducers;
i is a grouping index (objects, attributes or conditions).
Output: ˜J is a list of ⟨key, triple⟩ pairs.
1: for all s ∈ S do
2: t := transform(s)
3: key := hash(t[i]) mod r
4: ˜J := ˜J ∪ {⟨key, t⟩}
5: end for
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 25 / 39
Two-stage MapReduce Implementation
Distributed OAC-triclustering: First Reduce
Input: J is a list of triples (for a certain key);
T = {T = (X, Y , Z)} is a current set of triclusters;
PrimesOA, PrimesOC, PrimesAC.
Output: file of strings – encoded ⟨triple, tricluster⟩ pairs.
1: Primes ← initialise a new multimap
2: for all (g, m, b) ∈ J do
3: Primes[g, m] := Primes[g, m] ∪ {b}
4: Primes[g, b] := Primes[g, b] ∪ {m}
5: Primes[m, b] := Primes[m, b] ∪ {g}
6: end for
7: for all (g, m, b) ∈ J do
8: T := (set(Primes[m, b]), set(Primes[g, b]), set(Primes[g, m]))
9: s := encode(⟨(g, m, b), T⟩)
10: store s
11: end for
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 26 / 39
Two-stage MapReduce Implementation
Distributed OAC-triclustering: Second Map
Input: S is a list of strings.
Output: ˜T is an list of ⟨tricluster, tricluster⟩ pairs.
1: Primes ← initialise a new multimap
2: for all s ∈ S do
3: ⟨(g, m, b), T⟩ := decode(s)
4: update Primes multimap appropriately
5: I := I ∪ {(g, m, b)}
6: end for
7: for all (g, m, b) ∈ I do
8: T := (set(Primes[m, b]), set(Primes[g, b]), set(Primes[g, m]))
9: ˜T := ˜T ∪ {⟨T, T⟩}
10: end for
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 27 / 39
Two-stage MapReduce Implementation
Distributed OAC-triclustering: Second Reduce
Input: ˆT is a list of ⟨tricluster, list of triclusters⟩ pairs.
Output: File with a final set of triclusters {T = (X, Y , Z)}.
1: for all ⟨T, [T, . . . , T]⟩ ∈ ˆT do
2: store T
3: end for
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 28 / 39
Two-stage MapReduce Implementation
Communication costs
The time complexity of the M/R solution is composed from two terms for
each stage: O(|I|/r) (or O(|I|)) and O(|I|).
The replication rate for the first M/R stage r1 = 1 (each triple is passed as
one key-value pair), the reducer size q1 = |I|/r
The replication rate for the second M/R stage is r2 = 1 (it assigns one
key-value pair for each tricluster), but the reducer size varies from qmin
2 = 1
(no duplicate triclusters) and qmax
2 = |I| (one final tricluster when all the
initial triples belong to one absolutely dense cuboid).
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 29 / 39
Outline
1 Motivation and previous work
2 Prime OAC-triclustering
Triadic Formal concept analysis
Basic algorithm
Online version of the algorithm
3 OAC-triclustering on MapReduce
MapReduce technology
MapReduce implementation
4 Experiments
Description of the experiments
Datasets
Results
5 Conclusion
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 30 / 39
Experiments
Description of the experiments
OS X 10, 1.8 GHz Intel Core i5, 4 Gb 1600 MHz DDR3 and 8 Gb free space
on the hard drive (a typical commodity hardware).
Two M/R modes have been tested: sequential mode of tasks completion and
emulation of distributed one with 16 first reducers and 32 threads for the
second stage.
To evaluate the runtime more carefully, for each context the average result of
5 runs of the algorithms has been recorded.
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 31 / 39
Experiments
Datasets
Synthetic datasets. 1) 20,000 triples (25 unique entities of each type); 2) 100,000 triples (50
unique entities of each type); 3) 1,000,000 triples (all possible combinations of 100 unique
entities of each type).
The 1st dataset contains duplicates since 25 × 25 × 25 gives only 15,625 unique triples. The 2nd
one contains less triples than 503 = 125, 000, the number of all possible combinations. The 3rd
one is an absolutely dense cuboid 100 × 100 × 100.
The 3rd dataset does not result in 3min(|G|,|M|,|B|) formal triconcepts, this is an example of the
worst case scenario for the second reducer (qmax
2 = |I|).
IMDB. Top-250 list of the best movies from Internet Movie Database
Bibsonomy. The data of bibsonomy.org from ECML PKDD discovery challenge 2008.
Context |G| |M| |B| # triples Density
20k 25 25 25 20,000 1
100k 50 50 50 100,000 0.8
1m 100 100 100 1,000,000 1
IMDB 250 795 22 3,818 0.00087
BibSonomy 2,337 67,464 28,920 816,197 1.8 · 10−7
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 32 / 39
Experiments
Results
Algorithm/Context IMDB 20k 100k 1m Bibsonomy
(≈3k triples) triples triples triples (≈800k triples)
Tribox 324 800 1,265 >3,000 >3,000
TRIAS 189 362 862 >3,000 >3,000
OAC Box 374 756 1,265 >3,000 >3,000
OAC Prime 7 8 734 >3,000 >3,000
Online OAC prime 3 3 3 5 >3,000
M/R OAC prime seq. 12 30 81 166 1,534
M/R OAC prime distr. 1 15 20 25 520
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 33 / 39
Alternative MapReduce decomposition
Variant I: First stage
First Map: Finding primes. During this phase every input triple (g, m, b) is
encoded by three key-value pairs ⟨(g, m), b⟩, ⟨(g, b), m⟩, and ⟨(m, b), g⟩. These
pairs are passed to the first reducer.
The replication rate is r1 = 3.
First Reduce: Finding primes. This reducer fills three corresponding dictionaries
for primes of keys. So, for example, the first dictionary, PrimeOA contains
key-value pairs ⟨(g, m), {b1, b2, . . . , bn}⟩.
The reducer size is q1 = max(|G|, |M|, |B|)
The process can be stopped after the first reduce phase and all the triclusters
found as (Prime[g, m], Prime[g, b], Prime[m, b]) each by enumeration of
(g, m, b) ∈ I. However, to do it faster and keep the result for further
computation, it is possible to use M/R as well.
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 34 / 39
Alternative MapReduce decomposition
Variant I: Second stage
Second Map: Tricluster generation. The second map does tricluster combining
job, i.e. for each triple (g, m, b) it composes the new key-value pair, ⟨(g, m, b), ∅⟩.
And for each pair of either type, ⟨(g, m), Prime[g, m]⟩, ⟨(g, b), Prime[g, b]⟩, and
⟨(m, b), Prime[m, b]⟩ it generates key-values pairs ⟨(g, m, ˜b), Prime[g, m]⟩,
⟨(g, ˜m, b), PrimeOC[g, b]⟩, and ⟨(˜g, m, b), Prime[m, b]⟩, where ˜g ∈ G, ˜m ∈ M,
and ˜b ∈ B.
r2 = (|I| + 3|G||M||B|)/(|I| + |G||M| + |G||B| + |M||B|) ≤
(ρ + 3)/(ρ + 3/max(|G|, |M|, |B|)), where ρ is the input tricontext density.
Second Reduce: Tricluster generation. The second reducer just assembles only
one value for each key (g, m, b), the generating triple, its tricluster, (Prime[g, m],
Prime[g, b], Prime[m, b]). If there is no key-value pair ⟨(g, m, b), ∅⟩ for a
particular triple (g, m, b), it does not output any key-value pair for the key.
The reducer size q2 is either 3 (no output) or 4 (tricluster assembled).
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 35 / 39
Alternative MapReduce decomposition
Variant II: Second stage
Second Map: Tricluster generation with duplicate generating triples.
Second map does tricluster combining job, i.e. for each triple (g, m, b) it
composes a new key-value pair:
⟨(Prime[g, m], Prime[g, b], Prime[m, b]), (g, m, b)⟩.
Second Map: Tricluster generation with duplicate generating triples.
The second reducer just groups values for each key: ⟨(X, Y , Z), {(g1, m1, b1), . . . ,
(gn, mn, bn)}⟩.
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 36 / 39
Outline
1 Motivation and previous work
2 Prime OAC-triclustering
Triadic Formal concept analysis
Basic algorithm
Online version of the algorithm
3 OAC-triclustering on MapReduce
MapReduce technology
MapReduce implementation
4 Experiments
Description of the experiments
Datasets
Results
5 Conclusion
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 37 / 39
Conclusion and further work
MapReduce Prime OAC-triclustering implementation has been proposed.
Communication costs have been analysed.
Comparison of the online version and M/R one has been performed.
Further experiments are needed with other M/R variants and other
triclustering algorithms.
A proper comparison of the proposed OAC triclustering and noise tolerant
patterns in n-ary relations, e.g., by DataPeeler descendants [Cerf et al., 2013]
is not yet conducted.
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 38 / 39
Thank you!
Questions?
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 39 / 39

More Related Content

PDF
A One-Pass Triclustering Approach: Is There any Room for Big Data?
PDF
Graph Edit Distance: Basics & Trends
PDF
Accelerating Pseudo-Marginal MCMC using Gaussian Processes
PDF
Graph kernels
PDF
Recursive algorithms
PDF
A Note on TopicRNN
PDF
A Note on Latent LSTM Allocation
PDF
SIAM - Minisymposium on Guaranteed numerical algorithms
A One-Pass Triclustering Approach: Is There any Room for Big Data?
Graph Edit Distance: Basics & Trends
Accelerating Pseudo-Marginal MCMC using Gaussian Processes
Graph kernels
Recursive algorithms
A Note on TopicRNN
A Note on Latent LSTM Allocation
SIAM - Minisymposium on Guaranteed numerical algorithms

What's hot (19)

PDF
Slides: Simplifying Gaussian Mixture Models Via Entropic Quantization (EUSIPC...
PDF
Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)
PPT
Spsp fw
PDF
Parallel Evaluation of Multi-Semi-Joins
PDF
2-rankings of Graphs
PDF
A Note on Correlated Topic Models
PDF
Subquad multi ff
PPTX
Ponchon Savarait
PDF
ABC with Wasserstein distances
PDF
A study of the worst case ratio of a simple algorithm for simple assembly lin...
PDF
QMC: Operator Splitting Workshop, Forward-Backward Splitting Algorithm withou...
PPTX
論文紹介 Fast imagetagging
PDF
YaPingPresentation
PDF
Incremental and parallel computation of structural graph summaries for evolvi...
PDF
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...
DOCX
R-ggplot2 package Examples
PDF
Litvinenko low-rank kriging +FFT poster
PDF
Ceske budevice
PDF
Coordinate sampler: A non-reversible Gibbs-like sampler
Slides: Simplifying Gaussian Mixture Models Via Entropic Quantization (EUSIPC...
Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)
Spsp fw
Parallel Evaluation of Multi-Semi-Joins
2-rankings of Graphs
A Note on Correlated Topic Models
Subquad multi ff
Ponchon Savarait
ABC with Wasserstein distances
A study of the worst case ratio of a simple algorithm for simple assembly lin...
QMC: Operator Splitting Workshop, Forward-Backward Splitting Algorithm withou...
論文紹介 Fast imagetagging
YaPingPresentation
Incremental and parallel computation of structural graph summaries for evolvi...
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...
R-ggplot2 package Examples
Litvinenko low-rank kriging +FFT poster
Ceske budevice
Coordinate sampler: A non-reversible Gibbs-like sampler
Ad

Viewers also liked (17)

PDF
A lattice-based consensus clustering
PPTX
AIST 2016 Opening Slides
PDF
A lattice-based consensus clustering
PDF
Experimental Economics and Machine Learning workshop
PPTX
NIPS 2016, Tensor-Learn@NIPS, and IEEE ICDM 2016
PDF
Pattern-based classification of demographic sequences
PDF
Sequence mining
PDF
Context-Aware Recommender System Based on Boolean Matrix Factorisation
PDF
Поиск частых множеств признаков (товаров) и ассоциативные правила
PDF
On the Family of Concept Forming Operators in Polyadic FCA
PPTX
RAPS: A Recommender Algorithm Based on Pattern Structures
PDF
Pattern Mining and Machine Learning for Demographic Sequences
PDF
Searching for optimal patterns in Boolean tensors
PDF
Введение в рекомендательные системы. 3 case-study без NetFlix.
PPTX
Boolean matrix factorisation for collaborative filtering
PDF
Intro to Data Mining and Machine Learning
PPTX
Online Recommender System for Radio Station Hosting: Experimental Results Rev...
A lattice-based consensus clustering
AIST 2016 Opening Slides
A lattice-based consensus clustering
Experimental Economics and Machine Learning workshop
NIPS 2016, Tensor-Learn@NIPS, and IEEE ICDM 2016
Pattern-based classification of demographic sequences
Sequence mining
Context-Aware Recommender System Based on Boolean Matrix Factorisation
Поиск частых множеств признаков (товаров) и ассоциативные правила
On the Family of Concept Forming Operators in Polyadic FCA
RAPS: A Recommender Algorithm Based on Pattern Structures
Pattern Mining and Machine Learning for Demographic Sequences
Searching for optimal patterns in Boolean tensors
Введение в рекомендательные системы. 3 case-study без NetFlix.
Boolean matrix factorisation for collaborative filtering
Intro to Data Mining and Machine Learning
Online Recommender System for Radio Station Hosting: Experimental Results Rev...
Ad

Similar to Putting OAC-triclustering on MapReduce (20)

PDF
A Polynomial-Space Exact Algorithm for TSP in Degree-5 Graphs
PDF
IRJET - Some Results on Fuzzy Semi-Super Modular Lattices
PDF
DATA STRUCTURES & ALGORITHMS MINIMUM SPANNING TREE
PDF
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
PDF
Presentation.pdf
PDF
PDF
Graph Analytics and Complexity Questions and answers
PDF
Stochastic Alternating Direction Method of Multipliers
PDF
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
PDF
MUMS: Bayesian, Fiducial, and Frequentist Conference - Generalized Probabilis...
PDF
Number theoretic-rsa-chailos-new
PDF
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
PDF
New data structures and algorithms for \\post-processing large data sets and ...
PDF
Radix-3 Algorithm for Realization of Type-II Discrete Sine Transform
PDF
Radix-3 Algorithm for Realization of Type-II Discrete Sine Transform
PDF
CDT 22 slides.pdf
PDF
Interval Pattern Structures: An introdution
PDF
Introduction to R Graphics with ggplot2
PDF
DISTANCE TWO LABELING FOR MULTI-STOREY GRAPHS
PDF
ICML2016: Low-rank tensor completion: a Riemannian manifold preconditioning a...
A Polynomial-Space Exact Algorithm for TSP in Degree-5 Graphs
IRJET - Some Results on Fuzzy Semi-Super Modular Lattices
DATA STRUCTURES & ALGORITHMS MINIMUM SPANNING TREE
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
Presentation.pdf
Graph Analytics and Complexity Questions and answers
Stochastic Alternating Direction Method of Multipliers
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Generalized Probabilis...
Number theoretic-rsa-chailos-new
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
New data structures and algorithms for \\post-processing large data sets and ...
Radix-3 Algorithm for Realization of Type-II Discrete Sine Transform
Radix-3 Algorithm for Realization of Type-II Discrete Sine Transform
CDT 22 slides.pdf
Interval Pattern Structures: An introdution
Introduction to R Graphics with ggplot2
DISTANCE TWO LABELING FOR MULTI-STOREY GRAPHS
ICML2016: Low-rank tensor completion: a Riemannian manifold preconditioning a...

More from Dmitrii Ignatov (11)

PDF
Interpretable Concept-Based Classification with Shapley Values
PPTX
AIST2019 – opening slides
PDF
Turning Krimp into a Triclustering Technique on Sets of Attribute-Condition P...
PDF
Personal Experiences of Publishing with Springer from both Editor and Author ...
PPTX
Aist2014
PDF
Social Learning in Networks: Extraction Deterministic Rules
PPTX
Orpailleur -- triclustering talk
PDF
CoClus ICDM Workshop talk
PPTX
Pseudo-triclustering
PPTX
Radio recommender system for FMHost
PDF
CrowDM system
Interpretable Concept-Based Classification with Shapley Values
AIST2019 – opening slides
Turning Krimp into a Triclustering Technique on Sets of Attribute-Condition P...
Personal Experiences of Publishing with Springer from both Editor and Author ...
Aist2014
Social Learning in Networks: Extraction Deterministic Rules
Orpailleur -- triclustering talk
CoClus ICDM Workshop talk
Pseudo-triclustering
Radio recommender system for FMHost
CrowDM system

Recently uploaded (20)

PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PPTX
Seminar Hypertension and Kidney diseases.pptx
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PPTX
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
PDF
Packaging materials of fruits and vegetables
PPT
Animal tissues, epithelial, muscle, connective, nervous tissue
PPTX
limit test definition and all limit tests
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PDF
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
PPTX
perinatal infections 2-171220190027.pptx
PPT
Computional quantum chemistry study .ppt
PPT
LEC Synthetic Biology and its application.ppt
PPTX
PMR- PPT.pptx for students and doctors tt
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
PDF
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
PPT
Mutation in dna of bacteria and repairss
PPTX
Understanding the Circulatory System……..
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPTX
TORCH INFECTIONS in pregnancy with toxoplasma
PPTX
Microbes in human welfare class 12 .pptx
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Seminar Hypertension and Kidney diseases.pptx
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
Packaging materials of fruits and vegetables
Animal tissues, epithelial, muscle, connective, nervous tissue
limit test definition and all limit tests
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
perinatal infections 2-171220190027.pptx
Computional quantum chemistry study .ppt
LEC Synthetic Biology and its application.ppt
PMR- PPT.pptx for students and doctors tt
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
Mutation in dna of bacteria and repairss
Understanding the Circulatory System……..
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
TORCH INFECTIONS in pregnancy with toxoplasma
Microbes in human welfare class 12 .pptx

Putting OAC-triclustering on MapReduce

  • 1. Putting OAC-triclustering on MapReduce Sergey Zudin, Dmitry V. Gnatyshak, and Dmitry I. Ignatov National Research University Higher School of Economics, Russian Federation Faculty of Computer Science CLA 2015, Clermont-Ferrand, France October 13-16 S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 1 / 39
  • 2. Outline 1 Motivation and previous work 2 Prime OAC-triclustering Triadic Formal concept analysis Basic algorithm Online version of the algorithm 3 OAC-triclustering on MapReduce MapReduce technology MapReduce implementation 4 Experiments Description of the experiments Datasets Results 5 Conclusion S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 2 / 39
  • 3. Outline 1 Motivation and previous work 2 Prime OAC-triclustering Triadic Formal concept analysis Basic algorithm Online version of the algorithm 3 OAC-triclustering on MapReduce MapReduce technology MapReduce implementation 4 Experiments Description of the experiments Datasets Results 5 Conclusion S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 3 / 39
  • 4. Motivation Big amount of multimodal data: Gene expression data Folksonomies Recommender Systems Communities in multi-mode (social) networks Pattern mining in relational databases . . . Non-binary data can be scaled (possibly increasing the dimensionality) Increasing amount of big data: fast and/or distributed algorithms are required (linear or sublinear, one-pass) Existing methods: finding all n-sets (mulitimodal clusters) satisfying some conditions (often the exponential number of patterns) S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 4 / 39
  • 5. Motivation IMDB example, [Mirkin et al., 2011] Clump Movie-Keyword-Genre Bicluster {12 Angry Men (1957), To Kill a Mockingbird (1962), Wit- ness for the Prosecution (1957)}, {Murder, Trial}, {n/a } Tricluster {12 Angry Men (1957), Double Indemnity (1944), China- town (1974), The Big Sleep (1946), Witness for the Pros- ecution (1957), Dial M for Murder (1954), Shadow of a Doubt (1943) }, {Murder, Trial, Widow, Marriage, Private detective, Blackmail, Letter}, {Crime, Drama, Thriller, Mystery, Film-Noir } S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 5 / 39
  • 6. Previous and related work A short (not full) list Triadic FCA [Wille, 1995; Lehman and Wille,1995] and Polyadic FCA [Voutsadakis, 2002] TRIAS [J¨aeschke et al., 2006] for mining (frequent) triconcepts DataPeeler for closed n-sets [Cerf et al., 2009], MultiDupeHack [Cerf et al, 2013] TriBox [Mirkin et al., 2011] for mining dense triboxes with LS criterion Box OAC-triclustering and Spectral Triclustering [Ignatov et al., 2011,2013] Multi-way set enumeration in weight tensors [Sch¨olkopf et al, 2011] S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 6 / 39
  • 7. Previous and related work A short (not full) list Quadri-concepts for personalised folksnomies [Jelassi et al., 2012, 2013] Prime OAC-triclustering [Gnatyshak et al., 2012–2014] Triadic Boolean tensor factorisation [Miettinen et al., 2011; Belohlavek et al., 2013] and Boolean tensor clustering [Miettinen et al., 2015] Closed and connected patterns in multi-relational data. [Spyropoulu et al., 2011–14] Triadic FCA and triclustering: Searching for optimal patterns. Machine Learning journal [Ignatov et al., 2015] and CLA 2013 . . . S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 7 / 39
  • 8. Outline 1 Motivation and previous work 2 Prime OAC-triclustering Triadic Formal concept analysis Basic algorithm Online version of the algorithm 3 OAC-triclustering on MapReduce MapReduce technology MapReduce implementation 4 Experiments Description of the experiments Datasets Results 5 Conclusion S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 8 / 39
  • 9. Prime OAC-triclustering Formal concept analysis: triadic case Definition Let G, M, B be sets and the ternary relation I be a subset of their Cartesian product: I ⊆ G × M × B. Then the tuple K = (G, M, B, I) is called a triadic formal context. G is a set of objects, M is a set of attributes, B is a set of conditions. GM m1 m2 m3 m1 m2 m3 m1 m2 m3 g1 x x x x x x x x g2 x x x x x g3 x x x x g4 x x x x x x B b1 b2 b3 S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39
  • 10. Prime OAC-triclustering Formal concept analysis: triadic case Definition Galois operators (prime operators) are defined in similar way to the dyadic case: 2G → 2M × 2B 2G × 2M → 2B 2M → 2G × 2B 2G × 2B → 2M 2B → 2G × 2M 2M × 2B → 2G S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39
  • 11. Prime OAC-triclustering Formal concept analysis: triadic case GM m1 m2 m3 m1 m2 m3 m1 m2 m3 g1 x x x x x x x x g2 x x x x x g3 x x x x g4 x x x x x x B b1 b2 b3 ({g1, g2}, {m1, m2})′ = {b1, b3} S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39
  • 12. Prime OAC-triclustering Formal concept analysis: triadic case GM m1 m2 m3 m1 m2 m3 m1 m2 m3 g1 x x x x x x x x g2 x x x x x g3 x x x x g4 x x x x x x B b1 b2 b3 m′ 2 = {(g1, b1), (g2, b1), (g3, b1), (g1, b2), (g1, b3), (g2, b3), (g4, b3)} S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39
  • 13. Prime OAC-triclustering Formal concept analysis: triadic case Definition The triple (X, Y , Z) is called triadic formal concept of the context K = (G, M, B, I), if X ⊆ G,Y ⊆ M, Z ⊆ B, (X, Y )′ = Z, (X, Z)′ = Y , (Y , Z)′ = X. X is called (formal) extent, Y — (formal) intent, Z — (formal) modus. GM m1 m2 m3 m1 m2 m3 m1 m2 m3 g1 x x x x x x x x g2 x x x x x g3 x x x x g4 x x x x x x B b1 b2 b3 S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39
  • 14. Prime OAC-triclustering Basic algorithm [Gnatyshak et al., 2013] This method uses the following types of prime operators (for the context K = (G, M, B, I)): (g, m)′ = {b ∈ B | (g, m, b) ∈ I}, (g, b)′ = {m ∈ M | (g, m, b) ∈ I}, (m, b)′ = {g ∈ G | (g, m, b) ∈ I} Definition Then the triple T = ((m, b)′ , (g, b)′ , (g, m)′ ) is called the prime-based OAC-tricluster for a triple (g, m, b) ∈ I. The sets of tricluster are called, respectively, tricluster extent, intent, and modus. Triple (g, m, b) is called a generating triple of the tricluster T. Definition Density of a tricluster: ρ(X, Y , Z) = |I∩(X×Y ×Z)| |X||Y ||Z| S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 10 / 39
  • 15. Prime OAC-triclustering Basic algorithm An example of a tricluster based on triple (g, m, b): S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 11 / 39
  • 16. Prime OAC-triclustering Basic algorithm Input: K = (G, M, B, I) — triadic context; ρmin — density threshold Output: T = {T = (X, Y , Z)} 1: T := ∅ 2: for all (g, m): g ∈ G,m ∈ M do 3: PrimesObjAttr[g, m] = (g, m)′ 4: end for 5: for all (g, b): g ∈ G,b ∈ B do 6: PrimesObjCond[g, b] = (g, b)′ 7: end for 8: for all (m, b): m ∈ M,b ∈ B do 9: PrimesAttrCond[m, b] = (m, b)′ 10: end for 11: for all (g, m, b) ∈ I do 12: T = (PrimesAttrCond[m, b], PrimesObjCond[g, b], PrimesObjAttr[g, m]) 13: Tkey = hash(T) 14: if Tkey ̸∈ T .keys ∧ ρ(T) ≥ ρmin then 15: T [Tkey] := T 16: end if 17: end for S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 12 / 39
  • 17. Prime OAC-triclustering Online version of the algorithm [Gnatyshak et al., 2014] Let K = (G, M, B, I) be a triadic context. We do not know G, M, B, I, or their cardinalities in advance. Input on each iteration: {(g, m, b)} = J ⊆ I. Goal: maintain an updated version of the results and efficiently update them when new triples are received. We need to keep in memory the results of prime operators’ application (prime sets): PrimesObjAttr — dictionary with elements of type ((g, m), {b ∈ B}), g ∈ G, m ∈ M; PrimesObjCond — dictionary with elements of type ((g, b), {m ∈ M}), g ∈ G, b ∈ B; PrimesAttrCond — dictionary with elements of type ((m, b), {g ∈ G}), m ∈ M, b ∈ B. S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 13 / 39
  • 18. Prime OAC-triclustering Online version of the algorithm Remark In this case we need to consider triclusters based on different triples different, even if their extents, intents, and modi are equal. S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 14 / 39
  • 19. Prime OAC-triclustering Online version of the algorithm Algorithm of triples addition: Input: J is a set of triples to add; T = {T = (∗X, ∗Y , ∗Z)} is a current tricluster set; PrimesObjAttr, PrimesObjCond, PrimesAttrCond; Output: T = {T = (∗X, ∗Y , ∗Z)}; PrimesObjAttr, PrimesObjCond, PrimesAttrCond; 1: for all (g, m, b) ∈ J do 2: PrimesObjAttr[g, m] := PrimesObjAttr[g, m] ∪ b 3: PrimesObjCond[g, b] := PrimesObjCond[g, b] ∪ m 4: PrimesAttrCond[m, b] := PrimesAttrCond[m, b] ∪ g 5: T := T ∪ (&PrimesAttrCond[m, b], &PrimesObjCond[g, b], &PrimesObjAttr[g, m]) 6: end for S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 15 / 39
  • 20. Prime OAC-triclustering Online version of the algorithm A user may require to remove the triclusters with the same extent, intent and modus at the post-processing stage. At this stage we can also check various conditions (for instance, minimal density condition). Input: T = {T = (∗X, ∗Y , ∗Z)} is a current tricluster set; Output: T = {T = (∗X, ∗Y , ∗Z)} — processed tricluster hash-set; 1: for all T ∈ T do 2: Compute hash(T) 3: if hash(T) ̸∈ T .keys() then 4: T := T ∪ T 5: end if 6: end for S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 16 / 39
  • 21. Prime OAC-triclustering Online version of the algorithm Complexity summary: Time complexity: O(|I|) (as there is a constant number of operations on each step); More precisely: 8|I| operations in total; 1 Modification of 3 prime sets (3); 2 Creation of a new tricluster (1); 3 Addition of pointers to its extent, intent, and modus (3); 4 Addition of the tricluster to the set of all triclusters (1). Memory complexity: O(|I|) (as we need to keep in memory only prime sets, |I| elements in each dictionary + keys). S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 17 / 39
  • 22. Prime OAC-triclustering Online version of the algorithm Example: S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
  • 23. Prime OAC-triclustering Online version of the algorithm → (g1, m1, b1) 1 PrimesObjAttr = {((g1, m1), {b1})} 2 PrimesObjCond = {((g1, b1), {m1})} 3 PrimesAttrCond = {((m1, b1), {g1})} 4 T := T ∪ {PrimesAttrCond[m1, b1], PrimesObjCond[g1, b1], PrimesObjAttr[g1, m1]} S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
  • 24. Prime OAC-triclustering Online version of the algorithm → (g1, m2, b1) 1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1})} 2 PrimesObjCond = {((g1, b1), {m1, m2})} 3 PrimesAttrCond = {((m1, b1), {g1}), ((m2, b1), {g1})} 4 T := T ∪ {PrimesAttrCond[m2, b1], PrimesObjCond[g1, b1], PrimesObjAttr[g1, m2]} S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
  • 25. Prime OAC-triclustering Online version of the algorithm → (g2, m1, b1) 1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1}), ((g2, m1), {b1})} 2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1})} 3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1})} 4 T := T ∪ {PrimesAttrCond[m1, b1], PrimesObjCond[g2, b1], PrimesObjAttr[g2, m1]} S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
  • 26. Prime OAC-triclustering Online version of the algorithm → (g2, m2, b1) 1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1}), ((g2, m1), {b1}), ((g2, m2), {b1})} 2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1, m2})} 3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2})} 4 T := T ∪ {PrimesAttrCond[m2, b1], PrimesObjCond[g2, b1], PrimesObjAttr[g2, m2]} S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
  • 27. Prime OAC-triclustering Online version of the algorithm → (g3, m3, b1) 1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1}), ((g2, m1), {b1}), ((g2, m2), {b1}), ((g3, m3), {b1})} 2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1, m2}), ((g3, b1), {m3})} 3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3})} 4 T := T ∪ {PrimesAttrCond[m3, b1], PrimesObjCond[g3, b1], PrimesObjAttr[g3, m3]} S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
  • 28. Prime OAC-triclustering Online version of the algorithm → (g1, m2, b2) 1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1, b2}), ((g2, m1), {b1}), ((g2, m2), {b1}), ((g3, m3), {b1})} 2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1, m2}), ((g3, b1), {m3}), ((g1, b2), {m2})} 3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3}), ((m2, b2), {g1})} 4 T := T ∪ {PrimesAttrCond[m2, b2], PrimesObjCond[g1, b2], PrimesObjAttr[g1, m2]} S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
  • 29. Prime OAC-triclustering Online version of the algorithm → (g2, m1, b2) 1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1, b2}), ((g2, m1), {b1, b2}), ((g2, m2), {b1}), ((g3, m3), {b1})} 2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1, m2}), ((g3, b1), {m3}), ((g1, b2), {m2}), ((g2, b2), {m1})} 3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3}), ((m2, b2), {g1}), ((m1, b2), {g2})} 4 T := T ∪ {PrimesAttrCond[m1, b2], PrimesObjCond[g2, b2], PrimesObjAttr[g2, m1]} S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
  • 30. Prime OAC-triclustering Online version of the algorithm → (g2, m2, b2) 1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1, b2}), ((g2, m1), {b1, b2}), ((g2, m2), {b1, b2}), ((g3, m3), {b1})} 2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1, m2}), ((g3, b1), {m3}), ((g1, b2), {m2}), ((g2, b2), {m1, m2})} 3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3}), ((m2, b2), {g1, g2}), ((m1, b2), {g2})} 4 T := T ∪ {PrimesAttrCond[m2, b2], PrimesObjCond[g2, b2], PrimesObjAttr[g2, m2]} S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
  • 31. Prime OAC-triclustering Online version of the algorithm → (g3, m3, b2) 1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1, b2}), ((g2, m1), {b1, b2}), ((g2, m2), {b1, b2}), ((g3, m3), {b1, b2})} 2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1, m2}), ((g3, b1), {m3}), ((g1, b2), {m2}), ((g2, b2), {m1, m2}), ((g3, b2), {m3})} 3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3}), ((m2, b2), {g1, g2}), ((m1, b2), {g2}), ((m3, b2), {g3})} 4 T := T ∪ {PrimesAttrCond[m3, b2], PrimesObjCond[g3, b2], PrimesObjAttr[g3, m3]} S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
  • 32. Prime OAC-triclustering Online version of the algorithm Postprocessing: 1 T(g1,m1,b1) = (g1, g2, m1, m2, b1) ← add 2 T(g1,m2,b1) = (g1, g2, m1, m2, b1, b2) ← add 3 T(g2,m1,b1) = (g1, g2, m1, m2, b1, b2) ← the same as T(g1,m2,b1), skip 4 T(g2,m2,b1) = (g1, g2, m1, m2, b1, b2) ← the same as T(g1,m2,b1), skip 5 T(g3,m3,b1) = (g3, m3, b1, b2) ← add 6 T(g1,m2,b2) = (g1, g2, m2, b1, b2) ← add 7 T(g2,m1,b2) = (g2, m1, m2, b1, b2) ← add 8 T(g2,m2,b2) = (g1, g2, m1, m2, b1, b2) ← the same as T(g1,m2,b1), skip 9 T(g3,m3,b2) = (g3, m3, b1, b2) ← the same as T(g3,m3,b1), skip S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
  • 33. Prime OAC-triclustering Online version of the algorithm The final output set of triclusters: 1 T1 = ({g1, g2}, {m1, m2}, {b1}) 2 T2 = ({g1, g2}, {m1, m2}, {b1, b2}) 3 T3 = ({g3}, {m3}, {b1, b2}) 4 T4 = ({g1, g2}, {m2}, {b1, b2}) 5 T5 = ({g2}, {m1, m2}, {b1, b2}) S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
  • 34. Outline 1 Motivation and previous work 2 Prime OAC-triclustering Triadic Formal concept analysis Basic algorithm Online version of the algorithm 3 OAC-triclustering on MapReduce MapReduce technology MapReduce implementation 4 Experiments Description of the experiments Datasets Results 5 Conclusion S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 19 / 39
  • 35. MapReduce Technology MapReduce scheme [Dean and Ghemawat, 2004] S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 20 / 39
  • 36. MapReduce Technology MapReduce example Figure: Word counting. Source: http://blog.trifork.com/2009/08/04/introduction-to-hadoop/ S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 21 / 39
  • 37. MapReduce Technology Communication costs: Mining of Massive Datasets [Leskovec et al., 2013] Chapter 2: MapReduce and the New Software Stack “Replication Rate and Reducer Size: It is often convenient to measure communication by the replication rate, which is the communication per input. Also, the reducer size is the maximum number of inputs associated with any reducer. For many problems, it is possible to derive a lower bound on replication rate as a function of the reducer size.” S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 22 / 39
  • 38. MapReduce Implementation The previous lattice-oriented M/R implementations A version of Close-by-One algorithm was ported to M/R framework [Krajca & Vychodil, 2009] A M/R algorithm for computation of closed cube lattices was proposed [Kudryavcev & Kuznecov, 2009] [Xu et al., 2012] demonstrated that iterative algorithms like Ganter’s NextClosure can benefit from the usage of iterative M/R schemes S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 23 / 39
  • 39. MapReduce Implementation Technologies and code repositories Technologies used Apache Hadoop 1 Apache Maven (framework for automatic project assembling) Apache Commons (for work with extended Java collections) Google Guava (utilities and data structures) Jackson JSON (open-source library for transformation of object-oriented representation of an object like tricluster to string) TypeTools (for real-time type resolution of inbound and outbound key-value pairs) . . . Implementations Source 1: “Chaining-job” module2 Source 2: M/R-based OAC Triclustering3 1http://hadoop.apache.org/ 2https://github.com/zydins/chaining-job 3https://github.com/zydins/DistributedTriclustering S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 24 / 39
  • 40. Two-stage MapReduce Implementation Distributed OAC-triclustering: First Map Input: S is a set of input triples as strings; r is a number of reducers; i is a grouping index (objects, attributes or conditions). Output: ˜J is a list of ⟨key, triple⟩ pairs. 1: for all s ∈ S do 2: t := transform(s) 3: key := hash(t[i]) mod r 4: ˜J := ˜J ∪ {⟨key, t⟩} 5: end for S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 25 / 39
  • 41. Two-stage MapReduce Implementation Distributed OAC-triclustering: First Reduce Input: J is a list of triples (for a certain key); T = {T = (X, Y , Z)} is a current set of triclusters; PrimesOA, PrimesOC, PrimesAC. Output: file of strings – encoded ⟨triple, tricluster⟩ pairs. 1: Primes ← initialise a new multimap 2: for all (g, m, b) ∈ J do 3: Primes[g, m] := Primes[g, m] ∪ {b} 4: Primes[g, b] := Primes[g, b] ∪ {m} 5: Primes[m, b] := Primes[m, b] ∪ {g} 6: end for 7: for all (g, m, b) ∈ J do 8: T := (set(Primes[m, b]), set(Primes[g, b]), set(Primes[g, m])) 9: s := encode(⟨(g, m, b), T⟩) 10: store s 11: end for S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 26 / 39
  • 42. Two-stage MapReduce Implementation Distributed OAC-triclustering: Second Map Input: S is a list of strings. Output: ˜T is an list of ⟨tricluster, tricluster⟩ pairs. 1: Primes ← initialise a new multimap 2: for all s ∈ S do 3: ⟨(g, m, b), T⟩ := decode(s) 4: update Primes multimap appropriately 5: I := I ∪ {(g, m, b)} 6: end for 7: for all (g, m, b) ∈ I do 8: T := (set(Primes[m, b]), set(Primes[g, b]), set(Primes[g, m])) 9: ˜T := ˜T ∪ {⟨T, T⟩} 10: end for S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 27 / 39
  • 43. Two-stage MapReduce Implementation Distributed OAC-triclustering: Second Reduce Input: ˆT is a list of ⟨tricluster, list of triclusters⟩ pairs. Output: File with a final set of triclusters {T = (X, Y , Z)}. 1: for all ⟨T, [T, . . . , T]⟩ ∈ ˆT do 2: store T 3: end for S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 28 / 39
  • 44. Two-stage MapReduce Implementation Communication costs The time complexity of the M/R solution is composed from two terms for each stage: O(|I|/r) (or O(|I|)) and O(|I|). The replication rate for the first M/R stage r1 = 1 (each triple is passed as one key-value pair), the reducer size q1 = |I|/r The replication rate for the second M/R stage is r2 = 1 (it assigns one key-value pair for each tricluster), but the reducer size varies from qmin 2 = 1 (no duplicate triclusters) and qmax 2 = |I| (one final tricluster when all the initial triples belong to one absolutely dense cuboid). S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 29 / 39
  • 45. Outline 1 Motivation and previous work 2 Prime OAC-triclustering Triadic Formal concept analysis Basic algorithm Online version of the algorithm 3 OAC-triclustering on MapReduce MapReduce technology MapReduce implementation 4 Experiments Description of the experiments Datasets Results 5 Conclusion S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 30 / 39
  • 46. Experiments Description of the experiments OS X 10, 1.8 GHz Intel Core i5, 4 Gb 1600 MHz DDR3 and 8 Gb free space on the hard drive (a typical commodity hardware). Two M/R modes have been tested: sequential mode of tasks completion and emulation of distributed one with 16 first reducers and 32 threads for the second stage. To evaluate the runtime more carefully, for each context the average result of 5 runs of the algorithms has been recorded. S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 31 / 39
  • 47. Experiments Datasets Synthetic datasets. 1) 20,000 triples (25 unique entities of each type); 2) 100,000 triples (50 unique entities of each type); 3) 1,000,000 triples (all possible combinations of 100 unique entities of each type). The 1st dataset contains duplicates since 25 × 25 × 25 gives only 15,625 unique triples. The 2nd one contains less triples than 503 = 125, 000, the number of all possible combinations. The 3rd one is an absolutely dense cuboid 100 × 100 × 100. The 3rd dataset does not result in 3min(|G|,|M|,|B|) formal triconcepts, this is an example of the worst case scenario for the second reducer (qmax 2 = |I|). IMDB. Top-250 list of the best movies from Internet Movie Database Bibsonomy. The data of bibsonomy.org from ECML PKDD discovery challenge 2008. Context |G| |M| |B| # triples Density 20k 25 25 25 20,000 1 100k 50 50 50 100,000 0.8 1m 100 100 100 1,000,000 1 IMDB 250 795 22 3,818 0.00087 BibSonomy 2,337 67,464 28,920 816,197 1.8 · 10−7 S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 32 / 39
  • 48. Experiments Results Algorithm/Context IMDB 20k 100k 1m Bibsonomy (≈3k triples) triples triples triples (≈800k triples) Tribox 324 800 1,265 >3,000 >3,000 TRIAS 189 362 862 >3,000 >3,000 OAC Box 374 756 1,265 >3,000 >3,000 OAC Prime 7 8 734 >3,000 >3,000 Online OAC prime 3 3 3 5 >3,000 M/R OAC prime seq. 12 30 81 166 1,534 M/R OAC prime distr. 1 15 20 25 520 S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 33 / 39
  • 49. Alternative MapReduce decomposition Variant I: First stage First Map: Finding primes. During this phase every input triple (g, m, b) is encoded by three key-value pairs ⟨(g, m), b⟩, ⟨(g, b), m⟩, and ⟨(m, b), g⟩. These pairs are passed to the first reducer. The replication rate is r1 = 3. First Reduce: Finding primes. This reducer fills three corresponding dictionaries for primes of keys. So, for example, the first dictionary, PrimeOA contains key-value pairs ⟨(g, m), {b1, b2, . . . , bn}⟩. The reducer size is q1 = max(|G|, |M|, |B|) The process can be stopped after the first reduce phase and all the triclusters found as (Prime[g, m], Prime[g, b], Prime[m, b]) each by enumeration of (g, m, b) ∈ I. However, to do it faster and keep the result for further computation, it is possible to use M/R as well. S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 34 / 39
  • 50. Alternative MapReduce decomposition Variant I: Second stage Second Map: Tricluster generation. The second map does tricluster combining job, i.e. for each triple (g, m, b) it composes the new key-value pair, ⟨(g, m, b), ∅⟩. And for each pair of either type, ⟨(g, m), Prime[g, m]⟩, ⟨(g, b), Prime[g, b]⟩, and ⟨(m, b), Prime[m, b]⟩ it generates key-values pairs ⟨(g, m, ˜b), Prime[g, m]⟩, ⟨(g, ˜m, b), PrimeOC[g, b]⟩, and ⟨(˜g, m, b), Prime[m, b]⟩, where ˜g ∈ G, ˜m ∈ M, and ˜b ∈ B. r2 = (|I| + 3|G||M||B|)/(|I| + |G||M| + |G||B| + |M||B|) ≤ (ρ + 3)/(ρ + 3/max(|G|, |M|, |B|)), where ρ is the input tricontext density. Second Reduce: Tricluster generation. The second reducer just assembles only one value for each key (g, m, b), the generating triple, its tricluster, (Prime[g, m], Prime[g, b], Prime[m, b]). If there is no key-value pair ⟨(g, m, b), ∅⟩ for a particular triple (g, m, b), it does not output any key-value pair for the key. The reducer size q2 is either 3 (no output) or 4 (tricluster assembled). S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 35 / 39
  • 51. Alternative MapReduce decomposition Variant II: Second stage Second Map: Tricluster generation with duplicate generating triples. Second map does tricluster combining job, i.e. for each triple (g, m, b) it composes a new key-value pair: ⟨(Prime[g, m], Prime[g, b], Prime[m, b]), (g, m, b)⟩. Second Map: Tricluster generation with duplicate generating triples. The second reducer just groups values for each key: ⟨(X, Y , Z), {(g1, m1, b1), . . . , (gn, mn, bn)}⟩. S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 36 / 39
  • 52. Outline 1 Motivation and previous work 2 Prime OAC-triclustering Triadic Formal concept analysis Basic algorithm Online version of the algorithm 3 OAC-triclustering on MapReduce MapReduce technology MapReduce implementation 4 Experiments Description of the experiments Datasets Results 5 Conclusion S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 37 / 39
  • 53. Conclusion and further work MapReduce Prime OAC-triclustering implementation has been proposed. Communication costs have been analysed. Comparison of the online version and M/R one has been performed. Further experiments are needed with other M/R variants and other triclustering algorithms. A proper comparison of the proposed OAC triclustering and noise tolerant patterns in n-ary relations, e.g., by DataPeeler descendants [Cerf et al., 2013] is not yet conducted. S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 38 / 39
  • 54. Thank you! Questions? S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 39 / 39