Putting OAC-triclustering on MapReduce

Putting OAC-triclustering on MapReduce
Sergey Zudin, Dmitry V. Gnatyshak, and Dmitry I. Ignatov
National Research University Higher School of Economics, Russian Federation
Faculty of Computer Science
CLA 2015, Clermont-Ferrand, France
October 13-16
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 1 / 39

Outline
1 Motivation and previous work
2 Prime OAC-triclustering
Triadic Formal concept analysis
Basic algorithm
Online version of the algorithm
3 OAC-triclustering on MapReduce
MapReduce technology
MapReduce implementation
4 Experiments
Description of the experiments
Datasets
Results
5 Conclusion

Outline
Basic algorithm
4 Experiments
Datasets
Results
5 Conclusion

Motivation
Big amount of multimodal data:
Gene expression data
Folksonomies
Recommender Systems
Communities in multi-mode (social) networks
Pattern mining in relational databases
. . .
Non-binary data can be scaled (possibly increasing the dimensionality)
Increasing amount of big data: fast and/or distributed algorithms are
required (linear or sublinear, one-pass)
Existing methods: finding all n-sets (mulitimodal clusters) satisfying some
conditions (often the exponential number of patterns)

Motivation
IMDB example, [Mirkin et al., 2011]
Clump Movie-Keyword-Genre
Bicluster
{12 Angry Men (1957), To Kill a Mockingbird (1962), Wit-
ness for the Prosecution (1957)}, {Murder, Trial}, {n/a }
Tricluster
{12 Angry Men (1957), Double Indemnity (1944), China-
town (1974), The Big Sleep (1946), Witness for the Pros-
ecution (1957), Dial M for Murder (1954), Shadow of a
Doubt (1943) }, {Murder, Trial, Widow, Marriage, Private
detective, Blackmail, Letter}, {Crime, Drama, Thriller,
Mystery, Film-Noir }

Previous and related work
A short (not full) list
Triadic FCA [Wille, 1995; Lehman and Wille,1995] and Polyadic FCA
[Voutsadakis, 2002]
TRIAS [J¨aeschke et al., 2006] for mining (frequent) triconcepts
DataPeeler for closed n-sets [Cerf et al., 2009], MultiDupeHack [Cerf et al,
2013]
TriBox [Mirkin et al., 2011] for mining dense triboxes with LS criterion
Box OAC-triclustering and Spectral Triclustering [Ignatov et al., 2011,2013]
Multi-way set enumeration in weight tensors [Sch¨olkopf et al, 2011]

Previous and related work
A short (not full) list
Quadri-concepts for personalised folksnomies [Jelassi et al., 2012, 2013]
Prime OAC-triclustering [Gnatyshak et al., 2012–2014]
Triadic Boolean tensor factorisation [Miettinen et al., 2011; Belohlavek et al.,
2013] and Boolean tensor clustering [Miettinen et al., 2015]
Closed and connected patterns in multi-relational data. [Spyropoulu et al.,
2011–14]
Triadic FCA and triclustering: Searching for optimal patterns. Machine
Learning journal [Ignatov et al., 2015] and CLA 2013
. . .

Outline
Basic algorithm
4 Experiments
Datasets
Results
5 Conclusion

Prime OAC-triclustering
Formal concept analysis: triadic case
Definition
Let G, M, B be sets and the ternary relation I be a subset of their Cartesian
product: I ⊆ G × M × B. Then the tuple K = (G, M, B, I) is called a triadic
formal context.
G is a set of objects, M is a set of attributes, B is a set of conditions.
GM m1 m2 m3 m1 m2 m3 m1 m2 m3
g1 x x x x x x x x
g2 x x x x x
g3 x x x x
g4 x x x x x x
B b1 b2 b3

Definition
Galois operators (prime operators) are defined in similar way to the dyadic case:
2G
→ 2M
× 2B
2G
× 2M
→ 2B
2M
→ 2G
× 2B
2G
× 2B
→ 2M
2B
→ 2G
× 2M
2M
× 2B
→ 2G

GM m1 m2 m3 m1 m2 m3 m1 m2 m3
g1 x x x x x x x x
g2 x x x x x
g3 x x x x
g4 x x x x x x
B b1 b2 b3
({g1, g2}, {m1, m2})′
= {b1, b3}

GM m1 m2 m3 m1 m2 m3 m1 m2 m3
g1 x x x x x x x x
g2 x x x x x
g3 x x x x
g4 x x x x x x
B b1 b2 b3
m′
2 = {(g1, b1), (g2, b1), (g3, b1), (g1, b2), (g1, b3), (g2, b3), (g4, b3)}

Definition
The triple (X, Y , Z) is called triadic formal concept of the context
K = (G, M, B, I), if X ⊆ G,Y ⊆ M, Z ⊆ B, (X, Y )′
= Z, (X, Z)′
= Y ,
(Y , Z)′
= X.
X is called (formal) extent, Y — (formal) intent, Z — (formal) modus.
GM m1 m2 m3 m1 m2 m3 m1 m2 m3
g1 x x x x x x x x
g2 x x x x x
g3 x x x x
g4 x x x x x x
B b1 b2 b3

Basic algorithm [Gnatyshak et al., 2013]
This method uses the following types of prime operators (for the context
K = (G, M, B, I)):
(g, m)′
= {b ∈ B | (g, m, b) ∈ I},
(g, b)′
= {m ∈ M | (g, m, b) ∈ I},
(m, b)′
= {g ∈ G | (g, m, b) ∈ I}
Definition
Then the triple T = ((m, b)′
, (g, b)′
, (g, m)′
) is called the prime-based
OAC-tricluster for a triple (g, m, b) ∈ I. The sets of tricluster are called,
respectively, tricluster extent, intent, and modus. Triple (g, m, b) is called a
generating triple of the tricluster T.
Definition
Density of a tricluster: ρ(X, Y , Z) = |I∩(X×Y ×Z)|
|X||Y ||Z|

Basic algorithm
An example of a tricluster based on triple (g, m, b):

Basic algorithm
Input: K = (G, M, B, I) — triadic context;
ρmin — density threshold
Output: T = {T = (X, Y , Z)}
1: T := ∅
2: for all (g, m): g ∈ G,m ∈ M do
3: PrimesObjAttr[g, m] = (g, m)′
4: end for
5: for all (g, b): g ∈ G,b ∈ B do
6: PrimesObjCond[g, b] = (g, b)′
7: end for
8: for all (m, b): m ∈ M,b ∈ B do
9: PrimesAttrCond[m, b] = (m, b)′
10: end for
11: for all (g, m, b) ∈ I do
12: T = (PrimesAttrCond[m, b], PrimesObjCond[g, b], PrimesObjAttr[g, m])
13: Tkey = hash(T)
14: if Tkey ̸∈ T .keys ∧ ρ(T) ≥ ρmin then
15: T [Tkey] := T
16: end if
17: end for

Online version of the algorithm [Gnatyshak et al., 2014]
Let K = (G, M, B, I) be a triadic context. We do not know G, M, B, I, or their
cardinalities in advance.
Input on each iteration: {(g, m, b)} = J ⊆ I.
Goal: maintain an updated version of the results and efficiently update them when
new triples are received.
We need to keep in memory the results of prime operators’ application (prime
sets):
PrimesObjAttr — dictionary with elements of type ((g, m), {b ∈ B}), g ∈ G,
m ∈ M;
PrimesObjCond — dictionary with elements of type ((g, b), {m ∈ M}),
g ∈ G, b ∈ B;
PrimesAttrCond — dictionary with elements of type ((m, b), {g ∈ G}),
m ∈ M, b ∈ B.

Remark
In this case we need to consider triclusters based on different triples different, even
if their extents, intents, and modi are equal.

Algorithm of triples addition:
Input: J is a set of triples to add;
T = {T = (∗X, ∗Y , ∗Z)} is a current tricluster set;
PrimesObjAttr, PrimesObjCond, PrimesAttrCond;
Output: T = {T = (∗X, ∗Y , ∗Z)};
PrimesObjAttr, PrimesObjCond, PrimesAttrCond;
1: for all (g, m, b) ∈ J do
2: PrimesObjAttr[g, m] := PrimesObjAttr[g, m] ∪ b
3: PrimesObjCond[g, b] := PrimesObjCond[g, b] ∪ m
4: PrimesAttrCond[m, b] := PrimesAttrCond[m, b] ∪ g
5: T :=
T ∪ (&PrimesAttrCond[m, b], &PrimesObjCond[g, b], &PrimesObjAttr[g, m])
6: end for

A user may require to remove the triclusters with the same extent, intent and
modus at the post-processing stage. At this stage we can also check various
conditions (for instance, minimal density condition).
Input: T = {T = (∗X, ∗Y , ∗Z)} is a current tricluster set;
Output: T = {T = (∗X, ∗Y , ∗Z)} — processed tricluster hash-set;
1: for all T ∈ T do
2: Compute hash(T)
3: if hash(T) ̸∈ T .keys() then
4: T := T ∪ T
5: end if
6: end for

Complexity summary:
Time complexity: O(|I|) (as there is a constant number of operations on
each step);
More precisely: 8|I| operations in total;
1 Modification of 3 prime sets (3);
2 Creation of a new tricluster (1);
3 Addition of pointers to its extent, intent, and modus (3);
4 Addition of the tricluster to the set of all triclusters (1).
Memory complexity: O(|I|) (as we need to keep in memory only prime sets,
|I| elements in each dictionary + keys).

Example:

→ (g1, m1, b1)
1 PrimesObjAttr = {((g1, m1), {b1})}
2 PrimesObjCond = {((g1, b1), {m1})}
3 PrimesAttrCond = {((m1, b1), {g1})}
4 T := T ∪ {PrimesAttrCond[m1, b1], PrimesObjCond[g1, b1], PrimesObjAttr[g1, m1]}

→ (g1, m2, b1)
1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1})}
2 PrimesObjCond = {((g1, b1), {m1, m2})}
3 PrimesAttrCond = {((m1, b1), {g1}), ((m2, b1), {g1})}

→ (g2, m1, b1)
1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1}), ((g2, m1), {b1})}
2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1})}

→ (g2, m2, b1)
1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1}), ((g2, m1), {b1}), ((g2, m2), {b1})}
2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1, m2})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2})}

→ (g3, m3, b1)
1 PrimesObjAttr =
{((g1, m1), {b1}), ((g1, m2), {b1}), ((g2, m1), {b1}), ((g2, m2), {b1}), ((g3, m3), {b1})}
2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1, m2}), ((g3, b1), {m3})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3})}

→ (g1, m2, b2)
1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1, b2}), ((g2, m1),
{b1}), ((g2, m2), {b1}), ((g3, m3), {b1})}
2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1, m2}), ((g3, b1),
{m3}), ((g1, b2), {m2})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1),
{g3}), ((m2, b2), {g1})}

→ (g2, m1, b2)
1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1, b2}), ((g2, m1), {b1, b2}),
((g2, m2), {b1}), ((g3, m3), {b1})}
2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1, m2}), ((g3, b1), {m3}),
((g1, b2), {m2}), ((g2, b2), {m1})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3}),
((m2, b2), {g1}), ((m1, b2), {g2})}

→ (g2, m2, b2)
1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1, b2}), ((g2, m1), {b1, b2}),
((g2, m2), {b1, b2}), ((g3, m3), {b1})}
2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1, m2}), ((g3, b1), {m3}),
((g1, b2), {m2}), ((g2, b2), {m1, m2})}
((m2, b2), {g1, g2}), ((m1, b2), {g2})}

→ (g3, m3, b2)
1 PrimesObjAttr = {((g1, m1), {b1}), ((g1, m2), {b1, b2}), ((g2, m1), {b1, b2}), ((g2, m2),
{b1, b2}), ((g3, m3), {b1, b2})}
2 PrimesObjCond = {((g1, b1), {m1, m2}), ((g2, b1), {m1, m2}), ((g3, b1), {m3}), ((g1, b2),
{m2}), ((g2, b2), {m1, m2}), ((g3, b2), {m3})}
((m2, b2), {g1, g2}), ((m1, b2), {g2}), ((m3, b2), {g3})}

Postprocessing:
1 T(g1,m1,b1) = (g1, g2, m1, m2, b1) ← add
2 T(g1,m2,b1) = (g1, g2, m1, m2, b1, b2) ← add
3 T(g2,m1,b1) = (g1, g2, m1, m2, b1, b2) ← the same as T(g1,m2,b1), skip
5 T(g3,m3,b1) = (g3, m3, b1, b2) ← add
6 T(g1,m2,b2) = (g1, g2, m2, b1, b2) ← add
7 T(g2,m1,b2) = (g2, m1, m2, b1, b2) ← add
9 T(g3,m3,b2) = (g3, m3, b1, b2) ← the same as T(g3,m3,b1), skip

The final output set of triclusters:
1 T1 = ({g1, g2}, {m1, m2}, {b1})
2 T2 = ({g1, g2}, {m1, m2}, {b1, b2})
3 T3 = ({g3}, {m3}, {b1, b2})
4 T4 = ({g1, g2}, {m2}, {b1, b2})
5 T5 = ({g2}, {m1, m2}, {b1, b2})

Outline
Basic algorithm
4 Experiments
Datasets
Results
5 Conclusion

MapReduce Technology
MapReduce scheme [Dean and Ghemawat, 2004]

MapReduce example
Figure: Word counting. Source:
http://blog.trifork.com/2009/08/04/introduction-to-hadoop/

Communication costs: Mining of Massive Datasets [Leskovec et al., 2013]
Chapter 2: MapReduce and the New Software Stack
“Replication Rate and Reducer Size: It is often convenient to measure
communication by the replication rate, which is the communication per input.
Also, the reducer size is the maximum number of inputs associated with any
reducer. For many problems, it is possible to derive a lower bound on replication
rate as a function of the reducer size.”

MapReduce Implementation
The previous lattice-oriented M/R implementations
A version of Close-by-One algorithm was ported to M/R framework [Krajca
& Vychodil, 2009]
A M/R algorithm for computation of closed cube lattices was proposed
[Kudryavcev & Kuznecov, 2009]
[Xu et al., 2012] demonstrated that iterative algorithms like Ganter’s
NextClosure can benefit from the usage of iterative M/R schemes

MapReduce Implementation
Technologies and code repositories
Technologies used
Apache Hadoop 1
Apache Maven (framework for automatic project assembling)
Apache Commons (for work with extended Java collections)
Google Guava (utilities and data structures)
Jackson JSON (open-source library for transformation of object-oriented
representation of an object like tricluster to string)
TypeTools (for real-time type resolution of inbound and outbound key-value
pairs)
. . .
Implementations
Source 1: “Chaining-job” module2
Source 2: M/R-based OAC Triclustering3
1http://hadoop.apache.org/
2https://github.com/zydins/chaining-job
3https://github.com/zydins/DistributedTriclustering

Two-stage MapReduce Implementation
Distributed OAC-triclustering: First Map
Input: S is a set of input triples as strings;
r is a number of reducers;
i is a grouping index (objects, attributes or conditions).
Output: ˜J is a list of ⟨key, triple⟩ pairs.
1: for all s ∈ S do
2: t := transform(s)
3: key := hash(t[i]) mod r
4: ˜J := ˜J ∪ {⟨key, t⟩}
5: end for

Distributed OAC-triclustering: First Reduce
Input: J is a list of triples (for a certain key);
T = {T = (X, Y , Z)} is a current set of triclusters;
PrimesOA, PrimesOC, PrimesAC.
Output: file of strings – encoded ⟨triple, tricluster⟩ pairs.
1: Primes ← initialise a new multimap
3: Primes[g, m] := Primes[g, m] ∪ {b}
4: Primes[g, b] := Primes[g, b] ∪ {m}
5: Primes[m, b] := Primes[m, b] ∪ {g}
6: end for
8: T := (set(Primes[m, b]), set(Primes[g, b]), set(Primes[g, m]))
9: s := encode(⟨(g, m, b), T⟩)
10: store s
11: end for

Distributed OAC-triclustering: Second Map
Input: S is a list of strings.
Output: ˜T is an list of ⟨tricluster, tricluster⟩ pairs.
1: Primes ← initialise a new multimap
2: for all s ∈ S do
3: ⟨(g, m, b), T⟩ := decode(s)
4: update Primes multimap appropriately
5: I := I ∪ {(g, m, b)}
6: end for
7: for all (g, m, b) ∈ I do
8: T := (set(Primes[m, b]), set(Primes[g, b]), set(Primes[g, m]))
9: ˜T := ˜T ∪ {⟨T, T⟩}
10: end for

Distributed OAC-triclustering: Second Reduce
Input: ˆT is a list of ⟨tricluster, list of triclusters⟩ pairs.
Output: File with a final set of triclusters {T = (X, Y , Z)}.
1: for all ⟨T, [T, . . . , T]⟩ ∈ ˆT do
2: store T
3: end for

Communication costs
The time complexity of the M/R solution is composed from two terms for
each stage: O(|I|/r) (or O(|I|)) and O(|I|).
The replication rate for the first M/R stage r1 = 1 (each triple is passed as
one key-value pair), the reducer size q1 = |I|/r
The replication rate for the second M/R stage is r2 = 1 (it assigns one
key-value pair for each tricluster), but the reducer size varies from qmin
2 = 1
(no duplicate triclusters) and qmax
2 = |I| (one final tricluster when all the
initial triples belong to one absolutely dense cuboid).

Outline
Basic algorithm
4 Experiments
Datasets
Results
5 Conclusion

Experiments
OS X 10, 1.8 GHz Intel Core i5, 4 Gb 1600 MHz DDR3 and 8 Gb free space
on the hard drive (a typical commodity hardware).
Two M/R modes have been tested: sequential mode of tasks completion and
emulation of distributed one with 16 first reducers and 32 threads for the
second stage.
To evaluate the runtime more carefully, for each context the average result of
5 runs of the algorithms has been recorded.

Experiments
Datasets
Synthetic datasets. 1) 20,000 triples (25 unique entities of each type); 2) 100,000 triples (50
unique entities of each type); 3) 1,000,000 triples (all possible combinations of 100 unique
entities of each type).
The 1st dataset contains duplicates since 25 × 25 × 25 gives only 15,625 unique triples. The 2nd
one contains less triples than 503 = 125, 000, the number of all possible combinations. The 3rd
one is an absolutely dense cuboid 100 × 100 × 100.
The 3rd dataset does not result in 3min(|G|,|M|,|B|) formal triconcepts, this is an example of the
worst case scenario for the second reducer (qmax
2 = |I|).
IMDB. Top-250 list of the best movies from Internet Movie Database
Bibsonomy. The data of bibsonomy.org from ECML PKDD discovery challenge 2008.
Context |G| |M| |B| # triples Density
20k 25 25 25 20,000 1
100k 50 50 50 100,000 0.8
1m 100 100 100 1,000,000 1
IMDB 250 795 22 3,818 0.00087
BibSonomy 2,337 67,464 28,920 816,197 1.8 · 10−7

Experiments
Results
Algorithm/Context IMDB 20k 100k 1m Bibsonomy
(≈3k triples) triples triples triples (≈800k triples)
Tribox 324 800 1,265 >3,000 >3,000
TRIAS 189 362 862 >3,000 >3,000
OAC Box 374 756 1,265 >3,000 >3,000
OAC Prime 7 8 734 >3,000 >3,000
Online OAC prime 3 3 3 5 >3,000
M/R OAC prime seq. 12 30 81 166 1,534
M/R OAC prime distr. 1 15 20 25 520

Alternative MapReduce decomposition
Variant I: First stage
First Map: Finding primes. During this phase every input triple (g, m, b) is
encoded by three key-value pairs ⟨(g, m), b⟩, ⟨(g, b), m⟩, and ⟨(m, b), g⟩. These
pairs are passed to the first reducer.
The replication rate is r1 = 3.
First Reduce: Finding primes. This reducer fills three corresponding dictionaries
for primes of keys. So, for example, the first dictionary, PrimeOA contains
key-value pairs ⟨(g, m), {b1, b2, . . . , bn}⟩.
The reducer size is q1 = max(|G|, |M|, |B|)
The process can be stopped after the first reduce phase and all the triclusters
found as (Prime[g, m], Prime[g, b], Prime[m, b]) each by enumeration of
(g, m, b) ∈ I. However, to do it faster and keep the result for further
computation, it is possible to use M/R as well.

Variant I: Second stage
Second Map: Tricluster generation. The second map does tricluster combining
job, i.e. for each triple (g, m, b) it composes the new key-value pair, ⟨(g, m, b), ∅⟩.
And for each pair of either type, ⟨(g, m), Prime[g, m]⟩, ⟨(g, b), Prime[g, b]⟩, and
⟨(m, b), Prime[m, b]⟩ it generates key-values pairs ⟨(g, m, ˜b), Prime[g, m]⟩,
⟨(g, ˜m, b), PrimeOC[g, b]⟩, and ⟨(˜g, m, b), Prime[m, b]⟩, where ˜g ∈ G, ˜m ∈ M,
and ˜b ∈ B.
r2 = (|I| + 3|G||M||B|)/(|I| + |G||M| + |G||B| + |M||B|) ≤
(ρ + 3)/(ρ + 3/max(|G|, |M|, |B|)), where ρ is the input tricontext density.
Second Reduce: Tricluster generation. The second reducer just assembles only
one value for each key (g, m, b), the generating triple, its tricluster, (Prime[g, m],
Prime[g, b], Prime[m, b]). If there is no key-value pair ⟨(g, m, b), ∅⟩ for a
particular triple (g, m, b), it does not output any key-value pair for the key.
The reducer size q2 is either 3 (no output) or 4 (tricluster assembled).

Variant II: Second stage
Second Map: Tricluster generation with duplicate generating triples.
Second map does tricluster combining job, i.e. for each triple (g, m, b) it
composes a new key-value pair:
⟨(Prime[g, m], Prime[g, b], Prime[m, b]), (g, m, b)⟩.
Second Map: Tricluster generation with duplicate generating triples.
The second reducer just groups values for each key: ⟨(X, Y , Z), {(g1, m1, b1), . . . ,
(gn, mn, bn)}⟩.

Outline
Basic algorithm
4 Experiments
Datasets
Results
5 Conclusion

Conclusion and further work
MapReduce Prime OAC-triclustering implementation has been proposed.
Communication costs have been analysed.
Comparison of the online version and M/R one has been performed.
Further experiments are needed with other M/R variants and other
triclustering algorithms.
A proper comparison of the proposed OAC triclustering and noise tolerant
patterns in n-ary relations, e.g., by DataPeeler descendants [Cerf et al., 2013]
is not yet conducted.

Thank you!
Questions?

Putting OAC-triclustering on MapReduce

More Related Content

What's hot (19)

Viewers also liked (17)

Similar to Putting OAC-triclustering on MapReduce (20)

More from Dmitrii Ignatov (11)

Recently uploaded (20)

Putting OAC-triclustering on MapReduce