Three way join in one round on hadoop

Three-way join in one
round on Hadoop
COMP 6231
GROUP 7
IRAJ HEDAYATISOMARIN, ZAKARIA NASERELDINE, J INYANG DU

Problem statement
푅 ⋈ 푆 ⋈ 푇
In this section of second project we
aimed to calculate three-way join in
one round of Map-Reduce algorithm.
S R
T
R join S join T

Algorithm Overview
First relation: R
a, b
Second relation: S
b, c
Third relation: T
c, d
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Mapper
h(b)=x
h(c)=y
R,(a,b)
S,(b,c)
T,(c,d)
x
y
<KEY, VALUE>=<(X,Y), (relation_name, tuple)> In memory join
Coordinate of a reducer in imagined matrix of reducers

Mapping and Hashing
<KEY, VALUE>=<(X,Y), (relation_name, tuple)>
Exactly same as input
Fetch from file name
Input tuple
First relation: R
(h(b),1)
(h(b),2)
Second relation: S (h(b),h(c))
Third relation: T
…
(h(b),11)
(1,h(c))
(1,h(c))
…
(11,h(c))
푅푒푑푢푐푒푟 # = (푥 − 1) × # 표푓 푟푒푑푢푐푒푟푠 + 푦
h(b)=x
h(c)=y

In-memory join algorithm
NESTED LOOP JOIN
For each tuple in R
For each tuple in S
If R.b==S.b then
For each tuple in T
If S.c==T.c then
Print (R.a, S.b, S.c, T.d)
SORT-BASED JOIN ALGORITHM
1. divide input list in three sorted lists using
Binary Search 푂(푛 algorithm
log 푛)
2. Execute in-memory join algorithm
•UNTIL R and S are not empty DO
• IF the first items in both list are equal THEN
• make sure all the tuples with the same value have
been joined together and remove them from the list
• ELSE
• Choose the smallest one and remove items until
reach an item equal or greater than the front item in
the another list
푂(푛3)
1.Divide list: 푂(푛 log 푛)
2.In-memory join:
1.푅 ⋈ 푆 = 푂 푛
2.푅푆 ⋈ 푇 = 푂 푛

Number of reducers
We decide to use a square matrix. This choice would be a constraint on number of reducers. For
example in this case, we had 128 reducers available but actually we just use 121 of them
On the other hand selecting different number of reducers in each dimension, we will have data
replication and inefficiency.

Number of reducers (example 1,
replication problem)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2
3
4
# of reducers=128
Assumption: R>>T
Both of them have uniform distribution
T(R) = 1,000,000
T(T) = 1,000
For square matrix:
Replicated data=1,000,000*11+1,000*11=11,011,000
For above matrix:
Replicated data=1,000,000*16+1,000*16=16,016,000

Number of reducers (example 1,
inefficiency problem)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2
3
4
IDLE FULL IDLE FULL
# of reducers=128
Assumption: T>>R
T is not uniformly distributed
T(R) = 1,000
T(T) = 1,000,000
When the range is reduced, it’s more likely two value
hash in to the same location.

Experimental results
37 seconds

Three way join in one round on hadoop

More Related Content

What's hot (20)

Similar to Three way join in one round on hadoop (20)

Recently uploaded (20)

Three way join in one round on hadoop