Repartition join in mapreduce

• What is Reduce-side join?
• Steps used to join the datasets in Reduce-side
join
• Sample datasets used in this project
• Scenario flow
• Practical demonstration of Reduce-side join

• Joins of datasets done in the reduce phase based on join
key are called reduce side joins. Reduce-side joins are the
easiest to implement
• What makes reduce-side joins straight forward is the fact
that Hadoop sends identical keys to the same reducer, so
by default the data is organized for us.
• To perform the join, we simply need to cache a key and
compare it to incoming keys. As long as the keys match, we
can join the values from the corresponding keys.
• The trade off with reduce-side joins is performance, since
all of the data is shuffled across the network

• The key of the map output, of datasets being joined, has to be the
join key - so they reach the same reducer
• Each dataset has to be tagged with its identity, in the mapper- to
help differentiate between the datasets in the reducer, so they can
be processed accordingly.
• In each reducer, the data values from both datasets, for keys
assigned to the reducer, are available, to be processed as required.
• A secondary sort needs to be done to ensure the ordering of the
values sent to the reducer
• If the input files are of different formats, we would need separate
mappers, and we would need to use MultipleInputs class in the
driver to add the inputs and associate the specific mapper to the
same.

1. Map output key
The key will be the empNo as it is the join key for the datasets employee and salary
[Implementation: in the mapper]
2. Tagging the data with the dataset identity
Add an attribute called srcIndex to tag the identity of the data (1=employee,
2=salary)
3. Discarding unwanted atributes
4. Composite key
Make the map output key a composite of empNo and srcIndex
[Implementation: create custom writable]
5. Partitioner
Partition the data on natural key of empNo
[Implementation: create custom partitioner class]
---- continue
•

6. Sorting
Sort the data on empNo first, and then source index
[Implementation: create custom sorting comparator
class]
7. Grouping
Group the data based on natural key
[Implementation: create custom grouping comparator
class]
8. Joining
Iterate through the values for a key and complete the
join for employee and salary data.
[Implementation: in the reducer]

Repartition join in mapreduce

More Related Content

What's hot (19)

Similar to Repartition join in mapreduce (20)

More from Uday Vakalapudi (10)

Recently uploaded (20)

Repartition join in mapreduce