SlideShare a Scribd company logo
Repartition join in mapreduce
Repartition join in mapreduce
‱ What is Reduce-side join? 
‱ Steps used to join the datasets in Reduce-side 
join 
‱ Sample datasets used in this project 
‱ Scenario flow 
‱ Practical demonstration of Reduce-side join
‱ Joins of datasets done in the reduce phase based on join 
key are called reduce side joins. Reduce-side joins are the 
easiest to implement 
‱ What makes reduce-side joins straight forward is the fact 
that Hadoop sends identical keys to the same reducer, so 
by default the data is organized for us. 
‱ To perform the join, we simply need to cache a key and 
compare it to incoming keys. As long as the keys match, we 
can join the values from the corresponding keys. 
‱ The trade off with reduce-side joins is performance, since 
all of the data is shuffled across the network
‱ The key of the map output, of datasets being joined, has to be the 
join key - so they reach the same reducer 
‱ Each dataset has to be tagged with its identity, in the mapper- to 
help differentiate between the datasets in the reducer, so they can 
be processed accordingly. 
‱ In each reducer, the data values from both datasets, for keys 
assigned to the reducer, are available, to be processed as required. 
‱ A secondary sort needs to be done to ensure the ordering of the 
values sent to the reducer 
‱ If the input files are of different formats, we would need separate 
mappers, and we would need to use MultipleInputs class in the 
driver to add the inputs and associate the specific mapper to the 
same.
Repartition join in mapreduce
1. Map output key 
The key will be the empNo as it is the join key for the datasets employee and salary 
[Implementation: in the mapper] 
2. Tagging the data with the dataset identity 
Add an attribute called srcIndex to tag the identity of the data (1=employee, 
2=salary) 
[Implementation: in the mapper] 
3. Discarding unwanted atributes 
[Implementation: in the mapper] 
4. Composite key 
Make the map output key a composite of empNo and srcIndex 
[Implementation: create custom writable] 
5. Partitioner 
Partition the data on natural key of empNo 
[Implementation: create custom partitioner class] 
---- continue 
‱
6. Sorting 
Sort the data on empNo first, and then source index 
[Implementation: create custom sorting comparator 
class] 
7. Grouping 
Group the data based on natural key 
[Implementation: create custom grouping comparator 
class] 
8. Joining 
Iterate through the values for a key and complete the 
join for employee and salary data. 
[Implementation: in the reducer]
Repartition join in mapreduce

More Related Content

PPTX
Hadoop Mapreduce joins
PPTX
Join Algorithms in MapReduce
PPTX
Hadoop MapReduce joins
PDF
Reduce Side Joins
PPTX
Introduction to MapReduce
PPTX
Mapreduce script
PDF
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
PPTX
Map reduce in Hadoop
Hadoop Mapreduce joins
Join Algorithms in MapReduce
Hadoop MapReduce joins
Reduce Side Joins
Introduction to MapReduce
Mapreduce script
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Map reduce in Hadoop

What's hot (19)

PPTX
What is MapReduce ?
PPT
Map Reduce
PDF
Mapreduce - Simplified Data Processing on Large Clusters
PPTX
Stratosphere with big_data_analytics
PDF
Applying stratosphere for big data analytics
PDF
Hadoop secondary sort and a custom comparator
PDF
Lecture 1 mapreduce
PDF
Pregel - Paper Review
PPTX
Mapreduce total order sorting technique
PPTX
Introduction to MapReduce
PDF
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
PPTX
Hive query optimization infinity
PPTX
H base introduction & development
PDF
Hadoop combiner and partitioner
PPT
Map Reduce
PPTX
Hadoop deconstructing map reduce job step by step
PPTX
Map reduce presentation
PDF
SPARJA: a Distributed Social Graph Partitioning and Replication Middleware
PPTX
Join optimization in hive
What is MapReduce ?
Map Reduce
Mapreduce - Simplified Data Processing on Large Clusters
Stratosphere with big_data_analytics
Applying stratosphere for big data analytics
Hadoop secondary sort and a custom comparator
Lecture 1 mapreduce
Pregel - Paper Review
Mapreduce total order sorting technique
Introduction to MapReduce
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Hive query optimization infinity
H base introduction & development
Hadoop combiner and partitioner
Map Reduce
Hadoop deconstructing map reduce job step by step
Map reduce presentation
SPARJA: a Distributed Social Graph Partitioning and Replication Middleware
Join optimization in hive
Ad

Similar to Repartition join in mapreduce (20)

PDF
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
PPTX
Introduction to Map Reduce
PPTX
map reduce 4..............................
PPTX
map reduce Technic in big data
PDF
Introduction to map reduce
PPTX
Lecture 04 big data analytics | map reduce
PDF
big data analytics introduction chapter 2
PDF
Design patterns in MapReduce
PDF
lec8_ref.pdf
PPTX
MapReduce and Hadoop Introcuctory Presentation
PDF
Lecture 2 part 3
PDF
MapReduce-Notes.pdf
PDF
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
PPTX
Module3 for enginerring students ppt.pptx
PPTX
MapReduce DesignPatterns
PDF
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
 
ODT
Spark rdd part 2
PDF
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
PPTX
Map reduce prashant
PPT
design mapping lecture6-mapreducealgorithmdesign.ppt
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Introduction to Map Reduce
map reduce 4..............................
map reduce Technic in big data
Introduction to map reduce
Lecture 04 big data analytics | map reduce
big data analytics introduction chapter 2
Design patterns in MapReduce
lec8_ref.pdf
MapReduce and Hadoop Introcuctory Presentation
Lecture 2 part 3
MapReduce-Notes.pdf
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Module3 for enginerring students ppt.pptx
MapReduce DesignPatterns
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
 
Spark rdd part 2
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Map reduce prashant
design mapping lecture6-mapreducealgorithmdesign.ppt
Ad

More from Uday Vakalapudi (10)

PPTX
Introduction to pig
PPTX
Introduction to sqoop
PPTX
Introduction to hbase
PPTX
Introduction to Hive
PPTX
Introduction to HDFS and MapReduce
PPTX
Advanced topics in hive
PPTX
Oozie workflow using HUE 2.2
PPTX
Apache Storm and twitter Streaming API integration
PPTX
How Hadoop Exploits Data Locality
PPTX
Flume basic
Introduction to pig
Introduction to sqoop
Introduction to hbase
Introduction to Hive
Introduction to HDFS and MapReduce
Advanced topics in hive
Oozie workflow using HUE 2.2
Apache Storm and twitter Streaming API integration
How Hadoop Exploits Data Locality
Flume basic

Recently uploaded (20)

PPTX
Transform Your Business with a Software ERP System
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
System and Network Administraation Chapter 3
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
AI in Product Development-omnex systems
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Digital Strategies for Manufacturing Companies
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
L1 - Introduction to python Backend.pptx
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Transform Your Business with a Software ERP System
Adobe Illustrator 28.6 Crack My Vision of Vector Design
System and Network Administraation Chapter 3
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
wealthsignaloriginal-com-DS-text-... (1).pdf
AI in Product Development-omnex systems
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Upgrade and Innovation Strategies for SAP ERP Customers
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Design an Analysis of Algorithms II-SECS-1021-03
How to Migrate SBCGlobal Email to Yahoo Easily
Design an Analysis of Algorithms I-SECS-1021-03
Digital Strategies for Manufacturing Companies
Understanding Forklifts - TECH EHS Solution
L1 - Introduction to python Backend.pptx
Navsoft: AI-Powered Business Solutions & Custom Software Development
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf

Repartition join in mapreduce

  • 3. ‱ What is Reduce-side join? ‱ Steps used to join the datasets in Reduce-side join ‱ Sample datasets used in this project ‱ Scenario flow ‱ Practical demonstration of Reduce-side join
  • 4. ‱ Joins of datasets done in the reduce phase based on join key are called reduce side joins. Reduce-side joins are the easiest to implement ‱ What makes reduce-side joins straight forward is the fact that Hadoop sends identical keys to the same reducer, so by default the data is organized for us. ‱ To perform the join, we simply need to cache a key and compare it to incoming keys. As long as the keys match, we can join the values from the corresponding keys. ‱ The trade off with reduce-side joins is performance, since all of the data is shuffled across the network
  • 5. ‱ The key of the map output, of datasets being joined, has to be the join key - so they reach the same reducer ‱ Each dataset has to be tagged with its identity, in the mapper- to help differentiate between the datasets in the reducer, so they can be processed accordingly. ‱ In each reducer, the data values from both datasets, for keys assigned to the reducer, are available, to be processed as required. ‱ A secondary sort needs to be done to ensure the ordering of the values sent to the reducer ‱ If the input files are of different formats, we would need separate mappers, and we would need to use MultipleInputs class in the driver to add the inputs and associate the specific mapper to the same.
  • 7. 1. Map output key The key will be the empNo as it is the join key for the datasets employee and salary [Implementation: in the mapper] 2. Tagging the data with the dataset identity Add an attribute called srcIndex to tag the identity of the data (1=employee, 2=salary) [Implementation: in the mapper] 3. Discarding unwanted atributes [Implementation: in the mapper] 4. Composite key Make the map output key a composite of empNo and srcIndex [Implementation: create custom writable] 5. Partitioner Partition the data on natural key of empNo [Implementation: create custom partitioner class] ---- continue ‱
  • 8. 6. Sorting Sort the data on empNo first, and then source index [Implementation: create custom sorting comparator class] 7. Grouping Group the data based on natural key [Implementation: create custom grouping comparator class] 8. Joining Iterate through the values for a key and complete the join for employee and salary data. [Implementation: in the reducer]