SlideShare a Scribd company logo
View MR Design Patterns course details at www.edureka.co/mapreduce-design-patterns
Application of JOIN Pattern
MAP Reduce Design PATTERN
Slide 2 www.edureka.co/mapreduce-design-patterns
Objectives
At the end of this module, you will be able to understand
Why Design Patterns in MR
Who should know Map-Reduce Design patterns
Available Design Patterns in MR
Join pattern
Slide 3 www.edureka.co/mapreduce-design-patternsSlide 3
Why Design Patterns in MR?
General reusable, optimized solutions to most common
problems
Template to solve problems used in different situations
Speed up the development process
Tried and tested design principles
An initial guideline to solve most common problems in MR
Help build sophisticated and best solution
Slide 4 www.edureka.co/mapreduce-design-patternsSlide 4
Who should know MR Design Pattern?
A Java developer who wants to explore world of Big Data
A MapReduce programmer who wants to develop expertise in his/her MR skills
One who aims to become a Hadoop Architect
Slide 5 www.edureka.co/mapreduce-design-patternsSlide 5
Available Design Patterns in MR
Summarization
Pattern
Filtering Pattern
Data Organization
Pattern
Join Pattern
Meta Pattern
Input & Output
Pattern
Slide 6 www.edureka.co/mapreduce-design-patterns
Join Patterns – What is it
 Datasets generally exist in multiple sources
 Deriving full-value requires merging them together
 Join Patterns are used for this purpose
 Performing joins on the fly on Big Data can be costly in terms of time
Example: Joining StackOverflow data from Comments & Posts on UserId
Slide 7 www.edureka.co/mapreduce-design-patterns
Join Patterns – What is it?
 Joining Patterns we will talk about are
» Reduce Side Join/Repartition Join
» Reduce Side Join with Bloom Filter
» Replicated Join
» Composite Join
» Cartesian Product
Slide 8 www.edureka.co/mapreduce-design-patterns
Join – Refresher
 Inner Join
 Outer Join
» Left Outer Join
» Right Outer Join
» Full Outer Join
 Anti Join
 Cartesian Product
Slide 9 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Description
 Easiest to implement but can be longest to execute
 Supports all types of join operation
 Can join multiple data sources, but expensive in terms of network resources & time
 All data transferred across network
Example : Join PostLinks table data in StackOverflow to Posts data
Slide 10 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Description (Contd.)
 Applicability – Use it when
» Multiple large data sets require to be joined
» If one of the data sources is small look at using replicated join
» Different data sources are linked by a foreign key
» You want all join operations to be supported
Slide 11 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Structure
Slide 12 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Structure (Contd.)
 Mapper
» Output key should reflect the foreign key
» Value can be the whole record and an identifier to identify the source
» Use projection and output only the required number of fields
 Combiner
» Not Required ; No additional benefit
Slide 13 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Structure (Contd.)
 Partitioner
» User Custom Partitioner if required;
 Reducer
» Reducer logic based on type of join required
» Reducer receives the data from all the different sources per key
Slide 14 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Analogy
 Resemblances
» SQL
» SELECT users.ID, users.Location, comments.upVotes
FROM users
[INNER|LEFT|RIGHT] JOIN comments
ON users.ID=comments.UserID
» Pig
» Supports inner & outer joins
» Inner Join
» A = JOIN comments BY userID, users BY userID;
» Outer Join
» A = JOIN comments BY userID [LEFT|RIGHT|FULL] OUTER, users BY userID
Slide 15 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Performance
 Performance
» The whole data moves across the network to reducers
» You can optimize by using projection and sending only the required fields
» Number of reducers typically higher than normal
» If you can use any other Join type for your problem, use that instead
Slide 16 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Use Cases
 Join tweets with user personal information for Behavioral Analysis
 Join PostLinks and Posts tables from StackOverflow to have all related posts in one place
Slide 17 www.edureka.co/mapreduce-design-patterns
Reduce Side Join Example – Problem
 Your dataset is the StackOverflow dataset. Look at the PostLinks.xml & Posts.xml file. Join the two tables based on
PostId in PostLinks & Id in Posts
» Use MultipleInputs class
» Projection on PostLinks to output only PostId & RelatedPostId fields
Slide 18 www.edureka.co/mapreduce-design-patterns
DEMO
Reduce Side Join Example
Slide 19 www.edureka.co/mapreduce-design-patterns
Questions
Slide 20 www.edureka.co/mapreduce-design-patterns

More Related Content

PDF
Building a Scalable Application on Cloud
PDF
Weather data meets ibm cloud. part 3 transformation and aggregation of weat...
PDF
Weather data meets ibm cloud. part 1 ingestion and processing of weather da...
PDF
(Big) Data Science
PDF
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
PPTX
Hadoop MapReduce Paradigm
PPTX
Tableau Architecture
PDF
Big Data Engineering for Machine Learning
Building a Scalable Application on Cloud
Weather data meets ibm cloud. part 3 transformation and aggregation of weat...
Weather data meets ibm cloud. part 1 ingestion and processing of weather da...
(Big) Data Science
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Hadoop MapReduce Paradigm
Tableau Architecture
Big Data Engineering for Machine Learning

What's hot (10)

PDF
Distributed machine learning
PDF
CAR EVALUATION DATABASE
PDF
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
PPTX
Agile data warehousing
PDF
Spark Streaming
PDF
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
PDF
Google take on heterogeneous data base replication
PDF
Building ML Pipelines with DCOS
PPTX
Machine Learning on Distributed Systems by Josh Poduska
PPTX
Google cloud-platform-official-icons-and-sample-diagrams
Distributed machine learning
CAR EVALUATION DATABASE
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Agile data warehousing
Spark Streaming
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Google take on heterogeneous data base replication
Building ML Pipelines with DCOS
Machine Learning on Distributed Systems by Josh Poduska
Google cloud-platform-official-icons-and-sample-diagrams
Ad

Similar to Mrdp reduce side_join (20)

PPTX
Top 3 design patterns in Map Reduce
PDF
Webinar: Tailored Big Data Solutions using MapReduce Design Patterns
PPTX
CQRS
PPTX
Embarrassingly/Delightfully Parallel Problems
PPTX
No more Three Tier - A path to a better code for Cloud and Azure
PPTX
Design patterns
PDF
Data Modeling with Neo4j
DOCX
What is the difference between Data and Information give an exa
PPTX
Cloud Computing
PPT
Design Concepts software engineering.ppt
PPTX
Final Project presentation (on App devlopment)
PDF
Design patterns 1july
PPTX
Lesson5-Algorithms-Flowcharts-DataTypes-Pseudocode.pptx
PPT
Download It
PPT
Parallel Computing 2007: Overview
PPTX
Backend accessible
PPT
PDF
Challenges in the Design of a Graph Database Benchmark
DOC
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
Top 3 design patterns in Map Reduce
Webinar: Tailored Big Data Solutions using MapReduce Design Patterns
CQRS
Embarrassingly/Delightfully Parallel Problems
No more Three Tier - A path to a better code for Cloud and Azure
Design patterns
Data Modeling with Neo4j
What is the difference between Data and Information give an exa
Cloud Computing
Design Concepts software engineering.ppt
Final Project presentation (on App devlopment)
Design patterns 1july
Lesson5-Algorithms-Flowcharts-DataTypes-Pseudocode.pptx
Download It
Parallel Computing 2007: Overview
Backend accessible
Challenges in the Design of a Graph Database Benchmark
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
Ad

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
PDF
Top 5 Trending Business Intelligence Tools | Edureka
PDF
Tableau Tutorial for Data Science | Edureka
PDF
Python Programming Tutorial | Edureka
PDF
Top 5 PMP Certifications | Edureka
PDF
Top Maven Interview Questions in 2020 | Edureka
PDF
Linux Mint Tutorial | Edureka
PDF
How to Deploy Java Web App in AWS| Edureka
PDF
Importance of Digital Marketing | Edureka
PDF
RPA in 2020 | Edureka
PDF
Email Notifications in Jenkins | Edureka
PDF
EA Algorithm in Machine Learning | Edureka
PDF
Cognitive AI Tutorial | Edureka
PDF
AWS Cloud Practitioner Tutorial | Edureka
PDF
Blue Prism Top Interview Questions | Edureka
PDF
Big Data on AWS Tutorial | Edureka
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
PDF
Kubernetes Installation on Ubuntu | Edureka
PDF
Introduction to DevOps | Edureka
What to learn during the 21 days Lockdown | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
Tableau Tutorial for Data Science | Edureka
Python Programming Tutorial | Edureka
Top 5 PMP Certifications | Edureka
Top Maven Interview Questions in 2020 | Edureka
Linux Mint Tutorial | Edureka
How to Deploy Java Web App in AWS| Edureka
Importance of Digital Marketing | Edureka
RPA in 2020 | Edureka
Email Notifications in Jenkins | Edureka
EA Algorithm in Machine Learning | Edureka
Cognitive AI Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
Blue Prism Top Interview Questions | Edureka
Big Data on AWS Tutorial | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Kubernetes Installation on Ubuntu | Edureka
Introduction to DevOps | Edureka

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PPTX
Big Data Technologies - Introduction.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Empathic Computing: Creating Shared Understanding
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Encapsulation theory and applications.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
MYSQL Presentation for SQL database connectivity
DOCX
The AUB Centre for AI in Media Proposal.docx
PPT
Teaching material agriculture food technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Cloud computing and distributed systems.
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Machine learning based COVID-19 study performance prediction
Big Data Technologies - Introduction.pptx
A Presentation on Artificial Intelligence
Empathic Computing: Creating Shared Understanding
Assigned Numbers - 2025 - Bluetooth® Document
Agricultural_Statistics_at_a_Glance_2022_0.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Encapsulation_ Review paper, used for researhc scholars
Encapsulation theory and applications.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
A comparative analysis of optical character recognition models for extracting...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
MYSQL Presentation for SQL database connectivity
The AUB Centre for AI in Media Proposal.docx
Teaching material agriculture food technology
Per capita expenditure prediction using model stacking based on satellite ima...
Cloud computing and distributed systems.
Network Security Unit 5.pdf for BCA BBA.
Building Integrated photovoltaic BIPV_UPV.pdf

Mrdp reduce side_join