SlideShare a Scribd company logo
Module 04 - MapReduce Framework
NPN TrainingTraining is the essence of success &
we are committed to it
www.npntraining.com
Course Topics `
Understanding Big Data
Module - I
Hadoop 1.x & 2.x Architecture
Module - II
Hadoop Setup and Configuration
Module - III
MapReduce Framework – I
Module - IV
Hive and Hive Query Language
Module - V
Advance Hive using Java
Module - VI
Advance HBase using Java
Module - IX
Module - X
MR Unit
Module - XI
Pig and Pig Latin
Module - XII
MapReduce Framework – III
MapReduce Framework – II
Module - VII
No SQL & HBase
Module - VIII
Advance Pig using Java
Module - XIII
Hue
Module - XV
Project Discussion & Use case
Module - XVI
Sqoop
Module - XIV
www.npntraining.com/courses/big-data-and-hadoop.php
Map, Reduce Paradigm
Overview of Record Reader & Input Splits
Executing Map Reduce programs
Data Flow in MapReduce
What is MapReduce Framework
Topics for the Module `
Word Count Implementation
Relation between InputSplits & HDFS BlockRole of Key and Pairs
Exploring different command line options
In executing MapReduce programs
Hadoop Datatypes
www.npntraining.com/courses/big-data-and-hadoop.php
Hadoop MapReduce Framework
MapReduce is a programming model for processing large data sets with a parallel ,
distributed on a cluster.
In MapReduce Programming model, work is divided into two phases:
a) Map Phase
b) Reduce Phase
The map phase takes a piece of input and performs some operations on it (e.g. extracting a field) .
The Reduce phase aggregates similar pieces of information that are produced by the map phase
(e.g. averaging fields with the same name).
These piece of information are represented by key-value pairs.
www.npntraining.com/courses/big-data-and-hadoop.php
Developer
Map
Reduce
Employee.dat
100 MB
64
MB
36
MB
Map Class
Reduce Class
1 Read the data from input file
2 Write Business Logic for processing the data
3 Send the result (Output) (Intermediate data –
Temporary Data stored on local FS)
1 Read all the output of maps
Aggregation or Consolidation Logic2
3 Final Output to HDFS Blocks and Replication
will be there
Hadoop
Framework does
sort and shuffling
of Map data to
Reducer
Map Task
Map Task
InputSplits
InputSplits
3000
2000
9000
6000
www.npntraining.com/courses/big-data-and-hadoop.php
Sample.dat
100 MB
Well come to NPN Training
we promise we teach you the best
We teach various course like Java,J2EE,Selenium,Hadoop
64
MB
36
MB
Well come to NPN Training
we promise we teach you the best
We train various course like Java,J2EE,Selenium,Hadoop
Every time Map Task reads individual key value pair
Key --> Byte Offset
Value --> Entire line as value
1block
0, Well come to NPN Training
25,we promise we teach you the best
2block
0, We teach various course like Java,J2EE,Selenium, Hadoop
By default Hadoop has TextInputFormat class(responsible
for creating InputSplits and also divides into records)
TextInputFormat class creates key value pairs.
www.npntraining.com/courses/big-data-and-hadoop.php
Word Count Use case `
Let’s assume we have a large collection of text documents in a folder
(Let’s say we have 1000 documents each with average of 1 million words)
We have to count how many times each word is repeated in the documents
How would you solve
this using simple Java
program?
How many lines of
code will you write?
How much will be the
program execution
time?
www.npntraining.com/courses/big-data-and-hadoop.php
MapReduce Paradigm
C++ J2EE Python LISP
Python Java Python
JSP Python LISP
Python Servlet JSP
Input Split
(K1,V1)
(Framework)
< 0, C++ J2EE Python LISP >
< 20, Python Java Python>
Mapper
< C++, 1 >
<J2EE, 1>
<Python,1>
<LISP,1>
<Python,1>
<Java,1>
<Python,1>
Mapper < JSP, 1 >
<Python,1>
<LISP,1>
<Python,1>
<Servlet,1>
<JSP,1>
< 0, JSP Python LISP >
< 14, Python Servlet JSP>
(C++, 1)
(Java, 1)
(J2EE, 1)
(LISP, 2)
(Python,5)
(JSP,1)
(K2,list(V1, v2, v3)
(Framework)
List (K2,V2)
List (K3,V3)
Input
Reducer
< C++ ,[1]
<Java ,[1]
<J2EE ,[1]
<LISP ,[1, 1]
<Python ,[1,1,1,1,1]
<JSP ,[1]
<Servlet ,[1]
Shuffle&Sort
www.npntraining.com/courses/big-data-and-hadoop.php
Word Count Implementation `
Map Class
Reduce Class
Driver Class
Why MapReduce
Processing data in parallel
Taking processing to the data
Map TaskHDFS Block
Node Rack
Data Center
www.npntraining.com/courses/big-data-and-hadoop.php
``
Input Splits
InputSplits Logical Division HDFS Blocks
Physical
Division
www.npntraining.com/courses/big-data-and-hadoop.php
``
Relational Between Input Splits and HDFS Blocks
1 2 3 4 5 6 7 8 9 10 11
64 MB
Blocks are cut in
between the
records
64 MB 64 MB
Last record may
cross across the
boundary of the
block
Split Split
Splits will be aware of
the positions
InputFormat is responsible
for creating InputSplits and
dividing into records
www.npntraining.com/courses/big-data-and-hadoop.php
``
Relational Between Input Splits and HDFS Blocks
Block Map TaskInputSplits
Block is the physical representation of data. Split is the logical representation of data present in Block.
Block and split size can be changed in properties.
Map reads data from Block through splits i.e. split act as a broker between Block and Mapper.
Now map reads block 1 till aa to JJ and doesn't know how to read block 2 i.e. block
doesn't know how to process different block of information. Here comes a Split it
will form a Logical grouping of Block 1 and Block 2 as single Block, then it forms
offset(key) and line (value) using inputformat and record reader and send map to
process further processing.
www.npntraining.com/courses/big-data-and-hadoop.php
``
file
file
InputFormat
Split Split Split
RR RR RR
Map Map Map
Input (K, V) pairs
Partitioner
Intermediate (K, V) pairs
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
RecordReaders
Final (K, V) pairs
Writeback to local
HDFS store
file
file
InputFormat
Split Split Split
RR RR RR
Map Map Map
Input (K, V) pairs
Partitioner
Intermediate (K, V) pairs
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
RecordReaders
Final (K, V) pairs
Writeback to local
HDFS store
Node 1 Node 2
Shuffling
Process
Intermediate
(K,V) pairs
exchanged by
all nodes
Hadoop MapReduce: A Closer Look``
Data Flow in MapReduce
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Intermediate data of the same key goes to the same reducer
Reducer output is stored
www.npntraining.com/courses/big-data-and-hadoop.php
``
`Agenda for Next Class
 Overview of Hive and its Architecture
 Understanding Hive Metastore
 Schema on Read and Schema on Write
 Hive Data Model and Complex Data types
 Internal VS External Tables
 Exporting Data from Hive
www.npntraining.com/courses/big-data-and-hadoop.php
``
www.npntraining.com +91 9535584691
www.npntraining.com +91 9535584691

More Related Content

PDF
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PDF
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
PPTX
Big data
PDF
A sql implementation on the map reduce framework
PDF
Python in an Evolving Enterprise System (PyData SV 2013)
PDF
Application of MapReduce in Cloud Computing
PDF
Hadoop paper
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
Big data
A sql implementation on the map reduce framework
Python in an Evolving Enterprise System (PyData SV 2013)
Application of MapReduce in Cloud Computing
Hadoop paper

What's hot (18)

PDF
lec6_ref.pdf
PDF
MapReduce in Cloud Computing
PDF
Implementation of nosql for robotics
PDF
Resilient Distributed Datasets
PDF
Characterization of hadoop jobs using unsupervised learning
PDF
Cidr11 paper32
PDF
Big data distributed processing: Spark introduction
PDF
Survey of Parallel Data Processing in Context with MapReduce
PDF
Resilient Distributed Datasets
PDF
Survey Paper on Big Data and Hadoop
PDF
PyData Amsterdam - Name Matching at Scale
PPT
BDAS RDD study report v1.2
PPTX
Neo, Titan & Cassandra
PDF
Ieeepro techno solutions ieee java project - nc cloud applying network codi...
PDF
Web Oriented FIM for large scale dataset using Hadoop
PDF
lec2_ref.pdf
PDF
Full stack analytics with Hadoop 2
lec6_ref.pdf
MapReduce in Cloud Computing
Implementation of nosql for robotics
Resilient Distributed Datasets
Characterization of hadoop jobs using unsupervised learning
Cidr11 paper32
Big data distributed processing: Spark introduction
Survey of Parallel Data Processing in Context with MapReduce
Resilient Distributed Datasets
Survey Paper on Big Data and Hadoop
PyData Amsterdam - Name Matching at Scale
BDAS RDD study report v1.2
Neo, Titan & Cassandra
Ieeepro techno solutions ieee java project - nc cloud applying network codi...
Web Oriented FIM for large scale dataset using Hadoop
lec2_ref.pdf
Full stack analytics with Hadoop 2
Ad

Similar to Module IV - MapReduce Programming - I (20)

PDF
Hadoop eco system with mapreduce hive and pig
PPTX
Map reducefunnyslide
PPTX
Mapreduce advanced
PPTX
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
PPTX
WELCOME TO BIG DATA TRANING
PPTX
WELCOME TO BIG DATA TRANING
PPTX
Map Reduce
PPTX
HDFS & MapReduce
PPTX
MAP REDUCE IN DATA SCIENCE.pptx
PPT
11. From Hadoop to Spark 1:2
PPTX
Hadoop MapReduce framework - Module 3
PDF
Hadoop first mr job - inverted index construction
PPTX
MapReduce and Hadoop Introcuctory Presentation
PPTX
Lecture 04 big data analytics | map reduce
PPTX
MapReduce.pptx
PPTX
Big data(hadoop)
PPT
Big-data-analysis-training-in-mumbai
PPTX
Hadoop live online training
PPTX
PPTX
Learn what is Hadoop-and-BigData
Hadoop eco system with mapreduce hive and pig
Map reducefunnyslide
Mapreduce advanced
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
WELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANING
Map Reduce
HDFS & MapReduce
MAP REDUCE IN DATA SCIENCE.pptx
11. From Hadoop to Spark 1:2
Hadoop MapReduce framework - Module 3
Hadoop first mr job - inverted index construction
MapReduce and Hadoop Introcuctory Presentation
Lecture 04 big data analytics | map reduce
MapReduce.pptx
Big data(hadoop)
Big-data-analysis-training-in-mumbai
Hadoop live online training
Learn what is Hadoop-and-BigData
Ad

Recently uploaded (20)

PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Lecture1 pattern recognition............
PPTX
Database Infoormation System (DBIS).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Business Analytics and business intelligence.pdf
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Lecture1 pattern recognition............
Database Infoormation System (DBIS).pptx
Fluorescence-microscope_Botany_detailed content
ISS -ESG Data flows What is ESG and HowHow
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
SAP 2 completion done . PRESENTATION.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Business Analytics and business intelligence.pdf
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Supervised vs unsupervised machine learning algorithms
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction-to-Cloud-ComputingFinal.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Data_Analytics_and_PowerBI_Presentation.pptx
Qualitative Qantitative and Mixed Methods.pptx

Module IV - MapReduce Programming - I

  • 1. Module 04 - MapReduce Framework NPN TrainingTraining is the essence of success & we are committed to it www.npntraining.com
  • 2. Course Topics ` Understanding Big Data Module - I Hadoop 1.x & 2.x Architecture Module - II Hadoop Setup and Configuration Module - III MapReduce Framework – I Module - IV Hive and Hive Query Language Module - V Advance Hive using Java Module - VI Advance HBase using Java Module - IX Module - X MR Unit Module - XI Pig and Pig Latin Module - XII MapReduce Framework – III MapReduce Framework – II Module - VII No SQL & HBase Module - VIII Advance Pig using Java Module - XIII Hue Module - XV Project Discussion & Use case Module - XVI Sqoop Module - XIV www.npntraining.com/courses/big-data-and-hadoop.php
  • 3. Map, Reduce Paradigm Overview of Record Reader & Input Splits Executing Map Reduce programs Data Flow in MapReduce What is MapReduce Framework Topics for the Module ` Word Count Implementation Relation between InputSplits & HDFS BlockRole of Key and Pairs Exploring different command line options In executing MapReduce programs Hadoop Datatypes www.npntraining.com/courses/big-data-and-hadoop.php
  • 4. Hadoop MapReduce Framework MapReduce is a programming model for processing large data sets with a parallel , distributed on a cluster. In MapReduce Programming model, work is divided into two phases: a) Map Phase b) Reduce Phase The map phase takes a piece of input and performs some operations on it (e.g. extracting a field) . The Reduce phase aggregates similar pieces of information that are produced by the map phase (e.g. averaging fields with the same name). These piece of information are represented by key-value pairs. www.npntraining.com/courses/big-data-and-hadoop.php
  • 5. Developer Map Reduce Employee.dat 100 MB 64 MB 36 MB Map Class Reduce Class 1 Read the data from input file 2 Write Business Logic for processing the data 3 Send the result (Output) (Intermediate data – Temporary Data stored on local FS) 1 Read all the output of maps Aggregation or Consolidation Logic2 3 Final Output to HDFS Blocks and Replication will be there Hadoop Framework does sort and shuffling of Map data to Reducer Map Task Map Task InputSplits InputSplits 3000 2000 9000 6000 www.npntraining.com/courses/big-data-and-hadoop.php
  • 6. Sample.dat 100 MB Well come to NPN Training we promise we teach you the best We teach various course like Java,J2EE,Selenium,Hadoop 64 MB 36 MB Well come to NPN Training we promise we teach you the best We train various course like Java,J2EE,Selenium,Hadoop Every time Map Task reads individual key value pair Key --> Byte Offset Value --> Entire line as value 1block 0, Well come to NPN Training 25,we promise we teach you the best 2block 0, We teach various course like Java,J2EE,Selenium, Hadoop By default Hadoop has TextInputFormat class(responsible for creating InputSplits and also divides into records) TextInputFormat class creates key value pairs. www.npntraining.com/courses/big-data-and-hadoop.php
  • 7. Word Count Use case ` Let’s assume we have a large collection of text documents in a folder (Let’s say we have 1000 documents each with average of 1 million words) We have to count how many times each word is repeated in the documents How would you solve this using simple Java program? How many lines of code will you write? How much will be the program execution time? www.npntraining.com/courses/big-data-and-hadoop.php
  • 8. MapReduce Paradigm C++ J2EE Python LISP Python Java Python JSP Python LISP Python Servlet JSP Input Split (K1,V1) (Framework) < 0, C++ J2EE Python LISP > < 20, Python Java Python> Mapper < C++, 1 > <J2EE, 1> <Python,1> <LISP,1> <Python,1> <Java,1> <Python,1> Mapper < JSP, 1 > <Python,1> <LISP,1> <Python,1> <Servlet,1> <JSP,1> < 0, JSP Python LISP > < 14, Python Servlet JSP> (C++, 1) (Java, 1) (J2EE, 1) (LISP, 2) (Python,5) (JSP,1) (K2,list(V1, v2, v3) (Framework) List (K2,V2) List (K3,V3) Input Reducer < C++ ,[1] <Java ,[1] <J2EE ,[1] <LISP ,[1, 1] <Python ,[1,1,1,1,1] <JSP ,[1] <Servlet ,[1] Shuffle&Sort www.npntraining.com/courses/big-data-and-hadoop.php
  • 9. Word Count Implementation ` Map Class Reduce Class Driver Class
  • 10. Why MapReduce Processing data in parallel Taking processing to the data Map TaskHDFS Block Node Rack Data Center www.npntraining.com/courses/big-data-and-hadoop.php ``
  • 11. Input Splits InputSplits Logical Division HDFS Blocks Physical Division www.npntraining.com/courses/big-data-and-hadoop.php ``
  • 12. Relational Between Input Splits and HDFS Blocks 1 2 3 4 5 6 7 8 9 10 11 64 MB Blocks are cut in between the records 64 MB 64 MB Last record may cross across the boundary of the block Split Split Splits will be aware of the positions InputFormat is responsible for creating InputSplits and dividing into records www.npntraining.com/courses/big-data-and-hadoop.php ``
  • 13. Relational Between Input Splits and HDFS Blocks Block Map TaskInputSplits Block is the physical representation of data. Split is the logical representation of data present in Block. Block and split size can be changed in properties. Map reads data from Block through splits i.e. split act as a broker between Block and Mapper. Now map reads block 1 till aa to JJ and doesn't know how to read block 2 i.e. block doesn't know how to process different block of information. Here comes a Split it will form a Logical grouping of Block 1 and Block 2 as single Block, then it forms offset(key) and line (value) using inputformat and record reader and send map to process further processing. www.npntraining.com/courses/big-data-and-hadoop.php ``
  • 14. file file InputFormat Split Split Split RR RR RR Map Map Map Input (K, V) pairs Partitioner Intermediate (K, V) pairs Sort Reduce OutputFormat Files loaded from local HDFS store RecordReaders Final (K, V) pairs Writeback to local HDFS store file file InputFormat Split Split Split RR RR RR Map Map Map Input (K, V) pairs Partitioner Intermediate (K, V) pairs Sort Reduce OutputFormat Files loaded from local HDFS store RecordReaders Final (K, V) pairs Writeback to local HDFS store Node 1 Node 2 Shuffling Process Intermediate (K,V) pairs exchanged by all nodes Hadoop MapReduce: A Closer Look``
  • 15. Data Flow in MapReduce Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Data exchange between nodes in a “shuffle” process Intermediate data of the same key goes to the same reducer Reducer output is stored www.npntraining.com/courses/big-data-and-hadoop.php ``
  • 16. `Agenda for Next Class  Overview of Hive and its Architecture  Understanding Hive Metastore  Schema on Read and Schema on Write  Hive Data Model and Complex Data types  Internal VS External Tables  Exporting Data from Hive www.npntraining.com/courses/big-data-and-hadoop.php ``