Module IV - MapReduce Programming - I

Module 04 - MapReduce Framework
NPN TrainingTraining is the essence of success &
we are committed to it
www.npntraining.com

Course Topics `
Understanding Big Data
Module - I
Hadoop 1.x & 2.x Architecture
Module - II
Hadoop Setup and Configuration
Module - III
MapReduce Framework – I
Module - IV
Hive and Hive Query Language
Module - V
Advance Hive using Java
Module - VI
Advance HBase using Java
Module - IX
Module - X
MR Unit
Module - XI
Pig and Pig Latin
Module - XII
MapReduce Framework – III
MapReduce Framework – II
Module - VII
No SQL & HBase
Module - VIII
Advance Pig using Java
Module - XIII
Hue
Module - XV
Project Discussion & Use case
Module - XVI
Sqoop
Module - XIV
www.npntraining.com/courses/big-data-and-hadoop.php

Map, Reduce Paradigm
Overview of Record Reader & Input Splits
Executing Map Reduce programs
Data Flow in MapReduce
What is MapReduce Framework
Topics for the Module `
Word Count Implementation
Relation between InputSplits & HDFS BlockRole of Key and Pairs
Exploring different command line options
In executing MapReduce programs
Hadoop Datatypes

Hadoop MapReduce Framework
MapReduce is a programming model for processing large data sets with a parallel ,
distributed on a cluster.
In MapReduce Programming model, work is divided into two phases:
a) Map Phase
b) Reduce Phase
The map phase takes a piece of input and performs some operations on it (e.g. extracting a field) .
The Reduce phase aggregates similar pieces of information that are produced by the map phase
(e.g. averaging fields with the same name).
These piece of information are represented by key-value pairs.

Developer
Map
Reduce
Employee.dat
100 MB
64
MB
36
MB
Map Class
Reduce Class
1 Read the data from input file
2 Write Business Logic for processing the data
3 Send the result (Output) (Intermediate data –
Temporary Data stored on local FS)
1 Read all the output of maps
Aggregation or Consolidation Logic2
3 Final Output to HDFS Blocks and Replication
will be there
Hadoop
Framework does
sort and shuffling
of Map data to
Reducer
Map Task
Map Task
InputSplits
InputSplits
3000
2000
9000
6000

Sample.dat
100 MB
Well come to NPN Training
we promise we teach you the best
We teach various course like Java,J2EE,Selenium,Hadoop
64
MB
36
MB
Well come to NPN Training
we promise we teach you the best
We train various course like Java,J2EE,Selenium,Hadoop
Every time Map Task reads individual key value pair
Key --> Byte Offset
Value --> Entire line as value
1block
0, Well come to NPN Training
25,we promise we teach you the best
2block
0, We teach various course like Java,J2EE,Selenium, Hadoop
By default Hadoop has TextInputFormat class(responsible
for creating InputSplits and also divides into records)
TextInputFormat class creates key value pairs.

Word Count Use case `
Let’s assume we have a large collection of text documents in a folder
(Let’s say we have 1000 documents each with average of 1 million words)
We have to count how many times each word is repeated in the documents
How would you solve
this using simple Java
program?
How many lines of
code will you write?
How much will be the
program execution
time?

MapReduce Paradigm
C++ J2EE Python LISP
Python Java Python
JSP Python LISP
Python Servlet JSP
Input Split
(K1,V1)
(Framework)
< 0, C++ J2EE Python LISP >
< 20, Python Java Python>
Mapper
< C++, 1 >
<J2EE, 1>
<Python,1>
<LISP,1>
<Python,1>
<Java,1>
<Python,1>
Mapper < JSP, 1 >
<Python,1>
<LISP,1>
<Python,1>
<Servlet,1>
<JSP,1>
< 0, JSP Python LISP >
< 14, Python Servlet JSP>
(C++, 1)
(Java, 1)
(J2EE, 1)
(LISP, 2)
(Python,5)
(JSP,1)
(K2,list(V1, v2, v3)
(Framework)
List (K2,V2)
List (K3,V3)
Input
Reducer
< C++ ,[1]
<Java ,[1]
<J2EE ,[1]
<LISP ,[1, 1]
<Python ,[1,1,1,1,1]
<JSP ,[1]
<Servlet ,[1]
Shuffle&Sort

Word Count Implementation `
Map Class
Reduce Class
Driver Class

Why MapReduce
Processing data in parallel
Taking processing to the data
Map TaskHDFS Block
Node Rack
Data Center
``

Input Splits
InputSplits Logical Division HDFS Blocks
Physical
Division
``

Relational Between Input Splits and HDFS Blocks
1 2 3 4 5 6 7 8 9 10 11
64 MB
Blocks are cut in
between the
records
64 MB 64 MB
Last record may
cross across the
boundary of the
block
Split Split
Splits will be aware of
the positions
InputFormat is responsible
for creating InputSplits and
dividing into records
``

Relational Between Input Splits and HDFS Blocks
Block Map TaskInputSplits
Block is the physical representation of data. Split is the logical representation of data present in Block.
Block and split size can be changed in properties.
Map reads data from Block through splits i.e. split act as a broker between Block and Mapper.
Now map reads block 1 till aa to JJ and doesn't know how to read block 2 i.e. block
doesn't know how to process different block of information. Here comes a Split it
will form a Logical grouping of Block 1 and Block 2 as single Block, then it forms
offset(key) and line (value) using inputformat and record reader and send map to
process further processing.
``

file
file
InputFormat
Split Split Split
RR RR RR
Map Map Map
Input (K, V) pairs
Partitioner
Intermediate (K, V) pairs
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
RecordReaders
Final (K, V) pairs
Writeback to local
HDFS store
file
file
InputFormat
Split Split Split
RR RR RR
Map Map Map
Input (K, V) pairs
Partitioner
Intermediate (K, V) pairs
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
RecordReaders
Final (K, V) pairs
Writeback to local
HDFS store
Node 1 Node 2
Shuffling
Process
Intermediate
(K,V) pairs
exchanged by
all nodes
Hadoop MapReduce: A Closer Look``

Data Flow in MapReduce
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Intermediate data of the same key goes to the same reducer
Reducer output is stored
``

`Agenda for Next Class
 Overview of Hive and its Architecture
 Understanding Hive Metastore
 Schema on Read and Schema on Write
 Hive Data Model and Complex Data types
 Internal VS External Tables
 Exporting Data from Hive
``

www.npntraining.com +91 9535584691

Module IV - MapReduce Programming - I

More Related Content

What's hot (18)

Similar to Module IV - MapReduce Programming - I (20)

Recently uploaded (20)

Module IV - MapReduce Programming - I