Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

 Recap
 Definition
 Analogy
 Phase : Map Reduce

 Access speed did not keep up with the Storage capacity
 Processing Data in Parallel is better
 Cluster Architecture is apt for Hadoop
 How Hadoop got started
 HDFS Architecture[Block Size and Replication]
 Name Node and Secondary Name Node
 5000 feet overview how HDFS Writes happen

 MapReduce is a framework for writing applications that process large
amounts of structured and unstructured data in parallel across a cluster of
thousands of machines, in a reliable and fault-tolerant manner.
 Framework
 Write Applications
 Process Large Data
 Structure or Un-Structured
 Process Data In Parallel
 Reliable
 Fault-tolerant

E-Sarjapur
Sort
A M
N Z
E-K.R.Puram
N-Yelahanka
S-J P Nagar
N-Hebbal
W-Rajajinagar
Merge
Hebbal
JPNagar
KRPuram
Rajajinagar
Sarjapur
Yelahanka

HDFS
Mappers
Input Splits
Sort and
Shuffle
Reducers
Data Node / Task trackers
Aggregation
HDFS

Input Splitting Mapping Shuffling Reducing Final Result
Near ear here
here there Hear
Ear dear There
Near ear here
Here there Hear
Ear Dear There
Near,1
ear ,1
Here,1
Here,1
There,1
Hear, 1
Ear,1
Dear ,1
There,1
Ear 1,1
Dear 1
Here 1,1,1
Near 1
There 1, 1
Ear, 2
Dear, 1
Here, 2
Hear,1
There,2
Near, 1
Ear, 2
Dear, 1
Here,2
Hear, 1
There,2
Near,1
Input to Mapper <K1,V1> Output from Mapper <K2,V2> Input to reducers
<K2,(V2,V2)>
<K3,V3>

// Map Reduce function in
JavaScript
var map = function (key,
value, context) {
var words =
value.split(/[^a-zA-Z]/);
for (var i = 0; i <
words.length; i++) {
if
(words[i] !== "")
{context.write(words[i].to
LowerCase(), 1);}
}};
var reduce = function
(key, values, context) {
var sum = 0;
while (values.hasNext()) {
sum +=
parseInt(values.next());
}
context.write(key, sum);
};

 Job Client
 Submits Job to Job Trackers
 Job Tracker – orchestrate jobs
 Query Name Node for Data Location
 Create Execution Plan
 Submits job to Task Tracker
 Manage Phases (Map, Shuffle & Reduce)
 Updates Status
 Task Tracker – Executes job Tasks
 Reports Progress

RACK 1 - DataNodes RACK 2 - DataNodes
File Metadata
/user/kc/data01.txt – Block
1,2,3,4
/user/apb/data02.txt– Block 5,6
1 1
1
2 2
3
3
2
34 4
45
5
5 6
6
6
Block1: R1DN01, R1DN02, R2DN01
Block2:R1DN01, R1DN02, R2DN03
Block3:R1DN02, R1DN03, R2DN01

Client Job Tracker Task Tracker
Splits Uses bytes and Storage location
from InputSplit
RecordReader
MAP()
Combiner
Partitioner
Shuffler & Sort
Reduce
Output Format

Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

 Support Team’s blog: http://blogs.msdn.com/b/bigdatasupport/
 Facebook Page: https://www.facebook.com/MicrosoftBigData
 Facebook Group: https://www.facebook.com/groups/bigdatalearnings/
 Twitter: @debarchans
 Twitter: @confusionblinds
 Read more:
 http://en.wikipedia.org/wiki/Hadoop
 http://en.wikipedia.org/wiki/Big_data
 Next Session:
 Apache Hadoop – Setup Lab

Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

More Related Content

What's hot (20)

Similar to Apache Hadoop - A Deep Dive (Part 2 - MapReduce) (20)

Recently uploaded (20)

Apache Hadoop - A Deep Dive (Part 2 - MapReduce)