SlideShare a Scribd company logo
Introduction
 Recap
 Definition
 Analogy
 Phase : Map Reduce
 Access speed did not keep up with the Storage capacity
 Processing Data in Parallel is better
 Cluster Architecture is apt for Hadoop
 How Hadoop got started
 HDFS Architecture[Block Size and Replication]
 Name Node and Secondary Name Node
 5000 feet overview how HDFS Writes happen
 MapReduce is a framework for writing applications that process large
amounts of structured and unstructured data in parallel across a cluster of
thousands of machines, in a reliable and fault-tolerant manner.
 Framework
 Write Applications
 Process Large Data
 Structure or Un-Structured
 Process Data In Parallel
 Reliable
 Fault-tolerant
E-Sarjapur
Sort
A M
N Z
E-K.R.Puram
N-Yelahanka
S-J P Nagar
N-Hebbal
W-Rajajinagar
Merge
Hebbal
JPNagar
KRPuram
Rajajinagar
Sarjapur
Yelahanka
HDFS
Mappers
Input Splits
Sort and
Shuffle
Reducers
Data Node / Task trackers
Aggregation
HDFS
Input Splitting Mapping Shuffling Reducing Final Result
Near ear here
here there Hear
Ear dear There
Near ear here
Here there Hear
Ear Dear There
Near,1
ear ,1
Here,1
Here,1
There,1
Hear, 1
Ear,1
Dear ,1
There,1
Ear 1,1
Dear 1
Here 1,1,1
Near 1
There 1, 1
Ear, 2
Dear, 1
Here, 2
Hear,1
There,2
Near, 1
Ear, 2
Dear, 1
Here,2
Hear, 1
There,2
Near,1
Input to Mapper <K1,V1> Output from Mapper <K2,V2> Input to reducers
<K2,(V2,V2)>
<K3,V3>
// Map Reduce function in
JavaScript
var map = function (key,
value, context) {
var words =
value.split(/[^a-zA-Z]/);
for (var i = 0; i <
words.length; i++) {
if
(words[i] !== "")
{context.write(words[i].to
LowerCase(), 1);}
}};
var reduce = function
(key, values, context) {
var sum = 0;
while (values.hasNext()) {
sum +=
parseInt(values.next());
}
context.write(key, sum);
};
 Job Client
 Submits Job to Job Trackers
 Job Tracker – orchestrate jobs
 Query Name Node for Data Location
 Create Execution Plan
 Submits job to Task Tracker
 Manage Phases (Map, Shuffle & Reduce)
 Updates Status
 Task Tracker – Executes job Tasks
 Reports Progress
RACK 1 - DataNodes RACK 2 - DataNodes
File Metadata
/user/kc/data01.txt – Block
1,2,3,4
/user/apb/data02.txt– Block 5,6
1 1
1
2 2
3
3
2
34 4
45
5
5 6
6
6
Block1: R1DN01, R1DN02, R2DN01
Block2:R1DN01, R1DN02, R2DN03
Block3:R1DN02, R1DN03, R2DN01
Client Job Tracker Task Tracker
Splits Uses bytes and Storage location
from InputSplit
RecordReader
MAP()
Combiner
Partitioner
Shuffler & Sort
Reduce
Output Format
Apache Hadoop - A Deep Dive (Part 2 - MapReduce)
 Support Team’s blog: http://blogs.msdn.com/b/bigdatasupport/
 Facebook Page: https://www.facebook.com/MicrosoftBigData
 Facebook Group: https://www.facebook.com/groups/bigdatalearnings/
 Twitter: @debarchans
 Twitter: @confusionblinds
 Read more:
 http://en.wikipedia.org/wiki/Hadoop
 http://en.wikipedia.org/wiki/Big_data
 Next Session:
 Apache Hadoop – Setup Lab

More Related Content

PPTX
Apache Hadoop - A Deep Dive (Part 1 - HDFS)
PDF
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
PDF
The Revolution Will be Streamed
PDF
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PDF
EMR AWS Demo
PDF
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
PDF
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Apache Hadoop - A Deep Dive (Part 1 - HDFS)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
The Revolution Will be Streamed
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Presto: Optimizing Performance of SQL-on-Anything Engine
EMR AWS Demo
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific

What's hot (20)

PDF
IEEE International Conference on Data Engineering 2015
PPTX
CtrlS - DR on Demand
PDF
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
PDF
Data Science Across Data Sources with Apache Arrow
PDF
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
PDF
Building a Real-Time Feature Store at iFood
PDF
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
PDF
Hadoop Network Performance profile
PDF
Big Data, Mob Scale.
PPTX
Juniper Innovation Contest
PDF
Proud to be Polyglot - Riviera Dev 2015
PPTX
Putting Lipstick on Apache Pig at Netflix
PPTX
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
PDF
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
PDF
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
PDF
Hoodie: How (And Why) We built an analytical datastore on Spark
PDF
Hyperspace for Delta Lake
PDF
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
PDF
What’s New in the Upcoming Apache Spark 3.0
PDF
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
IEEE International Conference on Data Engineering 2015
CtrlS - DR on Demand
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Data Science Across Data Sources with Apache Arrow
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Building a Real-Time Feature Store at iFood
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Hadoop Network Performance profile
Big Data, Mob Scale.
Juniper Innovation Contest
Proud to be Polyglot - Riviera Dev 2015
Putting Lipstick on Apache Pig at Netflix
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
Hoodie: How (And Why) We built an analytical datastore on Spark
Hyperspace for Delta Lake
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
What’s New in the Upcoming Apache Spark 3.0
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ad

Similar to Apache Hadoop - A Deep Dive (Part 2 - MapReduce) (20)

PPTX
MAP REDUCE IN DATA SCIENCE.pptx
PPTX
Map Reduce
PDF
Hadoop ecosystem
PPTX
Hadoop ecosystem
PPT
Hadoop Map Reduce
PDF
SparkR: Enabling Interactive Data Science at Scale on Hadoop
PPT
Meethadoop
PDF
Apache Spark: What? Why? When?
PDF
An Overview of Apache Spark
PPT
PPT
Big Data- process of map reducing MapReduce- .ppt
PPTX
PPTX
Big Data - Part III
PPT
Behm Shah Pagerank
PPT
Taste Java In The Clouds
PDF
Apache Spark with Scala
PPTX
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
PDF
Hot-Spot analysis Using Apache Spark framework
MAP REDUCE IN DATA SCIENCE.pptx
Map Reduce
Hadoop ecosystem
Hadoop ecosystem
Hadoop Map Reduce
SparkR: Enabling Interactive Data Science at Scale on Hadoop
Meethadoop
Apache Spark: What? Why? When?
An Overview of Apache Spark
Big Data- process of map reducing MapReduce- .ppt
Big Data - Part III
Behm Shah Pagerank
Taste Java In The Clouds
Apache Spark with Scala
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Hot-Spot analysis Using Apache Spark framework
Ad

Recently uploaded (20)

PDF
Mega Projects Data Mega Projects Data
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
modul_python (1).pptx for professional and student
PPTX
Modelling in Business Intelligence , information system
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Transcultural that can help you someday.
PDF
Lecture1 pattern recognition............
PDF
How to run a consulting project- client discovery
PPTX
Database Infoormation System (DBIS).pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
Leprosy and NLEP programme community medicine
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Business Analytics and business intelligence.pdf
Mega Projects Data Mega Projects Data
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
modul_python (1).pptx for professional and student
Modelling in Business Intelligence , information system
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Galatica Smart Energy Infrastructure Startup Pitch Deck
Acceptance and paychological effects of mandatory extra coach I classes.pptx
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Transcultural that can help you someday.
Lecture1 pattern recognition............
How to run a consulting project- client discovery
Database Infoormation System (DBIS).pptx
[EN] Industrial Machine Downtime Prediction
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Leprosy and NLEP programme community medicine
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Business Analytics and business intelligence.pdf

Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

  • 2.  Recap  Definition  Analogy  Phase : Map Reduce
  • 3.  Access speed did not keep up with the Storage capacity  Processing Data in Parallel is better  Cluster Architecture is apt for Hadoop  How Hadoop got started  HDFS Architecture[Block Size and Replication]  Name Node and Secondary Name Node  5000 feet overview how HDFS Writes happen
  • 4.  MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner.  Framework  Write Applications  Process Large Data  Structure or Un-Structured  Process Data In Parallel  Reliable  Fault-tolerant
  • 5. E-Sarjapur Sort A M N Z E-K.R.Puram N-Yelahanka S-J P Nagar N-Hebbal W-Rajajinagar Merge Hebbal JPNagar KRPuram Rajajinagar Sarjapur Yelahanka
  • 6. HDFS Mappers Input Splits Sort and Shuffle Reducers Data Node / Task trackers Aggregation HDFS
  • 7. Input Splitting Mapping Shuffling Reducing Final Result Near ear here here there Hear Ear dear There Near ear here Here there Hear Ear Dear There Near,1 ear ,1 Here,1 Here,1 There,1 Hear, 1 Ear,1 Dear ,1 There,1 Ear 1,1 Dear 1 Here 1,1,1 Near 1 There 1, 1 Ear, 2 Dear, 1 Here, 2 Hear,1 There,2 Near, 1 Ear, 2 Dear, 1 Here,2 Hear, 1 There,2 Near,1 Input to Mapper <K1,V1> Output from Mapper <K2,V2> Input to reducers <K2,(V2,V2)> <K3,V3>
  • 8. // Map Reduce function in JavaScript var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") {context.write(words[i].to LowerCase(), 1);} }}; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); };
  • 9.  Job Client  Submits Job to Job Trackers  Job Tracker – orchestrate jobs  Query Name Node for Data Location  Create Execution Plan  Submits job to Task Tracker  Manage Phases (Map, Shuffle & Reduce)  Updates Status  Task Tracker – Executes job Tasks  Reports Progress
  • 10. RACK 1 - DataNodes RACK 2 - DataNodes File Metadata /user/kc/data01.txt – Block 1,2,3,4 /user/apb/data02.txt– Block 5,6 1 1 1 2 2 3 3 2 34 4 45 5 5 6 6 6 Block1: R1DN01, R1DN02, R2DN01 Block2:R1DN01, R1DN02, R2DN03 Block3:R1DN02, R1DN03, R2DN01
  • 11. Client Job Tracker Task Tracker Splits Uses bytes and Storage location from InputSplit RecordReader MAP() Combiner Partitioner Shuffler & Sort Reduce Output Format
  • 13.  Support Team’s blog: http://blogs.msdn.com/b/bigdatasupport/  Facebook Page: https://www.facebook.com/MicrosoftBigData  Facebook Group: https://www.facebook.com/groups/bigdatalearnings/  Twitter: @debarchans  Twitter: @confusionblinds  Read more:  http://en.wikipedia.org/wiki/Hadoop  http://en.wikipedia.org/wiki/Big_data  Next Session:  Apache Hadoop – Setup Lab