SlideShare a Scribd company logo
www.edureka.co/big-data-and-hadoop
Hadoop the ultimate data storage
And processing Together
Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
Analyze different use-cases where MapReduce is used
Differentiate between Traditional way and MapReduce way
Learn about Hadoop 2.x MapReduce architecture and components
Understand execution flow of YARN MapReduce application
Implement basic MapReduce concepts
Run a MapReduce Program
At the end of this module, you will be able to
Slide 3 www.edureka.co/big-data-and-hadoop
Where MapReduce is Used?
Weather Forecasting
HealthCare
 Problem Statement:
» De-identify personal health information.
 Problem Statement:
» Finding Maximum temperature recorded in a year.
Slide 4 www.edureka.co/big-data-and-hadoop
Where MapReduce is Used?
MapReduce
FeaturesLarge Scale
Distributed Model
Used in
Function
Design Pattern
Parallel
Programming
A Program Model
Classification
Analytics
Recommendation
Index and Search
Map
Reduce
Classification
Eg: Top N records
Analytics
Eg: Join, Selection
Recommendation
Eg: Sort
Summarization
Eg: Inverted Index
Implemented
Google
Apache Hadoop
HDFS
Pig
Hive
HBase
For
Slide 5 www.edureka.co/big-data-and-hadoop
The Traditional Way
Very
Big
Data
Split Data matches
All
matches
grep
grep
grep cat
grep
:
matches
matches
matches
Split Data
Split Data
Split Data
Slide 6 www.edureka.co/big-data-and-hadoop
MapReduce Way
Very
Big
Data
Split Data
All
matches
:
Split Data
Split Data
Split Data
M
A
P
R
E
D
U
C
E
MapReduce Framework
Slide 7 www.edureka.co/big-data-and-hadoop
MapReduce Paradigm
The Overall MapReduce Word Count Process
Input Splitting Mapping Shuffling Reducing Final Result
List(K3,V3)
Deer Bear River
Dear Bear River
Car Car River
Deer Car Bear
Bear, 2
Car, 3
Deer, 2
River, 2
Deer, 1
Bear, 1
River, 1
Car, 1
Car, 1
River, 1
Deer, 1
Car, 1
Bear, 1
K2,List(V2)List(K2,V2)
K1,V1
Car Car River
Deer Car Bear
Bear, 2
Car, 3
Deer, 2
River, 2
Bear, (1,1)
Car, (1,1,1)
Deer, (1,1)
River, (1,1)
Slide 8 www.edureka.co/big-data-and-hadoop
Anatomy of a MapReduce Program
MapReduce
Map:
Reduce:
(K1, V1) List (K2, V2)
(K2, list (V2)) List (K3, V3)
Key Value
Slide 9 www.edureka.co/big-data-and-hadoop
Why MapReduce?
Two biggest Advantages:
» Taking processing to the data
» Processing data in parallel
a
b
c
Map Task
HDFS Block
Data Center
Rack
Node
Slide 10 www.edureka.co/big-data-and-hadoop
 ApplicationMaster
» One per application
» Short life
» Coordinates and Manages MapReduce Jobs
» Negotiates with Resource Manager to
schedule tasks
» The tasks are started by NodeManager(s)
 Job History Server
» Maintains information about submitted
MapReduce jobs after their ApplicationMaster
terminates
 Client
» Submits a MapReduce Job
 Resource Manager
» Cluster Level resource manager
» Long Life, High Quality Hardware
 Node Manager
» One per Data Node
» Monitors resources on Data Node
Hadoop 2.x MapReduce Components
 Container
» Created by NM when requested
» Allocates certain amount of resources
(memory, CPU etc.) on a slave node
Slide 11 www.edureka.co/big-data-and-hadoop
BATCH
(MapReduce)
INTERACTIVE
(Text)
ONLINE
(HBase)
STREAMING
(Storm, S4, …)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(OpenMPI)
OTHER
(Search)
(Weave..)
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html
YARN – Moving beyond MapReduce
Slide 12 www.edureka.co/big-data-and-hadoop
MapReduce Application Execution
Executing MapReduce Application on YARN
Slide 13 www.edureka.co/big-data-and-hadoop
YARN MR Application Execution Flow
MapReduce Job Execution
» Job Submission
» Job Initialization
» Tasks Assignment
» Memory Assignment
» Status Updates
» Failure Recovery
Slide 14 www.edureka.co/big-data-and-hadoop
HDFS
Application Job Object
Client JVM
Client
Resource
Manager
Management Node
Run Job
2. Get New Application ID
4. Submit Application Context
3. Prepare the
Application submit
context
3.1 App Jar
3.2 Job Resources(Block
locations)
3.3 User Information
1. Notify Start Application
YARN MR Application Execution Flow
Slide 15 www.edureka.co/big-data-and-hadoop
HDFS
3. Prepare the
Application submit
context
3.1 App Jar
3.2 Job Resources(Block
locations)
3.3 User Information
Node Manager
5. Start AppMaster container /
Allocate Context for AppMaster
App Master
6.Alloate
Container for
AppMaster
7.Request
Resources
8.Notify with resources
Availability
Data Node
YARN MR Application Execution Flow
Application Job Object
Client JVM
Client
Resource
Manager
Management Node
Run Job
2. Get New Application ID
4. Submit Application Context
1. Notify Start Application
Slide 16 www.edureka.co/big-data-and-hadoop
HDFS
Resource
Manager
3. Prepare the Application
submit context
3.1 App Jar
3.2 Job Resources(Block
locations)
3.3 User Information
Management Node
Node Manager
5. Start AppMaster container / Allocate
Context for AppMaster
App Master
6. Allocate
Container for
AppMaster
7.Request
Resources
8.Notify with resources
Availability
Data Node
Client
Node Manager
Data node-1
Node Manager
Map Block
9.Start Container
in the worker node
Data node-2
Node Manager
Map Block
10.NM allocate
Container
10.NM allocate
Container
2. Get New Application
4. Submit Application
1. Notify Start Application
9.Start Container
in the worker
node
YARN MR Application Execution Flow
Slide 17 www.edureka.co/big-data-and-hadoop
YARN MR Application Execution Flow
11.Task get Executed.
12.If any reducer in a Job Reducer, again AppMaster Request the Node Manager to start the and Allocate
Container
13.Output of All the Maps given to reducer and Reducer get executed
14.Once Job finished, Application Master notify the Resource Manager and Client Library
15.Application Master closed.
Slide 18 www.edureka.co/big-data-and-hadoop
Hadoop 2.x : YARN Workflow
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Container 1.2
Container 1.1
Container 2.1
Container 2.2
Container 2.3
App
Master 2
App
Master 1
Scheduler
Applications
Manager (AsM)
Resource
Manager
Slide 19 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application Client RM NM AM
1
Slide 20 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
Client RM NM AM
1
2
Slide 21 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
Client RM NM AM
1
2
3
Slide 22 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
Client RM NM AM
1
2
3
4
Slide 23 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
Client RM NM AM
1
2
3
4
5
Slide 24 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
6. Application code is executed in container
Client RM NM AM
1
2
3
4
5
6
Slide 25 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
6. Application code is executed in container
7. Client contacts RM/AM to monitor application’s status
Client RM NM AM
1
2
3
4
5
7 6
Slide 26 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
6. Application code is executed in container
7. Client contacts RM/AM to monitor application’s status
8. AM unregisters with RM
Client RM NM AM
1
2
3
4
5
7
8
6
Slide 27 www.edureka.co/big-data-and-hadoop
Input Splits
INPUT DATA
Physical
Division
Logical
Division
HDFS
Blocks
Input
Splits
Slide 28 www.edureka.co/big-data-and-hadoop
Relation Between Input Splits and HDFS Blocks
1 2 3 4 5 6 7 8 9 10 11
 Logical records do not fit neatly into the HDFS blocks.
 Logical records are lines that cross the boundary of the blocks.
 First split contains line 5 although it spans across blocks.
File
Lines
Block
Boundary
Block
Boundary
Block
Boundary
Block
Boundary
Split Split Split
Slide 29 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Node 1 Node 2
INPUT DATA
Slide 30 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Map
Node 1
Map
Node 2
INPUT DATA
Slide 31 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Map
Node 1
Map
Node 2
INPUT DATA
Slide 32 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Map
Node 1
Map
Node 2
Node 1 Node 2
INPUT DATA
Slide 33 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Intermediate data of the same key goes to the same reducer
Map
Node 1
Map
Node 2
Reduce
Node 1
Reduce
Node 2
INPUT DATA
Slide 34 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Intermediate data of the same key goes to the same reducer
Reducer output is stored
Map
Node 1
Map
Node 2
Reduce
Node 1
Reduce
Node 2
INPUT DATA
Slide 35 www.edureka.co/big-data-and-hadoop
Combiner
Combiner
Reducer
(B,1)
(C,1)
(D,1)
(E,1)
(D,1)
(B,1)
(D,1)
(A,1)
(A,1)
(C,1)
(B,1)
(D,1)
(B,2)
(C,1)
(D,2)
(E,1)
(D,2)
(A,2)
(C,1)
(B,1)
(A, [2])
(B, [2,1])
(C, [1,1])
(D, [2,2])
(E, [1])
(A,2)
(B,3)
(C,2)
(D,4)
(E,1)
Shuffle
CombinerMapper
Mapper
B
C
D
E
D
B
D
A
A
C
B
D
Block1Block2
Slide 36 www.edureka.co/big-data-and-hadoop
Partitioner – Redirecting Output from Mapper
Map
Map
Map
Reducer
Reducer
Reducer
Partitioner
Partitioner
Partitioner
Slide 37 www.edureka.co/big-data-and-hadoop
Getting Data to the Mapper
Input File Input File
Input split Input split Input split Input split
RecordReader RecordReader RecordReader RecordReader
Mapper Mapper Mapper Mapper
(intermediates) (intermediates) (intermediates) (intermediates)
Slide 38 www.edureka.co/big-data-and-hadoop
Partition and Shuffle
Mapper Mapper Mapper Mapper
(intermediates) (intermediates) (intermediates) (intermediates)
Partitioner Partitioner Partitioner Partitioner
(intermediates) (intermediates) (intermediates)
Reducer Reducer Reducer
Slide 39 www.edureka.co/big-data-and-hadoop
Demo of Word Count Program
To illustrate Default Input Format
(Text Input Format)
Demo
Slide 40 www.edureka.co/big-data-and-hadoop
Input file
Input Split Input Split Input Split
Record
Reader
Record
Reader
Record
Reader
Mapper Mapper Mapper
(Intermediates) (Intermediates) (Intermediates)
InputFormat
Input Split
Record
Reader
Mapper
Input file
(Intermediates)
Input Format
Slide 41 www.edureka.co/big-data-and-hadoop
Combine File
Input Format<K,V>
Text Input Format
Key Value Text
Input Format
Nline Input Format
Sequence File
Input Format<K,V>
File Input Format
<K,V>
Input Format<K,V>
org.apache.hadoop.mapreduce
<<interface>>
Composable
Input Format
<K,V>
Composite Input Format
<K,V>
DB Input
Format<T>
Sequence File As
Binary Input Format
Sequence File As
Text Input Format
Sequence File Input
Filter<K,V>
Input Format – Class Hierarchy
Slide 42 www.edureka.co/big-data-and-hadoop
Reducer
RecordWriter
Output file
Reducer
RecordWriter
Output file
Reducer
RecordWriter
Output file
OutputFormat
Output Format
Slide 43 www.edureka.co/big-data-and-hadoop
Text Output Format
<K,V>
Sequence File
Output Format<K,V>
Output Format <K,V>
org.apache.hadoop.mapreduce
DB Output Format
<K,V>
File Output Format
<K,V>
Null Output Format
<K,V>
Filter Output Format
<K,V>
Sequence File As Binary
Output Format
Lazy Output Format
<K,V>
Output Format – Class Hierarchy
Slide 44 www.edureka.co/big-data-and-hadoop
Demo
Demo: Custom Input Format
XML Parsing with Map Reduce

More Related Content

PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
KEY
Large scale ETL with Hadoop
ODP
Hadoop demo ppt
PPTX
Hadoop And Their Ecosystem
PDF
Introduction to Big Data & Hadoop
PDF
Practical Problem Solving with Apache Hadoop & Pig
PPTX
Hadoop introduction , Why and What is Hadoop ?
PDF
Hadoop-Introduction
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Large scale ETL with Hadoop
Hadoop demo ppt
Hadoop And Their Ecosystem
Introduction to Big Data & Hadoop
Practical Problem Solving with Apache Hadoop & Pig
Hadoop introduction , Why and What is Hadoop ?
Hadoop-Introduction

What's hot (20)

PDF
Hadoop Administration pdf
PDF
Integration of HIve and HBase
PDF
Hadoop MapReduce Framework
PDF
Basics of big data analytics hadoop
PDF
Big data Hadoop Analytic and Data warehouse comparison guide
PPTX
Transformation Processing Smackdown; Spark vs Hive vs Pig
PPT
Hadoop MapReduce Fundamentals
PPTX
Introduction to Pig
PDF
Hadoop Overview & Architecture
 
PPTX
February 2014 HUG : Pig On Tez
PPTX
Mutable Data in Hive's Immutable World
PPTX
NoSQL Needs SomeSQL
PDF
20131205 hadoop-hdfs-map reduce-introduction
PDF
Scaling HDFS to Manage Billions of Files with Key-Value Stores
PPTX
Functional Programming and Big Data
PPT
Seminar Presentation Hadoop
PDF
Hadoop Developer
PDF
Overview of Hadoop and HDFS
PPTX
Real time hadoop + mapreduce intro
PPTX
Introduction to Hadoop and Hadoop component
Hadoop Administration pdf
Integration of HIve and HBase
Hadoop MapReduce Framework
Basics of big data analytics hadoop
Big data Hadoop Analytic and Data warehouse comparison guide
Transformation Processing Smackdown; Spark vs Hive vs Pig
Hadoop MapReduce Fundamentals
Introduction to Pig
Hadoop Overview & Architecture
 
February 2014 HUG : Pig On Tez
Mutable Data in Hive's Immutable World
NoSQL Needs SomeSQL
20131205 hadoop-hdfs-map reduce-introduction
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Functional Programming and Big Data
Seminar Presentation Hadoop
Hadoop Developer
Overview of Hadoop and HDFS
Real time hadoop + mapreduce intro
Introduction to Hadoop and Hadoop component
Ad

Viewers also liked (20)

PDF
Efficient processing of large and complex XML documents in Hadoop
PDF
Pig - Processing XML data
PDF
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
PPTX
Data Ingestion, Extraction & Parsing on Hadoop
PDF
A poster version of HadoopXML
PDF
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation
PPSX
La plateforme OpenData 3.0 pour libérer et valoriser les données
PDF
Hadoop in Data Warehousing
PPTX
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
PDF
Hadoop combiner and partitioner
PDF
Webinar: Ways to Succeed with Hadoop in 2015
PDF
MapReduce 簡單介紹與練習
PPTX
MapReduce Design Patterns
PDF
Map reduce: beyond word count
PDF
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
PPTX
Kafka & Hadoop - for NYC Kafka Meetup
PPTX
Flume vs. kafka
PPTX
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
PPTX
Hadoop data ingestion
PPSX
Sales Forecasting
Efficient processing of large and complex XML documents in Hadoop
Pig - Processing XML data
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
Data Ingestion, Extraction & Parsing on Hadoop
A poster version of HadoopXML
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation
La plateforme OpenData 3.0 pour libérer et valoriser les données
Hadoop in Data Warehousing
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Hadoop combiner and partitioner
Webinar: Ways to Succeed with Hadoop in 2015
MapReduce 簡單介紹與練習
MapReduce Design Patterns
Map reduce: beyond word count
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Kafka & Hadoop - for NYC Kafka Meetup
Flume vs. kafka
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Hadoop data ingestion
Sales Forecasting
Ad

Similar to XML Parsing with Map Reduce (20)

PDF
Bulk Loading Into HBase With MapReduce
PDF
Distributed Cache With MapReduce
PPTX
Hadoop Adminstration with Latest Release (2.0)
PDF
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
PDF
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
PPTX
What is hadoop
PDF
Big Data Analytics Chapter3-6@2021.pdf
PPTX
batch-6 - Manisalyabatch_8_bda - Tejaswini Chowdary.pptx Kumar.pptx
PDF
Hadoop
PPT
Hadoop and Mapreduce Introduction
PDF
Introduction to Big Data and Hadoop
PDF
Big data presentation
PPTX
Big Data & Hadoop Tutorial
PDF
Introduction to Big data & Hadoop -I
DOCX
project report on hadoop
PDF
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
PDF
Hadoop Tutorial for Big Data Enthusiasts
PDF
Hadoop installation by santosh nage
PPT
Hadoop Map-Reduce from the subject: Big Data Analytics
PPTX
Big data
Bulk Loading Into HBase With MapReduce
Distributed Cache With MapReduce
Hadoop Adminstration with Latest Release (2.0)
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
What is hadoop
Big Data Analytics Chapter3-6@2021.pdf
batch-6 - Manisalyabatch_8_bda - Tejaswini Chowdary.pptx Kumar.pptx
Hadoop
Hadoop and Mapreduce Introduction
Introduction to Big Data and Hadoop
Big data presentation
Big Data & Hadoop Tutorial
Introduction to Big data & Hadoop -I
project report on hadoop
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
Hadoop Tutorial for Big Data Enthusiasts
Hadoop installation by santosh nage
Hadoop Map-Reduce from the subject: Big Data Analytics
Big data

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
PDF
Top 5 Trending Business Intelligence Tools | Edureka
PDF
Tableau Tutorial for Data Science | Edureka
PDF
Python Programming Tutorial | Edureka
PDF
Top 5 PMP Certifications | Edureka
PDF
Top Maven Interview Questions in 2020 | Edureka
PDF
Linux Mint Tutorial | Edureka
PDF
How to Deploy Java Web App in AWS| Edureka
PDF
Importance of Digital Marketing | Edureka
PDF
RPA in 2020 | Edureka
PDF
Email Notifications in Jenkins | Edureka
PDF
EA Algorithm in Machine Learning | Edureka
PDF
Cognitive AI Tutorial | Edureka
PDF
AWS Cloud Practitioner Tutorial | Edureka
PDF
Blue Prism Top Interview Questions | Edureka
PDF
Big Data on AWS Tutorial | Edureka
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
PDF
Kubernetes Installation on Ubuntu | Edureka
PDF
Introduction to DevOps | Edureka
What to learn during the 21 days Lockdown | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
Tableau Tutorial for Data Science | Edureka
Python Programming Tutorial | Edureka
Top 5 PMP Certifications | Edureka
Top Maven Interview Questions in 2020 | Edureka
Linux Mint Tutorial | Edureka
How to Deploy Java Web App in AWS| Edureka
Importance of Digital Marketing | Edureka
RPA in 2020 | Edureka
Email Notifications in Jenkins | Edureka
EA Algorithm in Machine Learning | Edureka
Cognitive AI Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
Blue Prism Top Interview Questions | Edureka
Big Data on AWS Tutorial | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Kubernetes Installation on Ubuntu | Edureka
Introduction to DevOps | Edureka

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
MYSQL Presentation for SQL database connectivity
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
cuic standard and advanced reporting.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Cloud computing and distributed systems.
Building Integrated photovoltaic BIPV_UPV.pdf
Unlocking AI with Model Context Protocol (MCP)
“AI and Expert System Decision Support & Business Intelligence Systems”
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
NewMind AI Weekly Chronicles - August'25 Week I
NewMind AI Monthly Chronicles - July 2025
MYSQL Presentation for SQL database connectivity
The Rise and Fall of 3GPP – Time for a Sabbatical?
Network Security Unit 5.pdf for BCA BBA.
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
cuic standard and advanced reporting.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Understanding_Digital_Forensics_Presentation.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation_ Review paper, used for researhc scholars
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

XML Parsing with Map Reduce

  • 1. www.edureka.co/big-data-and-hadoop Hadoop the ultimate data storage And processing Together
  • 2. Slide 2 www.edureka.co/big-data-and-hadoop Objectives Analyze different use-cases where MapReduce is used Differentiate between Traditional way and MapReduce way Learn about Hadoop 2.x MapReduce architecture and components Understand execution flow of YARN MapReduce application Implement basic MapReduce concepts Run a MapReduce Program At the end of this module, you will be able to
  • 3. Slide 3 www.edureka.co/big-data-and-hadoop Where MapReduce is Used? Weather Forecasting HealthCare  Problem Statement: » De-identify personal health information.  Problem Statement: » Finding Maximum temperature recorded in a year.
  • 4. Slide 4 www.edureka.co/big-data-and-hadoop Where MapReduce is Used? MapReduce FeaturesLarge Scale Distributed Model Used in Function Design Pattern Parallel Programming A Program Model Classification Analytics Recommendation Index and Search Map Reduce Classification Eg: Top N records Analytics Eg: Join, Selection Recommendation Eg: Sort Summarization Eg: Inverted Index Implemented Google Apache Hadoop HDFS Pig Hive HBase For
  • 5. Slide 5 www.edureka.co/big-data-and-hadoop The Traditional Way Very Big Data Split Data matches All matches grep grep grep cat grep : matches matches matches Split Data Split Data Split Data
  • 6. Slide 6 www.edureka.co/big-data-and-hadoop MapReduce Way Very Big Data Split Data All matches : Split Data Split Data Split Data M A P R E D U C E MapReduce Framework
  • 7. Slide 7 www.edureka.co/big-data-and-hadoop MapReduce Paradigm The Overall MapReduce Word Count Process Input Splitting Mapping Shuffling Reducing Final Result List(K3,V3) Deer Bear River Dear Bear River Car Car River Deer Car Bear Bear, 2 Car, 3 Deer, 2 River, 2 Deer, 1 Bear, 1 River, 1 Car, 1 Car, 1 River, 1 Deer, 1 Car, 1 Bear, 1 K2,List(V2)List(K2,V2) K1,V1 Car Car River Deer Car Bear Bear, 2 Car, 3 Deer, 2 River, 2 Bear, (1,1) Car, (1,1,1) Deer, (1,1) River, (1,1)
  • 8. Slide 8 www.edureka.co/big-data-and-hadoop Anatomy of a MapReduce Program MapReduce Map: Reduce: (K1, V1) List (K2, V2) (K2, list (V2)) List (K3, V3) Key Value
  • 9. Slide 9 www.edureka.co/big-data-and-hadoop Why MapReduce? Two biggest Advantages: » Taking processing to the data » Processing data in parallel a b c Map Task HDFS Block Data Center Rack Node
  • 10. Slide 10 www.edureka.co/big-data-and-hadoop  ApplicationMaster » One per application » Short life » Coordinates and Manages MapReduce Jobs » Negotiates with Resource Manager to schedule tasks » The tasks are started by NodeManager(s)  Job History Server » Maintains information about submitted MapReduce jobs after their ApplicationMaster terminates  Client » Submits a MapReduce Job  Resource Manager » Cluster Level resource manager » Long Life, High Quality Hardware  Node Manager » One per Data Node » Monitors resources on Data Node Hadoop 2.x MapReduce Components  Container » Created by NM when requested » Allocates certain amount of resources (memory, CPU etc.) on a slave node
  • 11. Slide 11 www.edureka.co/big-data-and-hadoop BATCH (MapReduce) INTERACTIVE (Text) ONLINE (HBase) STREAMING (Storm, S4, …) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) OTHER (Search) (Weave..) http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html YARN – Moving beyond MapReduce
  • 12. Slide 12 www.edureka.co/big-data-and-hadoop MapReduce Application Execution Executing MapReduce Application on YARN
  • 13. Slide 13 www.edureka.co/big-data-and-hadoop YARN MR Application Execution Flow MapReduce Job Execution » Job Submission » Job Initialization » Tasks Assignment » Memory Assignment » Status Updates » Failure Recovery
  • 14. Slide 14 www.edureka.co/big-data-and-hadoop HDFS Application Job Object Client JVM Client Resource Manager Management Node Run Job 2. Get New Application ID 4. Submit Application Context 3. Prepare the Application submit context 3.1 App Jar 3.2 Job Resources(Block locations) 3.3 User Information 1. Notify Start Application YARN MR Application Execution Flow
  • 15. Slide 15 www.edureka.co/big-data-and-hadoop HDFS 3. Prepare the Application submit context 3.1 App Jar 3.2 Job Resources(Block locations) 3.3 User Information Node Manager 5. Start AppMaster container / Allocate Context for AppMaster App Master 6.Alloate Container for AppMaster 7.Request Resources 8.Notify with resources Availability Data Node YARN MR Application Execution Flow Application Job Object Client JVM Client Resource Manager Management Node Run Job 2. Get New Application ID 4. Submit Application Context 1. Notify Start Application
  • 16. Slide 16 www.edureka.co/big-data-and-hadoop HDFS Resource Manager 3. Prepare the Application submit context 3.1 App Jar 3.2 Job Resources(Block locations) 3.3 User Information Management Node Node Manager 5. Start AppMaster container / Allocate Context for AppMaster App Master 6. Allocate Container for AppMaster 7.Request Resources 8.Notify with resources Availability Data Node Client Node Manager Data node-1 Node Manager Map Block 9.Start Container in the worker node Data node-2 Node Manager Map Block 10.NM allocate Container 10.NM allocate Container 2. Get New Application 4. Submit Application 1. Notify Start Application 9.Start Container in the worker node YARN MR Application Execution Flow
  • 17. Slide 17 www.edureka.co/big-data-and-hadoop YARN MR Application Execution Flow 11.Task get Executed. 12.If any reducer in a Job Reducer, again AppMaster Request the Node Manager to start the and Allocate Container 13.Output of All the Maps given to reducer and Reducer get executed 14.Once Job finished, Application Master notify the Resource Manager and Client Library 15.Application Master closed.
  • 18. Slide 18 www.edureka.co/big-data-and-hadoop Hadoop 2.x : YARN Workflow Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Container 1.2 Container 1.1 Container 2.1 Container 2.2 Container 2.3 App Master 2 App Master 1 Scheduler Applications Manager (AsM) Resource Manager
  • 19. Slide 19 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application Client RM NM AM 1
  • 20. Slide 20 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM Client RM NM AM 1 2
  • 21. Slide 21 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM Client RM NM AM 1 2 3
  • 22. Slide 22 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM Client RM NM AM 1 2 3 4
  • 23. Slide 23 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers Client RM NM AM 1 2 3 4 5
  • 24. Slide 24 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container Client RM NM AM 1 2 3 4 5 6
  • 25. Slide 25 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container 7. Client contacts RM/AM to monitor application’s status Client RM NM AM 1 2 3 4 5 7 6
  • 26. Slide 26 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container 7. Client contacts RM/AM to monitor application’s status 8. AM unregisters with RM Client RM NM AM 1 2 3 4 5 7 8 6
  • 27. Slide 27 www.edureka.co/big-data-and-hadoop Input Splits INPUT DATA Physical Division Logical Division HDFS Blocks Input Splits
  • 28. Slide 28 www.edureka.co/big-data-and-hadoop Relation Between Input Splits and HDFS Blocks 1 2 3 4 5 6 7 8 9 10 11  Logical records do not fit neatly into the HDFS blocks.  Logical records are lines that cross the boundary of the blocks.  First split contains line 5 although it spans across blocks. File Lines Block Boundary Block Boundary Block Boundary Block Boundary Split Split Split
  • 29. Slide 29 www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Node 1 Node 2 INPUT DATA
  • 30. Slide 30 www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Map Node 1 Map Node 2 INPUT DATA
  • 31. Slide 31 www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Map Node 1 Map Node 2 INPUT DATA
  • 32. Slide 32 www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Data exchange between nodes in a “shuffle” process Map Node 1 Map Node 2 Node 1 Node 2 INPUT DATA
  • 33. Slide 33 www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Data exchange between nodes in a “shuffle” process Intermediate data of the same key goes to the same reducer Map Node 1 Map Node 2 Reduce Node 1 Reduce Node 2 INPUT DATA
  • 34. Slide 34 www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Data exchange between nodes in a “shuffle” process Intermediate data of the same key goes to the same reducer Reducer output is stored Map Node 1 Map Node 2 Reduce Node 1 Reduce Node 2 INPUT DATA
  • 35. Slide 35 www.edureka.co/big-data-and-hadoop Combiner Combiner Reducer (B,1) (C,1) (D,1) (E,1) (D,1) (B,1) (D,1) (A,1) (A,1) (C,1) (B,1) (D,1) (B,2) (C,1) (D,2) (E,1) (D,2) (A,2) (C,1) (B,1) (A, [2]) (B, [2,1]) (C, [1,1]) (D, [2,2]) (E, [1]) (A,2) (B,3) (C,2) (D,4) (E,1) Shuffle CombinerMapper Mapper B C D E D B D A A C B D Block1Block2
  • 36. Slide 36 www.edureka.co/big-data-and-hadoop Partitioner – Redirecting Output from Mapper Map Map Map Reducer Reducer Reducer Partitioner Partitioner Partitioner
  • 37. Slide 37 www.edureka.co/big-data-and-hadoop Getting Data to the Mapper Input File Input File Input split Input split Input split Input split RecordReader RecordReader RecordReader RecordReader Mapper Mapper Mapper Mapper (intermediates) (intermediates) (intermediates) (intermediates)
  • 38. Slide 38 www.edureka.co/big-data-and-hadoop Partition and Shuffle Mapper Mapper Mapper Mapper (intermediates) (intermediates) (intermediates) (intermediates) Partitioner Partitioner Partitioner Partitioner (intermediates) (intermediates) (intermediates) Reducer Reducer Reducer
  • 39. Slide 39 www.edureka.co/big-data-and-hadoop Demo of Word Count Program To illustrate Default Input Format (Text Input Format) Demo
  • 40. Slide 40 www.edureka.co/big-data-and-hadoop Input file Input Split Input Split Input Split Record Reader Record Reader Record Reader Mapper Mapper Mapper (Intermediates) (Intermediates) (Intermediates) InputFormat Input Split Record Reader Mapper Input file (Intermediates) Input Format
  • 41. Slide 41 www.edureka.co/big-data-and-hadoop Combine File Input Format<K,V> Text Input Format Key Value Text Input Format Nline Input Format Sequence File Input Format<K,V> File Input Format <K,V> Input Format<K,V> org.apache.hadoop.mapreduce <<interface>> Composable Input Format <K,V> Composite Input Format <K,V> DB Input Format<T> Sequence File As Binary Input Format Sequence File As Text Input Format Sequence File Input Filter<K,V> Input Format – Class Hierarchy
  • 42. Slide 42 www.edureka.co/big-data-and-hadoop Reducer RecordWriter Output file Reducer RecordWriter Output file Reducer RecordWriter Output file OutputFormat Output Format
  • 43. Slide 43 www.edureka.co/big-data-and-hadoop Text Output Format <K,V> Sequence File Output Format<K,V> Output Format <K,V> org.apache.hadoop.mapreduce DB Output Format <K,V> File Output Format <K,V> Null Output Format <K,V> Filter Output Format <K,V> Sequence File As Binary Output Format Lazy Output Format <K,V> Output Format – Class Hierarchy