Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Webpage

Agenda
➢ Big Data Growth Drivers
➢ What is Big Data?
➢ Hadoop Introduction
➢ Hadoop Master/Slave Architecture
➢ Hadoop Core Components
➢ HDFS Data Blocks
➢ HDFS Read/Write Mechanism
➢ What is MapReduce
➢ MapReduce Program
➢ MapReduce Job Workflow
➢ Hadoop Ecosystem
➢ Hadoop Use Case: Analyzing Olympic Dataset

Big Data Growth Drivers

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Big Data Growth Drivers
Users like
4,166,667
posts
Users send
347,222
tweets
Users cast
18,327
votes
Users like
1,736,111
posts
Users upload
300 hours
of new video
Data Generated
Every 60 Seconds

Global Mobile Data Traffic, 2015 to 2020
Source: http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/mobile-white-paper-c11-520862.html
3 major trends contributing to the growth of mobile data traffic:
➢ Adapting to Smarter Mobile Devices
➢ Defining Cell Network Advances—2G, 3G, and 4G (5G Perspectives)
➢ Reviewing Tiered Pricing—Unlimited Data and Shared Plans
Exabytes
per Month
Cisco Forecasts 30.6 Exabytes per Month of Mobile Data Traffic by 2020

What is Big Data?

What is Big Data?
“Big data is the term for a collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data processing applications”
Volume Variety Velocity Value Veracity
Uncertainty and
inconsistencies in the
data
Finding correct
meaning out of the
data
Data is being
generated at an
alarming rate
Processing different
types of data
Processing increasing
huge data sets

Let us understand Problems with Big
Data and Traditional System with a
Story

Story of Big Data & Traditional System
Scenario:
Bob has opened a small restaurant in his city

Traditional Scenario
Single Cook Food Shelf Traditional Processing
System
RDBMS
Traditional Scenario:
Data is generated at a steady rate and is structured in
nature
Traditional Scenario:
2 orders per hour

Failure of Traditional System
Single Cook
(Regular Computing System)
Food Shelf
(Data)
Scenario 2:
➢ They started taking Online orders
➢ 10 orders per hour
Traditional Processing
System
RDBMS
Big Data Scenario:
Heterogenous data is being generated at an alarming rate
by multiple sources

Issue 1: Too Many Orders Per Hour
Solution: Hiring Multiple Cook

Need of an Effective Solution
Food Shelf
(Data)
Scenario:
Multiple Cook cooking food
Issue:
Food Shelf becomes the BOTTLENECK

Need of an Effective Solution
Data Warehouse
Scenario:
Multiple Processing Unit for data processing
Issue:
Bringing data to processing generated lots
of Network overhead

Issue 2: Food Shelf becomes the Bottleneck
Solution: Distributed and Parallel Approach

Effective Solution
Final Orders
(Meat Sauce)
Cooks Meat
Cooks Sauce
Assembles to
cook Meat Sauce
Distributed
Food Shelf
Map
Reduce

Need of a Framework
Do we have a framework that
works like that ?

Apache Hadoop:
Framework to Process Big Data

Apache Hadoop: Framework to Process Big Data
Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion
H A D O O P
Storage:
Distributed File
System
Processing:
Allows parallel &
distributed
processing

Hadoop: Master/Slave Architecture

Project
Manager
BobAlice
John James
Scenario:
A project Manager managing a team of four
employees. He assigns project to each of
them and tracks the progress

Project
Manager
BobAlice
John James
Project B
Project C
Project A
Project D
➢ John: A
➢ James: B
➢ Bob: C
➢ Bob: D
Metadata

Project
Manager
BobAlice
John James
Project C, A
Project A, C
Project D, A
➢ John: A, C
➢ James: B
➢ Bob: C, A
➢ Alice: C, A
Metadata
Project B

Master Node
Slave NodeSlave Node
Slave Node Slave Node
Project A, C
Project D, A
➢ John: A, C
➢ James: B
➢ Bob: C, A
➢ Alice: C, A
Metadata
Project C, A
Project B

M A S T E R N O D E
S L A V E N O D E S

H A D O O P C O R E C O M P O N E N T S
Storage:
Distributed File
System
Processing:
Allows parallel &
distributed
processing

HDFS Core Components:
02
DataNode
03
Secondary
NameNode
01
NameNode

NameNode & DataNode
NameNode:
➢ Maintains and Manages DataNodes
➢ Records metadata i.e. information about data blocks e.g.
location of blocks stored, the size of the files, permissions,
hierarchy, etc.
➢ Receives heartbeat and block report from all the DataNodes
DataNode:
➢ Slave daemons
➢ Stores actual data
➢ Serves read and write requests from the clients
NameNode
DataNode DataNode DataNode
Secondary
NameNode

Secondary NameNode & Checkpointing
Secondary
NameNode
NameNode
editLog editLog
fsImage fsImage
editLog
(new)
FsImage
(final)
First time copy
Temporary
During checkpoint
➢ Checkpointing is a process of combining
edit logs with FsImage
➢ Secondary NameNode takes over the
responsibility of checkpointing, therefore,
making NameNode more available
➢ Allows faster Failover as it prevents edit
logs from getting too huge
➢ Checkpointing happens periodically
(default: 1 hour)

How the data is actually stored
in DataNodes?
HDFS Data Blocks

HDFS Data Blocks
128
MB
128
MB
124
MB
NameNode
DataNode DataNode DataNode
380 MB
Blk 1 Blk 2 Blk 3
➢ Each file is stored on HDFS as blocks
➢ The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x)

Fault Tolerance: How Hadoop cope up
with DataNode Failure?

Fault Tolerance
Scenario:
One of the DataNodes crashed containing the data
blocks
128
MB
128
MB
124
MB

Solution: Replication Factor

Fault Tolerance: Replication Factor
Solution:
Each data blocks are replicated (thrice by default) and are
distributed across different DataNodes
NameNode
DataNode DataNode DataNode DataNode
128
MB
128
MB
124
MB
128
MB
128
MB
128
MB
128
MB
124
MB
124
MB
124
MB

Fault Tolerance: Replication Factor
Solution:
Each data blocks are replicated (thrice by default) and are
distributed across different DataNodes
As it is said Never Put All Your Eggs in the Same Basket
NameNode
DataNode DataNode DataNode DataNode
128
MB
128
MB
124
MB
128
MB
128
MB
128
MB
128
MB
124
MB
124
MB
124
MB

HDFS Write Mechanism

HDFS Write Mechanism – Pipeline Setup

HDFS Write Mechanism – Writing a Block

HDFS Write Mechanism - Acknowledgement

HDFS Multi-Block Write Mechanism
For Block A: 1A -> 2A -> 3A -> 4A
For Block B: 1B -> 2B -> 3B -> 4B -> 5B -> 6B

HDFS Read Mechanism

Let us understand
MapReduce with a story

Story of MapReduce
45
45 45
46
Each student has to count the occurrence of the word
Julius in the book
Majority of the students
have answered 45
Time: 4 Hours
Time: 4 Hours

Story of MapReduce
Each student count the
number of occurrence in
each chapter parallelly
12 + 8 + 14 + 11 = 45
Ch. 1 Ch. 2
Ch. 3 Ch. 4
Map Map
118
12 14
Reduce
Prof. will sum up the
answer given by student to
get the final output
1 hr. + 2 mins.
1 hr.
1 hr. 1 hr.
1 hr.

What is MapReduce?

What is MapReduce?
MapReduce is a programming framework that allows us to perform distributed and parallel
processing on large data sets in a distributed environment

MapReduce
Word Count Program

MapReduce Word Count Program

MapReduce Word Count Program
Three Major Parts of MapReduce Program:
1
2
3
Driver Code
You specify all the job configurations over here like job name,
Input path, output path, etc.
Mapper Code:
You write the mapper logic over here i.e. how map task will process
the data to produce the key-value pair to be aggregated
Reducer Code:
You write reducer logic here which combines the intermediate key-value
pair generated by Mapper to give the final aggregated output

Byte Offset Type
Mapper Value Input Type
Mapper Key Output Type
Mapper Value Output Type
Mapper Input:
➢ The key is nothing but the offset of each line in the text file:
LongWritable
➢ The value is each individual: Text
Mapper Output:
➢ The key is the tokenized words: Text
➢ We have the hardcoded value in our case which is 1: IntWritable
➢ Example – Dear 1, Bear 1, etc.
Mapper Code

Reducer Key Input Type
Reducer Value Input Type
Reducer Key Output Type
Reducer Value Output Type
Reducer Output:
➢ The key is all the unique words present in the input text file: Text
➢ The value is the number of occurrences of each of the unique words:
IntWritable
➢ Example: Bear, 2; Car, 3, etc. .
Reducer Input:
➢ Keys are unique words which have been generated after the sorting
and shuffling phase: Text
➢ The value is a list of integers corresponding to each key: IntWritable
➢ Example: Bear, [1, 1], etc.
Reducer Code

Driver Code
In the driver class, we set the configuration of our MapReduce job to run in Hadoop
➢ Specify the name of the job , the data type of
input/output of the mapper and reducer
➢ Specify the names of the mapper and
reducer classes.
➢ Path of the input and output folder
➢ The method setInputFormatClass () is used
for specifying the unit of work for mapper
➢ Main() method is the entry point for the
driver

YARN Components

YARN Components
Node
Manager
App
Master
container
Resource
Manager
Container:
➢ Allocates certain amount of resources
(memory, CPU etc.) on a slave node (NM)
AppMaster:
➢ One per application
➢ Coordinates and manages MR Jobs
➢ Negotiates resources from RM
ResourceManager:
➢ Master daemon that manages all other
daemons & accepts job submission
➢ Allocates first container for the AppMaster
NodeManager:
➢ Responsible for containers, monitoring their
resource usage i.e. (cpu, memory, disk,
network) & reports the same to RM

MapReduce Job Workflow

MAPREDUCE JOB WORKFLOW
MAP
Other Maps
input-split in-memory
buffer
merge
fetch
partition, sort
and spill to disc
partitions
mapreduce.task.io.sort.factor
Default: 10
mapreduce.task.io.sort.mb
Default: 100 MB
REDUCE
reduce phase
merge
merge

YARN Architecture

YARN Architecture
Resource Manager
Node
Manager
Node
Manager
container
App
Master
App
Master
container
Node
Manager
App
Master
container
Client Node Status
Resource Request
MapReduce Status

Hadoop Architecture: HDFS & YARN

Hadoop Architecture: HDFS & YARN
Secondary
NameNode
NameNode ResourceManager
DataNode DataNode
NodeManager NodeManager
NodeManager NodeManager
DataNode DataNode

Hadoop Cluster

Hadoop Cluster
Slaves and
Master
Machines
NameNode
Secondary
NameNode
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
switch switch switch
core switch
Rack 1 Rack 2 Rack 3

Hadoop Cluster Modes

Hadoop Cluster Modes
Standalone (or Local) Mode
➢ No daemons, everything runs in a single JVM
➢ Suitable for running MapReduce programs during development
➢ Has no DFS or Distributed File System
Pseudo Distributed Mode
➢ All Hadoop daemons run on the local machine
Multi-Node Cluster Mode
➢ Hadoop daemons run on a cluster of machines

Hadoop Ecosystem

Hadoop Use Case:
Analyzing Olympic Dataset

Hadoop Use Case: Analyzing Olympic Dataset
Problem statement:
➢ Find the list of top 10 countries won the highest medals
➢ Find the total number of gold medals won by each country
➢ Which countries have won the most number of medals in swimming?

Dataset Description
The data set consists of the following fields:
➢ Athlete: This field consists of the athlete name
➢ Age: This field consists of athlete ages
➢ Country: This fields consists of the country names which participated
in Olympics
➢ Year: This field consists of the year
➢ Closing Date: This field consists of the closing date of ceremony
➢ Sport: Consists of the sports name
➢ Gold Medals: No. of Gold medals
➢ Silver Medals: No.of Silver medals
➢ Bronze Medals: No.of Bronze medals
➢ Total Medals: Consists of total no of medals

Dataset Description

Demo

Thank You …
Questions/Queries/Feedback

Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

More Related Content

What's hot (20)

Similar to Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka (20)

More from Edureka! (20)

Recently uploaded (20)

Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka