Hadoop Architecture in Depth

BigData
Syed
Solutions Engineer - Big Data
mail.syed786@gmail.com
info.syedacademy@gmail.com
+91-9030477368

Need For A New Processing Platform (Big Data)
What is Big Data ?
 Twitter (over 7~ TB/day)
 Facebook (over 10~ TB/day)
 Google (over 20~ PB/day)
Where does it come from ?
Existing systems (vertical scalibility)
Why Hadoop (horizontal scalibility)?

Yahoo
Google
Facebook
LinkedIn
IBM
Amazon
HortonWorks
Cloudera
NY Times
… the list goes on.
Companies Using Hadoop

What is Hadoop?
 Flexible infrastructure for large scale computation & data
processing on a network of commodity hardware.
 Completely written in java.
 Open source & distributed under Apache license
 Hadoop Core Components: HDFS & MapReduce.
 The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models.

What Hadoop is Not?
 A database
 An online transaction processing (OLTP) system
 Replacement of all programming language

Hadoop Introduction and Architecture

Hadoop High-Level Architecture

HDFS - Hadoop Distributed File System
Design of HDFS
Where HDFS is not a good fit
Why Is a Block in HDFS So Large?

NameNode
 Deeper Things about Name Node

 What is DataNode?
DataNode

How Do We Fix a Single NameNode
Feature

NameNode HA – Shared Storage

Hadoop JournalNode
JournalNode machines - the machines on which you run the JournalNodes. The
JournalNode daemon is relatively lightweight, so these daemons may reasonably be
collocated on machines with other Hadoop daemons, for example NameNodes, the
JobTracker, or the YARN ResourceManager. Note: There must be at least 3 JournalNode
daemons, since edit log modifications must be written to a majority of JNs. This will allow
the system to tolerate the failure of a single machine. You may also run more than 3
JournalNodes, but in order to actually increase the number of failures the system can
tolerate, you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running
with N JournalNodes, the system can tolerate at most (N - 1) / 2 failures and continue to
function normally.

Hadoop 1
Limited up to 4,000 nodes per cluster
O(# of tasks in a cluster)
JobTracker bottleneck - resource management, job
scheduling and monitoring
Only has one namespace for managing HDFS
Map and Reduce slots are static
Only job to run is MapReduce

Hadoop 2
Potentially up to 10,000 nodes per cluster
O(cluster size)
Supports multiple namespace for managing
HDFS
Efficient cluster utilization (YARN)
MRv1 backward and forward compatible
Any apps can integrate with Hadoop

Thank you!
www.syedacademy.com
mail.syed786@gmail.com
info.syedacademy@gmail.com
+91-9030477368

Hadoop Architecture in Depth

More Related Content

What's hot (20)

Similar to Hadoop Architecture in Depth (20)

More from Syed Hadoop (6)

Recently uploaded (20)

Hadoop Architecture in Depth