SlideShare a Scribd company logo
BigData
Syed
Solutions Engineer - Big Data
mail.syed786@gmail.com
info.syedacademy@gmail.com
+91-9030477368
Need For A New Processing Platform (Big Data)
What is Big Data ?
 Twitter (over 7~ TB/day)
 Facebook (over 10~ TB/day)
 Google (over 20~ PB/day)
Where does it come from ?
Existing systems (vertical scalibility)
Why Hadoop (horizontal scalibility)?
Yahoo
Google
Facebook
LinkedIn
IBM
Amazon
HortonWorks
Cloudera
NY Times
… the list goes on.
Companies Using Hadoop
What is Hadoop?
 Flexible infrastructure for large scale computation & data
processing on a network of commodity hardware.
 Completely written in java.
 Open source & distributed under Apache license
 Hadoop Core Components: HDFS & MapReduce.
 The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models.
What Hadoop is Not?
 A database
 An online transaction processing (OLTP) system
 Replacement of all programming language
Hadoop Introduction and Architecture
Hadoop High-Level Architecture
HDFS - Hadoop Distributed File System
Design of HDFS
Where HDFS is not a good fit
Why Is a Block in HDFS So Large?
HDFS Architecture
Let us Zoom into HDFS
NameNode
 Deeper Things about Name Node
Secondary NameNode
 What is DataNode?
DataNode
NameNode and DataNodes
Feature Matrix
Yahoo Study
Still need to be fixed
How Do We Fix a Single NameNode
Feature
HDFS Architecture
NameNode HA(V2)
NameNode HA – Shared Storage
NameNode HA
HDFS Federation
Hadoop JournalNode
JournalNode machines - the machines on which you run the JournalNodes. The
JournalNode daemon is relatively lightweight, so these daemons may reasonably be
collocated on machines with other Hadoop daemons, for example NameNodes, the
JobTracker, or the YARN ResourceManager. Note: There must be at least 3 JournalNode
daemons, since edit log modifications must be written to a majority of JNs. This will allow
the system to tolerate the failure of a single machine. You may also run more than 3
JournalNodes, but in order to actually increase the number of failures the system can
tolerate, you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running
with N JournalNodes, the system can tolerate at most (N - 1) / 2 failures and continue to
function normally.
Hadoop 1
Limited up to 4,000 nodes per cluster
O(# of tasks in a cluster)
JobTracker bottleneck - resource management, job
scheduling and monitoring
Only has one namespace for managing HDFS
Map and Reduce slots are static
Only job to run is MapReduce
Hadoop 1 - Reading Files
Rack1 Rack2 Rack3 RackN
read file (fsimage/edit)
Hadoop Client
NameNode SNameNode
return DNs,
block ids, etc.
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
checkpoint
heartbeat/
block reportread blocks
Hadoop 1 - Writing Files
Rack1 Rack2 Rack3 RackN
request write (fsimage/edit)
Hadoop Client
NameNode SNameNode
return DNs, etc.
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
checkpoint
block report
write blocks
replication pipelining
Hadoop 2
Potentially up to 10,000 nodes per cluster
O(cluster size)
Supports multiple namespace for managing
HDFS
Efficient cluster utilization (YARN)
MRv1 backward and forward compatible
Any apps can integrate with Hadoop
Hadoop 2 - Basics
Hadoop 2 - Reading Files
(w/ NN Federation)
Rack1 Rack2 Rack3 RackN
read file
fsimage/edit copy
Hadoop Client NN1/ns1
SNameNode
per NN
return DNs,
block ids, etc.
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
checkpoint
register/
heartbeat/
block report
read blocks
fs sync Backup NN
per NN
checkpoint
NN2/ns2 NN3/ns3 NN4/ns4
or
ns1 ns2 ns3 ns4
dn1, dn2
dn1, dn3
dn4, dn5 dn4, dn5
Block Pools
Hadoop 2 - Writing Files
Rack1 Rack2 Rack3 RackN
request write
Hadoop Client
return DNs, etc.
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
write blocks
replication pipelining
fsimage/edit copy
NN1/ns1
SNameNode
per NN
checkpoint
block report
fs sync Backup NN
per NN
checkpoint
NN2/ns2 NN3/ns3 NN4/ns4
or
Thank you!
www.syedacademy.com
mail.syed786@gmail.com
info.syedacademy@gmail.com
+91-9030477368

More Related Content

PPT
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
PDF
Hadoop-Introduction
PPTX
Hadoop architecture by ajay
PPTX
HDFS: Hadoop Distributed Filesystem
PPTX
Introduction to Hadoop part 2
PDF
Syncsort et le retour d'expérience ComScore
PDF
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
PPTX
Hive and data analysis using pandas
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop-Introduction
Hadoop architecture by ajay
HDFS: Hadoop Distributed Filesystem
Introduction to Hadoop part 2
Syncsort et le retour d'expérience ComScore
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
Hive and data analysis using pandas

What's hot (20)

PPT
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
PPT
Hadoop Tutorial
PDF
Apache Hadoop 0.22 and Other Versions
PPTX
Hadoop and big data training
ODP
Architecture of Hadoop
PPT
Hadoop training in hyderabad-kellytechnologies
PDF
Hadoop interview questions
PPTX
Hadoop architecture meetup
PDF
Apache Drill - Why, What, How
PDF
PPTX
Apache hadoop technology : Beginners
PPTX
Hadoop workshop
PPT
Meethadoop
PPTX
A Basic Introduction to the Hadoop eco system - no animation
PDF
Hadoop trainting in hyderabad@kelly technologies
ODP
Hadoop - Overview
PDF
Practical Problem Solving with Apache Hadoop & Pig
PPTX
R for hadoopers
PPTX
Introduction to Hadoop and Hadoop component
PPTX
Cassandra/Hadoop Integration
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
Hadoop Tutorial
Apache Hadoop 0.22 and Other Versions
Hadoop and big data training
Architecture of Hadoop
Hadoop training in hyderabad-kellytechnologies
Hadoop interview questions
Hadoop architecture meetup
Apache Drill - Why, What, How
Apache hadoop technology : Beginners
Hadoop workshop
Meethadoop
A Basic Introduction to the Hadoop eco system - no animation
Hadoop trainting in hyderabad@kelly technologies
Hadoop - Overview
Practical Problem Solving with Apache Hadoop & Pig
R for hadoopers
Introduction to Hadoop and Hadoop component
Cassandra/Hadoop Integration
Ad

Similar to Hadoop Architecture in Depth (20)

PPT
Hadoop 1.x vs 2
PDF
Lecture 2 part 1
PPTX
Understanding Hadoop
PPTX
Presentation sreenu dwh-services
PPT
Hadoop training by keylabs
PPTX
Hadoop and BigData - July 2016
PDF
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
PPT
hadoop
PPT
hadoop
PPTX
Big Data and Hadoop
PPT
Hw09 Production Deep Dive With High Availability
ODP
Apache hadoop
PPT
Hadoop and Mapreduce Introduction
PPTX
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
PPTX
PPTX
PPTX
Cppt Hadoop
PPTX
Topic 9a-Hadoop Storage- HDFS.pptx
PPTX
Hadoop by kamran khan
Hadoop 1.x vs 2
Lecture 2 part 1
Understanding Hadoop
Presentation sreenu dwh-services
Hadoop training by keylabs
Hadoop and BigData - July 2016
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
hadoop
hadoop
Big Data and Hadoop
Hw09 Production Deep Dive With High Availability
Apache hadoop
Hadoop and Mapreduce Introduction
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Cppt Hadoop
Topic 9a-Hadoop Storage- HDFS.pptx
Hadoop by kamran khan
Ad

More from Syed Hadoop (6)

PDF
Kafka syed academy_v1_introduction
PDF
Spark SQL In Depth www.syedacademy.com
PDF
Spark Streaming In Depth - www.syedacademy.com
PDF
Spark_RDD_SyedAcademy
PDF
Spark_Intro_Syed_Academy
PDF
Hadoop course content Syed Academy
Kafka syed academy_v1_introduction
Spark SQL In Depth www.syedacademy.com
Spark Streaming In Depth - www.syedacademy.com
Spark_RDD_SyedAcademy
Spark_Intro_Syed_Academy
Hadoop course content Syed Academy

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Computer network topology notes for revision
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction to machine learning and Linear Models
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Supervised vs unsupervised machine learning algorithms
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
IB Computer Science - Internal Assessment.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
1_Introduction to advance data techniques.pptx
Computer network topology notes for revision
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Qualitative Qantitative and Mixed Methods.pptx
Introduction to Knowledge Engineering Part 1
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Business Acumen Training GuidePresentation.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Reliability_Chapter_ presentation 1221.5784
.pdf is not working space design for the following data for the following dat...
Introduction to machine learning and Linear Models
Data_Analytics_and_PowerBI_Presentation.pptx

Hadoop Architecture in Depth

  • 1. BigData Syed Solutions Engineer - Big Data mail.syed786@gmail.com info.syedacademy@gmail.com +91-9030477368
  • 2. Need For A New Processing Platform (Big Data) What is Big Data ?  Twitter (over 7~ TB/day)  Facebook (over 10~ TB/day)  Google (over 20~ PB/day) Where does it come from ? Existing systems (vertical scalibility) Why Hadoop (horizontal scalibility)?
  • 4. What is Hadoop?  Flexible infrastructure for large scale computation & data processing on a network of commodity hardware.  Completely written in java.  Open source & distributed under Apache license  Hadoop Core Components: HDFS & MapReduce.  The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
  • 5. What Hadoop is Not?  A database  An online transaction processing (OLTP) system  Replacement of all programming language
  • 6. Hadoop Introduction and Architecture
  • 8. HDFS - Hadoop Distributed File System Design of HDFS Where HDFS is not a good fit Why Is a Block in HDFS So Large?
  • 10. Let us Zoom into HDFS
  • 11. NameNode  Deeper Things about Name Node
  • 13.  What is DataNode? DataNode
  • 17. Still need to be fixed
  • 18. How Do We Fix a Single NameNode Feature
  • 21. NameNode HA – Shared Storage
  • 24. Hadoop JournalNode JournalNode machines - the machines on which you run the JournalNodes. The JournalNode daemon is relatively lightweight, so these daemons may reasonably be collocated on machines with other Hadoop daemons, for example NameNodes, the JobTracker, or the YARN ResourceManager. Note: There must be at least 3 JournalNode daemons, since edit log modifications must be written to a majority of JNs. This will allow the system to tolerate the failure of a single machine. You may also run more than 3 JournalNodes, but in order to actually increase the number of failures the system can tolerate, you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N JournalNodes, the system can tolerate at most (N - 1) / 2 failures and continue to function normally.
  • 25. Hadoop 1 Limited up to 4,000 nodes per cluster O(# of tasks in a cluster) JobTracker bottleneck - resource management, job scheduling and monitoring Only has one namespace for managing HDFS Map and Reduce slots are static Only job to run is MapReduce
  • 26. Hadoop 1 - Reading Files Rack1 Rack2 Rack3 RackN read file (fsimage/edit) Hadoop Client NameNode SNameNode return DNs, block ids, etc. DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT checkpoint heartbeat/ block reportread blocks
  • 27. Hadoop 1 - Writing Files Rack1 Rack2 Rack3 RackN request write (fsimage/edit) Hadoop Client NameNode SNameNode return DNs, etc. DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT checkpoint block report write blocks replication pipelining
  • 28. Hadoop 2 Potentially up to 10,000 nodes per cluster O(cluster size) Supports multiple namespace for managing HDFS Efficient cluster utilization (YARN) MRv1 backward and forward compatible Any apps can integrate with Hadoop
  • 29. Hadoop 2 - Basics
  • 30. Hadoop 2 - Reading Files (w/ NN Federation) Rack1 Rack2 Rack3 RackN read file fsimage/edit copy Hadoop Client NN1/ns1 SNameNode per NN return DNs, block ids, etc. DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM checkpoint register/ heartbeat/ block report read blocks fs sync Backup NN per NN checkpoint NN2/ns2 NN3/ns3 NN4/ns4 or ns1 ns2 ns3 ns4 dn1, dn2 dn1, dn3 dn4, dn5 dn4, dn5 Block Pools
  • 31. Hadoop 2 - Writing Files Rack1 Rack2 Rack3 RackN request write Hadoop Client return DNs, etc. DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM write blocks replication pipelining fsimage/edit copy NN1/ns1 SNameNode per NN checkpoint block report fs sync Backup NN per NN checkpoint NN2/ns2 NN3/ns3 NN4/ns4 or