SlideShare a Scribd company logo
Ali Bahu
10/17/2012
APACHE HADOOP
INTRODUCTION
 Apache Hadoop is an open-source software
framework that supports data-intensive
distributed applications, licensed under the
Apache v2 license. It enables applications to
work with thousands of computation-
independent computers and petabytes of
data.
 Hadoop was derived from Google's
MapReduce and Google File System (GFS)
papers.
 Hadoop is implemented in Java.
WHY HADOOP
 Need to process Multi Petabyte Datasets
 It is expensive to build reliability in each application
 Nodes failure is expected and Hadoop can help
 Number of nodes is not constant
 Efficient, reliable, Open Source Apache License
 Workloads are IO bound and not CPU bound
WHO USES HADOOP
 Amazon/A9
 Facebook
 Google
 IBM
 Joost
 Last.fm
 New York Times
 PowerSet
 Veoh
 Yahoo!
COMMODITY HARDWARE
 Typically in 2 level architecture
 Nodes are commodity PCs
 30-40 nodes/rack
 Uplink from rack is 3-4 gigabit
 Rack-internal is one gigabit
HDFS ARCHITECTURE
HDFS (HADOOP DISTRIBUTED FILE
SYSTEM) Very Large Distributed File System
 10K nodes, 100 million files, 10 PB
 Assumes Commodity Hardware
 Files are replicated to handle hardware failure
 Detect failures and recovers from them
 Optimized for Batch Processing
 Data locations exposed so that computations can
move to where data resides
 Provides very high aggregate bandwidth
 User Space, runs on heterogeneous OS
 Single Namespace for entire cluster
 Data Coherency
 Write-once-read-many access model
 Client can only append to existing files
 Files are broken up into blocks
 Typically 128 MB block size
 Each block replicated on multiple Data Nodes
 Intelligent Client
 Client can find location of blocks
 Client accesses data directly from Data Node
NAME NODE METADATA
 Meta-data in Memory
 The entire metadata is in main memory
 No demand paging of meta-data
 Types of Metadata
 List of files
 List of Blocks for each file
 List of Data Nodes for each block
 File attributes, e.g. creation time, replication factor, etc.
 Transaction Log
 Records file creations, file deletions, etc. is kept here
DATA NODE
Block Server
 Stores data in the local file system (e.g. ext3)
 Stores meta-data of a block (e.g. CRC)
 Serves data and meta-data to Clients
Block Report
 Periodically sends a report of all existing blocks to
the Name Node
Facilitates Pipelining of Data
 Forwards data to other specified Data Nodes
BLOCK PLACEMENT
 Block Placement Strategy
 One replica on local node
 Second replica on a remote rack
 Third replica on same remote rack
 Additional replicas are randomly placed
 Clients read from nearest replica
DATA CORRECTNESS
 Use Checksums to validate data
 Use CRC32, etc.
 File Creation
 Client computes checksum per 512 byte
 Data Node stores the checksum
 File access
 Client retrieves the data and checksum from Data Node
 If Validation fails, Client tries other replicas
NAMENODE FAILURE
 A single point of failure
 Transaction Log stored in multiple directories
 A directory on the local file system
 A directory on a remote file system (NFS/CIFS)
DATA PIPELINING
 Client retrieves a list of DataNodes on which to place replicas of a
block
 Client writes block to the first DataNode
 The first DataNode forwards the data to the next DataNode in the
Pipeline
 When all replicas are written, the Client moves on to write the next
block in file
 Rebalancer is used to ensure that the % disk full on DataNodes are
similar
 Usually run when new DataNodes are added
 Cluster is online when Rebalancer is active
 Rebalancer is throttled to avoid network congestion
 Command line tool
HADOOP MAP/REDUCE
 The Map-Reduce programming model
 Framework for distributed processing of large data sets
 Pluggable user code runs in generic framework
 Common design pattern in data processing
 cat * | grep | sort | unique -c | cat > file
 input | map | shuffle | reduce | output
 Useful for:
 Log processing
 Web search indexing
 Ad-hoc queries, etc.

More Related Content

PPT
Hadoop training in bangalore-kellytechnologies
PPTX
Hadoop distributed file system
PPTX
Hadoop Distributed File System
PPTX
Unit 2.pptx
PPTX
Sector Vs Hadoop
PDF
EKAW - Publishing with Triple Pattern Fragments
PPTX
Inroduction to Big Data
PPTX
Hadoop architecture-tutorial
Hadoop training in bangalore-kellytechnologies
Hadoop distributed file system
Hadoop Distributed File System
Unit 2.pptx
Sector Vs Hadoop
EKAW - Publishing with Triple Pattern Fragments
Inroduction to Big Data
Hadoop architecture-tutorial

What's hot (20)

PPT
Apache hadoop and hive
PPT
Directory services by SAJID
PDF
EKAW - Linked Data Publishing
PPTX
Big data technologies and databases
PPTX
Hadoop Distributed File System
PPTX
Snapshot in Hadoop Distributed File System
PPT
Directory services by SAJID
PDF
Multidimensional Interfaces for Selecting Data with Order
PPT
GFS - Google File System
PPT
3. distributed file system requirements
PDF
Describing configurations of software experiments as Linked Data
PPT
Csci12 report aug18
PPT
Chapter13
PDF
File organization
ODP
HadoopDB
PPT
Distributed Filesystems Review
PPT
PDF
Poster GraphQL-LD: Linked Data Querying with GraphQL
PPT
Distributed file systems dfs
Apache hadoop and hive
Directory services by SAJID
EKAW - Linked Data Publishing
Big data technologies and databases
Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
Directory services by SAJID
Multidimensional Interfaces for Selecting Data with Order
GFS - Google File System
3. distributed file system requirements
Describing configurations of software experiments as Linked Data
Csci12 report aug18
Chapter13
File organization
HadoopDB
Distributed Filesystems Review
Poster GraphQL-LD: Linked Data Querying with GraphQL
Distributed file systems dfs
Ad

Similar to Hadoop (20)

PPT
Hadoop-professional-software-development-course-in-mumbai
PPT
Hadoop professional-software-development-course-in-mumbai
PDF
Hadoop distributed file system
PPT
Hadoop -HDFS.ppt
PPTX
Hadoop File system (HDFS)
PPTX
Sector Cloudcom Tutorial
PDF
Big data interview questions and answers
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PPTX
PDF
Hadoop data management
PDF
PPTX
Introduction_to_HDFS sun.pptx
PPT
Hadoop training in bangalore
PPTX
Managing Big data with Hadoop
PPTX
Unit-1 Introduction to Big Data.pptx
PDF
getFamiliarWithHadoop
PPTX
Introduction to hadoop and hdfs
PDF
Lecture 2 part 1
Hadoop-professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbai
Hadoop distributed file system
Hadoop -HDFS.ppt
Hadoop File system (HDFS)
Sector Cloudcom Tutorial
Big data interview questions and answers
hdfs readrmation ghghg bigdats analytics info.pdf
Hadoop data management
Introduction_to_HDFS sun.pptx
Hadoop training in bangalore
Managing Big data with Hadoop
Unit-1 Introduction to Big Data.pptx
getFamiliarWithHadoop
Introduction to hadoop and hdfs
Lecture 2 part 1
Ad

More from Ali Bahu (6)

PPTX
Apache Ant
PPTX
Jhiccup
PPTX
Apache Ant
PPTX
EclipseMAT
PPTX
Cloud computing
PPTX
Pervasive computing
Apache Ant
Jhiccup
Apache Ant
EclipseMAT
Cloud computing
Pervasive computing

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
cuic standard and advanced reporting.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation theory and applications.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
1. Introduction to Computer Programming.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Machine learning based COVID-19 study performance prediction
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Building Integrated photovoltaic BIPV_UPV.pdf
MIND Revenue Release Quarter 2 2025 Press Release
cuic standard and advanced reporting.pdf
A comparative analysis of optical character recognition models for extracting...
Advanced methodologies resolving dimensionality complications for autism neur...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation theory and applications.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
MYSQL Presentation for SQL database connectivity
1. Introduction to Computer Programming.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Electronic commerce courselecture one. Pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Machine learning based COVID-19 study performance prediction
SOPHOS-XG Firewall Administrator PPT.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”

Hadoop

  • 2. INTRODUCTION  Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It enables applications to work with thousands of computation- independent computers and petabytes of data.  Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.  Hadoop is implemented in Java.
  • 3. WHY HADOOP  Need to process Multi Petabyte Datasets  It is expensive to build reliability in each application  Nodes failure is expected and Hadoop can help  Number of nodes is not constant  Efficient, reliable, Open Source Apache License  Workloads are IO bound and not CPU bound
  • 4. WHO USES HADOOP  Amazon/A9  Facebook  Google  IBM  Joost  Last.fm  New York Times  PowerSet  Veoh  Yahoo!
  • 5. COMMODITY HARDWARE  Typically in 2 level architecture  Nodes are commodity PCs  30-40 nodes/rack  Uplink from rack is 3-4 gigabit  Rack-internal is one gigabit
  • 7. HDFS (HADOOP DISTRIBUTED FILE SYSTEM) Very Large Distributed File System  10K nodes, 100 million files, 10 PB  Assumes Commodity Hardware  Files are replicated to handle hardware failure  Detect failures and recovers from them  Optimized for Batch Processing  Data locations exposed so that computations can move to where data resides  Provides very high aggregate bandwidth  User Space, runs on heterogeneous OS  Single Namespace for entire cluster  Data Coherency  Write-once-read-many access model  Client can only append to existing files  Files are broken up into blocks  Typically 128 MB block size  Each block replicated on multiple Data Nodes  Intelligent Client  Client can find location of blocks  Client accesses data directly from Data Node
  • 8. NAME NODE METADATA  Meta-data in Memory  The entire metadata is in main memory  No demand paging of meta-data  Types of Metadata  List of files  List of Blocks for each file  List of Data Nodes for each block  File attributes, e.g. creation time, replication factor, etc.  Transaction Log  Records file creations, file deletions, etc. is kept here
  • 9. DATA NODE Block Server  Stores data in the local file system (e.g. ext3)  Stores meta-data of a block (e.g. CRC)  Serves data and meta-data to Clients Block Report  Periodically sends a report of all existing blocks to the Name Node Facilitates Pipelining of Data  Forwards data to other specified Data Nodes
  • 10. BLOCK PLACEMENT  Block Placement Strategy  One replica on local node  Second replica on a remote rack  Third replica on same remote rack  Additional replicas are randomly placed  Clients read from nearest replica
  • 11. DATA CORRECTNESS  Use Checksums to validate data  Use CRC32, etc.  File Creation  Client computes checksum per 512 byte  Data Node stores the checksum  File access  Client retrieves the data and checksum from Data Node  If Validation fails, Client tries other replicas
  • 12. NAMENODE FAILURE  A single point of failure  Transaction Log stored in multiple directories  A directory on the local file system  A directory on a remote file system (NFS/CIFS)
  • 13. DATA PIPELINING  Client retrieves a list of DataNodes on which to place replicas of a block  Client writes block to the first DataNode  The first DataNode forwards the data to the next DataNode in the Pipeline  When all replicas are written, the Client moves on to write the next block in file  Rebalancer is used to ensure that the % disk full on DataNodes are similar  Usually run when new DataNodes are added  Cluster is online when Rebalancer is active  Rebalancer is throttled to avoid network congestion  Command line tool
  • 14. HADOOP MAP/REDUCE  The Map-Reduce programming model  Framework for distributed processing of large data sets  Pluggable user code runs in generic framework  Common design pattern in data processing  cat * | grep | sort | unique -c | cat > file  input | map | shuffle | reduce | output  Useful for:  Log processing  Web search indexing  Ad-hoc queries, etc.