SlideShare a Scribd company logo
Students: An Du – Tan Tran – Toan Do – Vinh Nguyen
      Instructor: Professor Lothar Piepmayer




  HDFS at a glance
Agenda

1. Design of HDFS
2.1. HDFS Concepts – Blocks
2.1. HDFS Concepts - Namenode and datanode
3.1 Dataflow - Anatomy of a read file
3.2 Dataflow - Anatomy of a write file
3.3 Dataflow - Coherency model
4. Parallel copying
5. Demo - Command line
The Design of HDFS

Very large distributed file system
  Up to 10K nodes, 1 billion files, 100PB
Streaming data access
  Write once, read many times
Commodity hardware
  Files are replicated to handle hardware failure
        Detect failures and recover from them
Worst fit with

Low-latency data access
Lots of small files
Multiple writers, arbitrary file modifications
HDFS Blocks

Normal Filesystem blocks are few kilobytes
HDFS has Large block size
    Default 64MB
    Typical 128MB
Unlike a file system for a single disk. A file in HDFS that is
 smaller than a single block does not occupy a full block
HDFS Blocks


A file is stored in blocks on various nodes in hadoop cluster.
HDFS creates several replication of the data blocks
Each and every data block is replicated to multiple nodes
 across the cluster.
HDFS Blocks




Dhruba Borthakur - Design and Evolution of the Apache Hadoop File System HDFS.pdf
Why blocks in HDFS so large?

Minimize the cost of seeks
=> Make transfer time = disk transfer rate
Benefit of Block abstraction

A file can be larger than any single disk in the network
Simplify the storage subsystem
Providing fault tolerance and availability
Namenode & Datanodes
Namenode & Datanodes

 Namenode (master)
 – manages the filesystem namespace
 – maintains the filesystem tree and metadata for all the
 files and directories in the tree.
 Datanodes (slaves)
 – store data in the local file system
 – Periodically report back to the namenode with lists of all
 existing blocks
 Clients communicate with both namenode and datanodes.
Anatomy of a File Read
Anatomy of a File Read


Benefits:
- Avoid “bottle neck”
- Multi-Clients
Writing in HDFS


Namenode
Datanode
Block
Hadoop at a glance
Writing in HDFS


Exeptions: Node failed
  Pipeline close, remove block and addr of failed
   node
  Namenode arrange new datanode
Coherency Model


Not visible when copying
use sync()
Apply in applications
Parallel copying in HDFS

Transfer data between clusters
   % hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
Implemented as MapReduce, each file per map
Each map take at least 256MB
Default max maps is 20 per node
The diffirent versions only supported by webhdfs protocol:
   % hadoop distcp webhdfs://namenode1:50070/foo
      webhdfs://namenode2:50070/bar
Setup

Cluster with 03 nodes:
    04 GB RAM
    02 CPU @ 2.0Ghz+
    100G HDD
Using vmWare on 03 different servers
Network: 100Mbps
Operating System: Ubuntu 11.04
    Windows: Not tested
Setup Guide - Single Node


java runtime ssh
  http://hadoop.apache.org/common/docs/r1.0.3/si
   ngle_node_setup.html
/etc/hadoop/core-site.xml
/etc/hadoop/hdfs-site.xml
Cluster


/etc/hadoop/masters
/etc/hadoop/slaves
http://hadoop.apache.org/common/docs/r1.0.3
/cluster_setup.html
Command Line

Similar to *nix
    hadoop fs -ls /
    hadoop fs -mkdir /test
    hadoop fs -rmr /test
    hadoop fs -cp /1 /2
    hadoop fs -copyFromLocal /3 hdfs://localhost/
Namedone-specific:
    hadoop namenode -format
    start-all.sh
Command Line

Sorting: Standard method to test cluster
    TeraGen: Generate dummy data
    TeraSort: Sort
    TeraValidate: Validate sort result
Command Line:
    hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar
     terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41
Benchmark Result

2 Nodes, 1GB data: 0:03:38
3 Nodes, 1GB data: 0:03:13

2 Nodes, 10GB data: 0:38:07
3 Nodes, 10GB data: 0:31:28

Virtual Machine's harddisks are the bottle-neck
Who
wins…?
References

Hadoop The Definitive Guide

More Related Content

PPTX
Hadoop HDFS NameNode HA
ODP
Hadoop HDFS by rohitkapa
PPTX
Hadoop Interacting with HDFS
PDF
Interacting with hdfs
PDF
Hdfs architecture
PPTX
PDF
HDFS Deep Dive
PPTX
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop HDFS NameNode HA
Hadoop HDFS by rohitkapa
Hadoop Interacting with HDFS
Interacting with hdfs
Hdfs architecture
HDFS Deep Dive
Hadoop Distributed File System(HDFS) : Behind the scenes

What's hot (20)

PPTX
Hadoop Distributed File System
PDF
Hadoop Distributed File System
PDF
Hadoop introduction
PDF
HDFS_Command_Reference
PPTX
Hadoop and HDFS
PDF
Hadoop Introduction
PPT
Anatomy of file write in hadoop
PPTX
Hadoop HDFS Detailed Introduction
PPT
Anatomy of file read in hadoop
PDF
Hadoop File System Shell Commands,
PDF
The basic concept of Linux FIleSystem
PDF
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
PPTX
Hadoop Distributed File System
PPTX
Snapshot in Hadoop Distributed File System
PDF
HDFS Design Principles
PPTX
12 linux archiving tools
PDF
HDFS User Reference
PDF
6 technical-dns-workshop-day3
PPTX
Introduction to HDFS and MapReduce
PDF
HDFS Trunncate: Evolving Beyond Write-Once Semantics
Hadoop Distributed File System
Hadoop Distributed File System
Hadoop introduction
HDFS_Command_Reference
Hadoop and HDFS
Hadoop Introduction
Anatomy of file write in hadoop
Hadoop HDFS Detailed Introduction
Anatomy of file read in hadoop
Hadoop File System Shell Commands,
The basic concept of Linux FIleSystem
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
HDFS Design Principles
12 linux archiving tools
HDFS User Reference
6 technical-dns-workshop-day3
Introduction to HDFS and MapReduce
HDFS Trunncate: Evolving Beyond Write-Once Semantics
Ad

Similar to Hadoop at a glance (20)

PDF
Apache Hadoop In Theory And Practice
PPTX
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
PPT
Hadoop Architecture
PPTX
PPT
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
PDF
Big data interview questions and answers
PPTX
Introduction_to_HDFS sun.pptx
PDF
Hadoop data management
PPTX
Hadoop HDFS Architeture and Design
PPTX
module 2.pptx
PPTX
Clustering and types of Clustering in Data analytics
PPTX
Big data with HDFS and Mapreduce
PPTX
Introduction to HDFS
PDF
Hadoop Architecture and HDFS
PPTX
Introduction to Hadoop Distributed File System(HDFS).pptx
PPTX
HDFS+basics.pptx
PDF
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
PPTX
Cloud Computing - Cloud Technologies and Advancements
PPT
Hadoop training institute in bangalore
PPT
Hadoop training institute in hyderabad
Apache Hadoop In Theory And Practice
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Architecture
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Big data interview questions and answers
Introduction_to_HDFS sun.pptx
Hadoop data management
Hadoop HDFS Architeture and Design
module 2.pptx
Clustering and types of Clustering in Data analytics
Big data with HDFS and Mapreduce
Introduction to HDFS
Hadoop Architecture and HDFS
Introduction to Hadoop Distributed File System(HDFS).pptx
HDFS+basics.pptx
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
Cloud Computing - Cloud Technologies and Advancements
Hadoop training institute in bangalore
Hadoop training institute in hyderabad
Ad

More from Tan Tran (16)

PDF
Mật thư trò chơi lớn (tóm tắt)
PPT
Managing for results
PPTX
Software estimation techniques
PPTX
Personal task management
PPTX
Jira in action
PPTX
Beautifying Data in the real world
PDF
BIS Vietnamese-German University
PPSX
Phac thao compendium
PPTX
Management skills in IT - Communication
PPTX
Internet governance and the filtering problems
PDF
C# conventions & good practices
PDF
Tổng hợp Dâng Ngài - nhạc sĩ Thy Yên
PDF
Flash coding convention for action script 3
PDF
Java convention
PPTX
VGU - BIS2010: Integrated Information Management
PPTX
Scrum introduction
Mật thư trò chơi lớn (tóm tắt)
Managing for results
Software estimation techniques
Personal task management
Jira in action
Beautifying Data in the real world
BIS Vietnamese-German University
Phac thao compendium
Management skills in IT - Communication
Internet governance and the filtering problems
C# conventions & good practices
Tổng hợp Dâng Ngài - nhạc sĩ Thy Yên
Flash coding convention for action script 3
Java convention
VGU - BIS2010: Integrated Information Management
Scrum introduction

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Modernizing your data center with Dell and AMD
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation_ Review paper, used for researhc scholars
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Approach and Philosophy of On baking technology
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Big Data Technologies - Introduction.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPT
Teaching material agriculture food technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Review of recent advances in non-invasive hemoglobin estimation
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Modernizing your data center with Dell and AMD
The Rise and Fall of 3GPP – Time for a Sabbatical?
Advanced methodologies resolving dimensionality complications for autism neur...
20250228 LYD VKU AI Blended-Learning.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Machine learning based COVID-19 study performance prediction
Encapsulation_ Review paper, used for researhc scholars
The AUB Centre for AI in Media Proposal.docx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Approach and Philosophy of On baking technology
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Big Data Technologies - Introduction.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Teaching material agriculture food technology
“AI and Expert System Decision Support & Business Intelligence Systems”

Hadoop at a glance

  • 1. Students: An Du – Tan Tran – Toan Do – Vinh Nguyen Instructor: Professor Lothar Piepmayer HDFS at a glance
  • 2. Agenda 1. Design of HDFS 2.1. HDFS Concepts – Blocks 2.1. HDFS Concepts - Namenode and datanode 3.1 Dataflow - Anatomy of a read file 3.2 Dataflow - Anatomy of a write file 3.3 Dataflow - Coherency model 4. Parallel copying 5. Demo - Command line
  • 3. The Design of HDFS Very large distributed file system Up to 10K nodes, 1 billion files, 100PB Streaming data access Write once, read many times Commodity hardware Files are replicated to handle hardware failure Detect failures and recover from them
  • 4. Worst fit with Low-latency data access Lots of small files Multiple writers, arbitrary file modifications
  • 5. HDFS Blocks Normal Filesystem blocks are few kilobytes HDFS has Large block size  Default 64MB  Typical 128MB Unlike a file system for a single disk. A file in HDFS that is smaller than a single block does not occupy a full block
  • 6. HDFS Blocks A file is stored in blocks on various nodes in hadoop cluster. HDFS creates several replication of the data blocks Each and every data block is replicated to multiple nodes across the cluster.
  • 7. HDFS Blocks Dhruba Borthakur - Design and Evolution of the Apache Hadoop File System HDFS.pdf
  • 8. Why blocks in HDFS so large? Minimize the cost of seeks => Make transfer time = disk transfer rate
  • 9. Benefit of Block abstraction A file can be larger than any single disk in the network Simplify the storage subsystem Providing fault tolerance and availability
  • 11. Namenode & Datanodes  Namenode (master) – manages the filesystem namespace – maintains the filesystem tree and metadata for all the files and directories in the tree.  Datanodes (slaves) – store data in the local file system – Periodically report back to the namenode with lists of all existing blocks  Clients communicate with both namenode and datanodes.
  • 12. Anatomy of a File Read
  • 13. Anatomy of a File Read Benefits: - Avoid “bottle neck” - Multi-Clients
  • 16. Writing in HDFS Exeptions: Node failed Pipeline close, remove block and addr of failed node Namenode arrange new datanode
  • 17. Coherency Model Not visible when copying use sync() Apply in applications
  • 18. Parallel copying in HDFS Transfer data between clusters % hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar Implemented as MapReduce, each file per map Each map take at least 256MB Default max maps is 20 per node The diffirent versions only supported by webhdfs protocol: % hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/bar
  • 19. Setup Cluster with 03 nodes:  04 GB RAM  02 CPU @ 2.0Ghz+  100G HDD Using vmWare on 03 different servers Network: 100Mbps Operating System: Ubuntu 11.04  Windows: Not tested
  • 20. Setup Guide - Single Node java runtime ssh http://hadoop.apache.org/common/docs/r1.0.3/si ngle_node_setup.html /etc/hadoop/core-site.xml /etc/hadoop/hdfs-site.xml
  • 22. Command Line Similar to *nix  hadoop fs -ls /  hadoop fs -mkdir /test  hadoop fs -rmr /test  hadoop fs -cp /1 /2  hadoop fs -copyFromLocal /3 hdfs://localhost/ Namedone-specific:  hadoop namenode -format  start-all.sh
  • 23. Command Line Sorting: Standard method to test cluster  TeraGen: Generate dummy data  TeraSort: Sort  TeraValidate: Validate sort result Command Line:  hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41
  • 24. Benchmark Result 2 Nodes, 1GB data: 0:03:38 3 Nodes, 1GB data: 0:03:13 2 Nodes, 10GB data: 0:38:07 3 Nodes, 10GB data: 0:31:28 Virtual Machine's harddisks are the bottle-neck