Hadoop at a glance

Students: An Du – Tan Tran – Toan Do – Vinh Nguyen
Instructor: Professor Lothar Piepmayer

HDFS at a glance

Agenda

1. Design of HDFS
2.1. HDFS Concepts – Blocks
2.1. HDFS Concepts - Namenode and datanode
3.1 Dataflow - Anatomy of a read file
3.2 Dataflow - Anatomy of a write file
3.3 Dataflow - Coherency model
4. Parallel copying
5. Demo - Command line

The Design of HDFS

Very large distributed file system
Up to 10K nodes, 1 billion files, 100PB
Streaming data access
Write once, read many times
Commodity hardware
Files are replicated to handle hardware failure
Detect failures and recover from them

Worst fit with

Low-latency data access
Lots of small files
Multiple writers, arbitrary file modifications

HDFS Blocks

Normal Filesystem blocks are few kilobytes
HDFS has Large block size
 Default 64MB
 Typical 128MB
Unlike a file system for a single disk. A file in HDFS that is
smaller than a single block does not occupy a full block

HDFS Blocks

A file is stored in blocks on various nodes in hadoop cluster.
HDFS creates several replication of the data blocks
Each and every data block is replicated to multiple nodes
across the cluster.

HDFS Blocks

Dhruba Borthakur - Design and Evolution of the Apache Hadoop File System HDFS.pdf

Why blocks in HDFS so large?

Minimize the cost of seeks
=> Make transfer time = disk transfer rate

Benefit of Block abstraction

A file can be larger than any single disk in the network
Simplify the storage subsystem
Providing fault tolerance and availability

Namenode & Datanodes

 Namenode (master)
– manages the filesystem namespace
– maintains the filesystem tree and metadata for all the
files and directories in the tree.
 Datanodes (slaves)
– store data in the local file system
– Periodically report back to the namenode with lists of all
existing blocks
 Clients communicate with both namenode and datanodes.

Anatomy of a File Read

Benefits:
- Avoid “bottle neck”
- Multi-Clients

Writing in HDFS

Namenode
Datanode
Block

Writing in HDFS

Exeptions: Node failed
Pipeline close, remove block and addr of failed
node
Namenode arrange new datanode

Coherency Model

Not visible when copying
use sync()
Apply in applications

Parallel copying in HDFS

Transfer data between clusters
% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
Implemented as MapReduce, each file per map
Each map take at least 256MB
Default max maps is 20 per node
The diffirent versions only supported by webhdfs protocol:
% hadoop distcp webhdfs://namenode1:50070/foo
webhdfs://namenode2:50070/bar

Setup

Cluster with 03 nodes:
 04 GB RAM
 02 CPU @ 2.0Ghz+
 100G HDD
Using vmWare on 03 different servers
Network: 100Mbps
Operating System: Ubuntu 11.04
 Windows: Not tested

Setup Guide - Single Node

java runtime ssh
http://hadoop.apache.org/common/docs/r1.0.3/si
ngle_node_setup.html
/etc/hadoop/core-site.xml
/etc/hadoop/hdfs-site.xml

Cluster

/etc/hadoop/masters
/etc/hadoop/slaves
http://hadoop.apache.org/common/docs/r1.0.3
/cluster_setup.html

Command Line

Similar to *nix
 hadoop fs -ls /
 hadoop fs -mkdir /test
 hadoop fs -rmr /test
 hadoop fs -cp /1 /2
 hadoop fs -copyFromLocal /3 hdfs://localhost/
Namedone-specific:
 hadoop namenode -format
 start-all.sh

Command Line

Sorting: Standard method to test cluster
 TeraGen: Generate dummy data
 TeraSort: Sort
 TeraValidate: Validate sort result
Command Line:
 hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar
terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41

Benchmark Result

2 Nodes, 1GB data: 0:03:38
3 Nodes, 1GB data: 0:03:13

2 Nodes, 10GB data: 0:38:07
3 Nodes, 10GB data: 0:31:28

Virtual Machine's harddisks are the bottle-neck

References

Hadoop The Definitive Guide

Hadoop at a glance

More Related Content

What's hot (20)

Similar to Hadoop at a glance (20)

More from Tan Tran (16)

Recently uploaded (20)

Hadoop at a glance