SlideShare a Scribd company logo
DataTorrent
HADOOP
Interacting with HDFS
1
→ What's the “Need” ? ←
❏ Big data Ocean
❏ Expensive hardware
❏ Frequent Failures and Difficult recovery
❏ Scaling up with more machines
2
→ Hadoop ←
Open source software
- a Java framework
- initial release: December 10, 2011
It provides both,
Storage → [HDFS]
Processing → [MapReduce]
HDFS: Hadoop Distributed File System
3
→ How Hadoop addresses the need? ←
Big data Ocean
Have multiple machines. Each will store some portion of data, not the entire data.
Expensive hardware
Use commodity hardware. Simple and cheap.
Frequent Failures and Difficult recovery
Have multiple copies of data. Have the copies in different machines.
Scaling up with more machines
If more processing is needed, add new machines on the fly 4
→ HDFS ←
Runs on Commodity hardware: Doesn't require expensive machines
Large Files; Write-once, Read-many (WORM)
Files are split into blocks
Actual blocks go to DataNodes
The metadata is stored at NameNode
Replicate blocks to different node
Default configuration:
Block size = 128MB
Replication Factor = 3 5
6
7
8
→ Where NOT TO use Hadoop/HDFS ←
Low latency data access
HDFS is optimized for high throughput of data at the expense of latency.
Large number of small files
Namenode has the entire file-system metadata in memory.
Too much metadata as compared to actual data.
Multiple writers / Arbitrary file modifications
No support for multiple writers for a file
Always append to end of a file
9
→ Some Key Concepts ←
❏NameNode
❏DataNodes
❏JobTracker (MR v1)
❏TaskTrackers (MR v1)
❏ResourceManager (MR v2)
❏NodeManagers (MR v2)
❏ApplicationMasters (MR v2)
10
→ NameNode & DataNodes ←
❏NameNode:
Centerpiece of HDFS: The Master
Only stores the block metadata: block-name, block-location etc.
Critical component; When down, whole cluster is considered down; Single point of failure
Should be configured with higher RAM
❏DataNode:
Stores the actual data: The Slave
In constant communication with NameNode
When down, it does not affect the availability of data/cluster
11
12
→ JobTracker & TaskTrackers ←
❏JobTracker:
Talks to the NameNode to determine location of the data
Monitors all TaskTrackers and submits status of the job back to the client
When down, HDFS is still functional; no new MR job; existing jobs halted
Replaced by ResourceManager/ApplicationMaster in MRv2
❏TaskTracker:
Runs on all DataNodes
TaskTracker communicates with JobTracker signaling the task progress
TaskTracker failure is not considered fatal
13
→ ResourceManager & NodeManager ←
❏Present in Hadoop v2.0
❏Equivalent of JobTracker & TaskTracker in v1.0
❏ResourceManager (RM):
Runs usually at NameNode; Distributes resources among applications.
Two main components: Scheduler and ApplicationsManager (AM)
❏NodeManager (NM):
Per-node framework agent
Responsible for containers
Monitors their resource usage 14
15
→ Hadoop 1.0 vs. 2.0 ←
HDFS 1.0:
Single point of failure
Horizontal scaling performance issue
HDFS 2.0:
HDFS High Availability
HDFS Snapshot
Improved performance
HDFS Federation
16
17
HDFS Federation
→ Interacting with HDFS ←
Command prompt:
Similar to Linux terminal commands
Unix is the model, POSIX is the API
Web Interface:
Similar to browsing a FTP site on web
18
Interacting With HDFS
On Command Prompt
19
→ Notes ←
File Paths on HDFS:
hdfs://<namenode>:<port>/path/to/file.txt
hdfs://127.0.0.1:8020/user/USERNAME/demo/data/file.txt
hdfs://localhost:8020/user/USERNAME/demo/data/file.txt
/user/USERNAME/demo/file.txt
demo/file.txt
File System:
Local: local file system (linux)
HDFS: hadoop file system
At some places: 20
→ Before we start ←
Command:
hdfs
Usage:
hdfs [--config confdir] COMMAND
Example:
hdfs dfs
hdfs dfsadmin
hdfs fsck
21
hdfs `dfs` commands
22
→ In general Syntax for `dfs` commands ←
hdfs
dfs
-<COMMAND>
-[OPTIONS]
<PARAMETERS>
e.g.
hdfs dfs -ls -R /user/USERNAME/demo/data/
23
0. Do It yourself
Syntax:
hdfs dfs -help [COMMAND … ]
hdfs dfs -usage [COMMAND … ]
Example:
hdfs dfs -help cat
hdfs dfs -usage cat
24
1. List the file/directory
Syntax:
hdfs dfs -ls [-d] [-h] [-R] <hdfs-dir-path>
Example:
hdfs dfs -ls
hdfs dfs -ls /
hdfs dfs -ls /user/USERNAME/demo/list-dir-example
hdfs dfs -ls -R /user/USERNAME/demo/list-dir-example
25
2. Creating a directory
Syntax:
hdfs dfs -mkdir [-p] <hdfs-dir-path>
Example:
hdfs dfs -mkdir /user/USERNAME/demo/create-dir-example
hdfs dfs -mkdir -p /user/USERNAME/demo/create-dir-example/dir1/dir2/dir3
26
3. Create a file on local & put it on HDFS
Syntax:
vi filename.txt
hdfs dfs -put [options] <local-file-path> <hdfs-dir-path>
Example:
vi file-copy-to-hdfs.txt
hdfs dfs -put file-copy-to-hdfs.txt /user/USERNAME/demo/put-example/
27
4. Get a file from HDFS to local
Syntax:
hdfs dfs -get <hdfs-file-path> [local-dir-path]
Example:
hdfs dfs -get /user/USERNAME/demo/get-example/file-copy-from-hdfs.txt ~/demo/
28
5. Copy From LOCAL To HDFS
Syntax:
hdfs dfs -copyFromLocal <local-file-path> <hdfs-file-path>
Example:
hdfs dfs -copyFromLocal file-copy-to-hdfs.txt /user/USERNAME/demo/copyFromLocal-example/
29
6. Copy To LOCAL From HDFS
Syntax:
hdfs dfs -copyToLocal <hdfs-file-path> <local-file-path>
Example:
hdfs dfs -copyToLocal /user/USERNAME/demo/copyToLocal-example/file-copy-from-hdfs.txt
~/demo/
30
7. Move a file from local to HDFS
Syntax:
hdfs dfs -moveFromLocal <local-file-path> <hdfs-dir-path>
Example:
hdfs dfs -moveFromLocal /path/to/file.txt /user/USERNAME/demo/moveFromLocal-example/
31
8. Copy a file within HDFS
Syntax:
hdfs dfs -cp <hdfs-source-file-path> <hdfs-dest-file-path>
Example:
hdfs dfs -cp /user/USERNAME/demo/copy-within-hdfs/file-copy.txt
/user/USERNAME/demo/data/
32
9. Move a file within HDFS
Syntax:
hdfs dfs -mv <hdfs-source-file-path> <hdfs-dest-file-path>
Example:
hdfs dfs -mv /user/USERNAME/demo/move-within-hdfs/file-move.txt
/user/USERNAME/demo/data/
33
10. Merge files on HDFS
Syntax:
hdfs dfs -getmerge [-nl] <hdfs-dir-path> <local-file-path>
Examples:
hdfs dfs -getmerge -nl /user/USERNAME/demo/merge-example/ /path/to/all-files.txt
34
11. View file contents
Syntax:
hdfs dfs -cat <hdfs-file-path>
hdfs dfs -tail <hdfs-file-path>
hdfs dfs -text <hdfs-file-path>
Examples:
hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt
hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt | head
35
12. Remove files/dirs from HDFS
Syntax:
hdfs dfs -rm [options] <hdfs-file-path>
Examples:
hdfs dfs -rm /user/USERNAME/demo/remove-example/remove-file.txt
hdfs dfs -rm -R /user/USERNAME/demo/remove-example/
hdfs dfs -rm -R -skipTrash /user/USERNAME/demo/remove-example/
36
13. Change file/dir properties
Syntax:
hdfs dfs -chgrp [-R] <NewGroupName> <hdfs-file-path>
hdfs dfs -chmod [-R] <permissions> <hdfs-file-path>
hdfs dfs -chown [-R] <NewOwnerName> <hdfs-file-path>
Examples:
hdfs dfs -chmod -R 777 /user/USERNAME/demo/data/file-change-properties.txt
37
14. Check the file size
Syntax:
hdfs dfs -du <hdfs-file-path>
Examples:
hdfs dfs -du /user/USERNAME/demo/data/file.txt
hdfs dfs -du -s -h /user/USERNAME/demo/data/
38
15. Create a zero byte file in HDFS
Syntax:
hdfs dfs -touchz <hdfs-file-path>
Examples:
hdfs dfs -touchz /user/USERNAME/demo/data/zero-byte-file.txt
39
16. File test operations
Syntax:
hdfs dfs -test -[defsz] <hdfs-file-path>
Examples:
hdfs dfs -test -e /user/USERNAME/demo/data/file.txt
echo $?
40
17. Get FileSystem Statistics
Syntax:
hdfs dfs -stat [format] <hdfs-file-path>
Format Options:
%b - file size in blocks, %g - group name of owner
%n - filename %o - block size
%r - replication %u - user name of owner
%y - modification date
41
18. Get File/Dir Counts
Syntax:
hdfs dfs -count [-q] [-h] [-v] <hdfs-file-path>
Example:
hdfs dfs -count -v /user/USERNAME/demo/
42
19. Set replication factor
Syntax:
hdfs dfs -setrep -w -R n <hdfs-file-path>
Examples:
hdfs dfs -setrep -w -R 2 /user/USERNAME/demo/data/file.txt
43
20. Set Block Size
Syntax:
hdfs dfs -D dfs.blocksize=blocksize -copyFromLocal <local-file-path> <hdfs-file-path>
Examples:
hdfs dfs -D dfs.blocksize=67108864 -copyFromLocal /path/to/file.txt
/user/USERNAME/demo/block-example/
44
21. Empty the HDFS trash
Syntax:
hdfs dfs -expunge
Location:
45
Other hdfs commands (admin)
46
22. HDFS Admin Commands: fsck
Syntax:
hdfs fsck <hdfs-file-path>
Options:
[-list-corruptfileblocks |
[-move | -delete | -openforwrite]
[-files [-blocks [-locations | -racks]]]
[-includeSnapshots]
47
48
23. HDFS Admin Commands: dfsadmin
Syntax:
hdfs dfsadmin
Options:
[-report [-live] [-dead] [-decommissioning]]
[-safemode enter | leave | get | wait]
[-refreshNodes]
[-refresh <host:ipc_port> <key> [arg1..argn]]
[-shutdownDatanode <datanode:port> [upgrade]]
[-getDatanodeInfo <datanode_host:ipc_port>]
[-help [cmd]]
Examples:
49
50
24. HDFS Admin Commands: namenode
Syntax:
hdfs namenode
Options:
[-checkpoint] |
[-format [-clusterid cid ] [-force] [-nonInteractive] ] |
[-upgrade [-clusterid cid] ] |
[-rollback] |
[-recover [-force] ] |
[-metadataVersion ]
Examples: 51
25. HDFS Admin Commands: getconf
Syntax:
hdfs getconf [-options]
Options:
[ -namenodes ] [ -secondaryNameNodes ]
[ -backupNodes ] [ -includeFile ]
[ -excludeFile ] [ -nnRpcAddresses ]
[ -confKey [key] ]
52
Again,,, THE most important commands !!
Syntax:
hdfs dfs -help [options]
hdfs dfs -usage [options]
Examples:
hdfs dfs -help help
hdfs dfs -usage usage
53
Interacting With HDFS
In Web Browser
54
Web HDFS
URL:
http://namenode:50070/explorer.html
Examples:
http://localhost:50070/explorer.html
http://ec2-52-23-214-111.compute-1.amazonaws.com:50070/explorer.html
55
References
1. http://www.hadoopinrealworld.com
2. http://www.slideshare.net/sanjeeb85/hdfscommandreference
3. http://www.slideshare.net/jaganadhg/hdfs-10509123
4. http://www.slideshare.net/praveenbhat2/adv-os-presentation
5. http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html
6. http://www.snia.org/sites/default/files/Hadoop2_New_And_Noteworthy_SNIA_v3.pdf
7. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
hdfs/HDFSCommands.html
8. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
56
Thank You!!
57
APPENDIX
58
Copy data from one cluster to another
Description:
Copy data between hadoop clusters
Syntax:
hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo
hadoop distcp hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b hdfs://nn2:8020/bar/foo
hadoop distcp -f hdfs://nn1:8020/srclist.file hdfs://nn2:8020/bar/foo
Where srclist.file contains
hdfs://nn1:8020/foo/a
hdfs://nn1:8020/foo/b
59

More Related Content

PDF
HDFS_Command_Reference
PDF
Hadoop File System Shell Commands,
DOCX
Hadoop basic commands
PPTX
Introduction to HDFS and MapReduce
PDF
Hadoop operations basic
PDF
HDFS User Reference
PDF
Interacting with hdfs
PPTX
Hadoop HDFS Concepts
HDFS_Command_Reference
Hadoop File System Shell Commands,
Hadoop basic commands
Introduction to HDFS and MapReduce
Hadoop operations basic
HDFS User Reference
Interacting with hdfs
Hadoop HDFS Concepts

What's hot (20)

PPTX
Hadoop HDFS Detailed Introduction
PPTX
Top 10 Hadoop Shell Commands
PPTX
Hadoop HDFS Concepts
PDF
Hdfs architecture
ODP
Hadoop HDFS by rohitkapa
PDF
Hadoop Introduction
PDF
HDFS Design Principles
PPT
Anatomy of file read in hadoop
PPTX
Understanding Hadoop
PDF
Hadoop Distributed File System
PPTX
Hadoop HDFS Architeture and Design
PPT
Anatomy of file write in hadoop
PPTX
Ravi Namboori Hadoop & HDFS Architecture
PPT
Hadoop Architecture
PPTX
Hadoop Distributed File System
PDF
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
PPTX
Hadoop Distributed File System
PPT
HDFS introduction
PPTX
Hadoop 20111215
PDF
Hadoop introduction
Hadoop HDFS Detailed Introduction
Top 10 Hadoop Shell Commands
Hadoop HDFS Concepts
Hdfs architecture
Hadoop HDFS by rohitkapa
Hadoop Introduction
HDFS Design Principles
Anatomy of file read in hadoop
Understanding Hadoop
Hadoop Distributed File System
Hadoop HDFS Architeture and Design
Anatomy of file write in hadoop
Ravi Namboori Hadoop & HDFS Architecture
Hadoop Architecture
Hadoop Distributed File System
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Hadoop Distributed File System
HDFS introduction
Hadoop 20111215
Hadoop introduction
Ad

Viewers also liked (20)

PPTX
Deep Dive into Apache Apex App Development
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
PPTX
HDFS Internals
PPTX
Introduction to Real-Time Data Processing
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
PPTX
Introduction to Yarn
PPTX
Intro to Apache Apex @ Women in Big Data
PPTX
Introduction to Apache Apex
PPTX
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
PPTX
Introduction to Map Reduce
PDF
Developing streaming applications with apache apex (strata + hadoop world)
PPTX
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
PPT
Цветочные легенды
PPT
Римский корсаков снегурочка
PPTX
High Performance Distributed Systems with CQRS
PPTX
бсп (обоб. урок)
PPTX
правописание приставок урок№4
PDF
Troubleshooting mysql-tutorial
PDF
Windowing in Apache Apex
Deep Dive into Apache Apex App Development
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
HDFS Internals
Introduction to Real-Time Data Processing
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Introduction to Yarn
Intro to Apache Apex @ Women in Big Data
Introduction to Apache Apex
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Introduction to Map Reduce
Developing streaming applications with apache apex (strata + hadoop world)
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Цветочные легенды
Римский корсаков снегурочка
High Performance Distributed Systems with CQRS
бсп (обоб. урок)
правописание приставок урок№4
Troubleshooting mysql-tutorial
Windowing in Apache Apex
Ad

Similar to Hadoop Interacting with HDFS (20)

PPTX
Lec 2 & 3 _Unit 1_Hadoop _MapReduce1.pptx
PPTX
MapReduce1.pptx
PPTX
Bd class 2 complete
PDF
Hadoop Architecture and HDFS
PPTX
Data analysis on hadoop
PDF
Lecture 2 part 1
PPTX
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
PPTX
HDFS tiered storage
PDF
Design and Research of Hadoop Distributed Cluster Based on Raspberry
PDF
an detailed notes on Hadoop_Shell_Commands.pdf
PPTX
5c_BigData_Hadoop_HDFS.PPTX
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
PPTX
Introduction to hadoop and hdfs
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PPTX
Basic command of hadoop
PPTX
Hadoop at a glance
PPTX
Hadoop, Map Reduce and Apache Pig tutorial
PPTX
Hadoop Installation presentation
PPTX
Introduction to HDFS
Lec 2 & 3 _Unit 1_Hadoop _MapReduce1.pptx
MapReduce1.pptx
Bd class 2 complete
Hadoop Architecture and HDFS
Data analysis on hadoop
Lecture 2 part 1
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
HDFS tiered storage
Design and Research of Hadoop Distributed Cluster Based on Raspberry
an detailed notes on Hadoop_Shell_Commands.pdf
5c_BigData_Hadoop_HDFS.PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
Introduction to hadoop and hdfs
hdfs readrmation ghghg bigdats analytics info.pdf
Basic command of hadoop
Hadoop at a glance
Hadoop, Map Reduce and Apache Pig tutorial
Hadoop Installation presentation
Introduction to HDFS

More from Apache Apex (17)

PDF
Low Latency Polyglot Model Scoring using Apache Apex
PDF
From Batch to Streaming with Apache Apex Dataworks Summit 2017
PDF
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
PPTX
Intro to Big Data Hadoop
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PPTX
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
PPTX
Ingestion and Dimensions Compute and Enrich using Apache Apex
PPTX
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
PPTX
Apache Beam (incubating)
PPTX
Java High Level Stream API
PPTX
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
PPTX
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
PPTX
Apache Apex & Bigtop
PDF
Building Your First Apache Apex Application
PPTX
Architectual Comparison of Apache Apex and Spark Streaming
Low Latency Polyglot Model Scoring using Apache Apex
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Intro to Big Data Hadoop
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Big Data Berlin v8.0 Stream Processing with Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Beam (incubating)
Java High Level Stream API
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Apache Apex & Bigtop
Building Your First Apache Apex Application
Architectual Comparison of Apache Apex and Spark Streaming

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
KodekX | Application Modernization Development
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Modernizing your data center with Dell and AMD
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPT
Teaching material agriculture food technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectral efficient network and resource selection model in 5G networks
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
KodekX | Application Modernization Development
Network Security Unit 5.pdf for BCA BBA.
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Review of recent advances in non-invasive hemoglobin estimation
Modernizing your data center with Dell and AMD
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
cuic standard and advanced reporting.pdf
Chapter 3 Spatial Domain Image Processing.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Teaching material agriculture food technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
CIFDAQ's Market Insight: SEC Turns Pro Crypto

Hadoop Interacting with HDFS

  • 2. → What's the “Need” ? ← ❏ Big data Ocean ❏ Expensive hardware ❏ Frequent Failures and Difficult recovery ❏ Scaling up with more machines 2
  • 3. → Hadoop ← Open source software - a Java framework - initial release: December 10, 2011 It provides both, Storage → [HDFS] Processing → [MapReduce] HDFS: Hadoop Distributed File System 3
  • 4. → How Hadoop addresses the need? ← Big data Ocean Have multiple machines. Each will store some portion of data, not the entire data. Expensive hardware Use commodity hardware. Simple and cheap. Frequent Failures and Difficult recovery Have multiple copies of data. Have the copies in different machines. Scaling up with more machines If more processing is needed, add new machines on the fly 4
  • 5. → HDFS ← Runs on Commodity hardware: Doesn't require expensive machines Large Files; Write-once, Read-many (WORM) Files are split into blocks Actual blocks go to DataNodes The metadata is stored at NameNode Replicate blocks to different node Default configuration: Block size = 128MB Replication Factor = 3 5
  • 6. 6
  • 7. 7
  • 8. 8
  • 9. → Where NOT TO use Hadoop/HDFS ← Low latency data access HDFS is optimized for high throughput of data at the expense of latency. Large number of small files Namenode has the entire file-system metadata in memory. Too much metadata as compared to actual data. Multiple writers / Arbitrary file modifications No support for multiple writers for a file Always append to end of a file 9
  • 10. → Some Key Concepts ← ❏NameNode ❏DataNodes ❏JobTracker (MR v1) ❏TaskTrackers (MR v1) ❏ResourceManager (MR v2) ❏NodeManagers (MR v2) ❏ApplicationMasters (MR v2) 10
  • 11. → NameNode & DataNodes ← ❏NameNode: Centerpiece of HDFS: The Master Only stores the block metadata: block-name, block-location etc. Critical component; When down, whole cluster is considered down; Single point of failure Should be configured with higher RAM ❏DataNode: Stores the actual data: The Slave In constant communication with NameNode When down, it does not affect the availability of data/cluster 11
  • 12. 12
  • 13. → JobTracker & TaskTrackers ← ❏JobTracker: Talks to the NameNode to determine location of the data Monitors all TaskTrackers and submits status of the job back to the client When down, HDFS is still functional; no new MR job; existing jobs halted Replaced by ResourceManager/ApplicationMaster in MRv2 ❏TaskTracker: Runs on all DataNodes TaskTracker communicates with JobTracker signaling the task progress TaskTracker failure is not considered fatal 13
  • 14. → ResourceManager & NodeManager ← ❏Present in Hadoop v2.0 ❏Equivalent of JobTracker & TaskTracker in v1.0 ❏ResourceManager (RM): Runs usually at NameNode; Distributes resources among applications. Two main components: Scheduler and ApplicationsManager (AM) ❏NodeManager (NM): Per-node framework agent Responsible for containers Monitors their resource usage 14
  • 15. 15
  • 16. → Hadoop 1.0 vs. 2.0 ← HDFS 1.0: Single point of failure Horizontal scaling performance issue HDFS 2.0: HDFS High Availability HDFS Snapshot Improved performance HDFS Federation 16
  • 18. → Interacting with HDFS ← Command prompt: Similar to Linux terminal commands Unix is the model, POSIX is the API Web Interface: Similar to browsing a FTP site on web 18
  • 19. Interacting With HDFS On Command Prompt 19
  • 20. → Notes ← File Paths on HDFS: hdfs://<namenode>:<port>/path/to/file.txt hdfs://127.0.0.1:8020/user/USERNAME/demo/data/file.txt hdfs://localhost:8020/user/USERNAME/demo/data/file.txt /user/USERNAME/demo/file.txt demo/file.txt File System: Local: local file system (linux) HDFS: hadoop file system At some places: 20
  • 21. → Before we start ← Command: hdfs Usage: hdfs [--config confdir] COMMAND Example: hdfs dfs hdfs dfsadmin hdfs fsck 21
  • 23. → In general Syntax for `dfs` commands ← hdfs dfs -<COMMAND> -[OPTIONS] <PARAMETERS> e.g. hdfs dfs -ls -R /user/USERNAME/demo/data/ 23
  • 24. 0. Do It yourself Syntax: hdfs dfs -help [COMMAND … ] hdfs dfs -usage [COMMAND … ] Example: hdfs dfs -help cat hdfs dfs -usage cat 24
  • 25. 1. List the file/directory Syntax: hdfs dfs -ls [-d] [-h] [-R] <hdfs-dir-path> Example: hdfs dfs -ls hdfs dfs -ls / hdfs dfs -ls /user/USERNAME/demo/list-dir-example hdfs dfs -ls -R /user/USERNAME/demo/list-dir-example 25
  • 26. 2. Creating a directory Syntax: hdfs dfs -mkdir [-p] <hdfs-dir-path> Example: hdfs dfs -mkdir /user/USERNAME/demo/create-dir-example hdfs dfs -mkdir -p /user/USERNAME/demo/create-dir-example/dir1/dir2/dir3 26
  • 27. 3. Create a file on local & put it on HDFS Syntax: vi filename.txt hdfs dfs -put [options] <local-file-path> <hdfs-dir-path> Example: vi file-copy-to-hdfs.txt hdfs dfs -put file-copy-to-hdfs.txt /user/USERNAME/demo/put-example/ 27
  • 28. 4. Get a file from HDFS to local Syntax: hdfs dfs -get <hdfs-file-path> [local-dir-path] Example: hdfs dfs -get /user/USERNAME/demo/get-example/file-copy-from-hdfs.txt ~/demo/ 28
  • 29. 5. Copy From LOCAL To HDFS Syntax: hdfs dfs -copyFromLocal <local-file-path> <hdfs-file-path> Example: hdfs dfs -copyFromLocal file-copy-to-hdfs.txt /user/USERNAME/demo/copyFromLocal-example/ 29
  • 30. 6. Copy To LOCAL From HDFS Syntax: hdfs dfs -copyToLocal <hdfs-file-path> <local-file-path> Example: hdfs dfs -copyToLocal /user/USERNAME/demo/copyToLocal-example/file-copy-from-hdfs.txt ~/demo/ 30
  • 31. 7. Move a file from local to HDFS Syntax: hdfs dfs -moveFromLocal <local-file-path> <hdfs-dir-path> Example: hdfs dfs -moveFromLocal /path/to/file.txt /user/USERNAME/demo/moveFromLocal-example/ 31
  • 32. 8. Copy a file within HDFS Syntax: hdfs dfs -cp <hdfs-source-file-path> <hdfs-dest-file-path> Example: hdfs dfs -cp /user/USERNAME/demo/copy-within-hdfs/file-copy.txt /user/USERNAME/demo/data/ 32
  • 33. 9. Move a file within HDFS Syntax: hdfs dfs -mv <hdfs-source-file-path> <hdfs-dest-file-path> Example: hdfs dfs -mv /user/USERNAME/demo/move-within-hdfs/file-move.txt /user/USERNAME/demo/data/ 33
  • 34. 10. Merge files on HDFS Syntax: hdfs dfs -getmerge [-nl] <hdfs-dir-path> <local-file-path> Examples: hdfs dfs -getmerge -nl /user/USERNAME/demo/merge-example/ /path/to/all-files.txt 34
  • 35. 11. View file contents Syntax: hdfs dfs -cat <hdfs-file-path> hdfs dfs -tail <hdfs-file-path> hdfs dfs -text <hdfs-file-path> Examples: hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt | head 35
  • 36. 12. Remove files/dirs from HDFS Syntax: hdfs dfs -rm [options] <hdfs-file-path> Examples: hdfs dfs -rm /user/USERNAME/demo/remove-example/remove-file.txt hdfs dfs -rm -R /user/USERNAME/demo/remove-example/ hdfs dfs -rm -R -skipTrash /user/USERNAME/demo/remove-example/ 36
  • 37. 13. Change file/dir properties Syntax: hdfs dfs -chgrp [-R] <NewGroupName> <hdfs-file-path> hdfs dfs -chmod [-R] <permissions> <hdfs-file-path> hdfs dfs -chown [-R] <NewOwnerName> <hdfs-file-path> Examples: hdfs dfs -chmod -R 777 /user/USERNAME/demo/data/file-change-properties.txt 37
  • 38. 14. Check the file size Syntax: hdfs dfs -du <hdfs-file-path> Examples: hdfs dfs -du /user/USERNAME/demo/data/file.txt hdfs dfs -du -s -h /user/USERNAME/demo/data/ 38
  • 39. 15. Create a zero byte file in HDFS Syntax: hdfs dfs -touchz <hdfs-file-path> Examples: hdfs dfs -touchz /user/USERNAME/demo/data/zero-byte-file.txt 39
  • 40. 16. File test operations Syntax: hdfs dfs -test -[defsz] <hdfs-file-path> Examples: hdfs dfs -test -e /user/USERNAME/demo/data/file.txt echo $? 40
  • 41. 17. Get FileSystem Statistics Syntax: hdfs dfs -stat [format] <hdfs-file-path> Format Options: %b - file size in blocks, %g - group name of owner %n - filename %o - block size %r - replication %u - user name of owner %y - modification date 41
  • 42. 18. Get File/Dir Counts Syntax: hdfs dfs -count [-q] [-h] [-v] <hdfs-file-path> Example: hdfs dfs -count -v /user/USERNAME/demo/ 42
  • 43. 19. Set replication factor Syntax: hdfs dfs -setrep -w -R n <hdfs-file-path> Examples: hdfs dfs -setrep -w -R 2 /user/USERNAME/demo/data/file.txt 43
  • 44. 20. Set Block Size Syntax: hdfs dfs -D dfs.blocksize=blocksize -copyFromLocal <local-file-path> <hdfs-file-path> Examples: hdfs dfs -D dfs.blocksize=67108864 -copyFromLocal /path/to/file.txt /user/USERNAME/demo/block-example/ 44
  • 45. 21. Empty the HDFS trash Syntax: hdfs dfs -expunge Location: 45
  • 46. Other hdfs commands (admin) 46
  • 47. 22. HDFS Admin Commands: fsck Syntax: hdfs fsck <hdfs-file-path> Options: [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]] [-includeSnapshots] 47
  • 48. 48
  • 49. 23. HDFS Admin Commands: dfsadmin Syntax: hdfs dfsadmin Options: [-report [-live] [-dead] [-decommissioning]] [-safemode enter | leave | get | wait] [-refreshNodes] [-refresh <host:ipc_port> <key> [arg1..argn]] [-shutdownDatanode <datanode:port> [upgrade]] [-getDatanodeInfo <datanode_host:ipc_port>] [-help [cmd]] Examples: 49
  • 50. 50
  • 51. 24. HDFS Admin Commands: namenode Syntax: hdfs namenode Options: [-checkpoint] | [-format [-clusterid cid ] [-force] [-nonInteractive] ] | [-upgrade [-clusterid cid] ] | [-rollback] | [-recover [-force] ] | [-metadataVersion ] Examples: 51
  • 52. 25. HDFS Admin Commands: getconf Syntax: hdfs getconf [-options] Options: [ -namenodes ] [ -secondaryNameNodes ] [ -backupNodes ] [ -includeFile ] [ -excludeFile ] [ -nnRpcAddresses ] [ -confKey [key] ] 52
  • 53. Again,,, THE most important commands !! Syntax: hdfs dfs -help [options] hdfs dfs -usage [options] Examples: hdfs dfs -help help hdfs dfs -usage usage 53
  • 54. Interacting With HDFS In Web Browser 54
  • 56. References 1. http://www.hadoopinrealworld.com 2. http://www.slideshare.net/sanjeeb85/hdfscommandreference 3. http://www.slideshare.net/jaganadhg/hdfs-10509123 4. http://www.slideshare.net/praveenbhat2/adv-os-presentation 5. http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html 6. http://www.snia.org/sites/default/files/Hadoop2_New_And_Noteworthy_SNIA_v3.pdf 7. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop- hdfs/HDFSCommands.html 8. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop- 56
  • 59. Copy data from one cluster to another Description: Copy data between hadoop clusters Syntax: hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo hadoop distcp hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b hdfs://nn2:8020/bar/foo hadoop distcp -f hdfs://nn1:8020/srclist.file hdfs://nn2:8020/bar/foo Where srclist.file contains hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b 59

Editor's Notes

  • #5: Commodity Hardware: -affordable and easy to obtain -capable of running Windows, Linux, or MS-DOS without requiring any special devices or equipment -broadly compatible and can function on a plug and play basis -low-end but functional product without distinctive features
  • #6: BLOCK: A physical storage disk has a block size - minimum amount of data it can read or write. Normally 512 bytes. File systems for a single disk also deal with data in blocks. Normally few kilo bytes (4 kb). Hadoop has a much larger block size. By default it is 64 mb. Files in HDFS are broken down into block sized chunks and are stored as independent units. However, files smaller than a block size do not occupy the entire block. Why so large blocks??
  • #12: > NameNode does not store the actual data or the dataset. The data itself is actually stored in the DataNodes. > NameNode knows the list of the blocks and its location for any given file in HDFS. > With this information NameNode knows how to construct the file from blocks.
  • #14: JobTracker finds the best TaskTracker nodes to execute tasks based on: -data locality -available slots to execute a task on a given node
  • #17: HDFS High Availability Namenode metadata is written to a shared storage (Journal Manager) Only one active NN can write to shared storage Passive NNs read & replay metadata from shared storage When active NN fails, one of the passive NNs is promoted to active Snapshot: Able to store a checkpointed stage of hdfs () Improved performance: Multithreaded random read HDFS v1: 264MB/sec HDFS v2: 1395MB/sec (about 5X !!) Federation Namenode stores metadata in memory For very large files, namenode could exaust memory Spread metadata over multiple namenodes
  • #18: Details about HDFS Federation.
  • #22: The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] Generic options supported are -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -fs <local|namenode:port> specify a namenode -jt <local|resourcemanager:port> specify a ResourceManager -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
  • #24: The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] Generic options supported are -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -fs <local|namenode:port> specify a namenode -jt <local|resourcemanager:port> specify a ResourceManager -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
  • #25: Everything you need to know about hdfs commands: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
  • #26: Description: List the contents that match the specified pattern. If path is not specified, the contents of /user/<current_user> are listed Options: -d Directories are listed as plain files. -h Formats the sizes of files in a human-readable fashion, rather than a number of bytes. -R Recursively list the contents of directories. Output: (<permissions> <-/#replicas> <userid> <groupid> <size(in bytes)> <modification_date> <directoryName/fileName>)
  • #27: Description: Create a directory in specified location. Options: -p : Create directories in the specified path, if does not exist
  • #28: Description: Copy files into fs. Options: -f : If the file already exists, copying does not fail & the destination is overwritten. -p : Preserves access time, modification time, ownership and the mode.
  • #29: Description: Copy files from hdfs. When copying multiple files, the destination must be a directory. Options: -p : Preserves access time, modification time, ownership and the mode. -ignorecrc : Files that fail the CRC check may be copied with this option. -crc : Files and CRCs may be copied using this option.
  • #30: Description: Copy files from local filesystem into hdfs. It is same as “put” command but more specific w.r.t local filesystem Options: -f : If the file already exists, copying does not fail & the destination is overwritten. -p : Preserves access time, modification time, ownership and the mode.
  • #31: Description: Copy files from hdfs to local filesystem. When copying multiple files, the destination must be a directory. It is same as “get” command but more specific w.r.t local filesystem Options: -p : Preserves access time, modification time, ownership and the mode. -ignorecrc : Files that fail the CRC check may be copied with this option. -crc : Files and CRCs may be copied using this option.
  • #32: Description: Same as -copyFromLocal, except that the source is deleted after it's copied.
  • #33: Description: Copy files from hdfs to the same hdfs. File pattern can be specified. When copying multiple files, the destination must be a directory. Options: -f : If the file already exists, copying does not fail & the destination is overwritten. -p : Preserves access time, modification time, ownership and the mode.
  • #34: Description: Move files from hdfs to the same hdfs. File pattern can be specified. When moving multiple files, the destination must be a directory.
  • #35: Description: Get all the files in the directories that match the source file pattern and merge and sort them to only one file on local fs. <src> is kept. Options: -nl Add a newline character at the end of each file.
  • #36: -cat [-ignoreCrc] <src> ... : Fetch all files that match the file pattern <src> and display their content on stdout. -tail [-f] <file> : Show the last 1KB of the file. -f Shows appended data as the file grows. -text [-ignoreCrc] <src> ... : Takes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream and Avro.
  • #37: Description: Delete all files that match the specified file pattern. Equivalent to the Unix command "rm <src>" Options: -skipTrash option bypasses trash, if enabled, and immediately deletes <src> -f If the file does not exist, do not display a diagnostic message or modify the exit status to reflect an error. -[rR] Recursively deletes directories
  • #38: -chgrp [-R] GROUP PATH... : This is equivalent to -chown ... :GROUP ... -chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH... : Changes permissions of a file. This works similar to the shell's chmod command with a few exceptions. -R modifies the files recursively. This is the only option currently supported. <MODE> Mode is the same as mode used for the shell's command. -chown [-R] [OWNER][:[GROUP]] PATH... : Changes owner and group of a file. This is similar to the shell's chown command with a few exceptions. -R modifies the files recursively. This is the only option currently supported.
  • #39: -du [-s] [-h] <path> ... : Show the amount of space, in bytes, used by the files that match the specified file pattern. The following flags are optional: -s Rather than showing the size of each individual file that matches the pattern, shows the total (summary) size. -h Formats the sizes of files in a human-readable fashion rather than a number of bytes. Note that, even without the -s option, this only shows size summaries one level deep into a directory. The output is in the form size disk space consumed name(full path)
  • #40: -touchz <path> ... : Creates a file of zero length at <path> with current time as the timestamp of that <path>. An error is returned if the file exists with non-zero length
  • #41: Options: -d: f the path is a directory, return 0. -e: if the path exists, return 0. -f: if the path is a file, return 0. -s: if the path is not empty, return 0. -z: if the file is zero length, return 0.
  • #42: -stat [format] <path> ... : Print statistics about the file/directory at <path> in the specified format. Format accepts filesize in blocks (%b), group name of owner(%g), filename (%n), block size (%o), replication (%r), user name of owner(%u), modification date (%y, %Y)
  • #43: -count [-q] [-h] [-v] <path> ... : Count the number of directories, files and bytes under the paths that match the specified file pattern. The -h option shows file sizes in human readable format. The -v option displays a header line. The output columns are: DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME or, with the -q option: QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
  • #44: -setrep [-R] [-w] <rep> <path> ... : Set the replication level of a file. If <path> is a directory then the command recursively changes the replication factor of all files under the directory tree rooted at <path>. -w It requests that the command waits for the replication to complete. This can potentially take a very long time. -R It is accepted for backwards compatibility. It has no effect.
  • #45: The block size specified by dfs.blocksize should be multiple of 512 -copyFromLocal error if block size is not valid: Invalid values: dfs.bytes-per-checksum (=512) must divide block size (=104857601).
  • #46: -expunge : Empty the Trash To enable hdfs thrash set fs.trash.interval > 1 in core-site.xml Deleted data goes in hdfs folder at : /user/<username>/.Trash/
  • #48: Options: -move move corrupted files to /lost+found -delete delete corrupted files -files print out files being checked -openforwrite print out files opened for write -includeSnapshots include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it -list-corruptfileblocks print out list of missing blocks and files they belong to -blocks print out block report -locations print out locations for every block -racks print out network topology for data-node locations
  • #50: Command: hdfs dfsadmin -help report Description: Reports basic filesystem information and statistics. The dfs usage can be different from "du" usage, because it measures raw space used by replication, checksums, snapshots and etc. on all the DNs. Optional flags may be used to filter the list of displayed DNs. Options: -report [-live] [-dead] [-decommissioning]
  • #53: hadoop getconf [-namenodes] gets list of namenodes in the cluster. [-secondaryNameNodes] gets list of secondary namenodes in the cluster. [-backupNodes] gets list of backup nodes in the cluster. [-includeFile] gets the include file path that defines the datanodes that can join the cluster. [-excludeFile] gets the exclude file path that defines the datanodes that need to decommissioned. [-nnRpcAddresses] gets the namenode rpc addresses [-confKey [key]] gets a specific key from the configuration