SlideShare a Scribd company logo
6
Most read
10
Most read
14
Most read
Hadoop Cluster Configuration
and Data Loading
Hadoop Cluster Specification
• Hadoop is designed to run on commodity hardware
• “Commodity” does not mean “low-end.”
• Processor
• 2 quad-core 2-2.5GHz CPUs
• Memory
• 16-24 GB ECC RAM1
• Storage
• 4 × 1TB SATA disks
• Network
• Gigabit Ethernet
Hadoop Cluster Architecture
Hadoop Cluster Configuration files
Filename Format Description
hadoop-env.sh Bash script
Environment variables that are used in the scripts to run
Hadoop.
core-site.xml
Hadoop
configuration
XML
Configuration settings for Hadoop Core, such as I/O settings that
are common to HDFS and MapReduce.
hdfs-site.xml
Hadoop
configuration
XML
Configuration settings for HDFS daemons: the namenode, the
secondary namenode, and the datanodes.
mapred-site.xml
Hadoop
configuration
XML
Configuration settings for MapReduce daemons: the jobtracker,
and the tasktrackers.
masters Plain text
A list of machines (one per line) that each run a secondary
namenode.
slaves Plain text
A list of machines (one per line) that each run a datanode and a
tasktracker.
Hadoop Cluster Modes
• Standalone (or local) mode
There are no daemons running and everything runs in a single JVM. Standalone
mode is suitable for running MapReduce programs during development, since it
is easy to test and debug them.
• Pseudo-distributed mode
The Hadoop daemons run on the local machine, thus simulating a cluster on a
small scale.
• Fully distributed mode
The Hadoop daemons run on a cluster of machines.
Multi-Node Hadoop Cluster
Reference: http://www.michael-
noll.com/tutorials/running-hadoop-on-ubuntu-linux-
multi-node-cluster/
A Typical Production Hadoop Cluster
Machine Type Workload
Pattern/ Cluster
Type
Storage Processor (# of
Cores)
Memory (GB) Network
Slaves Balanced
workload
Four to six 1 TB
disks
Dual Quad 24 Dual 1 GB links for
all nodes in a 20
node rack and 2 x
10 GB intercon-
nect links per rack
going to a pair of
central switches.
Compute
intensive
workload
Four to six 1 TB or
2 TB disks
Dual Hexa Quad 24-48
I/O intensive
workload
Twelve 1 TB disks Dual Quad 24-48
HBase clusters Twelve 1 TB disks Dual Hexa Quad 48-96
Masters All workload pat-
terns/HBase
clusters
Four to six 2 TB
disks
Dual Quad Depends on
number of file
system objects to
be created by
NameNode.
References : http://docs.hortonworks.com/HDP2Alpha/index.htm#Hardware_Recommendations_for_Hadoop.htm
MapReduce Job execution (Map Task)
MapReduce Job execution (Reduce Task)
Hadoop Shell commands
• Create a directory in HDFS at given path(s)
Usage: hadoop fs -mkdir <paths>
Example: hadoop fs -mkdir /user/saurzcode/dir1 /user/saurzcode/dir2
• List the contents of a directory
Usage: hadoop fs -ls <args>
Example: hadoop fs -ls /user/saurzcode
• Upload and download a file in HDFS.
Usage: hadoop fs -put <localsrc> ... <HDFS_dest_Path>
Example: hadoop fs -put /home/saurzcode/Samplefile.txt /user/saurzcode/dir3/
Usage: hadoop fs -get <hdfs_src> <localdst>
Example: hadoop fs -get /user/saurzcode/dir3/Samplefile.txt /home/
Hadoop Shell commands contd..
• See contents of a file
Usage: hadoop fs -cat <path[filename]>
Example: hadoop fs -cat /user/saurzcode/dir1/abc.txt
• Move file from source to destination.
Usage: hadoop fs -mv <src> <dest>
Example: hadoop fs -mv /user/saurzcode/dir1/abc.txt /user/saurzcode/dir2
• Remove a file or directory in HDFS.
Usage : hadoop fs -rm <arg>
Example: hadoop fs -rm /user/saurzcode/dir1/abc.txt
Usage : hadoop fs -rmr <arg>
Example: hadoop fs -rmr /user/saurzcode/
Hadoop Shell commands contd..
• Display last few lines of a file.
Usage : hadoop fs -tail <path[filename]>
Example: hadoop fs -tail /user/saurzcode/dir1/abc.txt
• Display the aggregate length of a file.
Usage : hadoop fs -du <path>
Example: hadoop fs -du /user/saurzcode/dir1/abc.txt
Hadoop Copy Commands
• Copy a file from source to destination
Usage: hadoop fs -cp <source> <dest>
Example: hadoop fs -cp /user/saurzcode/dir1/abc.txt /user/saurzcode/dir2
• Copy a file from/To Local file system to HDFS
Usage: hadoop fs -copyFromLocal <localsrc> URI
Example: hadoop fs -copyFromLocal /home/saurzcode/abc.txt
/user/saurzcode/abc.txt
Usage: hadoop fs -copyToLocal URI <localdst>
Example: hadoop fs -copyFromLocal /user/saurzcode/abc.txt
/home/saurzcode/abc.txt
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2

More Related Content

DOCX
Hadoop basic commands
PDF
Hadoop operations basic
PPTX
Top 10 Hadoop Shell Commands
PDF
HDFS_Command_Reference
PDF
Hadoop File System Shell Commands,
PPTX
Data analysis on hadoop
PPT
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
PPT
Hadoop
Hadoop basic commands
Hadoop operations basic
Top 10 Hadoop Shell Commands
HDFS_Command_Reference
Hadoop File System Shell Commands,
Data analysis on hadoop
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
Hadoop

What's hot (20)

PPTX
Hadoop Interacting with HDFS
PPTX
Session 03 - Hadoop Installation and Basic Commands
PPTX
Hadoop+Cassandra_Integration
PDF
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Ecossistema Hadoop no Magazine Luiza
PDF
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Introduction to Hadoop
PDF
Simplified Data Management And Process Scheduling in Hadoop
PPT
Hadoop Tutorial
PPTX
Session 01 - Into to Hadoop
PPTX
BIG DATA: Apache Hadoop
PPTX
Apache Hive
PPTX
Pptx present
PPTX
HDFS: Hadoop Distributed Filesystem
PDF
An introduction to Big-Data processing applying hadoop
PPTX
Hadoop Installation presentation
PPTX
Hadoop introduction seminar presentation
PDF
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Pig with Cassandra: Adventures in Analytics
PDF
Introducción a hadoop
Hadoop Interacting with HDFS
Session 03 - Hadoop Installation and Basic Commands
Hadoop+Cassandra_Integration
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Ecossistema Hadoop no Magazine Luiza
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Hadoop
Simplified Data Management And Process Scheduling in Hadoop
Hadoop Tutorial
Session 01 - Into to Hadoop
BIG DATA: Apache Hadoop
Apache Hive
Pptx present
HDFS: Hadoop Distributed Filesystem
An introduction to Big-Data processing applying hadoop
Hadoop Installation presentation
Hadoop introduction seminar presentation
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Pig with Cassandra: Adventures in Analytics
Introducción a hadoop
Ad

Viewers also liked (19)

PPTX
Amazon Elastic Computing 2
PPTX
Big Data and Hadoop - An Introduction
PPTX
Taller hadoop
PPTX
Hadoop administration
PPTX
Introduction to Hadoop and Hadoop component
PDF
Hadoop Trends
PPTX
Hadoop fault-tolerance
PPTX
Introduction to Apache Hadoop Ecosystem
PDF
Hadoop, HDFS and MapReduce
PPTX
Hadoop as data refinery
PDF
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
PDF
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
PDF
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
ODP
Hadoop admin
PDF
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
PPTX
Learn Hadoop Administration
PDF
Hadoop Administration pdf
PDF
Store and Process Big Data with Hadoop and Cassandra
PPTX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
Amazon Elastic Computing 2
Big Data and Hadoop - An Introduction
Taller hadoop
Hadoop administration
Introduction to Hadoop and Hadoop component
Hadoop Trends
Hadoop fault-tolerance
Introduction to Apache Hadoop Ecosystem
Hadoop, HDFS and MapReduce
Hadoop as data refinery
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Hadoop admin
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Learn Hadoop Administration
Hadoop Administration pdf
Store and Process Big Data with Hadoop and Cassandra
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
Ad

Similar to Hadoop Cluster Configuration and Data Loading - Module 2 (20)

PDF
Hadoop Architecture and HDFS
PPTX
Learn to setup a Hadoop Multi Node Cluster
PPTX
Hadoop Architecture_Cluster_Cap_Plan
ODT
Hadoop Interview Questions and Answers by rohit kapa
PDF
hadoop distributed file systems complete information
PPTX
Introduction to HDFS
PDF
Power Hadoop Cluster with AWS Cloud
PDF
Administer Hadoop Cluster
PDF
Top 5 Hadoop Admin Tasks
PDF
Webinar: Top 5 Hadoop Admin Tasks
PPTX
Hadoop and BigData - July 2016
PPTX
Big data processing using hadoop poster presentation
PPSX
Hadoop – big deal
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PPTX
Hadoop configuration & performance tuning
PDF
App cap2956v2-121001194956-phpapp01 (1)
PDF
App Cap2956v2 121001194956 Phpapp01 (1)
PDF
Inside the Hadoop Machine @ VMworld
PDF
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
PPT
Apache hadoop, hdfs and map reduce Overview
Hadoop Architecture and HDFS
Learn to setup a Hadoop Multi Node Cluster
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Interview Questions and Answers by rohit kapa
hadoop distributed file systems complete information
Introduction to HDFS
Power Hadoop Cluster with AWS Cloud
Administer Hadoop Cluster
Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin Tasks
Hadoop and BigData - July 2016
Big data processing using hadoop poster presentation
Hadoop – big deal
hdfs readrmation ghghg bigdats analytics info.pdf
Hadoop configuration & performance tuning
App cap2956v2-121001194956-phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
Inside the Hadoop Machine @ VMworld
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Apache hadoop, hdfs and map reduce Overview

More from Rohit Agrawal (9)

PPTX
Apache Oozie Workflow Scheduler - Module 10
PPTX
Hadoop 2.0, MRv2 and YARN - Module 9
PPTX
Advance HBase and Zookeeper - Module 8
PPTX
Advance Hive, NoSQL Database (HBase) - Module 7
PPTX
Pig and Pig Latin - Module 5
PPTX
Advance MapReduce Concepts - Module 4
PPTX
Hadoop MapReduce framework - Module 3
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
PPTX
Hive and HiveQL - Module6
Apache Oozie Workflow Scheduler - Module 10
Hadoop 2.0, MRv2 and YARN - Module 9
Advance HBase and Zookeeper - Module 8
Advance Hive, NoSQL Database (HBase) - Module 7
Pig and Pig Latin - Module 5
Advance MapReduce Concepts - Module 4
Hadoop MapReduce framework - Module 3
Introduction to Big Data & Hadoop Architecture - Module 1
Hive and HiveQL - Module6

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
A Presentation on Artificial Intelligence
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
cuic standard and advanced reporting.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Weekly Chronicles - August'25 Week I
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Machine learning based COVID-19 study performance prediction
Reach Out and Touch Someone: Haptics and Empathic Computing
A Presentation on Artificial Intelligence
NewMind AI Monthly Chronicles - July 2025
cuic standard and advanced reporting.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Encapsulation_ Review paper, used for researhc scholars
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Advanced methodologies resolving dimensionality complications for autism neur...
Mobile App Security Testing_ A Comprehensive Guide.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
“AI and Expert System Decision Support & Business Intelligence Systems”

Hadoop Cluster Configuration and Data Loading - Module 2

  • 2. Hadoop Cluster Specification • Hadoop is designed to run on commodity hardware • “Commodity” does not mean “low-end.” • Processor • 2 quad-core 2-2.5GHz CPUs • Memory • 16-24 GB ECC RAM1 • Storage • 4 × 1TB SATA disks • Network • Gigabit Ethernet
  • 4. Hadoop Cluster Configuration files Filename Format Description hadoop-env.sh Bash script Environment variables that are used in the scripts to run Hadoop. core-site.xml Hadoop configuration XML Configuration settings for Hadoop Core, such as I/O settings that are common to HDFS and MapReduce. hdfs-site.xml Hadoop configuration XML Configuration settings for HDFS daemons: the namenode, the secondary namenode, and the datanodes. mapred-site.xml Hadoop configuration XML Configuration settings for MapReduce daemons: the jobtracker, and the tasktrackers. masters Plain text A list of machines (one per line) that each run a secondary namenode. slaves Plain text A list of machines (one per line) that each run a datanode and a tasktracker.
  • 5. Hadoop Cluster Modes • Standalone (or local) mode There are no daemons running and everything runs in a single JVM. Standalone mode is suitable for running MapReduce programs during development, since it is easy to test and debug them. • Pseudo-distributed mode The Hadoop daemons run on the local machine, thus simulating a cluster on a small scale. • Fully distributed mode The Hadoop daemons run on a cluster of machines.
  • 6. Multi-Node Hadoop Cluster Reference: http://www.michael- noll.com/tutorials/running-hadoop-on-ubuntu-linux- multi-node-cluster/
  • 7. A Typical Production Hadoop Cluster Machine Type Workload Pattern/ Cluster Type Storage Processor (# of Cores) Memory (GB) Network Slaves Balanced workload Four to six 1 TB disks Dual Quad 24 Dual 1 GB links for all nodes in a 20 node rack and 2 x 10 GB intercon- nect links per rack going to a pair of central switches. Compute intensive workload Four to six 1 TB or 2 TB disks Dual Hexa Quad 24-48 I/O intensive workload Twelve 1 TB disks Dual Quad 24-48 HBase clusters Twelve 1 TB disks Dual Hexa Quad 48-96 Masters All workload pat- terns/HBase clusters Four to six 2 TB disks Dual Quad Depends on number of file system objects to be created by NameNode. References : http://docs.hortonworks.com/HDP2Alpha/index.htm#Hardware_Recommendations_for_Hadoop.htm
  • 9. MapReduce Job execution (Reduce Task)
  • 10. Hadoop Shell commands • Create a directory in HDFS at given path(s) Usage: hadoop fs -mkdir <paths> Example: hadoop fs -mkdir /user/saurzcode/dir1 /user/saurzcode/dir2 • List the contents of a directory Usage: hadoop fs -ls <args> Example: hadoop fs -ls /user/saurzcode • Upload and download a file in HDFS. Usage: hadoop fs -put <localsrc> ... <HDFS_dest_Path> Example: hadoop fs -put /home/saurzcode/Samplefile.txt /user/saurzcode/dir3/ Usage: hadoop fs -get <hdfs_src> <localdst> Example: hadoop fs -get /user/saurzcode/dir3/Samplefile.txt /home/
  • 11. Hadoop Shell commands contd.. • See contents of a file Usage: hadoop fs -cat <path[filename]> Example: hadoop fs -cat /user/saurzcode/dir1/abc.txt • Move file from source to destination. Usage: hadoop fs -mv <src> <dest> Example: hadoop fs -mv /user/saurzcode/dir1/abc.txt /user/saurzcode/dir2 • Remove a file or directory in HDFS. Usage : hadoop fs -rm <arg> Example: hadoop fs -rm /user/saurzcode/dir1/abc.txt Usage : hadoop fs -rmr <arg> Example: hadoop fs -rmr /user/saurzcode/
  • 12. Hadoop Shell commands contd.. • Display last few lines of a file. Usage : hadoop fs -tail <path[filename]> Example: hadoop fs -tail /user/saurzcode/dir1/abc.txt • Display the aggregate length of a file. Usage : hadoop fs -du <path> Example: hadoop fs -du /user/saurzcode/dir1/abc.txt
  • 13. Hadoop Copy Commands • Copy a file from source to destination Usage: hadoop fs -cp <source> <dest> Example: hadoop fs -cp /user/saurzcode/dir1/abc.txt /user/saurzcode/dir2 • Copy a file from/To Local file system to HDFS Usage: hadoop fs -copyFromLocal <localsrc> URI Example: hadoop fs -copyFromLocal /home/saurzcode/abc.txt /user/saurzcode/abc.txt Usage: hadoop fs -copyToLocal URI <localdst> Example: hadoop fs -copyFromLocal /user/saurzcode/abc.txt /home/saurzcode/abc.txt