SlideShare a Scribd company logo
A new way to store and analyze
data
An Elephant can't jump. But can carry heavy load.
Overview…
 What is Hadoop?
 Why Hadoop?
 Famous Hadoop users.
 Core-Components of Hadoop.
 Hadoop installation and configuration.
 Starting your Single-node cluster.
 Stopping your Single-node cluster.
 Running a single-node cluster program.
What is Apache Hadoop?
The most well known technology used
for Big Data is Hadoop.
Apache Hadoop is a Framework that allows
for the distributed processing of large data
sets across clusters of commodity computers
using a simple programming model.
Why Hadoop?
 Need to process 100TB datasets –
 On 1 node
 Scanning @50MB/s = 23days
 On 1000 node cluster :
 Scanning @50MB/s = 33 mins
Famous Hadoop Users..
Two cluster of 8000 and 3000
Nodes
100 Nodes
4500 Nodes
532 Nodes
Core-Components of Hadoop..
 The Cluster is the set of host machines (nodes). Nodes may be
partitioned in racks. This is the hardware part of the infrastructure.
 The YARN Infrastructure (Yet Another Resource Negotiator) is
the framework responsible for providing the computational
resources (e.g., CPUs, memory, etc.) needed for application
executions. Two important elements are:
 Resource Manager
 Node Manager
 The HDFS (Hadoop Distributed File System) Inspired by
Google File System is a primary distributed storage used by Hadoop.
 The MapReduce is the original processing model for Hadoop
clusters. It distributes work within the cluster or map, then
organizes and reduces the result from the nodes into a response to a
query..
Mapreduce Example
Two input files:
file1: “hello world hello moon”
file2: “goodbye world goodnight moon”
Three operations:
 map
 combine
• reduce
we’ll use an example: WordCount Count occurrences of each
word across different files
What is the output per step?
MAP
First map: Second map:
< hello, 1 > < goodbye, 1 >
< world, 1 > < world, 1 >
< hello, 1 > < goodnight, 1 >
< moon, 1 > < moon, 1 >
COMBINE
First map: Second map:
< moon, 1 > < goodbye, 1 >
< world, 1 > < world, 1 >
< hello, 2 > < goodnight, 1 >
< moon, 1 >
REDUCE
< goodbye, 1 >
< goodnight, 1 >
< moon, 2 >
< world, 2 >
< hello, 2 >
Hadoop Installation and Configuration...
1. Installing Java:- Java is the primary requirement for run
Hadoop on any system.
$ java -version
Cont...
2. Configuring SSH:-Hadoop requires SSH access to manage its
nodes, i.e. remote machines plus your local machine.
$ ssh-keygen -t rsa -P ""
Cont...
3. The next step is to test the SSH setup by connecting to your local
machine with the user.
Cont...
4. Disabling IPv6 : To disable IPv6 on Ubuntu, open /etc/sysctl.conf
in the editor of your choice and add the following lines to the end of
the file:
/etc/sysctl.conf
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1 –
You have to reboot your machine in order to make the changes take
effect.
Cont...
You can check whether IPv6 is enabled on your machine with the
following command:
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
A return value of 0 means IPv6 is enabled, a value of 1 means
disabled (that’s what we want).
Cont...
5. Installing Hadoop:-
Cont...
Let’s unpack the Hadoop gz:
$ tar xvzf hadoop 2.8.1.tar.gz
Cont...
6. Configure Hadoop Pseudo-Distributed Mode:-
• Setup Environment Variables - First we need to set environment variable
uses by hadoop. Edit ~/.bashrc file and append following values at end of file.
Cont...
• Now edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set
JAVA_HOME environment variable
Cont...
6. Edit Configuration Files :- Hadoop has many of configuration
files, which need to configure as per requirements of your
hadoop infrastructure.
$ gedit hadoop/etc/hadoop/core-site.xml:
Cont...
$ gedit hadoop/etc/hadoop/hdfs-site.xml
Cont...
$ Edit mapred-site.xml
Cont...
Edit yarn-site.xml
Cont...
7. Formatting the NameNode:- Now format the namenode using
following command.
$ hdfs namenode -format
Starting your Single-node cluster
Now run start-dfs.sh script.
$ start-dfs.sh
Now run start-yarn.sh script.
$ start-yarn.sh
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your
machine.
A nifty tool for checking whether the expected Hadoop processes
are running is jps
$ /usr/lib/jvm/java-8-oracle/bin/jps
Go to the browser and open http://127.0.0.1:50070
Hadoop installation with an example
Stopping your Single-node cluster
$ Stop-all.sh
Running a single-node cluster program
WordCount Program :
-What is wordcount program?
- Wordcount is a simple application that determines how many
times different words appear in a given input set. Steps for this
session:
1. Upload the input data text file in hdfs.
2. Create your own jar and put the output file.
3. Run a jar on hadoop architecture.
4. View output on command line.
5. View and download output file.
Input File..
Output File..
Download Output File..
 You can download output file:
References..
•Apache Hadoop!
(http://hadoop.apache.org)
•Hadoop on Wikipedia
(http://en.wikipedia.org/wiki/Hadoop)
•Cloudera - Apache Hadoop for the Enterprise
(http://www.cloudera.com
Hadoop installation with an example
Hadoop installation with an example

More Related Content

PPTX
Hadoop installation on windows
PPTX
Big data and Hadoop
PPTX
Introduction to HiveQL
PPT
Unit-3_BDA.ppt
PPTX
Map Reduce
PDF
Hadoop Overview & Architecture
 
PDF
Hadoop ecosystem
Hadoop installation on windows
Big data and Hadoop
Introduction to HiveQL
Unit-3_BDA.ppt
Map Reduce
Hadoop Overview & Architecture
 
Hadoop ecosystem

What's hot (20)

PPTX
Hadoop File system (HDFS)
PPTX
Learn to setup a Hadoop Multi Node Cluster
PPTX
Introduction to Map Reduce
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
PPTX
Hadoop et son écosystème
PPT
Hadoop Map Reduce
PPTX
Data models in NoSQL
PPTX
introduction to NOSQL Database
PPTX
Hadoop Installation presentation
PPTX
Introduction to HDFS
PDF
Apache Spark Overview
PPT
Map Reduce
PDF
Big Data Architecture
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PPT
Database performance tuning and query optimization
PPTX
Introduction to Hadoop
PPTX
Hadoop And Their Ecosystem ppt
PPTX
Big data Analytics Hadoop
PPTX
Big Data Technology Stack : Nutshell
PPT
Tomcat server
Hadoop File system (HDFS)
Learn to setup a Hadoop Multi Node Cluster
Introduction to Map Reduce
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop et son écosystème
Hadoop Map Reduce
Data models in NoSQL
introduction to NOSQL Database
Hadoop Installation presentation
Introduction to HDFS
Apache Spark Overview
Map Reduce
Big Data Architecture
Spark SQL Deep Dive @ Melbourne Spark Meetup
Database performance tuning and query optimization
Introduction to Hadoop
Hadoop And Their Ecosystem ppt
Big data Analytics Hadoop
Big Data Technology Stack : Nutshell
Tomcat server
Ad

Similar to Hadoop installation with an example (20)

PPTX
Unit 5
PDF
Hadoop installation by santosh nage
PDF
Hadoop 2.0 handout 5.0
PPTX
Big data processing using hadoop poster presentation
PDF
Single node hadoop cluster installation
PPTX
THE SOLUTION FOR BIG DATA
PPTX
THE SOLUTION FOR BIG DATA
PPT
Introduction to Apache Hadoop
ODT
Hadoop Interview Questions and Answers by rohit kapa
PDF
02 Hadoop deployment and configuration
PDF
Power Hadoop Cluster with AWS Cloud
PPTX
Asbury Hadoop Overview
PDF
Design and Research of Hadoop Distributed Cluster Based on Raspberry
PPT
Big data with hadoop Setup on Ubuntu 12.04
PDF
Inside the Hadoop Machine @ VMworld
PDF
App Cap2956v2 121001194956 Phpapp01 (1)
PDF
App cap2956v2-121001194956-phpapp01 (1)
PPTX
Hadoop training-in-hyderabad
PDF
Hadoop description
Unit 5
Hadoop installation by santosh nage
Hadoop 2.0 handout 5.0
Big data processing using hadoop poster presentation
Single node hadoop cluster installation
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
Introduction to Apache Hadoop
Hadoop Interview Questions and Answers by rohit kapa
02 Hadoop deployment and configuration
Power Hadoop Cluster with AWS Cloud
Asbury Hadoop Overview
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Big data with hadoop Setup on Ubuntu 12.04
Inside the Hadoop Machine @ VMworld
App Cap2956v2 121001194956 Phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
Hadoop training-in-hyderabad
Hadoop description
Ad

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Approach and Philosophy of On baking technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Monthly Chronicles - July 2025
PPT
Teaching material agriculture food technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
KodekX | Application Modernization Development
Electronic commerce courselecture one. Pdf
Empathic Computing: Creating Shared Understanding
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Approach and Philosophy of On baking technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Monthly Chronicles - July 2025
Teaching material agriculture food technology
Per capita expenditure prediction using model stacking based on satellite ima...
Chapter 3 Spatial Domain Image Processing.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Dropbox Q2 2025 Financial Results & Investor Presentation
Network Security Unit 5.pdf for BCA BBA.
Understanding_Digital_Forensics_Presentation.pptx
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
KodekX | Application Modernization Development

Hadoop installation with an example

  • 1. A new way to store and analyze data An Elephant can't jump. But can carry heavy load.
  • 2. Overview…  What is Hadoop?  Why Hadoop?  Famous Hadoop users.  Core-Components of Hadoop.  Hadoop installation and configuration.  Starting your Single-node cluster.  Stopping your Single-node cluster.  Running a single-node cluster program.
  • 3. What is Apache Hadoop? The most well known technology used for Big Data is Hadoop. Apache Hadoop is a Framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.
  • 4. Why Hadoop?  Need to process 100TB datasets –  On 1 node  Scanning @50MB/s = 23days  On 1000 node cluster :  Scanning @50MB/s = 33 mins
  • 5. Famous Hadoop Users.. Two cluster of 8000 and 3000 Nodes 100 Nodes 4500 Nodes 532 Nodes
  • 6. Core-Components of Hadoop..  The Cluster is the set of host machines (nodes). Nodes may be partitioned in racks. This is the hardware part of the infrastructure.
  • 7.  The YARN Infrastructure (Yet Another Resource Negotiator) is the framework responsible for providing the computational resources (e.g., CPUs, memory, etc.) needed for application executions. Two important elements are:  Resource Manager  Node Manager  The HDFS (Hadoop Distributed File System) Inspired by Google File System is a primary distributed storage used by Hadoop.  The MapReduce is the original processing model for Hadoop clusters. It distributes work within the cluster or map, then organizes and reduces the result from the nodes into a response to a query..
  • 8. Mapreduce Example Two input files: file1: “hello world hello moon” file2: “goodbye world goodnight moon” Three operations:  map  combine • reduce we’ll use an example: WordCount Count occurrences of each word across different files
  • 9. What is the output per step? MAP First map: Second map: < hello, 1 > < goodbye, 1 > < world, 1 > < world, 1 > < hello, 1 > < goodnight, 1 > < moon, 1 > < moon, 1 > COMBINE First map: Second map: < moon, 1 > < goodbye, 1 > < world, 1 > < world, 1 > < hello, 2 > < goodnight, 1 > < moon, 1 > REDUCE < goodbye, 1 > < goodnight, 1 > < moon, 2 > < world, 2 > < hello, 2 >
  • 10. Hadoop Installation and Configuration... 1. Installing Java:- Java is the primary requirement for run Hadoop on any system. $ java -version
  • 11. Cont... 2. Configuring SSH:-Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine. $ ssh-keygen -t rsa -P ""
  • 12. Cont... 3. The next step is to test the SSH setup by connecting to your local machine with the user.
  • 13. Cont... 4. Disabling IPv6 : To disable IPv6 on Ubuntu, open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file: /etc/sysctl.conf # disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 – You have to reboot your machine in order to make the changes take effect.
  • 14. Cont... You can check whether IPv6 is enabled on your machine with the following command: $ cat /proc/sys/net/ipv6/conf/all/disable_ipv6 A return value of 0 means IPv6 is enabled, a value of 1 means disabled (that’s what we want).
  • 16. Cont... Let’s unpack the Hadoop gz: $ tar xvzf hadoop 2.8.1.tar.gz
  • 17. Cont... 6. Configure Hadoop Pseudo-Distributed Mode:- • Setup Environment Variables - First we need to set environment variable uses by hadoop. Edit ~/.bashrc file and append following values at end of file.
  • 18. Cont... • Now edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set JAVA_HOME environment variable
  • 19. Cont... 6. Edit Configuration Files :- Hadoop has many of configuration files, which need to configure as per requirements of your hadoop infrastructure. $ gedit hadoop/etc/hadoop/core-site.xml:
  • 23. Cont... 7. Formatting the NameNode:- Now format the namenode using following command. $ hdfs namenode -format
  • 24. Starting your Single-node cluster Now run start-dfs.sh script. $ start-dfs.sh Now run start-yarn.sh script. $ start-yarn.sh This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
  • 25. A nifty tool for checking whether the expected Hadoop processes are running is jps $ /usr/lib/jvm/java-8-oracle/bin/jps
  • 26. Go to the browser and open http://127.0.0.1:50070
  • 28. Stopping your Single-node cluster $ Stop-all.sh
  • 29. Running a single-node cluster program WordCount Program : -What is wordcount program? - Wordcount is a simple application that determines how many times different words appear in a given input set. Steps for this session: 1. Upload the input data text file in hdfs. 2. Create your own jar and put the output file. 3. Run a jar on hadoop architecture. 4. View output on command line. 5. View and download output file.
  • 32. Download Output File..  You can download output file:
  • 33. References.. •Apache Hadoop! (http://hadoop.apache.org) •Hadoop on Wikipedia (http://en.wikipedia.org/wiki/Hadoop) •Cloudera - Apache Hadoop for the Enterprise (http://www.cloudera.com