Hadoop installation with an example

A new way to store and analyze
data
An Elephant can't jump. But can carry heavy load.

Overview…
 What is Hadoop?
 Why Hadoop?
 Famous Hadoop users.
 Core-Components of Hadoop.
 Hadoop installation and configuration.
 Starting your Single-node cluster.
 Stopping your Single-node cluster.
 Running a single-node cluster program.

What is Apache Hadoop?
The most well known technology used
for Big Data is Hadoop.
Apache Hadoop is a Framework that allows
for the distributed processing of large data
sets across clusters of commodity computers
using a simple programming model.

Why Hadoop?
 Need to process 100TB datasets –
 On 1 node
 Scanning @50MB/s = 23days
 On 1000 node cluster :
 Scanning @50MB/s = 33 mins

Famous Hadoop Users..
Two cluster of 8000 and 3000
Nodes
100 Nodes
4500 Nodes
532 Nodes

Core-Components of Hadoop..
 The Cluster is the set of host machines (nodes). Nodes may be
partitioned in racks. This is the hardware part of the infrastructure.

 The YARN Infrastructure (Yet Another Resource Negotiator) is
the framework responsible for providing the computational
resources (e.g., CPUs, memory, etc.) needed for application
executions. Two important elements are:
 Resource Manager
 Node Manager
 The HDFS (Hadoop Distributed File System) Inspired by
Google File System is a primary distributed storage used by Hadoop.
 The MapReduce is the original processing model for Hadoop
clusters. It distributes work within the cluster or map, then
organizes and reduces the result from the nodes into a response to a
query..

Mapreduce Example
Two input files:
file1: “hello world hello moon”
file2: “goodbye world goodnight moon”
Three operations:
 map
 combine
• reduce
we’ll use an example: WordCount Count occurrences of each
word across different files

What is the output per step?
MAP
First map: Second map:
< hello, 1 > < goodbye, 1 >
< world, 1 > < world, 1 >
< hello, 1 > < goodnight, 1 >
< moon, 1 > < moon, 1 >
COMBINE
First map: Second map:
< moon, 1 > < goodbye, 1 >
< world, 1 > < world, 1 >
< hello, 2 > < goodnight, 1 >
< moon, 1 >
REDUCE
< goodbye, 1 >
< goodnight, 1 >
< moon, 2 >
< world, 2 >
< hello, 2 >

Hadoop Installation and Configuration...
1. Installing Java:- Java is the primary requirement for run
Hadoop on any system.
$ java -version

Cont...
2. Configuring SSH:-Hadoop requires SSH access to manage its
nodes, i.e. remote machines plus your local machine.
$ ssh-keygen -t rsa -P ""

Cont...
3. The next step is to test the SSH setup by connecting to your local
machine with the user.

Cont...
4. Disabling IPv6 : To disable IPv6 on Ubuntu, open /etc/sysctl.conf
in the editor of your choice and add the following lines to the end of
the file:
/etc/sysctl.conf
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1 –
You have to reboot your machine in order to make the changes take
effect.

Cont...
You can check whether IPv6 is enabled on your machine with the
following command:
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
A return value of 0 means IPv6 is enabled, a value of 1 means
disabled (that’s what we want).

Cont...
5. Installing Hadoop:-

Cont...
Let’s unpack the Hadoop gz:
$ tar xvzf hadoop 2.8.1.tar.gz

Cont...
6. Configure Hadoop Pseudo-Distributed Mode:-
• Setup Environment Variables - First we need to set environment variable
uses by hadoop. Edit ~/.bashrc file and append following values at end of file.

Cont...
• Now edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set
JAVA_HOME environment variable

Cont...
6. Edit Configuration Files :- Hadoop has many of configuration
files, which need to configure as per requirements of your
hadoop infrastructure.
$ gedit hadoop/etc/hadoop/core-site.xml:

Cont...
$ gedit hadoop/etc/hadoop/hdfs-site.xml

Cont...
$ Edit mapred-site.xml

Cont...
7. Formatting the NameNode:- Now format the namenode using
following command.
$ hdfs namenode -format

Starting your Single-node cluster
Now run start-dfs.sh script.
$ start-dfs.sh
Now run start-yarn.sh script.
$ start-yarn.sh
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your
machine.

A nifty tool for checking whether the expected Hadoop processes
are running is jps
$ /usr/lib/jvm/java-8-oracle/bin/jps

Go to the browser and open http://127.0.0.1:50070

Hadoop installation with an example

Stopping your Single-node cluster
$ Stop-all.sh

Running a single-node cluster program
WordCount Program :
-What is wordcount program?
- Wordcount is a simple application that determines how many
times different words appear in a given input set. Steps for this
session:
1. Upload the input data text file in hdfs.
2. Create your own jar and put the output file.
3. Run a jar on hadoop architecture.
4. View output on command line.
5. View and download output file.

Download Output File..
 You can download output file:

References..
•Apache Hadoop!
(http://hadoop.apache.org)
•Hadoop on Wikipedia
(http://en.wikipedia.org/wiki/Hadoop)
•Cloudera - Apache Hadoop for the Enterprise
(http://www.cloudera.com

Hadoop installation with an example

More Related Content

What's hot (20)

Similar to Hadoop installation with an example (20)

Recently uploaded (20)

Hadoop installation with an example