Introduction to hadoop

What is BigData?
The term “BigData” is used to describe the
collection of Complex and Large Data such that it’s
difficult to capture, search, store, process and analyze this
kind of data using Database Management System.
Basically the data coming from everyware like,
Social media sites
Traffic, Satellite
Digital world
Software logs
Business data
And many more…..

• BigData Includes both Structured and Unstructured
data.
• BigData is difficult to work with using most Relational
database management systems.
• BigData is more than simply a matter of size; it is an
opportunity to find insights in new and emerging types
of data and content, to make your business more agile.
• why it so important ,
1.More data leads to more accurate analyses.
2.More accurate analyses leads to better decision
making.
3.Better decisions means greater operational
efficiencies, cost reductions and reduced Risk.

What is Hadoop…?
“Apache hadoop is open source software
library framework use to process large data sets across
the distributed cluster using simple programming on
commodity(highly available) hardware.”
 Hadoop process the data parallel on large cluster.
Google created its own distributed computing
framework and published papers about the same.
Hadoop was developed on the basis of papers released
by Google.
Core hadoop consists of two core components,
-The Hadoop Distributed File System (HDFS)
-MapReduce

Why Hadoop ?
Why
Hadoop
Economical
(cost
effective)
Flexible
Scalable
Solves
Bigdata
problems
Reliable
Smart

How Hadoop works
Client
Program
Data
Master
Node
Slave Node
Slave Node
Slave Node
HDFS
Name
Node
Map
Reduce
Job
Tracker
Map Reduce
Task Tracker
HDFS
Name Node
HDFS
Name Node
Map Reduce
Task Tracker
HDFS
Name Node
Map Reduce
Task Tracker

STEPS:
Step 1 : Data is Broken Into file splits of 64 mb OR
128 mb and the blocks are moved to different
Nodes.
Step 2 : Once all the blocks are moved, The hadoop
framework passes on the program to each
node.
Step 3 : Job Tracker Then Starts the scheduling the
programs on individual nodes.
Step 4 : Once all the node are done, the output id
return back.

History……
Hadoop was inspired by Google’s MapReduce, a
software framework in which an application is broken
down into numerous small parts. Any of those parts
(also called fragments or blocks) can be run on any node
in the cluster.
Doug Cutting, hadoop’s creator , named the framework
after his child’s stuffed toy elephant.
In 2002, Doug Cutting created an open source, web
crawler project. In 2004, Google published MapReduce,
GFS papers. In 2006, Doug Cutting developed the open
source, MapReduce and HDFS project. In 2008, Yahoo
run 4,000 node Hadoop cluster and Hadoop won
terabyte sort benchmark. In 2009, Facebook launched
SQL support for Hadoop.

PIG
Apache PIG is a platform for analyzing large data set,
that consist of high level language, for expressing data
analysis programs. Introduced by Yahoo.
HIVE
Apache HIVE is data warehouse software used to
querying and managing large data set on distributed
cluster. Introduced by Facebook.
HBase
Apache HBase is a Distributed column-oriented database
on top of HDFS and Hadoop.

SQOOP
SQOOP is a combination of SQL-Hadoop.
SQOOP is import and export utility, it is a data transfer
tool, to get data into hadoop from relational system and
put data into RDBMS for analysis with BI tools.
Zookeeper
Apache zookeeper coordination service for distributed
system, it is fast and scalable.
OOZiE
OOZiE is a workflow engine that runs on server, it is job
scheduling service within a hadoop cluster.

FLUME
FLUME is a service that basically lets you ingest data
(typically file data) into HDFS. Defined as, distributed
reliable, available service for moving large amount of
data as it is produced.
Ganesh L. Sanap
connectoganesh@gmail.com

Introduction to hadoop

More Related Content

What's hot (20)

Similar to Introduction to hadoop (20)

Recently uploaded (20)

Introduction to hadoop