Hadoop info

BIRLA VISHVAKARMA
MAHAVIDHYALAYA
TOPIC :SUBMITTED TO : MISSES BIJAL DALWADI
SUBMITTED BY : NIKITA SURE(140080116025)
JIMMY CHOPDA(140080116013)
VRUTI TANKARIA(140080116057)
MESHWA PATEL(140080116035)
DATABASE MANAGEMENT SYSTEM

PART 1
What is hadoop?
History of hadoop.
Why hadoop?
Where hadoop.

What is Hadoop?
Apache hadoop is an open-source
software framework written in java for
distributed storage and distributed
processing of very large data set on
computer clusters built from commodity
hardware.

Hadoop was created by doug
cutting and mike cafarella in 2005.
Cutting who was working at yahoo !
At that time, named it after his
son’s toy elephant.
It was originally developed to
support distribution for the nutch
search engine project.
It’s latest release version is 2.7.1
on july 6,2015.
Doug Cutting

Hadoop - Why ?
The complexity of modern analytics needs is
outstripping the available computing power
of legacy systems. With its distributed
processing, Hadoop can handle large
volumes of structured and unstructured data
more efficiently than the traditional
enterprise data warehouse. Because Hadoop
is open source and can run on commodity
hardware, the initial cost savings are
dramatic and continue to grow as your
organizational data grows.

 Smart meters are deployed in homes worldwide to help consumers and utility
companies manage the use of water, electricity, and gas better. Historically,
meter readers would walk from house to house recording meter read outs and
reporting them to the utility company for billing purposes. Because of the labor
costs, many utilities switched from monthly readings to quarterly. This delayed
revenue and made it impossible to analyze residential usage in any detail.
Consider a fictional company called CostCutter Utilities that serves 10 million
households. Once a quarter, they gathered 10 million readings to produce
utility bills. With government regulation and the price of oil skyrocketing,
CostCutter started deploying smart meters so they could get hourly readings of
electricity usage. They now collect 21.6 billion sensor readings per quarter
from the smart meters. Analysis of the meter data over months and years can
be correlated with energy saving campaigns, weather patterns, and local
events, providing savings insights both for consumers and CostCutter Utilities.
When consumers are offered a billing plan that has cheaper electricity from 8
p.m. to 5 a.m., they demand five minute intervals in their smart meter reports
so they can identify high-use activity in their homes. At five minute intervals,
the smart meters are collecting more than 100 billion meter readings every 90
days, and CostCutter Utilities now has a big data problem. Their data volume
exceeds their ability to process it with existing software and hardware. So
CostCutter Utilities turns to Hadoop to handle the incoming meter readings.

The base Apache Hadoop framework is composed
of the following modules:
1. Hadoop Common – contains libraries and
utilities needed by other Hadoop modules
2. Hadoop Distributed File System (HDFS) – a
distributed file-system that stores data on
commodity machines
3. Hadoop YARN – a resource-management
platform responsible for managing computing
resources in clusters and using them for
scheduling of users' applications
4. Hadoop MapReduce – a programming model for
large scale data processing

PART 2
 Map reduce.
 HDFS.
 Introduction to hadoop architecture.

What is map reduce?
 Mapreduce is a processing technique and
a program model for distributed
computing based on java.
 The mapreduce algorithm contain two
important tasks, namely map and reduce.
 Map takes a set of data and converts it
into another set of data, where individual
elements are broken down into tuples.

What is HDFS?
 HDFS is a Java-based file system that provides
scalable and reliable data storage, and it was
designed to span large clusters of commodity
servers. HDFS has demonstrated production
scalability of up to 200 PB of storage and a single
cluster of 4500 servers, supporting close to a billion
files and blocks.
 Hadoop can work directly with any mountable
distributed file system, but the most common file
system used by Hadoop is the HDFS. It is a fault-
tolerant distributed file system that is designed for
commonly available hardware. It is well-suited for
large data sets due to its high throughput access to
application data.

HADOOP ARCHITECTURE
 "Hadoop employs a master/slave
architecture for both distributed storage
and distributed computation". In the
distributed storage, the NameNode is the
master and the DataNodes are the slaves.
In the distributed computation, the
Jobtracker is the master and the
Tasktrackers are the slaves

MASTERs
1 . NAMENODE
 The NameNode is the HEART of an HDFS file
system. It keeps the directory tree of all files in
the file system, and tracks where across the
cluster the file data is kept. It does not store the
data of these files itself.
 When the NameNode goes down, the file system
goes offline.
2 . JOBTRACKER
 The JobTracker is the service within Hadoop that
farms out MapReduce tasks to specific nodes in
the cluster, ideally the nodes that have the data,
or at least are in the same rack.
 The JobTracker is a point of failure for the
hadoop MapReduce service. If it goes down, all
running jobs are halted.

SLAVES
1 . DATANODE
 A DataNode stores data in the
HadoopFileSystem. A functional filesystem
has more than one DataNode, with data
replicated across them.
2 . TASKTRACKER
A TaskTracker is a node in the cluster that
accepts tasks - Map, Reduce and Shuffle
operations - from a JobTracker.

PART 3
 How hadoop architecture works.
 Reading files from HDFS.
 Writing files from HDFS.

PART 4
 Advantages of hadoop.
 Disadvantages of hadoop
 Where it is used?
 Subprojects of hadoop

Hadoop RelatedSubprojects
 Pig
◦ High-level language for data analysis
 HBase
◦ Table storage for semi-structured data
 Zookeeper
◦ Coordinating distributed applications
 Hive
◦ SQL-like Query language and Metastore
 Mahout
◦ Machine learning
Etc….

Hadoop advantages
1. Scalable
Hadoop is a highly scalable storage platform, because it can
store and distribute very large data sets across hundreds of
inexpensive servers that operate in parallel.
2. Cost effective
Hadoop is designed as a scale-out architecture that can
affordably store all of a company’s data for later use. The cost
savings are staggering: instead of costing thousands to tens of
thousands of pounds per terabyte, Hadoop offers computing
and storage capabilities for hundreds of pounds per terabyte.

3. Flexible
 Hadoop can be used for a wide variety of purposes, such as
log processing, recommendation systems, data
warehousing, market campaign analysis and fraud
detection.
 4. Fast
 Hadoop’s unique storage method is based on a distributed
file system that basically ‘maps’ data wherever it is located
on a cluster. The tools for data processing are often on the
same servers where the data is located, resulting in much
faster data processing. If you’re dealing with large volumes
of unstructured data, Hadoop is able to efficiently process
terabytes of data in just minutes, and petabytes in hours.
 5. Resilient to failure
 A key advantage of using Hadoop is its fault tolerance.
When data is sent to an individual node, that data is also
replicated to other nodes in the cluster, which means that
in the event of failure, there is another copy available for
use.

Problems
 Coding is tedious
 Want to change that data? SQL UPDATE
and the change is in. Hadoop does not do
this.
 Hadoop stores data in files, and does not
index them. If you want to find something,
you have to run a MapReduce job going
through all the data.
 Where Hadoop works is where the data is
too big for a database!

What we want ?
 Guaranteed data processing
 Fault-tolerance
 No intermediate message brokers!
 Higher level abstraction than message
passing
 “Just works” !!

Who uses hadoop?
Hadoop is in use at most organizations that
handle big data:
 Amazon/A9
 Facebook
 Google
 IBM
 New York Times
 PowerSet
 Yahoo!

Hadoop info

More Related Content

What's hot (20)

Similar to Hadoop info (20)

Recently uploaded (20)

Hadoop info