Hadoop

Hadoop Architecture
Presented by :
Yojana Nanaware
ME(CSE-I)

Agenda
• What is Hadoop?
• Why, When, Where?
• Hadoop : How?
• Hadoop Architecture
• Hadoop Common
• HDFS
• Hadoop Map/Reduce
• Process
• Hadoop Community
• Conclusion
• References

What is Hadoop?
• A SMART WAY TO STORE & ANALYAZE
DATA
• Douglas Reed Cutting, who is the creator
of Open-Source Technology & also
Hadoop. He originated Lucene and Nutch
• Open-source project administered by
Apache Software Foundation. Hadoop
consists of two key services:

What is Hadoop?
– Hadoop Distributed File System (HDFS).
– Map/Reduce .
• Hadoop is large-scale, high-performance
processing jobs — in spite of system
changes or failures

Why Hadoop?
• Need to process 100TB datasets
• On 1 node :
– Scanning @ 50MB/s=23 days
– MTBF = 3 years
• On 1000 node cluster :
– Scanning @ 50MB/s=33mins
– MTBF = 1 days
• Need efficient, Reliable & Usable
framework

Where & When?
• Where
– Batch Data
Processing, not
real-time/ user
facing
– Highly parallel data
intensive distributed
application
– Very large
production of
deployment
• When
– Process lots of
unstructured data
– When your processing
can easily be made
parallel
– Running batch jobs is
acceptable
– When you have to
access lots of cheap
hardware

Hadoop : How?
• Commodity hardware cluster
• Distributed File System
– Modeled on GFS
• Distributed Processing Framework
– Using Map/Reduce metaphor
• Open Source Java
– Apache Lucene Framework

Hadoop Architecture
Hadoop consists :
•Hadoop Common
– Support other Hadoop subprojects
•HDFS
– Provide high throughput access to application
data
•MapReduce
– Compute cluster of large data sets

Hadoop Common
• It is a set of utilities
• Includes File system, RPC, & Serialization
libraries

HDFS
• Primary storage system
• Creates multiple replicas of data blocks &
distributes them on compute nodes
throughout a cluster to enable reliable,
extremely rapid computations.
• Replication & locality

Hadoop MapReduce
• The Map/Reduce programming language
– Framework
– Pluggable user code
• Common design pattern in design processing
cat * I grep I sort I unique –c I cat>file
input I map I shuffle I reduce I output
• Natural for
– log processing
– web search indexing
– Ad-hoc queries

Map/Reduce Implementation
1. input files split
2. Assign Masters &
Workers
3. Map tasks
4. Writing intermediate
data to disk
5. Intermediate data
read & sort
6. Reduce tasks
7. Return

Example of Map/Reduce word count
• Read text files & count how word often
occur.
– The input is text files
– The output is text file
• Each line : word, tab, count
• Map – Produce pair of (word, count)
• Reduce – For each word, sum up the
counts

Process
• Installation
– Requirements : Linux,
java1.6, sshd, rsync
– Configure SSH for
password free
authentication
– Unpack Hadoop
distribution
– Edit a few configuration
files
– Format the DFS on the
name node
– Start all the demon
process
• Execution
– Compile your job into a
jar files
– Copy input data into the
HDFS
– Execute bin/hadoop jar
with relevant arguments
– Monitor task via Web
interface (optional)
– Examine output when
job is complete

Hadoop Community
• Hadoop Users
– Adobe
– Alibaba
– Amazon
– AOL
– Facebook
– Google
– IBM
• Major Contributor
– Apache
– Cloudera
– Yahoo

Conclusion
• Designed to run on cheap commodity
power
• Handles data replication & node failure
• Cost saving & efficient & reliable data
processing

References
• http://www.newyorksys.com/hadoop-
online-training
• Hadoop on Wikipedia
(http://en.wikipedia.org/wiki/Hadoop )
• http://hadoop.apache.org/core/docs/curren
t/api/

Hadoop

More Related Content

What's hot (19)

Viewers also liked (10)

Similar to Hadoop (20)

More from Yojana Nanaware (7)

Recently uploaded (20)

Hadoop