Apache hadoop

Apache Hadoop
Sheetal Sharma
Intern At IBM Innovation Centre

Why Data?
Get insights to offer
a better product
“More data usually
beats better
algorithms”
Get of insights to
make better decisions
Avoid “guesstimates”

What Is Challenging?
Store data reliably
Analyze data quickly
Cost-effective way
Use expressible and
high-level language

Fundamental Ideas
A big system of
machines, not a big
machine
Failures will happen
Move computation to
data, not data to
computation
Write complex code
only once, but right

Apache Hadoop
An open-source Java
software
Storing and processing
of very large data sets
A clusters of
commodity machines
A simple programming
model

Apache Hadoop
Two main components:
HDFS - a distributed file
system
MapReduce – a
distributed processing
layer

HDFS
The Purpose Of HDFS
●
Store large datasets
in a distributed,
scalable and fault-
tolerant way
●
High throughput
●
Very large files
●
Streaming reads and writes (no edits)

HDFS Mis-Usage
Do NOT use, if you have
Low-latency
requests
Random
reads and writes
Lots of
small files
Then better to consider
RDBMs,

Splitting Files And
Replicating Blocks
Split a very large file into
smaller (but still large)
blocks
Store them redundantly on
a set of machines

Spiting Files Into Blocks
●
The default block size
is 64MB
●
Minimize the overhead
of a disk seek
operation (less than
1%)
●
A file is just “sliced”
into chunks after each
64MB (or so)

Replicating Blocks
The default
replication factor
is 3
●
It can be changed
per a file or a
directory
●
It can be

Master And Slaves
The Master node keeps and
manages all metadata
information
The Slave nodes store blocks
of data and serve them to
the client
Master node (called
NameNode)
Slave nodes (called DataNodes

Classical* HDFS Cluster
*no NameNode HA, no HDFS
Replication
Manages metadata
Does some
“house-keeping”
operations for
NameNode
Stores and retrieves
blocks of data

HDFS NameNode
Performs all the metadata-
related operations
Keeps information in RAM (for
fast look up)
The file system tree
Metadata for all
files/directories (e.g.
ownership, permissions)
Names and locations of
blocks

HDFS DataNode
Stores and retrieves blocks of
data
Data is stored as regular files on a local filesystem (e.g. ext4)
e.g. blk_-992391354910561645 (+ checksums in a separate file)
A block itself does not know which file it belongs to!
Sends a heartbeat message to
the NN to say that it is still
alive
Sends a block report to the NN
periodically

HDFS Secondary NameNode
NOT a failover NameNode
Periodically merges a prior
snapshot (fsimage) and editlog(s)
(edits)
Fetches current fsimage and
edits files from the NameNode
Applies edits to fsimage to
create the up-to-date fsimage
Then sends the up-to-date
fsimage back to the NameNode

Reading A File From HDFS
Block data is never sent through the
NameNode
The NameNode redirects a client to an
appropriate DataNode
The NameNode chooses a DataNode that
is as “close” as possible
Lots of data
comes
from DataNodes
to a client
Blocks locations
$ hadoop fs -cat /toplist/2013-05-15/poland.txt

HDFS And Local File System
●
Runs on the top
of a native file
system (e.g. ext3,
ext4, xfs)
●
HDFS is simply a
Java application
that uses a native

HDFS Data Integrity
HDFS detects corrupted
blocks
● When writing
Client computes the
checksums for each block
Client sends checksums to
a DN together with data
● When reading
Client verifies the

HDFS NameNode Scalability
Stats based on Yahoo!
Clusters
●
An average file 1.5≈
blocks (block size = 128
MB)
●
An average file 600≈
bytes in RAM (1 file and 2
blocks objects)
●
100M files 60 GB of≈
metadata

HDFS NameNode
Performance
Read/write operations
throughput limited by one
machine
●
~120K read ops/sec
●
~6K write ops/sec
●
MapReduce tasks are also
HDFS clients
Internal load increases as
the cluster grows
●

HDFS Main Limitations
Single NameNode
●
Keeps all
metadata in RAM
●
Performs all
metadata
operations
●
Becomes a single

MapReduce
MapReduce Model
Programming model
inspired by functional
programming
map() and reduce()
functions processing
<key, value> pairs
Useful for processing

Map And Reduce Functions
● Map and Reduce

Map And Reduce Functions -
Counting Word

MapReduce Job
Input data is divided
into
splits and converted
into
<key, value> pairs
Invokes map() function
multiple times
Keys are
sorted,
values not
(but
could be)
Invokes reduce()
Function multiple times

MapReduce Example: ArtistCount
Artist, Song, Timestamp, User
Key is the offset of the line
from the beginning
of the line
We could specify which artist
goes to which reducer
(HashParitioner is default one)

MapReduce Example:
ArtistCount
map(Integer key, EndSong value, Context context):
context.write(value.artist, 1)
reduce(String key, Iterator<Integer> values, Context
context):
int count = 0
for each v in values:
count += v
context.write(key, count)
Pseudo-code in
non-existing
language ;)

MapReduce Combiner
Make sure that the Combiner
combines fast and enough
(otherwise it adds overhead
only)

MapReduce Implementation
●
Batch processing system
●
Automatic parallelization
and distribution of
computation
●
Fault-tolerance
●
Deals with all messy
details related to
distributed processing
●
Relatively easy to use
for programmers

JobTracker Reponsibilities
●
Manages the
computational
resources
Available
TaskTrackers, map and
reduce slots
●
Schedules all user
jobs
Schedules all

TaskTracker Reponsibilities
●
Runs map and reduce
tasks
●
Reports to JobTracker
Heartbeats saying
that it is still alive
Number of free
map and reduce slots
Task progress,

Apache Hadoop Cluster
●
It can consists of 1, 5,
100 and 4000 nodes

Apache hadoop

More Related Content

What's hot (20)

Viewers also liked (14)

Similar to Apache hadoop (20)

More from sheetal sharma (9)

Recently uploaded (20)

Apache hadoop