Hadoop: A Hands-on Introduction

Hadoop
A Hands-on Introduction

Claudio Martella
Elia Bruni

9 November 2011

Tuesday, November 8, 11

Outline
• What is Hadoop
• Why is Hadoop
• How is Hadoop
• Hadoop & Python
• Some NLP code
• A more complicated problem: Eva
2

A bit of Context
• 2003: ﬁrst MapReduce library @ Google
• 2003: GFS paper
• 2004: MapReduce paper
• 2005: Apache Nutch uses MapReduce
• 2006: Hadoop was born
• 2007: ﬁrst 1000 nodes cluster at Y!
3

An Ecosystem
• HDFS & MapReduce
• Zookeeper
• HBase
• Pig & Hive
• Mahout
• Giraph
• Nutch 4

Traditional way
• Design a high-level Schema
• You store data in a RDBMS
• Which has very poor write throughput
• And doesn’t scale very much
• When you talk about Terabyte of data
• Expensive Data Warehouse
5

BigData & NoSQL

• Store ﬁrst, think later
• Schema-less storage
• Analytics
• Petabyte scale
• Ofﬂine processing
6

Vertical Scalability

• Extremely expensive
• Requires expertise in distributed systems
and concurrent programming
• Lacks of real fault-tolerance

7

Horizontal Scalability

• Built on top of commodity hardware
• Easy to use programming paradigms
• Fault-tolerance through replication

8

1st Assumptions
• Data to process does not ﬁt on one node.
• Each node is commodity hardware.
• Failure happens.
Spread your data among your nodes
and replicate it.

9

2nd Assumptions
• Moving computation is cheap.
• Moving data is expensive.
• Distributed computing is hard.
Move computation to data,
with simple paradigm.

10

3rd Assumptions
• Systems run on spinning hard disks.
• Disk seek >> disk scan.
• Many small ﬁles are expensive.

Base the paradigm on scanning large ﬁles.

11

Typical Problem

• Collect and iterate over many records
• Filter and extract something from each
• Shufﬂe & sort these intermediate results
• Group-by and aggregate them
• Produce ﬁnal output set
12

Typical Problem

• Collect and iterate over many records
AP

• Filter and extract something from each
M

• Shufﬂe & sort these intermediate R
results
• Group-by and aggregate them ED
U
• Produce ﬁnal output set C
E

13

Quick example
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/
1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en]
(Win98; I ;Nav)"

• (frank, index.html)

• (index.html, 10/Oct/2000)

• (index.html, http://www.example.com/start.html)

14

MapReduce
• Programmers deﬁne two functions:
★ map (key, value) (key’, value’)*
★ reduce (key’, [value’+]) (key”, value”)*

• Can also deﬁne:
★ combine (key, value) (key’, value’)*
★ partitioner: k‘ partition

15

k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 9

Shuffle and Sort: aggregate values by keys
a 1 5 b 2 7 c 2 3 6 9

reduce reduce reduce

r1 s1 r2 s2 r3 s3

16

MapReduce daemons
• JobTracker: it’s the Master, it runs the
schedule of the jobs, assigns tasks to
nodes, collects hearth-beats from workers,
reschedules for fault-tolerance.
• TaskTracker: it’s the Worker, it runs on
each slave, runs (multiple) Mappers and
Reducers each in their JVM.

17

User
Program

(1) fork (1) fork (1) fork

Master

(2) assign map
(2) assign reduce

worker
split 0
(6) write output
split 1 (5) remote read worker
(3) read file 0
split 2 (4) local write
worker
split 3
split 4 output
worker
file 1

worker

Input Map Intermediate files Reduce Output
files phase (on local disk) phase files

18
Redrawn from (Dean and Ghemawat, OSDI 2004)

HDFS daemons
• NameNode: it’s the Master, it keeps the
filesystem metadata (in-memory), the file-
block-node mapping, decides replication
and block placement, collects heart-beats
from nodes.
• DataNode: it’s the Slave, it stores the
blocks (64MB) of the files and serves
directly reads and writes.

19

Application GFS master
(file name, chunk index) /foo/bar
GSF Client File namespace chunk 2ef0
(chunk handle, chunk location)

Instructions to chunkserver

Chunkserver state
(chunk handle, byte range)
GFS chunkserver GFS chunkserver
chunk data
Linux file system Linux file system

… …

awn from (Ghemawat et al., SOSP 2003)

20

Transparent to

• Workers to data assignment
• Map / Reduce assignment to nodes
• Management of synchronization
• Management of communication
• Fault-tolerance and restarts
21

Take home recipe

• Scan-based computation (no random I/O)
• Big datasets
• Divide-and-conquer class algorithms
• No communication between tasks

22

Not good for

• Real-time / Stream processing
• Graph processing
• Computation without locality
• Small datasets

23

Questions?


Baseline solution


What we attacked

• You don’t want to parse the ﬁle many times
• You don’t want to re-calculate the norm
• You don’t want to calculate 0*n

26

Our solution
0 1.3 0 0 7.1 1.1 1.3 7.1 1.1

1.2 0 0 0 0 3.4 1.2 3.4

0 5.7 0 0 1.1 2 5.7 1.1 2

5.1 0 0 4.6 0 10 5.1 4.6 10

0 0 0 1.6 0 0 1.6

line format: <string><norm>[<col><value>]*
for example: cat 12.1313 0 5.1 3 4.6 5 10
27

Benchmarking

• serial python (single-core): 7 minutes
• java+hadoop (single-core): 2 minutes
• serial python (big ﬁle): 18 days
• java+hadoop (parallel, big ﬁle): 8 hours
• it makes sense: 18d / 3.5 = 5.14d / 14 = 8h
28

Hadoop: A Hands-on Introduction

More Related Content

Similar to Hadoop: A Hands-on Introduction (20)

Recently uploaded (20)

Hadoop: A Hands-on Introduction