Processing Big Data (Chapter 3, SC 11 Tutorial)

An
Introduc+on
to

Data
Intensive
Compu+ng

Chapter
3:
Processing
Big
Data

Robert
Grossman

University
of
Chicago

Open
Data
Group

Collin
BenneC

Open
Data
Group

November
14,
2011

1

1.  Introduc+on
(0830-‐0900)

a.  Data
clouds
(e.g.
Hadoop)

b.  U+lity
clouds
(e.g.
Amazon)

2.  Managing
Big
Data
(0900-‐0945)

a.  Databases

b.  Distributed
File
Systems
(e.g.
Hadoop)

c.  NoSql
databases
(e.g.
HBase)

3.  Processing
Big
Data
(0945-‐1000
and
1030-‐1100)

a.  Mul+ple
Virtual
Machines
&
Message
Queues

b.  MapReduce

c.  Streams
over
distributed
ﬁle
systems

4.  Lab
using
Amazon’s
Elas+c
Map
Reduce

(1100-‐1200)

Sec+on
3.1

Processing
Big
Data
Using

U+lity
and
Data
Clouds

A
Google
produc+on
rack
of

servers
from
about
1999.

•  How
do
you
do
analy+cs
over
commodity

disks
and
processors?

•  How
do
you
improve
the
eﬃciency
of

programmers?

Serial
&
SMP
Algorithms

Task
Task

Task

Task

local
disk*
local
disk*

Serial
algorithm
Symmetric

Mul+processing

(SMP)
algorithm

•  *
local
disk
and
memory

Pleasantly
(=
Embarrassingly)
Parallel

Task

Task
Task

Task
Task

Task

Task
Task
Task

local
disk
local
disk
local
disk

MPI

•  Need
to
par++on
data,
start
tasks,
collect
results.

•  Oden
tasks
organized
into
DAG.

How
Do
You
Program
A
Data
Center?

7

The
Google
Data
Stack

•  The
Google
File
System
(2003)

•  MapReduce:
Simpliﬁed
Data
Processing…
(2004)

•  BigTable:
A
Distributed
Storage
System…
(2006)

8

Google’s
Large
Data
Cloud

Applica+ons

Compute
Services
Google’s
MapReduce

Data
Services
Google’s
BigTable

Storage
Services
Google
File
System
(GFS)

Google’s
Early
Data
Stack

circa
2000

9

Hadoop’s
Large
Data
Cloud

(Open
Source)

Applica+ons

Compute
Services
Hadoop’s
MapReduce

Data
Services
NoSQL,
e.g.
HBase

Storage
Services
Hadoop
Distributed
File

System
(HDFS)

Hadoop’s
Stack

10

A
very
nice
recent
book
by

Barroso
and
Holzle

The
Amazon
Data
Stack

Amazon
uses
a
highly

decentralized,
loosely
coupled,

service
oriented
architecture

consis+ng
of
hundreds
of

services.
In
this
environment

there
is
a
par+cular
need
for

storage
technologies
that
are

always
available.
For
example,

customers
should
be
able
to

view
and
add
items
to
their

shopping
cart
even
if
disks
are

failing,
network
routes
are

ﬂapping,
or
data
centers
are

being
destroyed
by
tornados.

SOSP’07

Amazon
Style
Data
Cloud

Load
Balancer

Simple
Queue
Service

SDB
EC2
Instance
EC2
Instance

EC2
Instance
EC2
Instance

EC2
Instance
EC2
Instance

EC2
Instance
EC2
Instance

EC2
Instance
EC2
Instance

EC2
Instances
EC2
Instances

S3
Storage
Services

13

Open
Source
Versions

•  Eucalyptus

–  Ability
to
launch
VMs

–  S3
like
storage

•  Open
Stack

–  Ability
to
launch
VMs

–  S3
like
storage
-‐
Swid

•  Cassandra

–  Key-‐value
store
like
S3

–  Columns
like
BigTable

•  Many
other
open
source
Amazon
style
services

available.

Some
Programming
Models
for
Data
Centers

•  Opera+ons
over
data
center
of
disks

–  MapReduce
(“string-‐based”
scans
of
data)

–  User-‐Deﬁned
Func+ons
(UDFs)
over
data
center

–  Launch
VMs
that
all
have
access
to
highly
scalable
and

available
disk-‐based
data.

–  SQL
and
NoSQL
over
data
center

•  Opera+ons
over
data
center
of
memory

–  Grep
over
distributed
memory

–  UDFs
over
distributed
memory

–  Launch
VMs
that
all
have
access
to
highly
scalable
and

available
membory-‐based
data.

–  SQL
and
NoSQL
over
distributed
memory

Sec+on
3.2

Processing
Data
By
Scaling
Out

Virtual
Machines

Processing
Big
Data
PaCern
1:

Launch
Independent
Virtual
Machines

and
Task
with
a
Messaging
Service

Task
With
Messaging
Service

Task

&
Use
S3
(Variant
1)

VM
Control
VM:
Launches
and

tasks
workers

Messaging
Services
(AWS
SMS,
AMQP
Service,
etc.)

Worker
VMs

Task
Task
Task

…

VM
VM
VM

S3

Task
With
Messaging
Service

Task

&
Use
NoSQL
DB
(Variant
2)

VM
Control
VM:
Launches
and

tasks
workers

Messaging
Services
(AWS
SMS,
AMQP
Service,
etc.)

Worker
VMs

Task
Task
Task

…

VM
VM
VM

AWS
SimpleDB

Task
With
Messaging
Service

Task

&
Use
Clustered
FS
(Variant
3)

VM
Control
VM:
Launches
and

tasks
workers

Messaging
Services
(AWS
SMS,
AMQP
Service,
etc.)

Worker
VMs

Task
Task
Task

…

VM
VM
VM

GlusterFS

Sec+on
3.3

MapReduce

Google
2004

Technical
Report

Core
Concepts

•  Data
are
(key,
value)
pairs
and
that’s
it

•  Par++on
data
over
commodity
nodes
ﬁlling
racks

in
a
data
center.

•  Sodware
handles
failures,
restarts,
etc.
This
is

the
hard
part.

•  Basic
examples:

–  Word
Count

–  Inverted
index

Processing
Big
Data
PaCern
2:

MapReduce

Map
Task
Reduce

Map

Map

Task
Tracker
Task

Task

Task

HDFS

HDFS
local
disk

local
disk

Map
Task

Map

Map

Task
Tracker

Task

Task

Reduce

Task

HDFS
local
disk

HDFS

Map
Task

Map

Map

Task
Tracker
local
disk

Task

Task

HDFS
local
disk
Shuﬄe
&
Sort

Example:
Word
Count
&
Inverted
Index

•  How
do
you
count

the
words
in
a

million
books?

–  (best,
7)

•  Inverted
index:

–  (best;
page
1,
page

82,
…)

–  (worst;
page
1,

page
12,
…)

Cover
of
serial
Vol.
V,
1859,
London

•  Assume
you
have
a
cluster
of
50
computers,
each

with
an
aCached
local
disk
and
half
full
of
web

pages.

•  What
is
a
simple
parallel
programming
framework

that
would
support
the
computa+on
of
word
counts

and
inverted
indices?

Basic
PaCern:
Strings

1.
Extract
words
2.
Hash
and
3.
Count
(or

from
web
pages
in
sort
words.
construct
inverted

parallel.
index)
in
parallel.

What
about
data
records?

1.
Extract
words
2.
Hash
and
3.
Count
(or

from
web
pages
in
sort
words.
construct
inverted

parallel.
index)
in
parallel.

1.
Extract
binned
2.
Hash
and
3.
Count
(or

ﬁeld
value
from
sort
binned
construct
inverted

data
records
in
ﬁeld
values.
index)
in
parallel.

parallel.

Map-‐Reduce
Example

•  Input
is
ﬁles
with
one
document
per
record

•  User
speciﬁes
map
func+on

–  key
=
document
URL

–  Value
=
document
contents

Input
of
map

doc
cdickens
two
ci+es ,
it
was
the
best
of
+mes

Output
of
map

it ,
1

was ,
1

the ,
1

best ,
1

Example
(cont d)

•  MapReduce
library
gathers
together
all
pairs

with
the
same
key
value
(shuﬄe/sort
phase)

•  The
user-‐deﬁned
reduce
func+on
combines
all

the
values
associated
with
the
same
key

Input
of
reduce

key
=
it
key
=
was
key
=
best
key
=
worst

values
=
1,
1
values
=
1,
1
values
=
1
values
=
1

Output
of
reduce

it ,
2

was ,
2

best ,
1

worst ,
1

Why
Is
Word
Count
Important?

•  It
is
one
of
the
most
important
examples
for

the
type
of
text
processing
oden
done
with

MapReduce.

•  There
is
an
important
mapping

document

<
-‐-‐-‐-‐-‐
>

data
record

words

<
-‐-‐-‐-‐-‐
>

(ﬁeld,
value)

Inversion

Pleasantly
Parallel
MapReduce

Data
structure
Arbitrary
(key,
value)
pairs

Func+ons
Arbitrary
Map
&
Reduce

Middleware
MPI
(message
Hadoop

passing)

Ease
of
use
Diﬃcult
Medium

Scope
Wide
Narrow

Challenge

Geung
something
Moving
to

working
MapReduce

Common
MapReduce
Design
PaCerns

•  Word
count

•  Inversion
–
inverted
index

•  Compu+ng
simple
sta+s+cs

•  Compu+ng
windowed
sta+s+cs

•  Sparse
matrix
(document-‐term,
data
record-‐
FieldBinValue,
…)

• 
Site-‐en+ty
sta+s+cs

•  PageRank

•  Par++oned
and
ensemble
models

•  EM

Sec+on
3.4

User
Deﬁned
Func+ons
over
DFS

sector.sf.net

Processing
Big
Data
PaCern
3:

User
Deﬁned
Func+ons
over

Distributed
File
Systems

Sector/Sphere

•  Sector/Sphere
is
a
plaworm
for
data
intensive

compu+ng.

Idea
1:
Apply
User
Deﬁned
Func+ons

(UDF)
to
Files
in
a
Distributed
File
System

map/shuffle reduce

UDF UDF

This
generalizes
Hadoop’s
implementa+on
of
MapReduce

over
the
Hadoop
Distributed
File
system.

Idea
2:
Add
Security
From
the
Start

Security •  Security
server
maintains

Master Client informa+on
about
users

Server
SSL and
slaves.

SSL
•  User
access
control:

password
and
client
IP

address.

AAA data •  File
level
access
control.

•  Messages
are
encrypted

over
SSL.
Cer+ﬁcate
is

used
for
authen+ca+on.

•  Sector
is
a
good
basis
for

HIPAA
compliant

Slaves applica+ons.

Idea
3:
Extend
the
Stack
to
Include

Network
Transport
Services

Compute
Services
Compute
Services

Data
Services
Data
Services

Storage
Services
Storage
Services

Rou+ng
&

Google,
Hadoop
Transport
Services

Sector

39

Sec+on
3.5

Compu+ng
With
Streams:

Warming
Up
With
Means
and

Variances

Warm
Up:
Par++oned
Means

Step
1.
Compute
local

(Σ
xi,

Σ
xi2,

ni)

in
parallel
for
each

par++on.

Step
2.
Compute
global

mean
and
variance
from

these
tuples.

•  Means
and
variances
cannot
be
computed

naively
when
the
data
is
in
distributed

par++ons.

Trivial
Observa+on
1

If
si
=
Σ
xi
is
a
the
i’th
local
means,
then
global

mean
=
Σ
si
/

Σ
ni.

•  If
local
means
for
each
par++on
are
passed

(without
corresponding
counts),
then
there
is

not
enough
informa+on
to
compute
global

means.

•  Same
tricks
works
for
variance,
but
need
to

pass
triples
(Σ
xi,

Σ
xi2,

ni).

Trivial
Observa+on
2

•  To
reduce
data
passed
over
the
network,

combine
appropriate
sta+s+cs
as
early
as

possible.

•  Consider
average.

Recall
with
MapReduce
there

are
4
steps
(Map,
Shuﬄe,
Sort
and
Reduce)
and

Reduce
pulls
data
from
local
disk
that
performs

Map.

•  A
Combine
Step
in
MapReduce
combines
local

data
before
it
is
pulled
for
Reduce
Step.

•  There
are
built
in
combiners
for
counts,
means,

etc.

Sec+on
3.6

Hadoop
Streams

Processing
Big
Data
PaCern
4:

Streams
over
Distributed
File
Systems

Hadoop
Streams

•  In
addi+on
to
the
Java
API,
Hadoop
oﬀers

–  Streaming
interface
for
any
language
that
supports

reading
and
wri+ng
to
Standard
In
and
Out

–  Pipes
for
C++

•  Why
would
I
want
to
use
something
besides

Java?

Because
Hadoop
Streams
provide
direct

access
to

–  (Without
JNI/
NIO)
to
C++
libraries
like
Boost,
GNU

Scien+ﬁc
Library
(GSL)

–  R
modules

Pros
and
Cons

•  Java

+

Best
documented

+

Largest
community

–  More
LOC
per
MR
job

•  Python

+

Eﬃcient
memory
handling

+

Programmers
can
be
very
eﬃcient

–  Limited
logging
/
debugging

•  R

+

Vast
collec+on
of
sta+s+cal
algorithms

–  Poor
error
handling
and
memory
handling

–  Less
familiar
to
developers

Word
Count
Python
Mapper

def read_input(file):
for line in file:
yield line.split()

def main(separator='t'):
data = read_input(sys.stdin)
for words in data:
for word in words:
print '%s%s%d' % (word, separator, 1)

Word
Count
Python
Reducer

def read_mapper_output(file, separator='t'):
for line in file:
yield line.rstrip().split(separator, 1)

def main(sep='t'):
data = read_mapper_output(sys.stdin, sep=sepa)
for word, group in groupby(data, itemgetter(0)):
total_count = sum(int(count) for word,
count in group)
print "%s%s%d" % (word, sep, total_count)

MalStone
Benchmark

MalStone
A
MalStone
B

Hadoop
MapReduce
455m
13s
840m
50s

Hadoop
Streams
87m
29s
142m
32s

(Python)

C++
implemented
UDFs
33m
40s
43m
44s

Sector/Sphere
1.20,
Hadoop
0.18.3
with
no
replica+on
on
Phase
1
of

Open
Cloud
Testbed
in
a
single
rack.

Data
consisted
of
20
nodes
with

500
million
100-‐byte
records
/
node.

Word
Count
R
Mapper

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)",
"", line)

con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn =
FALSE)) > 0) {
line <- trimWhiteSpace(line)
words <- splitIntoWords(line)
cat(paste(words, "t1n", sep=""), sep="")
}
close(con)

Word
Count
R
Reducer

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)

splitLine <- function(line) {
val <- unlist(strsplit(line, "t"))
list(word = val[1], count = as.integer(val[2]))
}

env <- new.env(hash = TRUE)
while (length(line <- readLines(con, n = 1, warn = FALSE)) >
0) {
split <- splitLine(line)
word <- split$word
count <- split$count

Word
Count
R
Reducer
(cont’d)

if (exists(word, envir = env, inherits = FALSE)) {
oldcount <- get(word, envir = env)
assign(word, oldcount + count, envir = env)
}
else assign(word, count, envir = env)
}
close(con)

for (w in ls(env, all = TRUE))
cat(w, "t", get(w, envir = env), "n", sep =
"”)

Word
Count
Java
Mapper

public static class Map
extends Mapper<LongWritable, Text,Text, IntWritable>

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context
context
throws IOException, InterruptedException {

String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

Word
Count
Java
Reducer

public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {

int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}

Code
Comparison
–
Word
Count

Mapper

Python
Java
def read_input(file):
for line in file: public static class Map
yield line.split() extends Mapper<LongWritable, Text,Text, IntWritable>

def main(separator='t'): private final static IntWritable one = new IntWritable(1);
data = read_input(sys.stdin) private Text word = new Text();
for words in data:
for word in words: public void map(LongWritable key, Text value, Context context
print '%s%s%d' % (word, separator, 1) throws IOException, InterruptedException {

String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

R

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)

while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
words <- splitIntoWords(line)
cat(paste(words, "t1n", sep=""), sep="")
}
close(con)

Code
Comparison
–
Word
Count

Reducer

Python if (exists(word, envir = env, inherits = FALSE)) {
oldcount <- get(word, envir = env)
assign(word, oldcount + count, envir = env)
def read_mapper_output(file, separator='t'): }
for line in file: else assign(word, count, envir = env)
yield line.rstrip().split(separator, 1) }
close(con)

def main(sep='t'): for (w in ls(env, all = TRUE))
data = read_mapper_output(sys.stdin, sep=sepa) cat(w, "t", get(w, envir = env), "n", sep = "”)
for word, group in groupby(data, itemgetter(0)):
total_count = sum(int(count) for word, count in group)
print "%s%s%d" % (word, sep, total_count)
Java

public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values,
Context context)
R throws IOException, InterruptedException {

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) int sum = 0;
for (IntWritable val : values) {
splitLine <- function(line) { sum += val.get();
val <- unlist(strsplit(line, "t")) }
list(word = val[1], count = as.integer(val[2])) context.write(key, new IntWritable(sum));
} }
}
env <- new.env(hash = TRUE)
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
split <- splitLine(line)
word <- split$word
count <- split$count

Ques+ons?

For
the
most
current
version
of
these
notes,
see

rgrossman.com

Processing Big Data (Chapter 3, SC 11 Tutorial)

More Related Content

What's hot (19)

Viewers also liked (15)

Similar to Processing Big Data (Chapter 3, SC 11 Tutorial) (20)

More from Robert Grossman (11)

Recently uploaded (20)

Processing Big Data (Chapter 3, SC 11 Tutorial)