Thinking in MapReduce - StampedeCon 2013

Thinking in MapReduce
Ryan Brush
@ryanbrush

2
We
programmers
have
had
it
pre1y
good

3
Hardware
has
scaled
up
faster
than
our
problem
sets

5
So#ware
Engineers
Moore’s

Law

6
But
the
party
is
ending
(or
at
least
changing)

7
Data
is
growing
faster
than
we
can
scale
individual
machines

8
So
we
have
to
spread
our
work

across
many
machines

9
This
is
a
big
deal
in
health
care

9
This
is
a
big
deal
in
health
care
Fragmented
InformaKon

9
This
is
a
big
deal
in
health
care
Fragmented
InformaKon
Spread
across
many
systems

9
This
is
a
big
deal
in
health
care
Fragmented
InformaKon
Spread
across
many
systems
No
one
has
the
complete
picture

10
We
need
to
put
the
picture
back
together
again

10
We
need
to
put
the
picture
back
together
again
Be1er-‐informed
decisions

10
We
need
to
put
the
picture
back
together
again
Be1er-‐informed
decisions
Reduce
systemaKc
fricKon

10
We
need
to
put
the
picture
back
together
again
Be1er-‐informed
decisions
Understand
and
improve
the
health
of
populaKons
Reduce
systemaKc
fricKon

Chart
Search
-InformaKon

extracKon

Chart
Search
-InformaKon

extracKon
-SemanKc
markup
of

documents

Chart
Search
-InformaKon

extracKon
-SemanKc
markup
of

documents
-Related
concepts
in

search
results

Medical
Alerts
-Detect
health
risks

in
incoming
data

Medical
Alerts
-Detect
health
risks

in
incoming
data
-NoKfy
clinicians
to

address
those
risks

Medical
Alerts
-Detect
health
risks

in
incoming
data
-NoKfy
clinicians
to

address
those
risks
-Quickly
include
new

knowledge

PopulaKon
Health
- Securely
bring

together
health
data

PopulaKon
Health
- Securely
bring

together
health
data

- IdenKfy
opportuniKes

to
improve
care

PopulaKon
Health
- Securely
bring

together
health
data

- IdenKfy
opportuniKes

to
improve
care
- Support
applicaKon
of

improvements

PopulaKon
Health
- Securely
bring

together
health
data

- IdenKfy
opportuniKes

to
improve
care
- Support
applicaKon
of

improvements
- Close
the
loop

17
Peter
Norvig,
h1p://www.youtube.com/watch?v=yvDCzhbjYWs
The
Unreasonable

EﬀecKveness
of
Data

17
Peter
Norvig,
h1p://www.youtube.com/watch?v=yvDCzhbjYWs
Simple
models
with
lots
of
data
almost
always

outperform
complex
models
with
less
data
The
Unreasonable

EﬀecKveness
of
Data

18
So
how
can
we
tackle

such
large
data
sets?

19
Can
we
adapt
what
has
worked
historically?

Rela%onal
Databases
are
Awesome
Acer
all,

Rela%onal
Databases
are
Awesome
Atomic,
transacKonal
updates
DeclaraKve
queries
Guaranteed
consistency
Easy
to
reason
about
Long
track
record
of
success

Rela%onal
Databases
are
Awesome
…so
use
them!

Rela%onal
Databases
are
Awesome
…so
use
them!
But…

Those
advantages
have
a
cost
Global,
atomic,
consistent
state
means

global
coordinaKon

Those
advantages
have
a
cost
Global,
atomic,
consistent
state
means

global
coordinaKon
CoordinaKon
does
not
scale
linearly

The
costs
of
coordinaKon
Remember
the

network
eﬀect?

The
costs
of
coordinaKon
2
nodes
=
1
channel
5
nodes
=
10
channels
12
nodes
=
66
channels
25
nodes
=
300
channels

The
result
is
we
don’t
scale

linearly
as
we
add
nodes

Independence Parallelizable
Parallelizable Scalable

“Shared
Nothing”
architectures
are
the
most
scalable…

“Shared
Nothing”
architectures
are
the
most
scalable…
…but
most
real-‐world
problems

require
us
to
share
something…

“Shared
Nothing”
architectures
are
the
most
scalable…
…but
most
real-‐world
problems

require
us
to
share
something…
…so
our
designs
usually
have
a
parallel
part
and
a
serial
part

The
key
is
to
make
sure
the
vast
majority
of
our
work
in
the
cloud
is
independent
and
parallelizable.

Amdahl’s
Law
S
:
speed
improvement
P
:
raKo
of
the
problem
that

can
be
parallelized
N:
number
of
processors

MapReduce
Primer
Input
Data
Split
1
Split
2
Split
3
Split
N
.
.
.
Mapper
1
Mapper
2
Mapper
3
Mapper
N
.
.
.
Map
Phase
Reducer
1
Reducer
2
Reducer
N
.
.
Reduce
Phase
Shuﬄe

MapReduce
Example:
Word
Count
Books
Count
words

per
book
.
.
.
Map
Phase
Sum
words

A-‐C
.
.
Reduce
Phase
Shuﬄe
Sum
words
D-‐E
Sum
words

W-‐Z
Count
words

per
book
Count
words

per
book

The
network
is
a
shared
resource

The
network
is
a
shared
resource
Too
much
data
to
move
to

computaKon

The
network
is
a
shared
resource
So
move
computa3on
to
data
Too
much
data
to
move
to

computaKon

MapReduce
Data
Locality
Input
Data
Split
1
Split
2
Split
3
Split
N
.
.
.
Mapper
1
Mapper
2
Mapper
3
Mapper
N
.
.
.
Map
Phase
Reducer
1
Reducer
2
Reducer
N
.
.
Reduce
Phase
Shuﬄe
=
a
physical
machine

Data
locality
only
guaranteed
in

the
Map
phase

Data
locality
only
guaranteed
in

the
Map
phase
So
do
as
much
work
as
possible
there

Data
locality
only
guaranteed
in

the
Map
phase
So
do
as
much
work
as
possible
there
Some
jobs
have
no
reducer
at
all!

38
MapReduce
is
a
building
block

39
So
let’s
build
higher-‐level
funcKons

Grouping
and
AggregaKng
Books
Count
words

per
book
.
.
.
Map
Phase
Sum
words

A-‐C
.
.
Reduce
Phase
Shuﬄe
Sum
words
D-‐E
Sum
words

W-‐Z
Count
words

per
book
Count
words

per
book

Joins
Data
Set
1
Split
1
Split
2
Split
3
Group
by
key
Map
Phase
Reducer
1
Reducer
2
Reducer
N
.
.
Reduce
Phase
Shuﬄe
Group
by
key
Group
by
key
Data
Set
2
Split
1
Split
2
Split
3
Group
by
key
Group
by
key
Group
by
key

Persons
Split
1
Split
2
Split
3
Group
by
person
id
Map
Phase
Reducer
1
Reducer
2
Reducer
N
.
.
Reduce
Phase
Shuﬄe
Group
by
person
id
Group
by
person
id
Visits
Split
1
Split
2
Split
3
Group
by
person
id
Group
by
person
id
Group
by
person
id
Joins

Map-‐Side
Joins
Data
Set
1
Split
3
Mapper
3
Map
Phase
Reducer
1
Reducer
2
.
.
Reduce
Phase
Shuﬄe
Data
set
2
Split
1
Mapper
1
Data
set
2
Split
2
Mapper
2
Data
set
2

44
Filtering
Map
or
reduce
funcKons
can
simply

discard
data
we’re
not
interested
in

45
And
Others
More
sophisKcated

pa1erns
composable

DisKnct
Sort
Binning
Top
N
...

46
Chain
Jobs
Together
Large-‐scale
joins
must
have
a
reduce
phase
MulKple
joins
or
group-‐by
operaKons

mean
mulKple
jobs
Normalize
Data
Join
Related
Items
Compute
Summary Output

Codiﬁed
in
High-‐Level
Libraries
Hive,
Pig,
Cascading,
and
Crunch
provide
simple
means
to
use
these
pa1erns
Apache
Crunch
The
era
of
wriKng
MapReduce
by
hand
is
over

48
How
do
we
use
these
tools?

49
Start
with
the
ques3on
you
want
to
ask,
then
transform
the
data
to
answer
it.

50
output
=
transform
(input)

50
output
=
transform
(input)
FuncKonal
over

Place-‐Oriented
Programming

51
Work
with
data
holisKcally

51
Work
with
data
holisKcally
Re-‐running
funcKons
simpler
to

reason
about
than
updaKng
state

51
Work
with
data
holisKcally
Re-‐running
funcKons
simpler
to

reason
about
than
updaKng
state
Hadoop
makes
this
possible
at
scale

52
Don’t
be
afraid
to
re-‐process

the
world

52
Don’t
be
afraid
to
re-‐process

the
world
Something’s
wrong,
we’re
above
95%
usage!
-‐TradiKonal
System
Administrator

52
Don’t
be
afraid
to
re-‐process

the
world
Something’s
wrong,
we’re
above
95%
usage!
-‐TradiKonal
System
Administrator
Something’s
wrong,
we’re
below
95%
usage!
-‐Hadoop
System
Administrator

53
Maximize
Resource
Usage

54
Franklin,
Halevy,
Maier,
h1p://homes.cs.washington.edu/~alon/ﬁles/dataspacesDec05.pdf
From
Databases
to
Dataspaces

54
Franklin,
Halevy,
Maier,
From
Databases
to
Dataspaces
(Also
referred
to
as
Data
Lakes)

55
Franklin,
Halevy,
Maier,
Bring
all
of
your
data
together...

55
Franklin,
Halevy,
Maier,
Bring
all
of
your
data
together...
..structured
or
unstructured...

55
Franklin,
Halevy,
Maier,
Bring
all
of
your
data
together...
...transform
it
with
unlimited
computaKon...
..structured
or
unstructured...

55
Franklin,
Halevy,
Maier,
Bring
all
of
your
data
together...
...transform
it
with
unlimited
computaKon...
...at
any
Kme
for
any
new
need.
..structured
or
unstructured...

56
And
oﬀer
a
variety
of
interacKve
access
pa1erns.

56
And
oﬀer
a
variety
of
interacKve
access
pa1erns.
SQL,
Search,
Domain-‐Speciﬁc
Apps

57
Hadoop
is
becoming
an
adapKve,

mulK-‐purpose
plasorm.

58
The
gap
between
asking
novel

quesKons
and
our
ability
to
answer

them
is
closing.

QuesKons?
@ryanbrush
h1ps://engineering.cerner.com
We’re
hiring!

Thinking in MapReduce - StampedeCon 2013

More Related Content

Similar to Thinking in MapReduce - StampedeCon 2013 (20)

More from StampedeCon (20)

Recently uploaded (20)

Thinking in MapReduce - StampedeCon 2013