Spark streaming

Noam Shaish
Spark Streaming
Scale

Fault
tolerance

High
throughput

Agenda
❖ Overview

❖ Architecture

❖ Fault-‐tolerance

❖ Why
Spark
streaming?
We
have
Storm

❖ Demo

Overview
❖ Spark
Streaming
is
an
extension
of
core
Spark
API.
It
enables
scalable,

high-‐throughput,
fault-‐tolerant
stream
processing
of
live
data
streams.

❖ ConnecGons
for
most
of
common
data
sources
such
as
KaIa,
Flume,

TwiKer,
ZeroMQ,
Kinesis,
TCP,
etc.

❖ Spark
streaming
diﬀer
from
most
online
processing
soluGon
by

espousing
mini
batch
approach,
instead
of
data
stream.

❖ Based
on
DiscreGzed
Stream
paper

❖ Discretized Streams:A Fault-Tolerant Model for Scalable Stream Processing 
Matei Zaharia,Tathagata Das, Haoyuan Li,  
Timothy Hunter, Scott Shenker, Ion Stoica 
Berkeley EECS (2012-12-14) 
www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf

Overview
Spark
streaming
runs
streaming
computaGon
as
a
series
of
very
small,

determinis1c
batch
jobs

Spark

streaming
Spark
Live
data
stream
Batches
of
X
milliseconds
Processed
results
❖ Chops
live
stream
into
batches
of
x

milliseconds

❖ Spark
treats
each
batch
of
data
as

RDDs

❖ Processed
results
of
the
RDD

operaGons
are
returned
in
batches

DStream, not just RDD
* Datastax cassandra connector
Transformations
• map(),

• ﬂatMap()

• ﬁlter()

• count()

• reparGGon()

• union()

• reduce()

• countByValue()

• reduceByKey()

• join()

• cogroup()

• transform()

• updateStateByKey()
Output Operations
• print()

• foreachRDD()

• saveAsObjectToFiles()

• saveAsTextFiles()

• saveAsHadoopFiles()

• *saveToCassandra()
Window Operations
• window()

• countByWindow()

• reduceByWindow()

• reduceByKeyAndWindow()

• countByValueAndWindow()

Example 1 - DStream to RDD
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
Twi8er
Streaming
API

!
!
tweets
DStream

batch
@
t batch
@
t
+
1 batch
@
t
+
3batch
@
t
+
2
stored
in
memory
as
an
RDD

(immutable,
distributed)

Example 1 - DStream to RDD relation
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!
val hashTags = tweets.flatMap(status => getTags(status))
tweets
DStream

batch
@
t batch
@
t
+
1 batch
@
t
+
3batch
@
t
+
2
hashTags
DStream

[#hobbitch,

#bilboleggins,
…]
flatMap flatMap flatMap flatMap
new
RDDs
for

each
batch
new
DStream

Example 1 - DStream to RDD
val hashTags = tweets.flatMap(status => getTags(status))!
hashTags.saveToCassandra(“keyspace”, “tableName”)
tweets
DStream

hashTags
DStream

[#hobbitch,

#bilboleggins,
…]
every
batch

saved
to

Cassandra
save save save save

Example 2 - DStream to RDD relation
val tagCounts = hashTags.countByValue()
tweets
DStream

hashTags

map map map map
reduceByKey reduceByKey reduceByKey reduceByKey
hashTags

[(#hobbitch,
10),

(#bilboleggins,
34),
…]

Example 3 - Count the hash tags over last 10 minutes
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
Sliding
window

operaGon Window
length Sliding
interval

Example 3 - Count the hash tags over last 10 minutes
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
t-1 t t+1 t+2 t+3
sliding
window
hashTags

hashTags

Count
over
all

data
in
window

Example 4 - Count hash tags over last 10 minutes smartly
val tagCounts = hashTags.countByValueAndWindow(Minutes(10), Seconds(1))
t-1 t t+1 t+2 t+3
sliding
window
hashTags

hashTags

Add
count
of
new

batch
in
window
+-
Reduce
count
of

batch
out
of
window
generalizaGon
of
smart
window
reduce
exists:

reduceByKeyAndWindow(reduce,
inverseReduce,
window,

interval)

Architecture
❖ Receivers
divides
data
into
mini
batches

❖ Size
of
batches
can
be
deﬁned
in
milliseconds
(best
pracGce

is
greater
than
500
milliseconds)
Spark
Streaming
Receivers
Spark

Engine
Batches
of

input
RDDs
Batches
of

output
RDDs
Input
streams

Fault-tolerance
❖ RDDs
are
not
generated
from

fault-‐tolerance
source

❖ Replicate
data
among
worker

nodes
 
(default
replicaGon
factor
of
2)

❖ In
state-‐full
jobs
checkpoints

should
be
used

❖ Journaling
such
as
in
DB
can

be
acGvated

ﬂatMap
Tweets
RDD
hashTags
RDD
input
data

replicated
in

memory
lost
parGGons

recomputed
on
other

workers

Fault-tolerance
❖ Two
kinds
of
data
to
recover
in
the
event
of
failure:

• Data
received
and
replicated
-‐
 
This
data
survives
failure
of
a
single
worker
node,
since
a
copy
of
it

exists
on
one
of
the
other
nodes.

• Data
received
but
buﬀered
for
replicaGon
-‐ 
As
this
is
not
replicated,
the
only
way
to
recover
that
data
is
to
get

it
from
the
source
again.

Fault-tolerance
❖ Two
receiver
semanGcs:

• Reliable
receiver
-‐
 
Acknowledges
only
ager
received
data
is
replicated.
If
fails,

buﬀered
data
does
not
get
acknowledged
to
the
source.
If
the

receiver
is
restarted,
the
source
will
resend
the
data,
and

therefore
no
data
will
be
lost
due
to
the
failure.

• Unreliable
Receiver
-‐
 
Such
receivers
can
lose
data
when
they
fail
due
to
worker
or
driver

failures.

Fault-tolerance
Deployment

Scenario
Receiver
Failure Driver
failure
without
write

ahead
log
Buffered
data
lost
with
unreliable
receivers

Zero
data
lost
with
reliable
receivers
and
files
Buffered
data
lost
with
unreliable
receivers

Past
data
lost
with
all
receivers

Zero
data
lost
with
files
with
write

ahead
log
Zero
data
lost
with
receivers
and
files Zero
data
lost
with
receivers
and
files

Why Spark streaming?  
We have Storm

One model to rule them all
❖ Same
model
for
offline
AND

online
processing

❖ Common
code
base
for
offline

AND
online
processing

❖ Less
bugs
due
to
duplicaGon

❖ Less
bugs
of
framework
difference

❖ Increase
developer
producGvity

One stack to rule them all
❖ Explore
data

interacGvely
using
Spark

shell
to
idenGfy
problem

❖ Use
same
code
in
Spark

standalone
to
idenGfy

problem
in
producGon

environment

❖ Use
similar
code
in

Spark
Streaming
to

monitor
problem
online
$
./spark-‐shell

scala>
val
file
=
sc.hadoopFile(“smallLogs”)

...

scala>
val
filtered
=
file.filter(_.contains(“ERROR”))

...

scala>
va
object
ProcessProductionData
{

def
main(args:
Array[String])
{

val
sc
=
new
SparkContext(...)

val
file
=
sc.hadoopFile(“productionLogs”)

val
filtered
=
file.filter(_.contains(“ERROR”))

val
mapped
=
filtered.map(...)

...

}

} object
ProcessLiveStream
{

def
main(args:
Array[String])
{

val
sc
=
new
StreamingContext(...)

val
stream
=
sc.kafkaStream(...)

val
filtered
=
stream.filter(_.contains(“ERROR”))

val
mapped
=
filtered.map(...)

...

}

}

Performance
❖ Higher
throughput
than
Storm

• Spark
Streaming:
670k
records/second/node

• Storm:
115k
records/seconds/node
Grep
Throughput
per

node
(MB/s)
0
17.5
35
52.5
70
Record
size
(bytes)
100 1000
Spark
Storm
WordCount
0
7.5
15
22.5
30
Record
size
(bytes)
100 1000
Tested
with
100
EC2
instances
with
4
core
each

Comparison
taken
from
Das
Thatagata
and
Reynold
Xin
Hadoop
summit
2013
presentaGon

Monitoring
In
addiGon
StreamListener
interface
provides
addiGonal
informaGon
in
various
levels

(ApplicaGon,
Job,
Task,
etc.)

Utilization
❖ Spark
1.2
introduces
dynamic
cluster
resource
allocaGon

❖ Jobs
can
request
more
resources
and
release
resource

❖ Available
only
on
YARN

Demo
hKps://github.com/NoamShaish/spark-‐streaming-‐workshop.git

Spark streaming

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Spark streaming (20)

Recently uploaded (20)

Spark streaming