SlideShare a Scribd company logo
Noam Shaish
Spark Streaming
Scale	
  
Fault	
  tolerance	
  
High	
  throughput
Agenda
❖ Overview	
  
❖ Architecture	
  
❖ Fault-­‐tolerance	
  
❖ Why	
  Spark	
  streaming?	
  We	
  have	
  Storm	
  
❖ Demo
Overview
❖ Spark	
  Streaming	
  is	
  an	
  extension	
  of	
  core	
  Spark	
  API.	
  It	
  enables	
  scalable,	
  
high-­‐throughput,	
  fault-­‐tolerant	
  stream	
  processing	
  of	
  live	
  data	
  streams.	
  
❖ ConnecGons	
  for	
  most	
  of	
  common	
  data	
  sources	
  such	
  as	
  KaIa,	
  Flume,	
  
TwiKer,	
  ZeroMQ,	
  Kinesis,	
  TCP,	
  etc.	
  
❖ Spark	
  streaming	
  differ	
  from	
  most	
  online	
  processing	
  soluGon	
  by	
  
espousing	
  mini	
  batch	
  approach,	
  instead	
  of	
  data	
  stream.	
  
❖ Based	
  on	
  DiscreGzed	
  Stream	
  paper	
  	
  
❖ Discretized Streams:A Fault-Tolerant Model for Scalable Stream Processing

Matei Zaharia,Tathagata Das, Haoyuan Li, 

Timothy Hunter, Scott Shenker, Ion Stoica

Berkeley EECS (2012-12-14)

www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
Overview
Spark	
  streaming	
  runs	
  streaming	
  computaGon	
  as	
  a	
  series	
  of	
  very	
  small,	
  
determinis1c	
  batch	
  jobs	
  
Spark	
  
streaming
Spark
Live	
  data	
  stream
Batches	
  of	
  X	
  milliseconds
Processed	
  results
❖ Chops	
  live	
  stream	
  into	
  batches	
  of	
  x	
  
milliseconds	
  
❖ Spark	
  treats	
  each	
  batch	
  of	
  data	
  as	
  
RDDs	
  
❖ Processed	
  results	
  of	
  the	
  RDD	
  
operaGons	
  are	
  returned	
  in	
  batches
DStream, not just RDD
* Datastax cassandra connector
Transformations
• map(),	
  	
  
• flatMap()	
  	
  
• filter()	
  	
  
• count()	
  
• reparGGon()	
  
• union()	
  
• reduce()	
  	
  
• countByValue()	
  
• reduceByKey()	
  
• join()	
  	
  
• cogroup()	
  
• transform()	
  
• updateStateByKey()
Output Operations
• print()	
  
• foreachRDD()	
  
• saveAsObjectToFiles()	
  
• saveAsTextFiles()	
  
• saveAsHadoopFiles()	
  
• *saveToCassandra()
Window Operations
• window()	
  
• countByWindow()	
  
• reduceByWindow()	
  
• reduceByKeyAndWindow()	
  
• countByValueAndWindow()
Example 1 - DStream to RDD
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
Twi8er	
  Streaming	
  API	
  
!
!
tweets	
  DStream	
  
batch	
  @	
  t batch	
  @	
  t	
  +	
  1 batch	
  @	
  t	
  +	
  3batch	
  @	
  t	
  +	
  2
stored	
  in	
  memory	
  as	
  an	
  RDD	
  
(immutable,	
  distributed)
Example 1 - DStream to RDD relation
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!
val hashTags = tweets.flatMap(status => getTags(status))
tweets	
  DStream	
  
batch	
  @	
  t batch	
  @	
  t	
  +	
  1 batch	
  @	
  t	
  +	
  3batch	
  @	
  t	
  +	
  2
hashTags	
  DStream	
  
[#hobbitch,	
  	
  #bilboleggins,	
  …]
flatMap flatMap flatMap flatMap
new	
  RDDs	
  for	
  
each	
  batch
new	
  DStream
Example 1 - DStream to RDD
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!
val hashTags = tweets.flatMap(status => getTags(status))!
hashTags.saveToCassandra(“keyspace”, “tableName”)
tweets	
  DStream	
  
hashTags	
  DStream	
  
[#hobbitch,	
  	
  #bilboleggins,	
  …]
flatMap flatMap flatMap flatMap
every	
  batch	
  
saved	
  to	
  
Cassandra
save save save save
Example 2 - DStream to RDD relation
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!
val hashTags = tweets.flatMap(status => getTags(status))!
val tagCounts = hashTags.countByValue()
tweets	
  DStream	
  
hashTags	
  
flatMap flatMap flatMap flatMap
map map map map
reduceByKey reduceByKey reduceByKey reduceByKey
hashTags	
  
[(#hobbitch,	
  10),	
  	
  (#bilboleggins,	
  34),	
  …]
Example 3 - Count the hash tags over last 10 minutes
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!
val hashTags = tweets.flatMap(status => getTags(status))!
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
Sliding	
  window	
  
operaGon Window	
  length Sliding	
  interval
Example 3 - Count the hash tags over last 10 minutes
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
t-1 t t+1 t+2 t+3
sliding	
  window
hashTags	
  
hashTags	
  
Count	
  over	
  all	
  
data	
  in	
  window
Example 4 - Count hash tags over last 10 minutes smartly
val tagCounts = hashTags.countByValueAndWindow(Minutes(10), Seconds(1))
t-1 t t+1 t+2 t+3
sliding	
  window
hashTags	
  
hashTags	
  
Add	
  count	
  of	
  new	
  
batch	
  in	
  window
+-
Reduce	
  count	
  of	
  
batch	
  out	
  of	
  window
generalizaGon	
  of	
  smart	
  window	
  reduce	
  exists:	
  	
  
reduceByKeyAndWindow(reduce,	
  inverseReduce,	
  window,	
  	
  interval)
Architecture
❖ Receivers	
  divides	
  data	
  into	
  mini	
  batches	
  
❖ Size	
  of	
  batches	
  can	
  be	
  defined	
  in	
  milliseconds	
  (best	
  pracGce	
  
is	
  greater	
  than	
  500	
  milliseconds)
Spark	
  Streaming
Receivers
Spark	
  
Engine
Batches	
  of	
  	
  
input	
  RDDs
Batches	
  of	
  	
  
output	
  RDDs
Input	
  streams
Fault-tolerance
❖ RDDs	
  are	
  not	
  generated	
  from	
  
fault-­‐tolerance	
  source	
  	
  	
  
❖ Replicate	
  data	
  among	
  worker	
  
nodes	
  

(default	
  replicaGon	
  factor	
  of	
  2)	
  
❖ In	
  state-­‐full	
  jobs	
  checkpoints	
  
should	
  be	
  used	
  	
  
❖ Journaling	
  such	
  as	
  in	
  DB	
  can	
  
be	
  acGvated	
  
flatMap
Tweets	
  RDD
hashTags	
  RDD
input	
  data	
  
replicated	
  in	
  
memory
lost	
  parGGons	
  
recomputed	
  on	
  other	
  
workers
Fault-tolerance
❖ Two	
  kinds	
  of	
  data	
  to	
  recover	
  in	
  the	
  event	
  of	
  failure:	
  
• Data	
  received	
  and	
  replicated	
  -­‐	
  

This	
  data	
  survives	
  failure	
  of	
  a	
  single	
  worker	
  node,	
  since	
  a	
  copy	
  of	
  it	
  
exists	
  on	
  one	
  of	
  the	
  other	
  nodes.	
  
• Data	
  received	
  but	
  buffered	
  for	
  replicaGon	
  -­‐

As	
  this	
  is	
  not	
  replicated,	
  the	
  only	
  way	
  to	
  recover	
  that	
  data	
  is	
  to	
  get	
  
it	
  from	
  the	
  source	
  again.
Fault-tolerance
❖ Two	
  receiver	
  semanGcs:	
  
• Reliable	
  receiver	
  -­‐	
  

Acknowledges	
  only	
  ager	
  received	
  data	
  is	
  replicated.	
  If	
  fails,	
  
buffered	
  data	
  does	
  not	
  get	
  acknowledged	
  to	
  the	
  source.	
  If	
  the	
  
receiver	
  is	
  restarted,	
  the	
  source	
  will	
  resend	
  the	
  data,	
  and	
  
therefore	
  no	
  data	
  will	
  be	
  lost	
  due	
  to	
  the	
  failure.	
  	
  
• Unreliable	
  Receiver	
  -­‐	
  

Such	
  receivers	
  can	
  lose	
  data	
  when	
  they	
  fail	
  due	
  to	
  worker	
  or	
  driver	
  
failures.
Fault-tolerance
Deployment	
  
Scenario
Receiver	
  Failure Driver	
  failure
without	
  write	
  
ahead	
  log
Buffered	
  data	
  lost	
  with	
  unreliable	
  receivers	
  
Zero	
  data	
  lost	
  with	
  reliable	
  receivers	
  and	
  files
Buffered	
  data	
  lost	
  with	
  unreliable	
  receivers	
  
Past	
  data	
  lost	
  with	
  all	
  receivers	
  
Zero	
  data	
  lost	
  with	
  files
with	
  write	
  
ahead	
  log
Zero	
  data	
  lost	
  with	
  receivers	
  and	
  files Zero	
  data	
  lost	
  with	
  receivers	
  and	
  files
Why Spark streaming? 

We have Storm
One model to rule them all
❖ Same	
  model	
  for	
  offline	
  AND	
  
online	
  processing	
  
❖ Common	
  code	
  base	
  for	
  offline	
  
AND	
  online	
  processing	
  
❖ Less	
  bugs	
  due	
  to	
  duplicaGon	
  
❖ Less	
  bugs	
  of	
  framework	
  difference	
  
❖ Increase	
  developer	
  producGvity
One stack to rule them all
❖ Explore	
  data	
  
interacGvely	
  using	
  Spark	
  
shell	
  to	
  idenGfy	
  problem	
  
❖ Use	
  same	
  code	
  in	
  Spark	
  
standalone	
  to	
  idenGfy	
  
problem	
  in	
  producGon	
  
environment	
  
❖ Use	
  similar	
  code	
  in	
  
Spark	
  Streaming	
  to	
  
monitor	
  problem	
  online
$	
  ./spark-­‐shell	
  
scala>	
  val	
  file	
  =	
  sc.hadoopFile(“smallLogs”)	
  
...	

scala>	
  val	
  filtered	
  =	
  file.filter(_.contains(“ERROR”))	
  
...	

scala>	
  va
object	
  ProcessProductionData	
  {	
  
	
   def	
  main(args:	
  Array[String])	
  {	
  
	
   	
   val	
  sc	
  =	
  new	
  SparkContext(...)	
  
	
   	
   val	
  file	
  =	
  sc.hadoopFile(“productionLogs”)	
  
	
   	
   val	
  filtered	
  =	
  file.filter(_.contains(“ERROR”))	
  
	
   	
   val	
  mapped	
  =	
  filtered.map(...)	
  
	
   	
   ...	
  
	
   }	
  
} object	
  ProcessLiveStream	
  {	
  
	
   def	
  main(args:	
  Array[String])	
  {	
  
	
   	
   val	
  sc	
  =	
  new	
  StreamingContext(...)	
  
	
   	
   val	
  stream	
  =	
  sc.kafkaStream(...)	
  
	
   	
   val	
  filtered	
  =	
  stream.filter(_.contains(“ERROR”))	
  
	
   	
   val	
  mapped	
  =	
  filtered.map(...)	
  
	
   	
   ...	
  
	
   }	
  
}
Performance
❖ Higher	
  throughput	
  than	
  Storm	
  
• Spark	
  Streaming:	
  670k	
  records/second/node	
  
• Storm:	
  115k	
  records/seconds/node
Grep
Throughput	
  per	
  
node	
  (MB/s)
0
17.5
35
52.5
70
Record	
  size	
  (bytes)
100 1000
Spark
Storm
WordCount
0
7.5
15
22.5
30
Record	
  size	
  (bytes)
100 1000
Tested	
  with	
  100	
  EC2	
  instances	
  with	
  4	
  core	
  each	
  
Comparison	
  taken	
  from	
  Das	
  Thatagata	
  and	
  Reynold	
  Xin	
  Hadoop	
  summit	
  2013	
  presentaGon
Community
Community
Community
Monitoring
In	
  addiGon	
  StreamListener	
  interface	
  provides	
  addiGonal	
  informaGon	
  in	
  various	
  levels	
  	
  
(ApplicaGon,	
  Job,	
  Task,	
  etc.)	
  	
  
Language
vs
Utilization
❖ Spark	
  1.2	
  introduces	
  dynamic	
  cluster	
  resource	
  allocaGon	
  
❖ Jobs	
  can	
  request	
  more	
  resources	
  and	
  release	
  resource	
  
❖ Available	
  only	
  on	
  YARN
Demo
hKps://github.com/NoamShaish/spark-­‐streaming-­‐workshop.git

More Related Content

PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
PDF
Microservices, containers, and machine learning
PDF
Strata EU 2014: Spark Streaming Case Studies
PDF
Data Science in 2016: Moving Up
PDF
Databricks Meetup @ Los Angeles Apache Spark User Group
PDF
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
PDF
Microservices, Containers, and Machine Learning
QCon São Paulo: Real-Time Analytics with Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Microservices, containers, and machine learning
Strata EU 2014: Spark Streaming Case Studies
Data Science in 2016: Moving Up
Databricks Meetup @ Los Angeles Apache Spark User Group
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Microservices, Containers, and Machine Learning

What's hot (20)

PDF
H2O with Erin LeDell at Portland R User Group
PDF
Apache Spark and the Emerging Technology Landscape for Big Data
PDF
How Apache Spark fits in the Big Data landscape
PDF
GalvanizeU Seattle: Eleven Almost-Truisms About Data
PDF
Jupyter for Education: Beyond Gutenberg and Erasmus
PDF
How Apache Spark fits into the Big Data landscape
PDF
Intro to H2O Machine Learning in R at Santa Clara University
PDF
PyData Texas 2015 Keynote
PPTX
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
PDF
Data Science with Spark
PDF
Use of standards and related issues in predictive analytics
PPTX
Paris Data Geek - Spark Streaming
PPTX
Hands On: Introduction to the Hadoop Ecosystem
PDF
H2O PySparkling Water
PDF
Architecture in action 01
PDF
H2O Big Join Slides
PDF
Big Data, Mob Scale.
PDF
H2O Deep Water - Making Deep Learning Accessible to Everyone
PDF
ArnoCandelAIFrontiers011217
H2O with Erin LeDell at Portland R User Group
Apache Spark and the Emerging Technology Landscape for Big Data
How Apache Spark fits in the Big Data landscape
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Jupyter for Education: Beyond Gutenberg and Erasmus
How Apache Spark fits into the Big Data landscape
Intro to H2O Machine Learning in R at Santa Clara University
PyData Texas 2015 Keynote
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Data Science with Spark
Use of standards and related issues in predictive analytics
Paris Data Geek - Spark Streaming
Hands On: Introduction to the Hadoop Ecosystem
H2O PySparkling Water
Architecture in action 01
H2O Big Join Slides
Big Data, Mob Scale.
H2O Deep Water - Making Deep Learning Accessible to Everyone
ArnoCandelAIFrontiers011217
Ad

Viewers also liked (20)

PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
PPTX
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
PDF
Apache storm vs. Spark Streaming
PDF
Spark Streaming Data Pipelines
PPTX
Big Data Analytics with Storm, Spark and GraphLab
PPTX
Apache NiFi in the Hadoop Ecosystem
PDF
Apache Spark Streaming - www.know bigdata.com
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
PDF
An introduction To Apache Spark
PDF
[Spark meetup] Spark Streaming Overview
PDF
Apache Spark & Streaming
PDF
Productionalizing a spark application
PDF
Understanding Data Partitioning and Replication in Apache Cassandra
PDF
Interactive Data Analysis in Spark Streaming
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
PDF
Dive into Spark Streaming
PDF
Reactive dashboard’s using apache spark
PDF
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
PPTX
Spark Streaming - The simple way
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Apache storm vs. Spark Streaming
Spark Streaming Data Pipelines
Big Data Analytics with Storm, Spark and GraphLab
Apache NiFi in the Hadoop Ecosystem
Apache Spark Streaming - www.know bigdata.com
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
An introduction To Apache Spark
[Spark meetup] Spark Streaming Overview
Apache Spark & Streaming
Productionalizing a spark application
Understanding Data Partitioning and Replication in Apache Cassandra
Interactive Data Analysis in Spark Streaming
Simplifying Big Data Analytics with Apache Spark
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Dive into Spark Streaming
Reactive dashboard’s using apache spark
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Spark Streaming - The simple way
Ad

Similar to Spark streaming (20)

PPT
strata_spark_streaming.ppt
PPT
strata_spark_streaming.ppt
PPT
strata spark streaming strata spark streamingsrata spark streaming
PPT
Spark streaming
PPT
strata_spark_streaming.ppt
PDF
Deep dive into spark streaming
PDF
Toying with spark
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
PDF
Big Data Analytics with Scala at SCALA.IO 2013
PPTX
Apache Spark Components
PPTX
Learning spark ch10 - Spark Streaming
PDF
Spark & Spark Streaming Internals - Nov 15 (1)
PDF
Bellevue Big Data meetup: Dive Deep into Spark Streaming
PDF
Spark streaming State of the Union - Strata San Jose 2015
PPTX
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
PPTX
Spark 计算模型
POTX
Apache Spark Streaming: Architecture and Fault Tolerance
PDF
Introduction to Spark Streaming
PDF
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...
PDF
Lifting the hood on spark streaming - StampedeCon 2015
strata_spark_streaming.ppt
strata_spark_streaming.ppt
strata spark streaming strata spark streamingsrata spark streaming
Spark streaming
strata_spark_streaming.ppt
Deep dive into spark streaming
Toying with spark
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Big Data Analytics with Scala at SCALA.IO 2013
Apache Spark Components
Learning spark ch10 - Spark Streaming
Spark & Spark Streaming Internals - Nov 15 (1)
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Spark streaming State of the Union - Strata San Jose 2015
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Spark 计算模型
Apache Spark Streaming: Architecture and Fault Tolerance
Introduction to Spark Streaming
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...
Lifting the hood on spark streaming - StampedeCon 2015

Recently uploaded (20)

PPTX
Materi-Enum-and-Record-Data-Type (1).pptx
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Essential Infomation Tech presentation.pptx
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
history of c programming in notes for students .pptx
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
System and Network Administraation Chapter 3
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Understanding Forklifts - TECH EHS Solution
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
ai tools demonstartion for schools and inter college
Materi-Enum-and-Record-Data-Type (1).pptx
Upgrade and Innovation Strategies for SAP ERP Customers
Design an Analysis of Algorithms II-SECS-1021-03
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Essential Infomation Tech presentation.pptx
Design an Analysis of Algorithms I-SECS-1021-03
Online Work Permit System for Fast Permit Processing
Operating system designcfffgfgggggggvggggggggg
history of c programming in notes for students .pptx
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
System and Network Administraation Chapter 3
How to Migrate SBCGlobal Email to Yahoo Easily
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
ManageIQ - Sprint 268 Review - Slide Deck
VVF-Customer-Presentation2025-Ver1.9.pptx
Understanding Forklifts - TECH EHS Solution
2025 Textile ERP Trends: SAP, Odoo & Oracle
ai tools demonstartion for schools and inter college

Spark streaming

  • 1. Noam Shaish Spark Streaming Scale   Fault  tolerance   High  throughput
  • 2. Agenda ❖ Overview   ❖ Architecture   ❖ Fault-­‐tolerance   ❖ Why  Spark  streaming?  We  have  Storm   ❖ Demo
  • 3. Overview ❖ Spark  Streaming  is  an  extension  of  core  Spark  API.  It  enables  scalable,   high-­‐throughput,  fault-­‐tolerant  stream  processing  of  live  data  streams.   ❖ ConnecGons  for  most  of  common  data  sources  such  as  KaIa,  Flume,   TwiKer,  ZeroMQ,  Kinesis,  TCP,  etc.   ❖ Spark  streaming  differ  from  most  online  processing  soluGon  by   espousing  mini  batch  approach,  instead  of  data  stream.   ❖ Based  on  DiscreGzed  Stream  paper     ❖ Discretized Streams:A Fault-Tolerant Model for Scalable Stream Processing
 Matei Zaharia,Tathagata Das, Haoyuan Li, 
 Timothy Hunter, Scott Shenker, Ion Stoica
 Berkeley EECS (2012-12-14)
 www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
  • 4. Overview Spark  streaming  runs  streaming  computaGon  as  a  series  of  very  small,   determinis1c  batch  jobs   Spark   streaming Spark Live  data  stream Batches  of  X  milliseconds Processed  results ❖ Chops  live  stream  into  batches  of  x   milliseconds   ❖ Spark  treats  each  batch  of  data  as   RDDs   ❖ Processed  results  of  the  RDD   operaGons  are  returned  in  batches
  • 5. DStream, not just RDD * Datastax cassandra connector Transformations • map(),     • flatMap()     • filter()     • count()   • reparGGon()   • union()   • reduce()     • countByValue()   • reduceByKey()   • join()     • cogroup()   • transform()   • updateStateByKey() Output Operations • print()   • foreachRDD()   • saveAsObjectToFiles()   • saveAsTextFiles()   • saveAsHadoopFiles()   • *saveToCassandra() Window Operations • window()   • countByWindow()   • reduceByWindow()   • reduceByKeyAndWindow()   • countByValueAndWindow()
  • 6. Example 1 - DStream to RDD val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) Twi8er  Streaming  API   ! ! tweets  DStream   batch  @  t batch  @  t  +  1 batch  @  t  +  3batch  @  t  +  2 stored  in  memory  as  an  RDD   (immutable,  distributed)
  • 7. Example 1 - DStream to RDD relation val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)! val hashTags = tweets.flatMap(status => getTags(status)) tweets  DStream   batch  @  t batch  @  t  +  1 batch  @  t  +  3batch  @  t  +  2 hashTags  DStream   [#hobbitch,    #bilboleggins,  …] flatMap flatMap flatMap flatMap new  RDDs  for   each  batch new  DStream
  • 8. Example 1 - DStream to RDD val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)! val hashTags = tweets.flatMap(status => getTags(status))! hashTags.saveToCassandra(“keyspace”, “tableName”) tweets  DStream   hashTags  DStream   [#hobbitch,    #bilboleggins,  …] flatMap flatMap flatMap flatMap every  batch   saved  to   Cassandra save save save save
  • 9. Example 2 - DStream to RDD relation val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)! val hashTags = tweets.flatMap(status => getTags(status))! val tagCounts = hashTags.countByValue() tweets  DStream   hashTags   flatMap flatMap flatMap flatMap map map map map reduceByKey reduceByKey reduceByKey reduceByKey hashTags   [(#hobbitch,  10),    (#bilboleggins,  34),  …]
  • 10. Example 3 - Count the hash tags over last 10 minutes val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)! val hashTags = tweets.flatMap(status => getTags(status))! val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() Sliding  window   operaGon Window  length Sliding  interval
  • 11. Example 3 - Count the hash tags over last 10 minutes val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() t-1 t t+1 t+2 t+3 sliding  window hashTags   hashTags   Count  over  all   data  in  window
  • 12. Example 4 - Count hash tags over last 10 minutes smartly val tagCounts = hashTags.countByValueAndWindow(Minutes(10), Seconds(1)) t-1 t t+1 t+2 t+3 sliding  window hashTags   hashTags   Add  count  of  new   batch  in  window +- Reduce  count  of   batch  out  of  window generalizaGon  of  smart  window  reduce  exists:     reduceByKeyAndWindow(reduce,  inverseReduce,  window,    interval)
  • 13. Architecture ❖ Receivers  divides  data  into  mini  batches   ❖ Size  of  batches  can  be  defined  in  milliseconds  (best  pracGce   is  greater  than  500  milliseconds) Spark  Streaming Receivers Spark   Engine Batches  of     input  RDDs Batches  of     output  RDDs Input  streams
  • 14. Fault-tolerance ❖ RDDs  are  not  generated  from   fault-­‐tolerance  source       ❖ Replicate  data  among  worker   nodes  
 (default  replicaGon  factor  of  2)   ❖ In  state-­‐full  jobs  checkpoints   should  be  used     ❖ Journaling  such  as  in  DB  can   be  acGvated   flatMap Tweets  RDD hashTags  RDD input  data   replicated  in   memory lost  parGGons   recomputed  on  other   workers
  • 15. Fault-tolerance ❖ Two  kinds  of  data  to  recover  in  the  event  of  failure:   • Data  received  and  replicated  -­‐  
 This  data  survives  failure  of  a  single  worker  node,  since  a  copy  of  it   exists  on  one  of  the  other  nodes.   • Data  received  but  buffered  for  replicaGon  -­‐
 As  this  is  not  replicated,  the  only  way  to  recover  that  data  is  to  get   it  from  the  source  again.
  • 16. Fault-tolerance ❖ Two  receiver  semanGcs:   • Reliable  receiver  -­‐  
 Acknowledges  only  ager  received  data  is  replicated.  If  fails,   buffered  data  does  not  get  acknowledged  to  the  source.  If  the   receiver  is  restarted,  the  source  will  resend  the  data,  and   therefore  no  data  will  be  lost  due  to  the  failure.     • Unreliable  Receiver  -­‐  
 Such  receivers  can  lose  data  when  they  fail  due  to  worker  or  driver   failures.
  • 17. Fault-tolerance Deployment   Scenario Receiver  Failure Driver  failure without  write   ahead  log Buffered  data  lost  with  unreliable  receivers   Zero  data  lost  with  reliable  receivers  and  files Buffered  data  lost  with  unreliable  receivers   Past  data  lost  with  all  receivers   Zero  data  lost  with  files with  write   ahead  log Zero  data  lost  with  receivers  and  files Zero  data  lost  with  receivers  and  files
  • 18. Why Spark streaming? 
 We have Storm
  • 19. One model to rule them all ❖ Same  model  for  offline  AND   online  processing   ❖ Common  code  base  for  offline   AND  online  processing   ❖ Less  bugs  due  to  duplicaGon   ❖ Less  bugs  of  framework  difference   ❖ Increase  developer  producGvity
  • 20. One stack to rule them all ❖ Explore  data   interacGvely  using  Spark   shell  to  idenGfy  problem   ❖ Use  same  code  in  Spark   standalone  to  idenGfy   problem  in  producGon   environment   ❖ Use  similar  code  in   Spark  Streaming  to   monitor  problem  online $  ./spark-­‐shell   scala>  val  file  =  sc.hadoopFile(“smallLogs”)   ... scala>  val  filtered  =  file.filter(_.contains(“ERROR”))   ... scala>  va object  ProcessProductionData  {     def  main(args:  Array[String])  {       val  sc  =  new  SparkContext(...)       val  file  =  sc.hadoopFile(“productionLogs”)       val  filtered  =  file.filter(_.contains(“ERROR”))       val  mapped  =  filtered.map(...)       ...     }   } object  ProcessLiveStream  {     def  main(args:  Array[String])  {       val  sc  =  new  StreamingContext(...)       val  stream  =  sc.kafkaStream(...)       val  filtered  =  stream.filter(_.contains(“ERROR”))       val  mapped  =  filtered.map(...)       ...     }   }
  • 21. Performance ❖ Higher  throughput  than  Storm   • Spark  Streaming:  670k  records/second/node   • Storm:  115k  records/seconds/node Grep Throughput  per   node  (MB/s) 0 17.5 35 52.5 70 Record  size  (bytes) 100 1000 Spark Storm WordCount 0 7.5 15 22.5 30 Record  size  (bytes) 100 1000 Tested  with  100  EC2  instances  with  4  core  each   Comparison  taken  from  Das  Thatagata  and  Reynold  Xin  Hadoop  summit  2013  presentaGon
  • 25. Monitoring In  addiGon  StreamListener  interface  provides  addiGonal  informaGon  in  various  levels     (ApplicaGon,  Job,  Task,  etc.)    
  • 27. Utilization ❖ Spark  1.2  introduces  dynamic  cluster  resource  allocaGon   ❖ Jobs  can  request  more  resources  and  release  resource   ❖ Available  only  on  YARN