SlideShare a Scribd company logo
Apache Storm and Spark
Streaming Compared
P. Taylor Goetz, Hortonworks
@ptgoetz
Honestly...
• I know a lot more about Apache Storm than I do
Apache Spark Streaming.
• I've been involved with Apache Storm, in one
way or another, since it was open-sourced.
• I'm admittedly biased.
But...
• A number of articles/papers comparing Apache
Storm and Spark Streaming are inaccurate in
terms of Storm’s features and performance
characteristics.
• Code and configuration for those studies is not
available, so independent verification is
impossible.
• Claims don't match real-world observations.
But...
• There is an inherent “Home Team Advantage” in
any benchmark comparison.
• Without open source code, any benchmark
claims are essentially marketing fluff, and should
be taken with a grain or two of NaCl.
• Any benchmark claim should be independently
verifiable.
Spark Streaming Paper
• Compares Spark Streaming (Micro-Batch) to
Core Storm (One-at-a-Time)
• A more appropriate comparison would have
been with Storm’s Trident (Micro-Batch) API
• Trident mentioned only in passing (on pages 3
and 12)
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
Spark Streaming Paper
• Benchmark code/configuration not publicly
available
• Performance claims not independently verifiable
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
Spark Streaming Paper
• Granted, the Spark Streaming paper is almost 2
years old and written at a time when Trident was
relatively new.
• However, that paper is often cited when
comparing Apache Storm and Spark Streaming,
particularly in terms of performance.
• A lot can change in 2 years.
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
Streaming and batch
processing are
fundamentally different.
Batch vs. Streaming
• Storm is a stream processing framework that
also does micro-batching (Trident).

• Spark is a batch processing framework that also
does micro-batching (Spark Streaming).
Batch vs. Streaming
Batch Streaming
Batch vs. Streaming
Batch Streaming
Micro-Batch
Apache Storm: Two
Streaming APIs
Core Storm (Spouts and Bolts)!
• One at a Time
• Lower Latency
• Operates on Tuple Streams
Trident (Streams and Operations)!
• Micro-Batch
• Higher Throughput
• Operates on Streams of Tuple Batches and Partitions
Language Options
Core Storm Storm Trident Spark Streaming
• Java
• Clojure
• Scala
• Python
• Ruby
• others*
• Java
• Clojure
• Scala
• Java
• Scala
• Python
*Storm’s Multi-Lang feature allows the use of virtually any programming language.
Reliability Models
Core Storm Storm Trident
Spark
Streaming
At Most Once Yes Yes No
At Least Once Yes Yes No*
Exactly Once No Yes Yes*
*In some node failure scenarios, Spark Streaming
falls back to at-least-once processing or data loss.
Programing Model
Core Storm Storm Trident Spark Streaming
Stream Primitive Tuple
Tuple, Tuple
Batch, Partition
DStream
Stream Source Spouts
Spouts, Trident
Spouts
HDFS, Network
Computation/
Transformation
Bolts
Filters,
Functions,
Aggregations,
Joins
Transformation,
Window
Operations
Stateful
Operations
No
(roll your own)
Yes Yes
Output/
Persistence
Bolts State, MapState foreachRDD
Production Deployments
Apache Storm Spark Streaming
• Too many to list



http://
storm.incubator.apache.org/
documentation/Powered-
By.html
• Sharethrough



http://
engineering.sharethrough.com/blog/
2014/06/27/sharethrough-at-spark-
summit-2014-spark-streaming-for-
realtime-auctions/
Support
Apache Storm Spark
Spark
Streaming
Hadoop Distro
Hortonworks,
MapR
Cloudera,
MapR,
Hortonworks
(preview)
Hortonworks,
Cloudera,
MapR
Resource
Management
YARN, Mesos YARN, Mesos YARN*, Mesos
Provisioning/
Monitoring
Apache
Ambari
Cloudera
Manager
?
*With issues: http://spark-summit.org/wp-content/uploads/2014/07/
Productionizing-a-247-Spark-Streaming-Service-on-YARN-Ooyala.pdf
Failure Scenarios
Worker Failure:
Spark Streaming
"So if a worker node fails, then the system can recompute
the lost from the the left over copy of the input data.
However, if the worker node where a network receiver was
running fails, then a tiny bit of data may be lost, that is, the
data received by the system but not yet replicated to other
node(s)."
Only HDFS-backed data sources are fully fault tolerant.
https://spark.apache.org/docs/latest/streaming-programming-
guide.html#fault-tolerance-properties
Worker Failure:
Spark Streaming
https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-
zero-data-loss-in-spark-streaming.html
Solution?: Write Ahead Logs (SPARK-3129)
• Enabling WAL requires DFS (HDFS, S3) — no such
requirement with Storm
• Incurs a performance penalty that adds to overall latency
• Full fault tolerance still requires a data source that can
replay data (e.g. Kafka)!
• Architectural band aid?
Worker Failure:
Apache Storm
• If a supervisor node fails, Nimbus will reassign that node's
tasks to other nodes in the cluster.
• Any tuples sent to a failed node will time out and be
replayed (In Trident, any batches will be replayed).
• Delivery guarantees dependent on a reliable data source.
Data Source Reliability
• A data source is considered unreliable if there is no means
to replay a previously-received message.
• A data source is considered reliable if it can somehow replay
a message if processing fails at any point.
• A data source is considered durable if it can replay any
message or set of messages given the necessary selection
criteria.
!
(These are my terms.)
Reliability Limitations:
Apache Storm
• Exactly once processing requires a durable data source.
• At least once processing requires a reliable data source.
• An unreliable data source can be wrapped to provide
additional guarantees.
• With durable and reliable sources, Storm will not drop data.
• Common pattern: Back unreliable data sources with
Apache Kafka (minor latency hit traded for 100% durability).
Apache Storm Spouts
Durable!
Kafka











Reliable!
JMS
RabbitMQ /
AMQP
Kestrel
Amazon SQS
Amazon Kinesis
Unreliable!
Twitter
Scribe
MongoDB
Apache Storm Output
(Bolts, Trident State)
• Cassandra
• HBase
• HDFS
• Kafka
• Redis
• Memcached
• R
• JMS
• MongoDB
• RDBMS
Apache Storm + Kafka
Apache Kafka is an ideal source for Storm topologies. It
provides everything necessary for:
• At most once processing
• At least once processing
• Exactly once processing
Apache Storm includes Kafka spout implementations for all
levels of reliability.
Kafka Supports a wide variety of languages and integration
points for both producers and consumers.
Reliability Limitations:
Spark Streaming
• Fault tolerance and reliability guarantees require
HDFS-backed data source.
• Moving data to HDFS prior to stream processing
introduces additional latency.
• Network data sources (Kafka, etc.) are
vulnerable to data loss in the event of a worker
node failure.
https://spark.apache.org/docs/latest/streaming-programming-
guide.html#fault-tolerance-properties
Performance
“The main reason cited by Tathagata for Spark's
performance gain over Storm is the aggregation of
small records that occurs through the mechanics
of RDDs.”
http://www.cs.duke.edu/~kmoses/cps516/dstream.html
In other words: Micro-Batching
Performance
http://www.cs.duke.edu/~kmoses/cps516/dstream.html
Storm capped at 10k msgs/sec/node?
Spark Streaming 40x faster than Storm?
Others may disagree…
https://twitter.com/
nathanmarz/status/
207989068519317505
http://www.slideshare.net/
JamesSirota/cisco-opensoc
Netty Transport
• Introduced in Apache Storm
0.9.0
• Faster, pure Java alternative
for 0MQ
• Yahoo! Engineering
announcement:

http://yahooeng.tumblr.com/post/
64758709722/making-storm-fly-
with-netty
• Performance Test Code:

https://github.com/yahoo/storm-
perf-test
Netty
0mq
STORM-297
• Introduced in Apache Storm
0.9.2-incubating
• Big performance boost,
especially for small messages
• JIRA Discussion:

https://issues.apache.org/jira/
browse/STORM-297
• Performance Test Code:

https://github.com/yahoo/storm-
perf-test
Benchmarking Storm
• 5 nodes on AWS (m1.large - not very powerful)
• 1 ZooKeeper, 1 Nimbus, 3 Supervisors
• Storm Core API and Trident API benchmarks
• Is Trident API slower than Core API?
https://github.com/ptgoetz/storm-benchmark
Is Trident API slower than
Core API?
• On low-power hardware with 3 supervisor nodes…
• Core API:
~150k msg./sec. with ~80 ms. latency
• Trident API:
~300k msg./sec. with ~250 ms. latency
• Higher throughput possible with increased latency
• Better performance with bigger hardware
Is Spark + Spark Streaming a
"Lambda Architecture in a Box?"
• No!
• Lambda is a lot more than batch + streaming.
• Lambda is powerful when applied correctly, but is
not right for every use case.
• Spark and Spark Streaming have overlapping
programming models for batch and micro-batch.
• The rest is up to you (as it is with Storm).
Final Thoughts
In general (not specific to Spark Streaming):!
• Beware any claim that A is X times faster than B.
• Performance is a matter of proper tuning for the
use case at hand.
• Any system can be hobbled to look bad in a
benchmark.
Recommendation
• It is up to you, and your specific use case.
• Consider fault tolerance. Is data loss
acceptable?
• Consider all facets and make informed
decisions.
• Rely on your own benchmarks
Questions?
Thank you!

More Related Content

PPTX
Delta lake and the delta architecture
PPTX
Voldemort
PPTX
Apache Helix presentation at SOCC 2012
PDF
SRv6 Network Programming: deployment use-cases
KEY
NoSQL Databases: Why, what and when
PPTX
Apache Kafka
PDF
Architectural Overview of MapR's Apache Hadoop Distribution
PDF
Cassandra overview
Delta lake and the delta architecture
Voldemort
Apache Helix presentation at SOCC 2012
SRv6 Network Programming: deployment use-cases
NoSQL Databases: Why, what and when
Apache Kafka
Architectural Overview of MapR's Apache Hadoop Distribution
Cassandra overview

What's hot (20)

PPTX
Apache Kafka
PPTX
Introduction to Apache Kafka
PPTX
YARN High Availability
PDF
PPTX
The Impala Cookbook
PDF
Google Spanner
PPTX
MariaDB Galera Cluster
PDF
Can Apache Kafka Replace a Database?
PPTX
Kafka at Peak Performance
PDF
Building an open data platform with apache iceberg
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Spark SQL
PDF
Service Function Chaining with SRv6
PDF
Iceberg + Alluxio for Fast Data Analytics
PPTX
Kafka presentation
PDF
Apache Hudi: The Path Forward
PDF
HDFS Overview
PDF
What is in a Lucene index?
PDF
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
PPTX
Rds data lake @ Robinhood
Apache Kafka
Introduction to Apache Kafka
YARN High Availability
The Impala Cookbook
Google Spanner
MariaDB Galera Cluster
Can Apache Kafka Replace a Database?
Kafka at Peak Performance
Building an open data platform with apache iceberg
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Spark SQL
Service Function Chaining with SRv6
Iceberg + Alluxio for Fast Data Analytics
Kafka presentation
Apache Hudi: The Path Forward
HDFS Overview
What is in a Lucene index?
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
Rds data lake @ Robinhood
Ad

Viewers also liked (8)

PPTX
Yahoo compares Storm and Spark
PDF
Hadoop Summit Europe 2014: Apache Storm Architecture
PDF
지금 핫한 Real-time In-memory Stream Processing 이야기
PDF
검색로그시스템 with Python
PDF
파이썬 데이터 분석 3종세트
PDF
Workflow Engines for Hadoop
PPTX
Performance Comparison of Streaming Big Data Platforms
PDF
Realtime Analytics with Storm and Hadoop
Yahoo compares Storm and Spark
Hadoop Summit Europe 2014: Apache Storm Architecture
지금 핫한 Real-time In-memory Stream Processing 이야기
검색로그시스템 with Python
파이썬 데이터 분석 3종세트
Workflow Engines for Hadoop
Performance Comparison of Streaming Big Data Platforms
Realtime Analytics with Storm and Hadoop
Ad

Similar to Apache storm vs. Spark Streaming (20)

PPTX
Spark vs storm
PDF
Apache Storm vs. Spark Streaming - two stream processing platforms compared
PPTX
Stream Computing (The Engineer's Perspective)
PPTX
Comparison of various streaming technologies
PDF
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
PPTX
Past, Present, and Future of Apache Storm
PPTX
Building Streaming Applications with Apache Storm 1.1
PDF
Comparing processing frameworks v7
PDF
Stream processing using Apache Storm - Big Data Meetup Athens 2016
PDF
4th Athens Big Data Meetup - 1st Talk - Big Data Streaming Processing Using A...
DOCX
INFO491FinalPaper
PPTX
Streaming and Messaging
PPTX
Your Guide to Streaming - The Engineer's Perspective
PDF
Apache Storm Concepts
PPTX
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
PDF
Distributed real time stream processing- why and how
PDF
Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...
PDF
Spark streaming State of the Union - Strata San Jose 2015
PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
PPTX
Next-Gen Decision Making in Under 2ms
Spark vs storm
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Stream Computing (The Engineer's Perspective)
Comparison of various streaming technologies
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Past, Present, and Future of Apache Storm
Building Streaming Applications with Apache Storm 1.1
Comparing processing frameworks v7
Stream processing using Apache Storm - Big Data Meetup Athens 2016
4th Athens Big Data Meetup - 1st Talk - Big Data Streaming Processing Using A...
INFO491FinalPaper
Streaming and Messaging
Your Guide to Streaming - The Engineer's Perspective
Apache Storm Concepts
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Distributed real time stream processing- why and how
Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...
Spark streaming State of the Union - Strata San Jose 2015
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Next-Gen Decision Making in Under 2ms

More from P. Taylor Goetz (6)

PPTX
Flux: Apache Storm Frictionless Topology Configuration & Deployment
PPTX
From Device to Data Center to Insights: Architectural Considerations for the ...
PPTX
Large Scale Graph Analytics with JanusGraph
PPTX
The Future of Apache Storm
PDF
Scaling Apache Storm - Strata + Hadoop World 2014
PPTX
Cassandra and Storm at Health Market Sceince
Flux: Apache Storm Frictionless Topology Configuration & Deployment
From Device to Data Center to Insights: Architectural Considerations for the ...
Large Scale Graph Analytics with JanusGraph
The Future of Apache Storm
Scaling Apache Storm - Strata + Hadoop World 2014
Cassandra and Storm at Health Market Sceince

Recently uploaded (20)

PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
AI in Product Development-omnex systems
PPTX
history of c programming in notes for students .pptx
PDF
Digital Strategies for Manufacturing Companies
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
L1 - Introduction to python Backend.pptx
PPT
Introduction Database Management System for Course Database
DOCX
The Five Best AI Cover Tools in 2025.docx
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Online Work Permit System for Fast Permit Processing
PDF
System and Network Administraation Chapter 3
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Upgrade and Innovation Strategies for SAP ERP Customers
AI in Product Development-omnex systems
history of c programming in notes for students .pptx
Digital Strategies for Manufacturing Companies
How Creative Agencies Leverage Project Management Software.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
L1 - Introduction to python Backend.pptx
Introduction Database Management System for Course Database
The Five Best AI Cover Tools in 2025.docx
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
VVF-Customer-Presentation2025-Ver1.9.pptx
Operating system designcfffgfgggggggvggggggggg
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Online Work Permit System for Fast Permit Processing
System and Network Administraation Chapter 3
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
ISO 45001 Occupational Health and Safety Management System
2025 Textile ERP Trends: SAP, Odoo & Oracle
How to Choose the Right IT Partner for Your Business in Malaysia

Apache storm vs. Spark Streaming

  • 1. Apache Storm and Spark Streaming Compared P. Taylor Goetz, Hortonworks @ptgoetz
  • 2. Honestly... • I know a lot more about Apache Storm than I do Apache Spark Streaming. • I've been involved with Apache Storm, in one way or another, since it was open-sourced. • I'm admittedly biased.
  • 3. But... • A number of articles/papers comparing Apache Storm and Spark Streaming are inaccurate in terms of Storm’s features and performance characteristics. • Code and configuration for those studies is not available, so independent verification is impossible. • Claims don't match real-world observations.
  • 4. But... • There is an inherent “Home Team Advantage” in any benchmark comparison. • Without open source code, any benchmark claims are essentially marketing fluff, and should be taken with a grain or two of NaCl. • Any benchmark claim should be independently verifiable.
  • 5. Spark Streaming Paper • Compares Spark Streaming (Micro-Batch) to Core Storm (One-at-a-Time) • A more appropriate comparison would have been with Storm’s Trident (Micro-Batch) API • Trident mentioned only in passing (on pages 3 and 12) http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
  • 6. Spark Streaming Paper • Benchmark code/configuration not publicly available • Performance claims not independently verifiable http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
  • 7. Spark Streaming Paper • Granted, the Spark Streaming paper is almost 2 years old and written at a time when Trident was relatively new. • However, that paper is often cited when comparing Apache Storm and Spark Streaming, particularly in terms of performance. • A lot can change in 2 years. http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
  • 8. Streaming and batch processing are fundamentally different.
  • 9. Batch vs. Streaming • Storm is a stream processing framework that also does micro-batching (Trident).
 • Spark is a batch processing framework that also does micro-batching (Spark Streaming).
  • 11. Batch vs. Streaming Batch Streaming Micro-Batch
  • 12. Apache Storm: Two Streaming APIs Core Storm (Spouts and Bolts)! • One at a Time • Lower Latency • Operates on Tuple Streams Trident (Streams and Operations)! • Micro-Batch • Higher Throughput • Operates on Streams of Tuple Batches and Partitions
  • 13. Language Options Core Storm Storm Trident Spark Streaming • Java • Clojure • Scala • Python • Ruby • others* • Java • Clojure • Scala • Java • Scala • Python *Storm’s Multi-Lang feature allows the use of virtually any programming language.
  • 14. Reliability Models Core Storm Storm Trident Spark Streaming At Most Once Yes Yes No At Least Once Yes Yes No* Exactly Once No Yes Yes* *In some node failure scenarios, Spark Streaming falls back to at-least-once processing or data loss.
  • 15. Programing Model Core Storm Storm Trident Spark Streaming Stream Primitive Tuple Tuple, Tuple Batch, Partition DStream Stream Source Spouts Spouts, Trident Spouts HDFS, Network Computation/ Transformation Bolts Filters, Functions, Aggregations, Joins Transformation, Window Operations Stateful Operations No (roll your own) Yes Yes Output/ Persistence Bolts State, MapState foreachRDD
  • 16. Production Deployments Apache Storm Spark Streaming • Too many to list
 
 http:// storm.incubator.apache.org/ documentation/Powered- By.html • Sharethrough
 
 http:// engineering.sharethrough.com/blog/ 2014/06/27/sharethrough-at-spark- summit-2014-spark-streaming-for- realtime-auctions/
  • 17. Support Apache Storm Spark Spark Streaming Hadoop Distro Hortonworks, MapR Cloudera, MapR, Hortonworks (preview) Hortonworks, Cloudera, MapR Resource Management YARN, Mesos YARN, Mesos YARN*, Mesos Provisioning/ Monitoring Apache Ambari Cloudera Manager ? *With issues: http://spark-summit.org/wp-content/uploads/2014/07/ Productionizing-a-247-Spark-Streaming-Service-on-YARN-Ooyala.pdf
  • 19. Worker Failure: Spark Streaming "So if a worker node fails, then the system can recompute the lost from the the left over copy of the input data. However, if the worker node where a network receiver was running fails, then a tiny bit of data may be lost, that is, the data received by the system but not yet replicated to other node(s)." Only HDFS-backed data sources are fully fault tolerant. https://spark.apache.org/docs/latest/streaming-programming- guide.html#fault-tolerance-properties
  • 20. Worker Failure: Spark Streaming https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and- zero-data-loss-in-spark-streaming.html Solution?: Write Ahead Logs (SPARK-3129) • Enabling WAL requires DFS (HDFS, S3) — no such requirement with Storm • Incurs a performance penalty that adds to overall latency • Full fault tolerance still requires a data source that can replay data (e.g. Kafka)! • Architectural band aid?
  • 21. Worker Failure: Apache Storm • If a supervisor node fails, Nimbus will reassign that node's tasks to other nodes in the cluster. • Any tuples sent to a failed node will time out and be replayed (In Trident, any batches will be replayed). • Delivery guarantees dependent on a reliable data source.
  • 22. Data Source Reliability • A data source is considered unreliable if there is no means to replay a previously-received message. • A data source is considered reliable if it can somehow replay a message if processing fails at any point. • A data source is considered durable if it can replay any message or set of messages given the necessary selection criteria. ! (These are my terms.)
  • 23. Reliability Limitations: Apache Storm • Exactly once processing requires a durable data source. • At least once processing requires a reliable data source. • An unreliable data source can be wrapped to provide additional guarantees. • With durable and reliable sources, Storm will not drop data. • Common pattern: Back unreliable data sources with Apache Kafka (minor latency hit traded for 100% durability).
  • 24. Apache Storm Spouts Durable! Kafka
 
 
 
 
 
 Reliable! JMS RabbitMQ / AMQP Kestrel Amazon SQS Amazon Kinesis Unreliable! Twitter Scribe MongoDB
  • 25. Apache Storm Output (Bolts, Trident State) • Cassandra • HBase • HDFS • Kafka • Redis • Memcached • R • JMS • MongoDB • RDBMS
  • 26. Apache Storm + Kafka Apache Kafka is an ideal source for Storm topologies. It provides everything necessary for: • At most once processing • At least once processing • Exactly once processing Apache Storm includes Kafka spout implementations for all levels of reliability. Kafka Supports a wide variety of languages and integration points for both producers and consumers.
  • 27. Reliability Limitations: Spark Streaming • Fault tolerance and reliability guarantees require HDFS-backed data source. • Moving data to HDFS prior to stream processing introduces additional latency. • Network data sources (Kafka, etc.) are vulnerable to data loss in the event of a worker node failure. https://spark.apache.org/docs/latest/streaming-programming- guide.html#fault-tolerance-properties
  • 28. Performance “The main reason cited by Tathagata for Spark's performance gain over Storm is the aggregation of small records that occurs through the mechanics of RDDs.” http://www.cs.duke.edu/~kmoses/cps516/dstream.html In other words: Micro-Batching
  • 29. Performance http://www.cs.duke.edu/~kmoses/cps516/dstream.html Storm capped at 10k msgs/sec/node? Spark Streaming 40x faster than Storm? Others may disagree…
  • 31. Netty Transport • Introduced in Apache Storm 0.9.0 • Faster, pure Java alternative for 0MQ • Yahoo! Engineering announcement:
 http://yahooeng.tumblr.com/post/ 64758709722/making-storm-fly- with-netty • Performance Test Code:
 https://github.com/yahoo/storm- perf-test Netty 0mq
  • 32. STORM-297 • Introduced in Apache Storm 0.9.2-incubating • Big performance boost, especially for small messages • JIRA Discussion:
 https://issues.apache.org/jira/ browse/STORM-297 • Performance Test Code:
 https://github.com/yahoo/storm- perf-test
  • 33. Benchmarking Storm • 5 nodes on AWS (m1.large - not very powerful) • 1 ZooKeeper, 1 Nimbus, 3 Supervisors • Storm Core API and Trident API benchmarks • Is Trident API slower than Core API? https://github.com/ptgoetz/storm-benchmark
  • 34. Is Trident API slower than Core API? • On low-power hardware with 3 supervisor nodes… • Core API: ~150k msg./sec. with ~80 ms. latency • Trident API: ~300k msg./sec. with ~250 ms. latency • Higher throughput possible with increased latency • Better performance with bigger hardware
  • 35. Is Spark + Spark Streaming a "Lambda Architecture in a Box?" • No! • Lambda is a lot more than batch + streaming. • Lambda is powerful when applied correctly, but is not right for every use case. • Spark and Spark Streaming have overlapping programming models for batch and micro-batch. • The rest is up to you (as it is with Storm).
  • 36. Final Thoughts In general (not specific to Spark Streaming):! • Beware any claim that A is X times faster than B. • Performance is a matter of proper tuning for the use case at hand. • Any system can be hobbled to look bad in a benchmark.
  • 37. Recommendation • It is up to you, and your specific use case. • Consider fault tolerance. Is data loss acceptable? • Consider all facets and make informed decisions. • Rely on your own benchmarks