SlideShare a Scribd company logo
Kafka Streams
vs
Spark Structured Streaming
Modern Stream Processing Engines Compared
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl / DataMass Summit 2019
● A freelance IT consultant
● Specializing in Spark, Kafka, Kafka Streams, Scala
● Development | Consulting | Training
● "The Internals Of" online books
● Among contributors to Apache Spark
● Among Confluent Community Catalyst (Class of 2019 - 2020)
● Contact me at jacek@japila.pl
● Follow @JacekLaskowski on twitter for more #ApacheSpark
#ApacheKafka #KafkaStreams
Jacek Laskowski
Friendly reminder
Pictures...take a lot of pictures! 📷
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
The Features of Both
1. Stream Processing Engines
2. High-Level DSL for defining processing flow (logic)
a. Topology
b. Dataflow
3. Low-Level API for custom flows
4. Logical and physical plans
a. Logical “what” and executable “how”
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Kafka Streams (1 of 2)
1. Kafka Streams 2.3.0
2. Java and Scala APIs
3. Yet Another Command-Line Application (“YACA”)
a. High-availability and fault tolerance OOTB
b. Creating consumer groups OOTB
4. Support for Apache Kafka only
a. Use Kafka Connect to go beyond Kafka
5. Data Abstractions
a. High-level Streams DSL (KStream, KTable, KGlobalTable)
b. Low-level Processor API
6. One record at a time
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Kafka Streams (2 of 2)
1. ETL only
a. No support for SQL or Machine Learning
b. KSQL
2. Java 11 supported
3. Scala 2.12
4. No interactive shell / REPL for learning and prototyping
5. Rich join support (stream-stream, stream-table,
stream-global table)
6. Reading from and writing to a single Kafka cluster
7. Uses RocksDB for persistent state storage
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Kafka Streams Code / Topology (1 of 2)
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Kafka Streams Code / Execution Env (1 of 2)
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Spark Structured Streaming (1 of 2)
1. Apache Spark 2.4.4
2. Stream Processing API for Scala, Java, Python, SQL
a. Useful for software developers and data scientists
3. Requires cluster manager
a. Apache Hadoop’s YARN / Apache Mesos / DC/OS / Spark
Standalone
4. Lots of data sources
a. Kafka, JSON, parquet, CSV, Avro, ORC, socket
b. Data Source API
5. Data abstraction: streaming Dataset
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Spark Structured Streaming (2 of 2)
1. ETL + Machine Learning
a. Spark MLlib supports streaming Datasets
2. Java 8 only
3. Scala 2.12
4. spark-shell for learning and prototyping
5. Streaming joins and aggregations
6. Reading from one Kafka cluster and writing to another
Kafka cluster
7. Uses Hadoop DFS (HDFS) for checkpointing and
persistent state storage
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Spark “Streams” Code / Loading Data (1 of 3)
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Spark “Streams” Code / Processing (2 of 3)
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Spark “Streams” Code / Saving Data (3 of 3)
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
“The Internals Of” Online Books
1. The Internals of Kafka Streams
2. The Internals of Spark Structured Streaming
Questions?
1. Follow @jaceklaskowski on twitter (DMs open)
2. Upvote my questions and answers on StackOverflow
3. Contact me at jacek@japila.pl
4. Connect with me at LinkedIn
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl

More Related Content

PDF
Introduction to Apache Spark 2.0
PDF
Spark Summit EU talk by Dean Wampler
PDF
Spark Summit EU talk by Jakub Hava
PDF
2016 Spark Summit East Keynote: Matei Zaharia
PPTX
Spline 2 - Vision and Architecture Overview
PDF
Spark Summit EU talk by Emlyn Whittick
PDF
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
PPTX
Zeppelin at Twitter
Introduction to Apache Spark 2.0
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Jakub Hava
2016 Spark Summit East Keynote: Matei Zaharia
Spline 2 - Vision and Architecture Overview
Spark Summit EU talk by Emlyn Whittick
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Zeppelin at Twitter

What's hot (20)

PDF
Writing Continuous Applications with Structured Streaming PySpark API
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
PDF
Whirlpools in the Stream with Jayesh Lalwani
PPTX
Zeppelin at twitter (sf data science meetup, july 2016)
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
PPTX
Keynote at spark summit east anjul
PDF
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
PDF
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
PDF
Databricks with R: Deep Dive
PDF
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
PDF
Spark Summit EU talk by Stephan Kessler
PPTX
Future of data visualization
PDF
What to Expect for Big Data and Apache Spark in 2017
PPTX
SparkR + Zeppelin
PDF
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
PPTX
Automatic Query-Centric API for Routine Access to Linked Data
PDF
An Introduction to Sparkling Water by Michal Malohlava
PPTX
What's New in Spark 2?
PDF
Performance of Spark vs MapReduce
Writing Continuous Applications with Structured Streaming PySpark API
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Whirlpools in the Stream with Jayesh Lalwani
Zeppelin at twitter (sf data science meetup, july 2016)
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Keynote at spark summit east anjul
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Efficient State Management With Spark 2.0 And Scale-Out Databases
Databricks with R: Deep Dive
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Spark Summit EU talk by Stephan Kessler
Future of data visualization
What to Expect for Big Data and Apache Spark in 2017
SparkR + Zeppelin
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Automatic Query-Centric API for Routine Access to Linked Data
An Introduction to Sparkling Water by Michal Malohlava
What's New in Spark 2?
Performance of Spark vs MapReduce
Ad

Similar to  Kafka Streams VS Spark Structured Streaming - Modern Stream Processing Engines Compared (20)

PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PPTX
Apache Spark Overview
PDF
Spark streaming state of the union
PDF
실시간 Streaming using Spark and Kafka 강의교재
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Apache Spark vs Apache Flink
PDF
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
PDF
Apache Spark Streaming
PPTX
Introduction to Kafka Streams Presentation
PPT
An Introduction to Apache spark with scala
PDF
What is apache Kafka?
PDF
What is Apache Kafka®?
PDF
Apache Kafka 0.11 の Exactly Once Semantics
PPTX
Kafka Streams for Java enthusiasts
PDF
Spark Streaming and MLlib - Hyderabad Spark Group
PDF
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PPTX
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Apache Spark Overview
Spark streaming state of the union
실시간 Streaming using Spark and Kafka 강의교재
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Apache Spark vs Apache Flink
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
Apache Spark Streaming
Introduction to Kafka Streams Presentation
An Introduction to Apache spark with scala
What is apache Kafka?
What is Apache Kafka®?
Apache Kafka 0.11 の Exactly Once Semantics
Kafka Streams for Java enthusiasts
Spark Streaming and MLlib - Hyderabad Spark Group
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Ad

More from Jacek Laskowski (11)

PDF
Opening slides to Warsaw Scala FortyFives on Testing tools
PDF
#Be #social #FTW aka Your #Professional #Development with #StackOverflow #Git...
PDF
StackOverflow, GitHub, twitter, reddit i Twój profesjonalny rozwój
PDF
Introduction to Web Application Development in Clojure
PDF
Introduction to Functional Programming in Scala
PDF
Moje pierwsze kroki w programowaniu funkcyjnym w Scali
PDF
Functional web development with Git(Hub), Heroku and Clojure
PDF
Praktyczne wprowadzenie do OSGi i Enterprise OSGi
PDF
Developing modular applications with Java EE 6 and Enterprise OSGi + WebSpher...
PDF
Apache Tomcat + Java EE = Apache TomEE
KEY
(map Clojure everyday-tasks)
Opening slides to Warsaw Scala FortyFives on Testing tools
#Be #social #FTW aka Your #Professional #Development with #StackOverflow #Git...
StackOverflow, GitHub, twitter, reddit i Twój profesjonalny rozwój
Introduction to Web Application Development in Clojure
Introduction to Functional Programming in Scala
Moje pierwsze kroki w programowaniu funkcyjnym w Scali
Functional web development with Git(Hub), Heroku and Clojure
Praktyczne wprowadzenie do OSGi i Enterprise OSGi
Developing modular applications with Java EE 6 and Enterprise OSGi + WebSpher...
Apache Tomcat + Java EE = Apache TomEE
(map Clojure everyday-tasks)

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Computer network topology notes for revision
PPTX
Introduction to machine learning and Linear Models
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
Quality review (1)_presentation of this 21
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Lecture1 pattern recognition............
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Business Analytics and business intelligence.pdf
STUDY DESIGN details- Lt Col Maksud (21).pptx
Computer network topology notes for revision
Introduction to machine learning and Linear Models
Miokarditis (Inflamasi pada Otot Jantung)
Quality review (1)_presentation of this 21
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Lecture1 pattern recognition............
Introduction to Knowledge Engineering Part 1
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Fluorescence-microscope_Botany_detailed content
IBA_Chapter_11_Slides_Final_Accessible.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction-to-Cloud-ComputingFinal.pptx
Business Acumen Training GuidePresentation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
ISS -ESG Data flows What is ESG and HowHow
Business Analytics and business intelligence.pdf

 Kafka Streams VS Spark Structured Streaming - Modern Stream Processing Engines Compared

  • 1. Kafka Streams vs Spark Structured Streaming Modern Stream Processing Engines Compared © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl / DataMass Summit 2019
  • 2. ● A freelance IT consultant ● Specializing in Spark, Kafka, Kafka Streams, Scala ● Development | Consulting | Training ● "The Internals Of" online books ● Among contributors to Apache Spark ● Among Confluent Community Catalyst (Class of 2019 - 2020) ● Contact me at jacek@japila.pl ● Follow @JacekLaskowski on twitter for more #ApacheSpark #ApacheKafka #KafkaStreams Jacek Laskowski
  • 3. Friendly reminder Pictures...take a lot of pictures! 📷 © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 4. The Features of Both 1. Stream Processing Engines 2. High-Level DSL for defining processing flow (logic) a. Topology b. Dataflow 3. Low-Level API for custom flows 4. Logical and physical plans a. Logical “what” and executable “how” © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 5. Kafka Streams (1 of 2) 1. Kafka Streams 2.3.0 2. Java and Scala APIs 3. Yet Another Command-Line Application (“YACA”) a. High-availability and fault tolerance OOTB b. Creating consumer groups OOTB 4. Support for Apache Kafka only a. Use Kafka Connect to go beyond Kafka 5. Data Abstractions a. High-level Streams DSL (KStream, KTable, KGlobalTable) b. Low-level Processor API 6. One record at a time © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 6. Kafka Streams (2 of 2) 1. ETL only a. No support for SQL or Machine Learning b. KSQL 2. Java 11 supported 3. Scala 2.12 4. No interactive shell / REPL for learning and prototyping 5. Rich join support (stream-stream, stream-table, stream-global table) 6. Reading from and writing to a single Kafka cluster 7. Uses RocksDB for persistent state storage © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 7. Kafka Streams Code / Topology (1 of 2) © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 8. Kafka Streams Code / Execution Env (1 of 2) © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 9. Spark Structured Streaming (1 of 2) 1. Apache Spark 2.4.4 2. Stream Processing API for Scala, Java, Python, SQL a. Useful for software developers and data scientists 3. Requires cluster manager a. Apache Hadoop’s YARN / Apache Mesos / DC/OS / Spark Standalone 4. Lots of data sources a. Kafka, JSON, parquet, CSV, Avro, ORC, socket b. Data Source API 5. Data abstraction: streaming Dataset © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 10. Spark Structured Streaming (2 of 2) 1. ETL + Machine Learning a. Spark MLlib supports streaming Datasets 2. Java 8 only 3. Scala 2.12 4. spark-shell for learning and prototyping 5. Streaming joins and aggregations 6. Reading from one Kafka cluster and writing to another Kafka cluster 7. Uses Hadoop DFS (HDFS) for checkpointing and persistent state storage © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 11. Spark “Streams” Code / Loading Data (1 of 3) © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 12. Spark “Streams” Code / Processing (2 of 3) © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 13. Spark “Streams” Code / Saving Data (3 of 3) © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 14. “The Internals Of” Online Books 1. The Internals of Kafka Streams 2. The Internals of Spark Structured Streaming
  • 15. Questions? 1. Follow @jaceklaskowski on twitter (DMs open) 2. Upvote my questions and answers on StackOverflow 3. Contact me at jacek@japila.pl 4. Connect with me at LinkedIn © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl