SlideShare a Scribd company logo
Hands-on with Apache Spark:
Creating a Fast Data Pipeline
with Structured Streaming and
Spark Streaming
Gerard Maas
Senior SW Engineer, Lightbend, Inc.
Gerard Maas
Señor SW Engineer
@maasg
https://github.com/maasg
https://www.linkedin.com/
in/gerardmaas/
https://stackoverflow.com
/users/764040/maasg
@maasg
Agenda
Creating a Fast Data Pipeline with
Structured Streaming and
Spark Streaming
@maasg
Data Pipelines
Data Pipelines
@maasg
Data Pipelines
Data Pipelines
Data
Kafka
Processor
@maasg
Data Pipelines
● Create Composable Streaming Applications
● Using the Best Tool for the Job
● Generating a Network Effect
@maasg
Agenda
Creating a Fast Data Pipeline with
Structured Streaming and
Spark Streaming
@maasg
Apache Spark Core
Spark SQL
SparkMLLib
SparkStreaming
Structured
Streaming
Datasets/Frames
GraphFrames
Data Sources
@maasg
Apache Spark Core
Spark SQL
SparkMLLib
SparkStreaming
StructuredStreaming
Datasets/Frames
GraphFrames
Data Sources
RDD
DStream
Query
@maasg
Structured Streaming
val ctx= new StreamingContext(conf,Seconds(1))
val lines = ssc.socketTextStream ("localhost" , 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ctx.start()
https://spark.apache.org/docs/latest/streaming-programming-guide.html
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words= lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
@maasg
Time
Execution
Abstraction
Structured Streaming Spark Streaming
Abstract
(Processing Time, Event Time)
Fixed to microbatch
Streaming Interval
Fixed Micro batch, Best Effort MB,
Continuous (NRT)
Fixed Micro batch
DataFrames/Dataset DStream, RDD
Access to the scheduler
@maasg
Agenda
Hands On with Spark:
Creating a Fast Data Pipeline with
Structured Streaming and
Spark Streaming
@maasg
Sensor
Data
Multiplexer
Data Exploration
[Structured Streaming]
Sensor Anomaly Detection Pipeline
Data Preparation
[Structured Streaming]
Online Model
Creation +
Training
[Spark Streaming]
Anomaly Detection
[Structured Streaming]
@maasg
Sensor
Data
Multiplexer
Structured
StreamingLocal Process
Sensor Anomaly Detection
Structured
Streaming
@maasg
Sensor
Data
Multiplexer
Structured
StreamingLocal Process
Sensor Anomaly Detection
Structured
Streaming
@maasg
Live
@maasg
Sensor
Data
Multiplexer
Data Exploration
[Structured Streaming]
Sensor Anomaly Detection Pipeline
Data Preparation
[Structured Streaming]
Online Model
Creation +
Training
[Spark Streaming]
Anomaly Detection
[Structured Streaming]
Kafka Source
Memory Sink
SQL Operations
@maasg
Sensor
Data
Multiplexer
Data Exploration
[Structured Streaming]
Sensor Anomaly Detection Pipeline
Data Preparation
[Structured Streaming]
Online Model
Creation +
Training
[Spark Streaming]
Anomaly Detection
[Structured Streaming]
Kafka Source
Memory Sink
SQL Operations
Event Time
Windows
Watermark
Kafka Sink
@maasg
Sensor
Data
Multiplexer
Data Exploration
[Structured Streaming]
Sensor Anomaly Detection Pipeline
Data Preparation
[Structured Streaming]
Online Model
Creation +
Training
[Spark Streaming]
Anomaly Detection
[Structured Streaming]
Kafka Source
Memory Sink
SQL Operations
Event Time
Windows
Watermark
Kafka Sink
RDD Programming
Local vs Distributed
Use Spark SQL
Kafka Source + Sink
@maasg
Sensor
Data
Multiplexer
Data Exploration
[Structured Streaming]
Sensor Anomaly Detection Pipeline
Data Preparation
[Structured Streaming]
Online Model
Creation +
Training
[Spark Streaming]
Anomaly Detection
[Structured Streaming]
Kafka Source
Memory Sink
SQL Operations
Event Time
Windows
Watermark
Kafka Sink
RDD Programming
Local vs Distributed
Use Spark SQL
Kafka Source + Sink
Controlled Streams:
Spark Streaming +
Structured Streaming
@maasg
Time
Execution
Abstraction
Structured Streaming Spark Streaming
Abstract
(Processing Time, Event Time)
Fixed to microbatch
Streaming Interval
Fixed Micro batch, Best Effort MB,
Continuous (NRT)
Fixed Micro batch
DataFrames/Dataset DStream, RDD
Access to the scheduler
Resources
Notebooks used today:
https://github.com/maasg/spark-notebooks/tree/master/streaming-anomaly-detection
Pipelines:
https://www.reactivesummit.org/2018/schedule/taking-the-pain-out-of-deploying-streaming-applic
ations
Structured Streaming + Spark Streaming:
https://www.reactivesummit.org/2018/schedule/processing-fast-data-with-apache-spark-the-tale-of
-two-streaming-apis
Fast Data:
https://www.lightbend.com/products/fast-data-platform
Gerard Maas
Señor SW Engineer
@maasg
https://github.com/maasg
https://www.linkedin.com/
in/gerardmaas/
https://stackoverflow.com
/users/764040/maasg
Your turn for...
Questions?
Thank
You!

More Related Content

PDF
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud
PDF
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
PDF
Moving from Big Data to Fast Data? Here's How To Pick The Right Streaming Engine
PDF
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
PDF
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the Job
PDF
Lightbend Fast Data Platform
PPTX
Lambda architecture: from zero to One
PDF
Streaming Microservices With Akka Streams And Kafka Streams
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
Moving from Big Data to Fast Data? Here's How To Pick The Right Streaming Engine
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the Job
Lightbend Fast Data Platform
Lambda architecture: from zero to One
Streaming Microservices With Akka Streams And Kafka Streams

What's hot (20)

PDF
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
PDF
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
PDF
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
PDF
Making Scala Faster: 3 Expert Tips For Busy Development Teams
PDF
Akka A to Z: A Guide To The Industry’s Best Toolkit for Fast Data and Microse...
PDF
Revitalizing Enterprise Integration with Reactive Streams
PDF
Monitoring Large-Scale Apache Spark Clusters at Databricks
PPTX
Kappa Architecture on Apache Kafka and Querona: datamass.io
PDF
Detecting Real-Time Financial Fraud with Cloudflow on Kubernetes
PDF
Can Apache Kafka Replace a Database? – The 2021 Update | Kai Waehner, Confluent
PDF
Exploring Reactive Integrations With Akka Streams, Alpakka And Apache Kafka
PDF
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
PDF
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
PDF
Simplify Governance of Streaming Data
PPTX
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
PDF
Integrating Apache Kafka and Elastic Using the Connect Framework
PDF
Flink at netflix paypal speaker series
PDF
A Tale of Two APIs: Using Spark Streaming In Production
PDF
Operationalizing Machine Learning: Serving ML Models
PDF
A Practical Guide to Selecting a Stream Processing Technology
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Making Scala Faster: 3 Expert Tips For Busy Development Teams
Akka A to Z: A Guide To The Industry’s Best Toolkit for Fast Data and Microse...
Revitalizing Enterprise Integration with Reactive Streams
Monitoring Large-Scale Apache Spark Clusters at Databricks
Kappa Architecture on Apache Kafka and Querona: datamass.io
Detecting Real-Time Financial Fraud with Cloudflow on Kubernetes
Can Apache Kafka Replace a Database? – The 2021 Update | Kai Waehner, Confluent
Exploring Reactive Integrations With Akka Streams, Alpakka And Apache Kafka
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
Simplify Governance of Streaming Data
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
Integrating Apache Kafka and Elastic Using the Connect Framework
Flink at netflix paypal speaker series
A Tale of Two APIs: Using Spark Streaming In Production
Operationalizing Machine Learning: Serving ML Models
A Practical Guide to Selecting a Stream Processing Technology
Ad

Similar to Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming And Spark Streaming (20)

PDF
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
PDF
Media_Entertainment_Veriticals
PDF
Jump Start with Apache Spark 2.0 on Databricks
PPTX
Big data processing with Apache Spark and Oracle Database
PPTX
Building Data Pipelines with Spark and StreamSets
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Building iot applications with Apache Spark and Apache Bahir
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PDF
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
PDF
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
PDF
Data Pipeline for The Big Data/Data Science OKC
PDF
[WSO2Con USA 2018] The Rise of Streaming SQL
PDF
The Rise of Streaming SQL
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
PPTX
Real-Time Data Analytics with Apache Kafka and Spark.pptx
PPTX
Real-Time Data Analytics with Apache Kafka and Spark.pptx
PPTX
Building Advanced Analytics Pipelines with Azure Databricks
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PDF
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
PPTX
Big data Lambda Architecture - Batch Layer Hands On
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Media_Entertainment_Veriticals
Jump Start with Apache Spark 2.0 on Databricks
Big data processing with Apache Spark and Oracle Database
Building Data Pipelines with Spark and StreamSets
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Building iot applications with Apache Spark and Apache Bahir
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Data Pipeline for The Big Data/Data Science OKC
[WSO2Con USA 2018] The Rise of Streaming SQL
The Rise of Streaming SQL
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Real-Time Data Analytics with Apache Kafka and Spark.pptx
Real-Time Data Analytics with Apache Kafka and Spark.pptx
Building Advanced Analytics Pipelines with Azure Databricks
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Big data Lambda Architecture - Batch Layer Hands On
Ad

More from Lightbend (20)

PDF
IoT 'Megaservices' - High Throughput Microservices with Akka
PDF
How Akka Cluster Works: Actors Living in a Cluster
PDF
The Reactive Principles: Eight Tenets For Building Cloud Native Applications
PDF
Putting the 'I' in IoT - Building Digital Twins with Akka Microservices
PDF
Akka at Enterprise Scale: Performance Tuning Distributed Applications
PDF
Digital Transformation with Kubernetes, Containers, and Microservices
PDF
Cloudstate - Towards Stateful Serverless
PDF
Digital Transformation from Monoliths to Microservices to Serverless and Beyond
PDF
Akka Anti-Patterns, Goodbye: Six Features of Akka 2.6
PPTX
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
PDF
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
PDF
Microservices, Kubernetes, and Application Modernization Done Right
PDF
Full Stack Reactive In Practice
PDF
Akka and Kubernetes: A Symbiotic Love Story
PPTX
Scala 3 Is Coming: Martin Odersky Shares What To Know
PDF
Migrating From Java EE To Cloud-Native Reactive Systems
PDF
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
PDF
Designing Events-First Microservices For A Cloud Native World
PDF
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
PDF
How To Build, Integrate, and Deploy Real-Time Streaming Pipelines On Kubernetes
IoT 'Megaservices' - High Throughput Microservices with Akka
How Akka Cluster Works: Actors Living in a Cluster
The Reactive Principles: Eight Tenets For Building Cloud Native Applications
Putting the 'I' in IoT - Building Digital Twins with Akka Microservices
Akka at Enterprise Scale: Performance Tuning Distributed Applications
Digital Transformation with Kubernetes, Containers, and Microservices
Cloudstate - Towards Stateful Serverless
Digital Transformation from Monoliths to Microservices to Serverless and Beyond
Akka Anti-Patterns, Goodbye: Six Features of Akka 2.6
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
Microservices, Kubernetes, and Application Modernization Done Right
Full Stack Reactive In Practice
Akka and Kubernetes: A Symbiotic Love Story
Scala 3 Is Coming: Martin Odersky Shares What To Know
Migrating From Java EE To Cloud-Native Reactive Systems
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Designing Events-First Microservices For A Cloud Native World
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
How To Build, Integrate, and Deploy Real-Time Streaming Pipelines On Kubernetes

Recently uploaded (20)

PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Introduction to Artificial Intelligence
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
ai tools demonstartion for schools and inter college
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
top salesforce developer skills in 2025.pdf
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
How to Choose the Right IT Partner for Your Business in Malaysia
Introduction to Artificial Intelligence
Upgrade and Innovation Strategies for SAP ERP Customers
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Online Work Permit System for Fast Permit Processing
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PTS Company Brochure 2025 (1).pdf.......
Operating system designcfffgfgggggggvggggggggg
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Understanding Forklifts - TECH EHS Solution
ai tools demonstartion for schools and inter college
Odoo POS Development Services by CandidRoot Solutions
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
ManageIQ - Sprint 268 Review - Slide Deck
Softaken Excel to vCard Converter Software.pdf
top salesforce developer skills in 2025.pdf
Which alternative to Crystal Reports is best for small or large businesses.pdf

Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming And Spark Streaming