SlideShare a Scribd company logo
Spark
Structured
Streaming
Ayush Hooda
Software Consultant
Knoldus Inc.
Agenda
● Streaming – What and Why ?
● Rdd vs DataFrames vs Datasets
● Programming Model
● Streaming DataFrames and Datasets
● Defining Schema
● Output Modes
● Basic operations
● Window operations on event time*
RDD
Some key features of RDD :-
● Resilient
● Type Safe
● Immutable
● Lazy evaluation
and many more
Problems with RDDs
● They express the ‘how’ of a solution better than the
‘what’. Rdd library is bit opaque.
● They cannot be optimized by Spark.
● It’s too easy to build an inefficient RDD
transformation chain.
DataFrames
● DataFrame API provides a higher-level abstraction,
allowing you to use a query language to manipulate
data.
● Avail SQL functionalities.
● Focus on What rather than on How.
DataFrames
“Let spark figure out how to do it for you.
Like, in RDBMS we just fire the sql queries
and are not concerned how the query brings
out the data, i.e., we just care about the
result not the process”
DataFrames
● There are three types of logical plans:-
● Parsed logical plan :- Checks column names,
table names, etc.
● Analyzed logical plan :- Analysis of the query.
● Optimized logical plan :- Perform
optimizations.
● Physical plan(Actual RDD)
Spark Structured Streaming
Problems with DataFrames
● Till now, everything looks cool, But??
● We have lost type safety
●
Solution ?
● We can convert dataframes back into rdds, but
then we loose upon the optimization.
● We would like to get back our compile time
safety without giving up the optimizations.
Datasets
● Extension to the DataFrame API.
● Conceptually similar to RDDs. (You can
actually operate on objects)
● Interoperate more easily with dataframe api.
● Like Rdd, Dataset has a type.
● They use Encoders for
serialization/deserialization which are quite
fast compared to java/kryo.
Datasets
● Providing type to previous example in dataset
Programming model
Spark Structured Streaming
Output Modes
●
The “Output” is defined as what gets written out to the external storage. The output can be
defined in a different mode:
●
Complete Mode - The entire updated Result Table will be written to the external storage.
●
Append Mode - Only the new rows appended in the Result Table since the last trigger will
be written to the external storage. This is applicable only on the queries where existing
rows in the Result Table are not expected to change.
●
Update Mode - Only the rows that were updated in the Result Table since the last trigger
will be written to the external storage. Note that this is different from the Complete Mode in
that this mode only outputs the rows that have changed since the last trigger. If the query
doesn’t contain aggregations, it will be equivalent to Append mode.
Streaming DataFrames and
Datasets
● Streaming DataFrames can be created through the
DataStreamReader interface returned by
SparkSession.readStream().
● For example, defining streaming dataframe from
kafka (data source)
●
Defining Schema
● Structured streaming requires you to specify
the schema.
There are two ways you can specify schema:-
1. You could build your schema manually
Defining Schema
2. Use the business object that describes the
dataset
Operations on streaming
DataFrames/Datasets
● You can apply various kind of operations on
streaming DataFrames/Datasets.
● Some of the basic operations are selection,
projection, aggregation, etc.
Window Operations on Event time
References
● https://spark.apache.org/docs/latest/structured-stre
● https://www.youtube.com/watch?v=pZQsDloGB4w&
● https://jaceklaskowski.gitbooks.io/spark-structured-
Thank you

More Related Content

DOC
Base and Advanced SAS Training Contents
PPTX
Challenges and patterns for semantics at scale
PPTX
NoSql evaluation
PDF
Apache spark - Spark's distributed programming model
PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
PDF
Structured streaming in Spark
ODP
Introduction to Structured Streaming
PPTX
Apache Spark Streaming
Base and Advanced SAS Training Contents
Challenges and patterns for semantics at scale
NoSql evaluation
Apache spark - Spark's distributed programming model
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Structured streaming in Spark
Introduction to Structured Streaming
Apache Spark Streaming

Similar to Spark Structured Streaming (20)

ODP
Understanding Spark Structured Streaming
PDF
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
PPTX
Spark Kafka summit 2017
PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
PDF
A Deep Dive into Structured Streaming in Apache Spark
PPTX
Meetup spark structured streaming
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
PDF
Writing Continuous Applications with Structured Streaming in PySpark
PDF
Introduction to Apache Spark 2.0
PDF
Writing Continuous Applications with Structured Streaming PySpark API
PPTX
Big data processing with Apache Spark and Oracle Database
PPTX
Building highly scalable data pipelines with Apache Spark
PDF
A look ahead at spark 2.0
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
Real-Time Spark: From Interactive Queries to Streaming
PPTX
Structured Streaming Using Spark 2.1
Understanding Spark Structured Streaming
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Spark Kafka summit 2017
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
A Deep Dive into Structured Streaming in Apache Spark
Meetup spark structured streaming
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Writing Continuous Applications with Structured Streaming in PySpark
Introduction to Apache Spark 2.0
Writing Continuous Applications with Structured Streaming PySpark API
Big data processing with Apache Spark and Oracle Database
Building highly scalable data pipelines with Apache Spark
A look ahead at spark 2.0
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Real-Time Spark: From Interactive Queries to Streaming
Structured Streaming Using Spark 2.1
Ad

More from Knoldus Inc. (20)

PPTX
Angular Hydration Presentation (FrontEnd)
PPTX
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
PPTX
Self-Healing Test Automation Framework - Healenium
PPTX
Kanban Metrics Presentation (Project Management)
PPTX
Java 17 features and implementation.pptx
PPTX
Chaos Mesh Introducing Chaos in Kubernetes
PPTX
GraalVM - A Step Ahead of JVM Presentation
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
DAPR - Distributed Application Runtime Presentation
PPTX
Introduction to Azure Virtual WAN Presentation
PPTX
Introduction to Argo Rollouts Presentation
PPTX
Intro to Azure Container App Presentation
PPTX
Insights Unveiled Test Reporting and Observability Excellence
PPTX
Introduction to Splunk Presentation (DevOps)
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
PPTX
AWS: Messaging Services in AWS Presentation
PPTX
Amazon Cognito: A Primer on Authentication and Authorization
PPTX
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
PPTX
Managing State & HTTP Requests In Ionic.
Angular Hydration Presentation (FrontEnd)
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Self-Healing Test Automation Framework - Healenium
Kanban Metrics Presentation (Project Management)
Java 17 features and implementation.pptx
Chaos Mesh Introducing Chaos in Kubernetes
GraalVM - A Step Ahead of JVM Presentation
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
DAPR - Distributed Application Runtime Presentation
Introduction to Azure Virtual WAN Presentation
Introduction to Argo Rollouts Presentation
Intro to Azure Container App Presentation
Insights Unveiled Test Reporting and Observability Excellence
Introduction to Splunk Presentation (DevOps)
Code Camp - Data Profiling and Quality Analysis Framework
AWS: Messaging Services in AWS Presentation
Amazon Cognito: A Primer on Authentication and Authorization
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Managing State & HTTP Requests In Ionic.
Ad

Recently uploaded (20)

PDF
Autodesk AutoCAD Crack Free Download 2025
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Salesforce Agentforce AI Implementation.pdf
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
iTop VPN 6.5.0 Crack + License Key 2025 (Premium Version)
PDF
medical staffing services at VALiNTRY
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Patient Appointment Booking in Odoo with online payment
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
CHAPTER 2 - PM Management and IT Context
DOCX
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Cost to Outsource Software Development in 2025
Autodesk AutoCAD Crack Free Download 2025
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Salesforce Agentforce AI Implementation.pdf
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
iTop VPN 6.5.0 Crack + License Key 2025 (Premium Version)
medical staffing services at VALiNTRY
Digital Systems & Binary Numbers (comprehensive )
iTop VPN Free 5.6.0.5262 Crack latest version 2025
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Patient Appointment Booking in Odoo with online payment
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Design an Analysis of Algorithms II-SECS-1021-03
Reimagine Home Health with the Power of Agentic AI​
CHAPTER 2 - PM Management and IT Context
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
Operating system designcfffgfgggggggvggggggggg
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Cost to Outsource Software Development in 2025

Spark Structured Streaming

  • 2. Agenda ● Streaming – What and Why ? ● Rdd vs DataFrames vs Datasets ● Programming Model ● Streaming DataFrames and Datasets ● Defining Schema ● Output Modes ● Basic operations ● Window operations on event time*
  • 3. RDD Some key features of RDD :- ● Resilient ● Type Safe ● Immutable ● Lazy evaluation and many more
  • 4. Problems with RDDs ● They express the ‘how’ of a solution better than the ‘what’. Rdd library is bit opaque. ● They cannot be optimized by Spark. ● It’s too easy to build an inefficient RDD transformation chain.
  • 5. DataFrames ● DataFrame API provides a higher-level abstraction, allowing you to use a query language to manipulate data. ● Avail SQL functionalities. ● Focus on What rather than on How.
  • 6. DataFrames “Let spark figure out how to do it for you. Like, in RDBMS we just fire the sql queries and are not concerned how the query brings out the data, i.e., we just care about the result not the process”
  • 7. DataFrames ● There are three types of logical plans:- ● Parsed logical plan :- Checks column names, table names, etc. ● Analyzed logical plan :- Analysis of the query. ● Optimized logical plan :- Perform optimizations. ● Physical plan(Actual RDD)
  • 9. Problems with DataFrames ● Till now, everything looks cool, But?? ● We have lost type safety ●
  • 10. Solution ? ● We can convert dataframes back into rdds, but then we loose upon the optimization. ● We would like to get back our compile time safety without giving up the optimizations.
  • 11. Datasets ● Extension to the DataFrame API. ● Conceptually similar to RDDs. (You can actually operate on objects) ● Interoperate more easily with dataframe api. ● Like Rdd, Dataset has a type. ● They use Encoders for serialization/deserialization which are quite fast compared to java/kryo.
  • 12. Datasets ● Providing type to previous example in dataset
  • 15. Output Modes ● The “Output” is defined as what gets written out to the external storage. The output can be defined in a different mode: ● Complete Mode - The entire updated Result Table will be written to the external storage. ● Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only on the queries where existing rows in the Result Table are not expected to change. ● Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage. Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.
  • 16. Streaming DataFrames and Datasets ● Streaming DataFrames can be created through the DataStreamReader interface returned by SparkSession.readStream(). ● For example, defining streaming dataframe from kafka (data source) ●
  • 17. Defining Schema ● Structured streaming requires you to specify the schema. There are two ways you can specify schema:- 1. You could build your schema manually
  • 18. Defining Schema 2. Use the business object that describes the dataset
  • 19. Operations on streaming DataFrames/Datasets ● You can apply various kind of operations on streaming DataFrames/Datasets. ● Some of the basic operations are selection, projection, aggregation, etc.
  • 20. Window Operations on Event time