SlideShare a Scribd company logo
Realtime Structured
Streaming in Azure
Databricks
Brian Steele - Principal Consultant
bsteele@pragmaticworks.com
• You currently have high volume data that you are
processing in a batch format
• You are you trying to get real-time insights from your
data
• You have great knowledge of your data, but limited
knowledge on of Azure Databricks or other Spark systems
Your Current Situation
Prior Architecture
Source
System
Azure Data
Factory
Daily File Extract
Batch
Processing
New Architecture
Bypass Source System
Realtime Message
Streaming to Event
Hubs
Structured
Streaming
Realtime Transaction
Processing
• Azure Databricks is an Apache Spark-based analytics platform
optimized for the Microsoft Azure cloud services platform.
• Designed with the founders of Apache Spark, Databricks is integrated
with Azure to provide one-click setup, streamlined workflows, and an
interactive workspace that enables collaboration between data
scientists, data engineers, and business analysts.
• Azure Databricks is a fast, easy, and collaborative Apache Spark-based
analytics service.
Why Azure Databricks?
• For a big data pipeline, the data (raw or structured) is ingested into
Azure through Azure Data Factory in batches, or streamed near real-
time using Kafka, Event Hub, or IoT Hub.
• This data lands in a data lake for long term persisted storage, in Azure
Blob Storage or Azure Data Lake Storage.
• As part of your analytics workflow, use Azure Databricks to read data
from multiple data sources such as Azure Blob Storage, Azure Data
Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and
turn it into breakthrough insights using Spark.
• Azure Databricks provides enterprise-grade Azure security, including
Azure Active Directory integration, role-based controls, and SLAs that
protect your data and your business.
• Structured Streaming is the Apache Spark API that lets you
express computation on streaming data in the same way
you express a batch computation on static data.
• The Spark SQL engine performs the computation
incrementally and continuously updates the result as
streaming data arrives.
• Databricks maintains the current checkpoint of the data
processed, making restart after failure nearly seamless.
• Can bring impactful insights to the users in almost real-
time.
Advantages of Structured Streaming
Streaming Data Source/Sinks
Sources Sinks
Azure Event Hubs/IOT Hubs
Databricks Delta Tables
Azure Data Lake Gen2 (Auto Loader)
Apache Kafka
Amazon Kinesis
Amazon S3 with Amazon SQS
Databricks Delta Tables
Almost any Sink using foreachBatch
• Source Parameters
• Source Format/Location
• Batch/File Size
• Transformations
• Streaming data can be transformed in the
same ways as static data
• Output Parameters
• Output Format/Location
• Checkpoint Location
Structured Streaming
Structured
Streaming
EVENT HUB
DEMO
Join Operations
• Join Types
• Inner
• Left
• Not Stateful by default
Stream-Static Joins
Structured
Streaming
EVENT HUB
STATIC FILE
DEMO
• Join Types
• Inner (Watermark and Time
Constraint Optional)
• Left Outer (Watermark and Time
Constraint Req)
• Right Outer (Watermark and Time
Constraint Req)
• You can also Join Static
Tables/Files into your Stream-
Stream Join
Stream-Stream Joins
Structured
Streaming
EVENT HUB
STATIC FILE
EVENT HUB
Structured
Streaming
Micro
Batch
• Watermark – How late a record can
arrive and after what time can it be
removed from the state.
• Time Constraint – How log the
records will be kept in state in
relation to the other stream
• Only used in stateful operation
• Ignored in non-stateful streaming
queries and batch queries
Watermark vs. Time Constraint
Structured
Streaming
EVENT HUB
EVENT HUB
Structured
Streaming
Transaction 1/Customer 1/Item 1
Transaction 2/Customer 2/Item 1
Transaction 3/Customer 1/Item 2
View 1/Customer 1/Item 1
View 2/Customer 2/Item 2
View 3/Customer 3/Item 3
View 4/Customer 1/Item 2
Watermark
10 Minutes
Watermark
5 Minutes
Time constraint
View.timeStamp >=
Transaction.timeStamp
and
View.timeStamp <=
Transaction.timeStamp + interval 5
minutes
10:00 - 10:05
View 6 Watermark
10:00 10:15
10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14
10:00 - 10:10
Watermark Time
10:00 - 10:05
Constraint Time
10:01
Transaction 1 Recieved
10:00
Transaction 1 Occurs
10:08
View 3 Received
10:06
View 4
10:02
View 1 10:03
View 2
10:04
View 3 Occurs
10:04
View 5 Occurs
10:12
View 5 Received
10:00
View 6
DEMO
• Allows Batch Type Processing to be performed on Streaming Data
• Perform Processes with out adding to state
• dropDuplicates
• Aggregating data
• Perform a Merge/Upsert with Existing Static Data
• Write Data to multiple sinks/destinations
• Write Data to sinks not support in Structured Streaming
foreachBatch
DEMO
• Spark Shuffle Partitions –
• Equal to the number of cores on the Cluster
• Maximum Records per Micro-Batch
• File Source/Delta Lake – maxFilesPerTrigger, maxBytesPerTrigger
• EventHubs – maxEventsPerTrigger
• Limit Stateful – limits state and memory errors
• Watermarking
• MERGE/Join/Aggregation
• Broadcast Joins
• Output Tables – Influences downstream streams
• Manually re-partition
• Delta Lake – Auto-Optimize
Going to Production
Conclusion
Have Any Questions?

More Related Content

PDF
Genji: Framework for building resilient near-realtime data pipelines
PPTX
Azure stream analytics by Nico Jacobs
PDF
Realtime Analytics on AWS
PPTX
Aws meetup 20190427
PDF
Azure saturday pn 2018
PPTX
Ho-Ho-Hold onto Your Hats! Real-Time Data Magic from Santa’s Sleigh with Azur...
PDF
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
PDF
Azure Data Factory V2; The Data Flows
Genji: Framework for building resilient near-realtime data pipelines
Azure stream analytics by Nico Jacobs
Realtime Analytics on AWS
Aws meetup 20190427
Azure saturday pn 2018
Ho-Ho-Hold onto Your Hats! Real-Time Data Magic from Santa’s Sleigh with Azur...
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
Azure Data Factory V2; The Data Flows

Similar to StructuredStreaming webinar slides.pptx (20)

PDF
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
PDF
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
PDF
Powering Interactive BI Analytics with Presto and Delta Lake
PPTX
Apache Spark Streaming -Real time web server log analytics
PPTX
Building cloud native data microservice
PDF
Building real time data-driven products
PPTX
Solving Office 365 Big Challenges using Cassandra + Spark
PDF
DBP-010_Using Azure Data Services for Modern Data Applications
PPTX
Azure Data Lake and Azure Data Lake Analytics
PPTX
How Totango uses Apache Spark
PPTX
A lap around Azure Data Factory
PDF
Streaming Visualization
PPTX
Intro to Azure Data Factory v1
PPT
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PDF
Serverless SQL
PPTX
Data Stream Processing for Beginners with Kafka and CDC
PDF
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
PPTX
Azure satpn19 time series analytics with azure adx
PPTX
Cloud Security Monitoring and Spark Analytics
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
Powering Interactive BI Analytics with Presto and Delta Lake
Apache Spark Streaming -Real time web server log analytics
Building cloud native data microservice
Building real time data-driven products
Solving Office 365 Big Challenges using Cassandra + Spark
DBP-010_Using Azure Data Services for Modern Data Applications
Azure Data Lake and Azure Data Lake Analytics
How Totango uses Apache Spark
A lap around Azure Data Factory
Streaming Visualization
Intro to Azure Data Factory v1
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Serverless SQL
Data Stream Processing for Beginners with Kafka and CDC
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Azure satpn19 time series analytics with azure adx
Cloud Security Monitoring and Spark Analytics
Ad

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Cloud computing and distributed systems.
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation theory and applications.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Approach and Philosophy of On baking technology
PDF
KodekX | Application Modernization Development
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
A Presentation on Artificial Intelligence
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Chapter 3 Spatial Domain Image Processing.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
“AI and Expert System Decision Support & Business Intelligence Systems”
Cloud computing and distributed systems.
Understanding_Digital_Forensics_Presentation.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation theory and applications.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Approach and Philosophy of On baking technology
KodekX | Application Modernization Development
NewMind AI Monthly Chronicles - July 2025
20250228 LYD VKU AI Blended-Learning.pptx
A Presentation on Artificial Intelligence
Ad

StructuredStreaming webinar slides.pptx

  • 1. Realtime Structured Streaming in Azure Databricks Brian Steele - Principal Consultant bsteele@pragmaticworks.com
  • 2. • You currently have high volume data that you are processing in a batch format • You are you trying to get real-time insights from your data • You have great knowledge of your data, but limited knowledge on of Azure Databricks or other Spark systems Your Current Situation
  • 4. New Architecture Bypass Source System Realtime Message Streaming to Event Hubs Structured Streaming Realtime Transaction Processing
  • 5. • Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. • Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. • Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics service. Why Azure Databricks?
  • 6. • For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real- time using Kafka, Event Hub, or IoT Hub. • This data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage. • As part of your analytics workflow, use Azure Databricks to read data from multiple data sources such as Azure Blob Storage, Azure Data Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and turn it into breakthrough insights using Spark. • Azure Databricks provides enterprise-grade Azure security, including Azure Active Directory integration, role-based controls, and SLAs that protect your data and your business.
  • 7. • Structured Streaming is the Apache Spark API that lets you express computation on streaming data in the same way you express a batch computation on static data. • The Spark SQL engine performs the computation incrementally and continuously updates the result as streaming data arrives. • Databricks maintains the current checkpoint of the data processed, making restart after failure nearly seamless. • Can bring impactful insights to the users in almost real- time. Advantages of Structured Streaming
  • 8. Streaming Data Source/Sinks Sources Sinks Azure Event Hubs/IOT Hubs Databricks Delta Tables Azure Data Lake Gen2 (Auto Loader) Apache Kafka Amazon Kinesis Amazon S3 with Amazon SQS Databricks Delta Tables Almost any Sink using foreachBatch
  • 9. • Source Parameters • Source Format/Location • Batch/File Size • Transformations • Streaming data can be transformed in the same ways as static data • Output Parameters • Output Format/Location • Checkpoint Location Structured Streaming Structured Streaming EVENT HUB
  • 10. DEMO
  • 12. • Join Types • Inner • Left • Not Stateful by default Stream-Static Joins Structured Streaming EVENT HUB STATIC FILE
  • 13. DEMO
  • 14. • Join Types • Inner (Watermark and Time Constraint Optional) • Left Outer (Watermark and Time Constraint Req) • Right Outer (Watermark and Time Constraint Req) • You can also Join Static Tables/Files into your Stream- Stream Join Stream-Stream Joins Structured Streaming EVENT HUB STATIC FILE EVENT HUB Structured Streaming Micro Batch
  • 15. • Watermark – How late a record can arrive and after what time can it be removed from the state. • Time Constraint – How log the records will be kept in state in relation to the other stream • Only used in stateful operation • Ignored in non-stateful streaming queries and batch queries Watermark vs. Time Constraint
  • 16. Structured Streaming EVENT HUB EVENT HUB Structured Streaming Transaction 1/Customer 1/Item 1 Transaction 2/Customer 2/Item 1 Transaction 3/Customer 1/Item 2 View 1/Customer 1/Item 1 View 2/Customer 2/Item 2 View 3/Customer 3/Item 3 View 4/Customer 1/Item 2 Watermark 10 Minutes Watermark 5 Minutes Time constraint View.timeStamp >= Transaction.timeStamp and View.timeStamp <= Transaction.timeStamp + interval 5 minutes
  • 17. 10:00 - 10:05 View 6 Watermark 10:00 10:15 10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14 10:00 - 10:10 Watermark Time 10:00 - 10:05 Constraint Time 10:01 Transaction 1 Recieved 10:00 Transaction 1 Occurs 10:08 View 3 Received 10:06 View 4 10:02 View 1 10:03 View 2 10:04 View 3 Occurs 10:04 View 5 Occurs 10:12 View 5 Received 10:00 View 6
  • 18. DEMO
  • 19. • Allows Batch Type Processing to be performed on Streaming Data • Perform Processes with out adding to state • dropDuplicates • Aggregating data • Perform a Merge/Upsert with Existing Static Data • Write Data to multiple sinks/destinations • Write Data to sinks not support in Structured Streaming foreachBatch
  • 20. DEMO
  • 21. • Spark Shuffle Partitions – • Equal to the number of cores on the Cluster • Maximum Records per Micro-Batch • File Source/Delta Lake – maxFilesPerTrigger, maxBytesPerTrigger • EventHubs – maxEventsPerTrigger • Limit Stateful – limits state and memory errors • Watermarking • MERGE/Join/Aggregation • Broadcast Joins • Output Tables – Influences downstream streams • Manually re-partition • Delta Lake – Auto-Optimize Going to Production

Editor's Notes

  • #3: - Questions responses from the poles For the last year or so I have been working very heavily in Databricks – specifically in using it in big data processing with structured streaming So what we are going to look at today is for the user: who maybe has played a little with Databricks has used Spark in some other format in the past at least an idea or need for big data processing, specifically in a real time solution
  • #6: So why Azure Databricks? I had working with many big data systems over the years on several different platforms I had also used spark before But as more of a data architect and developer, I was always put off by what seemed like over complexity of the spark ecosystem. There were a lot of elements, it took a lot of “under the hood” setup and tuning, and would just always rather use something else. Especially as we moved to Azure and the cloud I could just throw a never ending amount of processing at my big data problems. With Databricks I now get the best of both worlds. A simple to setup, simple to maintain, easy to scale spark based system with all the development and processing benefits without all the technical and administrative overhead. So with Azure Databricks you get Spark – directly from the people that invented it – but just in a fast, easy and collaborative cloud service.
  • #7: You also get great integration with all the other Azure elements – Event Hubs, Key Vaults, Data Lakes, Azure SQL, data warehouse, Data Factory and even Azure DevOPS. Then you overlay your existing Azure security model with Active Directory right over it to provide a completely integrated security model.
  • #8: Structured streaming then allows you to take all of that integration and processing power and apply it to a stream of big data to gain near real-time processing capabilities. So you can process thru large amounts of messages/events/files as they are received and perform the same computations on the data that you could with static data set. At the same time Databricks automatically keeps a record of the data as it is processed, allowing almost seamless restarts if a failure were to occur in the process. This allows you to generate dataset in near real time – providing marketable insights to your business.
  • #9: There are several different source and sink locations that can be used with streaming in Databricks. Within the Azure ecosystem Azure Event Hubs and Databricks Delta tables in Azure Datalake are the most popular, but other source streams like Apache Kafka or Amazon Kinesis are also popular. You can also use the file Queue in Data Lake Gen2 with Auto Loader to load blob files as they are saves to file location. You can use almost anything as a sink by using the foreachBatch method which we will take a look at later.
  • #10: So a typical structure streaming pipeline is made up 3 parts, the source, any transformations and the output sink or destination. In our first example we will look at the source being an event hub message stream, add some minor transformations, and then sink the results to a Databricks Delta table. Each source has some specific options or parameters, such as format, connection information, file location, etc. The transformations can be any transformation you can perform on a static dataset. And the output can again have specific options and formats based on the type, including the destination location or partitioning information. The key element that makes the sink of a streaming datasource different is the checkpoint location. This checkpoint allows the stream to keep state with which messages have been read for the source and if the stream is interrupted, where to pick up at on restart. In the case of the Event Hub queue the checkpoint keeps track of the specific message offset on each partition. Also note that to use an Event Hub source you must add the azure event hubs library to your cluster and import the microsoft.azure.eventhubs library into your notebook.
  • #11: TASK – Need data elements and code. Databricks environment Can all be in the same command, can be in as many commands as you want
  • #12: Structured Streaming supports joining a streaming Dataset or DataFrame with a static Dataset or DataFrame – such as binding our transactional table to other dimensional information – like sales info to an item table, customer information or sale territories. It also supports joining to another streaming Dataset/DataFrame. The result of the streaming join is generated incrementally as the micro batches are exectute and looks similar to the results of our previous streaming aggregations example before. So in the upcoming demonstrations we will look at a few of these examples and see what the type of joins (i.e. inner, outer, etc.) are handled. In all the supported join types, the result of the join with a streaming Dataset/DataFrame will be the exactly the same as if it was with a static Dataset/DataFrame containing the same data in the stream.
  • #13: When a streaming dataset and a static dataset are used, then only an inner join and a left outer join are supported. Right outer joins and full outer joins are not supported. Inner joins and left outer joins on streaming and static datasets don’t have to be stateful, which improves your performance. The records in any single micro batch can be matched with a static set of records.
  • #14: TASK – need data and example code
  • #15: Stream to Stream joins support inner, left and right joins, but with differing requirements. While on an inner join watermarking is not required, unless you can be sure both records will exist at some point it is best to use it. Otherwise you may have records that stay in state indefinitely and are never cleaned up.
  • #16: It’s very important to understand the difference between watermark and time constraint. Watermarking a stream decides how delayed a record can arrive and gives a time when records can be dropped. For example, if you set a watermark for 30 minutes, then records older than 30 minutes will be dropped/ignored. Time constraints decide how long the record in will be retained in Spark's state in correlation to the other stream.
  • #17: So in our scenario we are going to receive our transaction data, and in addition we are going to get View data from our website. So we want to analyze For Customer X, After buying Item Y, how many other items did they view in the next 5 minutes? Another thing to remember that often gets people is that the watermark is not from the “current time”, it is from the last event time that the system saw. So if you have not received new messages in the stream, it will not apply.
  • #18: We have several possible outcomes. The transaction may be late, so how long do we want to keep that record? – this can depend on the volume of records and the source system. If you have a large volume, but few late records you can make this timeframe shorter. The views may be late, or even before the transaction – so again how long do we want to keep those records in memory – it has to be >= 5 minutes since that is our time constraint. They may not view anything else, so if we want to know that, we need to use a left join so we can get transactions that have no view data within 5 minutes.
  • #19: TASK – Need data elements and code. Databricks environment Can all be in the same command, can be in as many commands as you want
  • #20: The last element of structured streaming that we are going to review is the foreachBatch What the foreachBatch really lets you do is “cheat” on your streaming. You can take the streaming microbatch, put it in the foreachBatch method, then perform anything you could normally do in a standard batch processing. One of the key things to do is to perform normally “stateful” processing – a great example of this is dropping duplicates As you get into more complex data structures you might also have need to perform aggregations on the micro batch itself. So if you had a complex structure like a sales ticket, that contained multiple individual sale items, you might want to aggregate those by item or department before saving them. In the foreachBatch you could perform the aggregation, then save the data. Another great use is when you need to save the same streaming data to multiple sinks. This might be to update a summary dataset and to save the detailed record at the same time. This method can also be used to write data to sinks that are not supported in streaming – such as an SQL database table.
  • #21: TASK – need data and example code
  • #22: This topic could really be its own webinar, but I did want to touch on some of the items you will want to look at when you get ready to move to production with your stream. There is a really good session from the spark AI 2020 summit that does a very good job of what types of issue to look for and I will put that in the chat. https://databricks.com/session_na20/performant-streaming-in-production-preventing-common-pitfalls-when-productionizing-streaming-jobs But some of the items we want to watch for that are harder to fix once you have started to run a process in production are things like the shuffle partition setting, which can limit the disk shuffle and greatly increase performance. Once that is set the value is saved in the delta metadata and is hard to change if you need to scale up or down the number of cores on your cluster. Another is the “auto-optimize” setting on your delta tables. By default, if you write streaming data to delta you will get a lot of very small files. You can setup a job to optimize the tables periodically, but in a real-time environment it is best to let the system optimize as data is processed. You can set your delta tables to auto-optimize which will reduce your number of files and increase the size of the files to help downstream performance. You can also manipulate the size of the micro batch by changing the number of events/files/bytes that are consumed – depending on your source. This again is to help keep your processing from having to use disk for the shuffle partitions. Finally, as you design your streaming environment try to limit the number of stateful processes you bring into the streams. By limiting things like deduplication of the stream itself, the number of aggregations, the length of any watermarking, or by using the broadcast join hint on smaller static tables, you can greatly increase your record thruput and reduce memory usage and errors.