StructuredStreaming webinar slides.pptx

Realtime Structured
Streaming in Azure
Databricks
Brian Steele - Principal Consultant
bsteele@pragmaticworks.com

• You currently have high volume data that you are
processing in a batch format
• You are you trying to get real-time insights from your
data
• You have great knowledge of your data, but limited
knowledge on of Azure Databricks or other Spark systems
Your Current Situation

Prior Architecture
Source
System
Azure Data
Factory
Daily File Extract
Batch
Processing

New Architecture
Bypass Source System
Realtime Message
Streaming to Event
Hubs
Structured
Streaming
Realtime Transaction
Processing

• Azure Databricks is an Apache Spark-based analytics platform
optimized for the Microsoft Azure cloud services platform.
• Designed with the founders of Apache Spark, Databricks is integrated
with Azure to provide one-click setup, streamlined workflows, and an
interactive workspace that enables collaboration between data
scientists, data engineers, and business analysts.
• Azure Databricks is a fast, easy, and collaborative Apache Spark-based
analytics service.
Why Azure Databricks?

• For a big data pipeline, the data (raw or structured) is ingested into
Azure through Azure Data Factory in batches, or streamed near real-
time using Kafka, Event Hub, or IoT Hub.
• This data lands in a data lake for long term persisted storage, in Azure
Blob Storage or Azure Data Lake Storage.
• As part of your analytics workflow, use Azure Databricks to read data
from multiple data sources such as Azure Blob Storage, Azure Data
Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and
turn it into breakthrough insights using Spark.
• Azure Databricks provides enterprise-grade Azure security, including
Azure Active Directory integration, role-based controls, and SLAs that
protect your data and your business.

• Structured Streaming is the Apache Spark API that lets you
express computation on streaming data in the same way
you express a batch computation on static data.
• The Spark SQL engine performs the computation
incrementally and continuously updates the result as
streaming data arrives.
• Databricks maintains the current checkpoint of the data
processed, making restart after failure nearly seamless.
• Can bring impactful insights to the users in almost real-
time.
Advantages of Structured Streaming

Streaming Data Source/Sinks
Sources Sinks
Azure Event Hubs/IOT Hubs
Databricks Delta Tables
Azure Data Lake Gen2 (Auto Loader)
Apache Kafka
Amazon Kinesis
Amazon S3 with Amazon SQS
Databricks Delta Tables
Almost any Sink using foreachBatch

• Source Parameters
• Source Format/Location
• Batch/File Size
• Transformations
• Streaming data can be transformed in the
same ways as static data
• Output Parameters
• Output Format/Location
• Checkpoint Location
Structured Streaming
Structured
Streaming
EVENT HUB

• Join Types
• Inner
• Left
• Not Stateful by default
Stream-Static Joins
Structured
Streaming
EVENT HUB
STATIC FILE

• Join Types
• Inner (Watermark and Time
Constraint Optional)
• Left Outer (Watermark and Time
Constraint Req)
• Right Outer (Watermark and Time
Constraint Req)
• You can also Join Static
Tables/Files into your Stream-
Stream Join
Stream-Stream Joins
Structured
Streaming
EVENT HUB
STATIC FILE
EVENT HUB
Structured
Streaming
Micro
Batch

• Watermark – How late a record can
arrive and after what time can it be
removed from the state.
• Time Constraint – How log the
records will be kept in state in
relation to the other stream
• Only used in stateful operation
• Ignored in non-stateful streaming
queries and batch queries
Watermark vs. Time Constraint

Structured
Streaming
EVENT HUB
EVENT HUB
Structured
Streaming
Transaction 1/Customer 1/Item 1
View 1/Customer 1/Item 1
Watermark
10 Minutes
Watermark
5 Minutes
Time constraint
View.timeStamp >=
Transaction.timeStamp
and
View.timeStamp <=
Transaction.timeStamp + interval 5
minutes

10:00 - 10:05
View 6 Watermark
10:00 10:15
10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14
10:00 - 10:10
Watermark Time
10:00 - 10:05
Constraint Time
10:01
Transaction 1 Recieved
10:00
Transaction 1 Occurs
10:08
View 3 Received
10:06
View 4
10:02
View 1 10:03
View 2
10:04
View 3 Occurs
10:04
View 5 Occurs
10:12
View 5 Received
10:00
View 6

• Allows Batch Type Processing to be performed on Streaming Data
• Perform Processes with out adding to state
• dropDuplicates
• Aggregating data
• Perform a Merge/Upsert with Existing Static Data
• Write Data to multiple sinks/destinations
• Write Data to sinks not support in Structured Streaming
foreachBatch

• Spark Shuffle Partitions –
• Equal to the number of cores on the Cluster
• Maximum Records per Micro-Batch
• File Source/Delta Lake – maxFilesPerTrigger, maxBytesPerTrigger
• EventHubs – maxEventsPerTrigger
• Limit Stateful – limits state and memory errors
• Watermarking
• MERGE/Join/Aggregation
• Broadcast Joins
• Output Tables – Influences downstream streams
• Manually re-partition
• Delta Lake – Auto-Optimize
Going to Production

StructuredStreaming webinar slides.pptx

More Related Content

Similar to StructuredStreaming webinar slides.pptx (20)

Recently uploaded (20)

StructuredStreaming webinar slides.pptx

Editor's Notes