Spark Streaming with Azure Databricks

Dustin Vannoy
Data Engineer
Cloud + Streaming
Spark Streaming with
Azure Databricks

Dustin Vannoy
Data Engineering Consultant
Co-founder Data Engineering San Diego
/in/dustinvannoy
@dustinvannoy
dustin@dustinvannoy.com
Technologies
• Azure & AWS
• Spark
• Kafka
• Python
Modern Data Systems
• Data Lakes
• Analytics in Cloud
• Streaming

© Microsoft Azure + AI Conference All rights reserved.
Agenda
 Shifting to Streaming
 Spark Structured Streaming
 Apache Kafka
 Azure Event Hubs
 Get Hands On

Shifting to Streaming
If you haven’t started with streaming, you will soon

Why Streaming?
Data Engineers have
decided that the
business only updates
in batch
Our customers and
business leaders know
better

Is streaming ingestion easier?
 Dealing with a large set of data at
once brings its own challenges
 Process as it comes in for cleaner
logic
 Even if not doing real-time analytics
yet, prepare for when you will

Spark Structured Streaming
Technology overview

Why Spark?
Big data and the cloud
changed our mindset.
We want tools that
scale easily as data
size grows.
Spark is a leader in
data processing that
scales across many
machines. It can run
on Hadoop but is
faster and easier than
Map Reduce.

What is Spark?
 Fast, general purpose engine for large-scale data processing
 Replaces MapReduce as Hadoop parallel programming API
 Many options:
 Yarn / Spark Cluster / Local
 Scala / Python / Java / R
 Spark Core / SQL / Streaming / ML / Graph

What is Spark Structured Streaming?
 Alternative to traditional Spark Streaming which used
DStreams
 If you are familiar with Spark, it is best to think of
Structured Streaming as Spark SQL API but for streaming
 Use import spark.sql.streaming

What is Spark Structured Streaming?
Tathagata Das “TD” - Lead Developer on Spark Streaming
 “Fast, fault-tolerant, exactly-once stateful stream
processing without having to reason about streaming"
 "The simplest way to perform streaming analytics is not
having to reason about streaming at all"
 A table that is constantly appended with each micro-batch
Reference: https://youtu.be/rl8dIzTpxrI

Structured Streaming - Read
df = spark.readStream
.format("kafka")
.options(**consumer_config)
.load()

Structured Streaming - Write
df.writeStream
.format("kafka")
.options(**producer_config)
.option("checkpointLocation","/tmp/cp001")
.start()

Structured Streaming - Window
df2.groupBy(
col("VendorID"),
window(col("pickup_dt"), "10 min"))
.avg("trip_distance")

Structured Streaming – Output Mode
 Output triggered on a time interval (defaults to 1 second)
 Append
 Just keep adding newest records
 Complete mode
 Output latest state of table
 Useful for aggregation results

Apache Kafka
Technology overview

Why Kafka?
Streaming data
directly from one
system to another
often problematic
Kafka serves as the
scalable broker,
keeping up with
producer and
persisting data for all
consumers

The Log
“It is an append-only,
totally-ordered sequence
of records ordered by
time.” - Jay Kreps
Reference: The Log - Jay Kreps

Kafka Topics
Feed where records published
Multiple partitions per topic
Order retained within partition

Consumers and offset
Offset = record id
Consumers read in order
Multiple consumer per topic

Event Hubs
Technology overview

Why Event Hubs?
Same core capability
as Kafka, using PaaS
instead of IaaS
Choose between Kafka
or Event Hub APIs;
avoid operational
overhead of managing
Kafka

Event Hubs key concepts
 Namespace = container to hold multiple Event Hubs
 Event Hub = Topic
 Partitions and Consumer Groups
 Same concepts as Kafka
 Minor differences in implementation
 Throughput Units define level of scalability

Eventhub Namespace Setup
 Standard pricing to enable Kafka
 Each Throughput unit
 1 MB/s ingress
 2 MB/s egress
 Auto Inflate to allow autoscale

Eventhub Setup
 Partition count
 Max # of consumers
 Message retention
 More days = More $
 Capture
 Save to Azure Storage

Demo
Structured
Streaming
+
Event Hubs
for Kafka

References
 The Log - Jay Kreps
 https://databricks.com/blog/2016/01/04/introducing-apache-spark-
datasets.html
 https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-
spark-apis-rdds-dataframes-and-datasets.html
 https://github.com/Azure/azure-event-hubs-for-kafka
 https://github.com/Azure/azure-event-hubs-spark

 Please use EventsXD to fill out a session evaluation.
Thank you!

Spark Streaming with Azure Databricks

More Related Content

What's hot (20)

Similar to Spark Streaming with Azure Databricks (20)

Recently uploaded (20)

Spark Streaming with Azure Databricks

Editor's Notes