SlideShare a Scribd company logo
Dustin Vannoy
Data Engineer
Cloud + Streaming
Spark Streaming with
Azure Databricks
Dustin Vannoy
Data Engineering Consultant
Co-founder Data Engineering San Diego
/in/dustinvannoy
@dustinvannoy
dustin@dustinvannoy.com
Technologies
• Azure & AWS
• Spark
• Kafka
• Python
Modern Data Systems
• Data Lakes
• Analytics in Cloud
• Streaming
© Microsoft Azure + AI Conference All rights reserved.
Agenda
 Shifting to Streaming
 Spark Structured Streaming
 Apache Kafka
 Azure Event Hubs
 Get Hands On
© Microsoft Azure + AI Conference All rights reserved.
Shifting to Streaming
If you haven’t started with streaming, you will soon
Life’s a batch stream
Why Streaming?
Data Engineers have
decided that the
business only updates
in batch
Our customers and
business leaders know
better
© Microsoft Azure + AI Conference All rights reserved.
Is streaming ingestion easier?
 Dealing with a large set of data at
once brings its own challenges
 Process as it comes in for cleaner
logic
 Even if not doing real-time analytics
yet, prepare for when you will
© Microsoft Azure + AI Conference All rights reserved.
Spark Structured Streaming
Technology overview
Why Spark?
Big data and the cloud
changed our mindset.
We want tools that
scale easily as data
size grows.
Spark is a leader in
data processing that
scales across many
machines. It can run
on Hadoop but is
faster and easier than
Map Reduce.
© Microsoft Azure + AI Conference All rights reserved.
What is Spark?
 Fast, general purpose engine for large-scale data processing
 Replaces MapReduce as Hadoop parallel programming API
 Many options:
 Yarn / Spark Cluster / Local
 Scala / Python / Java / R
 Spark Core / SQL / Streaming / ML / Graph
© Microsoft Azure + AI Conference All rights reserved.
What is Spark Structured Streaming?
 Alternative to traditional Spark Streaming which used
DStreams
 If you are familiar with Spark, it is best to think of
Structured Streaming as Spark SQL API but for streaming
 Use import spark.sql.streaming
© Microsoft Azure + AI Conference All rights reserved.
What is Spark Structured Streaming?
Tathagata Das “TD” - Lead Developer on Spark Streaming
 “Fast, fault-tolerant, exactly-once stateful stream
processing without having to reason about streaming"
 "The simplest way to perform streaming analytics is not
having to reason about streaming at all"
 A table that is constantly appended with each micro-batch
Reference: https://youtu.be/rl8dIzTpxrI
© Microsoft Azure + AI Conference All rights reserved.
Structured Streaming - Read
df = spark.readStream
.format("kafka")
.options(**consumer_config)
.load()
© Microsoft Azure + AI Conference All rights reserved.
Structured Streaming - Write
df.writeStream
.format("kafka")
.options(**producer_config)
.option("checkpointLocation","/tmp/cp001")
.start()
© Microsoft Azure + AI Conference All rights reserved.
Structured Streaming - Window
df2.groupBy(
col("VendorID"),
window(col("pickup_dt"), "10 min"))
.avg("trip_distance")
© Microsoft Azure + AI Conference All rights reserved.
Structured Streaming – Output Mode
 Output triggered on a time interval (defaults to 1 second)
 Append
 Just keep adding newest records
 Complete mode
 Output latest state of table
 Useful for aggregation results
© Microsoft Azure + AI Conference All rights reserved.
Apache Kafka
Technology overview
Why Kafka?
Streaming data
directly from one
system to another
often problematic
Kafka serves as the
scalable broker,
keeping up with
producer and
persisting data for all
consumers
© Microsoft Azure + AI Conference All rights reserved.
The Log
“It is an append-only,
totally-ordered sequence
of records ordered by
time.” - Jay Kreps
Reference: The Log - Jay Kreps
© Microsoft Azure + AI Conference All rights reserved.
Kafka Topics
Feed where records published
Multiple partitions per topic
Order retained within partition
© Microsoft Azure + AI Conference All rights reserved.
Consumers and offset
Offset = record id
Consumers read in order
Multiple consumer per topic
© Microsoft Azure + AI Conference All rights reserved.
Event Hubs
Technology overview
Why Event Hubs?
Same core capability
as Kafka, using PaaS
instead of IaaS
Choose between Kafka
or Event Hub APIs;
avoid operational
overhead of managing
Kafka
© Microsoft Azure + AI Conference All rights reserved.
Event Hubs key concepts
 Namespace = container to hold multiple Event Hubs
 Event Hub = Topic
 Partitions and Consumer Groups
 Same concepts as Kafka
 Minor differences in implementation
 Throughput Units define level of scalability
Eventhub Namespace Setup
 Standard pricing to enable Kafka
 Each Throughput unit
 1 MB/s ingress
 2 MB/s egress
 Auto Inflate to allow autoscale
Eventhub Setup
 Partition count
 Max # of consumers
 Message retention
 More days = More $
 Capture
 Save to Azure Storage
Shared Access Key
Shared Access Key
Demo
Structured
Streaming
+
Event Hubs
for Kafka
© Microsoft Azure + AI Conference All rights reserved.
References
 The Log - Jay Kreps
 https://databricks.com/blog/2016/01/04/introducing-apache-spark-
datasets.html
 https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-
spark-apis-rdds-dataframes-and-datasets.html
 https://github.com/Azure/azure-event-hubs-for-kafka
 https://github.com/Azure/azure-event-hubs-spark
© Microsoft Azure + AI Conference All rights reserved.
 Please use EventsXD to fill out a session evaluation.
Thank you!

More Related Content

PPTX
Introduction to Azure Databricks
PPTX
Delta Lake with Azure Databricks
PPTX
Modern data warehouse
PPTX
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
PDF
IBM Cloud Day January 2021 - A well architected data lake
PDF
Moving to Databricks & Delta
PDF
Databricks Delta Lake and Its Benefits
PDF
Redash: Open Source SQL Analytics on Data Lakes
Introduction to Azure Databricks
Delta Lake with Azure Databricks
Modern data warehouse
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
IBM Cloud Day January 2021 - A well architected data lake
Moving to Databricks & Delta
Databricks Delta Lake and Its Benefits
Redash: Open Source SQL Analytics on Data Lakes

What's hot (20)

PDF
201905 Azure Databricks for Machine Learning
PDF
Intro to databricks delta lake
PDF
Owning Your Own (Data) Lake House
PDF
Spark as a Service with Azure Databricks
PDF
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
PPTX
Architecting a datalake
PPTX
TechEvent Databricks on Azure
PPTX
Global AI Bootcamp Madrid - Azure Databricks
PDF
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
PDF
IBM Cloud Day January 2021 Data Lake Deep Dive
PPTX
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PPTX
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
PDF
Data Lakes with Azure Databricks
PDF
Using Redash for SQL Analytics on Databricks
PDF
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PPTX
Azure Data Lake and Azure Data Lake Analytics
PDF
Big Data Adavnced Analytics on Microsoft Azure
PDF
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
201905 Azure Databricks for Machine Learning
Intro to databricks delta lake
Owning Your Own (Data) Lake House
Spark as a Service with Azure Databricks
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
Architecting a datalake
TechEvent Databricks on Azure
Global AI Bootcamp Madrid - Azure Databricks
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
IBM Cloud Day January 2021 Data Lake Deep Dive
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Data Lakes with Azure Databricks
Using Redash for SQL Analytics on Databricks
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Architect’s Open-Source Guide for a Data Mesh Architecture
Azure Data Lake and Azure Data Lake Analytics
Big Data Adavnced Analytics on Microsoft Azure
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Ad

Similar to Spark Streaming with Azure Databricks (20)

PPTX
StructuredStreaming webinar slides.pptx
PPTX
StructuredStreaming webinar slides.pptx
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
PDF
Strata NYC 2015: What's new in Spark Streaming
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
Lifting the hood on spark streaming - StampedeCon 2015
PPTX
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
PPTX
Apache Spark Components
PDF
Spark streaming state of the union
PDF
Streaming architecture patterns
PPTX
Event Hub & Azure Stream Analytics
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Spark (Structured) Streaming vs. Kafka Streams
PDF
Streaming analytics state of the art
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
StructuredStreaming webinar slides.pptx
StructuredStreaming webinar slides.pptx
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Strata NYC 2015: What's new in Spark Streaming
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Lifting the hood on spark streaming - StampedeCon 2015
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Apache Spark Components
Spark streaming state of the union
Streaming architecture patterns
Event Hub & Azure Stream Analytics
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Spark (Structured) Streaming vs. Kafka Streams
Streaming analytics state of the art
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Ad

Recently uploaded (20)

PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Introduction to Business Data Analytics.
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Foundation of Data Science unit number two notes
PPTX
Database Infoormation System (DBIS).pptx
PDF
Lecture1 pattern recognition............
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Business Data Analytics.
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Foundation of Data Science unit number two notes
Database Infoormation System (DBIS).pptx
Lecture1 pattern recognition............
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Clinical guidelines as a resource for EBP(1).pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
IB Computer Science - Internal Assessment.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction-to-Cloud-ComputingFinal.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Major-Components-ofNKJNNKNKNKNKronment.pptx
1_Introduction to advance data techniques.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn

Spark Streaming with Azure Databricks

  • 1. Dustin Vannoy Data Engineer Cloud + Streaming Spark Streaming with Azure Databricks
  • 2. Dustin Vannoy Data Engineering Consultant Co-founder Data Engineering San Diego /in/dustinvannoy @dustinvannoy dustin@dustinvannoy.com Technologies • Azure & AWS • Spark • Kafka • Python Modern Data Systems • Data Lakes • Analytics in Cloud • Streaming
  • 3. © Microsoft Azure + AI Conference All rights reserved. Agenda  Shifting to Streaming  Spark Structured Streaming  Apache Kafka  Azure Event Hubs  Get Hands On
  • 4. © Microsoft Azure + AI Conference All rights reserved. Shifting to Streaming If you haven’t started with streaming, you will soon
  • 6. Why Streaming? Data Engineers have decided that the business only updates in batch Our customers and business leaders know better
  • 7. © Microsoft Azure + AI Conference All rights reserved. Is streaming ingestion easier?  Dealing with a large set of data at once brings its own challenges  Process as it comes in for cleaner logic  Even if not doing real-time analytics yet, prepare for when you will
  • 8. © Microsoft Azure + AI Conference All rights reserved. Spark Structured Streaming Technology overview
  • 9. Why Spark? Big data and the cloud changed our mindset. We want tools that scale easily as data size grows. Spark is a leader in data processing that scales across many machines. It can run on Hadoop but is faster and easier than Map Reduce.
  • 10. © Microsoft Azure + AI Conference All rights reserved. What is Spark?  Fast, general purpose engine for large-scale data processing  Replaces MapReduce as Hadoop parallel programming API  Many options:  Yarn / Spark Cluster / Local  Scala / Python / Java / R  Spark Core / SQL / Streaming / ML / Graph
  • 11. © Microsoft Azure + AI Conference All rights reserved. What is Spark Structured Streaming?  Alternative to traditional Spark Streaming which used DStreams  If you are familiar with Spark, it is best to think of Structured Streaming as Spark SQL API but for streaming  Use import spark.sql.streaming
  • 12. © Microsoft Azure + AI Conference All rights reserved. What is Spark Structured Streaming? Tathagata Das “TD” - Lead Developer on Spark Streaming  “Fast, fault-tolerant, exactly-once stateful stream processing without having to reason about streaming"  "The simplest way to perform streaming analytics is not having to reason about streaming at all"  A table that is constantly appended with each micro-batch Reference: https://youtu.be/rl8dIzTpxrI
  • 13. © Microsoft Azure + AI Conference All rights reserved. Structured Streaming - Read df = spark.readStream .format("kafka") .options(**consumer_config) .load()
  • 14. © Microsoft Azure + AI Conference All rights reserved. Structured Streaming - Write df.writeStream .format("kafka") .options(**producer_config) .option("checkpointLocation","/tmp/cp001") .start()
  • 15. © Microsoft Azure + AI Conference All rights reserved. Structured Streaming - Window df2.groupBy( col("VendorID"), window(col("pickup_dt"), "10 min")) .avg("trip_distance")
  • 16. © Microsoft Azure + AI Conference All rights reserved. Structured Streaming – Output Mode  Output triggered on a time interval (defaults to 1 second)  Append  Just keep adding newest records  Complete mode  Output latest state of table  Useful for aggregation results
  • 17. © Microsoft Azure + AI Conference All rights reserved. Apache Kafka Technology overview
  • 18. Why Kafka? Streaming data directly from one system to another often problematic Kafka serves as the scalable broker, keeping up with producer and persisting data for all consumers
  • 19. © Microsoft Azure + AI Conference All rights reserved. The Log “It is an append-only, totally-ordered sequence of records ordered by time.” - Jay Kreps Reference: The Log - Jay Kreps
  • 20. © Microsoft Azure + AI Conference All rights reserved. Kafka Topics Feed where records published Multiple partitions per topic Order retained within partition
  • 21. © Microsoft Azure + AI Conference All rights reserved. Consumers and offset Offset = record id Consumers read in order Multiple consumer per topic
  • 22. © Microsoft Azure + AI Conference All rights reserved. Event Hubs Technology overview
  • 23. Why Event Hubs? Same core capability as Kafka, using PaaS instead of IaaS Choose between Kafka or Event Hub APIs; avoid operational overhead of managing Kafka
  • 24. © Microsoft Azure + AI Conference All rights reserved. Event Hubs key concepts  Namespace = container to hold multiple Event Hubs  Event Hub = Topic  Partitions and Consumer Groups  Same concepts as Kafka  Minor differences in implementation  Throughput Units define level of scalability
  • 25. Eventhub Namespace Setup  Standard pricing to enable Kafka  Each Throughput unit  1 MB/s ingress  2 MB/s egress  Auto Inflate to allow autoscale
  • 26. Eventhub Setup  Partition count  Max # of consumers  Message retention  More days = More $  Capture  Save to Azure Storage
  • 30. © Microsoft Azure + AI Conference All rights reserved. References  The Log - Jay Kreps  https://databricks.com/blog/2016/01/04/introducing-apache-spark- datasets.html  https://databricks.com/blog/2016/07/14/a-tale-of-three-apache- spark-apis-rdds-dataframes-and-datasets.html  https://github.com/Azure/azure-event-hubs-for-kafka  https://github.com/Azure/azure-event-hubs-spark
  • 31. © Microsoft Azure + AI Conference All rights reserved.  Please use EventsXD to fill out a session evaluation. Thank you!

Editor's Notes

  • #2: In the world of data science we often default to processing in nightly or hourly batches, but that pattern is not enough any more. Our customers and business leaders see information is being created all the time and realize it should be available much sooner. While the move to stream processing adds complexity, the tools we have available make it achievable for teams of any size. In this session we will talk about why we need to shift some of our workloads from batch data jobs to streaming in real-time. We'll dive into how Spark Structured Streaming in Azure Databricks enables this along with streaming data systems such as Kafka and EventHub. We will discuss the concepts, how Azure Databricks enables stream processing, and review code examples on a sample data set.
  • #4: Shifting to Streaming: We don’t have to convince our stakeholders that they don’t really need streaming. Understand the needs, find the right uses for streaming, and make it happen. Discuss pros and cons, considerations before going to production, and general use cases in AI/ML Spark, Event Hubs, and Kafka Define the systems we will be using for this session, including some of the reasons we choose them Talk about some of the options for using these together Getting Hands On Review dependencies that are not covered Walk through basic setup of the most important pieces Demo of use case code, highlight some important Structure Streaming components Best Practices Cover things to consider when working with Spark Structured Streaming and Kafka or Event Hubs
  • #7: In the world of data science, those of us who develop ETL pipelines have determined that everything can be processed in nightly or hourly batches, but that only makes sense to data engineers. Our customers and business leaders see information is being created all the time and realize it should be available much sooner.
  • #8: Dealing with a large set of data at once brings its own challenges (a lot of resources at once, large table joins, run out of memory, etc) Process as it comes in for cleaner logic (rather than seeing latest state, we see events as they happen and update state downstream) Even if not doing real-time analytics yet, prepare for when you will - the times they are a’changin
  • #11: A fast and general engine for large-scale data processing, uses memory to provide benefit Often replaces MapReduce as parallel programming api on Hadoop, the way it handles data (RDDs) provides one performance benefit and use of memory when possible provides another large performance benefit Can run on Hadoop (using Yarn) but also as a separate Spark cluster. Local is possible as well but reduces the performance benefits…I find its still a useful API though Run Java, Scala, Python, or R. If you don’t already know one of those languages really well, I recommend trying it in Python and Scala and pick whichever is easiest for you. Several modules for different use cases, similar api so you can swap between modes relatively easily. For example, we have both streaming and batch sources of some data and we reuse the rest of the spark processing transformations.
  • #12: A fast and general engine for large-scale data processing, uses memory to provide benefit Often replaces MapReduce as parallel programming api on Hadoop, the way it handles data (RDDs) provides one performance benefit and use of memory when possible provides another large performance benefit Can run on Hadoop (using Yarn) but also as a separate Spark cluster. Local is possible as well but reduces the performance benefits…I find its still a useful API though Run Java, Scala, Python, or R. If you don’t already know one of those languages really well, I recommend trying it in Python and Scala and pick whichever is easiest for you. Several modules for different use cases, similar api so you can swap between modes relatively easily. For example, we have both streaming and batch sources of some data and we reuse the rest of the spark processing transformations.
  • #13: A fast and general engine for large-scale data processing, uses memory to provide benefit Often replaces MapReduce as parallel programming api on Hadoop, the way it handles data (RDDs) provides one performance benefit and use of memory when possible provides another large performance benefit Can run on Hadoop (using Yarn) but also as a separate Spark cluster. Local is possible as well but reduces the performance benefits…I find its still a useful API though Run Java, Scala, Python, or R. If you don’t already know one of those languages really well, I recommend trying it in Python and Scala and pick whichever is easiest for you. Several modules for different use cases, similar api so you can swap between modes relatively easily. For example, we have both streaming and batch sources of some data and we reuse the rest of the spark processing transformations.
  • #16: Window is essentially like grouping. Continuously compute the average distance for each vendor over the last 10 minutes
  • #17: Window is essentially like grouping. Continuously compute the average distance for each vendor over the last 10 minutes
  • #30: Quick overview of important databricks workspace segments – Clusters, Tables, Notebooks Open create_parquet_tables notebook and run first few commands as examples of working without delta