SlideShare a Scribd company logo
Next Generation Apache Spark
Structured Streaming
Karthik Ramasamy
Head of Streaming, Databricks
Project #Lightspeed
Stream Processing
DBMS / CDC, Apps,
collection agents, IoT
devices
Streaming data lands in
message bus (e.g.
Pulsar, Kafka) / Files
Window aggregation
Pattern detection
Enrichment
Routing
Streaming
Transformations
Data continuously, incrementally processed as it appears
Triggers and Alerts
Real-time Analytics
Applications
Operational Applications
Explosion of streaming
Trillions of rows of data processed from thousands of sources
3
Manufacturing
Retail
Financial Services Healthcare
Energy Gaming
Technology &
Software
Media &
Entertainment
Fraud
Detection
Personalization Covid-19 Response Predictive
Maintenance
Smart Pricing Player Interaction
Analytics
Connected Cars,
Smart Homes
Content
Recommendations
Growth of Spark Structured Streaming
>150%
YoY streaming
job growth
Most downloaded streaming engine from Maven Central
1200+ customers
Logos using Structured Streaming on the Lakehouse
9x growth
in usage in 3 years
Spark Structured Streaming
Powers thousands of your everyday life applications today
Unified Batch & Streaming APIs
Lets developers use the same business logic across batch and stream processing
Fault Tolerance & Recovery
Automatic checkpointing & failure recovery allowing for reliable operations
Performance | Throughput
Handles > 14M events/sec (1.2T events per day) for the most challenging workloads
Flexible operations
Arbitrary logic and operations on the output of a streaming query
Stateful Processing
Support for stateful aggregations and joins along with watermarks for bounded states
New streaming applications
Proactive Maintenance in
Oil Drilling
Elevator Dispatch
Consistent
sub-second
latency
Ease of expressing
processing logic for
complex use cases
Integrations with
new cloud source
and sink systems
Tracing Microservices
1 2 3
Structured Streaming
needs to evolve to
satisfy these new
requirements
Project Lightspeed
Next generation of Spark Structured Streaming
Project Lightspeed
Faster and simpler stream processing
Predictable Low Latency
Target reduction in tail
latency by up to 2x
Enhanced Functionality
Advanced capabilities for
processing data with new
operators and easy to use APIs
Operations & Troubleshooting
Simplifying deployment,
operations, monitoring, and
troubleshooting
Connectors & Ecosystem
Improving ecosystem support for
connectors, authentication &
authorization features
Project Lightspeed - Predictable Low Latency
Faster bookkeeping - Offset management
External
Storage
Sequential Overlapped
External
Storage
Micro-batch -
1 Processing
External
Storage
Micro-batch -
2 Processing
External
Storage
async persist
offset ranges
async persist
offset ranges
time
Micro-batch -
3 Processing
async persist
offset ranges
440 ms 120 ms
73% improvement in latency for stateless pipelines
time
Micro-batch -
1 Processing
External
Storage
Micro-batch -
2 Processing
External
Storage
External
Storage
Persist
offset
ranges
Mark
batch
done
Persist
offset
ranges
Mark
batch
done
Project Lightspeed - Python as a first class citizen
agg()
count()
min()
max()
mean()
groupby()
orderby()
select()
selectExpr()
distinct()
where()
map()
mapValues()
flatMap()
flatMapValues()
csv()
json()
parquet()
orc()
schema()
text()
foreach()
foreachBatch()
Input & Output
Aggr & Grouping
awaitTermination()
exception()
explain()
status
stop()
Query Management
crossJoin()
crosstab()
join()
union()
unionAll()
Joins, etc
Filtering
createGlobalTempView()
createTempView()
drop()
drop_duplicates()
registerTempTable()
DDL Operations
window()
session_window()
Windowing
mapGroupWithState()
flatMapGroupWithState()
Arbitrary Stateful
Processing
Project Lightspeed - Improve Debuggability
Visualize the pipeline as data flow
Provide timeline view of metrics for operators
Group operator metrics by executor
Incorporate source and sink specific metrics
and many more…
Interested in Collaboration?
SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements
SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency
Karthik Ramasamy
Head of Streaming
Thank you

More Related Content

PPTX
Apache Spark Components
PPTX
Trivento summercamp masterclass 9/9/2016
PDF
Introduction to Structured streaming
PDF
AI-Powered Streaming Analytics for Real-Time Customer Experience
PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PDF
2016 Spark Summit East Keynote: Matei Zaharia
PDF
A look ahead at spark 2.0
Apache Spark Components
Trivento summercamp masterclass 9/9/2016
Introduction to Structured streaming
AI-Powered Streaming Analytics for Real-Time Customer Experience
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Apache Spark: The Next Gen toolset for Big Data Processing
2016 Spark Summit East Keynote: Matei Zaharia
A look ahead at spark 2.0

Similar to Data Con LA 2022 Keynote (20)

PDF
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
PDF
2017 big data landscape and cutting edge innovations public
PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
PDF
A Deep Dive into Structured Streaming in Apache Spark
ODP
Introduction to Structured Streaming
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
PDF
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
PDF
Apache Spark Streaming
PDF
What's New in Apache Spark 2.3 & Why Should You Care
PDF
Strata NYC 2015: What's new in Spark Streaming
PDF
Apache Spark - A High Level overview
PDF
Introduction to Apache Spark 2.0
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
PDF
Structured streaming in Spark
PDF
Dev Ops Training
PDF
The Future of Real-Time in Spark
PDF
The Future of Real-Time in Spark
PDF
Tecnicas e Instrumentos de Recoleccion de Datos
PDF
Spark After Dark - LA Apache Spark Users Group - Feb 2015
PDF
Spark after Dark by Chris Fregly of Databricks
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
2017 big data landscape and cutting edge innovations public
Taking Spark Streaming to the Next Level with Datasets and DataFrames
A Deep Dive into Structured Streaming in Apache Spark
Introduction to Structured Streaming
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Apache Spark Streaming
What's New in Apache Spark 2.3 & Why Should You Care
Strata NYC 2015: What's new in Spark Streaming
Apache Spark - A High Level overview
Introduction to Apache Spark 2.0
Spark Concepts - Spark SQL, Graphx, Streaming
Structured streaming in Spark
Dev Ops Training
The Future of Real-Time in Spark
The Future of Real-Time in Spark
Tecnicas e Instrumentos de Recoleccion de Datos
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark after Dark by Chris Fregly of Databricks
Ad

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
PPTX
Data Con LA 2022 Keynotes
PDF
Data Con LA 2022 Keynote
PPTX
Data Con LA 2022 - Startup Showcase
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
PPTX
Data Con LA 2022 - AI Ethics
PDF
Data Con LA 2022 - Improving disaster response with machine learning
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
PDF
Data Con LA 2022 - Real world consumer segmentation
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
PDF
Data Con LA 2022 - Intro to Data Science
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
PPTX
Data Con LA 2022 - Data Streaming with Kafka
PPTX
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynote
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...
Ad

Recently uploaded (20)

PPTX
A Quantitative-WPS Office.pptx research study
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Foundation of Data Science unit number two notes
PDF
Mega Projects Data Mega Projects Data
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Global journeys: estimating international migration
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
A Quantitative-WPS Office.pptx research study
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Foundation of Data Science unit number two notes
Mega Projects Data Mega Projects Data
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Database Infoormation System (DBIS).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Business Acumen Training GuidePresentation.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Global journeys: estimating international migration
Clinical guidelines as a resource for EBP(1).pdf
IB Computer Science - Internal Assessment.pptx
Taxes Foundatisdcsdcsdon Certificate.pdf
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Reliability_Chapter_ presentation 1221.5784
Miokarditis (Inflamasi pada Otot Jantung)
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx

Data Con LA 2022 Keynote

  • 1. Next Generation Apache Spark Structured Streaming Karthik Ramasamy Head of Streaming, Databricks Project #Lightspeed
  • 2. Stream Processing DBMS / CDC, Apps, collection agents, IoT devices Streaming data lands in message bus (e.g. Pulsar, Kafka) / Files Window aggregation Pattern detection Enrichment Routing Streaming Transformations Data continuously, incrementally processed as it appears Triggers and Alerts Real-time Analytics Applications Operational Applications
  • 3. Explosion of streaming Trillions of rows of data processed from thousands of sources 3 Manufacturing Retail Financial Services Healthcare Energy Gaming Technology & Software Media & Entertainment Fraud Detection Personalization Covid-19 Response Predictive Maintenance Smart Pricing Player Interaction Analytics Connected Cars, Smart Homes Content Recommendations
  • 4. Growth of Spark Structured Streaming >150% YoY streaming job growth Most downloaded streaming engine from Maven Central
  • 5. 1200+ customers Logos using Structured Streaming on the Lakehouse 9x growth in usage in 3 years
  • 6. Spark Structured Streaming Powers thousands of your everyday life applications today Unified Batch & Streaming APIs Lets developers use the same business logic across batch and stream processing Fault Tolerance & Recovery Automatic checkpointing & failure recovery allowing for reliable operations Performance | Throughput Handles > 14M events/sec (1.2T events per day) for the most challenging workloads Flexible operations Arbitrary logic and operations on the output of a streaming query Stateful Processing Support for stateful aggregations and joins along with watermarks for bounded states
  • 7. New streaming applications Proactive Maintenance in Oil Drilling Elevator Dispatch Consistent sub-second latency Ease of expressing processing logic for complex use cases Integrations with new cloud source and sink systems Tracing Microservices 1 2 3
  • 8. Structured Streaming needs to evolve to satisfy these new requirements
  • 9. Project Lightspeed Next generation of Spark Structured Streaming
  • 10. Project Lightspeed Faster and simpler stream processing Predictable Low Latency Target reduction in tail latency by up to 2x Enhanced Functionality Advanced capabilities for processing data with new operators and easy to use APIs Operations & Troubleshooting Simplifying deployment, operations, monitoring, and troubleshooting Connectors & Ecosystem Improving ecosystem support for connectors, authentication & authorization features
  • 11. Project Lightspeed - Predictable Low Latency Faster bookkeeping - Offset management External Storage Sequential Overlapped External Storage Micro-batch - 1 Processing External Storage Micro-batch - 2 Processing External Storage async persist offset ranges async persist offset ranges time Micro-batch - 3 Processing async persist offset ranges 440 ms 120 ms 73% improvement in latency for stateless pipelines time Micro-batch - 1 Processing External Storage Micro-batch - 2 Processing External Storage External Storage Persist offset ranges Mark batch done Persist offset ranges Mark batch done
  • 12. Project Lightspeed - Python as a first class citizen agg() count() min() max() mean() groupby() orderby() select() selectExpr() distinct() where() map() mapValues() flatMap() flatMapValues() csv() json() parquet() orc() schema() text() foreach() foreachBatch() Input & Output Aggr & Grouping awaitTermination() exception() explain() status stop() Query Management crossJoin() crosstab() join() union() unionAll() Joins, etc Filtering createGlobalTempView() createTempView() drop() drop_duplicates() registerTempTable() DDL Operations window() session_window() Windowing mapGroupWithState() flatMapGroupWithState() Arbitrary Stateful Processing
  • 13. Project Lightspeed - Improve Debuggability Visualize the pipeline as data flow Provide timeline view of metrics for operators Group operator metrics by executor Incorporate source and sink specific metrics
  • 15. Interested in Collaboration? SPARK-39585 - Multiple Stateful Operators in Structured Streaming SPARK-39586 - Advanced Windowing in Structured Streaming SPARK-39587 - Schema Evolution for Stateful Pipelines SPARK-39589 - Asynchronous I/O support SPARK-39590 - Python API for Arbitrary Stateful Processing SPARK-39591 - Offset Management Improvements SPARK-39592 - Asynchronous State Checkpointing SPARK-39593 - Configurable State Checkpointing Frequency
  • 16. Karthik Ramasamy Head of Streaming Thank you

Editor's Notes

  • #2: <TRANSITION TO KARTHIK> So what happened in the last 6-9 months is that we’ve invested heavily on building up a strong streaming team that’s actually going to take structured streaming and elevate to the next level We actually have the CEO of Pulsar, Karthik who is going to present this talk. He built a very popular streaming engine prior to this that many of you may have used… and today we are very excited to introduce Karthik to share our vision to grow Structured Streaming to the next level….
  • #3: We have seen an explosion of streaming applications across all industries… In fact, data streaming is part of your everyday life and is reshaping/transforming every industry you can imagine…. In finance……In retail….. In healthcare…. In manufacturing…. In retail…….
  • #4: We have seen an explosion of streaming applications across all industries… In fact, data streaming is part of your everyday life and is reshaping/transforming every industry you can imagine…. In finance……In retail….. In healthcare…. In manufacturing…. In retail…….
  • #5: KARTHIK…. Thank you Ali We are very data-driven at Databricks and we’ve been looking at the metrics, and from all numbers we’ve seen, this is the most surprising statistic that I’ve seen at Databricks. And we haven’t even done much on this, in fact we developed Structured Streaming many years ago and not too much investment went into it and still the growth is 160% of a large base. This is a significant portion of our revenue. Spark Structured Streaming has been widely adopted since the early days of streaming because of its ease of use, performance, large ecosystem, and developer communities. The majority of streaming workloads we saw were customers migrating their batch workloads to take advantage of the lower latency, fault tolerance, and support for incremental processing that streaming has to offer. The result is that we have seen tremendous adoption from streaming customers for both open source Spark and Databricks. The graph below shows the weekly number of streaming jobs on Databricks over the past three years, which has grown from thousands to 3+ millions, and is still accelerating. ………. Per Matei - to update, not to use graph, but to say a double digit percentage of our workflows is streaming and have a number here and we see that increasing over time. X many trillions of records p/day.
  • #6: ..and many of our customers, from enterprises to startups have and are continuing adopting streaming in the lakehouse….
  • #7: Why do I believe Spark Structured Streaming is growing? Several properties of Structured Streaming have made it popular and here are the top 5. Unification - The foremost advantage of Structured Streaming is that it uses the same API as batch processing,, making the transition to real-time processing from batch much simpler. Fault Tolerance & Recovery - Structured Streaming checkpoints state automatically at every stage of processing. When a failure occurs, it automatically recovers from the previous state. The failure recovery is very fast since it is restricted to failed tasks as opposed to restarting the entire streaming pipeline in other systems. AFAIK, SS runs in spot instances making streaming cost effective Performance - Structured Streaming provides very high throughput with seconds of latency at a lower cost, taking full advantage of the performance optimizations in the Spark SQL engine.. Flexible Operations - The ability to apply arbitrary logic and operations on the output of a streaming query using foreachBatch. This enables developers to perform operations like upserts, writes to multiple sinks, as well as interaction with external data sources. Over 40% of our users on Databricks take advantage of this feature. Stateful Processing - Support for stateful aggregations and joins along with watermarks for bounded state and late order processing. In addition, arbitrary stateful operations with [flat]mapGroupsWithState backed by a RocksDB state store are provided for efficient and fault-tolerant state management (as of Spark 3.2).
  • #8: As SS grew in leaps and bounds, developers started using it for emerging new applications such as … Monitor expensive drill bits continuously and stop them from hitting rock surfaces Continuously monitor the data from elevator for emergencies and quickly alert the dispatch Stitch the requests and responses from logs of microservices that serve a web request for tracing and troubleshooting These exposed some of the shortcomings of SS such as … . I think if we can address all of these, we will be able to increase adoption and see skyrocketed growth. So,
  • #9: What are we doing about?
  • #10: I am very excited to announce that we are launching Project Lightspeed to take SS into next generation
  • #11: Project Lightspeed advances SS across four pillars… …. In the next few slides, I will give a glimpse of some of the Lightspeed features
  • #12: SS has several bookkeeping - (b) plan offset ranges, (e) mark batch done. Forced into storage (b) and (a) and in sequence. Increased latency In default trigger, eliminate (e) and overlap the execution of mb with storing the offset range async
  • #13: SS pipelines can be programmed using multiple languages Java, Scala, Python and SQL. Python is a popular choice. Python provides several API …. But there is a gap. Arbitrary Stateful processing - needed for exponential weighted avg. Key challenge with this API is executing arbitrary python code in a JVM system.
  • #14: Streaming pipelines are brittle. There can be several reasons - surge in data to be processed, resources not adequately provisioned, bug in user code. SS provides tons of metrics ´& logs at micro batch level.