Data Con LA 2022 Keynote

Next Generation Apache Spark
Structured Streaming
Karthik Ramasamy
Head of Streaming, Databricks
Project #Lightspeed

Stream Processing
DBMS / CDC, Apps,
collection agents, IoT
devices
Streaming data lands in
message bus (e.g.
Pulsar, Kafka) / Files
Window aggregation
Pattern detection
Enrichment
Routing
Streaming
Transformations
Data continuously, incrementally processed as it appears
Triggers and Alerts
Real-time Analytics
Applications
Operational Applications

Explosion of streaming
Trillions of rows of data processed from thousands of sources
3
Manufacturing
Retail
Financial Services Healthcare
Energy Gaming
Technology &
Software
Media &
Entertainment
Fraud
Detection
Personalization Covid-19 Response Predictive
Maintenance
Smart Pricing Player Interaction
Analytics
Connected Cars,
Smart Homes
Content
Recommendations

Growth of Spark Structured Streaming
>150%
YoY streaming
job growth
Most downloaded streaming engine from Maven Central

1200+ customers
Logos using Structured Streaming on the Lakehouse
9x growth
in usage in 3 years

Spark Structured Streaming
Powers thousands of your everyday life applications today
Unified Batch & Streaming APIs
Lets developers use the same business logic across batch and stream processing
Fault Tolerance & Recovery
Automatic checkpointing & failure recovery allowing for reliable operations
Performance | Throughput
Handles > 14M events/sec (1.2T events per day) for the most challenging workloads
Flexible operations
Arbitrary logic and operations on the output of a streaming query
Stateful Processing
Support for stateful aggregations and joins along with watermarks for bounded states

New streaming applications
Proactive Maintenance in
Oil Drilling
Elevator Dispatch
Consistent
sub-second
latency
Ease of expressing
processing logic for
complex use cases
Integrations with
new cloud source
and sink systems
Tracing Microservices
1 2 3

Structured Streaming
needs to evolve to
satisfy these new
requirements

Project Lightspeed
Next generation of Spark Structured Streaming

Project Lightspeed
Faster and simpler stream processing
Predictable Low Latency
Target reduction in tail
latency by up to 2x
Enhanced Functionality
Advanced capabilities for
processing data with new
operators and easy to use APIs
Operations & Troubleshooting
Simplifying deployment,
operations, monitoring, and
troubleshooting
Connectors & Ecosystem
Improving ecosystem support for
connectors, authentication &
authorization features

Project Lightspeed - Predictable Low Latency
Faster bookkeeping - Offset management
External
Storage
Sequential Overlapped
External
Storage
Micro-batch -
1 Processing
External
Storage
Micro-batch -
2 Processing
External
Storage
async persist
offset ranges
async persist
offset ranges
time
Micro-batch -
3 Processing
async persist
offset ranges
440 ms 120 ms
73% improvement in latency for stateless pipelines
time
Micro-batch -
1 Processing
External
Storage
Micro-batch -
2 Processing
External
Storage
External
Storage
Persist
offset
ranges
Mark
batch
done
Persist
offset
ranges
Mark
batch
done

Project Lightspeed - Python as a first class citizen
agg()
count()
min()
max()
mean()
groupby()
orderby()
select()
selectExpr()
distinct()
where()
map()
mapValues()
flatMap()
flatMapValues()
csv()
json()
parquet()
orc()
schema()
text()
foreach()
foreachBatch()
Input & Output
Aggr & Grouping
awaitTermination()
exception()
explain()
status
stop()
Query Management
crossJoin()
crosstab()
join()
union()
unionAll()
Joins, etc
Filtering
createGlobalTempView()
createTempView()
drop()
drop_duplicates()
registerTempTable()
DDL Operations
window()
session_window()
Windowing
mapGroupWithState()
flatMapGroupWithState()
Arbitrary Stateful
Processing

Project Lightspeed - Improve Debuggability
Visualize the pipeline as data flow
Provide timeline view of metrics for operators
Group operator metrics by executor
Incorporate source and sink specific metrics

Interested in Collaboration?
SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements
SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency

Karthik Ramasamy
Head of Streaming
Thank you

Data Con LA 2022 Keynote

More Related Content

Similar to Data Con LA 2022 Keynote (20)

More from Data Con LA (20)

Recently uploaded (20)

Data Con LA 2022 Keynote

Editor's Notes