Apache Spark 3.0: Overview of What’s New and Why Care

What's New in Apache Spark 3.0 & Why Care?
Buenas noches amigos de Bogotá Spark! J
Jules S. Damji, Databricks
Apache Spark Bogota Meetup
September 23, 2020

Jules S. Damji
Senior Developer Advocate @
Databricks
Joined Databricks in 2016
20+ Software Engineer at companies:
Sun, Netscape, VeriSign, @Home,
LoudCloud/Opsware, Hortonworks etc
About Me

Unified data analytics platform for
data science, data engineering, and business analytics to solve tough
data problems
Original creators of popular data and machine learning open source projects
Global company with 5,000 customers and 450+ partners

Adaptive Query
Execution
Dynamic Partition
Pruning
Query Compilation
Speedup
Join Hints
Performance
Richer APIs
Accelerator-aware
Scheduler
Built-in
Functions
pandas UDF
Enhancements
DELETE/UPDATE/
MERGE in Catalyst
Reserved
Keywords
Proleptic Gregorian
Calendar
ANSI Store
Assignment
Overflow
Checking
SQL Compatibility
Built-in Data Sources
Parquet/ORC Nested
Column Pruning
Parquet: Nested
Column Filter
Pushdown
CSV Filter
Pushdown
New Binary
Data Source
Data Source V2 API +
Catalog Support
Java 11 Support
Hadoop 3
Support
Hive 3.x Metastore
Hive 2.3 Execution
Extensibility and Ecosystem
Structured
Streaming UI
DDL/DML
Enhancements
Observable
Metrics
Event Log
Rollover
Monitoring and Debuggability
Delta ,

3400+ Resolved
JIRAs
in Spark 3.0
Blog

Agenda
Performance
Spark 3.0 comes with performance improvements to make Spark faster, cheaper, and more flexible
Usability
Spark is easier to use
Compatibility Considerations
View notable compatibility/behavior changes
Spark Ecosystem
Learn about developments in Delta Lake , Project Hydrogen and Project Zen

Performance
Achieve high performance for interactive, batch, streaming and ML workloads
Adaptive Query
Execution
Dynamic Partition
Pruning
Join Hints
Blog

Spark Catalyst Optimizer
Spark 1.x, Rule
Spark 2.x, Rule + Cost
Spark 3.0, Rule + Cost + Runtime

Optimization in Spark 2.x
Blog

Adaptive Query Execution
Based on statistics of the finished plan nodes, re-optimize the execution
plan of the remaining queries
▪ Dynamically switch join strategies
▪ Dynamically coalesce shuffle partitions
▪ Dynamically optimize skew joins
adaptive planning

Performance Pitfall
Choose Broadcast Hash Join?
▪ Increase “spark.sql.autoBroadcastJoinThreshold”?
▪ Use “broadcast” hint?
However
Hard to tune
Hard to maintain over time
OOM…
Using the wrong join strategy

Vision: No more manual setting of broadcast hints/thresholds! Capability: SMJ -> BHJ at runtime
SMJ
Sort Sort
Shuffle Write Shuffle Write
Left Child Right Child
BHJ
Broadcast
Shuffle Shuffle
Left Child Right Child
Static size:
15MB
Actual:
8MB
Shuffle Read Shuffle Read
Local
Shuffle Read
Local
Shuffle Read
Not Started Done New Plan - Changed

Performance Pitfall
Tuning spark.sql.shuffle.partitions
▪ Default magic number: 200 !?!
However
▪ Too small: GC pressure; disk spilling
▪ Too large: Inefficient I/O; scheduler pressure
▪ Hard to tune over the whole query plan
▪ Hard to maintain over time
Choosing the wrong shuffle partition number

VISION: No more manual tuning of spark.shuffle.partitions! Capability: Coalesce shuffle partitions
Filter
Scan
Execute
Shuffle (50 part.)
Sort
Stage 1
OptimizeFilter
Scan
Shuffle (50 part.)
Sort
Stage 1
Filter
Scan
Shuffle (50 part.)
Sort
Stage 1
Coalesce (5 part.)
Set the initial partition number 200 or X to accommodate the largest
data size of the entire query execution
Automatically coalesce partitions if needed after each query stage

Performance Pitfall
Symptoms of data skew
▪ Frozen/long-running tasks
▪ Disk spilling
▪ Low resource utilization in most nodes
▪ OOM
Various ways
▪ Find the skew values and rewrite the queries
▪ Adding extra skew keys…
Data skew
Anybody dealt with
data skews while
running Spark jobs?

Data Skew

VISION: No more manual tuning of skew hints!

AQE Configuration Settings
Property Name Default Meaning Since Version
spark.sql.adaptive.
coalescePartitions.
enabled
true When true and spark.sql.adaptive.enabled is true, Spark will coalesce contiguous
shuffle partitions according to the target size (specified by
spark.sql.adaptive.advisoryPartitionSizeInBytes), to avoid too many small
tasks.
3.0.0
spark.sql.adaptive.
coalescePartitions.
minPartitionNum
Default
Parallelism
The minimum number of shuffle partitions after coalescing. If not set, the default value
is the default parallelism of the Spark cluster. This configuration only has an effect
when spark.sql.adaptive.enabled and
spark.sql.adaptive.coalescePartitions.enabled are both enabled.
3.0.0
spark.sql.adaptive.
coalescePartitions.
initialPartitionNum
200 The initial number of shuffle partitions before coalescing. By default it equals to
spark.sql.shuffle.partitions. This configuration only has an effect when
spark.sql.adaptive.enabled and
spark.sql.adaptive.coalescePartitions.enabled are both enabled.
3.0.0
spark.sql.adaptive.
advisoryPartitionSizeInBytes
64 MB The advisory size in bytes of the shuffle partition during adaptive optimization (when
spark.sql.adaptive.enabled is true). It takes effect when Spark coalesces small
shuffle partitions or splits skewed shuffle partition.
3.0.0
AQE is not enabled by default. Set spark.sql.adaptive.enabled to true to use the features below.

Performance
Adaptive Query
Execution
Dynamic Partition
Pruning
Join Hints

Dynamic Partition Pruning
• Avoid partition scanning based on
the query results of the other
query fragments.
• Important for star-schema
queries.
• Significant speedup in TPC-DS.

Static Partition Pruning
SELECT * FROM Sales WHERE store_id = 5
Most optimizations employ simple static partition pruning
Basic Data Flow Filter Push-down
Partitioned files with
multi-columnar data

A Common Workload
SELECT * FROM Sales JOIN Stores
WHERE Stores.city = 'New York'
Scan
Sales
Join
Scan
Stores
Filter
city = 'New York'
● Static pruning cannot be applied
● Filter is only acting on the smaller
dimensional table, not the larger fact
table
Small dimensional table
Larger fact table
Star Schema Queries

Table Denormalization
SELECT * FROM Sales JOIN Stores
WHERE Stores.city = 'New York'
Scan
Sales
Join
Scan
Stores
Filter
city = 'New York'
Scan

Physical Plan Optimization
Partitioned files with
multi-columnar data
File Scan
Dynamic Filter
Broadcast
Exchange
File Scan with DIM
filter
Broadcast Hash
Join
SCAN Fact Table

60 / 102 TPC-DS queries: a speedup between 2x and 18x

Dynamic Partition Pruning Configuration
Dynamic Partition pruning is enabled by default
Property Name Default Meaning Since Version
spark.sql.optimizer.
dynamicPartitionPruning.enabled
true When true, we will generate predicate for
partition column when it's used as join key
3.0.0

Performance
Adaptive Query
Execution
Dynamic Partition
Pruning Join Hints

Optimizer Hints
▪ Join hints influence optimizer to choose the join strategies
▪ Broadcast hash join
▪ Sort-merge join NEW
▪ Shuffle hash join NEW
▪ Shuffle nested loop join NEW
▪ Should be used with extreme caution.
▪ Difficult to manage over time.

Join Strategies
Most robust
Handle any
data size
Needs to
shuffle and sort
Can be slow
when table size
is small
Requires one
side to be small
No shuffle or
sort
Very fast
Broadcast HashSort-Merge
Needs to
shuffle, but no
sort
Can handle
large tables
Will OOM if data
is skewed
Shuffle Hash
Doesn't require join
keys
Shuffle Nested Loop

▪ Broadcast Hash Join
SELECT /*+ BROADCAST(a) */ id FROM a JOIN b ON a.key = b.key
▪ Sort-Merge Join
SELECT /*+ MERGE(a, b) */ id FROM a JOIN b ON a.key = b.key
▪ Shuffle Hash Join
SELECT /*+ SHUFFLE_HASH(a, b) */ id FROM a JOIN b ON a.key = b.key
▪ Shuffle Nested Loop Join
SELECT /*+ SHUFFLE_REPLICATE_NL(a, b) */ id FROM a JOIN b
How to Use SQL Join Hints?

Join Hint Syntax
Shuffle Merge
SQL
Python

Enable new use cases and simplify the Spark application development
Formatted Explain
SQL Engine
pandas UDF
enhancements
Useability and Richer APIs

Spark SQL: Old Explain
How many of you
have scratched
your heads
reading this?

Spark SQL: New EXPLAIN FORMATTED
Header: Basic operating tree for
the execution plan
Footer: Each operator
with additional attributes

* Project (4)
+- * Filter (3)
+- * ColumnarToRow (2)
+- Scan parquet default.tab1 (1)
(1) Scan parquet default.tab1
Output [2]: [key#5, val#6]
Batched: true
Location: InMemoryFileIndex [file:/user/hive/warehouse/tab1]
PushedFilters: [IsNotNull(key)]
ReadSchema: struct<key:int,val:int>
(2) ColumnarToRow [codegen id : 1]
Input [2]: [key#5, val#6]
(3) Filter [codegen id : 1]
Condition : (isnotnull(key#5) AND (key#5 = Subquery scalar-subquery#27, [id=#164]))
(4) Project [codegen id : 1]
Output [2]: [key#5, val#6]
EXPLAIN FORMATTED
SELECT *
FROM tab1
WHERE key = (SELECT max(key)
FROM tab2
WHERE val > 5

DataFrame.explain(mode)
Modes:
• simple
• extended
• codegen
• formatted
query = “””SELECT * FROM tab1
WHERE key = (SELECT max(key)
FROM tab2 WHERE val > 5”””
df = spark.sql(query)
df.explain(mode=“formatted”)

pandas UDF
enhancements
Structured
Streaming

Pandas UDFs (a.k.a. Vectorized UDFs)
SPARK 2.3
SPARK 3.0
Python Type Hints

Pandas UDFs
Pandas Function APIs - Grouped Map

Pandas UDFs
Supported function APIs include:
Grouped Map
Map
Co-grouped Map
Pandas Function APIs
Spark + AI Session
Blog

Structured
Streaming UI

Improved Web UI: Structured Streaming Tab

Improved Web UI: Structured Streaming
Get real-time metrics via the structured
streaming tab including:
▪ Input rate
▪ Process Rate
▪ Input rows
▪ Batch duration
▪ Operation duration
▪ 2 minute window display
Documentation

Improve the plug-in interface and extend the deployment environments
Hive 3.x Metastore
Hive 2.3 Execution
Hadoop 3
Support
Java 11
Support
Compatibility and Migration Considerations

Spark 3.0 Builds
• Only builds with Scala 2.12
• Deprecates Python 2 (already EOL)
• Can build with various Hadoop/Hive versions
– Hadoop 2.7 + Hive 1.2
– Hadoop 2.7 + Hive 2.3 (supports Java 11) [Default]
– Hadoop 3.2 + Hive 2.3 (supports Java 11)
• Supports the following Hive metastore versions:
– "0.12", "0.13", "0.14", "1.0", "1.1", "1.2", "2.0", "2.1", "2.2", "2.3", "3.0", "3.1"

Apache Spark 3.0: Overview of What’s New and Why Care

A New Standard for Building Data Lakes
A new approach to building Data Lakes
• Open format-based on parquet with ACID
transactions
• Adds reliability, data quality, performance to Data
Lakes
• Brings the best of data warehousing and data lakes
• Based on open source and open format (Parquet)
• Enabled by Apache Spark

Challenges with data lakes
1. Hard to append data. Adding newly
arrived data leads to incorrect reads.✗
2. Modification of existing data difficult. GDPR/CCPA
requires making fine grained changes to existing data
lake. Very costly with Spark.
3. Jobs failing mid way. Half of the data appears
in the data lake, the rest missing.
How many of
you have built
data lakes?

4. Real-time operations hard – mixing streaming and batch
leads to inconsistency.
5. Costly to keep historical versions of the data – regulated
environments require reproducibility, auditing, and governance.
6. Difficult to handle large metadata – for large data lakes
the metadata itself becomes difficult to manage.

7. "Too many files” problems. Data lakes
not great at handling millions of small files.✗
8. Fine grained access control difficult. Enforcing
enterprise-wide role-based access control on data
difficult.

9. Hard to get great performance – partitioning the data for
performance error-prone and difficult to change.
10. Data quality issues. Hard to ensure that all the
data is correct and has the right quality.

1. Hard to append data
2. Modification of existing data difficult
3. Jobs failing mid way
4. Real-time operations hard
5. Costly to keep historical data versions
6. Difficult to handle large metadata
7. Poor performance
8. “Too many files” problem
9. Fine grained access control difficult
10. Data quality issues
ACID transactions
Spark under the hood
- auto-indexing
- fine grained ACLs and RBAC
- schema enforcement and evolution

From Data Lakes -> Delta Lake -> Lakehouse
VLDB Conference paper.

Delta Lake Connectors
Standardize your big data storage with an open format accessible from various tools
Amazon Redshift
Amazon Athena

https://databricks.com/diving-into-delta-lake-talkshttps://delta.io

What is Project Hydrogen?
The goal of Project Hydrogen is to enable first-class
support for all distributed ML frameworks
https://vimeo.com/274267107

Incompatible Execution Models
Spark
▪ Tasks are independent
▪ Parallel and massively scalable
▪ If a task crashes, rerun it
Task 1
Task 2
Task 3
Distributed ML Frameworks
▪ Complete coordination among tasks
▪ Optimized for communication
▪ If a task crashes, rerun all tasks
Task 1
Task 2 Task 3

Barrier Execution Mode (Spark 2.4)
▪ Since 2.4, gang scheduling has been implemented on top of the
MapReduce execution model
▪ Gang scheduling enables barrier execution mode
Stage 1: Data Prep
(embarrassingly parallel)
Stage 2: Dist ML training
(gang scheduled)
Stage 3: Data Sink
(embarrassingly parallel)

Accelerator Aware Scheduling (Spark 3.0) - Motivation
▪ Deep learning workloads often use GPUs or other accelerators to
speed up processing on large datasets
▪ Popular cluster managers YARN and Kubernetes support GPUs
▪ Spark 2.x can support those cluster managers, but is not aware of
available GPUs and cannot request or schedule them

Accelerator Aware Scheduling in Spark 3.0
▪ Used to accelerate special workloads like deep learning and signal
processing
▪ Supports Standalone, YARN, and Kubernetes
▪ Supports GPUs
▪ Required resources are specified by configuration, so works only at
the application level
Future work:
▪ Support TPU, FPGA, etc
▪ Support job/stage/task level resource allocation

Project Zen : PySpark Improvements

• Redesigning PySpark documentation
• PySpark type hints
• Visualization
• Standardized warnings and exceptions
Blog
How many of you
have scratched
your head looking
at a PySpark stack
trace?

THE FREE VIRTUAL EVENT FOR DATA TEAMS
● Three days of sessions, keynotes, training and demos
● Catch up on rapid advances and best practices in
Apache Spark™, Delta Lake, MLflow and Redash
● Network with more than 20,000 data professionals
from across Europe and around the world

Thank you for your support & contributions!
Happy 10th Birthday
cc: Acknowledgement's to Xiao Li and Doug Bateman

Apache Spark 3.0: Overview of What’s New and Why Care

More Related Content

What's hot (20)

Similar to Apache Spark 3.0: Overview of What’s New and Why Care (20)

More from Databricks (20)

Recently uploaded (20)

Apache Spark 3.0: Overview of What’s New and Why Care