SlideShare a Scribd company logo
What's New in Apache Spark 3.0 & Why Care?
Buenas noches amigos de Bogotá Spark! J
Jules S. Damji, Databricks
Apache Spark Bogota Meetup
September 23, 2020
Jules S. Damji
Senior Developer Advocate @
Databricks
Joined Databricks in 2016
20+ Software Engineer at companies:
Sun, Netscape, VeriSign, @Home,
LoudCloud/Opsware, Hortonworks etc
About Me
Unified data analytics platform for
data science, data engineering, and business analytics to solve tough
data problems
Original creators of popular data and machine learning open source projects
Global company with 5,000 customers and 450+ partners
Adaptive Query
Execution
Dynamic Partition
Pruning
Query Compilation
Speedup
Join Hints
Performance
Richer APIs
Accelerator-aware
Scheduler
Built-in
Functions
pandas UDF
Enhancements
DELETE/UPDATE/
MERGE in Catalyst
Reserved
Keywords
Proleptic Gregorian
Calendar
ANSI Store
Assignment
Overflow
Checking
SQL Compatibility
Built-in Data Sources
Parquet/ORC Nested
Column Pruning
Parquet: Nested
Column Filter
Pushdown
CSV Filter
Pushdown
New Binary
Data Source
Data Source V2 API +
Catalog Support
Java 11 Support
Hadoop 3
Support
Hive 3.x Metastore
Hive 2.3 Execution
Extensibility and Ecosystem
Structured
Streaming UI
DDL/DML
Enhancements
Observable
Metrics
Event Log
Rollover
Monitoring and Debuggability
Delta ,
3400+ Resolved
JIRAs
in Spark 3.0
Blog
Agenda
Performance
Spark 3.0 comes with performance improvements to make Spark faster, cheaper, and more flexible
Usability
Spark is easier to use
Compatibility Considerations
View notable compatibility/behavior changes
Spark Ecosystem
Learn about developments in Delta Lake , Project Hydrogen and Project Zen
Performance
Achieve high performance for interactive, batch, streaming and ML workloads
Adaptive Query
Execution
Dynamic Partition
Pruning
Join Hints
Blog
Spark Catalyst Optimizer
Spark 1.x, Rule
Spark 2.x, Rule + Cost
Spark 3.0, Rule + Cost + Runtime
Optimization in Spark 2.x
Blog
Adaptive Query Execution
Based on statistics of the finished plan nodes, re-optimize the execution
plan of the remaining queries
▪ Dynamically switch join strategies
▪ Dynamically coalesce shuffle partitions
▪ Dynamically optimize skew joins
adaptive planning
Performance Pitfall
Choose Broadcast Hash Join?
▪ Increase “spark.sql.autoBroadcastJoinThreshold”?
▪ Use “broadcast” hint?
However
Hard to tune
Hard to maintain over time
OOM…
Using the wrong join strategy
Adaptive Query Execution
Vision: No more manual setting of broadcast hints/thresholds! Capability: SMJ -> BHJ at runtime
SMJ
Sort Sort
Shuffle Write Shuffle Write
Left Child Right Child
BHJ
Broadcast
Shuffle Shuffle
Left Child Right Child
Static size:
15MB
Actual:
8MB
Shuffle Read Shuffle Read
Local
Shuffle Read
Local
Shuffle Read
Not Started Done New Plan - Changed
Performance Pitfall
Tuning spark.sql.shuffle.partitions
▪ Default magic number: 200 !?!
However
▪ Too small: GC pressure; disk spilling
▪ Too large: Inefficient I/O; scheduler pressure
▪ Hard to tune over the whole query plan
▪ Hard to maintain over time
Choosing the wrong shuffle partition number
Adaptive Query Execution
VISION: No more manual tuning of spark.shuffle.partitions! Capability: Coalesce shuffle partitions
Filter
Scan
Execute
Shuffle (50 part.)
Sort
Stage 1
OptimizeFilter
Scan
Shuffle (50 part.)
Sort
Stage 1
Filter
Scan
Shuffle (50 part.)
Sort
Stage 1
Coalesce (5 part.)
Set the initial partition number 200 or X to accommodate the largest
data size of the entire query execution
Automatically coalesce partitions if needed after each query stage
Performance Pitfall
Symptoms of data skew
▪ Frozen/long-running tasks
▪ Disk spilling
▪ Low resource utilization in most nodes
▪ OOM
Various ways
▪ Find the skew values and rewrite the queries
▪ Adding extra skew keys…
Data skew
Anybody dealt with
data skews while
running Spark jobs?
Adaptive Query Execution
Data Skew
Adaptive Query Execution
VISION: No more manual tuning of skew hints!
AQE Configuration Settings
Property Name Default Meaning Since Version
spark.sql.adaptive.
coalescePartitions.
enabled
true When true and spark.sql.adaptive.enabled is true, Spark will coalesce contiguous
shuffle partitions according to the target size (specified by
spark.sql.adaptive.advisoryPartitionSizeInBytes), to avoid too many small
tasks.
3.0.0
spark.sql.adaptive.
coalescePartitions.
minPartitionNum
Default
Parallelism
The minimum number of shuffle partitions after coalescing. If not set, the default value
is the default parallelism of the Spark cluster. This configuration only has an effect
when spark.sql.adaptive.enabled and
spark.sql.adaptive.coalescePartitions.enabled are both enabled.
3.0.0
spark.sql.adaptive.
coalescePartitions.
initialPartitionNum
200 The initial number of shuffle partitions before coalescing. By default it equals to
spark.sql.shuffle.partitions. This configuration only has an effect when
spark.sql.adaptive.enabled and
spark.sql.adaptive.coalescePartitions.enabled are both enabled.
3.0.0
spark.sql.adaptive.
advisoryPartitionSizeInBytes
64 MB The advisory size in bytes of the shuffle partition during adaptive optimization (when
spark.sql.adaptive.enabled is true). It takes effect when Spark coalesces small
shuffle partitions or splits skewed shuffle partition.
3.0.0
AQE is not enabled by default. Set spark.sql.adaptive.enabled to true to use the features below.
Adaptive Query Execution
Performance
Achieve high performance for interactive, batch, streaming and ML workloads
Adaptive Query
Execution
Dynamic Partition
Pruning
Join Hints
Dynamic Partition Pruning
• Avoid partition scanning based on
the query results of the other
query fragments.
• Important for star-schema
queries.
• Significant speedup in TPC-DS.
Static Partition Pruning
SELECT * FROM Sales WHERE store_id = 5
Most optimizations employ simple static partition pruning
Basic Data Flow Filter Push-down
Partitioned files with
multi-columnar data
A Common Workload
SELECT * FROM Sales JOIN Stores
WHERE Stores.city = 'New York'
Scan
Sales
Join
Scan
Stores
Filter
city = 'New York'
● Static pruning cannot be applied
● Filter is only acting on the smaller
dimensional table, not the larger fact
table
Small dimensional table
Larger fact table
Star Schema Queries
Table Denormalization
SELECT * FROM Sales JOIN Stores
WHERE Stores.city = 'New York'
Scan
Sales
Join
Scan
Stores
Filter
city = 'New York'
Scan
Dynamic Partition Pruning
Physical Plan Optimization
Partitioned files with
multi-columnar data
File Scan
Dynamic Filter
Broadcast
Exchange
File Scan with DIM
filter
Broadcast Hash
Join
SCAN Fact Table
Dynamic Partition Pruning
60 / 102 TPC-DS queries: a speedup between 2x and 18x
Dynamic Partition Pruning Configuration
Dynamic Partition pruning is enabled by default
Property Name Default Meaning Since Version
spark.sql.optimizer.
dynamicPartitionPruning.enabled
true When true, we will generate predicate for
partition column when it's used as join key
3.0.0
Performance
Achieve high performance for interactive, batch, streaming and ML workloads
Adaptive Query
Execution
Dynamic Partition
Pruning Join Hints
Optimizer Hints
▪ Join hints influence optimizer to choose the join strategies
▪ Broadcast hash join
▪ Sort-merge join NEW
▪ Shuffle hash join NEW
▪ Shuffle nested loop join NEW
▪ Should be used with extreme caution.
▪ Difficult to manage over time.
Join Strategies
Most robust
Handle any
data size
Needs to
shuffle and sort
Can be slow
when table size
is small
Requires one
side to be small
No shuffle or
sort
Very fast
Broadcast HashSort-Merge
Needs to
shuffle, but no
sort
Can handle
large tables
Will OOM if data
is skewed
Shuffle Hash
Doesn't require join
keys
Shuffle Nested Loop
▪ Broadcast Hash Join
SELECT /*+ BROADCAST(a) */ id FROM a JOIN b ON a.key = b.key
▪ Sort-Merge Join
SELECT /*+ MERGE(a, b) */ id FROM a JOIN b ON a.key = b.key
▪ Shuffle Hash Join
SELECT /*+ SHUFFLE_HASH(a, b) */ id FROM a JOIN b ON a.key = b.key
▪ Shuffle Nested Loop Join
SELECT /*+ SHUFFLE_REPLICATE_NL(a, b) */ id FROM a JOIN b
How to Use SQL Join Hints?
Join Hint Syntax
Shuffle Merge
SQL
Python
Enable new use cases and simplify the Spark application development
Formatted Explain
SQL Engine
pandas UDF
enhancements
Useability and Richer APIs
Spark SQL: Old Explain
How many of you
have scratched
your heads
reading this?
Spark SQL: New EXPLAIN FORMATTED
Header: Basic operating tree for
the execution plan
Footer: Each operator
with additional attributes
* Project (4)
+- * Filter (3)
+- * ColumnarToRow (2)
+- Scan parquet default.tab1 (1)
(1) Scan parquet default.tab1
Output [2]: [key#5, val#6]
Batched: true
Location: InMemoryFileIndex [file:/user/hive/warehouse/tab1]
PushedFilters: [IsNotNull(key)]
ReadSchema: struct<key:int,val:int>
(2) ColumnarToRow [codegen id : 1]
Input [2]: [key#5, val#6]
(3) Filter [codegen id : 1]
Input [2]: [key#5, val#6]
Condition : (isnotnull(key#5) AND (key#5 = Subquery scalar-subquery#27, [id=#164]))
(4) Project [codegen id : 1]
Output [2]: [key#5, val#6]
Input [2]: [key#5, val#6]
EXPLAIN FORMATTED
SELECT *
FROM tab1
WHERE key = (SELECT max(key)
FROM tab2
WHERE val > 5
DataFrame.explain(mode)
Modes:
• simple
• extended
• codegen
• formatted
query = “””SELECT * FROM tab1
WHERE key = (SELECT max(key)
FROM tab2 WHERE val > 5”””
df = spark.sql(query)
df.explain(mode=“formatted”)
Enable new use cases and simplify the Spark application development
Useability and Richer APIs
pandas UDF
enhancements
Structured
Streaming
Pandas UDFs (a.k.a. Vectorized UDFs)
SPARK 2.3
SPARK 3.0
Python Type Hints
Pandas UDFs
Pandas Function APIs - Grouped Map
Pandas UDFs
Supported function APIs include:
Grouped Map
Map
Co-grouped Map
Pandas Function APIs
Spark + AI Session
Blog
Enable new use cases and simplify the Spark application development
Useability and Richer APIs
Structured
Streaming UI
Improved Web UI: Structured Streaming Tab
Improved Web UI: Structured Streaming
Get real-time metrics via the structured
streaming tab including:
▪ Input rate
▪ Process Rate
▪ Input rows
▪ Batch duration
▪ Operation duration
▪ 2 minute window display
Documentation
Structured Streaming UI
Improve the plug-in interface and extend the deployment environments
Hive 3.x Metastore
Hive 2.3 Execution
Hadoop 3
Support
Java 11
Support
Compatibility and Migration Considerations
Spark 3.0 Builds
• Only builds with Scala 2.12
• Deprecates Python 2 (already EOL)
• Can build with various Hadoop/Hive versions
– Hadoop 2.7 + Hive 1.2
– Hadoop 2.7 + Hive 2.3 (supports Java 11) [Default]
– Hadoop 3.2 + Hive 2.3 (supports Java 11)
• Supports the following Hive metastore versions:
– "0.12", "0.13", "0.14", "1.0", "1.1", "1.2", "2.0", "2.1", "2.2", "2.3", "3.0", "3.1"
Apache Spark 3.0: Overview of What’s New and Why Care
The Apache Spark Ecosystem
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
A New Standard for Building Data Lakes
A new approach to building Data Lakes
• Open format-based on parquet with ACID
transactions
• Adds reliability, data quality, performance to Data
Lakes
• Brings the best of data warehousing and data lakes
• Based on open source and open format (Parquet)
• Enabled by Apache Spark
Challenges with data lakes
1. Hard to append data. Adding newly
arrived data leads to incorrect reads.✗
2. Modification of existing data difficult. GDPR/CCPA
requires making fine grained changes to existing data
lake. Very costly with Spark.
3. Jobs failing mid way. Half of the data appears
in the data lake, the rest missing.
How many of
you have built
data lakes?
Challenges with data lakes
4. Real-time operations hard – mixing streaming and batch
leads to inconsistency.
5. Costly to keep historical versions of the data – regulated
environments require reproducibility, auditing, and governance.
6. Difficult to handle large metadata – for large data lakes
the metadata itself becomes difficult to manage.
Challenges with data lakes
7. "Too many files” problems. Data lakes
not great at handling millions of small files.✗
8. Fine grained access control difficult. Enforcing
enterprise-wide role-based access control on data
difficult.
Challenges with data lakes
9. Hard to get great performance – partitioning the data for
performance error-prone and difficult to change.
10. Data quality issues. Hard to ensure that all the
data is correct and has the right quality.
Challenges with data lakes
1. Hard to append data
2. Modification of existing data difficult
3. Jobs failing mid way
4. Real-time operations hard
5. Costly to keep historical data versions
6. Difficult to handle large metadata
7. Poor performance
8. “Too many files” problem
9. Fine grained access control difficult
10. Data quality issues
ACID transactions
Spark under the hood
- auto-indexing
- fine grained ACLs and RBAC
- schema enforcement and evolution
From Data Lakes -> Delta Lake -> Lakehouse
VLDB Conference paper.
Delta Lake Connectors
Standardize your big data storage with an open format accessible from various tools
Amazon Redshift
Amazon Athena
https://databricks.com/diving-into-delta-lake-talkshttps://delta.io
Project Hydrogen
What is Project Hydrogen?
The goal of Project Hydrogen is to enable first-class
support for all distributed ML frameworks
https://vimeo.com/274267107
Incompatible Execution Models
Spark
▪ Tasks are independent
▪ Parallel and massively scalable
▪ If a task crashes, rerun it
Task 1
Task 2
Task 3
Distributed ML Frameworks
▪ Complete coordination among tasks
▪ Optimized for communication
▪ If a task crashes, rerun all tasks
Task 1
Task 2 Task 3
Barrier Execution Mode (Spark 2.4)
▪ Since 2.4, gang scheduling has been implemented on top of the
MapReduce execution model
▪ Gang scheduling enables barrier execution mode
Stage 1: Data Prep
(embarrassingly parallel)
Stage 2: Dist ML training
(gang scheduled)
Stage 3: Data Sink
(embarrassingly parallel)
Accelerator Aware Scheduling (Spark 3.0) - Motivation
▪ Deep learning workloads often use GPUs or other accelerators to
speed up processing on large datasets
▪ Popular cluster managers YARN and Kubernetes support GPUs
▪ Spark 2.x can support those cluster managers, but is not aware of
available GPUs and cannot request or schedule them
Accelerator Aware Scheduling in Spark 3.0
▪ Used to accelerate special workloads like deep learning and signal
processing
▪ Supports Standalone, YARN, and Kubernetes
▪ Supports GPUs
▪ Required resources are specified by configuration, so works only at
the application level
Future work:
▪ Support TPU, FPGA, etc
▪ Support job/stage/task level resource allocation
Web UI for accelerators
Project Zen : PySpark Improvements
• Redesigning PySpark documentation
• PySpark type hints
• Visualization
• Standardized warnings and exceptions
Blog
How many of you
have scratched
your head looking
at a PySpark stack
trace?
Apache Spark 3.0: Overview of What’s New and Why Care
THE FREE VIRTUAL EVENT FOR DATA TEAMS
● Three days of sessions, keynotes, training and demos
● Catch up on rapid advances and best practices in
Apache Spark™, Delta Lake, MLflow and Redash
● Network with more than 20,000 data professionals
from across Europe and around the world
Thank you for your support & contributions!
Happy 10th Birthday
cc: Acknowledgement's to Xiao Li and Doug Bateman

More Related Content

PDF
Hyperspace: An Indexing Subsystem for Apache Spark
PDF
Deep Dive into the New Features of Apache Spark 3.0
PDF
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
PDF
Koalas: Making an Easy Transition from Pandas to Apache Spark
PDF
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
PDF
Data Security at Scale through Spark and Parquet Encryption
PDF
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
PDF
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
Hyperspace: An Indexing Subsystem for Apache Spark
Deep Dive into the New Features of Apache Spark 3.0
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Koalas: Making an Easy Transition from Pandas to Apache Spark
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Data Security at Scale through Spark and Parquet Encryption
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
From HDFS to S3: Migrate Pinterest Apache Spark Clusters

What's hot (20)

PDF
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
PDF
Designing and Building Next Generation Data Pipelines at Scale with Structure...
PDF
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
PDF
Accelerating Data Processing in Spark SQL with Pandas UDFs
PDF
Deep Dive into GPU Support in Apache Spark 3.x
PPTX
Catalyst optimizer
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PDF
Speed up UDFs with GPUs using the RAPIDS Accelerator
PDF
Deep Dive into the New Features of Apache Spark 3.1
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
PDF
Building Robust, Adaptive Streaming Apps with Spark Streaming
PDF
Faster Data Integration Pipeline Execution using Spark-Jobserver
PDF
Informational Referential Integrity Constraints Support in Apache Spark with ...
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
PDF
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
PDF
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
PDF
Improving Apache Spark for Dynamic Allocation and Spot Instances
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
PDF
How to performance tune spark applications in large clusters
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Accelerating Data Processing in Spark SQL with Pandas UDFs
Deep Dive into GPU Support in Apache Spark 3.x
Catalyst optimizer
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Speed up UDFs with GPUs using the RAPIDS Accelerator
Deep Dive into the New Features of Apache Spark 3.1
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Building Robust, Adaptive Streaming Apps with Spark Streaming
Faster Data Integration Pipeline Execution using Spark-Jobserver
Informational Referential Integrity Constraints Support in Apache Spark with ...
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Improving Apache Spark for Dynamic Allocation and Spot Instances
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
How to performance tune spark applications in large clusters
Ad

Similar to Apache Spark 3.0: Overview of What’s New and Why Care (20)

PDF
What’s New in the Upcoming Apache Spark 3.0
PDF
Spark + AI Summit recap jul16 2020
PDF
Mastering Query Optimization Techniques for Modern Data Engineers
PPTX
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
PPTX
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PPTX
Running Presto and Spark on the Netflix Big Data Platform
PDF
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
PDF
Headaches and Breakthroughs in Building Continuous Applications
PPTX
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
PDF
Healthcare Claim Reimbursement using Apache Spark
PDF
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
PPTX
PASS Spanish Recomendaciones para entornos de SQL Server productivos
PPTX
Explore big data at speed of thought with Spark 2.0 and Snappydata
PPTX
iceberg introduction.pptx
PPTX
Make your SharePoint fly by tuning and optimizing SQL Server
PPTX
Espc17 make your share point fly by tuning and optimising sql server
PDF
How Adobe Does 2 Million Records Per Second Using Apache Spark!
PDF
Very large scale distributed deep learning on BigDL
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PDF
PostgreSQL Table Partitioning / Sharding
What’s New in the Upcoming Apache Spark 3.0
Spark + AI Summit recap jul16 2020
Mastering Query Optimization Techniques for Modern Data Engineers
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
Running Presto and Spark on the Netflix Big Data Platform
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
Headaches and Breakthroughs in Building Continuous Applications
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Healthcare Claim Reimbursement using Apache Spark
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
PASS Spanish Recomendaciones para entornos de SQL Server productivos
Explore big data at speed of thought with Spark 2.0 and Snappydata
iceberg introduction.pptx
Make your SharePoint fly by tuning and optimizing SQL Server
Espc17 make your share point fly by tuning and optimising sql server
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Very large scale distributed deep learning on BigDL
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PostgreSQL Table Partitioning / Sharding
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection

Recently uploaded (20)

PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
Introduction to Artificial Intelligence
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Digital Strategies for Manufacturing Companies
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
medical staffing services at VALiNTRY
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
AI in Product Development-omnex systems
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
System and Network Administration Chapter 2
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
ai tools demonstartion for schools and inter college
PPTX
L1 - Introduction to python Backend.pptx
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Introduction to Artificial Intelligence
Which alternative to Crystal Reports is best for small or large businesses.pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PTS Company Brochure 2025 (1).pdf.......
Digital Strategies for Manufacturing Companies
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
medical staffing services at VALiNTRY
Adobe Illustrator 28.6 Crack My Vision of Vector Design
AI in Product Development-omnex systems
Operating system designcfffgfgggggggvggggggggg
System and Network Administration Chapter 2
How to Choose the Right IT Partner for Your Business in Malaysia
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Understanding Forklifts - TECH EHS Solution
Upgrade and Innovation Strategies for SAP ERP Customers
ai tools demonstartion for schools and inter college
L1 - Introduction to python Backend.pptx

Apache Spark 3.0: Overview of What’s New and Why Care

  • 1. What's New in Apache Spark 3.0 & Why Care? Buenas noches amigos de Bogotá Spark! J Jules S. Damji, Databricks Apache Spark Bogota Meetup September 23, 2020
  • 2. Jules S. Damji Senior Developer Advocate @ Databricks Joined Databricks in 2016 20+ Software Engineer at companies: Sun, Netscape, VeriSign, @Home, LoudCloud/Opsware, Hortonworks etc About Me
  • 3. Unified data analytics platform for data science, data engineering, and business analytics to solve tough data problems Original creators of popular data and machine learning open source projects Global company with 5,000 customers and 450+ partners
  • 4. Adaptive Query Execution Dynamic Partition Pruning Query Compilation Speedup Join Hints Performance Richer APIs Accelerator-aware Scheduler Built-in Functions pandas UDF Enhancements DELETE/UPDATE/ MERGE in Catalyst Reserved Keywords Proleptic Gregorian Calendar ANSI Store Assignment Overflow Checking SQL Compatibility Built-in Data Sources Parquet/ORC Nested Column Pruning Parquet: Nested Column Filter Pushdown CSV Filter Pushdown New Binary Data Source Data Source V2 API + Catalog Support Java 11 Support Hadoop 3 Support Hive 3.x Metastore Hive 2.3 Execution Extensibility and Ecosystem Structured Streaming UI DDL/DML Enhancements Observable Metrics Event Log Rollover Monitoring and Debuggability Delta ,
  • 6. Agenda Performance Spark 3.0 comes with performance improvements to make Spark faster, cheaper, and more flexible Usability Spark is easier to use Compatibility Considerations View notable compatibility/behavior changes Spark Ecosystem Learn about developments in Delta Lake , Project Hydrogen and Project Zen
  • 7. Performance Achieve high performance for interactive, batch, streaming and ML workloads Adaptive Query Execution Dynamic Partition Pruning Join Hints Blog
  • 8. Spark Catalyst Optimizer Spark 1.x, Rule Spark 2.x, Rule + Cost Spark 3.0, Rule + Cost + Runtime
  • 10. Adaptive Query Execution Based on statistics of the finished plan nodes, re-optimize the execution plan of the remaining queries ▪ Dynamically switch join strategies ▪ Dynamically coalesce shuffle partitions ▪ Dynamically optimize skew joins adaptive planning
  • 11. Performance Pitfall Choose Broadcast Hash Join? ▪ Increase “spark.sql.autoBroadcastJoinThreshold”? ▪ Use “broadcast” hint? However Hard to tune Hard to maintain over time OOM… Using the wrong join strategy
  • 12. Adaptive Query Execution Vision: No more manual setting of broadcast hints/thresholds! Capability: SMJ -> BHJ at runtime SMJ Sort Sort Shuffle Write Shuffle Write Left Child Right Child BHJ Broadcast Shuffle Shuffle Left Child Right Child Static size: 15MB Actual: 8MB Shuffle Read Shuffle Read Local Shuffle Read Local Shuffle Read Not Started Done New Plan - Changed
  • 13. Performance Pitfall Tuning spark.sql.shuffle.partitions ▪ Default magic number: 200 !?! However ▪ Too small: GC pressure; disk spilling ▪ Too large: Inefficient I/O; scheduler pressure ▪ Hard to tune over the whole query plan ▪ Hard to maintain over time Choosing the wrong shuffle partition number
  • 14. Adaptive Query Execution VISION: No more manual tuning of spark.shuffle.partitions! Capability: Coalesce shuffle partitions Filter Scan Execute Shuffle (50 part.) Sort Stage 1 OptimizeFilter Scan Shuffle (50 part.) Sort Stage 1 Filter Scan Shuffle (50 part.) Sort Stage 1 Coalesce (5 part.) Set the initial partition number 200 or X to accommodate the largest data size of the entire query execution Automatically coalesce partitions if needed after each query stage
  • 15. Performance Pitfall Symptoms of data skew ▪ Frozen/long-running tasks ▪ Disk spilling ▪ Low resource utilization in most nodes ▪ OOM Various ways ▪ Find the skew values and rewrite the queries ▪ Adding extra skew keys… Data skew Anybody dealt with data skews while running Spark jobs?
  • 17. Adaptive Query Execution VISION: No more manual tuning of skew hints!
  • 18. AQE Configuration Settings Property Name Default Meaning Since Version spark.sql.adaptive. coalescePartitions. enabled true When true and spark.sql.adaptive.enabled is true, Spark will coalesce contiguous shuffle partitions according to the target size (specified by spark.sql.adaptive.advisoryPartitionSizeInBytes), to avoid too many small tasks. 3.0.0 spark.sql.adaptive. coalescePartitions. minPartitionNum Default Parallelism The minimum number of shuffle partitions after coalescing. If not set, the default value is the default parallelism of the Spark cluster. This configuration only has an effect when spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled are both enabled. 3.0.0 spark.sql.adaptive. coalescePartitions. initialPartitionNum 200 The initial number of shuffle partitions before coalescing. By default it equals to spark.sql.shuffle.partitions. This configuration only has an effect when spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled are both enabled. 3.0.0 spark.sql.adaptive. advisoryPartitionSizeInBytes 64 MB The advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition. 3.0.0 AQE is not enabled by default. Set spark.sql.adaptive.enabled to true to use the features below.
  • 20. Performance Achieve high performance for interactive, batch, streaming and ML workloads Adaptive Query Execution Dynamic Partition Pruning Join Hints
  • 21. Dynamic Partition Pruning • Avoid partition scanning based on the query results of the other query fragments. • Important for star-schema queries. • Significant speedup in TPC-DS.
  • 22. Static Partition Pruning SELECT * FROM Sales WHERE store_id = 5 Most optimizations employ simple static partition pruning Basic Data Flow Filter Push-down Partitioned files with multi-columnar data
  • 23. A Common Workload SELECT * FROM Sales JOIN Stores WHERE Stores.city = 'New York' Scan Sales Join Scan Stores Filter city = 'New York' ● Static pruning cannot be applied ● Filter is only acting on the smaller dimensional table, not the larger fact table Small dimensional table Larger fact table Star Schema Queries
  • 24. Table Denormalization SELECT * FROM Sales JOIN Stores WHERE Stores.city = 'New York' Scan Sales Join Scan Stores Filter city = 'New York' Scan
  • 25. Dynamic Partition Pruning Physical Plan Optimization Partitioned files with multi-columnar data File Scan Dynamic Filter Broadcast Exchange File Scan with DIM filter Broadcast Hash Join SCAN Fact Table
  • 26. Dynamic Partition Pruning 60 / 102 TPC-DS queries: a speedup between 2x and 18x
  • 27. Dynamic Partition Pruning Configuration Dynamic Partition pruning is enabled by default Property Name Default Meaning Since Version spark.sql.optimizer. dynamicPartitionPruning.enabled true When true, we will generate predicate for partition column when it's used as join key 3.0.0
  • 28. Performance Achieve high performance for interactive, batch, streaming and ML workloads Adaptive Query Execution Dynamic Partition Pruning Join Hints
  • 29. Optimizer Hints ▪ Join hints influence optimizer to choose the join strategies ▪ Broadcast hash join ▪ Sort-merge join NEW ▪ Shuffle hash join NEW ▪ Shuffle nested loop join NEW ▪ Should be used with extreme caution. ▪ Difficult to manage over time.
  • 30. Join Strategies Most robust Handle any data size Needs to shuffle and sort Can be slow when table size is small Requires one side to be small No shuffle or sort Very fast Broadcast HashSort-Merge Needs to shuffle, but no sort Can handle large tables Will OOM if data is skewed Shuffle Hash Doesn't require join keys Shuffle Nested Loop
  • 31. ▪ Broadcast Hash Join SELECT /*+ BROADCAST(a) */ id FROM a JOIN b ON a.key = b.key ▪ Sort-Merge Join SELECT /*+ MERGE(a, b) */ id FROM a JOIN b ON a.key = b.key ▪ Shuffle Hash Join SELECT /*+ SHUFFLE_HASH(a, b) */ id FROM a JOIN b ON a.key = b.key ▪ Shuffle Nested Loop Join SELECT /*+ SHUFFLE_REPLICATE_NL(a, b) */ id FROM a JOIN b How to Use SQL Join Hints?
  • 32. Join Hint Syntax Shuffle Merge SQL Python
  • 33. Enable new use cases and simplify the Spark application development Formatted Explain SQL Engine pandas UDF enhancements Useability and Richer APIs
  • 34. Spark SQL: Old Explain How many of you have scratched your heads reading this?
  • 35. Spark SQL: New EXPLAIN FORMATTED Header: Basic operating tree for the execution plan Footer: Each operator with additional attributes
  • 36. * Project (4) +- * Filter (3) +- * ColumnarToRow (2) +- Scan parquet default.tab1 (1) (1) Scan parquet default.tab1 Output [2]: [key#5, val#6] Batched: true Location: InMemoryFileIndex [file:/user/hive/warehouse/tab1] PushedFilters: [IsNotNull(key)] ReadSchema: struct<key:int,val:int> (2) ColumnarToRow [codegen id : 1] Input [2]: [key#5, val#6] (3) Filter [codegen id : 1] Input [2]: [key#5, val#6] Condition : (isnotnull(key#5) AND (key#5 = Subquery scalar-subquery#27, [id=#164])) (4) Project [codegen id : 1] Output [2]: [key#5, val#6] Input [2]: [key#5, val#6] EXPLAIN FORMATTED SELECT * FROM tab1 WHERE key = (SELECT max(key) FROM tab2 WHERE val > 5
  • 37. DataFrame.explain(mode) Modes: • simple • extended • codegen • formatted query = “””SELECT * FROM tab1 WHERE key = (SELECT max(key) FROM tab2 WHERE val > 5””” df = spark.sql(query) df.explain(mode=“formatted”)
  • 38. Enable new use cases and simplify the Spark application development Useability and Richer APIs pandas UDF enhancements Structured Streaming
  • 39. Pandas UDFs (a.k.a. Vectorized UDFs) SPARK 2.3 SPARK 3.0 Python Type Hints
  • 40. Pandas UDFs Pandas Function APIs - Grouped Map
  • 41. Pandas UDFs Supported function APIs include: Grouped Map Map Co-grouped Map Pandas Function APIs Spark + AI Session Blog
  • 42. Enable new use cases and simplify the Spark application development Useability and Richer APIs Structured Streaming UI
  • 43. Improved Web UI: Structured Streaming Tab
  • 44. Improved Web UI: Structured Streaming Get real-time metrics via the structured streaming tab including: ▪ Input rate ▪ Process Rate ▪ Input rows ▪ Batch duration ▪ Operation duration ▪ 2 minute window display Documentation
  • 46. Improve the plug-in interface and extend the deployment environments Hive 3.x Metastore Hive 2.3 Execution Hadoop 3 Support Java 11 Support Compatibility and Migration Considerations
  • 47. Spark 3.0 Builds • Only builds with Scala 2.12 • Deprecates Python 2 (already EOL) • Can build with various Hadoop/Hive versions – Hadoop 2.7 + Hive 1.2 – Hadoop 2.7 + Hive 2.3 (supports Java 11) [Default] – Hadoop 3.2 + Hive 2.3 (supports Java 11) • Supports the following Hive metastore versions: – "0.12", "0.13", "0.14", "1.0", "1.1", "1.2", "2.0", "2.1", "2.2", "2.3", "3.0", "3.1"
  • 49. The Apache Spark Ecosystem
  • 52. A New Standard for Building Data Lakes A new approach to building Data Lakes • Open format-based on parquet with ACID transactions • Adds reliability, data quality, performance to Data Lakes • Brings the best of data warehousing and data lakes • Based on open source and open format (Parquet) • Enabled by Apache Spark
  • 53. Challenges with data lakes 1. Hard to append data. Adding newly arrived data leads to incorrect reads.✗ 2. Modification of existing data difficult. GDPR/CCPA requires making fine grained changes to existing data lake. Very costly with Spark. 3. Jobs failing mid way. Half of the data appears in the data lake, the rest missing. How many of you have built data lakes?
  • 54. Challenges with data lakes 4. Real-time operations hard – mixing streaming and batch leads to inconsistency. 5. Costly to keep historical versions of the data – regulated environments require reproducibility, auditing, and governance. 6. Difficult to handle large metadata – for large data lakes the metadata itself becomes difficult to manage.
  • 55. Challenges with data lakes 7. "Too many files” problems. Data lakes not great at handling millions of small files.✗ 8. Fine grained access control difficult. Enforcing enterprise-wide role-based access control on data difficult.
  • 56. Challenges with data lakes 9. Hard to get great performance – partitioning the data for performance error-prone and difficult to change. 10. Data quality issues. Hard to ensure that all the data is correct and has the right quality.
  • 57. Challenges with data lakes 1. Hard to append data 2. Modification of existing data difficult 3. Jobs failing mid way 4. Real-time operations hard 5. Costly to keep historical data versions 6. Difficult to handle large metadata 7. Poor performance 8. “Too many files” problem 9. Fine grained access control difficult 10. Data quality issues ACID transactions Spark under the hood - auto-indexing - fine grained ACLs and RBAC - schema enforcement and evolution
  • 58. From Data Lakes -> Delta Lake -> Lakehouse VLDB Conference paper.
  • 59. Delta Lake Connectors Standardize your big data storage with an open format accessible from various tools Amazon Redshift Amazon Athena
  • 62. What is Project Hydrogen? The goal of Project Hydrogen is to enable first-class support for all distributed ML frameworks https://vimeo.com/274267107
  • 63. Incompatible Execution Models Spark ▪ Tasks are independent ▪ Parallel and massively scalable ▪ If a task crashes, rerun it Task 1 Task 2 Task 3 Distributed ML Frameworks ▪ Complete coordination among tasks ▪ Optimized for communication ▪ If a task crashes, rerun all tasks Task 1 Task 2 Task 3
  • 64. Barrier Execution Mode (Spark 2.4) ▪ Since 2.4, gang scheduling has been implemented on top of the MapReduce execution model ▪ Gang scheduling enables barrier execution mode Stage 1: Data Prep (embarrassingly parallel) Stage 2: Dist ML training (gang scheduled) Stage 3: Data Sink (embarrassingly parallel)
  • 65. Accelerator Aware Scheduling (Spark 3.0) - Motivation ▪ Deep learning workloads often use GPUs or other accelerators to speed up processing on large datasets ▪ Popular cluster managers YARN and Kubernetes support GPUs ▪ Spark 2.x can support those cluster managers, but is not aware of available GPUs and cannot request or schedule them
  • 66. Accelerator Aware Scheduling in Spark 3.0 ▪ Used to accelerate special workloads like deep learning and signal processing ▪ Supports Standalone, YARN, and Kubernetes ▪ Supports GPUs ▪ Required resources are specified by configuration, so works only at the application level Future work: ▪ Support TPU, FPGA, etc ▪ Support job/stage/task level resource allocation
  • 67. Web UI for accelerators
  • 68. Project Zen : PySpark Improvements
  • 69. • Redesigning PySpark documentation • PySpark type hints • Visualization • Standardized warnings and exceptions Blog How many of you have scratched your head looking at a PySpark stack trace?
  • 71. THE FREE VIRTUAL EVENT FOR DATA TEAMS ● Three days of sessions, keynotes, training and demos ● Catch up on rapid advances and best practices in Apache Spark™, Delta Lake, MLflow and Redash ● Network with more than 20,000 data professionals from across Europe and around the world
  • 72. Thank you for your support & contributions! Happy 10th Birthday cc: Acknowledgement's to Xiao Li and Doug Bateman