SlideShare a Scribd company logo
2
Most read
3
Most read
6
Most read
- Aparup Chatterjee
Introduction of Apache Spark 3.X Dynamic Partition
Pruning (DPP)
Spark 3.0 based Environment Details:
 Hadoop 3.2
 Spark 3.1
 Python 3.8
Spark 2.0 based Environment Details:
 Hadoop 2.9
 Spark 2.3
 Python 2.7.14
Source:- SPARK+AI SUMMIT NORTH AMERICA 2020 &
SPARK 3.X OFFICIAL DOCS & DATABRICKS
Another big improvement of Spark 3.X is Dynamic Partition Pruning(DPP).
Before going to in detail about DPP, I would like to explain main 2 key points of DPP
1. Filter Pushdown
2. Partition Pruning
Used GCP based Bigdata Component Details
Filter Pushdown/ Predicate Pushdown
Filter Pushdown or Predicate Pushdown is an Optimization in Spark Sql.
To improve spark sql query performance, one strategy is to reduce the amount of data read (I/O) which is transferred from
the data storage to executors.
Generally, when we use Filter/Where condition in our Spark Sql Query, Spark Catalyst Optimizer always attempts to “push
down” filtering operations to the data source layer to save I/O cost (pic. Filter Push-down) instead of reading full files and
send across executors (pic. Basic data-flow) then filtering.
Example:
Filter Pushdown Limitation
Filter Pushdown works based on Data Schema type, if filter condition requires casting the content of a field then cast
functions cannot be pushed down and will read full file.
example: if column age data type is String while reading from CSV/hive table
So as a Data Engineer we always need to cast column type if required while passing through filter condition so that
spark can use Filter PushDown for optimization.
You can achieve this via casting on the fly or add custom schema to dataframe.
Partition Pruning
Partition pruning is another optimization but it works based on PredicatePushDown/FilterPushDown methodology.
Filtering data on Partitioned Column, the catalyst optimizer pushes down the partition filters at Data Source level. The scan reads only the
directories(not actual content of data) that match the partition filters, thus reducing disk I/O.
example: suppose if a partition column contain records that match filter condition then there is no need to of PredicatePushDown instead of
that spark will PushDown only filter column to read that partition and rest partitioned will be pruned/skipped, thereby avoiding
unnecessary I/O.
Lets take same dataset where table partitioned by age
Lets have a look into Physical Plan:
As you can see spark will Pruned/skip those irrelevant partitions and will take
required partition which will further pushdown at data source level. Data
pruning/skipping allows for a big performance boost.
This is also called Static Partition Pruning
Quick Recap of Filter PushDown and Partition Pruning
Filter Pushdown : When you filter on some column that isn't in your partition, Spark will scan every part file in folder of that
parquet table. Only when you have pushdown filtering, Spark will use the footer of every part file (where min, max and count
statistics are stored) to determine if your search value is within that range. If yes, Spark will read the file fully. If not, Spark will
skip the whole file, not costing you at least the full read.
When we filter off of df, the pushed filters are-
PushedFilters: [IsNotNull(age), GreaterThan(age,40)]
Partition Pruning: When you use filters on the columns which you did partition on, Spark will skip those files completely and it
wouldn't cost you any IO.
When we filter off of partitioned Df, the pushed filters are-
PartitionFilters: [isnotnull(age#102), (age#102 > 40)], PushedFilters: []
Spark doesn’t need to push the age filter when working off of partitioned DF because it can use a partition filter that is a lot
faster.
Dynamic Partition Pruning
 Dynamic partition pruning improves job performance by more accurately selecting the specific partitions within a table
that need to be read and processed for a specific query.
 Dynamic partition pruning allows the Spark engine to dynamically infer at runtime based on calculation column statistics
of the data in selected columns i.e which partitions need to be read and which can be safely eliminated.
 By reducing the amount of data read and processed, significant time is saved in job execution.
In Real life its very useful for Data Engineers with Star Schema Data Warehousing concept where we generally join big fact table with
comparatively small dimension table.
Let’s look into below example:
Tables: withPartition_fact (Fact table) - Partitioned by age withoutPartition_dim (Dimension table) – Non Partitioned
Physical Plan
Now its’s time to Deep Dive
Dimension Table Phase
Spark use Filter Pushdown/Predicate Pushdown method which will skip reading irrelevant data from dimension table
before actual data fetching phase.
Fact Table Phase
Here you can notice the Partition Filter applied, which is formed internally from dimension table filter and at last
dynamic pruning expression is formed.
Internal Architecture and Dag Level
From the above views we can see spark will pick particular partition from fact table which is relevant to dimension table
filter condition and Spark will pruned other partitions of fact table.
According to our Dataframe query , dimension table filter “city=Bangalore”, Now from fact table Datasets we can see only 2
records are matched with different age(i.e 2 partition).
Now from the DAG also we can see using Dynamic Partition Pruning Spark only fetched those 2 partitions( number of
files read:2) only and rest all irrelevant partitions are pruned out.
Spark 3 config that’s responsible for DPP :- spark.sql.optimizer.dynamicPartitionPruning.enabled – true (bydefault)
Spark Internal Dynamic Partition Pruning Steps
1. Spark builds a hash table and forms an inner subquery out of the dimension table, which is broadcasted across all the
executors. (Subquery Broadcast)
2. Using Subquery Broadcast , we are able to execute the join without requiring a shuffle.
3. Then spark will start probing that hash table with rows that come from the fact table on each worker node in its
scanning phase, so that it doesn’t carry any irrelevant data to the Join phase.
Spark Dynamic Partition Pruning Factors
 In Spark 3 by default its enabled.
 Fact/Big table must be partitioned by a Column key
 Its best suitable for Star Schema Data Warehousing Model.
Useful Resources
• https://www.youtube.com/watch?v=WSplTjBKijU - SPARK+AI SUMMIT NORTH AMERICA 2020
• https://spark.apache.org/releases/spark-release-3-0-0.html - Spark 3 Official Docs
• https://databricks.com/session_eu19/dynamic-partition-pruning-in-apache-spark - Databricks
Apache Spark 3 Dynamic Partition Pruning

More Related Content

PDF
Dynamic Partition Pruning in Apache Spark
PDF
Spark shuffle introduction
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Dynamic Partition Pruning in Apache Spark
Spark shuffle introduction
The Parquet Format and Performance Optimization Opportunities
Iceberg: A modern table format for big data (Strata NY 2018)
Deep Dive: Memory Management in Apache Spark
Apache Iceberg - A Table Format for Hige Analytic Datasets
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

What's hot (20)

PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
The Apache Spark File Format Ecosystem
PPTX
Rds data lake @ Robinhood
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
PDF
Hudi architecture, fundamentals and capabilities
PDF
Understanding Query Plans and Spark UIs
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PDF
Parquet Strata/Hadoop World, New York 2013
PPTX
Druid deep dive
PDF
Apache Spark At Scale in the Cloud
PDF
Parquet performance tuning: the missing guide
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PPTX
Batch Processing vs Stream Processing Difference
PDF
Iceberg: a fast table format for S3
PDF
Physical Plans in Spark SQL
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
PPTX
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PDF
Apache Spark Core – Practical Optimization
Processing Large Data with Apache Spark -- HasGeek
The Apache Spark File Format Ecosystem
Rds data lake @ Robinhood
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Hudi architecture, fundamentals and capabilities
Understanding Query Plans and Spark UIs
Efficient Data Storage for Analytics with Apache Parquet 2.0
Parquet Strata/Hadoop World, New York 2013
Druid deep dive
Apache Spark At Scale in the Cloud
Parquet performance tuning: the missing guide
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Batch Processing vs Stream Processing Difference
Iceberg: a fast table format for S3
Physical Plans in Spark SQL
Cosco: An Efficient Facebook-Scale Shuffle Service
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Apache Spark Core – Practical Optimization
Ad

Similar to Apache Spark 3 Dynamic Partition Pruning (20)

PDF
Apache Spark 3.0: Overview of What’s New and Why Care
PDF
What’s New in the Upcoming Apache Spark 3.0
DOCX
Quick Guide to Refresh Spark skills
PPTX
iceberg introduction.pptx
PPTX
OVERVIEW ON SPARK.pptx
PDF
Spark SQL In Depth www.syedacademy.com
PDF
Spark + AI Summit recap jul16 2020
PDF
Hadoop Spark Introduction-20150130
PPTX
Building a modern Application with DataFrames
PPTX
Building a modern Application with DataFrames
PDF
data stage-material
DOC
123448572 all-in-one-informatica
PDF
Spark Performance Tuning .pdf
PDF
exadata-database-machine-kpis-3224944.pdf
PDF
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
PPTX
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
PDF
PostgreSQL Table Partitioning / Sharding
PPTX
Big Data Transformation Powered By Apache Spark.pptx
PPTX
Big Data Transformations Powered By Spark
PPT
2007 Tidc India Profiling
Apache Spark 3.0: Overview of What’s New and Why Care
What’s New in the Upcoming Apache Spark 3.0
Quick Guide to Refresh Spark skills
iceberg introduction.pptx
OVERVIEW ON SPARK.pptx
Spark SQL In Depth www.syedacademy.com
Spark + AI Summit recap jul16 2020
Hadoop Spark Introduction-20150130
Building a modern Application with DataFrames
Building a modern Application with DataFrames
data stage-material
123448572 all-in-one-informatica
Spark Performance Tuning .pdf
exadata-database-machine-kpis-3224944.pdf
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
PostgreSQL Table Partitioning / Sharding
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformations Powered By Spark
2007 Tidc India Profiling
Ad

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Foundation of Data Science unit number two notes
PDF
Lecture1 pattern recognition............
PPTX
Introduction to machine learning and Linear Models
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Computer network topology notes for revision
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Logistic Regression ml machine learning.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Database Infoormation System (DBIS).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Foundation of Data Science unit number two notes
Lecture1 pattern recognition............
Introduction to machine learning and Linear Models
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Taxes Foundatisdcsdcsdon Certificate.pdf
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Quality review (1)_presentation of this 21
Supervised vs unsupervised machine learning algorithms
Computer network topology notes for revision
.pdf is not working space design for the following data for the following dat...
Logistic Regression ml machine learning.pptx
Reliability_Chapter_ presentation 1221.5784
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Major-Components-ofNKJNNKNKNKNKronment.pptx

Apache Spark 3 Dynamic Partition Pruning

  • 2. Introduction of Apache Spark 3.X Dynamic Partition Pruning (DPP) Spark 3.0 based Environment Details:  Hadoop 3.2  Spark 3.1  Python 3.8 Spark 2.0 based Environment Details:  Hadoop 2.9  Spark 2.3  Python 2.7.14 Source:- SPARK+AI SUMMIT NORTH AMERICA 2020 & SPARK 3.X OFFICIAL DOCS & DATABRICKS Another big improvement of Spark 3.X is Dynamic Partition Pruning(DPP). Before going to in detail about DPP, I would like to explain main 2 key points of DPP 1. Filter Pushdown 2. Partition Pruning Used GCP based Bigdata Component Details
  • 3. Filter Pushdown/ Predicate Pushdown Filter Pushdown or Predicate Pushdown is an Optimization in Spark Sql. To improve spark sql query performance, one strategy is to reduce the amount of data read (I/O) which is transferred from the data storage to executors. Generally, when we use Filter/Where condition in our Spark Sql Query, Spark Catalyst Optimizer always attempts to “push down” filtering operations to the data source layer to save I/O cost (pic. Filter Push-down) instead of reading full files and send across executors (pic. Basic data-flow) then filtering. Example:
  • 4. Filter Pushdown Limitation Filter Pushdown works based on Data Schema type, if filter condition requires casting the content of a field then cast functions cannot be pushed down and will read full file. example: if column age data type is String while reading from CSV/hive table So as a Data Engineer we always need to cast column type if required while passing through filter condition so that spark can use Filter PushDown for optimization. You can achieve this via casting on the fly or add custom schema to dataframe.
  • 5. Partition Pruning Partition pruning is another optimization but it works based on PredicatePushDown/FilterPushDown methodology. Filtering data on Partitioned Column, the catalyst optimizer pushes down the partition filters at Data Source level. The scan reads only the directories(not actual content of data) that match the partition filters, thus reducing disk I/O. example: suppose if a partition column contain records that match filter condition then there is no need to of PredicatePushDown instead of that spark will PushDown only filter column to read that partition and rest partitioned will be pruned/skipped, thereby avoiding unnecessary I/O. Lets take same dataset where table partitioned by age Lets have a look into Physical Plan: As you can see spark will Pruned/skip those irrelevant partitions and will take required partition which will further pushdown at data source level. Data pruning/skipping allows for a big performance boost. This is also called Static Partition Pruning
  • 6. Quick Recap of Filter PushDown and Partition Pruning Filter Pushdown : When you filter on some column that isn't in your partition, Spark will scan every part file in folder of that parquet table. Only when you have pushdown filtering, Spark will use the footer of every part file (where min, max and count statistics are stored) to determine if your search value is within that range. If yes, Spark will read the file fully. If not, Spark will skip the whole file, not costing you at least the full read. When we filter off of df, the pushed filters are- PushedFilters: [IsNotNull(age), GreaterThan(age,40)] Partition Pruning: When you use filters on the columns which you did partition on, Spark will skip those files completely and it wouldn't cost you any IO. When we filter off of partitioned Df, the pushed filters are- PartitionFilters: [isnotnull(age#102), (age#102 > 40)], PushedFilters: [] Spark doesn’t need to push the age filter when working off of partitioned DF because it can use a partition filter that is a lot faster.
  • 7. Dynamic Partition Pruning  Dynamic partition pruning improves job performance by more accurately selecting the specific partitions within a table that need to be read and processed for a specific query.  Dynamic partition pruning allows the Spark engine to dynamically infer at runtime based on calculation column statistics of the data in selected columns i.e which partitions need to be read and which can be safely eliminated.  By reducing the amount of data read and processed, significant time is saved in job execution. In Real life its very useful for Data Engineers with Star Schema Data Warehousing concept where we generally join big fact table with comparatively small dimension table. Let’s look into below example: Tables: withPartition_fact (Fact table) - Partitioned by age withoutPartition_dim (Dimension table) – Non Partitioned
  • 9. Now its’s time to Deep Dive Dimension Table Phase Spark use Filter Pushdown/Predicate Pushdown method which will skip reading irrelevant data from dimension table before actual data fetching phase. Fact Table Phase Here you can notice the Partition Filter applied, which is formed internally from dimension table filter and at last dynamic pruning expression is formed.
  • 11. From the above views we can see spark will pick particular partition from fact table which is relevant to dimension table filter condition and Spark will pruned other partitions of fact table. According to our Dataframe query , dimension table filter “city=Bangalore”, Now from fact table Datasets we can see only 2 records are matched with different age(i.e 2 partition). Now from the DAG also we can see using Dynamic Partition Pruning Spark only fetched those 2 partitions( number of files read:2) only and rest all irrelevant partitions are pruned out. Spark 3 config that’s responsible for DPP :- spark.sql.optimizer.dynamicPartitionPruning.enabled – true (bydefault)
  • 12. Spark Internal Dynamic Partition Pruning Steps 1. Spark builds a hash table and forms an inner subquery out of the dimension table, which is broadcasted across all the executors. (Subquery Broadcast) 2. Using Subquery Broadcast , we are able to execute the join without requiring a shuffle. 3. Then spark will start probing that hash table with rows that come from the fact table on each worker node in its scanning phase, so that it doesn’t carry any irrelevant data to the Join phase. Spark Dynamic Partition Pruning Factors  In Spark 3 by default its enabled.  Fact/Big table must be partitioned by a Column key  Its best suitable for Star Schema Data Warehousing Model.
  • 13. Useful Resources • https://www.youtube.com/watch?v=WSplTjBKijU - SPARK+AI SUMMIT NORTH AMERICA 2020 • https://spark.apache.org/releases/spark-release-3-0-0.html - Spark 3 Official Docs • https://databricks.com/session_eu19/dynamic-partition-pruning-in-apache-spark - Databricks