SlideShare a Scribd company logo
APACHE SPARK 3
NEW FEATURES
- APARUP CHATTERJEE
Spark 3.0.0 has released early June 2020
With the release of Spark 3.0, there are so many improvements implemented for
faster execution.
Well, there are many several changes done in improving SQL Performance such as:
 Adaptive Query Execution (AQE)
 New EXPLAIN Format
 Dataframe tail function
 Join Hints
 Dynamic Partition Pruning
New Added Features in Spark 3.0
Source:- SPARK+AI SUMMIT EUROPE 2019,
SPARK 3.0 OFFICIAL DOCS & Google Search
Today’s session I will be briefing first 3 features and rest of other I will continue in
my next session
Spark 2.0 based Environment Details:
Hadoop 2.9
Spark 2.3
Python 2.7.14
Used GCP based Bigdata Component Details
Spark 3.0 based Environment Details:
Hadoop 3.2
Spark 3.0
Python 3.7.4
Spark catalyst is one of the most important layer of spark SQL which does all
the query optimisation.
Even though spark catalyst does lot of heavy lifting, it’s all done before query
execution. So that means once the physical plan is created and execution of
the plan started, it will not do any optimisation there after. So it cannot do
some of the optimisation which is based on metrics it sees when the
execution is going on.
In 3.0, spark has introduced an additional layer of optimisation. This layer is
known as Adaptive Query Execution(AQE). This layer tries to optimise the
queries depending upon the metrics that are collected as part of the
execution.
Adaptive Query Execution, AQE, is a layer on top of the spark catalyst
which will modify the spark plan on the fly. This allows spark to do some of
the things which are not possible to do in catalyst today.
Adaptive Query Execution(AQE)
Adaptive Query Execution(AQE)
Adaptive Number of Shuffle Partitions or Reducers
In Spark sql, number of shuffle partitions are set using spark.sql.shuffle.partitions which defaults to 200. In most of the cases, this
number is too high for smaller data and too small for bigger data. Selecting right value becomes always tricky for the developer.
So we need an ability to coalesce the shuffle partitions by looking at the mapper output. If the mapping generates small number of
partitions, we want to reduce the overall shuffle partitions so it will improve the performance
Shuffle Partitions without AQE:
Before we see how to optimise the shuffle partitions, let’s see what is the problem we are trying to solve. Let’s take below example
from pyspark.sql import SparkSession
spark = SparkSession 
.builder 
.appName("Spark Adaptive Query Execution ") 
.config("spark.some.config.option", "some-value") 
.getOrCreate()
sc=spark.sparkContext
df=spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("gs://aparup-
files/sales.csv").repartition(500)
#In above code, I am reading a small file and increasing the partitions to 500. This increase is to force the spark to use maximum
shuffle partitions and file size: 226B
df.show(4, False)
#GroupBy for Shuffle
df.groupBy("customerId").count().count()
#sales_df=df.groupBy("customerId").count()
#sales_df.write.parquet("gs://aparup-files/spark2.parquet")
sc.stop()
Adaptive Query Execution(AQE)
Adaptive Number of Shuffle Partitions or Reducers
Observing Job: Spark 2 Doesn’t has AQE
When I am running in Spark2 Cluster its throwing error as AQE is by default set to false and
we cant use this because to use AQE we need enable
‘spark.sql.adaptive.coalescePartitions.enabled’ to check the requires partition based on
result metrics and its not present in spark 2.
Spark 3 with AQE
Adaptive Query Execution(AQE)
Adaptive Number of Shuffle Partitions or Reducers
Spark 2 Observing Stages
As you can observe from the image, stage id 14, 200 tasks ran even the data was very less.
Spark 2 Observing Dags
From the image, you can
observe that there was lot
of shuffle.
Adaptive Query Execution(AQE)
Adaptive Number of Shuffle Partitions or Reducers
Optimising Shuffle Partitions in AQE
Enabling the configuration
To use AQE we need to set spark.sql.adaptive.enabled to true.
conf.set("spark.sql.adaptive.enabled", "true")
To use the shuffle partitions optimization we need to set
spark.sql.adaptive.coalescePartitions.enabled to true.
conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
Adaptive Query Execution(AQE)
Adaptive Number of Shuffle Partitions or Reducers
Spark 3 Observing Stages
From the image you can observe that, most of the stages are skipped all together as spark figured
out that most of the partitions are empty.
Spark 3 Observing Dags
From the image, you can observe
most of the shuffle was skipped.
There is a CoalescedShuffleReader
which is combining all the shuffle
partitions to 1.
So by just enabling few
configuration we can dynamically
optimise the shuffle partitions in
AQE.
New EXPLAIN Format
In Spark EXPLAIN function returns the detail of spark sql query execution stages or you can say how query is
optimized
Challenges in Spark 2 – Not easy to understand how a query is optimized i.e output is too complex
Key Feature of Explain function in Spark 3 –
EASY TO READ QUERY EXECUTION PLAN by adding Explain mode="formatted“
query="select customerId,max(amountPaid) from spark3.sample_tbl where customerId>0 group by customerId having
max(amountPaid)>0 "
Explain in Spark 2
Not easy to understand how a query is optimized
output is too complex!!!
Explain in Spark 3
Easy to Read Query Plan
Output with Very Detailed Information
In many times in our code, we would like to read few rows from the dataframe.
For this, we use head function on top of the dataframe which Internally
implemented by reading only needed number of items by accessing one partition at
a time from beginning.
But to access the values from last partition of Dataframe till Spark V2 we don’t have
any straight forward way
So in Spark V3 new function tail has been introduced for reading values from the last
partition of a dataframe.
Dataframe tail function
Dataframe tail function
Spark 2 Don’t have tail Function
Dataframe tail function
Spark 3 introduced new tail Function
Useful Resources
https://spark.apache.org/releases/spark-release-3-0-0.html -
Spark 3 Official Docs
https://www.youtube.com/watch?v=scM_WQMhB3A&t=1s -
SPARK+AI SUMMIT EUROPE 2019
What's New in Apache Spark 3.0 !!

More Related Content

PPTX
Apache Spark 3 Dynamic Partition Pruning
PDF
Improving Spark SQL at LinkedIn
PDF
Deep Dive into the New Features of Apache Spark 3.0
PDF
SQL Performance Improvements At a Glance in Apache Spark 3.0
PDF
Apache Spark 3.0: Overview of What’s New and Why Care
PDF
What is New with Apache Spark Performance Monitoring in Spark 3.0
PDF
How to Automate Performance Tuning for Apache Spark
PDF
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Apache Spark 3 Dynamic Partition Pruning
Improving Spark SQL at LinkedIn
Deep Dive into the New Features of Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0
Apache Spark 3.0: Overview of What’s New and Why Care
What is New with Apache Spark Performance Monitoring in Spark 3.0
How to Automate Performance Tuning for Apache Spark
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...

What's hot (20)

PPTX
Catalyst optimizer
PDF
Koalas: Interoperability Between Koalas and Apache Spark
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PDF
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
PDF
Koalas: Making an Easy Transition from Pandas to Apache Spark
PDF
Parallelize R Code Using Apache Spark
PDF
Building Robust, Adaptive Streaming Apps with Spark Streaming
PDF
Spark Summit EU talk by Luc Bourlier
PDF
From Pipelines to Refineries: Scaling Big Data Applications
PDF
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
PPTX
Spark autotuning talk final
PDF
Continuous Application with FAIR Scheduler with Robert Xue
PPTX
Understanding Spark Tuning: Strata New York
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PPTX
Apache Calcite overview
PDF
Optimizing Apache Spark UDFs
PDF
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
PDF
Apache Spark Data Validation
PDF
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
Catalyst optimizer
Koalas: Interoperability Between Koalas and Apache Spark
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
Koalas: Making an Easy Transition from Pandas to Apache Spark
Parallelize R Code Using Apache Spark
Building Robust, Adaptive Streaming Apps with Spark Streaming
Spark Summit EU talk by Luc Bourlier
From Pipelines to Refineries: Scaling Big Data Applications
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark autotuning talk final
Continuous Application with FAIR Scheduler with Robert Xue
Understanding Spark Tuning: Strata New York
A Deep Dive into Query Execution Engine of Spark SQL
Apache Calcite overview
Optimizing Apache Spark UDFs
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache Spark Data Validation
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
Ad

Similar to What's New in Apache Spark 3.0 !! (20)

PDF
SQL Performance Improvements at a Glance in Apache Spark 3.0
PPTX
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Mila...
DOC
SAP Integration with Excel - Basic Guide
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PPT
Using AWR for SQL Analysis
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PDF
Getting Started with Apache Spark on Kubernetes
PDF
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
ODP
Deploying Rails App On Ec2
PDF
Intro to apache spark
PPTX
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
PDF
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
PDF
Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017
PDF
SQL Performance Tuning and New Features in Oracle 19c
PPTX
Spark Sql for Training
PPTX
What's New in Spark 2?
PDF
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
PPTX
Using state-engine-as-sca-component-final
PDF
Apache spark-the-definitive-guide-excerpts-r1
SQL Performance Improvements at a Glance in Apache Spark 3.0
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Mila...
SAP Integration with Excel - Basic Guide
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Using AWR for SQL Analysis
Apache Spark 2.0: Faster, Easier, and Smarter
Getting Started with Apache Spark on Kubernetes
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deploying Rails App On Ec2
Intro to apache spark
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017
SQL Performance Tuning and New Features in Oracle 19c
Spark Sql for Training
What's New in Spark 2?
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Using state-engine-as-sca-component-final
Apache spark-the-definitive-guide-excerpts-r1
Ad

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Modernizing your data center with Dell and AMD
PDF
KodekX | Application Modernization Development
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
MYSQL Presentation for SQL database connectivity
Per capita expenditure prediction using model stacking based on satellite ima...
The AUB Centre for AI in Media Proposal.docx
Network Security Unit 5.pdf for BCA BBA.
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Diabetes mellitus diagnosis method based random forest with bat algorithm
Spectral efficient network and resource selection model in 5G networks
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
NewMind AI Monthly Chronicles - July 2025
Review of recent advances in non-invasive hemoglobin estimation
Modernizing your data center with Dell and AMD
KodekX | Application Modernization Development
“AI and Expert System Decision Support & Business Intelligence Systems”
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication

What's New in Apache Spark 3.0 !!

  • 1. APACHE SPARK 3 NEW FEATURES - APARUP CHATTERJEE
  • 2. Spark 3.0.0 has released early June 2020 With the release of Spark 3.0, there are so many improvements implemented for faster execution. Well, there are many several changes done in improving SQL Performance such as:  Adaptive Query Execution (AQE)  New EXPLAIN Format  Dataframe tail function  Join Hints  Dynamic Partition Pruning New Added Features in Spark 3.0 Source:- SPARK+AI SUMMIT EUROPE 2019, SPARK 3.0 OFFICIAL DOCS & Google Search Today’s session I will be briefing first 3 features and rest of other I will continue in my next session
  • 3. Spark 2.0 based Environment Details: Hadoop 2.9 Spark 2.3 Python 2.7.14 Used GCP based Bigdata Component Details Spark 3.0 based Environment Details: Hadoop 3.2 Spark 3.0 Python 3.7.4
  • 4. Spark catalyst is one of the most important layer of spark SQL which does all the query optimisation. Even though spark catalyst does lot of heavy lifting, it’s all done before query execution. So that means once the physical plan is created and execution of the plan started, it will not do any optimisation there after. So it cannot do some of the optimisation which is based on metrics it sees when the execution is going on. In 3.0, spark has introduced an additional layer of optimisation. This layer is known as Adaptive Query Execution(AQE). This layer tries to optimise the queries depending upon the metrics that are collected as part of the execution. Adaptive Query Execution, AQE, is a layer on top of the spark catalyst which will modify the spark plan on the fly. This allows spark to do some of the things which are not possible to do in catalyst today. Adaptive Query Execution(AQE)
  • 5. Adaptive Query Execution(AQE) Adaptive Number of Shuffle Partitions or Reducers In Spark sql, number of shuffle partitions are set using spark.sql.shuffle.partitions which defaults to 200. In most of the cases, this number is too high for smaller data and too small for bigger data. Selecting right value becomes always tricky for the developer. So we need an ability to coalesce the shuffle partitions by looking at the mapper output. If the mapping generates small number of partitions, we want to reduce the overall shuffle partitions so it will improve the performance Shuffle Partitions without AQE: Before we see how to optimise the shuffle partitions, let’s see what is the problem we are trying to solve. Let’s take below example from pyspark.sql import SparkSession spark = SparkSession .builder .appName("Spark Adaptive Query Execution ") .config("spark.some.config.option", "some-value") .getOrCreate() sc=spark.sparkContext df=spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("gs://aparup- files/sales.csv").repartition(500) #In above code, I am reading a small file and increasing the partitions to 500. This increase is to force the spark to use maximum shuffle partitions and file size: 226B df.show(4, False) #GroupBy for Shuffle df.groupBy("customerId").count().count() #sales_df=df.groupBy("customerId").count() #sales_df.write.parquet("gs://aparup-files/spark2.parquet") sc.stop()
  • 6. Adaptive Query Execution(AQE) Adaptive Number of Shuffle Partitions or Reducers Observing Job: Spark 2 Doesn’t has AQE When I am running in Spark2 Cluster its throwing error as AQE is by default set to false and we cant use this because to use AQE we need enable ‘spark.sql.adaptive.coalescePartitions.enabled’ to check the requires partition based on result metrics and its not present in spark 2. Spark 3 with AQE
  • 7. Adaptive Query Execution(AQE) Adaptive Number of Shuffle Partitions or Reducers Spark 2 Observing Stages As you can observe from the image, stage id 14, 200 tasks ran even the data was very less. Spark 2 Observing Dags From the image, you can observe that there was lot of shuffle.
  • 8. Adaptive Query Execution(AQE) Adaptive Number of Shuffle Partitions or Reducers Optimising Shuffle Partitions in AQE Enabling the configuration To use AQE we need to set spark.sql.adaptive.enabled to true. conf.set("spark.sql.adaptive.enabled", "true") To use the shuffle partitions optimization we need to set spark.sql.adaptive.coalescePartitions.enabled to true. conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
  • 9. Adaptive Query Execution(AQE) Adaptive Number of Shuffle Partitions or Reducers Spark 3 Observing Stages From the image you can observe that, most of the stages are skipped all together as spark figured out that most of the partitions are empty. Spark 3 Observing Dags From the image, you can observe most of the shuffle was skipped. There is a CoalescedShuffleReader which is combining all the shuffle partitions to 1. So by just enabling few configuration we can dynamically optimise the shuffle partitions in AQE.
  • 10. New EXPLAIN Format In Spark EXPLAIN function returns the detail of spark sql query execution stages or you can say how query is optimized Challenges in Spark 2 – Not easy to understand how a query is optimized i.e output is too complex Key Feature of Explain function in Spark 3 – EASY TO READ QUERY EXECUTION PLAN by adding Explain mode="formatted“ query="select customerId,max(amountPaid) from spark3.sample_tbl where customerId>0 group by customerId having max(amountPaid)>0 "
  • 11. Explain in Spark 2 Not easy to understand how a query is optimized output is too complex!!!
  • 12. Explain in Spark 3 Easy to Read Query Plan Output with Very Detailed Information
  • 13. In many times in our code, we would like to read few rows from the dataframe. For this, we use head function on top of the dataframe which Internally implemented by reading only needed number of items by accessing one partition at a time from beginning. But to access the values from last partition of Dataframe till Spark V2 we don’t have any straight forward way So in Spark V3 new function tail has been introduced for reading values from the last partition of a dataframe. Dataframe tail function
  • 14. Dataframe tail function Spark 2 Don’t have tail Function
  • 15. Dataframe tail function Spark 3 introduced new tail Function
  • 16. Useful Resources https://spark.apache.org/releases/spark-release-3-0-0.html - Spark 3 Official Docs https://www.youtube.com/watch?v=scM_WQMhB3A&t=1s - SPARK+AI SUMMIT EUROPE 2019