SlideShare a Scribd company logo
Catalyst optimizer
Presented by Ayub Mohammad
Agenda
08-03-2019
• What is catalyst optimizer
• Why is it used
• How does it optimize
• Fundamentals of Apache Spark Catalyst
Optimizer
• References
2
What is catalyst optimizer
• It optimizes all the queries written in Spark SQL and DataFrame API. The optimizer helps us
to run queries much faster than their counter RDD part.
• Supports rule based and cost based optimization.
• In rule-based optimization the rule based optimizer use set of rule to determine how to
execute the query. While the cost based optimization finds the most suitable way to carry
out SQL statement. In cost-based optimization, multiple plans are generated using rules
and then their cost is computed.
08-03-2019
3
What is catalyst optimizer
• It includes Scala’s pattern matching and quasi quotes.
• Also, offers to build an extensible query optimizer.
08-03-2019
4
Purpose of catalyst optimizer
Catalyst’s extensible design had two purposes.
• Easy to add new optimization techniques and features to Spark SQL, especially for the
purpose of tackling various problems we were seeing with big data (e.g.,
semistructured data and advanced analytics).
• Second, Enable external developers to extend the optimizer — for example, by adding
data source specific rules that can push filtering or aggregation into external storage
systems, or support for new data types.
08-03-2019
5
How catalyst optimizer works
• user=spark.read.option("header",true).option("delimiter","
t").option("inferSchema",true).csv("user.txt");
• purchase=spark.read.option("header",true).option("delimit
er","t").option("inferSchema",true).csv("purchase.txt");
• joined=purchase.join(user,Seq("userid"),"leftouter").select(
"pid","location").filter("amount>60").select("location");
• Joined.explode(true)
08-03-2019 6
08-03-2019 7
Spark SQL Execution Plan
Spark uses Catalyst’s general tree transformation framework in 4 phases:
• Analysis
• Logical Optimization
• Physical planning
• Code generation
08-03-2019 8
Analysis
Spark SQL begins with a relation to be computed, either from an abstract syntax tree (AST) returned by
a SQL parser, or from a DataFrame object constructed using the API. It starts by creating an unresolved
logical plan, and then apply the following steps for the sql query
joinedDF.registerTempTable("joinedTable");
spark.sql("select location from joinedTable where pid > 2").explain(true)
• Search relation BY NAME FROM CATALOG.
• Map the name attribute, for example, salary, to the input provided given operator’s children.
• Determine which attributes match to the same value to give them unique ID.
08-03-2019
9
Analysis
08-03-2019
10
08-03-2019 11
Logical Optimization
• In this phase of Spark SQL optimization, the standard rule-based optimization is applied to
the logical plan. It includes
1. Constant folding
2. Predicate pushdown
3. Projection pruning
4. null propagation and other rules.
08-03-2019
12
Example:
08-03-2019
project
filter
project
Join
Scan
Table user
Scan
Table purchase
User.
Userid==purchase.
userid
Select pid,
location,amoun
t
Purchase.am
ount>60
Select location
project
filter
project
Join
Scan
Table user
Scan
Table purchase
User.
Userid==purchase.
userid
Select pid,
location,amoun
t
Purchase.am
ount>60
Select location
13
Optimized logical plan
08-03-2019
project
filter
Join
Scan
Table user
Scan
Table purchase
User.
Userid==purchase.
userid
Purchase.am
ount>60
Select
userlocation
project project
Select
purchase.userId
Select
user.userId, user.
location
project
filter
project
Join
Scan
Table user
Scan
Table purchase
User.
Userid==purchase.
userid
Select pid,
location,amoun
t
Purchase.am
ount>60
Select location
14
Physical Planning
• After an optimized logical plan is generated it is passed through a series of SparkStrategies that produce one or
more Physical plans
• It then selects a plan using a cost model.
• Currently, cost-based optimization is only used to select join algorithms.
• The framework supports broader use of cost-based optimization, however, as costs can be estimated recursively
for a whole tree using a rule. So it is possible to implement richer cost-based optimization in the future.
• It also can push operations from the logical plan into data sources that support predicate or projection pushdown.
08-03-2019
15
Code Generation
• The final phase of query optimization involves generating Java bytecode to run on each machine.
• Because Spark SQL often operates on in-memory datasets, where processing is CPU-bound, supporting
code generation can speed up execution.
• Catalyst relies on a special feature of the Scala language, quasiquotes, to make code generation simpler.
• Quasiquotes allow the programmatic construction of abstract syntax trees (ASTs) in the Scala language,
which can then be fed to the Scala compiler at runtime to generate bytecode.
• Catalyst is used to transform a tree representing an expression in SQL to an AST for Scala code to evaluate
that expression, and then compile and run the generated code.
08-03-2019
16
Code Generation
08-03-2019 17
Fundamentals of Apache Spark Catalyst
Optimizer
• At its core, Catalyst contains a general library for representing trees and applying rules to
manipulate them
08-03-2019
18
Tree
• The main data type in Catalyst is a tree composed of node objects. Each node has a node type and zero or
more children. New node types are defined in Scala as subclasses of the TreeNode class.
• Immutable.
• As a simple example, suppose we have the following three node classes for a very simple expression
language:
• Literal(value: Int): a constant value
• Attribute(name: String): an attribute from an input row, e.g.,“x”
• Add(left: TreeNode, right: TreeNode): sum of two expressions.
08-03-2019
19
Tree example for an
expression : x+(1+2)
These classes can be used to build up trees;
for example, the tree for the expression
x+(1+2), would be represented in Scala code
as follows:
Add(Attribute(x), Add(Literal(1), Literal(2)))
08-03-2019
20
Rules
• Trees can be manipulated using rules
• Functions from a tree to another tree. While a rule can run arbitrary code on its input tree, the most
common approach is to use a set of pattern matching functions that find and replace subtrees with a
specific structure.
• Pattern matching is a feature of many functional languages that allows extracting values from potentially
nested structures.
• Can have arbitrary Scala code that’s gives user the flexibility to add new rules easily.
• In Catalyst, trees offer a transform method that applies a pattern matching function recursively on all nodes
of the tree, transforming the ones that match each pattern to a result. For example, we could implement a
rule that folds Add operations between constants as follows:
08-03-2019
21
Rules
tree.transform {
case Add(Literal(c1), Literal(c2)) => Literal(c1+c2)
}
• Rules can match multiple patterns in the same transform call, making it very concise to implement multiple
transformations at once:
tree.transform {
case Add(Literal(c1), Literal(c2)) => Literal(c1+c2)
case Add(left, Literal(0)) => left
case Add(Literal(0), right) => right
}
08-03-2019
22
Sample CombineFilter rule from spark source
code
object CombineFilters extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case ff @ Filter(fc, nf @ Filter(nc, grandChild)) => Filter(And(nc, fc), grandChild)
}
}
08-03-2019
23
Custom rules
object MultiplyOptimizationRule extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
case Multiply(left,right) if right.isInstanceOf[Literal] &&
right.asInstanceOf[Literal].value.asInstanceOf[Double] == 1.0 =>
println("optimization of one applied")
left
}
}
08-03-2019
24
Custom rules
val purchase=spark.read.option("header",true).option("delimiter","t").csv("purchase.txt");
val purchaseamount = purchase.selectExpr("amount * 1")
println(purchaseamount.queryExecution.optimizedPlan.numberedTreeString)
00 Project [(cast(amount#3 as double) * 1.0) AS (amount * 1)#5]
01 +- Relation[tid#10,pid#11,userid#12,amount#3,itemdesc#14] csv
sparkSession.experimental.extraOptimizations = Seq(MultiplyOptimizationRule)
val purchaseamount = purchase.selectExpr("amount * 1")
println(purchaseamount.queryExecution.optimizedPlan.numberedTreeString)
00 Project [cast(amount#3 as double) AS (amount * 1)#7]
01 +- Relation[tid#10,pid#11,userid#12,amount#3,itemdesc#14] csv
08-03-2019
25
References
• https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
• http://blog.madhukaraphatak.com/introduction-to-spark-two-part-6/
• https://virtuslab.com/blog/spark-sql-hood-part-i/
• https://data-flair.training/blogs/spark-sql-optimization/
• https://www.tutorialkart.com/apache-spark/dag-and-physical-execution-plan/
08-03-2019
26

More Related Content

PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
PPTX
Apache Spark overview
PPTX
Learn Apache Spark: A Comprehensive Guide
PDF
How Adobe Does 2 Million Records Per Second Using Apache Spark!
PPTX
Spark architecture
The Parquet Format and Performance Optimization Opportunities
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Apache Spark overview
Learn Apache Spark: A Comprehensive Guide
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Spark architecture

What's hot (20)

PDF
Improving PySpark performance: Spark Performance Beyond the JVM
PDF
Spark Performance Tuning .pdf
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
Spark shuffle introduction
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
PPTX
Migrating with Debezium
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
Introduction to Apache Spark
PDF
Iceberg: a fast table format for S3
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Top 5 mistakes when writing Spark applications
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
PDF
Introduction to Spark Internals
PDF
Physical Plans in Spark SQL
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Apache Spark Introduction
PDF
Enabling Vectorized Engine in Apache Spark
Improving PySpark performance: Spark Performance Beyond the JVM
Spark Performance Tuning .pdf
Apache Spark in Depth: Core Concepts, Architecture & Internals
Spark shuffle introduction
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Migrating with Debezium
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Introduction to Apache Spark
Iceberg: a fast table format for S3
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Top 5 mistakes when writing Spark applications
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Introduction to Spark Internals
Physical Plans in Spark SQL
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Spark Introduction
Enabling Vectorized Engine in Apache Spark
Ad

Similar to Catalyst optimizer (20)

PDF
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
PDF
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
PDF
Spark Summit EU talk by Herman van Hovell
PDF
Anatomy of spark catalyst
PPTX
This is training for spark SQL essential
PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
PDF
Spark SQL In Depth www.syedacademy.com
PDF
SparkSQL: A Compiler from Queries to RDDs
PDF
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
PDF
Anatomy of Spark SQL Catalyst - Part 2
PDF
20140908 spark sql & catalyst
PDF
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
PPTX
Apache Spark sql
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
PDF
Real-Time Spark: From Interactive Queries to Streaming
PDF
High Performance Spark Best Practices for Scaling and Optimizing Apache Spark...
PDF
High Performance Spark Best Practices for Scaling and Optimizing Apache Spark...
PPTX
Spark sql meetup
PDF
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit EU talk by Herman van Hovell
Anatomy of spark catalyst
This is training for spark SQL essential
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Spark SQL In Depth www.syedacademy.com
SparkSQL: A Compiler from Queries to RDDs
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Spark SQL Catalyst - Part 2
20140908 spark sql & catalyst
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Apache Spark sql
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Real-Time Spark: From Interactive Queries to Streaming
High Performance Spark Best Practices for Scaling and Optimizing Apache Spark...
High Performance Spark Best Practices for Scaling and Optimizing Apache Spark...
Spark sql meetup
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Ad

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Cloud computing and distributed systems.
PDF
cuic standard and advanced reporting.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Empathic Computing: Creating Shared Understanding
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Cloud computing and distributed systems.
cuic standard and advanced reporting.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Spectral efficient network and resource selection model in 5G networks
Per capita expenditure prediction using model stacking based on satellite ima...
Dropbox Q2 2025 Financial Results & Investor Presentation
Empathic Computing: Creating Shared Understanding
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Chapter 3 Spatial Domain Image Processing.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Network Security Unit 5.pdf for BCA BBA.
The Rise and Fall of 3GPP – Time for a Sabbatical?

Catalyst optimizer

  • 2. Agenda 08-03-2019 • What is catalyst optimizer • Why is it used • How does it optimize • Fundamentals of Apache Spark Catalyst Optimizer • References 2
  • 3. What is catalyst optimizer • It optimizes all the queries written in Spark SQL and DataFrame API. The optimizer helps us to run queries much faster than their counter RDD part. • Supports rule based and cost based optimization. • In rule-based optimization the rule based optimizer use set of rule to determine how to execute the query. While the cost based optimization finds the most suitable way to carry out SQL statement. In cost-based optimization, multiple plans are generated using rules and then their cost is computed. 08-03-2019 3
  • 4. What is catalyst optimizer • It includes Scala’s pattern matching and quasi quotes. • Also, offers to build an extensible query optimizer. 08-03-2019 4
  • 5. Purpose of catalyst optimizer Catalyst’s extensible design had two purposes. • Easy to add new optimization techniques and features to Spark SQL, especially for the purpose of tackling various problems we were seeing with big data (e.g., semistructured data and advanced analytics). • Second, Enable external developers to extend the optimizer — for example, by adding data source specific rules that can push filtering or aggregation into external storage systems, or support for new data types. 08-03-2019 5
  • 6. How catalyst optimizer works • user=spark.read.option("header",true).option("delimiter"," t").option("inferSchema",true).csv("user.txt"); • purchase=spark.read.option("header",true).option("delimit er","t").option("inferSchema",true).csv("purchase.txt"); • joined=purchase.join(user,Seq("userid"),"leftouter").select( "pid","location").filter("amount>60").select("location"); • Joined.explode(true) 08-03-2019 6
  • 8. Spark SQL Execution Plan Spark uses Catalyst’s general tree transformation framework in 4 phases: • Analysis • Logical Optimization • Physical planning • Code generation 08-03-2019 8
  • 9. Analysis Spark SQL begins with a relation to be computed, either from an abstract syntax tree (AST) returned by a SQL parser, or from a DataFrame object constructed using the API. It starts by creating an unresolved logical plan, and then apply the following steps for the sql query joinedDF.registerTempTable("joinedTable"); spark.sql("select location from joinedTable where pid > 2").explain(true) • Search relation BY NAME FROM CATALOG. • Map the name attribute, for example, salary, to the input provided given operator’s children. • Determine which attributes match to the same value to give them unique ID. 08-03-2019 9
  • 12. Logical Optimization • In this phase of Spark SQL optimization, the standard rule-based optimization is applied to the logical plan. It includes 1. Constant folding 2. Predicate pushdown 3. Projection pruning 4. null propagation and other rules. 08-03-2019 12
  • 13. Example: 08-03-2019 project filter project Join Scan Table user Scan Table purchase User. Userid==purchase. userid Select pid, location,amoun t Purchase.am ount>60 Select location project filter project Join Scan Table user Scan Table purchase User. Userid==purchase. userid Select pid, location,amoun t Purchase.am ount>60 Select location 13
  • 14. Optimized logical plan 08-03-2019 project filter Join Scan Table user Scan Table purchase User. Userid==purchase. userid Purchase.am ount>60 Select userlocation project project Select purchase.userId Select user.userId, user. location project filter project Join Scan Table user Scan Table purchase User. Userid==purchase. userid Select pid, location,amoun t Purchase.am ount>60 Select location 14
  • 15. Physical Planning • After an optimized logical plan is generated it is passed through a series of SparkStrategies that produce one or more Physical plans • It then selects a plan using a cost model. • Currently, cost-based optimization is only used to select join algorithms. • The framework supports broader use of cost-based optimization, however, as costs can be estimated recursively for a whole tree using a rule. So it is possible to implement richer cost-based optimization in the future. • It also can push operations from the logical plan into data sources that support predicate or projection pushdown. 08-03-2019 15
  • 16. Code Generation • The final phase of query optimization involves generating Java bytecode to run on each machine. • Because Spark SQL often operates on in-memory datasets, where processing is CPU-bound, supporting code generation can speed up execution. • Catalyst relies on a special feature of the Scala language, quasiquotes, to make code generation simpler. • Quasiquotes allow the programmatic construction of abstract syntax trees (ASTs) in the Scala language, which can then be fed to the Scala compiler at runtime to generate bytecode. • Catalyst is used to transform a tree representing an expression in SQL to an AST for Scala code to evaluate that expression, and then compile and run the generated code. 08-03-2019 16
  • 18. Fundamentals of Apache Spark Catalyst Optimizer • At its core, Catalyst contains a general library for representing trees and applying rules to manipulate them 08-03-2019 18
  • 19. Tree • The main data type in Catalyst is a tree composed of node objects. Each node has a node type and zero or more children. New node types are defined in Scala as subclasses of the TreeNode class. • Immutable. • As a simple example, suppose we have the following three node classes for a very simple expression language: • Literal(value: Int): a constant value • Attribute(name: String): an attribute from an input row, e.g.,“x” • Add(left: TreeNode, right: TreeNode): sum of two expressions. 08-03-2019 19
  • 20. Tree example for an expression : x+(1+2) These classes can be used to build up trees; for example, the tree for the expression x+(1+2), would be represented in Scala code as follows: Add(Attribute(x), Add(Literal(1), Literal(2))) 08-03-2019 20
  • 21. Rules • Trees can be manipulated using rules • Functions from a tree to another tree. While a rule can run arbitrary code on its input tree, the most common approach is to use a set of pattern matching functions that find and replace subtrees with a specific structure. • Pattern matching is a feature of many functional languages that allows extracting values from potentially nested structures. • Can have arbitrary Scala code that’s gives user the flexibility to add new rules easily. • In Catalyst, trees offer a transform method that applies a pattern matching function recursively on all nodes of the tree, transforming the ones that match each pattern to a result. For example, we could implement a rule that folds Add operations between constants as follows: 08-03-2019 21
  • 22. Rules tree.transform { case Add(Literal(c1), Literal(c2)) => Literal(c1+c2) } • Rules can match multiple patterns in the same transform call, making it very concise to implement multiple transformations at once: tree.transform { case Add(Literal(c1), Literal(c2)) => Literal(c1+c2) case Add(left, Literal(0)) => left case Add(Literal(0), right) => right } 08-03-2019 22
  • 23. Sample CombineFilter rule from spark source code object CombineFilters extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case ff @ Filter(fc, nf @ Filter(nc, grandChild)) => Filter(And(nc, fc), grandChild) } } 08-03-2019 23
  • 24. Custom rules object MultiplyOptimizationRule extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions { case Multiply(left,right) if right.isInstanceOf[Literal] && right.asInstanceOf[Literal].value.asInstanceOf[Double] == 1.0 => println("optimization of one applied") left } } 08-03-2019 24
  • 25. Custom rules val purchase=spark.read.option("header",true).option("delimiter","t").csv("purchase.txt"); val purchaseamount = purchase.selectExpr("amount * 1") println(purchaseamount.queryExecution.optimizedPlan.numberedTreeString) 00 Project [(cast(amount#3 as double) * 1.0) AS (amount * 1)#5] 01 +- Relation[tid#10,pid#11,userid#12,amount#3,itemdesc#14] csv sparkSession.experimental.extraOptimizations = Seq(MultiplyOptimizationRule) val purchaseamount = purchase.selectExpr("amount * 1") println(purchaseamount.queryExecution.optimizedPlan.numberedTreeString) 00 Project [cast(amount#3 as double) AS (amount * 1)#7] 01 +- Relation[tid#10,pid#11,userid#12,amount#3,itemdesc#14] csv 08-03-2019 25
  • 26. References • https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html • http://blog.madhukaraphatak.com/introduction-to-spark-two-part-6/ • https://virtuslab.com/blog/spark-sql-hood-part-i/ • https://data-flair.training/blogs/spark-sql-optimization/ • https://www.tutorialkart.com/apache-spark/dag-and-physical-execution-plan/ 08-03-2019 26