Deep Dive into Spark

Deep Dive into Spark
Eric Xiao, Data Developer | Storage and Query Technologies
1

Motivation
We’ll be learning how to analyze a spark application through the sparkUI.
Then implement 2 optimizations that will address 2 main problems that are
common in spark applications.
2

Overview:
1. Concrete Example Query.
2. From Code to Execution.
3. Optimization 1: Improving joins.
4. Optimization 2: Eliminate data spill.
5. Recap.
6. Questions.
3

Section 1:
Concrete Query
Example
4

Example Query
5
I want to know the count of order transactions per
credit card types.
SELECT
card_type,
count(*)
FROM order_transactions
LEFT JOIN payment_gateways USING (payment_gateway_id)
LEFT JOIN payment_details USING (payment_detail_id)
GROUP BY 1
WHERE ….

Section 2:
From Code to
Execution
6

Lazy Evaluation
Actions
Spark Plans
Logical Plan
Optimized Logical Plan
Physical Plans
7

1
2
3
4
5
6
7
Lazy Evaluation
Spark doesn’t perform any
transformations until an “action” is called.
df
> DataFrame[card_type: string, count:
bigint]
8

1
2
3
4
5
6
7
Actions
An “action” triggers Spark execution.
df.write()
df.head(n)
df.take(n)
df.collect()
df.show()
df.toPandas()
9

Spark Plans
• Spark generates diﬀerent execution plans from
the spark DataFrames/RDD code:
• Logical Plan
• Optimized Logical Plan
• Physical Plan
• Read bottom up, opposite of the plan in SparkUI
10
df.explain(True)

Spark Plans
== Physical Plan ==
*(10) HashAggregate(keys=[card_type#229], functions=[sum(1)], output=[card_type#229,
count#289L])
+- Exchange hashpartitioning(card_type#229, 200)
+- *(9) HashAggregate(keys=[card_type#229], functions=[partial_sum(1)], output=[card_type#229,
sum#296L])
+- *(9) Project [card_type#229]
+- *(9) SortMergeJoin [order_transaction_id#202L], [order_transaction_id#160L], Inner
:- *(6) Sort [order_transaction_id#202L ASC NULLS FIRST], false, 0
…
11

Logical Plan
• Is a set of abstract expressions that represent the spark code.
12

Optimized Logical Plan
• Checks if set of expressions are valid.
• ie. tables and columns exists.
• If valid, plan is passed to the Spark Catalyst Optimizer to be optimized.
14

15
LOGICAL PLAN
OPTIMIZED
LOGICAL PLAN

Physical Plan
• Specifies exactly how the logical plan will execute.
• Multiple plans are generated and most optimal is selected.
• Based on the physical attributes of the tables.
• ie. table size, and partition size.
16

17
OPTIMIZED
LOGICAL PLAN
PHYSICAL PLAN

Section 3:
Optimization 1 -
Improving Joins
18

SparkUI
Spark DAG
Shuffle
Broadcasting
Spark SQL tab
Join Skew
Join Strategies
19

SparkUI Tabs
21
Main tabs:
• Jobs
• SQL
• Storage

Spark DAG
• Directed Acyclic Graph.
• No cycles.
• Data flows in one direction.
• A Spark DAG represents a Spark Job.
• A spark job consists of multiple stages.
• These stages can run in parallel.
22

Shuffle
23
• Certain transformations in Spark trigger an event known as the Shuffle.
• A shuffle re-distributes data so that it’s grouped diﬀerently across partitions.
• Why? Not all values needed for a transformation exist on the same partition or machine at the
time of the transformation, but they must be co-located to perform the transformation.
• This involves copying data across executors and machines.
• Which is why shuffling data is complex and a costly operation.
reference: http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/rdd-
programming-guide.html#shuffle-operations

Shuffle
24
Operations that might cause a shuffle:
• Joins
• GroupBy
• Distinct
• Repartition
• Coalesce
• Window

Shuffle
25
When a shuffle isn’t performed:
• Data is sorted / bucketed by the partition key on disk.
• Data has already been shuffled from a previous operation.
• ex. When an aggregate happens after a join but both are performed on the same column.

Shuffle
27
1. Initial partitions read from disk.

Shuffle
28
2. apply transformations on each partition.
ex.
(
df
.withColumn(x, x + 1)
)

Shuffle
29
3. shuffle data from stage 1 for stage 2.
ex.
(
df
.groupBy(y)
.agg(…)
)

Shuffle
30
4. perform stage 2 transformations.
ex.
(
df
.groupBy(y)
.agg(…)
)

1
2
3
4
5
6
7
Broadcast
• When data is small enough, Spark
can broadcast it.
• DataFrame will be executed.
• Instead of shuffling data, Spark
(the driver) will “broadcast” the
DataFrame to every executor.
df = …
broadcasted_df = broadcast(df)
spark.sql.autoBroadcastJoinThreshold
31

Broadcast Gotchas
33
• As mentioned before the broadcasted dataFrame is executed before being broadcasted.
• If the transformations performed are complex or take long enough, this will cause a timeout.
• This wait time is defined by spark.sql.broadcastTimeout which is 300s (5 minutes by default).
RED FLAGS:
• TimeoutException: Futures timed out after [300 seconds].
Solution:
• Increase the timeout, 10 minutes should be the upper limit.

SQL Tab
34
• Shows the physical plan of your Spark application.
• Shows an aggregate level of task duration and data volume.
• Dropdown of all the other plans.

1. Overall DAG
• Visual representation of physical plan.
• It flows from top (reading data) to bottom
(writing / outputting data).
RED FLAGS:
• Too large and wide DAG.
Solution:
• Reduce application complexity.
35

3. Transformation Details
• Hover over light blue blobs
to see more code details.
• Gives hints to line of
code in Spark application.
37

SELECT
card_type,
count(*)
LEFT JOIN payment_gateways USING (payment_gateway_id)
LEFT JOIN payment_details USING (payment_detail_id)
GROUP BY 1
WHERE ….
38
SQL Query (Again)

40
SELECT
card_type,
count(*)
JOIN payment_gateways ON payment_gateway_id
JOIN payment_details ON payment_detail_id
GROUP BY 1
WHERE ….

4. Aggregate Level Stats
41
• Spark will provide some overall values for the min /
median / max values.
• These values are for the task duration and data volume .
• They are provided on a per stage / transformation level.
• Focus on the blocks around a join block.
RED FLAGS:
• Highly varying times or data volume.
Solution:
• Skewed join, use a join optimizer.

42
• 9 seconds vs 11.5m
• Around the join onto payment_gateway_id.

Goal 1
Increase Join Performance. Addressing slow running tasks in a (skewed) join.
Motivation:
When Spark performs a join, spark distributed the data based on the join key to the tasks.
If the number of rows per task is not evenly distributed the time it takes per task can
vary, thus causing the slower stages to bottleneck the whole application.
Sometimes the skew is so large that the data struggles or can’t fit in memory.
43

Joins (skew)
2 types of skewed joins:
1. NULL Skewed (only for left joins).
2. Key(s) Skewed (only left and inner joins).
These problems can also appear in our resolvers as well, but they might be
implemented diﬀerently.
44

Determining the Skew Type
45
Steps:
1. Run an analysis (dev skew …) on the join key.
2. If NULL values make up a majority of the data, data is NULL Skewed.
1. Replace the join with our NullValueSkewHelper join class.
3. If a single non-null key makes up a large portion of the dataset, Key Skewed.
• Replace the join with our SkewPartitioner join class.
4. If multiple non-null keys make up a large portion of the dataset, Keys Skewed.
• Replace the join with our FrequentValuesSkewHelper join class.

1
2
3
4
5
6
7
*If you have to include legal or source
Dev Skew
• Calculates the distribution of a
provided column in a given dataset.
• Finds the minimal and maximum
partition size.
dev skew analyze --field 'column' --path
gcs_path
46

47
Dev Skew
• Shows top 5 most frequent values.
• The count and percentage.
• Shows min and max partition data size.

NullValueSkewHelper Under the Hood
Motivation:
• Eliminate the shuffle of the null rows.
• Too many rows to fit into memory.
Solution:
• Split the dataset into null and not null.
• Perform the left join on the not null subset.
• Union result with null subset.
• Unions don’t cause a shuffle.
48

NullValueSkewHelper Under the Hood
• Filter the null skewed dataset to not null values.
• Left joins the to the right table.
• Filter the null skewed dataset to null values.
• Create the right table columns with nulls.
• Unions the joined results with the null values.
49

Downsides
• Spark will recognize that you want to do 2 separate transformations on the
skewed dataset:
• Filter for null values.
• Filter for not null values.
• Spark will do this read in parallel, so you’ll read in twice as much data.
• We recognize this, and “cache”d the skewed dataset, which reduces the read,
but now we storing the dataset in memory.
• Thus lowering the amount of total memory for our spark application.
• This can cause memory issues.
52

SkewPartitioner
Motivation:
• Reduce the amount of rows that contain the skewed key.
Solution:
• We randomly assign a number from 0 to n - 1, called a “salt”, to each row.
• This splits the number of rows by a factor n.
55

SkewPartitioner Under the Hood
• Assigns a “salt” number (random % num_partitions) to each row of the left
dataFrame.
• Duplicates the right dataFrame num_partitions times.
• Assign a diﬀerent salt value to the every duplicate.
• Joins the two datasets on the (join key + salt number).
56

Downsides
• The right side of the join (the dimension) grows linearly with n.
• Depending how large the right side is, this can potentially be bad.
59

FrequentValuesSkewHelper
Motivation:
• Eliminate the shuffle of rows containing the subset of skewed keys.
• These rows are too big to fit onto tasks.
60

Solution:
• Split the skewed dataset into 2 subsets.
• First subset contains only skewed keys.
• Second contains non skewed keys.
• Do same with right dataset.
• Perform a regular join with non skewed datasets.
• Do a broadcast join on skewed dataset.
• Then union that with the joined result.
• Broadcast join and union do not produce a shuffle. **
We can broadcast cause only a small subset (under broadcast range) of keys are considered skewed.
61

Under the Hood
• Calculates the keys that make up a large portion of the left dataset.
• Splits left and right dataset into 2 parts:
1. All the rows that make up a large portion of the dataset, call this the “high frequency”
DataFrame.
2. Everything else, call this the “low frequency” DataFrame.
• Broadcast the high frequency right side, this eliminates shuffle of left high frequency dataframe.
• Performs the join on both the “high frequency” Dataframes.
• Performs the join on both the “low frequency” Dataframes.
• Unions the two joined datasets together.
62

Downsides
• The complexity of the job increases as this is a more complex join.
• The broadcasting might timeout?
66

Key Skew (cont’d)
• The techniques can be used together.
• ie. Largest skew key is null, then a non-null key.
• Remove the nulls, then salt.
67

SQL Tab Recap
68
1. Interpreted the overall spark DAG.
2. Found the spark plans from the details dropdown.
3. Mapped parts of the plan to the exact code line.
4. Analyzed aggregate level stats.
Red Flags:
1. Long and wide DAGs.
2. Big variation in min/med/max times.
** Evaluate plan in the order that they are listed.

Join Optimization Recap
69
• Spot a join block that have widely varying task times.
• Run dev skew to analyze the distribution of keys.
• Pick the correct join optimization.
• Implement.

Section 4:
Optimization 2 -
Eliminating Data Spill
70

Jobs, Stages, Tasks
Executors
Spill
Partitions / Tasks
Optimal Partition Calculations
71

Disclaimer
These screenshot are taken of the stages when we had the join skew.
They illustrate more things that you want to be aware of in your stage.
But not needed for the second optimization.
72

73
• Every Spark “action” triggers a “job”.
• Every job contains multiple “stages”.
• New stages are created when a shuffle operation is required.
• A stage is a collection of transformations.
• A stage divides the work into a number of “tasks”.
• Tasks run in parallel on “executors”.
Jobs, Stages, Tasks, Executors

Executors
• Read Karl’s discourse response to understand the architecture of Spark or go to
Michael Style’s talk.
• All we need to know is that executors are responsible for running tasks.
• The number of executors and memory per executor is set in the resource classes.
Karl’s discourse response: https://discourse.shopify.io/t/what-do-each-of-the-resource-
class-arguments-mean/1920/3
74

Spill
76
• An executor has a set amount of memory, set at the start of the spark
application.
• An executor is responsible for running tasks.
• A task is a unit of transformation(s) on a partition of data.
• Is the partition of data exceeds the executor memory, it will spill the
data from memory to disk.

Spill
77
• Spilling is very bad, lots of I/O, serialization, etc. costs.
• If we spill too much we can potentially take the executors down.

1
2
3
4
5
6
7
Jobs
• A job in a Spark application corresponds
to a single “action” performed.
• A job consists of many stages.
• A new stage is created for every “shuffle
operation”.
# Review of actions in Spark:
df.write()
df.head(n)
df.take(n)
df.collect()
df.show()
df.toPandas()
78

82
Multiple stages of a stage:
• Pending
• Active
• Completed
• Failed
RED FLAGS:
• Failed stages should be the first thing to check.

83
RED FLAGS:
• Stages with failed tasks.
tasks.spark.task.maxFailures = 4

84
RED FLAGS:
• Stages with failed failed tasks.
• Long running stages.

85
RED FLAGS:
• Lots of input data and data being shuffled around.

86
RED FLAGS:
• Low number of tasks.

87
RED FLAGS:
• Low number of tasks.

1
2
3
4
5
6
7
Stages
• Is a group of transformations.
• Stages are separated by a shuffle of
data:
• join on x
• group by x
• window over x
• etc.
• A stage consists of many tasks.
# Review of transformations in Spark:
df.where(…)
df.withColumn(…)
df.withRenamedColumn(…)
88

90
RED FLAGS:
• Having spill in your stage.

91
RED FLAGS:
• Big diﬀerence in the min and max time and memory.
Stage Level Metrics

92
RED FLAGS:
• Uneven distribution of work per executor.
Executor Level Metrics

94
RED FLAGS:
• Uneven distribution of work per executor.
• Tasks with more than 128MB of Shuffle Read data.

After Skew Fix
95
Varying task times and data volume has been fixed!

96
After Skew Fix
Still some spill of data though :(.

Goal 2
Reducing the shuffle spilt to disk.
Motivation:
“When the records destined for these aggregation operations do not easily fit in memory, some
mayhem can ensue. First, holding many records in these data structures puts pressure on garbage
collection, which can lead to pauses down the line. Second, when the records do not fit in memory,
Spark will spill them to disk, which causes disk I/O and sorting. This overhead during large shuffles is
probably the number one cause of job stalls I have seen at Cloudera customers.”
Reference: https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
97

Tasks
• The smallest unit of transformation(s) applied on a unit of data.
• Tasks run on executors within a stage.
• Number of tasks per stage is determined by various settings.
98

Partitions & Tasks
Number of tasks for a given stage can be:
1. If partition number is not tweaked:
• If stage is reading input data:
• Number of tasks ~= total number of “blocks” in all the files.
• Else:
• Default value of 200 partitions / stage.
2. If partition number is tweaked:
• Partitions passed to the repartition / coalesce function.
• Min(cardinality of a dataset, number of partitions argument).
99
spark.sql.shuffle.partitions = 200

Partitions & Tasks
Adding More Partitions / Tasks:
• Smaller manageable amounts of data to process at a time.
• Data (hopefully) fit in memory on the executor.
DISCLAIMER:
• More tasks also means more load on the driver.
RED FLAGS:
• Lots of shuffle spill.
• Driver OOMs —> too many partitions / tasks.
100
Review:
Tasks in a stage are processed
sequentially by executors.

Choosing an Optimal Number of Tasks
For the most optimal setup, you want to dynamically change the number of partitions on
a stage level, as it will diﬀer stage by stage.
But for simplicity sacks we will find an overarching number for the entire job.
Steps:
1. Find the non read input stage with shuffle spill data.
2. Rule of thumb, ideal task data size is 128MB.
3. Add spark.sql.shufﬂe.partitions: ideal_number to schedule file.
101

Calculating Ideal Task Number
(1) Memory available for a given task:
Java memory per executor * margin of error
cores per executor
Margin of error = 0.8 * 0.2 = 0.16
(2) In memory size of shuffle data:
Shuffle spill (memory) * Shuffle write
Shuffle spill (disk)
reference: https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
102

Calculating Ideal Task Number
Putting the 2 equations together:
Shuffle spill (memory) * Shuffle write * cores per executor
Shuffle spill (disk) * java memory per executor * margin of error
reference: https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
103

All else Fails…
but…. This equation is not bullet proof..
If this equation gives a smaller estimate than the current number of tasks, then we can
only guess and check. Increase the number of tasks by 1.5x incrementally until no shuffle
spill is observed.
104

105
After Skew Fix
Notebooks resource class:
• 8 executors
• 12GB java memory

106
After Skew Fix
Shuffle spill (memory) * Shuffle write * cores per executor
Shuffle spill (disk) * java memory per executor * margin of error
(73.6 GB * 287.4 KB * 8)
(18.7 GB * 12 GB * 0.8)
0.00094263101 partitions…
So let’s try 200 * 1.5 = 300 partitions.

Recap
1. Understood how a Spark DataFrames application got executed.
2. Evaluated the Spark application by analyzing the SparkUI.
3. Identified bottlenecks.
4. Implemented optimizations to address the bottlenecks.
109

Deep Dive into Spark

More Related Content

What's hot (20)

Similar to Deep Dive into Spark (20)

Recently uploaded (20)

Deep Dive into Spark