Spark Performance Tuning .pdf

By: Amit Raj IIT Kharagpur
Apache Spark Performance
tuning and Best Practise

Our Agenda
01 Spark Introduction
02 Code Level Optimization
03 Outside Code Technique
04 Demo
05 Summary

Introduction
● Apache Spark is Open Source, in-memory computation framework.
● It gives high performance for both batch as well as streaming job.
● It deals of big data processing.
● it is approx 100 times faster than mapreduce, because of in-memory computation
As it deals with the big data processing application it also involves lot of uses of resources such as
CPU, RAM and Storage. Optimising one or more together will leads to saving a lot cost reduction.
In the upcoming 40 minute we will learn about the approaches which will help to do so.

Ways to Optimise
Code Level:-
Here we will learn the best practices to follow in order to achieve high performance in minimal
resources such as:- Caching, Broadcasting, Serialization, use DataSet/DF over RDD, Avoid
UDF, Filter Data at earliest , Reduce Shuffle
Beyond Code:-
Here we will learn to tune the config parameter cluster resources level tuning such as:-
File Format, Level of Parallelism, Executor config, Memory Tuning, Batch Interval

Major Bottleneck
● CPU
● Network Bandwidth
● Memory
Our Goal is to optimise each of them as much as possible in order to reduce the resources used
and reduce the computation time to achieve optimum performance.

Caching
Suppose in our analytics project we have a text file and we have to read them and get number of flights leaving
from a particular country and same is being used multiple times.
● Raw Data is in text file
● Reading Text File as DF1
● Grouping by origin country DF2

Caching
JOB1:- Now number of flights leaving US as DF3
JOB2:- number of flights leaving Singapore as DF4
JOB3:- number of flights leaving India as DF5
Execution plan for JOB1 :- DF1>DF2 >DF3
Execution plan for JOB2 :- DF1>DF2 >DF4 after cache DF2 > DF4 no need of DF1 > DF2 step.
Execution plan for JOB3 :- DF1>DF2 >DF5 after cache DF2 > DF5 no need of DF1 > DF2 step.
here instead of calculating the DF1 and DF2 again we can cache the last reusable DF in memory so that we can
use it in another job to reduce computation resources and save time.

Broadcasting
Broadcast variable allows us to keep a read only variable cached on each executor hence we don’t have to send it with
task every time. which helps in reducing the network bandwidth and time consumption.
When to Use Broadcast Variable:-
Suppose we have a lookup data and that data need to be used by each executor while performing task.
We have 100 partitions and 10 executor node cluster (every executor has to take care for 10 partition)
we need to execute at least 100 task hence we have to send the lookup data 100 time to executor(once with every task).
But if we use broadcast then we need to send the lookup data to each executor only once and only 10 copies will be
sent.
Beneﬁt= sending 100 copy vs sending 10 copy
val states = Map(("NY","New York"),("CA","California"),("FL","Florida"))
val countries = Map(("USA","United States of America"),("IN","India"))
val broadcastStates = spark.sparkContext.broadcast(states)
val broadcastCountries = spark.sparkContext.broadcast(countries)

– - Continue
In the above diagram m is broadcast variable and it’s sitting in memory of each executor and getting used while task execution.
Hence driver don’t need to ship the variable(m) with task and reduce the time of network IO and time.

Serialization
From the above diagram it is clear that serialization is needed when we write data in some storage.
De-Serialization is needed when we need to read from the some source.
In Spark ecosystem we always have to deal with both of them while cache, broadcast, shuﬄing etc.
Hence it becomes very important to optimize the serialization process.

Serialization
Kyro serialization over Java serialization:-
kyro is 10 times faster and more compact than java serialization but it doesn’t support all serializable types and requires
to register the classes not supported by it.
val spark = SparkSession.builder().appName("Broadcast").master("local").getOrCreate()
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
Further Optimization is to register the class with kyro in advance if row size is too big as if you don’t register the class
it will store the class name with each object of it (for every row)
conf.set("spark.kryo.registrationRequired", "true")
conf.registerKryoClasses(Array(classOf[Foo]))

DataSet/DataFrame over RDD
RDD does sterilization and deserialization of data whenever it distributes the data across clusters such as during repartition
and shuﬄe, and we all know that Serialization and de-serialization are very expensive operations in spark.
On the other hand, DataFrame stores the data as binary using oﬀ-heap storage, no need for deserialization and serialization
of data when it distributes to clusters. We see a big performance improvement in DataFrame over RDD

Avoid UDF
When we use UDFs we end up losing all the optimization Spark does on our Dataframe/Dataset. Hence
whenever we can use inbuilt spark function we should use them and avoid UDF as much as possible.
but by any chance we have to use it then first we have to define a function like a normal scala function and we
have to register it with spark udf class
● val plusOne = udf((x: Int) => x + 1) //defined function
● spark.udf.register("plusOne", plusOne) //register udf
● spark.sql("SELECT plusOne(5)").show() // calling udf
// |UDF(5)| // result
// +------+
// | 6|

Filter Data at Earliest
example:- suppose we have a data set of employees and have column like patient Number, age, gender, salary, department, city, address,
pastexp, marital status, ……………………….. etc.
Bu we have to ﬁnd number of employees belonging to a particular city. in this case we have to perform groupby operation on city column
and other column becomes irrelevant.
df.select(name,city).groupby(“city”).show()
df.groupby(“City”).select(“City”, “count”).show()
Scan
Aggregate
Filter
Scan
Aggregate
Filter

Shuffling
Shuffling is a mechanism Spark uses to redistribute the data across different executors and even across
machines. Spark shuffling triggers when we perform certain transformation operations like gropByKey(),
reducebyKey(), join() on RDD and DataFrame. It involves
● Disk I/O
● Involves data serialization and deserialization
● Network I/O

Reduce Shuffle Operation
We cannot completely avoid shuffle operations but when possible try to reduce the number of shuffle operations
remove any unused operations.
Spark provides spark.sql.shuffle.partitions configurations to control the partitions of the shuffle, By tuning this property
you can improve Spark performance.
spark.conf.set("spark.sql.shuffle.partitions",100)
Here 100 is the shuffle partition count we can tune this number by hit and trial based on datasize, If we have less data then we
don’t need 100 shuffle partition, If we have much bigger data and can execute large number of parallel task then we can increase
it to 200 or more.

File Format
Suppose we have system like this DataSource > SparkJob1 > Database > SparkJob2 > Database
As we are reading the data from source 1 from SparkJob1 and then we are writing data in Database2 then SparkJob2 reades
from Database2 and perform calculation then writes in Databse3.
as Database2 involves writing the data into and reading the data from it.
In the above scenario we should prefer writing an intermediate ﬁle in Serialized and optimized formats like Avro, Parquet
e.t.c,
Any transformations on these formats performs better than text, CSV, and JSON.
Spark Job1
Spark Job2
DataBase2 Database3
DataBase1

Executor Config
● JOB > Stage > Task
● one job can have multiple Stage, One stage can have multiple task.
● And number of core = number of parallel task
● Here we have to give proper number of core to each executor in order to optimise the resources.
● Allocating more number of core to each executor will leads to more parallel task on each executor which can
lead to outofmemory(OOM) error.
● Allocating less core per executor will reduce the parallelism and will the the benefit of it. Also the executor
memory will not be fully optimised.
● After Many iterations people recommend to allocate 5 cores per executor in order to get maximum benefit of
parallelism and proper memory uses.
./bin/spark-submit --driver-memory 8G --executor-memory 16G --num-executors 3 --executor-cores 5

Memory Tuning
There are three considerations in tuning memory usage:
● the amount of memory used by your objects (you may want your entire dataset to ﬁt in memory),
● the cost of accessing those objects, and
● the overhead of garbage collection
● String data types uses less storage space compared to Linked List and Map as these objects not only has a
header, but also pointers (typically 8 bytes each) to the next object in the list.
● We can also optimise the memory uses by storing data in a serialized format.
● Java Objects are fast to access but consumes 2-5 times more space than the “raw” data inside their ﬁelds.
● using data structures with fewer objects and caching data in serialized format can help in reduce the Garbage
collection cost. Broadcasting variable also help us in reducing GC.

Thank You !
Get in touch with us:
Amit Raj
Senior Data Engineer
IIT Kharagpur
amitraj.iitkgp@gmail.com / 7548095242

Spark Performance Tuning .pdf

More Related Content

What's hot (20)

Similar to Spark Performance Tuning .pdf (20)

More from Amit Raj (6)

Recently uploaded (20)

Spark Performance Tuning .pdf