SlideShare a Scribd company logo
11
Most read
16
Most read
18
Most read
By: Amit Raj IIT Kharagpur
Apache Spark Performance
tuning and Best Practise
Our Agenda
01 Spark Introduction
02 Code Level Optimization
03 Outside Code Technique
04 Demo
05 Summary
Introduction
● Apache Spark is Open Source, in-memory computation framework.
● It gives high performance for both batch as well as streaming job.
● It deals of big data processing.
● it is approx 100 times faster than mapreduce, because of in-memory computation
As it deals with the big data processing application it also involves lot of uses of resources such as
CPU, RAM and Storage. Optimising one or more together will leads to saving a lot cost reduction.
In the upcoming 40 minute we will learn about the approaches which will help to do so.
Ways to Optimise
Code Level:-
Here we will learn the best practices to follow in order to achieve high performance in minimal
resources such as:- Caching, Broadcasting, Serialization, use DataSet/DF over RDD, Avoid
UDF, Filter Data at earliest , Reduce Shuffle
Beyond Code:-
Here we will learn to tune the config parameter cluster resources level tuning such as:-
File Format, Level of Parallelism, Executor config, Memory Tuning, Batch Interval
Major Bottleneck
● CPU
● Network Bandwidth
● Memory
Our Goal is to optimise each of them as much as possible in order to reduce the resources used
and reduce the computation time to achieve optimum performance.
Caching
Suppose in our analytics project we have a text file and we have to read them and get number of flights leaving
from a particular country and same is being used multiple times.
● Raw Data is in text file
● Reading Text File as DF1
● Grouping by origin country DF2
Caching
JOB1:- Now number of flights leaving US as DF3
JOB2:- number of flights leaving Singapore as DF4
JOB3:- number of flights leaving India as DF5
Execution plan for JOB1 :- DF1>DF2 >DF3
Execution plan for JOB2 :- DF1>DF2 >DF4 after cache DF2 > DF4 no need of DF1 > DF2 step.
Execution plan for JOB3 :- DF1>DF2 >DF5 after cache DF2 > DF5 no need of DF1 > DF2 step.
here instead of calculating the DF1 and DF2 again we can cache the last reusable DF in memory so that we can
use it in another job to reduce computation resources and save time.
Broadcasting
Broadcast variable allows us to keep a read only variable cached on each executor hence we don’t have to send it with
task every time. which helps in reducing the network bandwidth and time consumption.
When to Use Broadcast Variable:-
Suppose we have a lookup data and that data need to be used by each executor while performing task.
We have 100 partitions and 10 executor node cluster (every executor has to take care for 10 partition)
we need to execute at least 100 task hence we have to send the lookup data 100 time to executor(once with every task).
But if we use broadcast then we need to send the lookup data to each executor only once and only 10 copies will be
sent.
Benefit= sending 100 copy vs sending 10 copy
val states = Map(("NY","New York"),("CA","California"),("FL","Florida"))
val countries = Map(("USA","United States of America"),("IN","India"))
val broadcastStates = spark.sparkContext.broadcast(states)
val broadcastCountries = spark.sparkContext.broadcast(countries)
– - Continue
In the above diagram m is broadcast variable and it’s sitting in memory of each executor and getting used while task execution.
Hence driver don’t need to ship the variable(m) with task and reduce the time of network IO and time.
Serialization
From the above diagram it is clear that serialization is needed when we write data in some storage.
De-Serialization is needed when we need to read from the some source.
In Spark ecosystem we always have to deal with both of them while cache, broadcast, shuffling etc.
Hence it becomes very important to optimize the serialization process.
Serialization
Kyro serialization over Java serialization:-
kyro is 10 times faster and more compact than java serialization but it doesn’t support all serializable types and requires
to register the classes not supported by it.
val spark = SparkSession.builder().appName("Broadcast").master("local").getOrCreate()
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
Further Optimization is to register the class with kyro in advance if row size is too big as if you don’t register the class
it will store the class name with each object of it (for every row)
conf.set("spark.kryo.registrationRequired", "true")
conf.registerKryoClasses(Array(classOf[Foo]))
DataSet/DataFrame over RDD
RDD does sterilization and deserialization of data whenever it distributes the data across clusters such as during repartition
and shuffle, and we all know that Serialization and de-serialization are very expensive operations in spark.
On the other hand, DataFrame stores the data as binary using off-heap storage, no need for deserialization and serialization
of data when it distributes to clusters. We see a big performance improvement in DataFrame over RDD
Avoid UDF
When we use UDFs we end up losing all the optimization Spark does on our Dataframe/Dataset. Hence
whenever we can use inbuilt spark function we should use them and avoid UDF as much as possible.
but by any chance we have to use it then first we have to define a function like a normal scala function and we
have to register it with spark udf class
● val plusOne = udf((x: Int) => x + 1) //defined function
● spark.udf.register("plusOne", plusOne) //register udf
● spark.sql("SELECT plusOne(5)").show() // calling udf
// |UDF(5)| // result
// +------+
// | 6|
Filter Data at Earliest
example:- suppose we have a data set of employees and have column like patient Number, age, gender, salary, department, city, address,
pastexp, marital status, ……………………….. etc.
Bu we have to find number of employees belonging to a particular city. in this case we have to perform groupby operation on city column
and other column becomes irrelevant.
df.select(name,city).groupby(“city”).show()
df.groupby(“City”).select(“City”, “count”).show()
Scan
Aggregate
Filter
Scan
Aggregate
Filter
Shuffling
Shuffling is a mechanism Spark uses to redistribute the data across different executors and even across
machines. Spark shuffling triggers when we perform certain transformation operations like gropByKey(),
reducebyKey(), join() on RDD and DataFrame. It involves
● Disk I/O
● Involves data serialization and deserialization
● Network I/O
Reduce Shuffle Operation
We cannot completely avoid shuffle operations but when possible try to reduce the number of shuffle operations
remove any unused operations.
Spark provides spark.sql.shuffle.partitions configurations to control the partitions of the shuffle, By tuning this property
you can improve Spark performance.
spark.conf.set("spark.sql.shuffle.partitions",100)
Here 100 is the shuffle partition count we can tune this number by hit and trial based on datasize, If we have less data then we
don’t need 100 shuffle partition, If we have much bigger data and can execute large number of parallel task then we can increase
it to 200 or more.
File Format
Suppose we have system like this DataSource > SparkJob1 > Database > SparkJob2 > Database
As we are reading the data from source 1 from SparkJob1 and then we are writing data in Database2 then SparkJob2 reades
from Database2 and perform calculation then writes in Databse3.
as Database2 involves writing the data into and reading the data from it.
In the above scenario we should prefer writing an intermediate file in Serialized and optimized formats like Avro, Parquet
e.t.c,
Any transformations on these formats performs better than text, CSV, and JSON.
Spark Job1
Spark Job2
DataBase2 Database3
DataBase1
Executor Config
● JOB > Stage > Task
● one job can have multiple Stage, One stage can have multiple task.
● And number of core = number of parallel task
● Here we have to give proper number of core to each executor in order to optimise the resources.
● Allocating more number of core to each executor will leads to more parallel task on each executor which can
lead to outofmemory(OOM) error.
● Allocating less core per executor will reduce the parallelism and will the the benefit of it. Also the executor
memory will not be fully optimised.
● After Many iterations people recommend to allocate 5 cores per executor in order to get maximum benefit of
parallelism and proper memory uses.
./bin/spark-submit --driver-memory 8G --executor-memory 16G --num-executors 3 --executor-cores 5
Memory Tuning
There are three considerations in tuning memory usage:
● the amount of memory used by your objects (you may want your entire dataset to fit in memory),
● the cost of accessing those objects, and
● the overhead of garbage collection
● String data types uses less storage space compared to Linked List and Map as these objects not only has a
header, but also pointers (typically 8 bytes each) to the next object in the list.
● We can also optimise the memory uses by storing data in a serialized format.
● Java Objects are fast to access but consumes 2-5 times more space than the “raw” data inside their fields.
● using data structures with fewer objects and caching data in serialized format can help in reduce the Garbage
collection cost. Broadcasting variable also help us in reducing GC.
Thank You !
Get in touch with us:
Amit Raj
Senior Data Engineer
IIT Kharagpur
amitraj.iitkgp@gmail.com / 7548095242

More Related Content

PDF
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
PDF
Memory Management in Apache Spark
PDF
Dynamic Partition Pruning in Apache Spark
PDF
Deep Dive into the New Features of Apache Spark 3.0
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
PDF
Enabling Vectorized Engine in Apache Spark
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Memory Management in Apache Spark
Dynamic Partition Pruning in Apache Spark
Deep Dive into the New Features of Apache Spark 3.0
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Enabling Vectorized Engine in Apache Spark
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

What's hot (20)

PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PDF
Apache Spark Core – Practical Optimization
PDF
Top 5 mistakes when writing Spark applications
PDF
Apache Spark At Scale in the Cloud
PDF
Spark performance tuning - Maksud Ibrahimov
PPTX
Performance Optimizations in Apache Impala
PDF
How Adobe Does 2 Million Records Per Second Using Apache Spark!
PDF
Understanding Query Plans and Spark UIs
PPTX
Kafka 101
PDF
Productizing Structured Streaming Jobs
PDF
Introduction to Spark Internals
PPTX
Apache Spark overview
PDF
Understanding Memory Management In Spark For Fun And Profit
PPTX
RocksDB detail
PDF
Apache Spark Core—Deep Dive—Proper Optimization
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Deep Dive: Memory Management in Apache Spark
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Optimizing Delta/Parquet Data Lakes for Apache Spark
Apache Spark Core – Practical Optimization
Top 5 mistakes when writing Spark applications
Apache Spark At Scale in the Cloud
Spark performance tuning - Maksud Ibrahimov
Performance Optimizations in Apache Impala
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Understanding Query Plans and Spark UIs
Kafka 101
Productizing Structured Streaming Jobs
Introduction to Spark Internals
Apache Spark overview
Understanding Memory Management In Spark For Fun And Profit
RocksDB detail
Apache Spark Core—Deep Dive—Proper Optimization
Ad

Similar to Spark Performance Tuning .pdf (20)

PDF
Apache Spark Performance tuning and Best Practise
ODP
Spark Deep Dive
PPTX
Spark Performance Tuning | Best PySpark & Databricks Online Training
PPTX
Spark Overview and Performance Issues
PPTX
Spark real world use cases and optimizations
PDF
Optimizations in Spark; RDD, DataFrame
PDF
10 things i wish i'd known before using spark in production
PPTX
Dive into spark2
PPTX
Data Analytics using sparkabcdefghi.pptx
PDF
Big Data processing with Apache Spark
DOCX
Quick Guide to Refresh Spark skills
PDF
Apache Spark - A High Level overview
PDF
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PPTX
Apache Spark Workshop
PPTX
Control dataset partitioning and cache to optimize performances in Spark
PPTX
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
PDF
Apache spark - Spark's distributed programming model
PPTX
Introduction to real time big data with Apache Spark
PDF
Apache Spark Best Practices Meetup Talk
Apache Spark Performance tuning and Best Practise
Spark Deep Dive
Spark Performance Tuning | Best PySpark & Databricks Online Training
Spark Overview and Performance Issues
Spark real world use cases and optimizations
Optimizations in Spark; RDD, DataFrame
10 things i wish i'd known before using spark in production
Dive into spark2
Data Analytics using sparkabcdefghi.pptx
Big Data processing with Apache Spark
Quick Guide to Refresh Spark skills
Apache Spark - A High Level overview
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Apache Spark Workshop
Control dataset partitioning and cache to optimize performances in Spark
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Apache spark - Spark's distributed programming model
Introduction to real time big data with Apache Spark
Apache Spark Best Practices Meetup Talk
Ad

More from Amit Raj (6)

PPTX
Environmental Impact Assessment(EIA)
PDF
Summer traning report BRPNNL by Amit Raj 14CE10005
DOCX
Summer traning report
PDF
Haripur npp project
PPTX
Spot speed study
PPTX
Case study on small e commerce
Environmental Impact Assessment(EIA)
Summer traning report BRPNNL by Amit Raj 14CE10005
Summer traning report
Haripur npp project
Spot speed study
Case study on small e commerce

Recently uploaded (20)

PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPT
Mechanical Engineering MATERIALS Selection
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
PPT on Performance Review to get promotions
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPT
Total quality management ppt for engineering students
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
737-MAX_SRG.pdf student reference guides
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPT
introduction to datamining and warehousing
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Mechanical Engineering MATERIALS Selection
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT on Performance Review to get promotions
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Total quality management ppt for engineering students
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
737-MAX_SRG.pdf student reference guides
Foundation to blockchain - A guide to Blockchain Tech
additive manufacturing of ss316l using mig welding
Fundamentals of safety and accident prevention -final (1).pptx
Fundamentals of Mechanical Engineering.pptx
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
introduction to datamining and warehousing

Spark Performance Tuning .pdf

  • 1. By: Amit Raj IIT Kharagpur Apache Spark Performance tuning and Best Practise
  • 2. Our Agenda 01 Spark Introduction 02 Code Level Optimization 03 Outside Code Technique 04 Demo 05 Summary
  • 3. Introduction ● Apache Spark is Open Source, in-memory computation framework. ● It gives high performance for both batch as well as streaming job. ● It deals of big data processing. ● it is approx 100 times faster than mapreduce, because of in-memory computation As it deals with the big data processing application it also involves lot of uses of resources such as CPU, RAM and Storage. Optimising one or more together will leads to saving a lot cost reduction. In the upcoming 40 minute we will learn about the approaches which will help to do so.
  • 4. Ways to Optimise Code Level:- Here we will learn the best practices to follow in order to achieve high performance in minimal resources such as:- Caching, Broadcasting, Serialization, use DataSet/DF over RDD, Avoid UDF, Filter Data at earliest , Reduce Shuffle Beyond Code:- Here we will learn to tune the config parameter cluster resources level tuning such as:- File Format, Level of Parallelism, Executor config, Memory Tuning, Batch Interval
  • 5. Major Bottleneck ● CPU ● Network Bandwidth ● Memory Our Goal is to optimise each of them as much as possible in order to reduce the resources used and reduce the computation time to achieve optimum performance.
  • 6. Caching Suppose in our analytics project we have a text file and we have to read them and get number of flights leaving from a particular country and same is being used multiple times. ● Raw Data is in text file ● Reading Text File as DF1 ● Grouping by origin country DF2
  • 7. Caching JOB1:- Now number of flights leaving US as DF3 JOB2:- number of flights leaving Singapore as DF4 JOB3:- number of flights leaving India as DF5 Execution plan for JOB1 :- DF1>DF2 >DF3 Execution plan for JOB2 :- DF1>DF2 >DF4 after cache DF2 > DF4 no need of DF1 > DF2 step. Execution plan for JOB3 :- DF1>DF2 >DF5 after cache DF2 > DF5 no need of DF1 > DF2 step. here instead of calculating the DF1 and DF2 again we can cache the last reusable DF in memory so that we can use it in another job to reduce computation resources and save time.
  • 8. Broadcasting Broadcast variable allows us to keep a read only variable cached on each executor hence we don’t have to send it with task every time. which helps in reducing the network bandwidth and time consumption. When to Use Broadcast Variable:- Suppose we have a lookup data and that data need to be used by each executor while performing task. We have 100 partitions and 10 executor node cluster (every executor has to take care for 10 partition) we need to execute at least 100 task hence we have to send the lookup data 100 time to executor(once with every task). But if we use broadcast then we need to send the lookup data to each executor only once and only 10 copies will be sent. Benefit= sending 100 copy vs sending 10 copy val states = Map(("NY","New York"),("CA","California"),("FL","Florida")) val countries = Map(("USA","United States of America"),("IN","India")) val broadcastStates = spark.sparkContext.broadcast(states) val broadcastCountries = spark.sparkContext.broadcast(countries)
  • 9. – - Continue In the above diagram m is broadcast variable and it’s sitting in memory of each executor and getting used while task execution. Hence driver don’t need to ship the variable(m) with task and reduce the time of network IO and time.
  • 10. Serialization From the above diagram it is clear that serialization is needed when we write data in some storage. De-Serialization is needed when we need to read from the some source. In Spark ecosystem we always have to deal with both of them while cache, broadcast, shuffling etc. Hence it becomes very important to optimize the serialization process.
  • 11. Serialization Kyro serialization over Java serialization:- kyro is 10 times faster and more compact than java serialization but it doesn’t support all serializable types and requires to register the classes not supported by it. val spark = SparkSession.builder().appName("Broadcast").master("local").getOrCreate() spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") Further Optimization is to register the class with kyro in advance if row size is too big as if you don’t register the class it will store the class name with each object of it (for every row) conf.set("spark.kryo.registrationRequired", "true") conf.registerKryoClasses(Array(classOf[Foo]))
  • 12. DataSet/DataFrame over RDD RDD does sterilization and deserialization of data whenever it distributes the data across clusters such as during repartition and shuffle, and we all know that Serialization and de-serialization are very expensive operations in spark. On the other hand, DataFrame stores the data as binary using off-heap storage, no need for deserialization and serialization of data when it distributes to clusters. We see a big performance improvement in DataFrame over RDD
  • 13. Avoid UDF When we use UDFs we end up losing all the optimization Spark does on our Dataframe/Dataset. Hence whenever we can use inbuilt spark function we should use them and avoid UDF as much as possible. but by any chance we have to use it then first we have to define a function like a normal scala function and we have to register it with spark udf class ● val plusOne = udf((x: Int) => x + 1) //defined function ● spark.udf.register("plusOne", plusOne) //register udf ● spark.sql("SELECT plusOne(5)").show() // calling udf // |UDF(5)| // result // +------+ // | 6|
  • 14. Filter Data at Earliest example:- suppose we have a data set of employees and have column like patient Number, age, gender, salary, department, city, address, pastexp, marital status, ……………………….. etc. Bu we have to find number of employees belonging to a particular city. in this case we have to perform groupby operation on city column and other column becomes irrelevant. df.select(name,city).groupby(“city”).show() df.groupby(“City”).select(“City”, “count”).show() Scan Aggregate Filter Scan Aggregate Filter
  • 15. Shuffling Shuffling is a mechanism Spark uses to redistribute the data across different executors and even across machines. Spark shuffling triggers when we perform certain transformation operations like gropByKey(), reducebyKey(), join() on RDD and DataFrame. It involves ● Disk I/O ● Involves data serialization and deserialization ● Network I/O
  • 16. Reduce Shuffle Operation We cannot completely avoid shuffle operations but when possible try to reduce the number of shuffle operations remove any unused operations. Spark provides spark.sql.shuffle.partitions configurations to control the partitions of the shuffle, By tuning this property you can improve Spark performance. spark.conf.set("spark.sql.shuffle.partitions",100) Here 100 is the shuffle partition count we can tune this number by hit and trial based on datasize, If we have less data then we don’t need 100 shuffle partition, If we have much bigger data and can execute large number of parallel task then we can increase it to 200 or more.
  • 17. File Format Suppose we have system like this DataSource > SparkJob1 > Database > SparkJob2 > Database As we are reading the data from source 1 from SparkJob1 and then we are writing data in Database2 then SparkJob2 reades from Database2 and perform calculation then writes in Databse3. as Database2 involves writing the data into and reading the data from it. In the above scenario we should prefer writing an intermediate file in Serialized and optimized formats like Avro, Parquet e.t.c, Any transformations on these formats performs better than text, CSV, and JSON. Spark Job1 Spark Job2 DataBase2 Database3 DataBase1
  • 18. Executor Config ● JOB > Stage > Task ● one job can have multiple Stage, One stage can have multiple task. ● And number of core = number of parallel task ● Here we have to give proper number of core to each executor in order to optimise the resources. ● Allocating more number of core to each executor will leads to more parallel task on each executor which can lead to outofmemory(OOM) error. ● Allocating less core per executor will reduce the parallelism and will the the benefit of it. Also the executor memory will not be fully optimised. ● After Many iterations people recommend to allocate 5 cores per executor in order to get maximum benefit of parallelism and proper memory uses. ./bin/spark-submit --driver-memory 8G --executor-memory 16G --num-executors 3 --executor-cores 5
  • 19. Memory Tuning There are three considerations in tuning memory usage: ● the amount of memory used by your objects (you may want your entire dataset to fit in memory), ● the cost of accessing those objects, and ● the overhead of garbage collection ● String data types uses less storage space compared to Linked List and Map as these objects not only has a header, but also pointers (typically 8 bytes each) to the next object in the list. ● We can also optimise the memory uses by storing data in a serialized format. ● Java Objects are fast to access but consumes 2-5 times more space than the “raw” data inside their fields. ● using data structures with fewer objects and caching data in serialized format can help in reduce the Garbage collection cost. Broadcasting variable also help us in reducing GC.
  • 20. Thank You ! Get in touch with us: Amit Raj Senior Data Engineer IIT Kharagpur amitraj.iitkgp@gmail.com / 7548095242