SlideShare a Scribd company logo
Apache Spark at Scale: A 60 TB+
production use case
Sital Kedia
Facebook
Agenda
• Use case: Entity ranking
• Previous Hive implementation
• Spark implementation
• Performance comparison
• Reliability improvements
• Performance improvements
• Configuration tuning
Use case: Entity ranking
• Used to serve realtime queries to rank entities
• Entity can be users, places, pages etc
• Raw features generated offline using Hive and loaded onto the
system for real-time query.
Previous Hive implementation
INSERT OVERWRITE TABLE tmp_table1

PARTITION ( . . .)
SELECT entity_id, target_id, feature_id, feature_value
FROM input_table

WHERE ...
INSERT OVERWRITE TABLE tmp_table2

PARTITION ( . . .)
SELECT entity_id, target_id, AGG(feature_id, feature_value)
FROM tmp_table1

SELECT TRANSFORM (entity_id % SHARDS as shard_id, ...) 

USING 'indexer' -- writes indexed files to hdfs
AS shard_id, status

FROM tmp_table2
Input table
tmp_table1
tmp_table2
indexed

hdfs_files
• 60 TB + compressed input
data size
• Split into hundreds of smaller
hive jobs sharded by entity id
• Unmanageable and slow
Filter
Aggregate
Shard
Spark implementation
SELECT TRANSFORM (shard_id, . . .)
USING 'indexer' 

AS shard_id, status

FROM (
SELECT entity_id % SHARDS as shard_id, entity_id, target_id, AGG ( ...)

FROM input_table

WHERE ...

GROUP BY shard_id, entity_id, feature_id, target_id
CLUSTER BY shard_id
) AS T 

Input table
indexed

hdfs files
• Single job with 2 stages
• Shuffles 90 TB+ compressed
intermediate data
Perfomance comparison
CPU time
CPU time (in cpu-days)
0
1750
3500
5250
7000
Job 1 Job 2
Hive Spark
• Collected from OS proc
file-system.
• Aggregated across all
executors
CPU Reservation time
CPU Reservation time (in cpu-days)
0
3500
7000
10500
14000
Job 1 Job 2
Hive Spark
• Executor run time 

* spark.executor.cores
• Aggregated across all
executors
Latency
Latency (in hours)
0
20
40
60
80
Job 1 Job 2
Hive Spark
• End to end latency of
the job
Reliability improvements
Fix memory leak in the sorter
SPARK-14363
e
c
o
r
d
Pointer 

Array
Sort
local
disk
Spill to Disk
Free memory

Pages
Memory 

leak
Memory 

Page
Seamless cluster restart
Task Scheduler
Backend
Scheduler
Shuffle
service
Executor
local 

disk
Resource

Manager
Zookeeper
Ignore all 

task failures
Cluster Manager 

down notification
Register the 

shuffle files
Cluster Manager 

up notification
Stop ignoring 

task failures
Request 

resources
Shuffle
service
Executor
local 

disk
Launch

executors
Executor
Executor
Driver
Worker
Worker
Other reliability improvements
• Various memory leak fixes (SPARK-13958 and SPARK-17113)
• Make PipedRDD robust to fetch failures (SPARK-13793)
• Configurable max number of fetch failures (SPARK-13369)
• Unresponsive driver (SPARK-13279)
• TimSort issue due to integer overflow for large buffer
(SPARK-13850)
Performance improvements
Tools
Spark UI metrics
Source: sed ut unde omnis
Tools
Thread dump from Spark UI
Tools
Flame Graph
Executo Periodic
Jstack/PerfExecutoExecutor
Filter executor
threads
Worker Jstack
aggregator
service
Reduce shuffle write latency
SPARK-5581 (Up to 50% speed-up)
Map task
Sort and
Spill
Shuffle

partition
Shuffle

partition
Shuffle

partition
file open
file close
Map task
Sort and
Spill
Shuffle

partition
Shuffle

partition
Shuffle

partition
Zero copy based Spill file reader
SPARK-17839 (Up to 7% speed-up)
Disk
Application context Kernel context
DMA copy
CPU copy
Application Buffer
Spill Reader
BufferedInputStream
Buffer cache Disk
Application context Kernel context
Application Buffer
Spill Reader
NioBufferedInputStream
Cache index files on shuffle server
SPARK-15074
index file
partition
partition
partition
Shuffle
service
Reducer
Reducer
Shuffle fetch
Shuffle fetch
Read index 

file
Read 

partition
Read 

partition
Read index 

file index file
partition
partition
partition
Shuffle
service
Reducer
Reducer
Shuffle fetch
Shuffle fetch
Read and cache

index file
Read 

partition
Read 

partition
Other performance improvements
• Snappy optimization (SPARK-14277)
• Fix duplicate task run issue due to fetch failure (SPARK-14649)
• Configurable buffer size for PipedRDD (SPARK-14542)
• Reduce update frequency of shuffle bytes written metrics
(SPARK-15569)
• Configurable initial buffer size for Sorter(SPARK-15958)
Configuration tuning
Configuration tuning
• Memory configurations
• spark.memory.offHeap.enabled = true
• spark.executor.memory = 3g
• spark.memory.offHeap.size = 3g
• Use parallel GC instead of G1GC
• spark.executor.extraJavaOptions = -XX:UseParallelGC
• Enable dynamic executor allocation
• spark.dynamicAllocation.enabled = true
Configuration tuning
• Tune Shuffle service
• spark.shuffle.io.serverThreads = 128
• spark.shuffle.io.backLog = 8192
• Buffer size configurations -
• spark.unsafe.sorter.spill.reader.buffer.size = 2m
• spark.shuffle.file.buffer = 1m
• spark.shuffle.sort.initialBufferSize = 4194304
Resource
• Apache Spark @Scale: A 60 TB+ production use case
Questions?

More Related Content

PDF
Spark Summit EU talk by Luc Bourlier
PDF
Spark Summit EU talk by Elena Lazovik
PDF
Spark Summit EU talk by Herman van Hovell
PDF
Spark Summit EU talk by Rolf Jagerman
PDF
Spark Summit EU talk by Simon Whitear
PDF
Continuous Application with FAIR Scheduler with Robert Xue
PDF
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
PDF
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Simon Whitear
Continuous Application with FAIR Scheduler with Robert Xue
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins

What's hot (20)

PDF
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
PDF
Spark Summit EU talk by Nick Pentreath
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
PDF
Spark Summit EU talk by John Musser
PDF
Spark Summit EU talk by Shay Nativ and Dvir Volk
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Extending Spark With Java Agent (handout)
PDF
Operational Tips For Deploying Apache Spark
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PDF
Spark Summit EU talk by Qifan Pu
PDF
Spark Summit EU talk by Luca Canali
PDF
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
PDF
Degrading Performance? You Might be Suffering From the Small Files Syndrome
PDF
Recent Developments In SparkR For Advanced Analytics
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
PDF
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
PDF
From Pipelines to Refineries: Scaling Big Data Applications
PDF
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by John Musser
Spark Summit EU talk by Shay Nativ and Dvir Volk
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Extending Spark With Java Agent (handout)
Operational Tips For Deploying Apache Spark
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Recent Developments In SparkR For Advanced Analytics
Designing Structured Streaming Pipelines—How to Architect Things Right
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
From Pipelines to Refineries: Scaling Big Data Applications
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Lambda architecture on Spark, Kafka for real-time large scale ML
Ad

Viewers also liked (20)

PPTX
Spark Summit EU talk by Sameer Agarwal
PDF
Spark Summit EU talk by Kaarthik Sivashanmugam
PDF
Spark Summit EU talk by Oscar Castaneda
PDF
Spark Summit EU talk by Dean Wampler
PDF
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
PDF
Spark Summit EU talk by Ted Malaska
PPTX
Simplifying Big Data Applications with Apache Spark 2.0
PPTX
The Next AMPLab: Real-Time, Intelligent, and Secure Computing
PDF
Spark Summit EU talk by Casey Stella
PDF
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
PDF
Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil
PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
PDF
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
PDF
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
PDF
Java one2013 con4540-keenan
PDF
5 Best Practices for Monitoring Hive and MapReduce Application Performance
PDF
Spark Summit EU talk by Ahsan Javed Awan
PDF
Spark Summit EU talk by Oscar Castaneda
PDF
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
PDF
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ted Malaska
Simplifying Big Data Applications with Apache Spark 2.0
The Next AMPLab: Real-Time, Intelligent, and Secure Computing
Spark Summit EU talk by Casey Stella
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Java one2013 con4540-keenan
5 Best Practices for Monitoring Hive and MapReduce Application Performance
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Javier Aguedes
Ad

Similar to Spark Summit EU talk by Sital Kedia (20)

PDF
The state of Spark in the cloud
PDF
The state of Hive and Spark in the Cloud (July 2017)
PPTX
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
PDF
Dev Ops Training
PDF
Scaling Apache Spark at Facebook
PPTX
In Memory Analytics with Apache Spark
PDF
Spark Uber Development Kit
PDF
Ultimate journey towards realtime data platform with 2.5M events per sec
PDF
Spark + AI Summit recap jul16 2020
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PDF
Spark Summit EU talk by Berni Schiefer
PDF
The State of Spark in the Cloud with Nicolas Poggi
PDF
State of Spark in the cloud (Spark Summit EU 2017)
PDF
Cluj meetup bigdata-final-version
PDF
Hive Now Sparks
PDF
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
PDF
TriHUG talk on Spark and Shark
PDF
Lessons from Running Large Scale Spark Workloads
PDF
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
PDF
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
The state of Spark in the cloud
The state of Hive and Spark in the Cloud (July 2017)
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Dev Ops Training
Scaling Apache Spark at Facebook
In Memory Analytics with Apache Spark
Spark Uber Development Kit
Ultimate journey towards realtime data platform with 2.5M events per sec
Spark + AI Summit recap jul16 2020
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Summit EU talk by Berni Schiefer
The State of Spark in the Cloud with Nicolas Poggi
State of Spark in the cloud (Spark Summit EU 2017)
Cluj meetup bigdata-final-version
Hive Now Sparks
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
TriHUG talk on Spark and Shark
Lessons from Running Large Scale Spark Workloads
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
1_Introduction to advance data techniques.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Computer network topology notes for revision
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Foundation of Data Science unit number two notes
PDF
Introduction to Business Data Analytics.
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Global journeys: estimating international migration
Launch Your Data Science Career in Kochi – 2025
Introduction to Knowledge Engineering Part 1
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
IB Computer Science - Internal Assessment.pptx
Mega Projects Data Mega Projects Data
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
1_Introduction to advance data techniques.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Computer network topology notes for revision
STUDY DESIGN details- Lt Col Maksud (21).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Foundation of Data Science unit number two notes
Introduction to Business Data Analytics.
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
IBA_Chapter_11_Slides_Final_Accessible.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Business Acumen Training GuidePresentation.pptx
Global journeys: estimating international migration

Spark Summit EU talk by Sital Kedia

  • 1. Apache Spark at Scale: A 60 TB+ production use case Sital Kedia Facebook
  • 2. Agenda • Use case: Entity ranking • Previous Hive implementation • Spark implementation • Performance comparison • Reliability improvements • Performance improvements • Configuration tuning
  • 3. Use case: Entity ranking • Used to serve realtime queries to rank entities • Entity can be users, places, pages etc • Raw features generated offline using Hive and loaded onto the system for real-time query.
  • 4. Previous Hive implementation INSERT OVERWRITE TABLE tmp_table1
 PARTITION ( . . .) SELECT entity_id, target_id, feature_id, feature_value FROM input_table
 WHERE ... INSERT OVERWRITE TABLE tmp_table2
 PARTITION ( . . .) SELECT entity_id, target_id, AGG(feature_id, feature_value) FROM tmp_table1
 SELECT TRANSFORM (entity_id % SHARDS as shard_id, ...) 
 USING 'indexer' -- writes indexed files to hdfs AS shard_id, status
 FROM tmp_table2 Input table tmp_table1 tmp_table2 indexed
 hdfs_files • 60 TB + compressed input data size • Split into hundreds of smaller hive jobs sharded by entity id • Unmanageable and slow Filter Aggregate Shard
  • 5. Spark implementation SELECT TRANSFORM (shard_id, . . .) USING 'indexer' 
 AS shard_id, status
 FROM ( SELECT entity_id % SHARDS as shard_id, entity_id, target_id, AGG ( ...)
 FROM input_table
 WHERE ...
 GROUP BY shard_id, entity_id, feature_id, target_id CLUSTER BY shard_id ) AS T 
 Input table indexed
 hdfs files • Single job with 2 stages • Shuffles 90 TB+ compressed intermediate data
  • 7. CPU time CPU time (in cpu-days) 0 1750 3500 5250 7000 Job 1 Job 2 Hive Spark • Collected from OS proc file-system. • Aggregated across all executors
  • 8. CPU Reservation time CPU Reservation time (in cpu-days) 0 3500 7000 10500 14000 Job 1 Job 2 Hive Spark • Executor run time 
 * spark.executor.cores • Aggregated across all executors
  • 9. Latency Latency (in hours) 0 20 40 60 80 Job 1 Job 2 Hive Spark • End to end latency of the job
  • 11. Fix memory leak in the sorter SPARK-14363 e c o r d Pointer 
 Array Sort local disk Spill to Disk Free memory
 Pages Memory 
 leak Memory 
 Page
  • 12. Seamless cluster restart Task Scheduler Backend Scheduler Shuffle service Executor local 
 disk Resource
 Manager Zookeeper Ignore all 
 task failures Cluster Manager 
 down notification Register the 
 shuffle files Cluster Manager 
 up notification Stop ignoring 
 task failures Request 
 resources Shuffle service Executor local 
 disk Launch
 executors Executor Executor Driver Worker Worker
  • 13. Other reliability improvements • Various memory leak fixes (SPARK-13958 and SPARK-17113) • Make PipedRDD robust to fetch failures (SPARK-13793) • Configurable max number of fetch failures (SPARK-13369) • Unresponsive driver (SPARK-13279) • TimSort issue due to integer overflow for large buffer (SPARK-13850)
  • 16. Source: sed ut unde omnis Tools Thread dump from Spark UI
  • 17. Tools Flame Graph Executo Periodic Jstack/PerfExecutoExecutor Filter executor threads Worker Jstack aggregator service
  • 18. Reduce shuffle write latency SPARK-5581 (Up to 50% speed-up) Map task Sort and Spill Shuffle
 partition Shuffle
 partition Shuffle
 partition file open file close Map task Sort and Spill Shuffle
 partition Shuffle
 partition Shuffle
 partition
  • 19. Zero copy based Spill file reader SPARK-17839 (Up to 7% speed-up) Disk Application context Kernel context DMA copy CPU copy Application Buffer Spill Reader BufferedInputStream Buffer cache Disk Application context Kernel context Application Buffer Spill Reader NioBufferedInputStream
  • 20. Cache index files on shuffle server SPARK-15074 index file partition partition partition Shuffle service Reducer Reducer Shuffle fetch Shuffle fetch Read index 
 file Read 
 partition Read 
 partition Read index 
 file index file partition partition partition Shuffle service Reducer Reducer Shuffle fetch Shuffle fetch Read and cache
 index file Read 
 partition Read 
 partition
  • 21. Other performance improvements • Snappy optimization (SPARK-14277) • Fix duplicate task run issue due to fetch failure (SPARK-14649) • Configurable buffer size for PipedRDD (SPARK-14542) • Reduce update frequency of shuffle bytes written metrics (SPARK-15569) • Configurable initial buffer size for Sorter(SPARK-15958)
  • 23. Configuration tuning • Memory configurations • spark.memory.offHeap.enabled = true • spark.executor.memory = 3g • spark.memory.offHeap.size = 3g • Use parallel GC instead of G1GC • spark.executor.extraJavaOptions = -XX:UseParallelGC • Enable dynamic executor allocation • spark.dynamicAllocation.enabled = true
  • 24. Configuration tuning • Tune Shuffle service • spark.shuffle.io.serverThreads = 128 • spark.shuffle.io.backLog = 8192 • Buffer size configurations - • spark.unsafe.sorter.spill.reader.buffer.size = 2m • spark.shuffle.file.buffer = 1m • spark.shuffle.sort.initialBufferSize = 4194304
  • 25. Resource • Apache Spark @Scale: A 60 TB+ production use case