Spark Pitfalls meetup UnderscoreIL

Spark Pitfalls
May 15, 2018
Lior Regev

#whoami
● Lior Regev
● Chief architect @ Endor
● Working on big data ETL jobs for the past 3
years
● Using Spark for 2 years
● Number of times I had to open up Spark’s
source: 42

Pitfall #1:
Spark is not
Kubernetes
● Types of distribution: Compute vs
Data
● Spark distributes data
● To run heterogeneous tasks:
○ Kubernetes
○ ECS
○ AWS Batch

● Repartitioning after load
Pitfall #2:
Partitioning -
repartitioning ● Pre-partition by group if multiple groupBy are used

Pitfall #2:
Partitioning -
partition size
● Partitioning should be guided by task
time
● I go for 1 minute:
○ Too short => too much overhead
○ Too long => costly retries
● View spark UI to check your task time
statistics

Pitfall #2: Partitioning - skewed data
● Sometimes one task takes significantly longer
● This is usually because of skewed data
● Try to repartition
● Use UDAFs and internal aggregation functions
● That one “null” user

Pitfall #3: Non-optimized algorithms are OK
● Optimizing a process to run 10-times
faster is fun.
● Scaling that process to a bazillion (50)
nodes, is even better.
● Using internal optimized functions might
not best fit your case

Pitfall #4:
Not reading docs
and source
When running into an issue, try reading the
docs.
Examples:
● Spark’s default partition number may not
fit (spark.sql.shuffle.partitions)
● When reading files, use
spark.sql.files.maxPartitionBytes to
control how many files will be processed
in each partition.

Pitfall #5:
Debugging big
data - local
driver
At some point you will run into data you did not
expect:
● Remote cluster + local driver
● Works, but:
○ Executors are not debuggable
○ Impact on performance
○ You want a small test-case anyway

Pitfall #5:
Debugging big
data - the right
way
● Use a Spark notebook to find the bad
data:
● Create a test-case repro
○ EMR
○ Databricks
○ Dataproc
○ etc...

Thank you
Lior Regev - lioregev@gmail.com
● All of the code is available at: https://github.com/liorregev/SparkPitfallMeetup
● Questions? Your pitfalls?

Spark Pitfalls meetup UnderscoreIL

More Related Content

What's hot (20)

Similar to Spark Pitfalls meetup UnderscoreIL (20)

Recently uploaded (20)

Spark Pitfalls meetup UnderscoreIL