Migrating to Apache Spark at Netflix

Migrating to
Spark at Netﬂix
Ryan Blue
Spark Summit 2019

● ETL was mostly written in Pig, with some in Hive
● Pipelines required data engineering
● Data engineers had to understand the processing engine
Long ago . . .

Today
S3 bytes read S3 bytes written

● Spark is > 90% of job executions – high tens-of-thousands daily
● Data platform is easier to use and more eﬃcient
● Customers from all parts of the business
Today

● High-proﬁle Spark features: DataFrames, codegen, etc.
● S3 optimizations and committers
● Parquet ﬁltering, tuning, and compression
● Notebook environment
Not included

● Rebase
○ Pull in a new version
○ Easy to get new features
○ Easy to break things
Following upstream Spark
● Backport
○ Pick only what’s needed
○ Time consuming
○ Safe?

● Maintain supported versions in parallel using backports
● Periodic rebase to add new minor versions: 1.6, 2.0, 2.1, 2.3
● Recommend version based on actual use and experience
● Requires patching job submission
Netﬂix: Parallel branches

● Easily test another branch before spending time
● Avoids coordinating versions across major applications
● Fast iteration: deploy changes several times per week
Beneﬁts of parallel branches

● Unstable branches
● Nightly canaries for stable and unstable
● CI runs unit tests for unstable
● Integration tests validate every deployment
Testing

● 1.6 – scale problems
● 2.0 – a little too unpolished
● 2.1 – solid, with some additional love
● 2.3 – slow migration, faster in some cases
Supported versions

● 1.6 is unstable above 500 executors
○ Use of the Actor model caused coarse locking
○ RPC dependencies make lock issues worse
○ Runaway retry storms
● Spark needs distributed tracing
Stability

● Much better in 2.1, plus patches
○ Remove block status data from heartbeats (SPARK-20084)
○ Multi-threaded listener bus (SPARK-18838)
○ Unstable executor requests (SPARK-20540)
● 2.1 and 2.3 still have problems with 100,000+ tasks
○ Applications hang after shutdown
○ Increase job maxPartitionBytes or coalesce
Stability

● Happen all the time at scale
● Scale in several dimensions
○ Large clusters, lots of disks to fail
○ High tens-of-thousands of executions
○ Many executors, many tasks, diverse workloads
Unlikely problems

● Fix CommitCoordinator and OutputCommitter problems
● Turn oﬀ YARN preemption in production
● Use cgroups to contain greedy apps
● Use general-purpose features
○ Blacklisting to avoid cascading failure
○ Speculative execution to tolerate slow nodes
○ Adaptive execution reduces risk
Unlikely problems

● Fix persistent OOM causes
○ Use less driver memory for broadcast joins (SPARK-22170)
○ Add PySpark memory region and limits (SPARK-25004)
○ Base stats on row count, not size on disk
Memory management

● Educate users about memory regions
○ Spark memory vs JVM memory vs overhead
○ Know what region ﬁxes your problem (e.g., spilling)
○ Never set spark.executor.memory without
also setting spark.memory.fraction
Memory management

● Avoid RDDs
○ Kryo problems plagued 1.6 apps
○ Let the optimizer improve jobs over time
● Aggressively broadcast
○ Remove the broadcast timeout
○ Set broadcast threshold much higher
Basics

● 3 rules:
○ Don’t copy conﬁguration
○ If you don’t know what it does, don’t change it
○ Never change timeouts
● Document defaults and recommendations
Conﬁguration

● Know how to control parallelism
○ spark.sql.shuffle.partitions,
spark.sql.files.maxPartitionBytes
○ repartition vs coalesce
● Use the least-intrusive option
○ Set shuﬄe parallelism high and use adaptive execution
○ Allow Spark to improve
Parallelism

● Keep tasks in low tens-of-thousands
○ Too many tasks and the driver can’t handle heartbeats
○ Jobs hang for 10+ minutes after shutdown
● Reduce pressure on shuﬄe service
○ map tasks * reduce tasks = shuﬄe shards
Avoid wide stages

● Fixed --num-executors accidents (SPARK-13723)
● Use materialize instead of caching
○ Materialize: convert to RDD, back to DF, and count
○ Stores cache data in shuﬄe servers
○ Also avoids over-optimization
Dynamic Allocation

● Add ORDER BY
○ Partition columns, filter columns, and one high cardinality column
● Benefits
○ Cluster by partition columns – minimize output files
○ Cluster by common filter columns – faster reads
○ Automatic skew estimation – faster writes (wall time)
● Needs adaptive execution support
Sort before writing

● Easy to overload one node
○ Skewed data, not enough threads, GC
● Prevents graceful shrink
● Causes huge runtime variance
Shuﬄe service

● Collect is wasteful
○ Iterate through compressed result blocks to collect
● Conﬁguration is confusing
○ Memory fraction is often ignored
○ Simpler is better
● Should build broadcast tables on executors
Memory management

● Forked the write path for 2.x releases
○ Consistent rules across “datasource” and Hive tables
○ Remove unsafe operations, like implicit unsafe casts
○ Dynamic partition overwrites and Netﬂix “batch” pattern
● Fix upstream behavior and consistency with DSv2
● Fix table usability with Iceberg
○ Schema evolution and partitioning
DataSourceV2

Migrating to Apache Spark at Netflix

More Related Content

What's hot (20)

Similar to Migrating to Apache Spark at Netflix (20)

More from Databricks (20)

Recently uploaded (20)

Migrating to Apache Spark at Netflix