SlideShare a Scribd company logo
Migrating to
Spark at Netflix
Ryan Blue
Spark Summit 2019
Spark at Netflix
● ETL was mostly written in Pig, with some in Hive
● Pipelines required data engineering
● Data engineers had to understand the processing engine
Long ago . . .
Today
Job executions
Today
Cluster runtime
Today
S3 bytes read S3 bytes written
● Spark is > 90% of job executions – high tens-of-thousands daily
● Data platform is easier to use and more efficient
● Customers from all parts of the business
Today
How did we get there?
● High-profile Spark features: DataFrames, codegen, etc.
● S3 optimizations and committers
● Parquet filtering, tuning, and compression
● Notebook environment
Not included
Spark deployments
● Rebase
○ Pull in a new version
○ Easy to get new features
○ Easy to break things
Following upstream Spark
● Backport
○ Pick only what’s needed
○ Time consuming
○ Safe?
● Maintain supported versions in parallel using backports
● Periodic rebase to add new minor versions: 1.6, 2.0, 2.1, 2.3
● Recommend version based on actual use and experience
● Requires patching job submission
Netflix: Parallel branches
● Easily test another branch before spending time
● Avoids coordinating versions across major applications
● Fast iteration: deploy changes several times per week
Benefits of parallel branches
● Unstable branches
● Nightly canaries for stable and unstable
● CI runs unit tests for unstable
● Integration tests validate every deployment
Testing
● 1.6 – scale problems
● 2.0 – a little too unpolished
● 2.1 – solid, with some additional love
● 2.3 – slow migration, faster in some cases
Supported versions
Challenges
● 1.6 is unstable above 500 executors
○ Use of the Actor model caused coarse locking
○ RPC dependencies make lock issues worse
○ Runaway retry storms
● Spark needs distributed tracing
Stability
● Much better in 2.1, plus patches
○ Remove block status data from heartbeats (SPARK-20084)
○ Multi-threaded listener bus (SPARK-18838)
○ Unstable executor requests (SPARK-20540)
● 2.1 and 2.3 still have problems with 100,000+ tasks
○ Applications hang after shutdown
○ Increase job maxPartitionBytes or coalesce
Stability
● Happen all the time at scale
● Scale in several dimensions
○ Large clusters, lots of disks to fail
○ High tens-of-thousands of executions
○ Many executors, many tasks, diverse workloads
Unlikely problems
● Fix CommitCoordinator and OutputCommitter problems
● Turn off YARN preemption in production
● Use cgroups to contain greedy apps
● Use general-purpose features
○ Blacklisting to avoid cascading failure
○ Speculative execution to tolerate slow nodes
○ Adaptive execution reduces risk
Unlikely problems
● Fix persistent OOM causes
○ Use less driver memory for broadcast joins (SPARK-22170)
○ Add PySpark memory region and limits (SPARK-25004)
○ Base stats on row count, not size on disk
Memory management
● Educate users about memory regions
○ Spark memory vs JVM memory vs overhead
○ Know what region fixes your problem (e.g., spilling)
○ Never set spark.executor.memory without
also setting spark.memory.fraction
Memory management
Best practices
● Avoid RDDs
○ Kryo problems plagued 1.6 apps
○ Let the optimizer improve jobs over time
● Aggressively broadcast
○ Remove the broadcast timeout
○ Set broadcast threshold much higher
Basics
● 3 rules:
○ Don’t copy configuration
○ If you don’t know what it does, don’t change it
○ Never change timeouts
● Document defaults and recommendations
Configuration
● Know how to control parallelism
○ spark.sql.shuffle.partitions,
spark.sql.files.maxPartitionBytes
○ repartition vs coalesce
● Use the least-intrusive option
○ Set shuffle parallelism high and use adaptive execution
○ Allow Spark to improve
Parallelism
● Keep tasks in low tens-of-thousands
○ Too many tasks and the driver can’t handle heartbeats
○ Jobs hang for 10+ minutes after shutdown
● Reduce pressure on shuffle service
○ map tasks * reduce tasks = shuffle shards
Avoid wide stages
● Fixed --num-executors accidents (SPARK-13723)
● Use materialize instead of caching
○ Materialize: convert to RDD, back to DF, and count
○ Stores cache data in shuffle servers
○ Also avoids over-optimization
Dynamic Allocation
● Add ORDER BY
○ Partition columns, filter columns, and one high cardinality column
● Benefits
○ Cluster by partition columns – minimize output files
○ Cluster by common filter columns – faster reads
○ Automatic skew estimation – faster writes (wall time)
● Needs adaptive execution support
Sort before writing
Current problems
● Easy to overload one node
○ Skewed data, not enough threads, GC
● Prevents graceful shrink
● Causes huge runtime variance
Shuffle service
● Collect is wasteful
○ Iterate through compressed result blocks to collect
● Configuration is confusing
○ Memory fraction is often ignored
○ Simpler is better
● Should build broadcast tables on executors
Memory management
● Forked the write path for 2.x releases
○ Consistent rules across “datasource” and Hive tables
○ Remove unsafe operations, like implicit unsafe casts
○ Dynamic partition overwrites and Netflix “batch” pattern
● Fix upstream behavior and consistency with DSv2
● Fix table usability with Iceberg
○ Schema evolution and partitioning
DataSourceV2
Thank you!
Questions?

More Related Content

PDF
Apache Airflow
PDF
Spark Meetup at Uber
PDF
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
PDF
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
PDF
Iceberg: a fast table format for S3
PDF
GraphRAG is All You need? LLM & Knowledge Graph
PDF
Apache Airflow
PPTX
Apache airflow
Apache Airflow
Spark Meetup at Uber
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Iceberg: a fast table format for S3
GraphRAG is All You need? LLM & Knowledge Graph
Apache Airflow
Apache airflow

What's hot (20)

PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PDF
Building an open data platform with apache iceberg
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
PDF
Reliable Performance at Scale with Apache Spark on Kubernetes
PDF
Airflow introduction
PDF
The Parquet Format and Performance Optimization Opportunities
PPTX
Apache Spark Architecture
PDF
Introduction to MLflow
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PPTX
Building an Event Streaming Architecture with Apache Pulsar
PDF
Machine Learning using Kubeflow and Kubernetes
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
PDF
Performance Troubleshooting Using Apache Spark Metrics
PDF
Deploying Flink on Kubernetes - David Anderson
PDF
Incremental View Maintenance with Coral, DBT, and Iceberg
PDF
Introducing Apache Airflow and how we are using it
PDF
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
PPTX
Airflow presentation
PDF
How to Automate Performance Tuning for Apache Spark
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Building an open data platform with apache iceberg
Apache Spark in Depth: Core Concepts, Architecture & Internals
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Reliable Performance at Scale with Apache Spark on Kubernetes
Airflow introduction
The Parquet Format and Performance Optimization Opportunities
Apache Spark Architecture
Introduction to MLflow
Architect’s Open-Source Guide for a Data Mesh Architecture
Building an Event Streaming Architecture with Apache Pulsar
Machine Learning using Kubeflow and Kubernetes
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Performance Troubleshooting Using Apache Spark Metrics
Deploying Flink on Kubernetes - David Anderson
Incremental View Maintenance with Coral, DBT, and Iceberg
Introducing Apache Airflow and how we are using it
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Airflow presentation
How to Automate Performance Tuning for Apache Spark
Ad

Similar to Migrating to Apache Spark at Netflix (20)

PPTX
Spark Overview and Performance Issues
PDF
Apache Spark - A High Level overview
PPTX
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
PDF
Sparklife - Life In The Trenches With Spark
ODP
Spark Deep Dive
DOCX
Quick Guide to Refresh Spark skills
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PDF
Apache Spark: What's under the hood
PDF
Apache Spark and Python: unified Big Data analytics
PPTX
Spark introduction and architecture
PPTX
Spark introduction and architecture
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Introduction to Spark Training
PDF
Hadoop Spark Introduction-20150130
PPTX
Intro to Spark development
PDF
Apache Spark Introduction.pdf
PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
PPTX
In Memory Analytics with Apache Spark
PDF
Spark After Dark - LA Apache Spark Users Group - Feb 2015
PDF
Spark after Dark by Chris Fregly of Databricks
Spark Overview and Performance Issues
Apache Spark - A High Level overview
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Sparklife - Life In The Trenches With Spark
Spark Deep Dive
Quick Guide to Refresh Spark skills
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Apache Spark: What's under the hood
Apache Spark and Python: unified Big Data analytics
Spark introduction and architecture
Spark introduction and architecture
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Introduction to Spark Training
Hadoop Spark Introduction-20150130
Intro to Spark development
Apache Spark Introduction.pdf
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
In Memory Analytics with Apache Spark
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark after Dark by Chris Fregly of Databricks
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Lecture1 pattern recognition............
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Mega Projects Data Mega Projects Data
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Computer network topology notes for revision
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Database Infoormation System (DBIS).pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
oil_refinery_comprehensive_20250804084928 (1).pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
IBA_Chapter_11_Slides_Final_Accessible.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Lecture1 pattern recognition............
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Clinical guidelines as a resource for EBP(1).pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Fluorescence-microscope_Botany_detailed content
Mega Projects Data Mega Projects Data
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Computer network topology notes for revision
IB Computer Science - Internal Assessment.pptx
Database Infoormation System (DBIS).pptx

Migrating to Apache Spark at Netflix

  • 1. Migrating to Spark at Netflix Ryan Blue Spark Summit 2019
  • 3. ● ETL was mostly written in Pig, with some in Hive ● Pipelines required data engineering ● Data engineers had to understand the processing engine Long ago . . .
  • 6. Today S3 bytes read S3 bytes written
  • 7. ● Spark is > 90% of job executions – high tens-of-thousands daily ● Data platform is easier to use and more efficient ● Customers from all parts of the business Today
  • 8. How did we get there?
  • 9. ● High-profile Spark features: DataFrames, codegen, etc. ● S3 optimizations and committers ● Parquet filtering, tuning, and compression ● Notebook environment Not included
  • 11. ● Rebase ○ Pull in a new version ○ Easy to get new features ○ Easy to break things Following upstream Spark ● Backport ○ Pick only what’s needed ○ Time consuming ○ Safe?
  • 12. ● Maintain supported versions in parallel using backports ● Periodic rebase to add new minor versions: 1.6, 2.0, 2.1, 2.3 ● Recommend version based on actual use and experience ● Requires patching job submission Netflix: Parallel branches
  • 13. ● Easily test another branch before spending time ● Avoids coordinating versions across major applications ● Fast iteration: deploy changes several times per week Benefits of parallel branches
  • 14. ● Unstable branches ● Nightly canaries for stable and unstable ● CI runs unit tests for unstable ● Integration tests validate every deployment Testing
  • 15. ● 1.6 – scale problems ● 2.0 – a little too unpolished ● 2.1 – solid, with some additional love ● 2.3 – slow migration, faster in some cases Supported versions
  • 17. ● 1.6 is unstable above 500 executors ○ Use of the Actor model caused coarse locking ○ RPC dependencies make lock issues worse ○ Runaway retry storms ● Spark needs distributed tracing Stability
  • 18. ● Much better in 2.1, plus patches ○ Remove block status data from heartbeats (SPARK-20084) ○ Multi-threaded listener bus (SPARK-18838) ○ Unstable executor requests (SPARK-20540) ● 2.1 and 2.3 still have problems with 100,000+ tasks ○ Applications hang after shutdown ○ Increase job maxPartitionBytes or coalesce Stability
  • 19. ● Happen all the time at scale ● Scale in several dimensions ○ Large clusters, lots of disks to fail ○ High tens-of-thousands of executions ○ Many executors, many tasks, diverse workloads Unlikely problems
  • 20. ● Fix CommitCoordinator and OutputCommitter problems ● Turn off YARN preemption in production ● Use cgroups to contain greedy apps ● Use general-purpose features ○ Blacklisting to avoid cascading failure ○ Speculative execution to tolerate slow nodes ○ Adaptive execution reduces risk Unlikely problems
  • 21. ● Fix persistent OOM causes ○ Use less driver memory for broadcast joins (SPARK-22170) ○ Add PySpark memory region and limits (SPARK-25004) ○ Base stats on row count, not size on disk Memory management
  • 22. ● Educate users about memory regions ○ Spark memory vs JVM memory vs overhead ○ Know what region fixes your problem (e.g., spilling) ○ Never set spark.executor.memory without also setting spark.memory.fraction Memory management
  • 24. ● Avoid RDDs ○ Kryo problems plagued 1.6 apps ○ Let the optimizer improve jobs over time ● Aggressively broadcast ○ Remove the broadcast timeout ○ Set broadcast threshold much higher Basics
  • 25. ● 3 rules: ○ Don’t copy configuration ○ If you don’t know what it does, don’t change it ○ Never change timeouts ● Document defaults and recommendations Configuration
  • 26. ● Know how to control parallelism ○ spark.sql.shuffle.partitions, spark.sql.files.maxPartitionBytes ○ repartition vs coalesce ● Use the least-intrusive option ○ Set shuffle parallelism high and use adaptive execution ○ Allow Spark to improve Parallelism
  • 27. ● Keep tasks in low tens-of-thousands ○ Too many tasks and the driver can’t handle heartbeats ○ Jobs hang for 10+ minutes after shutdown ● Reduce pressure on shuffle service ○ map tasks * reduce tasks = shuffle shards Avoid wide stages
  • 28. ● Fixed --num-executors accidents (SPARK-13723) ● Use materialize instead of caching ○ Materialize: convert to RDD, back to DF, and count ○ Stores cache data in shuffle servers ○ Also avoids over-optimization Dynamic Allocation
  • 29. ● Add ORDER BY ○ Partition columns, filter columns, and one high cardinality column ● Benefits ○ Cluster by partition columns – minimize output files ○ Cluster by common filter columns – faster reads ○ Automatic skew estimation – faster writes (wall time) ● Needs adaptive execution support Sort before writing
  • 31. ● Easy to overload one node ○ Skewed data, not enough threads, GC ● Prevents graceful shrink ● Causes huge runtime variance Shuffle service
  • 32. ● Collect is wasteful ○ Iterate through compressed result blocks to collect ● Configuration is confusing ○ Memory fraction is often ignored ○ Simpler is better ● Should build broadcast tables on executors Memory management
  • 33. ● Forked the write path for 2.x releases ○ Consistent rules across “datasource” and Hive tables ○ Remove unsafe operations, like implicit unsafe casts ○ Dynamic partition overwrites and Netflix “batch” pattern ● Fix upstream behavior and consistency with DSv2 ● Fix table usability with Iceberg ○ Schema evolution and partitioning DataSourceV2