SlideShare a Scribd company logo
Spark Pitfalls
May 15, 2018
Lior Regev
#whoami
● Lior Regev
● Chief architect @ Endor
● Working on big data ETL jobs for the past 3
years
● Using Spark for 2 years
● Number of times I had to open up Spark’s
source: 42
Recap
Pitfall #1:
Spark is not
Kubernetes
● Types of distribution: Compute vs
Data
● Spark distributes data
● To run heterogeneous tasks:
○ Kubernetes
○ ECS
○ AWS Batch
● Repartitioning after load
Pitfall #2:
Partitioning -
repartitioning ● Pre-partition by group if multiple groupBy are used
Pitfall #2:
Partitioning -
partition size
● Partitioning should be guided by task
time
● I go for 1 minute:
○ Too short => too much overhead
○ Too long => costly retries
● View spark UI to check your task time
statistics
Pitfall #2: Partitioning - skewed data
● Sometimes one task takes significantly longer
● This is usually because of skewed data
● Try to repartition
● Use UDAFs and internal aggregation functions
● That one “null” user
Pitfall #3: Non-optimized algorithms are OK
● Optimizing a process to run 10-times
faster is fun.
● Scaling that process to a bazillion (50)
nodes, is even better.
● Using internal optimized functions might
not best fit your case
Pitfall #4:
Not reading docs
and source
When running into an issue, try reading the
docs.
Examples:
● Spark’s default partition number may not
fit (spark.sql.shuffle.partitions)
● When reading files, use
spark.sql.files.maxPartitionBytes to
control how many files will be processed
in each partition.
Pitfall #5:
Debugging big
data - local
driver
At some point you will run into data you did not
expect:
● Remote cluster + local driver
● Works, but:
○ Executors are not debuggable
○ Impact on performance
○ You want a small test-case anyway
Pitfall #5:
Debugging big
data - the right
way
● Use a Spark notebook to find the bad
data:
● Create a test-case repro
○ EMR
○ Databricks
○ Dataproc
○ etc...
Thank you
Lior Regev - lioregev@gmail.com
● All of the code is available at: https://github.com/liorregev/SparkPitfallMeetup
● Questions? Your pitfalls?

More Related Content

PPTX
Prometheus on AWS
PDF
Ceilometer lsf-intergration-openstack-summit
PDF
Prometheus london
PDF
The Future of Real-Time in Spark
PDF
Prophet at Scale: Using Prophet at scale to tune and forecast time series at ...
PPTX
Spanner : Google' s Globally Distributed Database
PDF
Workers and Worker Patterns at Scale
PDF
Transducing for fun and profit
Prometheus on AWS
Ceilometer lsf-intergration-openstack-summit
Prometheus london
The Future of Real-Time in Spark
Prophet at Scale: Using Prophet at scale to tune and forecast time series at ...
Spanner : Google' s Globally Distributed Database
Workers and Worker Patterns at Scale
Transducing for fun and profit

What's hot (20)

PDF
Online statistical analysis using transducers and sketch algorithms
PDF
Big Data Analytics Tokyo
PDF
Efficient monitoring and alerting
PDF
Apache airflow
PPTX
Airflow presentation
PDF
Machine Learning Infrastructure
PDF
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
PDF
Creating Kubernetes Autoscalers
PDF
Building Robust Pipelines with Airflow
PPTX
Prometheus - Utah Software Architecture Meetup - Clint Checketts
PDF
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
PDF
Monitoring Cloud Native Applications with Prometheus
PPTX
Scrappy
PDF
Monitoring Big Data Systems - "The Simple Way"
PDF
Slack in the Age of Prometheus
PDF
Pandas/Data Analysis at Baypiggies
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PDF
Building an analytics workflow using Apache Airflow
ODP
Large Scale Processing with Django
PDF
Jinchao demo v3
Online statistical analysis using transducers and sketch algorithms
Big Data Analytics Tokyo
Efficient monitoring and alerting
Apache airflow
Airflow presentation
Machine Learning Infrastructure
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Creating Kubernetes Autoscalers
Building Robust Pipelines with Airflow
Prometheus - Utah Software Architecture Meetup - Clint Checketts
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
Monitoring Cloud Native Applications with Prometheus
Scrappy
Monitoring Big Data Systems - "The Simple Way"
Slack in the Age of Prometheus
Pandas/Data Analysis at Baypiggies
How I learned to time travel, or, data pipelining and scheduling with Airflow
Building an analytics workflow using Apache Airflow
Large Scale Processing with Django
Jinchao demo v3
Ad

Similar to Spark Pitfalls meetup UnderscoreIL (20)

PDF
Apache Spark Performance is too hard. Let's make it easier
PPTX
Spark Gotchas and Lessons Learned
PPTX
Data Analytics using sparkabcdefghi.pptx
PDF
Spark Gotchas and Lessons Learned (2/20/20)
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PDF
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
PPTX
Spark - Migration Story
PPTX
LanceShivnathHadoopSummit2015
PDF
SparkApplicationDevMadeEasy_Spark_Summit_2015
PDF
Better Visibility into Spark Execution for Faster Application Development-(S...
PDF
Databricks spark-knowledge-base-1
PDF
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
PPTX
Multi dimension aggregations using spark and dataframes
PDF
Top 5 mistakes when writing Spark applications
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PPTX
Think Like Spark: Some Spark Concepts and a Use Case
PPTX
Spark Application Development Made Easy
PDF
Deep Dive into Spark
PDF
Not Your Father's Database by Databricks
Apache Spark Performance is too hard. Let's make it easier
Spark Gotchas and Lessons Learned
Data Analytics using sparkabcdefghi.pptx
Spark Gotchas and Lessons Learned (2/20/20)
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
An Insider’s Guide to Maximizing Spark SQL Performance
Spark - Migration Story
LanceShivnathHadoopSummit2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
Better Visibility into Spark Execution for Faster Application Development-(S...
Databricks spark-knowledge-base-1
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Multi dimension aggregations using spark and dataframes
Top 5 mistakes when writing Spark applications
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Think Like Spark: Some Spark Concepts and a Use Case
Spark Application Development Made Easy
Deep Dive into Spark
Not Your Father's Database by Databricks
Ad

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Spectroscopy.pptx food analysis technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Spectral efficient network and resource selection model in 5G networks
PPT
Teaching material agriculture food technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation theory and applications.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
Dropbox Q2 2025 Financial Results & Investor Presentation
Big Data Technologies - Introduction.pptx
Understanding_Digital_Forensics_Presentation.pptx
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Weekly Chronicles - August'25 Week I
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Review of recent advances in non-invasive hemoglobin estimation
Spectroscopy.pptx food analysis technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectral efficient network and resource selection model in 5G networks
Teaching material agriculture food technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Network Security Unit 5.pdf for BCA BBA.
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation_ Review paper, used for researhc scholars
MYSQL Presentation for SQL database connectivity
Encapsulation theory and applications.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Per capita expenditure prediction using model stacking based on satellite ima...

Spark Pitfalls meetup UnderscoreIL

  • 1. Spark Pitfalls May 15, 2018 Lior Regev
  • 2. #whoami ● Lior Regev ● Chief architect @ Endor ● Working on big data ETL jobs for the past 3 years ● Using Spark for 2 years ● Number of times I had to open up Spark’s source: 42
  • 4. Pitfall #1: Spark is not Kubernetes ● Types of distribution: Compute vs Data ● Spark distributes data ● To run heterogeneous tasks: ○ Kubernetes ○ ECS ○ AWS Batch
  • 5. ● Repartitioning after load Pitfall #2: Partitioning - repartitioning ● Pre-partition by group if multiple groupBy are used
  • 6. Pitfall #2: Partitioning - partition size ● Partitioning should be guided by task time ● I go for 1 minute: ○ Too short => too much overhead ○ Too long => costly retries ● View spark UI to check your task time statistics
  • 7. Pitfall #2: Partitioning - skewed data ● Sometimes one task takes significantly longer ● This is usually because of skewed data ● Try to repartition ● Use UDAFs and internal aggregation functions ● That one “null” user
  • 8. Pitfall #3: Non-optimized algorithms are OK ● Optimizing a process to run 10-times faster is fun. ● Scaling that process to a bazillion (50) nodes, is even better. ● Using internal optimized functions might not best fit your case
  • 9. Pitfall #4: Not reading docs and source When running into an issue, try reading the docs. Examples: ● Spark’s default partition number may not fit (spark.sql.shuffle.partitions) ● When reading files, use spark.sql.files.maxPartitionBytes to control how many files will be processed in each partition.
  • 10. Pitfall #5: Debugging big data - local driver At some point you will run into data you did not expect: ● Remote cluster + local driver ● Works, but: ○ Executors are not debuggable ○ Impact on performance ○ You want a small test-case anyway
  • 11. Pitfall #5: Debugging big data - the right way ● Use a Spark notebook to find the bad data: ● Create a test-case repro ○ EMR ○ Databricks ○ Dataproc ○ etc...
  • 12. Thank you Lior Regev - lioregev@gmail.com ● All of the code is available at: https://github.com/liorregev/SparkPitfallMeetup ● Questions? Your pitfalls?