SlideShare a Scribd company logo
Optimizing Spark
Greg Novak
Proprietary and confidential 2
If you’ve thought about this at all, you won’t
learn anything from me today
If you haven’t thought about this, you’ll learn a
few principles to organize your thinking
Proprietary and confidential 3
Know what you want to measure
You don’t want to measure run times
You want to measure effective performance of
some machine characteristic: network
bandwidth, file access latency, or CPU
operations per second
Proprietary and confidential 4
You do this with carefully constructed data sets
To measure network bandwidth, construct a data set with
the same number of files (so file access latency is
constant) and do the same operation on it (so that cpu
operations are constant) but force some extra data with
variable size (e.g. random 1 byte ints vs. random 8 byte
ints) to come along for the ride.
Then take difference of run times.
Proprietary and confidential 5
Case Study: Effective Network Bandwidth
Everything seemed to run slowly under
Spark 2.0...
Latency and CPU performance looked
fine
But we got terrible network bandwidth
from Spark 2.0
Not necessarily intrinsic to Spark 2.0…
could have been some detail of our
setup
However Spark 2.1 worked fine, so we
just decommissioned our Spark 2.0
setup
Proprietary and confidential 6
How do you know if you’re getting your
money’s worth out of parallelization?
Proprietary and confidential 7
Run time vs. Number of Executors
Probably the first
plot you draw…
but doesn’t really
tell you what you
want to know
Proprietary and confidential 8
Overall Cost (in dollars if possible) vs. executors
In a perfect world
(linear speed-ups)
cost is independent
of parallelism
In the real world
costs generally rise
with parallelism
Proprietary and confidential 9
Benefit: 1/walltime = answers per hour
1 hour vs. 2 hours:
Probably not a big deal
1 week vs. 2 weeks:
Probably is a big deal
1 minute vs 10 minutes
is a huge deal:
Too easy to get
distracted if your debug
cycle is 10 minutes.
Proprietary and confidential 10
Once you are crisp on the costs and benefits, you will be in
a position to say things like:
“If I double the amount of parallelism for this job, my AWS
bill will rise by 30 pct and the job will run in 45 minutes
instead of 60 minutes. Does that seem worth it to me?”
Proprietary and confidential 11
Recap
Focus on measuring performance of intrinsic machine
characteristics like network bandwidth to characterize
performance
Use carefully constructed data sets that change one and
only one thing to do it
Be crisp on costs (dollars) and benefits (essentially debug
cycles per hour) of parallelism to make informed choices
about whether you want more or less of it.

More Related Content

PDF
Improving ad hoc and production workflows at Stitch Fix
PPTX
presto-at-netflix-hadoop-summit-15
PDF
Presto @ Facebook: Past, Present and Future
PPTX
Presto: Distributed sql query engine
PDF
Realtime
 Distributed Analysis
 of Datastreams
PPTX
Monitoring and scaling postgres at datadog
PDF
Presto Summit 2018 - 04 - Netflix Containers
PDF
20140120 presto meetup_en
Improving ad hoc and production workflows at Stitch Fix
presto-at-netflix-hadoop-summit-15
Presto @ Facebook: Past, Present and Future
Presto: Distributed sql query engine
Realtime
 Distributed Analysis
 of Datastreams
Monitoring and scaling postgres at datadog
Presto Summit 2018 - 04 - Netflix Containers
20140120 presto meetup_en

What's hot (20)

PDF
Presto meetup 2015-03-19 @Facebook
PDF
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
PPTX
Data Engineer's Lunch #54: dbt and Spark
PPTX
Cassandra Lunch #87: Recreating Cassandra.api using Astra and Stargate
PDF
Provisioning Datadog with Terraform
PDF
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
PDF
Presto - Analytical Database. Overview and use cases.
PPTX
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
PPTX
Gobblin meetup-whats new in 0.7
PDF
Spark and S3 with Ryan Blue
KEY
Cascalog at May Bay Area Hadoop User Group
PDF
Elastic Stack Roadmap
PDF
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
PDF
"Databases - The Choice is Yours", Philipp Krenn, Developer Advocate at Elastic
PDF
Distributed ML in Apache Spark
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
PDF
Unifying Events and Logs into the Cloud
PDF
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
PDF
Presto talk @ Global AI conference 2018 Boston
PDF
Scylla Summit 2022: Stream Processing with ScyllaDB
Presto meetup 2015-03-19 @Facebook
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Data Engineer's Lunch #54: dbt and Spark
Cassandra Lunch #87: Recreating Cassandra.api using Astra and Stargate
Provisioning Datadog with Terraform
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
Presto - Analytical Database. Overview and use cases.
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
Gobblin meetup-whats new in 0.7
Spark and S3 with Ryan Blue
Cascalog at May Bay Area Hadoop User Group
Elastic Stack Roadmap
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
"Databases - The Choice is Yours", Philipp Krenn, Developer Advocate at Elastic
Distributed ML in Apache Spark
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Unifying Events and Logs into the Cloud
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Presto talk @ Global AI conference 2018 Boston
Scylla Summit 2022: Stream Processing with ScyllaDB
Ad

Similar to Optimizing Spark (20)

PDF
Apache Spark Performance tuning and Best Practise
PDF
Spark Performance Tuning .pdf
PDF
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
PDF
Apache Spark Performance: Past, Future and Present
PDF
Performance OR Capacity #CMGimPACt2016
PDF
LISA2010 visualizations
PDF
Netflix SRE perf meetup_slides
PDF
Measure All the Things! - Austin Data Day 2014
PPTX
Data driven; People based
PPT
Benchmarking PyCon AU 2011 v0
DOCX
Introduction: What is Performance Testing?
PDF
Dzone performancemonitoring2016-mastercode.vn
PDF
Spark Summit EU talk by Qifan Pu
PDF
10 tips-for-optimizing-sql-server-performance-white-paper-22127
PDF
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
PPTX
Training - What is Performance ?
PPTX
The ROI for Early Testing 120816 (Ixia)
PDF
PAC 2019 virtual Alexander Podelko
PDF
Re-Architecting Spark For Performance Understandability
PDF
Re-Architecting Spark For Performance Understandability
Apache Spark Performance tuning and Best Practise
Spark Performance Tuning .pdf
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Apache Spark Performance: Past, Future and Present
Performance OR Capacity #CMGimPACt2016
LISA2010 visualizations
Netflix SRE perf meetup_slides
Measure All the Things! - Austin Data Day 2014
Data driven; People based
Benchmarking PyCon AU 2011 v0
Introduction: What is Performance Testing?
Dzone performancemonitoring2016-mastercode.vn
Spark Summit EU talk by Qifan Pu
10 tips-for-optimizing-sql-server-performance-white-paper-22127
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Training - What is Performance ?
The ROI for Early Testing 120816 (Ixia)
PAC 2019 virtual Alexander Podelko
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Ad

More from Stitch Fix Algorithms (10)

PPTX
Progression by Regression: How to increase your A/B Test Velocity
PPTX
Deep recommendations in PyTorch
PDF
Tracking data lineage at Stitch Fix
PDF
A compute infrastructure for data scientists
PPTX
Moment-based estimation for hierarchical models in Apache Spark
PDF
Production model deployment
PPTX
When We Spark and When We Don’t: Developing Data and ML Pipelines
PPTX
Incrementality
PDF
Apache Spark & ML Workflows
PDF
Enabling full stack data scientists
Progression by Regression: How to increase your A/B Test Velocity
Deep recommendations in PyTorch
Tracking data lineage at Stitch Fix
A compute infrastructure for data scientists
Moment-based estimation for hierarchical models in Apache Spark
Production model deployment
When We Spark and When We Don’t: Developing Data and ML Pipelines
Incrementality
Apache Spark & ML Workflows
Enabling full stack data scientists

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Approach and Philosophy of On baking technology
PPT
Teaching material agriculture food technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
KodekX | Application Modernization Development
PDF
Electronic commerce courselecture one. Pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx
Review of recent advances in non-invasive hemoglobin estimation
Dropbox Q2 2025 Financial Results & Investor Presentation
Approach and Philosophy of On baking technology
Teaching material agriculture food technology
Reach Out and Touch Someone: Haptics and Empathic Computing
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The AUB Centre for AI in Media Proposal.docx
Understanding_Digital_Forensics_Presentation.pptx
KodekX | Application Modernization Development
Electronic commerce courselecture one. Pdf
Machine learning based COVID-19 study performance prediction
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
NewMind AI Weekly Chronicles - August'25 Week I
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Empathic Computing: Creating Shared Understanding
MYSQL Presentation for SQL database connectivity
Digital-Transformation-Roadmap-for-Companies.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

Optimizing Spark

  • 2. Proprietary and confidential 2 If you’ve thought about this at all, you won’t learn anything from me today If you haven’t thought about this, you’ll learn a few principles to organize your thinking
  • 3. Proprietary and confidential 3 Know what you want to measure You don’t want to measure run times You want to measure effective performance of some machine characteristic: network bandwidth, file access latency, or CPU operations per second
  • 4. Proprietary and confidential 4 You do this with carefully constructed data sets To measure network bandwidth, construct a data set with the same number of files (so file access latency is constant) and do the same operation on it (so that cpu operations are constant) but force some extra data with variable size (e.g. random 1 byte ints vs. random 8 byte ints) to come along for the ride. Then take difference of run times.
  • 5. Proprietary and confidential 5 Case Study: Effective Network Bandwidth Everything seemed to run slowly under Spark 2.0... Latency and CPU performance looked fine But we got terrible network bandwidth from Spark 2.0 Not necessarily intrinsic to Spark 2.0… could have been some detail of our setup However Spark 2.1 worked fine, so we just decommissioned our Spark 2.0 setup
  • 6. Proprietary and confidential 6 How do you know if you’re getting your money’s worth out of parallelization?
  • 7. Proprietary and confidential 7 Run time vs. Number of Executors Probably the first plot you draw… but doesn’t really tell you what you want to know
  • 8. Proprietary and confidential 8 Overall Cost (in dollars if possible) vs. executors In a perfect world (linear speed-ups) cost is independent of parallelism In the real world costs generally rise with parallelism
  • 9. Proprietary and confidential 9 Benefit: 1/walltime = answers per hour 1 hour vs. 2 hours: Probably not a big deal 1 week vs. 2 weeks: Probably is a big deal 1 minute vs 10 minutes is a huge deal: Too easy to get distracted if your debug cycle is 10 minutes.
  • 10. Proprietary and confidential 10 Once you are crisp on the costs and benefits, you will be in a position to say things like: “If I double the amount of parallelism for this job, my AWS bill will rise by 30 pct and the job will run in 45 minutes instead of 60 minutes. Does that seem worth it to me?”
  • 11. Proprietary and confidential 11 Recap Focus on measuring performance of intrinsic machine characteristics like network bandwidth to characterize performance Use carefully constructed data sets that change one and only one thing to do it Be crisp on costs (dollars) and benefits (essentially debug cycles per hour) of parallelism to make informed choices about whether you want more or less of it.