SlideShare a Scribd company logo
OPTIMIZING SPARK DEPLOYMENTS
FOR CONTAINERS: ISOLATION,
SAFETY, AND PERFORMANCE
William Benton • @willb
Red Hat, Inc.
OPTIMIZING SPARK DEPLOYMENTS
FOR CONTAINERS: ISOLATION,
SAFETY, AND PERFORMANCE
William Benton • @willb
Red Hat, Inc.
Forecast
Background and definitions
Architectural concerns
Security concerns
Performance concerns
Conclusions and takeaways
Background and definitions
Forecast
Background and definitions
Architectural concerns
Security concerns
Performance concerns
Conclusions and takeaways
Background and definitions
Architectural concerns
Forecast
Background and definitions
Architectural concerns
Security concerns
Performance concerns
Conclusions and takeaways
Background and definitions
Architectural concerns
Security concerns
Forecast
Background and definitions
Architectural concerns
Security concerns
Performance concerns
Conclusions and takeaways
Preliminaries
What is a container?
…a lightweight VM?
…a way to totally isolate applications?
…a packaging format for a container runtime or orchestration platform?
pid
root
net
$SPARK_HOME/bin/spark-class 
org.apache.spark.deploy.worker.Worker 
master:7077
pid
root
net
$SPARK_HOME/bin/spark-class 
org.apache.spark.deploy.worker.Worker 
master:7077
pid
root /
net
$SPARK_HOME/bin/spark-class 
org.apache.spark.deploy.worker.Worker 
master:7077
pid
root /
net
$SPARK_HOME/bin/spark-class 
org.apache.spark.deploy.worker.Worker 
master:7077
pid
root /tmp/foo
net
$SPARK_HOME/bin/spark-class 
org.apache.spark.deploy.worker.Worker 
master:7077
container runtime
pid
root /tmp/foo
net
$SPARK_HOME/bin/spark-class 
org.apache.spark.deploy.worker.Worker 
master:7077
container runtime
SPEED
LIMIT
55
pid
root
net
$SPARK_HOME/bin/spark-class 
org.apache.spark.deploy.worker.Worker 
master:7077
/
pid
root
net
$SPARK_HOME/bin/spark-class 
org.apache.spark.deploy.worker.Worker 
master:7077
/
What is a container?
…a lightweight VM?
…a way to totally isolate applications?
…a packaging format for a container runtime or orchestration platform?
What is a container?
…a lightweight VM?
…a way to totally isolate applications?
…a packaging format for a container runtime or orchestration platform?
…a lightweight means to address some of the same use cases as VMs.
What is a container?
…a lightweight VM?
…a way to totally isolate applications?
…a packaging format for a container runtime or orchestration platform?
…a lightweight means to address some of the same use cases as VMs.
…a way to provide reasonable, not exhaustive application isolation.
What is a container?
…a lightweight VM?
…a way to totally isolate applications?
…a packaging format for a container runtime or orchestration platform?
…a lightweight means to address some of the same use cases as VMs.
…a way to provide reasonable, not exhaustive application isolation.
…yes, but really just any Linux process with some special settings!
Architectural considerations
Microservice architectures
Microservice architectures
Microservice architectures
Microservice architectures
High-level app architecture
federate
events
databases
file, object
storage
transform
transform
transform
archive
High-level app architecture
federate
trainmodels
events
databases
file, object
storage
transform
transform
transform
archive
High-level app architecture
federate
trainmodels
events
databases
file, object
storage
management
web and mobile
reporting
developer UItransform
transform
transform
archive
High-level app architecture
federate
trainmodels
events
databases
file, object
storage
management
web and mobile
reporting
developer UItransform
transform
transform
archive
High-level app architecture
federate
trainmodels
archive
events
databases
file, object
storage
management
web and mobile
reporting
developer UItransform
transform
transform
High-level app architecture
federate
trainmodels
archive
events
databases
file, object
storage
management
web and mobile
reporting
developer UItransform
transform
transform
High-level app architecture
federate
trainmodels
archive
events
databases
file, object
storage
management
web and mobile
reporting
developer UItransform
transform
transform
Spark is a natural fit for
microservice architectures, since
executors are microservices!
Monolithic Spark clusters
Cluster scheduler
Shared FS /
object store
Spark executor
Spark executor
Spark executor
Spark executor
Spark executor
Spark executor
Resource manager
app 1 app 2
app 4app 3
Databases
Monolithic Spark clusters
Cluster scheduler
Shared FS /
object store
Spark executor
Spark executor
Spark executor
Spark executor
Spark executor
Spark executor
Resource manager
app 1 app 2
app 4app 3
Databases
One cluster per application
Resource manager
Shared FS /
object store
app 1 app 2
app 5app 4
app 3
app 6
Databases
One cluster per application
Resource manager
Shared FS /
object store
app 1 app 2
app 5app 4
app 3
app 6
app 2
app 4
Databases
Security
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performance: Spark Summit East talk by William Benton
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performance: Spark Summit East talk by William Benton
systemd
qemu
qemu
qemu
systemd
nginx
mongodb
spark-class
/tmp/foo
/tmp/bar
/tmp/blah
systemd
nginx
mongodb
spark-class
spark-class
/tmp/foo
systemd
nginx
Use SELinux
spark-class
/tmp/foo
systemd
nginx
Use SELinux
spark-class
/tmp/foo
systemd
nginx
SELinux limits your exposure to an exploit in a
container or a bug in a container runtime.
Use SELinux
Root is root
…
/tmp/foo
Root is root
…
/
Denials of service
…
/tmp/foo
Denials of service
…
/tmp/foo
Kernel panics
…
/tmp/foo
Kernel panics
…
/tmp/foo
Keeping secrets
…
/tmp/foo
Keeping secrets
…
/tmp/foo
Shared FS /
object store
ACCESS_KEY=…
SECRET_KEY=…
Keeping secrets
cat <<EOF > secret.txt
ACCESS_KEY=…
SECRET_KEY=…
EOF
git add secret.txt
Keeping secrets
cat <<EOF > secret.txt
ACCESS_KEY=…
SECRET_KEY=…
EOF
git add secret.txt
export ACCESS_KEY=…
export SECRET_KEY=…
Keeping secrets
cat <<EOF > secret.txt
ACCESS_KEY=…
SECRET_KEY=…
EOF
git add secret.txt
export ACCESS_KEY=…
export SECRET_KEY=…
kubectl create secret 
generic mysecrets 
--from-file=… 
--from-file=…
Keeping secrets
cat <<EOF > secret.txt
ACCESS_KEY=…
SECRET_KEY=…
EOF
git add secret.txt
export ACCESS_KEY=…
export SECRET_KEY=…
kubectl create secret 
generic mysecrets 
--from-file=… 
--from-file=…
Performance
Potential performance pitfalls
Potential performance pitfalls
Hypervisors introduce overhead.
Use more lightweight isolation
mechanisms to preserve performance.
Potential performance pitfalls
Potential performance pitfalls
Potential performance pitfalls
Virtualized networking likely
has minimal impact on overall
application performance!
Potential performance pitfalls
Virtualized networking likely
has minimal impact on overall
application performance!
…but measure the
performance of your
I/O configuration!
Potential performance pitfalls
Potential performance pitfalls
SPEED
LIMIT
55
Potential performance pitfalls
SPEED
LIMIT
55
Quotas mean some ubiquitous techniques can have
surprising performance impact. Consider in particular
parallel GC and disk buffer cache use.
Potential performance pitfalls
SPEED
LIMIT
55
Be sure you set your heap sizes based on your
resource limits…or wait for OpenJDK 9!
Conclusions and takeaways
Architectural takeaways
Spark executors are already microservices.
Consider using a single Spark cluster per application for flexible
scheduling and easy deployments.
Persistent storage lives outside of containers and is probably best
accessed via service interfaces rather than through filesystem interfaces.
Security takeaways
It isn’t safe to run arbitrary code just because you put it in a container.
Use SELinux to minimize your exposure to error and malice.
Don’t run as root unless you absolutely have to (and you probably don’t).
Ad hoc mechanisms for configuring secrets are likely to leak information
and are almost always a bad idea.
Performance takeaways
Avoid hypervisor overhead by using different approaches to isolation.
Measure everything, but virtualized networking likely has a minimal
performance impact on real applications.
Artificially throttled performance can be a real problem. Experiment with
JVM settings, including serial GC, to reduce your chance of getting limited.
Configuration takeaways
If you consume logs from standard output and error, consider using an
alternate stack trace formatter to get exceptions in a single log record.
If you use ephemeral user IDs, set SPARK_USER or use nss_wrapper so
Hadoop file libraries won’t get confused.
Thanks!
willb@redhat.com • @willb
http://radanalytics.io
https://chapeau.freevariable.com

More Related Content

PDF
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
PDF
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
PDF
Realtime Analytical Query Processing and Predictive Model Building on High Di...
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
PDF
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
PDF
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
PDF
Realtime Analytical Query Processing and Predictive Model Building on High Di...
PDF
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Improving Python and Spark (PySpark) Performance and Interoperability
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...

What's hot (20)

PDF
Spark + Flashblade: Spark Summit East talk by Brian Gold
PPTX
PPTX
PPTX
Presto query optimizer: pursuit of performance
PPTX
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
PPTX
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
PDF
Sparkler Presentation for Spark Summit East 2017
PDF
Data Science Across Data Sources with Apache Arrow
PDF
Migrating pipelines into Docker
PDF
Just-in-Time Analytics and the Need for Autonomous Database Administration wi...
PPTX
Graphene – Microsoft SCOPE on Tez
PPTX
Docker data science pipeline
PPTX
Building a Virtual Data Lake with Apache Arrow
PPTX
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
PPTX
How do you decide where your customer was?
PDF
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
PDF
Realizing the promise of portable data processing with Apache Beam
PDF
Spark Summit EU talk by William Benton
Spark + Flashblade: Spark Summit East talk by Brian Gold
Presto query optimizer: pursuit of performance
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
Sparkler Presentation for Spark Summit East 2017
Data Science Across Data Sources with Apache Arrow
Migrating pipelines into Docker
Just-in-Time Analytics and the Need for Autonomous Database Administration wi...
Graphene – Microsoft SCOPE on Tez
Docker data science pipeline
Building a Virtual Data Lake with Apache Arrow
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
How do you decide where your customer was?
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Realizing the promise of portable data processing with Apache Beam
Spark Summit EU talk by William Benton
Ad

Similar to Optimizing Spark Deployments for Containers: Isolation, Safety, and Performance: Spark Summit East talk by William Benton (20)

PDF
Why and how are containers the foundation for a hybrid cloud future
PDF
Stay productive while slicing up the monolith
PDF
Stay productive while slicing up the monolith
PDF
The Future of Cloud Innovation, featuring Adrian Cockcroft
PDF
Splunk conf2014 - Getting Deeper Insights into your Virtualization and Storag...
PPTX
ThatConference 2016 - Highly Available Node.js
PPTX
Webinar: How and Why to Containerize Your Legacy Applications
PDF
[muCon2017]DevSecOps: How to Continuously Integrate Security into DevOps
PPTX
Simplify DevOps with Microservices and Mobile Backends.pptx
PDF
Application Modernisation with PKS
PDF
Application Modernisation with PKS
PDF
Tampere Docker meetup - Happy 5th Birthday Docker
PDF
DevOps LA Meetup Intro to Habitat
PPTX
Webinar leveraging-cloud-sandboxes-with-ansible-jenkins-j frog
PDF
The NRB Group mainframe day 2021 - Containerisation on Z - Paul Pilotto - Seb...
 
PDF
Faster, more Secure Application Modernization and Replatforming with PKS - Ku...
PDF
Lublin Startup Festival - Mobile Architecture Design Patterns
PDF
PIACERE - DevSecOps Automated
PPTX
Business and IT agility through DevOps and microservice architecture powered ...
PDF
Elevate Your Continuous Delivery Strategy Above the Rolling Clouds (Interconn...
Why and how are containers the foundation for a hybrid cloud future
Stay productive while slicing up the monolith
Stay productive while slicing up the monolith
The Future of Cloud Innovation, featuring Adrian Cockcroft
Splunk conf2014 - Getting Deeper Insights into your Virtualization and Storag...
ThatConference 2016 - Highly Available Node.js
Webinar: How and Why to Containerize Your Legacy Applications
[muCon2017]DevSecOps: How to Continuously Integrate Security into DevOps
Simplify DevOps with Microservices and Mobile Backends.pptx
Application Modernisation with PKS
Application Modernisation with PKS
Tampere Docker meetup - Happy 5th Birthday Docker
DevOps LA Meetup Intro to Habitat
Webinar leveraging-cloud-sandboxes-with-ansible-jenkins-j frog
The NRB Group mainframe day 2021 - Containerisation on Z - Paul Pilotto - Seb...
 
Faster, more Secure Application Modernization and Replatforming with PKS - Ku...
Lublin Startup Festival - Mobile Architecture Design Patterns
PIACERE - DevSecOps Automated
Business and IT agility through DevOps and microservice architecture powered ...
Elevate Your Continuous Delivery Strategy Above the Rolling Clouds (Interconn...
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PPT
Quality review (1)_presentation of this 21
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to machine learning and Linear Models
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Foundation of Data Science unit number two notes
PDF
Business Analytics and business intelligence.pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Lecture1 pattern recognition............
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Business Acumen Training GuidePresentation.pptx
Quality review (1)_presentation of this 21
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
annual-report-2024-2025 original latest.
Introduction to machine learning and Linear Models
Business Ppt On Nestle.pptx huunnnhhgfvu
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Miokarditis (Inflamasi pada Otot Jantung)
STUDY DESIGN details- Lt Col Maksud (21).pptx
Qualitative Qantitative and Mixed Methods.pptx
Foundation of Data Science unit number two notes
Business Analytics and business intelligence.pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Lecture1 pattern recognition............
IB Computer Science - Internal Assessment.pptx
Database Infoormation System (DBIS).pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Knowledge Engineering Part 1
Business Acumen Training GuidePresentation.pptx

Optimizing Spark Deployments for Containers: Isolation, Safety, and Performance: Spark Summit East talk by William Benton