SlideShare a Scribd company logo
Haohai Ma, IBM
Khalid Ahmed, IBM
RUNNING SPARK INSIDE
CONTAINERS
Myself
• “How High”
• Software Architect
• IBM Spectrum Computing
• Toronto Canada
2IBM Spectrum Computing
Agenda
• Why container?
• Migrate spark workload to container
• Spark instance on Kubernetes
– Architecture
– Workflow
– Multi-tenancy
• Future work
3IBM Spectrum Computing
Why use containers?
• To enforce the CPU and memory bounds.
– CPU shares are proportional to the allocated slots
– spark.driver.memory & spark.executor.memroy
• To completely isolate the file system
– Solve the dependency conflicts
• To create and ship images
– Develop once and run everywhere
4IBM Spectrum Computing
No prebuilt Spark image
• A running container needs an application image
– Independent to Spark versions
• Seamlessly migrate Spark workloads to a
container based environment
– Assume: Spark is distributed onto the host file system
5IBM Spectrum Computing
Host Filesystem
Spark
installation
Regular Spark workload
IBM Spectrum Computing
Spark Master
JVM
Spark Submit
Spark Driver
Spark Executor
Host Filesystem
Spark
installation
Running in containers
IBM Spectrum Computing
Spark Master
Spark Executor
container: ubuntu
JVM
Spark Submit
Spark Driver
container: ubuntu
container: image
Creating a container definition for an application
image
IBM Spectrum Computing
Extra dependency from
host file system
Submitting workload with the container
definition
9IBM Spectrum Computing
spark-submit --class<main-class> --master<master-url> --deploy-mode cluster 
--conf spark.ego.driver.docker.definition= MyAppDef 
--conf spark.ego.executor.docker.definition= MyAppDef 
<application-jar> 
[application-arguments]
Cluster Mode:
Define container specifications for
the drivers and executors
Host Filesystem
Spark
installation
Running in containers
IBM Spectrum Computing
Spark Master
Spark Executor
Container: myappimage:v1
Spark Submit
Spark Driver
container: myappimage:v1
Infobatch
lib
Spark Instance on Kubernetes
• Increase resource utilization
– Share nodes between Spark and surrounding ecosystem
• Isolation between tenants and apply resource
enforcement
– Each tenant gets a dedicated Spark working instance
– Tenant price plan can directly map to its resource quota
• Simplify deployment and roll out
11IBM Spectrum Computing
Architecture
IBM Spectrum Computing
IBM Spectrum Conductor with Spark
Spark Instance
Group
Spark Instance
Group
Spark
Master
History
Server
Notebook
Shuffle
Service
Spark
Master
History
Server
Notebook
Shuffle
Service
SparkaaS level
Tenant level
Spark instance group
Admin Portal
Spark Information Hub Image and deployment
Architecture
IBM Spectrum Computing
IBM Spectrum Conductor with Spark
Spark Instance
Group
Spark Instance
Group
Spark
Master
History
Server
Notebook
Shuffle
Service
Spark
Master
History
Server
Notebook
Shuffle
Service
Spark Service level
Tenant level
Spark Information Hub Image and deployment
Spark instance group
Admin Portal
• A Spark instance group is an independent
deployment in Kubernetes.
• A docker image is built automatically based on the
Spark version, configuration, notebook edition, and
user application dependencies.
• Initially only one container for Spark services of the
Spark instance group.
• Dynamic scalability based on workload demand.
Architecture
IBM Spectrum Computing
IBM Spectrum Conductor with Spark
Spark Instance
Group
Spark Instance
Group
Spark
Master
History
Server
Notebook
Shuffle
Service
Spark
Master
History
Server
Notebook
Shuffler
Service
SparkaaS level
Tenant level
Spark Information Hub Image and deployment
Spark instance group
Admin Portal
• IBM Spectrum Conductor with Spark - End points
• Admin manages Spark instance group life cycle
• Tenant accesses Spark workloads and notebooks
• Deploy by a helm chart and expose as a service with
one single container: CWS master
• Multiple deployments in a Kubernetes cluster
• Cloud: One deployment for one Spaas
• On-Prem: One deployment for one BU
Creating a master container Kubernetes
Creating a master container Kubernetes
Namespace: ns4bu1
container: CWS
spaas4bu1_cwsmaster
Creating a Spark instance group Kubernetes
Namespace: ns4bu1
container: CWS
spaas4bu1_cwsmaster
Registry
tenant1
Image
Deploying a Spark instance group Kubernetes
Namespace: ns4bu1
container: CWS
spaas4bu1_cwsmaster
container: tenant1
spaas4bu1_tenant1
Registry
tenant1
Image
Kubernetes
Namespace: ns4bu1
container: tenant1
container: tenant1 container: tenant1
Scaling the Spark instance group based on
workload demands
container: tenant1
scheduler
API Server
K8s master
Spark Master
Spark Driver
Spark ExecutorSpark Executor
Performance
• Without Dynamic Scaling
Performance
• With Dynamic Scaling
• Without Dynamic Scaling
Multitenancy with Spark instance groups Kubernetes
Namespace: ns4bu1
container: CWS
spaas4bu1_cwsmaster
Registry
container: tenant1
spaas4bu1_tenant1
container: tenant1
spaas4bu1_tenant1
container: tenant1
spaas4bu1_tenant1
container: tenant1
spaas4bu1_tenant1
container: tenant2
spaas4bu1_tenant2
container: tenant2
spaas4bu1_tenant2
container: tenant2
spaas4bu1_tenant2
tenant1
Image
tenant2
Image
Multi-Spaas Kubernetes
Registry
tenant1
Image
tenant2
Image
tenant3
Image
tenant4
Image
Multi-Spaas Kubernetes
Namespace: ns4bu1
container: CWS
spaas4bu1_cwsmaster
container: tenant1
spaas4bu1_tenant1
container: tenant1
spaas4bu1_tenant1
container: tenant1
spaas4bu1_tenant1
container: tenant1
spaas4bu1_tenant1
container: tenant2
spaas4bu1_tenant2 container: tenant2
spaas4bu1_tenant2
Namespace: ns4bu2
container: CWS
spaas4bu2_cwsmaster
container: tenant3
spaas4bu2_tenant3
container: tenant4
spaas4bu2_tenant4
Survey: Spark on Kubernetes
SPARK-18278 Standalone
IBM Spectrum Conductor
with Spark on
Kubernetes
Dynamic allocation on
demand
Yes Static Yes
K8s interaction granularity Job level Instance level – static Instance level –
dynamic
Deployment Automation
• Simple deploy by helm
charts
No Yes Yes
Spark instance per tenant
• Multi-job/workflow/user
• Image with user
applications
• Security
No limited Yes
Future work
• Integration with Kubernetes batch workload
scheduler
– Kube-arbitrator (https://github.com/kubernetes-
incubator/kube-arbitrator)
• Performance comparation with other Spark on
Kubernetes solutions
26IBM Spectrum Computing
www.ibm.com/spectrum-conductor
hma@ca.ibm.com
Thank You

More Related Content

PDF
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
PPT
February 2016 HUG: Running Spark Clusters in Containers with Docker
PDF
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
PDF
Spark day 2017 - Spark on Kubernetes
PDF
Spark Pipelines in the Cloud with Alluxio with Gene Pang
PDF
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
PDF
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
PPTX
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
February 2016 HUG: Running Spark Clusters in Containers with Docker
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark day 2017 - Spark on Kubernetes
Spark Pipelines in the Cloud with Alluxio with Gene Pang
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters

What's hot (20)

PDF
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
PDF
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
PDF
[Spark Summit 2017 NA] Apache Spark on Kubernetes
PDF
Spark Working Environment in Windows OS
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
PDF
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
PDF
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
PDF
PaaSTA: Autoscaling at Yelp
PDF
State of Spark in the cloud (Spark Summit EU 2017)
PPTX
Cassandra on Docker @ Walmart Labs
PDF
Spark Summit EU talk by William Benton
PDF
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
PDF
Scaling spark on kubernetes at Lyft
PPTX
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
PPTX
How to deploy Apache Spark in a multi-tenant, on-premises environment
PPTX
Flexible compute
PDF
Spark and S3 with Ryan Blue
PPTX
Simplified Cluster Operation & Troubleshooting
PDF
Using Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
PPTX
Dev ops for big data cluster management tools
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
[Spark Summit 2017 NA] Apache Spark on Kubernetes
Spark Working Environment in Windows OS
Apache Spark on K8S Best Practice and Performance in the Cloud
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
PaaSTA: Autoscaling at Yelp
State of Spark in the cloud (Spark Summit EU 2017)
Cassandra on Docker @ Walmart Labs
Spark Summit EU talk by William Benton
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Scaling spark on kubernetes at Lyft
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
How to deploy Apache Spark in a multi-tenant, on-premises environment
Flexible compute
Spark and S3 with Ryan Blue
Simplified Cluster Operation & Troubleshooting
Using Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
Dev ops for big data cluster management tools
Ad

Viewers also liked (8)

PDF
Lessons Learned From Running Spark On Docker
PDF
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
PDF
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
PDF
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
PDF
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
PDF
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
PDF
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
PDF
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Lessons Learned From Running Spark On Docker
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Ad

Similar to Running Spark Inside Containers with Haohai Ma and Khalid Ahmed (20)

PPTX
Why Kubernetes as a container orchestrator is a right choice for running spar...
PDF
Webinar kubernetes and-spark
PPTX
Deploying Apache Spark on a Local Kubernetes Cluster.pptx
PPTX
Serverless spark
PDF
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
PDF
Hybrid Apache Spark Architecture with YARN and Kubernetes
PPTX
Docker and kubernetes
PDF
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
PPTX
GlobalAzureBootCamp 2018
PPTX
Knative with .NET Core and Quarkus with GraalVM
PDF
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
PDF
Deploying your apps in the cloud - the options: an overview
PPTX
dockerSAW
PDF
Cloud Native Camel Design Patterns
PDF
Container Orchestration Integration: OpenStack Kuryr
PDF
Container Orchestration Integration: OpenStack Kuryr & Apache Mesos
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PDF
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
PPTX
Introduction to Apache Spark and MLlib
PDF
Virtualizing Apache Spark and Machine Learning with Justin Murray
Why Kubernetes as a container orchestrator is a right choice for running spar...
Webinar kubernetes and-spark
Deploying Apache Spark on a Local Kubernetes Cluster.pptx
Serverless spark
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
Hybrid Apache Spark Architecture with YARN and Kubernetes
Docker and kubernetes
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
GlobalAzureBootCamp 2018
Knative with .NET Core and Quarkus with GraalVM
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Deploying your apps in the cloud - the options: an overview
dockerSAW
Cloud Native Camel Design Patterns
Container Orchestration Integration: OpenStack Kuryr
Container Orchestration Integration: OpenStack Kuryr & Apache Mesos
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
Introduction to Apache Spark and MLlib
Virtualizing Apache Spark and Machine Learning with Justin Murray

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
1_Introduction to advance data techniques.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Foundation of Data Science unit number two notes
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to Knowledge Engineering Part 1
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
Lecture1 pattern recognition............
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
Supervised vs unsupervised machine learning algorithms
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Reliability_Chapter_ presentation 1221.5784
1_Introduction to advance data techniques.pptx
IB Computer Science - Internal Assessment.pptx
Foundation of Data Science unit number two notes
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
annual-report-2024-2025 original latest.
Introduction to Knowledge Engineering Part 1
ISS -ESG Data flows What is ESG and HowHow
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Database Infoormation System (DBIS).pptx
Lecture1 pattern recognition............
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Miokarditis (Inflamasi pada Otot Jantung)

Running Spark Inside Containers with Haohai Ma and Khalid Ahmed

  • 1. Haohai Ma, IBM Khalid Ahmed, IBM RUNNING SPARK INSIDE CONTAINERS
  • 2. Myself • “How High” • Software Architect • IBM Spectrum Computing • Toronto Canada 2IBM Spectrum Computing
  • 3. Agenda • Why container? • Migrate spark workload to container • Spark instance on Kubernetes – Architecture – Workflow – Multi-tenancy • Future work 3IBM Spectrum Computing
  • 4. Why use containers? • To enforce the CPU and memory bounds. – CPU shares are proportional to the allocated slots – spark.driver.memory & spark.executor.memroy • To completely isolate the file system – Solve the dependency conflicts • To create and ship images – Develop once and run everywhere 4IBM Spectrum Computing
  • 5. No prebuilt Spark image • A running container needs an application image – Independent to Spark versions • Seamlessly migrate Spark workloads to a container based environment – Assume: Spark is distributed onto the host file system 5IBM Spectrum Computing
  • 6. Host Filesystem Spark installation Regular Spark workload IBM Spectrum Computing Spark Master JVM Spark Submit Spark Driver Spark Executor
  • 7. Host Filesystem Spark installation Running in containers IBM Spectrum Computing Spark Master Spark Executor container: ubuntu JVM Spark Submit Spark Driver container: ubuntu container: image
  • 8. Creating a container definition for an application image IBM Spectrum Computing Extra dependency from host file system
  • 9. Submitting workload with the container definition 9IBM Spectrum Computing spark-submit --class<main-class> --master<master-url> --deploy-mode cluster --conf spark.ego.driver.docker.definition= MyAppDef --conf spark.ego.executor.docker.definition= MyAppDef <application-jar> [application-arguments] Cluster Mode: Define container specifications for the drivers and executors
  • 10. Host Filesystem Spark installation Running in containers IBM Spectrum Computing Spark Master Spark Executor Container: myappimage:v1 Spark Submit Spark Driver container: myappimage:v1 Infobatch lib
  • 11. Spark Instance on Kubernetes • Increase resource utilization – Share nodes between Spark and surrounding ecosystem • Isolation between tenants and apply resource enforcement – Each tenant gets a dedicated Spark working instance – Tenant price plan can directly map to its resource quota • Simplify deployment and roll out 11IBM Spectrum Computing
  • 12. Architecture IBM Spectrum Computing IBM Spectrum Conductor with Spark Spark Instance Group Spark Instance Group Spark Master History Server Notebook Shuffle Service Spark Master History Server Notebook Shuffle Service SparkaaS level Tenant level Spark instance group Admin Portal Spark Information Hub Image and deployment
  • 13. Architecture IBM Spectrum Computing IBM Spectrum Conductor with Spark Spark Instance Group Spark Instance Group Spark Master History Server Notebook Shuffle Service Spark Master History Server Notebook Shuffle Service Spark Service level Tenant level Spark Information Hub Image and deployment Spark instance group Admin Portal • A Spark instance group is an independent deployment in Kubernetes. • A docker image is built automatically based on the Spark version, configuration, notebook edition, and user application dependencies. • Initially only one container for Spark services of the Spark instance group. • Dynamic scalability based on workload demand.
  • 14. Architecture IBM Spectrum Computing IBM Spectrum Conductor with Spark Spark Instance Group Spark Instance Group Spark Master History Server Notebook Shuffle Service Spark Master History Server Notebook Shuffler Service SparkaaS level Tenant level Spark Information Hub Image and deployment Spark instance group Admin Portal • IBM Spectrum Conductor with Spark - End points • Admin manages Spark instance group life cycle • Tenant accesses Spark workloads and notebooks • Deploy by a helm chart and expose as a service with one single container: CWS master • Multiple deployments in a Kubernetes cluster • Cloud: One deployment for one Spaas • On-Prem: One deployment for one BU
  • 15. Creating a master container Kubernetes
  • 16. Creating a master container Kubernetes Namespace: ns4bu1 container: CWS spaas4bu1_cwsmaster
  • 17. Creating a Spark instance group Kubernetes Namespace: ns4bu1 container: CWS spaas4bu1_cwsmaster Registry tenant1 Image
  • 18. Deploying a Spark instance group Kubernetes Namespace: ns4bu1 container: CWS spaas4bu1_cwsmaster container: tenant1 spaas4bu1_tenant1 Registry tenant1 Image
  • 19. Kubernetes Namespace: ns4bu1 container: tenant1 container: tenant1 container: tenant1 Scaling the Spark instance group based on workload demands container: tenant1 scheduler API Server K8s master Spark Master Spark Driver Spark ExecutorSpark Executor
  • 21. Performance • With Dynamic Scaling • Without Dynamic Scaling
  • 22. Multitenancy with Spark instance groups Kubernetes Namespace: ns4bu1 container: CWS spaas4bu1_cwsmaster Registry container: tenant1 spaas4bu1_tenant1 container: tenant1 spaas4bu1_tenant1 container: tenant1 spaas4bu1_tenant1 container: tenant1 spaas4bu1_tenant1 container: tenant2 spaas4bu1_tenant2 container: tenant2 spaas4bu1_tenant2 container: tenant2 spaas4bu1_tenant2 tenant1 Image tenant2 Image
  • 24. Registry tenant1 Image tenant2 Image tenant3 Image tenant4 Image Multi-Spaas Kubernetes Namespace: ns4bu1 container: CWS spaas4bu1_cwsmaster container: tenant1 spaas4bu1_tenant1 container: tenant1 spaas4bu1_tenant1 container: tenant1 spaas4bu1_tenant1 container: tenant1 spaas4bu1_tenant1 container: tenant2 spaas4bu1_tenant2 container: tenant2 spaas4bu1_tenant2 Namespace: ns4bu2 container: CWS spaas4bu2_cwsmaster container: tenant3 spaas4bu2_tenant3 container: tenant4 spaas4bu2_tenant4
  • 25. Survey: Spark on Kubernetes SPARK-18278 Standalone IBM Spectrum Conductor with Spark on Kubernetes Dynamic allocation on demand Yes Static Yes K8s interaction granularity Job level Instance level – static Instance level – dynamic Deployment Automation • Simple deploy by helm charts No Yes Yes Spark instance per tenant • Multi-job/workflow/user • Image with user applications • Security No limited Yes
  • 26. Future work • Integration with Kubernetes batch workload scheduler – Kube-arbitrator (https://github.com/kubernetes- incubator/kube-arbitrator) • Performance comparation with other Spark on Kubernetes solutions 26IBM Spectrum Computing