SlideShare a Scribd company logo
Bo Yang
DoK Day Europe 2022 @ KubeCon
Apache Spark on Kubernetes -
Challenges and Solutions
DoK Day Europe 2022 @ KubeCon
Agenda
● Quick Introduction
● Challenges and Solutions
● Conclusion
● One Click To Run
DoK Day Europe 2022 @ KubeCon
Introduction
● Apache Spark: powerful tool to run data processing and machine learning jobs
● Kubernetes: extensible platform to run containerized workload and services
● Apache Spark + Kubernetes: 1 + 1 > 2
● It works: after solving various challenges
DoK Day Europe 2022 @ KubeCon
Challenge: Complexity
● A Lot of Moving Pieces
○ Create Kubernetes cluster
○ Create service account and set up permission
○ Run spark-submit
○ Use kubectl to check driver/executor pods
DoK Day Europe 2022 @ KubeCon
Make it Simple: Operator
● Operator Pattern: simplify operation work
● Spark Operator: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
○ Spark Application CRD
○ No need to deal with details of spark-submit
○ Monitor and manage the application status
DoK Day Europe 2022 @ KubeCon
Spark UI
● Challenge: Cannot Access Spark UI from Outside
● Solution: Spark Operator + Nginx Ingress Controller
○ Automate ingress rule to expose Spark UI
○ Leverage Nginx to modify Spark UI web page on the fly
DoK Day Europe 2022 @ KubeCon
Spark Logs
● Challenges
○ No log aggregation
○ Executor logs gone after application finish
● Solutions
○ Log shipping tools (Fluentd, Fluent Bit, Logstash, etc.)
○ spark.kubernetes.executor.deleteOnTermination=false
DoK Day Europe 2022 @ KubeCon
CPU/Memory Overhead
● Challenges
○ Executor memory not same as pod memory
○ Non trivial overhead
● Solutions
○ Spark conf: spark.kubernetes.memoryOverheadFactor
○ Use larger node in Kubernetes cluster
DoK Day Europe 2022 @ KubeCon
Dynamic Allocation
● Why
○ Dynamically allocate and terminate workers (executors)
○ Increase cluster utilization and reduce cost
● Challenges
○ Issue on Kubernetes: shuffle data prevents terminating executors
○ No External Shuffle Service (v.s. Spark on YARN)
● Solutions
○ Decouple shuffle data from executor
○ Uber Remote Shuffle Service
DoK Day Europe 2022 @ KubeCon
Kubernetes Default Scheduler
● Originally designed to orchestrate long running services
● Problems
○ Missing Job Queue
○ Static Resource Limit
○ Driver Deadlock
DoK Day Europe 2022 @ KubeCon
Batch Friendly Scheduler
● Kubernetes Scheduling Framework
○ Support plugins to enhance behavior
○ Able to totally replace the default scheduler
● Options
○ Volcano: a batch scheduler inspired by machine learning workload
○ Apache YuniKorn: inspired by the YARN scheduler from Hadoop
○ Scheduler Plugin - scheduler plugin to support gang scheduling and run FIFO
DoK Day Europe 2022 @ KubeCon
Auto Scaling
● Automatically scale up/out
● Cluster Autoscaler: horizontally scale the cluster
● Usage Example
○ Create different node groups for different teams in the same cluster
○ Enable Cluster Autoscaler for those node groups
○ Adjust different min and max sizes for different node groups
DoK Day Europe 2022 @ KubeCon
Conclusion
● Spark on Kubernetes: low operation cost (after get it working)
● A lot of work!
● Make Deployment Easy
○ Automate
○ Repeatable
DoK Day Europe 2022 @ KubeCon
Data Punch Project
● One Click to Create Spark Environment on Kubernetes
○ Create IAM Role for EKS Control Plane
○ Create IAM Role for EKS Node Instance
○ Create the actual EKS cluster
○ Create a node group
○ Install Nginx Ingress Controller
○ Install Spark Operator
○ Install Spark REST Service
● Command Example: punch install SparkOnK8s
● https://github.com/datapunchorg/punch
DoK Day Europe 2022 @ KubeCon
Thank you!

More Related Content

PDF
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
PDF
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
PySpark on Kubernetes @ Python Barcelona March Meetup
PPTX
Running Apache Spark on Kubernetes
PDF
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
PDF
[Spark Summit 2017 NA] Apache Spark on Kubernetes
PDF
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Scaling your Data Pipelines with Apache Spark on Kubernetes
PySpark on Kubernetes @ Python Barcelona March Meetup
Running Apache Spark on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen

Similar to One Click to Run Apache Spark as a Service on Kubernetes (20)

PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PPTX
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
PDF
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
PDF
Scaling spark on kubernetes at Lyft
PDF
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
PDF
Webinar kubernetes and-spark
PDF
Scaling Apache Spark on Kubernetes at Lyft
PDF
Reliable Performance at Scale with Apache Spark on Kubernetes
PDF
Operating FoundationDB on Kubernetes
PDF
Getting Started with Apache Spark on Kubernetes
PDF
1000 node Cassandra cluster on Amazon's EKS?
PDF
1000 node Cassandra cluster on Amazon's EKS? - Matt Overstreet (DoK Day EU 2022)
PDF
Big data and Kubernetes
PPTX
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
PDF
KUDO - Kubernetes Operators, the easy way
PDF
Running Apache Spark Jobs Using Kubernetes
PDF
Kubernetes extensibility: crd & operators
PDF
Kubernetes extensibility: CRDs & Operators
PDF
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Scaling spark on kubernetes at Lyft
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
Webinar kubernetes and-spark
Scaling Apache Spark on Kubernetes at Lyft
Reliable Performance at Scale with Apache Spark on Kubernetes
Operating FoundationDB on Kubernetes
Getting Started with Apache Spark on Kubernetes
1000 node Cassandra cluster on Amazon's EKS?
1000 node Cassandra cluster on Amazon's EKS? - Matt Overstreet (DoK Day EU 2022)
Big data and Kubernetes
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Apache Spark on K8S Best Practice and Performance in the Cloud
KUDO - Kubernetes Operators, the easy way
Running Apache Spark Jobs Using Kubernetes
Kubernetes extensibility: crd & operators
Kubernetes extensibility: CRDs & Operators
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Ad

More from DoKC (20)

PDF
Distributed Vector Databases - What, Why, and How
PDF
Is It Safe? Security Hardening for Databases Using Kubernetes Operators
PDF
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
PDF
Transforming Data Processing with Kubernetes: Journey Towards a Self-Serve Da...
PDF
The State of Stateful on Kubernetes
PDF
Colocating Data Workloads and Web Services on Kubernetes to Improve Resource ...
PDF
Make Your Kafka Cluster Production-Ready
PDF
Run PostgreSQL in Warp Speed Using NVMe/TCP in the Cloud
PDF
The Kubernetes Native Database
PDF
ING Data Services hosted on ICHP DoK Amsterdam 2023
PDF
Implementing data and databases on K8s within the Dutch government
PDF
StatefulSets in K8s - DoK Talks #154
PDF
Running PostgreSQL in Kubernetes: from day 0 to day 2 with CloudNativePG - Do...
PDF
Analytics with Apache Superset and ClickHouse - DoK Talks #151
PPTX
Overcoming challenges with protecting and migrating data in multi-cloud K8s e...
PDF
Evaluating Cloud Native Storage Vendors - DoK Talks #147
PDF
Kubernetes Cluster Upgrade Strategies and Data: Best Practices for your State...
PDF
We will Dok You! - The journey to adopt stateful workloads on k8s
PPTX
Mastering MongoDB on Kubernetes, the power of operators
PDF
Leveraging Running Stateful Workloads on Kubernetes for the Benefit of Develo...
Distributed Vector Databases - What, Why, and How
Is It Safe? Security Hardening for Databases Using Kubernetes Operators
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Transforming Data Processing with Kubernetes: Journey Towards a Self-Serve Da...
The State of Stateful on Kubernetes
Colocating Data Workloads and Web Services on Kubernetes to Improve Resource ...
Make Your Kafka Cluster Production-Ready
Run PostgreSQL in Warp Speed Using NVMe/TCP in the Cloud
The Kubernetes Native Database
ING Data Services hosted on ICHP DoK Amsterdam 2023
Implementing data and databases on K8s within the Dutch government
StatefulSets in K8s - DoK Talks #154
Running PostgreSQL in Kubernetes: from day 0 to day 2 with CloudNativePG - Do...
Analytics with Apache Superset and ClickHouse - DoK Talks #151
Overcoming challenges with protecting and migrating data in multi-cloud K8s e...
Evaluating Cloud Native Storage Vendors - DoK Talks #147
Kubernetes Cluster Upgrade Strategies and Data: Best Practices for your State...
We will Dok You! - The journey to adopt stateful workloads on k8s
Mastering MongoDB on Kubernetes, the power of operators
Leveraging Running Stateful Workloads on Kubernetes for the Benefit of Develo...
Ad

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Machine Learning_overview_presentation.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
MYSQL Presentation for SQL database connectivity
20250228 LYD VKU AI Blended-Learning.pptx
Spectral efficient network and resource selection model in 5G networks
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Big Data Technologies - Introduction.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Machine Learning_overview_presentation.pptx
sap open course for s4hana steps from ECC to s4
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
“AI and Expert System Decision Support & Business Intelligence Systems”
A comparative analysis of optical character recognition models for extracting...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Empathic Computing: Creating Shared Understanding
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Dropbox Q2 2025 Financial Results & Investor Presentation
MYSQL Presentation for SQL database connectivity

One Click to Run Apache Spark as a Service on Kubernetes

  • 1. Bo Yang DoK Day Europe 2022 @ KubeCon Apache Spark on Kubernetes - Challenges and Solutions
  • 2. DoK Day Europe 2022 @ KubeCon Agenda ● Quick Introduction ● Challenges and Solutions ● Conclusion ● One Click To Run
  • 3. DoK Day Europe 2022 @ KubeCon Introduction ● Apache Spark: powerful tool to run data processing and machine learning jobs ● Kubernetes: extensible platform to run containerized workload and services ● Apache Spark + Kubernetes: 1 + 1 > 2 ● It works: after solving various challenges
  • 4. DoK Day Europe 2022 @ KubeCon Challenge: Complexity ● A Lot of Moving Pieces ○ Create Kubernetes cluster ○ Create service account and set up permission ○ Run spark-submit ○ Use kubectl to check driver/executor pods
  • 5. DoK Day Europe 2022 @ KubeCon Make it Simple: Operator ● Operator Pattern: simplify operation work ● Spark Operator: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator ○ Spark Application CRD ○ No need to deal with details of spark-submit ○ Monitor and manage the application status
  • 6. DoK Day Europe 2022 @ KubeCon Spark UI ● Challenge: Cannot Access Spark UI from Outside ● Solution: Spark Operator + Nginx Ingress Controller ○ Automate ingress rule to expose Spark UI ○ Leverage Nginx to modify Spark UI web page on the fly
  • 7. DoK Day Europe 2022 @ KubeCon Spark Logs ● Challenges ○ No log aggregation ○ Executor logs gone after application finish ● Solutions ○ Log shipping tools (Fluentd, Fluent Bit, Logstash, etc.) ○ spark.kubernetes.executor.deleteOnTermination=false
  • 8. DoK Day Europe 2022 @ KubeCon CPU/Memory Overhead ● Challenges ○ Executor memory not same as pod memory ○ Non trivial overhead ● Solutions ○ Spark conf: spark.kubernetes.memoryOverheadFactor ○ Use larger node in Kubernetes cluster
  • 9. DoK Day Europe 2022 @ KubeCon Dynamic Allocation ● Why ○ Dynamically allocate and terminate workers (executors) ○ Increase cluster utilization and reduce cost ● Challenges ○ Issue on Kubernetes: shuffle data prevents terminating executors ○ No External Shuffle Service (v.s. Spark on YARN) ● Solutions ○ Decouple shuffle data from executor ○ Uber Remote Shuffle Service
  • 10. DoK Day Europe 2022 @ KubeCon Kubernetes Default Scheduler ● Originally designed to orchestrate long running services ● Problems ○ Missing Job Queue ○ Static Resource Limit ○ Driver Deadlock
  • 11. DoK Day Europe 2022 @ KubeCon Batch Friendly Scheduler ● Kubernetes Scheduling Framework ○ Support plugins to enhance behavior ○ Able to totally replace the default scheduler ● Options ○ Volcano: a batch scheduler inspired by machine learning workload ○ Apache YuniKorn: inspired by the YARN scheduler from Hadoop ○ Scheduler Plugin - scheduler plugin to support gang scheduling and run FIFO
  • 12. DoK Day Europe 2022 @ KubeCon Auto Scaling ● Automatically scale up/out ● Cluster Autoscaler: horizontally scale the cluster ● Usage Example ○ Create different node groups for different teams in the same cluster ○ Enable Cluster Autoscaler for those node groups ○ Adjust different min and max sizes for different node groups
  • 13. DoK Day Europe 2022 @ KubeCon Conclusion ● Spark on Kubernetes: low operation cost (after get it working) ● A lot of work! ● Make Deployment Easy ○ Automate ○ Repeatable
  • 14. DoK Day Europe 2022 @ KubeCon Data Punch Project ● One Click to Create Spark Environment on Kubernetes ○ Create IAM Role for EKS Control Plane ○ Create IAM Role for EKS Node Instance ○ Create the actual EKS cluster ○ Create a node group ○ Install Nginx Ingress Controller ○ Install Spark Operator ○ Install Spark REST Service ● Command Example: punch install SparkOnK8s ● https://github.com/datapunchorg/punch
  • 15. DoK Day Europe 2022 @ KubeCon Thank you!