SlideShare a Scribd company logo
Bo Yang
https://www.linkedin.com/in/hiboyang/
1
Run Apache Spark as a Service on Kubernetes
Challenges and Solutions
2
Agenda
● Quick Introduction
● Challenges
● Spark As A Service
● One Click Deployment
3
Self Introduction
● Bo Yang
● Worked in Big Data for 10+ Years (Uber, Apple, …)
● Open Source Projects
○ JVM Profiler for Spark
○ Remote Shuffle Service for Spark
○ Data Punch - One Click to Deploy Spark Service
● Now Working in ZettaBlock.com - Data Infra for Web3 (Easy Access On-Chain Data /
Off-Chain Data)
4
Introduction
● Apache Spark: powerful tool for data processing and machine learning
● Kubernetes: extensible platform to run containers
● Apache Spark + Kubernetes: 1 + 1 > 2
● It works: after solving various challenges
5
Why Run Spark on Kubernetes
● Industry Trend
● Low Operation Cost
○ No Need to Maintain
Hadoop/YARN Stacks
● Unified Compute Platform
○ Online Service
○ Offline Data Processing
screenshot comes from Google search web page
6
Challenge: Complexity
😨
Create
EKS/Kubernetes
Set up IAM Policy
Create Service Account
Learn Kubectl
Set up Auto Scaling
Install Spark Operator
Add Node Groups
7
Spark Operator
● Operator Pattern
● Spark Operator:
https://github.com/GoogleCloudPlatform/spark-on-k8s-ope
rator
○ Spark Application CRD
(Custom Resource)
○ Monitor Application Status
○ Restart Application on
Failure
photo copied from https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/design.md
8
Spark UI
● Challenge:
○ Inside Driver Pod
○ Cannot Access Spark UI from outside Kubernetes cluster
● Solution: Nginx Ingress Controller / Reverse Proxy
○ Automate ingress rule to expose Spark UI
○ Use Reverse Proxy to route traffic to Spark Driver Pod
9
Spark Logs
● Challenges
○ No log aggregation
○ Executor logs gone after application finish
● Solutions
○ Log shipping tools (Fluentd, Fluent Bit, Logstash, etc.)
○ spark.kubernetes.executor.deleteOnTermination=false
10
Dynamic Allocation
● Benefits
○ Dynamically allocate and terminate workers (executors)
○ Increase cluster utilization and reduce cost
● Challenges
○ Issue on Kubernetes: shuffle data prevents terminating executors
○ No External Shuffle Service (which depends on YARN)
● Solutions
○ Decouple shuffle data from executor
○ Uber Remote Shuffle Service
11
Kubernetes Default Scheduler
● Originally designed to orchestrate long running services
● Problems
○ Missing Job Queue
○ No Dynamic Resource Limit
○ Driver Deadlock
12
Batch Friendly Scheduler
● Kubernetes Scheduling Framework
○ Support plugins to enhance behavior
○ Able to totally replace the default scheduler
● Options
○ Volcano: a batch scheduler inspired by machine learning workload
○ Apache YuniKorn: inspired by the YARN scheduler from Hadoop
○ Scheduler Plugin - scheduler plugin to support gang scheduling and FIFO
13
Auto Scaling
● Automatically scale up/out
● Cluster Autoscaler: horizontally scale the cluster
● Usage Example
○ Create different node groups for different teams in the same cluster
○ Enable Cluster Autoscaler for those node groups
○ Adjust different min and max sizes for different node groups
14
Make it User Friendly: Spark As a Service
User: CLI
/ Curl
API Gateway
Spark
Operator
Spark Application
Spark
Driver
Pod
Spark
Executor
Pod
Spark
Driver
Pod
Spark
Executor
Pod
Kubernetes
Cluster
Spark
Operator
Spark Application
Spark
Driver
Pod
Spark
Executor
Pod
Spark
Driver
Pod
Spark
Executor
Pod
Kubernetes
Cluster
… …
User Environment Service Environment
Dynamic
Cluster
Routing
Spark Application Submission
15
Summary
● Spark on Kubernetes: Low Operation Oost
● A Lot of Work to Get Started!
● Need Easy Deployment
○ Automate
○ Repeatable
○ Zero Time to Market
16
DataPunch Project - One Click Deploy
https://github.com/datapunchorg/punch
😃
Create
EKS/Kubernetes
Set up IAM Policy
Create Service Account
Set up Auto Scaling
Install Spark Operator
Spark API Gateway
One Click To Deploy
Curl / CLI To Submit Application
17
DataPunch Architecture
Punch
Command Topology
EKS
SparkOnEks
Hive
Metastore
Topology Deployment
Step N
Step 1
Step 2
…
18
DataPunch Benefits
● Learning Curve: Days -> Minutes
● Time to Deploy: Hours -> Minutes
● Operation: Manual -> Automated and Repeatable
19
Take-Away
● Decouple End User from Kubernete Resources
○ Reduce Learning Curve
● REST API Gateway to Submit Application
○ Dynamically Add/Remove Kubernete Clusters without user impact
● Repeatable/Easy Deployment - DataPunch Project
○ Simplify Operation
○ https://github.com/datapunchorg/punch
● References
○ Challenges of Running Spark on Kubernetes - Part 1
○ Challenges of Running Spark on Kubernetes - Part 2

More Related Content

PDF
Getting Started with Apache Spark on Kubernetes
PPTX
Autoscaling Flink with Reactive Mode
PDF
Reliable Performance at Scale with Apache Spark on Kubernetes
PPTX
Flink vs. Spark
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
Evening out the uneven: dealing with skew in Flink
PDF
Introducing DataFrames in Spark for Large Scale Data Science
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Getting Started with Apache Spark on Kubernetes
Autoscaling Flink with Reactive Mode
Reliable Performance at Scale with Apache Spark on Kubernetes
Flink vs. Spark
Apache Iceberg - A Table Format for Hige Analytic Datasets
Evening out the uneven: dealing with skew in Flink
Introducing DataFrames in Spark for Large Scale Data Science
Where is my bottleneck? Performance troubleshooting in Flink

What's hot (20)

PDF
Aggregated queries with Druid on terrabytes and petabytes of data
PDF
Apache Flink internals
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Apache Iceberg: An Architectural Look Under the Covers
PPTX
Introduction to Kafka Cruise Control
PPTX
Introduction to Apache ZooKeeper
PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
PDF
Pinot: Near Realtime Analytics @ Uber
PDF
Apache Airflow
PDF
The delta architecture
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
PDF
Scalability, Availability & Stability Patterns
PPTX
Apache Spark Architecture
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PPTX
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
PDF
Beyond SQL: Speeding up Spark with DataFrames
Aggregated queries with Druid on terrabytes and petabytes of data
Apache Flink internals
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Apache Iceberg: An Architectural Look Under the Covers
Introduction to Kafka Cruise Control
Introduction to Apache ZooKeeper
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Pinot: Near Realtime Analytics @ Uber
Apache Airflow
The delta architecture
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Flink in the Cloud-Native Era
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Scalability, Availability & Stability Patterns
Apache Spark Architecture
Introduction to Apache Flink - Fast and reliable big data processing
Optimizing Delta/Parquet Data Lakes for Apache Spark
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Beyond SQL: Speeding up Spark with DataFrames
Ad

Similar to Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf (20)

PDF
[Spark Summit 2017 NA] Apache Spark on Kubernetes
PDF
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PDF
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
PDF
PySpark on Kubernetes @ Python Barcelona March Meetup
PDF
Webinar kubernetes and-spark
PPTX
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
PDF
Running Apache Spark Jobs Using Kubernetes
PDF
One Click to Run Apache Spark as a Service on Kubernetes
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling spark on kubernetes at Lyft
PDF
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
PDF
Big data and Kubernetes
PDF
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
PDF
Scaling Apache Spark on Kubernetes at Lyft
PDF
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
PPTX
Running secured Spark job in Kubernetes compute cluster and integrating with ...
PDF
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator
PPTX
Spark with kubernates
PDF
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
PySpark on Kubernetes @ Python Barcelona March Meetup
Webinar kubernetes and-spark
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Running Apache Spark Jobs Using Kubernetes
One Click to Run Apache Spark as a Service on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling spark on kubernetes at Lyft
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
Big data and Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
Scaling Apache Spark on Kubernetes at Lyft
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Running secured Spark job in Kubernetes compute cluster and integrating with ...
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator
Spark with kubernates
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Ad

Recently uploaded (20)

PPTX
Introduction to Knowledge Engineering Part 1
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
annual-report-2024-2025 original latest.
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Lecture1 pattern recognition............
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to Knowledge Engineering Part 1
Clinical guidelines as a resource for EBP(1).pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Mega Projects Data Mega Projects Data
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
[EN] Industrial Machine Downtime Prediction
IB Computer Science - Internal Assessment.pptx
Database Infoormation System (DBIS).pptx
annual-report-2024-2025 original latest.
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Lecture1 pattern recognition............
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
.pdf is not working space design for the following data for the following dat...
Fluorescence-microscope_Botany_detailed content
iec ppt-1 pptx icmr ppt on rehabilitation.pptx

Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf

  • 1. Bo Yang https://www.linkedin.com/in/hiboyang/ 1 Run Apache Spark as a Service on Kubernetes Challenges and Solutions
  • 2. 2 Agenda ● Quick Introduction ● Challenges ● Spark As A Service ● One Click Deployment
  • 3. 3 Self Introduction ● Bo Yang ● Worked in Big Data for 10+ Years (Uber, Apple, …) ● Open Source Projects ○ JVM Profiler for Spark ○ Remote Shuffle Service for Spark ○ Data Punch - One Click to Deploy Spark Service ● Now Working in ZettaBlock.com - Data Infra for Web3 (Easy Access On-Chain Data / Off-Chain Data)
  • 4. 4 Introduction ● Apache Spark: powerful tool for data processing and machine learning ● Kubernetes: extensible platform to run containers ● Apache Spark + Kubernetes: 1 + 1 > 2 ● It works: after solving various challenges
  • 5. 5 Why Run Spark on Kubernetes ● Industry Trend ● Low Operation Cost ○ No Need to Maintain Hadoop/YARN Stacks ● Unified Compute Platform ○ Online Service ○ Offline Data Processing screenshot comes from Google search web page
  • 6. 6 Challenge: Complexity 😨 Create EKS/Kubernetes Set up IAM Policy Create Service Account Learn Kubectl Set up Auto Scaling Install Spark Operator Add Node Groups
  • 7. 7 Spark Operator ● Operator Pattern ● Spark Operator: https://github.com/GoogleCloudPlatform/spark-on-k8s-ope rator ○ Spark Application CRD (Custom Resource) ○ Monitor Application Status ○ Restart Application on Failure photo copied from https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/design.md
  • 8. 8 Spark UI ● Challenge: ○ Inside Driver Pod ○ Cannot Access Spark UI from outside Kubernetes cluster ● Solution: Nginx Ingress Controller / Reverse Proxy ○ Automate ingress rule to expose Spark UI ○ Use Reverse Proxy to route traffic to Spark Driver Pod
  • 9. 9 Spark Logs ● Challenges ○ No log aggregation ○ Executor logs gone after application finish ● Solutions ○ Log shipping tools (Fluentd, Fluent Bit, Logstash, etc.) ○ spark.kubernetes.executor.deleteOnTermination=false
  • 10. 10 Dynamic Allocation ● Benefits ○ Dynamically allocate and terminate workers (executors) ○ Increase cluster utilization and reduce cost ● Challenges ○ Issue on Kubernetes: shuffle data prevents terminating executors ○ No External Shuffle Service (which depends on YARN) ● Solutions ○ Decouple shuffle data from executor ○ Uber Remote Shuffle Service
  • 11. 11 Kubernetes Default Scheduler ● Originally designed to orchestrate long running services ● Problems ○ Missing Job Queue ○ No Dynamic Resource Limit ○ Driver Deadlock
  • 12. 12 Batch Friendly Scheduler ● Kubernetes Scheduling Framework ○ Support plugins to enhance behavior ○ Able to totally replace the default scheduler ● Options ○ Volcano: a batch scheduler inspired by machine learning workload ○ Apache YuniKorn: inspired by the YARN scheduler from Hadoop ○ Scheduler Plugin - scheduler plugin to support gang scheduling and FIFO
  • 13. 13 Auto Scaling ● Automatically scale up/out ● Cluster Autoscaler: horizontally scale the cluster ● Usage Example ○ Create different node groups for different teams in the same cluster ○ Enable Cluster Autoscaler for those node groups ○ Adjust different min and max sizes for different node groups
  • 14. 14 Make it User Friendly: Spark As a Service User: CLI / Curl API Gateway Spark Operator Spark Application Spark Driver Pod Spark Executor Pod Spark Driver Pod Spark Executor Pod Kubernetes Cluster Spark Operator Spark Application Spark Driver Pod Spark Executor Pod Spark Driver Pod Spark Executor Pod Kubernetes Cluster … … User Environment Service Environment Dynamic Cluster Routing Spark Application Submission
  • 15. 15 Summary ● Spark on Kubernetes: Low Operation Oost ● A Lot of Work to Get Started! ● Need Easy Deployment ○ Automate ○ Repeatable ○ Zero Time to Market
  • 16. 16 DataPunch Project - One Click Deploy https://github.com/datapunchorg/punch 😃 Create EKS/Kubernetes Set up IAM Policy Create Service Account Set up Auto Scaling Install Spark Operator Spark API Gateway One Click To Deploy Curl / CLI To Submit Application
  • 18. 18 DataPunch Benefits ● Learning Curve: Days -> Minutes ● Time to Deploy: Hours -> Minutes ● Operation: Manual -> Automated and Repeatable
  • 19. 19 Take-Away ● Decouple End User from Kubernete Resources ○ Reduce Learning Curve ● REST API Gateway to Submit Application ○ Dynamically Add/Remove Kubernete Clusters without user impact ● Repeatable/Easy Deployment - DataPunch Project ○ Simplify Operation ○ https://github.com/datapunchorg/punch ● References ○ Challenges of Running Spark on Kubernetes - Part 1 ○ Challenges of Running Spark on Kubernetes - Part 2