SlideShare a Scribd company logo
Auto-scaling
Apache Spark cluster using
Deep Reinforcement Learning
Kundjanasith Thonglek1
, Kohei Ichikawa1
,
Chatchawal Sangkeettrakan2
, Apivadee Piyatumrong2
1
1
Nara Institute of Science and Technology (NAIST), Japan
2
National Electronics and Computer Technology Center (Nectec), Thailand
OLA’2019 : International Conference on Optimization and Learning
Agenda
This is a brief description
Introduction
Methodology
Evaluation
Conclusion
Conclusion
2
Introduction
3
Big data and advanced analytics technology are attracting much attention not just because the size
of data is big but also because the potential of impact is big
Real-time application might have to handle different sizes of the input data at the
different time as well as different techniques of machine learning for different purposes
at the same time.
Engineers need can efficiently handle large-scale data processing systems. However, it is also
known that data processing science is a relatively new field where it requires advanced knowledge on a
huge variety of techniques, tools, and theories
Apache Spark
Apache Spark is a fast, in-memory data processing engine with elegant and
expressive development APIs to allow data workers to efficiently execute
streaming, machine learning or SQL workloads that require fast iterative access
to datasets.
Spark operation :
- Transformation : passing each dataset element through a function and returns a new RDD
representing the results
- Action : aggregating all the elements of the RDD using some function and returns the final
result to the driver program
4
Transformation Action
RDD
RDD
RDD
RDD
Value
Apache Spark cluster
5
The Key Components of Apache Spark cluster
Master Node Data Node
Worker Node
Executor
Driver Program
Cluster
Manager
Spark
Context
s
c
a
l
i
n
g
Master Node
- Spark Context : It is essentially a client of Spark’s
execution environment and acts as the master of
the Spark application
Worker Node
- Executor : It is a distributed agent that
responsible for executing tasks.
Problem statement
When does Apache Spark cluster should scale-out or scale-in the
worker node for completing task within the limit execution time constraint
and the maximum number of worker nodes constraint?
6
scale-out
scale-in
Resources
Resources
Time
Time
The system supports real-time
processing to handle different size
of input data at the different time.
The system can complete the task
within the bounded time and
resources constraints.
Objectives
We will create auto-scaling system to scale Apache Spark cluster automatically
on OpenStack platform using Deep Reinforcement Learning technique.
Auto-Scaling system
8
SCALING TECHNIQUE
Rule-Based Scaling Technique Data-Driven Scaling Technique
cluster cluster
cluster management system
Data Model
cluster management system
Rule
current
state
scaling
command
scaling
command
current
state
task
status
Data
Modeling
Methodology
Auto-scaling Apache Spark cluster using Deep Reinforcement Learning
- Set up Apache Spark
cluster on OpenStack
platform by config Apache
Spark cluster template
Set up Environment
- Analyse the features
which from the log that we
collect from system API
Feature selection
- DQN is a deep reinforcement
learning technique which is
suitable for this situation
problem
Applied DQN
Set up
Environment
Feature
Selection
Applied
DQN
Auto-scaling
system
- Design our auto-scaling
system to connect between
compute and scaling module
Auto-scaling system
9
Set up Environment
10
The OpenStack system is prepared and stacked up with Apache Spark Cluster configuration in
necessary templates such as master node template, worker node template, data node template
Apache Spark cluster template where one cluster must have at least one master and one
worker node.
OpenStack platform
Apache Spark cluster
Apache Spark cluster is launched on the OpenStack platform in
homogeneous mode.
Node :
- CPU 4 vCPU
- Memory 8 GB
- Storage disk 20 GB
Feature Selection
11
The percentage of memory usage when Apache
Spark operate action ( ma
)
The percentage of memory usage when Apache
Spark operate transformation ( mt
)
Collector
Collector Analyze
Analyze
The percentage of CPU usage for
user processes ( cu
)
The percentage of CPU usage for
system processes ( cs
)
The percentage of network usage for
inbound network ( bi
)
The percentage of network usage for
outbound network ( bo
)
[ Action ] : Ay
o | neutral | i
Deep Reinforcement Learning
12
OpenStack
platform
Apache Spark
cluster
Deep
Reinforcement
Learning
[ Agent ]
[ Constraints ]
[Reward function ]
State
The current state of
Apache Spark cluster is
acquired to be the features.
Action
The scaling action with
the number of scaling
worker nodes in cluster.
Agent
Deep Q-Network or DQN
to be the network for learning
feature and take action.
[ State ] : cu
, cs
, bi
, bo
[ State ] : mt
, ma
13
States & Constraints
The states are the possible environment status of the studying system. According to the scenario
we are facing, the Apache Spark Cluster is spawned as a cluster with at least one Master node and
one Worker node, based on the pre-configured template of OpenStack for scaling purpose.
If the maximum number of worker nodes is N then the number of possible states is N
Assumption : the maximum number of worker nodes is 3
S1
T, 3
S2
T, 3
S3
T, 3
[ T, N ] are the environment constraints.
- Time constraint [ T ] : The expectation of bounded execution time.
- Resource constraint [ N ] : The maximum number of worker nodes.
Actions
14
The actions for deep reinforcement learning to scale Apache Spark cluster. There are three
possible scaling actions: (1) scaling-out (2) not-scaling and (3) scaling-in
A0
neutral
If the maximum number of worker nodes is N then the number of possible actions is 2(N-1) + 1
Assumption : the maximum number of worker nodes is 3
A1
o
A1
o
A1
i A1
i
A2
o
A2
i
Reward Function
15
The reward equation to give the reward (r) to an agent when it make a decision to scale the
cluster, must has at least one worker node. The reward function utilize the features which are selected
and explained earlier as well as the constraint of the cluster state (ma
,mt
,cu
,cs
,bi
,bo
,T,N). Furthermore, it
must take into account the number of scaling worker nodes y made by the actions.
w(y) =
{
+y, when Ay
o
; the agent takes scaling-out action
0, when A0
neutral
; the agent takes not-scaling action
-y, when Ay
i
; the agent takes scaling-in action
The reward function is defined as
r =
( 1 - ) + ma
+ mt
+ cu
+ cs
+ bi
+ bo
+
w
(N - 1)
( 1 + )
(T - t)
T
U
Where t is the execution time of this round and U is the number of features
System Architecture
16
OpenStack platform
Apache Spark cluster Deep Reinforcement Learning node
Learning & Scaling Engine
Scaling-Mode Web Interface
Data Publishing Engine
Evaluation
17
The auto-scaling system on Apache Spark cluster using deep reinforcement learning is
evaluated by data size is 5 GB.
via streaming processed. Each environment constraint is tested 100
times.
It is evaluated within two constraints :
(1) The limit execution time constraint ( T )
(2) The maximum number of worker nodes constraint ( N )
T = { 5, 6, 7, 8, 9, 10 } minutes
N = { 5, 6, 7, 8, 9, 10 } nodes
The Percentage of Job Failure with Different Optimization Models
18
Deep Q-Network (DQN) Linear Regression (LR)
OUR MODEL BASE LINE
The Sacrifice and Stabilize period of DQN and LR
19
Time Constraint (T) 5 6 7 8 9
# Experiment LR DQN LR DQN LR DQN LR DQN LR DQN
1 - 25 4 5, L=9 4 5, L=7 2 2, L=3 0 0 0 0
26 - 50 2 0 3 0 1 0 1, L=34 0 0 0
51 - 75 2 0 2, L=73 0 1 0 0 0 0 0
76 - 100 2, L=90 0 0 0 1, L=84 0 0 0 0 0
The maximum number of worker node constraint is 5 worker nodes.
Let L be the experiment round that last failure happened
Conclusion
● We study how to optimize the scaling computing node issue of Apache
Spark cluster automatically using deep reinforcement learning technique.
20
● Found the six significant features that give direct impact to the
performance of real-time application running on Apache Spark
cluster.
● Improved performance of the cluster
constrained by two constraint
features: the limitation of execution
time and the maximum number of
worker node per cluster.
Implementation
We provide Docker image on Dockerhub and source code on Github
21
https://hub.docker.com/r/kundjanasith/kitwai-engine/
https://hub.docker.com/r/kundjanasith/kitwai-ai/
https://github.com/Kundjanasith/scaling-sparkcluster/
Email : thonglek.kundjanasith.ti7@is.naist.jp
Thank You
Q & A
Kundjanasith Thonglek
Software Design & Analysis Laboratory, NAIST
22

More Related Content

PPTX
Chaos Engineering on Cloud Foundry
PPTX
Moving to the cloud: cloud strategies and roadmaps
PPTX
Learning Solidity
PDF
Toyota Financial Services Digital Transformation - Think 2019
PDF
Mission Ready PLM
PDF
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
PPTX
Domain Driven Design 101
PPTX
Terraform on Azure
Chaos Engineering on Cloud Foundry
Moving to the cloud: cloud strategies and roadmaps
Learning Solidity
Toyota Financial Services Digital Transformation - Think 2019
Mission Ready PLM
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Domain Driven Design 101
Terraform on Azure

What's hot (20)

PDF
IBM MQ - better application performance
PDF
Agile...Looking Back Looking Forward
PDF
9 steps to awesome with kubernetes
PDF
Testing Kafka containers with Testcontainers: There and back again with Vikto...
PDF
Hexagonal Architecture.pdf
PDF
The never-ending REST API design debate
PDF
“Computer Security Incident handling Guide,” Special Publication 800-61
PPTX
Clean Pragmatic Architecture - Avoiding a Monolith
PDF
Data Migration Strategies Powerpoint Presentation Slides
PDF
Kubernetes Networking
PDF
PDF
Microservices, Containers and Docker
PDF
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
PDF
Developer’s guide to contributing code to Kafka with Mickael Maison and Tom B...
PDF
Securing the Message Bus with Kafka Streams | Paul Otto and Ryan Salcido, Raf...
PPTX
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
PDF
Classes de Problemas P e NP
PDF
Resolving technical debt in software architecture
PDF
Apache Kafka and ksqlDB in Action: Let's Build a Streaming Data Pipeline! (Ro...
IBM MQ - better application performance
Agile...Looking Back Looking Forward
9 steps to awesome with kubernetes
Testing Kafka containers with Testcontainers: There and back again with Vikto...
Hexagonal Architecture.pdf
The never-ending REST API design debate
“Computer Security Incident handling Guide,” Special Publication 800-61
Clean Pragmatic Architecture - Avoiding a Monolith
Data Migration Strategies Powerpoint Presentation Slides
Kubernetes Networking
Microservices, Containers and Docker
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Developer’s guide to contributing code to Kafka with Mickael Maison and Tom B...
Securing the Message Bus with Kafka Streams | Paul Otto and Ryan Salcido, Raf...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Classes de Problemas P e NP
Resolving technical debt in software architecture
Apache Kafka and ksqlDB in Action: Let's Build a Streaming Data Pipeline! (Ro...
Ad

Similar to Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf (20)

PDF
Autoscaling Stateful Workloads in Kubernetes
PDF
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
PDF
Workload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
PDF
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
PDF
Odsc workshop - Distributed Tensorflow on Hops
PDF
Hadoop Spark Introduction-20150130
PDF
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
PDF
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
PDF
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
PDF
August-20_Autoscaling-and-Cost-Optimization-on-Kubernetes-From-0-to-100.pdf
PDF
PaddlePaddle: A Complete Enterprise Solution
DOCX
Real time web app integration with hadoop on docker
PDF
Toronto meetup 20190917
PDF
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
PDF
Spark Autotuning - Spark Summit East 2017
PDF
TriHUG talk on Spark and Shark
PDF
Scaling Deep Learning on Hadoop at LinkedIn
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Autoscaling Stateful Workloads in Kubernetes
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Workload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Odsc workshop - Distributed Tensorflow on Hops
Hadoop Spark Introduction-20150130
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
August-20_Autoscaling-and-Cost-Optimization-on-Kubernetes-From-0-to-100.pdf
PaddlePaddle: A Complete Enterprise Solution
Real time web app integration with hadoop on docker
Toronto meetup 20190917
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning - Spark Summit East 2017
TriHUG talk on Spark and Shark
Scaling Deep Learning on Hadoop at LinkedIn
Stage Level Scheduling Improving Big Data and AI Integration
Scaling your Data Pipelines with Apache Spark on Kubernetes
Ad

More from Kundjanasith Thonglek (8)

PDF
Sparse Communication for Federated Learning
PDF
Improving Resource Availability in Data Center using Deep Learning.pdf
PDF
Enhancing the Prediction Accuracy of Solar Power Generation using a Generativ...
PDF
Federated Learning of Neural Network Models with Heterogeneous Structures.pdf
PDF
Abnormal Gait Recognition in Real-Time using Recurrent Neural Networks.pdf
PDF
Retraining Quantized Neural Network Models with Unlabeled Data.pdf
PDF
Improving Resource Utilization in Data Centers using an LSTM-based Prediction...
PDF
Intelligent Vehicle Accident Analysis System.pdf
Sparse Communication for Federated Learning
Improving Resource Availability in Data Center using Deep Learning.pdf
Enhancing the Prediction Accuracy of Solar Power Generation using a Generativ...
Federated Learning of Neural Network Models with Heterogeneous Structures.pdf
Abnormal Gait Recognition in Real-Time using Recurrent Neural Networks.pdf
Retraining Quantized Neural Network Models with Unlabeled Data.pdf
Improving Resource Utilization in Data Centers using an LSTM-based Prediction...
Intelligent Vehicle Accident Analysis System.pdf

Recently uploaded (20)

PPTX
OMC Textile Division Presentation 2021.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
project resource management chapter-09.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Mushroom cultivation and it's methods.pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
A Presentation on Touch Screen Technology
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
WOOl fibre morphology and structure.pdf for textiles
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Hybrid model detection and classification of lung cancer
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
OMC Textile Division Presentation 2021.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
DP Operators-handbook-extract for the Mautical Institute
project resource management chapter-09.pdf
Enhancing emotion recognition model for a student engagement use case through...
Mushroom cultivation and it's methods.pdf
A Presentation on Artificial Intelligence
Group 1 Presentation -Planning and Decision Making .pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
A Presentation on Touch Screen Technology
SOPHOS-XG Firewall Administrator PPT.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
WOOl fibre morphology and structure.pdf for textiles
Digital-Transformation-Roadmap-for-Companies.pptx
Hybrid model detection and classification of lung cancer
Unlocking AI with Model Context Protocol (MCP)
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf

  • 1. Auto-scaling Apache Spark cluster using Deep Reinforcement Learning Kundjanasith Thonglek1 , Kohei Ichikawa1 , Chatchawal Sangkeettrakan2 , Apivadee Piyatumrong2 1 1 Nara Institute of Science and Technology (NAIST), Japan 2 National Electronics and Computer Technology Center (Nectec), Thailand OLA’2019 : International Conference on Optimization and Learning
  • 2. Agenda This is a brief description Introduction Methodology Evaluation Conclusion Conclusion 2
  • 3. Introduction 3 Big data and advanced analytics technology are attracting much attention not just because the size of data is big but also because the potential of impact is big Real-time application might have to handle different sizes of the input data at the different time as well as different techniques of machine learning for different purposes at the same time. Engineers need can efficiently handle large-scale data processing systems. However, it is also known that data processing science is a relatively new field where it requires advanced knowledge on a huge variety of techniques, tools, and theories
  • 4. Apache Spark Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Spark operation : - Transformation : passing each dataset element through a function and returns a new RDD representing the results - Action : aggregating all the elements of the RDD using some function and returns the final result to the driver program 4 Transformation Action RDD RDD RDD RDD Value
  • 5. Apache Spark cluster 5 The Key Components of Apache Spark cluster Master Node Data Node Worker Node Executor Driver Program Cluster Manager Spark Context s c a l i n g Master Node - Spark Context : It is essentially a client of Spark’s execution environment and acts as the master of the Spark application Worker Node - Executor : It is a distributed agent that responsible for executing tasks.
  • 6. Problem statement When does Apache Spark cluster should scale-out or scale-in the worker node for completing task within the limit execution time constraint and the maximum number of worker nodes constraint? 6 scale-out scale-in Resources Resources Time Time
  • 7. The system supports real-time processing to handle different size of input data at the different time. The system can complete the task within the bounded time and resources constraints. Objectives We will create auto-scaling system to scale Apache Spark cluster automatically on OpenStack platform using Deep Reinforcement Learning technique.
  • 8. Auto-Scaling system 8 SCALING TECHNIQUE Rule-Based Scaling Technique Data-Driven Scaling Technique cluster cluster cluster management system Data Model cluster management system Rule current state scaling command scaling command current state task status Data Modeling
  • 9. Methodology Auto-scaling Apache Spark cluster using Deep Reinforcement Learning - Set up Apache Spark cluster on OpenStack platform by config Apache Spark cluster template Set up Environment - Analyse the features which from the log that we collect from system API Feature selection - DQN is a deep reinforcement learning technique which is suitable for this situation problem Applied DQN Set up Environment Feature Selection Applied DQN Auto-scaling system - Design our auto-scaling system to connect between compute and scaling module Auto-scaling system 9
  • 10. Set up Environment 10 The OpenStack system is prepared and stacked up with Apache Spark Cluster configuration in necessary templates such as master node template, worker node template, data node template Apache Spark cluster template where one cluster must have at least one master and one worker node. OpenStack platform Apache Spark cluster Apache Spark cluster is launched on the OpenStack platform in homogeneous mode. Node : - CPU 4 vCPU - Memory 8 GB - Storage disk 20 GB
  • 11. Feature Selection 11 The percentage of memory usage when Apache Spark operate action ( ma ) The percentage of memory usage when Apache Spark operate transformation ( mt ) Collector Collector Analyze Analyze The percentage of CPU usage for user processes ( cu ) The percentage of CPU usage for system processes ( cs ) The percentage of network usage for inbound network ( bi ) The percentage of network usage for outbound network ( bo )
  • 12. [ Action ] : Ay o | neutral | i Deep Reinforcement Learning 12 OpenStack platform Apache Spark cluster Deep Reinforcement Learning [ Agent ] [ Constraints ] [Reward function ] State The current state of Apache Spark cluster is acquired to be the features. Action The scaling action with the number of scaling worker nodes in cluster. Agent Deep Q-Network or DQN to be the network for learning feature and take action. [ State ] : cu , cs , bi , bo [ State ] : mt , ma
  • 13. 13 States & Constraints The states are the possible environment status of the studying system. According to the scenario we are facing, the Apache Spark Cluster is spawned as a cluster with at least one Master node and one Worker node, based on the pre-configured template of OpenStack for scaling purpose. If the maximum number of worker nodes is N then the number of possible states is N Assumption : the maximum number of worker nodes is 3 S1 T, 3 S2 T, 3 S3 T, 3 [ T, N ] are the environment constraints. - Time constraint [ T ] : The expectation of bounded execution time. - Resource constraint [ N ] : The maximum number of worker nodes.
  • 14. Actions 14 The actions for deep reinforcement learning to scale Apache Spark cluster. There are three possible scaling actions: (1) scaling-out (2) not-scaling and (3) scaling-in A0 neutral If the maximum number of worker nodes is N then the number of possible actions is 2(N-1) + 1 Assumption : the maximum number of worker nodes is 3 A1 o A1 o A1 i A1 i A2 o A2 i
  • 15. Reward Function 15 The reward equation to give the reward (r) to an agent when it make a decision to scale the cluster, must has at least one worker node. The reward function utilize the features which are selected and explained earlier as well as the constraint of the cluster state (ma ,mt ,cu ,cs ,bi ,bo ,T,N). Furthermore, it must take into account the number of scaling worker nodes y made by the actions. w(y) = { +y, when Ay o ; the agent takes scaling-out action 0, when A0 neutral ; the agent takes not-scaling action -y, when Ay i ; the agent takes scaling-in action The reward function is defined as r = ( 1 - ) + ma + mt + cu + cs + bi + bo + w (N - 1) ( 1 + ) (T - t) T U Where t is the execution time of this round and U is the number of features
  • 16. System Architecture 16 OpenStack platform Apache Spark cluster Deep Reinforcement Learning node Learning & Scaling Engine Scaling-Mode Web Interface Data Publishing Engine
  • 17. Evaluation 17 The auto-scaling system on Apache Spark cluster using deep reinforcement learning is evaluated by data size is 5 GB. via streaming processed. Each environment constraint is tested 100 times. It is evaluated within two constraints : (1) The limit execution time constraint ( T ) (2) The maximum number of worker nodes constraint ( N ) T = { 5, 6, 7, 8, 9, 10 } minutes N = { 5, 6, 7, 8, 9, 10 } nodes
  • 18. The Percentage of Job Failure with Different Optimization Models 18 Deep Q-Network (DQN) Linear Regression (LR) OUR MODEL BASE LINE
  • 19. The Sacrifice and Stabilize period of DQN and LR 19 Time Constraint (T) 5 6 7 8 9 # Experiment LR DQN LR DQN LR DQN LR DQN LR DQN 1 - 25 4 5, L=9 4 5, L=7 2 2, L=3 0 0 0 0 26 - 50 2 0 3 0 1 0 1, L=34 0 0 0 51 - 75 2 0 2, L=73 0 1 0 0 0 0 0 76 - 100 2, L=90 0 0 0 1, L=84 0 0 0 0 0 The maximum number of worker node constraint is 5 worker nodes. Let L be the experiment round that last failure happened
  • 20. Conclusion ● We study how to optimize the scaling computing node issue of Apache Spark cluster automatically using deep reinforcement learning technique. 20 ● Found the six significant features that give direct impact to the performance of real-time application running on Apache Spark cluster. ● Improved performance of the cluster constrained by two constraint features: the limitation of execution time and the maximum number of worker node per cluster.
  • 21. Implementation We provide Docker image on Dockerhub and source code on Github 21 https://hub.docker.com/r/kundjanasith/kitwai-engine/ https://hub.docker.com/r/kundjanasith/kitwai-ai/ https://github.com/Kundjanasith/scaling-sparkcluster/ Email : thonglek.kundjanasith.ti7@is.naist.jp
  • 22. Thank You Q & A Kundjanasith Thonglek Software Design & Analysis Laboratory, NAIST 22