SlideShare a Scribd company logo
Dynamic Resource
Allocation for Spark on
YARN
ozawa@apache.org
Tsuyoshi Ozawa
What s YARN
• A resource manager
implementation

for computer cluster
Hadoop Stack
HDFS
YARN
MapReduceSpark Tez
YARN overview
• All resources are managed by ResourceManager
• All tasks are launched on NodeManager
• Client submit jobs via ResourceManager
NodeManager NodeManager
ResourceManager client
Spark on YARN
• 2 mode
• yarn-cluster
• yarn-client
yarn-cluster mode
• Launching Spark driver on YARN container
• Working well with spark-submit
NodeManager NodeManager NM
container1 container2Spark
AppMaster
clientResource Manager
1 submit
2 launching
master
3 launching
executers
spark driver
yarn-client mode
• Launching Spark driver at client side
• Working well with spark-shell
NodeManager NodeManager NM
container1 container2Spark
AppMaster
client
Resource Manager
1 submit
2 launching
master
3 launching
executers spark driver
4. send
commands
Spark on YARN
• yarn-cluster mode
Node1 Node2 Node3
container
1
container
2
AppMaster
container
2
Problem
• Inefficient resource management
• containers cannot exit until job exits
Node1 Node2
container container container container
stage1
stage2
100% 100% 100% 100%
100%0%0% 0%
Dynamic resource
allocation(since v1.2)
• Allocating containers more dynamically
• number of executers are decided by workload
NodeManager NodeManager NM
container1 container2Spark
AppMaster
clientResource Manager
1 submit
2 launching
master
3 launching
executers/
kill executors
spark driver
Yak shaving
• Where should we hold the state of 

Spark RDD?
• If executers are killed, it ll be lost…
NodeManager
executer executer
RDD RDD
external shuffle
• Saving Spark RDD to NodeManager
• NodeManager has a interface,

external shuffle plugin
• Now executers are stateless!
NodeManager
executer executer
external
shuffle plugin
RDD
(IntermediateFile)
RDD
(IntermediateFile)
How to install
(with Apache Hadoop)
• Copy shuffle plugin to nodemanager s
classpath
• Edit yarn-site.xml
• Edit spark-defaults.conf
Copy shuffle jar to
nodemanager s classpath
$ cp 
lib/spark-*-yarn-shuffle.jar 
/home/ubuntu/hadoop/share/hadoop/yarn/
Edit yarn-site.xml
• Adding shuffle plugin
• Note that documentation for 1.2 includes typo - I PRed :-)
• See documentation for 1.4
Edit spark-defaults.conf
We re ready!!
• num-executers are defined automatically
Demo
Summary
• Spark on YARN
• yarn-client mode
• yarn-cluster mode
• Spark can launch jobs efficiently on YARN

with dynamic allocation

More Related Content

PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
Spark Performance Tuning .pdf
PDF
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
PPTX
Using Apache Hive with High Performance
PPTX
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PPTX
Apache Arrow Flight Overview
PPTX
Tuning and Debugging in Apache Spark
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Spark Performance Tuning .pdf
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Using Apache Hive with High Performance
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Apache Arrow Flight Overview
Tuning and Debugging in Apache Spark

What's hot (20)

PDF
Spark SQL Join Improvement at Facebook
PDF
Using Apache Spark as ETL engine. Pros and Cons
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
PDF
Understanding PostgreSQL LW Locks
PDF
Apache Spark Core – Practical Optimization
PDF
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
PPTX
Optimizing Apache Spark SQL Joins
PDF
Understanding Query Plans and Spark UIs
PPTX
Introduction to spark
PDF
Enabling Vectorized Engine in Apache Spark
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PDF
Introduction to Spark Internals
PDF
High-speed Database Throughput Using Apache Arrow Flight SQL
PDF
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Parquet performance tuning: the missing guide
PPTX
Hadoop configuration & performance tuning
PPTX
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Spark SQL Join Improvement at Facebook
Using Apache Spark as ETL engine. Pros and Cons
Cosco: An Efficient Facebook-Scale Shuffle Service
Understanding PostgreSQL LW Locks
Apache Spark Core – Practical Optimization
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Optimizing Apache Spark SQL Joins
Understanding Query Plans and Spark UIs
Introduction to spark
Enabling Vectorized Engine in Apache Spark
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
Efficient Data Storage for Analytics with Apache Parquet 2.0
Introduction to Spark Internals
High-speed Database Throughput Using Apache Arrow Flight SQL
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Parquet performance tuning: the missing guide
Hadoop configuration & performance tuning
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Ad

Viewers also liked (12)

PDF
Understanding Memory Management In Spark For Fun And Profit
PDF
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
PDF
Dynamically Allocate Cluster Resources to your Spark Application
PPTX
Scheduling Policies in YARN
PDF
Apache Spark & Hadoop
PPTX
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
PDF
Spark on YARN
PPTX
Spark on Yarn
PDF
Apache Spark RDDs
PDF
Spark on yarn
PDF
Blazing Performance with Flame Graphs
Understanding Memory Management In Spark For Fun And Profit
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Dynamically Allocate Cluster Resources to your Spark Application
Scheduling Policies in YARN
Apache Spark & Hadoop
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark on YARN
Spark on Yarn
Apache Spark RDDs
Spark on yarn
Blazing Performance with Flame Graphs
Ad

Similar to Dynamic Resource Allocation Spark on YARN (20)

PPTX
Spark Architecture and it requirement for
PDF
Introduction to YARN Apps
PDF
Yarn
PDF
[db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から by NTT 小沢健史
PDF
Hadoop bangalore-meetup-dec-2011-hadoop nextgen
PPTX
Hadoop fault-tolerance
PDF
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
PDF
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Taming YARN @ Hadoop Conference Japan 2014
PPTX
Anatomy of Hadoop YARN
PPTX
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
PPTX
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
PDF
Orchestrating Linux Containers while tolerating failures
PPTX
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)
PDF
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
PDF
Taming YARN @ Hadoop conference Japan 2014
PDF
Swarm migration
PDF
Introduction to yarn
PPTX
Apache Spark Core
PDF
Productionizing Spark and the Spark Job Server
Spark Architecture and it requirement for
Introduction to YARN Apps
Yarn
[db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から by NTT 小沢健史
Hadoop bangalore-meetup-dec-2011-hadoop nextgen
Hadoop fault-tolerance
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Taming YARN @ Hadoop Conference Japan 2014
Anatomy of Hadoop YARN
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Orchestrating Linux Containers while tolerating failures
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Taming YARN @ Hadoop conference Japan 2014
Swarm migration
Introduction to yarn
Apache Spark Core
Productionizing Spark and the Spark Job Server

More from Tsuyoshi OZAWA (10)

PDF
YARN: a resource manager for analytic platform
PDF
Spark shark
PDF
Fluent logger-scala
PDF
Multilevel aggregation for Hadoop/MapReduce
PDF
Memcached as a Service for CloudFoundry
KEY
First step for dynticks in FreeBSD
PDF
Memory Virtualization
PDF
第二回Bitvisor読書会 前半 Intel-VT について
PDF
第二回KVM読書会
PDF
Linux KVM のコードを追いかけてみよう
YARN: a resource manager for analytic platform
Spark shark
Fluent logger-scala
Multilevel aggregation for Hadoop/MapReduce
Memcached as a Service for CloudFoundry
First step for dynticks in FreeBSD
Memory Virtualization
第二回Bitvisor読書会 前半 Intel-VT について
第二回KVM読書会
Linux KVM のコードを追いかけてみよう

Dynamic Resource Allocation Spark on YARN

  • 1. Dynamic Resource Allocation for Spark on YARN ozawa@apache.org Tsuyoshi Ozawa
  • 2. What s YARN • A resource manager implementation
 for computer cluster
  • 4. YARN overview • All resources are managed by ResourceManager • All tasks are launched on NodeManager • Client submit jobs via ResourceManager NodeManager NodeManager ResourceManager client
  • 5. Spark on YARN • 2 mode • yarn-cluster • yarn-client
  • 6. yarn-cluster mode • Launching Spark driver on YARN container • Working well with spark-submit NodeManager NodeManager NM container1 container2Spark AppMaster clientResource Manager 1 submit 2 launching master 3 launching executers spark driver
  • 7. yarn-client mode • Launching Spark driver at client side • Working well with spark-shell NodeManager NodeManager NM container1 container2Spark AppMaster client Resource Manager 1 submit 2 launching master 3 launching executers spark driver 4. send commands
  • 8. Spark on YARN • yarn-cluster mode Node1 Node2 Node3 container 1 container 2 AppMaster container 2
  • 9. Problem • Inefficient resource management • containers cannot exit until job exits Node1 Node2 container container container container stage1 stage2 100% 100% 100% 100% 100%0%0% 0%
  • 10. Dynamic resource allocation(since v1.2) • Allocating containers more dynamically • number of executers are decided by workload NodeManager NodeManager NM container1 container2Spark AppMaster clientResource Manager 1 submit 2 launching master 3 launching executers/ kill executors spark driver
  • 11. Yak shaving • Where should we hold the state of 
 Spark RDD? • If executers are killed, it ll be lost… NodeManager executer executer RDD RDD
  • 12. external shuffle • Saving Spark RDD to NodeManager • NodeManager has a interface,
 external shuffle plugin • Now executers are stateless! NodeManager executer executer external shuffle plugin RDD (IntermediateFile) RDD (IntermediateFile)
  • 13. How to install (with Apache Hadoop) • Copy shuffle plugin to nodemanager s classpath • Edit yarn-site.xml • Edit spark-defaults.conf
  • 14. Copy shuffle jar to nodemanager s classpath $ cp lib/spark-*-yarn-shuffle.jar /home/ubuntu/hadoop/share/hadoop/yarn/
  • 15. Edit yarn-site.xml • Adding shuffle plugin • Note that documentation for 1.2 includes typo - I PRed :-) • See documentation for 1.4
  • 17. We re ready!! • num-executers are defined automatically
  • 18. Demo
  • 19. Summary • Spark on YARN • yarn-client mode • yarn-cluster mode • Spark can launch jobs efficiently on YARN
 with dynamic allocation