SlideShare a Scribd company logo
GoodFit: Multi-Resource Packing
of Tasks with Dependencies
Cluster Scheduling for Jobs
Jobs
Machines, file-system, network
Cluster Scheduler
matches tasks to resources
Goals
• High cluster utilization
• Fast job completion time
• Predictable perf./ fairness
E.g., BigData (Hive, SCOPE, Spark)
E.g., CloudBuild
Tasks
Dependencies
• Need not keep resource “buffers”
• More dynamic than VM placement (tasks last seconds)
• Aggregate properties are important (eg, all tasks in a job should finish)
Need careful multi-resource planning
Problem
Fragmentation
Current Schedulers Packer Scheduler
Over-allocation of net/disk
Current Schedulers Packer Scheduler
2 tasks/T  3 tasks/T (+50%) 2 tasks/ 2T  2 tasks/T (+100%)
… worse with dependencies
Problem 2
Tt,
𝟏
𝒏
r t, 1- r
t, r
t, 1- r t, 1- r
(T- 2)t,
𝟏
𝒏
r (T- 4)t,
𝟏
𝒏
r ~Tt,
𝟏
𝒏
r
…
…
DAG label= {duration, resource demand}
resource
time
~nT t
…
resource
time
~T t
…
…
Crit. Path Best
Critical path scheduling is n times off since it ignores resource demands
Packers can be d times off since they ignore future work [d resources]
Typical job scheduler infrastructure
+ packing
+ bounded unfairness
+ merge schedules
+ overbook
DAG
AM
DAG
AM
… Node
heartbeat
Task
assignment
Schedule
Constructor
Schedule
Constructor
RM
NM
NM
NM
NM
Main ideas in multi-resource packing
Task packing ~ Multi-dimensional bin packing, but
* Very hard problem (“APX-hard”)
* Available heuristics do not directly apply [task demands change with placement]
Alignment score (A) = D  R
A packing heuristic
 Task’s resources demand vector: D  Machine resource vector: R<
Fit
A job completion time heuristic shortest remaining work, P tasks avg. duration
tasks avg. resource demand
*
*
=
remaining # tasks
Packing
Efficiency
?
delays job completion
loses packing efficiencyJob Completion
Time
Fairness
Trade-offs:
We show that:
{best “perf” |bounded unfairness} ~ best “perf”
loses both
Main ideas in packing dependent tasks
1. Identify troublesome tasks (meat) and place
them first
2. Systematically place other tasks without
deadlocks
3. At runtime, use a precedence order from the
computed schedule + heuristics to (a)
overbook, (b) previous slide.
4. Better lower bounds for DAG completion time
M
P
C
O
time
resource
meat
begin
meat
end
parents
meat
children
Results - 1
Packing
Packing + Deps.
Lower bound
[20K DAGs from Cosmos]
Results - 2
Tez + Packing
Tez + Pack +Deps
[200 jobs from TPC-DS, 200 server cluster]
Bundling
Temporal relaxation of fairness
Map
(disk)
Reduce
(netw.)
Fair share among two identical jobs
50%
50%
50%
50%
2T 4T
Instantaneous fairness
100
%
100
%
100
%
100
%
2T 3TT
1) Temporal relaxation of fairness
a job will finish within (1 + 𝑓)x the time it takes given strict share
2) Optimal trade-off with performance
(1 + 𝑓)x fairness costs (2 + 2𝑓 − 2 𝑓 + 𝑓2)x on make-span
3) A simple (offline) algorithm that achieves the above trade-off
Problem:
Instantaneous fairness can be up to dx worse on makespan (d resources)
Best
Fairness slack 𝒇 Perf loss
0 (perfectly fair) 2x
1 (<2x longer) 1.1x
2 (<3x longer) 1.07x
Bare metal
VM Allocation
Data-parallel Jobs
Job: Tasks
Dependencies
E.g., HDInsight, AzureBatch
E.g., BigData (Yarn, Cosmos, Spark)
E.g., CloudBuild
3500 servers
3500 users
>20M targets/day
~100K servers (40K at Yahoo)
>50K servers
>2EB stored
>6K devs
• Tasks are short-lived (10s of seconds)
• Have peculiar shaped demands
• Composites are important (job needs all tasks to finish)
• OK to kill and restart tasks
• Locality
1) Job scheduling has specific aspects
2) will speed-up the average job (and reduce resource cost)
3) research + practice
Resource aware scheduling improves SLOs and Return/$
Cluster Scheduling for Jobs
Jobs
Machines, file-system, network
Cluster Scheduler
matches tasks to resources
Goals
• High cluster utilization
• Fast job completion time
• Predictable perf./ fairness
• Efficient (milliseconds…)
E.g., HDInsight, AzureBatch
E.g., BigData (Hive, SCOPE, Spark)
E.g., CloudBuild
Tasks
Dependencies
Need careful multi-resource planning
Problem
Fragmentation
Current Schedulers Packer Scheduler
Over-allocation of net/disk
Current Schedulers Packer Scheduler
2 tasks/T  3 tasks/T (+50%) 2 tasks/ 2T  2 tasks/T (+100%)
… worse with dependencies
Problem 2
Tt,
𝟏
𝒏
r t, 1- r
t, r
t, 1- r t, 1- r
(T- 2)t,
𝟏
𝒏
r (T- 4)t,
𝟏
𝒏
r ~Tt,
𝟏
𝒏
r
…
…
DAG label= {duration, resource demand}
resource
time
~nT t
…
resource
time
~T t
…
…
Crit. Path Best
Critical path scheduling is n times off since it ignores resource demands
Packers can be d times off since they ignore future work [d resources]
Typical job scheduler infrastructure
+ packing
+ bounded unfairness
+ merge schedules
+ overbook
DAG
AM
DAG
AM
… Node
heartbeat
Task
assignment
Schedule
Constructor
Schedule
Constructor
RM
NM
NM
NM
NM
Main ideas in packing dependent tasks
1. Identify troublesome tasks (T) and place them
first
2. Systematically place other tasks without dead-
ends
3. At runtime, enforce computed schedule +
heuristics to (a) overbook, (b) previous slide.
4. Better lower bounds for DAG completion time
T
P
C
O
time
resource
Trouble
begin
Trouble
end
parents
trouble
children
Results - 1
Packing
Packing + Deps.
Lower bound
[20K DAGs from Cosmos]
2X1.5X
Results - 2
Tez + Packing
Tez + Pack +Deps
[200 jobs from TPC-DS, 200 server cluster]
Multi-Resource Packing for Cluster Schedulers
Performance of cluster schedulers
We observe that:
1Time to finish a set of jobs
 Resources are fragmented i.e. machines are running below capacity
 Even at 100% usage, goodput is much smaller due to over-allocation
 Even pareto-efficient multi-resource fair schemes result in much lower performance
Tetris
up to 40% improvement in makespan1 and job
completion time with near-perfect fairness
Findings from Bing and Facebook traces analysis
 Tasks need varying amounts of each resource
 Demands for resources are weakly correlated
Diversity in multi-resource requirements:
Multiple resources become tight
This matters because no single bottleneck resource:
 Enough cross-rack network bandwidth to use all CPU cores
25
Upper bounding potential gains
 reduce makespan1 by up to 49%
 reduce avg. job compl. time by up to 46%
26
Why so bad #1
Production schedulers neither pack
tasks nor consider all their relevant
resource demands
#1 Resource Fragmentation
#2 Over-allocation
Current Schedulers “Packer” Scheduler
Machine A
4 GB Memory
Machine B
4 GB Memory
T1: 2 GB
T3: 4 GB
T2: 2 GB
Time
Resource Fragmentation (RF)
STOP
Machine A
4 GB Memory
Machine B
4 GB Memory
T1: 2 GB
T3: 4 GB
T2: 2 GB
Time
Avg. task compl. time = 1 t
27
Current Schedulers
RF increase with the
number of resources
being allocated !
Avg. task compl.time = 1.33 t
Resources allocated
in terms of Slots
Free resources unable
to be assigned to tasks
Current Schedulers “Packer” Scheduler
Machine A
4 GB Memory; 20 MB/s Nw.
Time
T1: 2 GB
Memory
20 MB/s
Nw.
T2: 2 GB
Memory
20 MB/s
Nw.
T3: 2 GB
Memory
Machine A
4 GB Memory; 20 MB/s Nw.Time
T1: 2 GB
Memory
20 MB/s
Nw.
T2: 2 GB
Memory
20 MB/s
Nw.
T3: 2 GB
Memory
STOP
20 MB/s
Nw.
20 MB/s
Nw.
28
Over-Allocation
 Not all tasks resource
demands are
explicitly allocated
 Disk and network
are over-allocated
Avg. task compl.time= 2.33 t Avg. task compl. time = 1.33 t
Current Schedulers
Work Conserving != no fragmentation, over-allocation
 Treat cluster as a big bag of resources
 Hides the impact of resource fragmentation
 Assume job has a fixed resource profile
 Different tasks in the same job have different demands
Multi-resource Fairness Schemes do not help either
Why so bad #2
 The schedule impacts job’s current resource profiles
 Can schedule to create complementarity profiles
Packer Scheduler vs. DRF
 Avg. Job Compl.Time: 50%
 Makespan: 33%
Pareto1 efficient != performant
1no job can increase share without decreasing the share of another
29
Competing objectives
Job completion time
Fairness
vs.
Cluster efficiency
vs.
Current Schedulers
1. Resource Fragmentation
3. Fair allocations sacrifice performance
2. Over-Allocation
30
# 1
Pack tasks along multiple resources to improve
cluster efficiency and reduce makespan
31
Theory Practice
Multi-Resource Packing of Tasks
similar to
Multi-Dimensional Bin Packing
Balls could be tasks
Bin could be machine, time
1APX-Hard is a strict subset of NP-hard
APX-Hard1
Existing heuristics do not directly apply here:
 Assume balls of a fixed size
 Assume balls are known apriori
32
 vary with time / machine placed
 elastic
 cope with online arrival of jobs,
dependencies, cluster activity
Avoiding fragmentation looks like:
 Tight bin packing
 Reduces # of bins used -> reduce makespan
# 1
Packing heuristic
1. Check for fit ensure no over-allocation Over-Allocation
Alignment score (A)
33
A packing heuristic
 Tasks resources demand vector  Machine resource vector<
Fit
“A” works because:
2. Bigger balls get bigger scores
3. Abundant resources used first
Resource Fragmentation
4. Can spread load across machines
# 2
Faster average job completion time
34
35
CHALLENGE
# 2
Shortest Remaining Time First1 (SRTF)
1SRTF – M. Harchol-Balter et al. Connection Scheduling in Web Servers [USITS’99]
schedules jobs in ascending order of their remaining time
Job Completion
Time Heuristic
Q: What is the shortest “remaining time” ?
“remaining work”
remaining # tasks
tasks durations
tasks resource demands
&
&
=
A job completion time heuristic
 Gives a score P to every job
 Extended SRTF to incorporate multiple resources
36
CHALLENGE
# 2
Job Completion
Time Heuristic
Combine A and P scores !
Packing
Efficiency
Completion
Time
?
1: among J runnable jobs
2: score (j) = A(t, R)+ P(j)
3: max task t in j, demand(t) ≤ R (resources free)
4: pick j*, t* = argmax score(j)
A: delays job completion time
P: loss in packing efficiency
# 3
Achieve performance and fairness
37
# 3
38
 A says: “task i should go here to improve packing efficiency”
Feasible solution which typically can satisfy all of them
 P says: “schedule job j next to improve job completion time”
 Fairness says: “this set of jobs should be scheduled next”
Fairness
Heuristic
Performance and fairness do not mix well in general
But ….
We can get “perfect fairness” and much better performance
# 3
39
 Fairness Knob, F  [0, 1)
 F = 0 most efficient scheduling
 F → 1 close to perfect fairness
Pick the best-for-perf. task from among
1-F fraction of jobs furthest from fair share
Fairness
Heuristic
Fairness is not a tight constraint
 Long term fairness not short term fairness
 Lose a bit of fairness for a lot of gains in performance
Heuristic
40
Putting it all together
We saw:
Other things in the paper:
 Packing efficiency
 Prefer small remaining work
 Fairness knob
 Estimate task demands
 Deal with inaccuracies, barriers
 Ingestion / evacuation
Job Manager1
Node Manager1
Cluster-wide Resource Manager
Multi-resource asks;
barrier hint
Track resource usage;
enforce allocations
New logic to match tasks to machines
(+packing, +SRTF, +fairness)
Allocations
Asks
Offers
Resource
availability reports
Yarn architecture
Changes to add Tetris(shown in orange)
Evaluation
 Pluggable scheduler in Yarn 2.4
 250 machine cluster deployment
 Replay Bing and Facebook traces
41
42
Efficiency
Makespan
DRF 28 %
Avg. Job Compl. Time
35%
0
50
100
150
200
0 5000 10000 15000
Utilization(%)
Time (s)
CPU Mem In St
Tetris
Gains from
 avoiding fragmentation
 avoid over-allocation
0
50
100
150
200
0 4500 9000 13500 18000 22500
Utilization(%)
Time (s)
CPU Mem In St
Tetris vs.
Capacity
Scheduler
29 % 30 %
Over-allocation
Lower value => higher resource fragmentation
Utilization(%)
200
150
100
50
0
0 5000 10000 15000
Time (s)
Over-allocation
Lower value => higher resource fragmentation
Capacity Scheduler
43
Fairness
Fairness Knob
 quantifies the extent to which Tetris adheres to fair allocation
No Fairness
F = 0
Makespan
50 %
10 %
25 %
Job Compl.
Time
40 %
23 %
35 %
Avg. Slowdown
[over impacted jobs]
25 %
2 %
5 %
Full Fairness
F → 1
F = 0.25
Pack efficiently
along multiple
resources
Prefer jobs
with less
“remaining
work”
Incorporate
Fairness
 combine heuristics that improve packing efficiency with those that
lower average job completion time
 achieving desired amounts of fairness can coexist with improving
cluster performance
 implemented inside YARN; trace-driven simulations and deployment
show encouraging initial results
We are working towards a Yarn check-in
http://research.microsoft.com/en-us/UM/redmond/projects/tetris/
44
45
Backup slides
Estimating resource requirements
Estimating Resource Demands
Under-utilization
 from:
o finished tasks in the same phase
 peak usage demands estimates
Machine1 - In Network
850
1024
0
512
MBytes/sec
Time (sec)
In Network Used
In Network Free
Resource Tracker
o report unused resources
o aware of other cluster activities: ingestion and evacuation
Resource Tracker
o collecting statistics from recurring jobs
Peak Demand
o inputs size/location of tasks
46
Placement
Impacts network/disk requirements
Packer Scheduler vs. DRF
DRF Scheduler Packer Schedulers
2 tasks
Job Schedule
Resources used
2 tasks 2 tasks
2 tasks 2 tasks 2 tasks
6 tasks 6 tasks 6 tasksA
B
C
18 cores
16 GB
18 cores
16 GB
18 cores
16 GB
t 2t 3t
0 tasks
Job Schedule
Resources used
0 tasks 6 tasks
0 tasks 6 tasks
18 tasksA
B
C
18 cores 18 cores
6 GB
18 cores
6 GB
t 2t 3t
36 GB
Durations:
A: 3t
B: 3t
C: 3t
Durations:
A: t
B: 2t
C: 3t
33%
improvement
Dominant Resource Fairness (DRF)
computes the dominant share (DS) of every user and
seeks to maximize the minimum DS across all users
Cluster [18 Cores, 36 GB Memory]
Job: [Task Prof.], # tasks
A [1 Core, 2 GB], 18
B [3 Cores, 1 GB], 6
C [3 Cores, 1 GB], 6
DS =
𝟏
𝟑
max (qA, qB, qC) (Maximize allocations)
qA + 3qB + 3qC ≤ 18 (CPU constraint)
2qA + 1qB + 1qC ≤ 36 (Memory constraint)
qA
18
=
qB
6
=
qC
6
(Equalize DS) 47
1Time to finish a set of jobs
Machine 1,2: [2 Cores, 4 GB]
Job: [Task Prof.], # tasks
A [2 Cores, 3 GB], 6
B [1 Core, 2 GB], 2
Resources used
4
cores
6 GB
2
tasks
2
tasks
2
tasks
2
tasks
t 2t 3t 4t
Job Schedule
4
cores
6 GB
4
cores
6 GB
2
cores
4 GB
Resources used
2
cores
4 GB
2
tasks
2
tasks
2
tasks
2
tasks
t 2t 3t 4t
Job Schedule
4
cores
6 GB
4
cores
6 GB
4
cores
6 GB
Pack No Pack
Durations:
A: 3t
B: 4t
Durations:
A: 4t
B: t
29% improvement
48
Packing efficiency does not achieve everything
Achieving packing efficiency does not
necessarily improve job completion time
49
Ingestion / evacuation
ingestion = storing incoming data for later analytics
evacuation = data evacuated and re-replicated before
maintenance operations
 e.g. some clusters reports volumes of up to 10 TB per hour
Other cluster activities which produce background traffic
 e.g. rack decommission for machines re-imaging
Resource Tracker reports, used by Tetris to avoid
contention between its tasks and these activities
50
Workload analysis
51
Alternative Packing Heuristics
52
Fairness vs. Efficiency
53
Fairness vs. Efficiency
54
Virtual Machine Packing != Tetris
Virtual Machine Packing
But focus on different challenges and not task packing:
 balance load across servers
 ensure VM availability inspite of failures
 allow for quick software and hardware updates
 NO corresponding entity to a job and hence job completion time is inexpressible
 Explicit resource requirements (e.g. small VM) makes VM packing simpler
Consolidating VMs, with multi-dimensional resource
requirements, on to the fewest number of servers
55
Barrier knob, b  [0, 1)
Tetris gives preference for last tasks in a stage
Offer resources to tasks in a stage preceding a
barrier, where b fraction of tasks have finished
 b = 1 no tasks preferentially treated
56
Starvation Prevention
It could take a long time to accommodate large tasks ?
But …
1. most tasks have demands within one order of magnitude of one another
2. machines report resource availability to the scheduler periodically
 scheduler learn about all the resources freed up by tasks that finish in the
preceding period together => can to reservation for large tasks
57
Cluster load vs. Tetris performance
Packing and Dependency-aware Scheduling for
Data-Parallel Clusters
Performance of cluster schedulers
We observe that:
1Time to finish a set of jobs
 Typically cluster schedulers do dependency-aware scheduling
 OR multi-resource packing
 None of the existing solutions are close to optimal for more than 50% of the
production jobs
Graphene
> 30% improvements in makespan1 and job
completion time for more than 50% of the jobs
2
Findings from Bing traces analysis
Jobs structure have evolved into complex DAGs of tasks
 depth 7
 103 tasks
Median job DAG’s has:
A good cluster scheduler should be
aware of dependencies
1Time to finish a set of jobs
3
Findings from Bing traces analysis
 High coefficient of variation (~1) for many resources
 Demands for resources are weakly correlated
Applications have (very) diverse resource needs:
Multiple resources become tight
This matters because no single bottleneck resource:
 Enough cross-rack network bandwidth to use all CPU cores
61
 CPU, Memory, Network and Disk
A good cluster scheduler should
pack resources
62
Why so bad
Production schedulers
DON’T pack tasks
consider dependencies
ORAND
Dependency-aware Packing
 Breadth First Search (BFS)
63
 Do not account for tasks resource demands
 If so, they assume tasks have homogeneous
demands
OR Consider the DAG structure during the
schedule
 Tetris
 Ignore dependencies
 Takes local greedy choices
 Handle tasks with multiple resource
requirements
 
Any scheduler that is not packing,
is up to n x OPTIMAL (n – number tasks)
Any scheduler that ignores dependencies is
d x OPTIMAL (d – number resource dimensions)
 Critical Path Scheduling
(CPSched)
Where does the “work” lie in a DAG?
“Work” – stages in a DAG where most amount of resources X time is spent
 Large DAGs that are neither a bunch of unrelated stages
nor a chain of stages
 > 40% of the DAGs have most of the “work” on the Critical Path CPSched performs well
 > 30% of the DAGs have most of the “work” such that Packers performs well
For ~50% of the DAGs neither
packers nor critically-based
schedulers may perform well 7
Pack tasks along multiple resources
while consider tasks dependencies
65
 State-of-the art techniques are suboptimal
 Key ideas in Graphene
 Conclusion
State-of-the art scheduling techniques are suboptimal
CPSched / Tetris
3 X Optimal
66
t0: t1:
t2:
t3:
1
{.7, .31}
.01
{.95, .01}
.01
{.1, .7}
.96
{. 2, .68}
.98
{. 1, .01}
.01
{. 01, .01}
t4:
t5:
duration
{rsrc.1, rsrc.2}
task:
CPSched t0 t4 t5
t
t1 t3t2
2t 3t
Time: ~3T
Tetris t0 t1 t2
t
t4 t3t5
2t 3t
Time: ~3T
Optimal t1 t0
t
t4 t3
t2
3t
Time: ~T
t5
Key insights:
 t0, t2, t5 are troublesome tasks
 schedule them as soon as possible
 Total capacity in any dimens. = 1
Schedule construction: identify troublesome tasks
and place accordingly on a virtual resource time
space.
67
# 1
T
P
C
O
…
time
resources
T
…
time
resources
P
O
C
T
Schedule Construction
 Identify tasks that can lead to a poor schedule (troublesome tasks) - T
 more likely to be on the critical path
 more difficult to pack
 Break the others tasks into P, C, O sets based on their relationship with tasks from T
 Place tasks in T on a virtual time space; overlay the others to fill any resultant holes in
this space
Nearly optimal for over three quarters of our
analyzed production DAGs
11
Online component: enforces the desired schedule
of the various DAGs.
69
# 2
DAG
Schedule Construction
Schedule Construction
Preference order
Preference order
- merging schedulesDAG
Runtime component
Node
heartbeat
Task
assignment
Resource Manager
 Prefer jobs with less
remaining work
 Enforces priority ordering
 Local placement
 Multi-resource packing
 Judicious overbooking of
malleable resources
 Deficit counters to bound
unfairness
 Enables implementation
of different fairness
schemes
Job completion time
Online Scheduling
Makespan Being Fair
- bound unfairness
- packing + overbooking
13
Evaluation
 Implemented in Yarn and Tez
 250 machine cluster deployment
 Replay Bing traces and
TPC-DS / TPC-H workloads
71
Makespan
Tetris
29 %
Avg. Job Compl. Time
27%
Graphene vs.
Critical Path
31 % 33 %BFS
23 % 24%
Gains from
 view of the entire DAG
 place the troublesome
tasks first
Efficiency
 more compact schedule
 better packing
 overbooking
15
 combine various mechanisms to improve packing efficiency and
consider tasks dependencies
 constructs a good schedule by placing tasks on a virtual resource time space
 implemented inside YARN and Tez; trace-driven simulations and
deployment show encouraging initial results
73
 online heuristics that softly enforces the desired schedules
Makespan
Tetris
29 %
Avg. Job Compl. Time
27%
Graphene vs.
Critical Path
31 % 33 %BFS
23 % 24%
Gains from
 view of the entire DAG
 place the troublesome
tasks first
Graphene BFSRunning tasks
Efficiency
 more compact schedule
 better packing
 overbooking
15
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies

More Related Content

PPTX
GoodFit: Multi-Resource Packing of Tasks with Dependencies
PDF
Scaling Apache Storm - Strata + Hadoop World 2014
PDF
Realtime Analytics with Storm and Hadoop
PPTX
Yahoo compares Storm and Spark
PDF
Distributed real time stream processing- why and how
PPTX
Multi-Tenant Storm Service on Hadoop Grid
PPTX
Resource Aware Scheduling in Apache Storm
PDF
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
GoodFit: Multi-Resource Packing of Tasks with Dependencies
Scaling Apache Storm - Strata + Hadoop World 2014
Realtime Analytics with Storm and Hadoop
Yahoo compares Storm and Spark
Distributed real time stream processing- why and how
Multi-Tenant Storm Service on Hadoop Grid
Resource Aware Scheduling in Apache Storm
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds

What's hot (20)

PDF
Spark streaming: Best Practices
PDF
Data profiling in Apache Calcite
PPTX
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
PPTX
Databricks clusters in autopilot mode
PDF
So you think you can stream.pptx
PDF
Hadoop Summit Europe 2014: Apache Storm Architecture
PPTX
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
PDF
Real-time Big Data Processing with Storm
PDF
Spark Streaming into context
PDF
Wayfair Use Case: The four R's of Metrics Delivery
PDF
Introduction to Twitter Storm
PPTX
Scaling Apache Storm (Hadoop Summit 2015)
PPTX
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
PPTX
R for hadoopers
PPTX
Cassandra and Storm at Health Market Sceince
PPTX
Improved Reliable Streaming Processing: Apache Storm as example
PDF
Top 5 mistakes when writing Spark applications
PPTX
Storm-on-YARN: Convergence of Low-Latency and Big-Data
PDF
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
PDF
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Spark streaming: Best Practices
Data profiling in Apache Calcite
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Databricks clusters in autopilot mode
So you think you can stream.pptx
Hadoop Summit Europe 2014: Apache Storm Architecture
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Real-time Big Data Processing with Storm
Spark Streaming into context
Wayfair Use Case: The four R's of Metrics Delivery
Introduction to Twitter Storm
Scaling Apache Storm (Hadoop Summit 2015)
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
R for hadoopers
Cassandra and Storm at Health Market Sceince
Improved Reliable Streaming Processing: Apache Storm as example
Top 5 mistakes when writing Spark applications
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Ad

Similar to GoodFit: Multi-Resource Packing of Tasks with Dependencies (20)

PPTX
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
PPTX
introduction to Complete Map and Reduce Framework
PPTX
This gives a brief detail about big data
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PPTX
Lecture2-MapReduce - An introductory lecture to Map Reduce
PDF
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
PPT
Distributed systems scheduling
PPT
CS3114_09212011.ppt
PPTX
Earliest Due Date Algorithm for Task scheduling for cloud computing
PDF
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
BIG DATA Session 7 8
PPTX
Optimizing Performance - Clojure Remote - Nikola Peric
PDF
Hadoop scheduler
PDF
Introduction to Big Data
PPTX
Task allocation and scheduling inmultiprocessors
PDF
Processing Big Data: An Introduction to Data Intensive Computing
PPTX
Hadoop fault tolerance
PDF
Hadoop map reduce concepts
PPTX
mapreduce.pptx
PPTX
Hanborq Optimizations on Hadoop MapReduce
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
introduction to Complete Map and Reduce Framework
This gives a brief detail about big data
Tame the small files problem and optimize data layout for streaming ingestion...
Lecture2-MapReduce - An introductory lecture to Map Reduce
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
Distributed systems scheduling
CS3114_09212011.ppt
Earliest Due Date Algorithm for Task scheduling for cloud computing
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
BIG DATA Session 7 8
Optimizing Performance - Clojure Remote - Nikola Peric
Hadoop scheduler
Introduction to Big Data
Task allocation and scheduling inmultiprocessors
Processing Big Data: An Introduction to Data Intensive Computing
Hadoop fault tolerance
Hadoop map reduce concepts
mapreduce.pptx
Hanborq Optimizations on Hadoop MapReduce
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Modernizing your data center with Dell and AMD
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Empathic Computing: Creating Shared Understanding
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
NewMind AI Monthly Chronicles - July 2025
PPT
Teaching material agriculture food technology
PDF
KodekX | Application Modernization Development
The AUB Centre for AI in Media Proposal.docx
Modernizing your data center with Dell and AMD
“AI and Expert System Decision Support & Business Intelligence Systems”
Digital-Transformation-Roadmap-for-Companies.pptx
Chapter 3 Spatial Domain Image Processing.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Building Integrated photovoltaic BIPV_UPV.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Network Security Unit 5.pdf for BCA BBA.
Diabetes mellitus diagnosis method based random forest with bat algorithm
Reach Out and Touch Someone: Haptics and Empathic Computing
Per capita expenditure prediction using model stacking based on satellite ima...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Empathic Computing: Creating Shared Understanding
Unlocking AI with Model Context Protocol (MCP)
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
NewMind AI Monthly Chronicles - July 2025
Teaching material agriculture food technology
KodekX | Application Modernization Development

GoodFit: Multi-Resource Packing of Tasks with Dependencies

  • 1. GoodFit: Multi-Resource Packing of Tasks with Dependencies
  • 2. Cluster Scheduling for Jobs Jobs Machines, file-system, network Cluster Scheduler matches tasks to resources Goals • High cluster utilization • Fast job completion time • Predictable perf./ fairness E.g., BigData (Hive, SCOPE, Spark) E.g., CloudBuild Tasks Dependencies • Need not keep resource “buffers” • More dynamic than VM placement (tasks last seconds) • Aggregate properties are important (eg, all tasks in a job should finish)
  • 3. Need careful multi-resource planning Problem Fragmentation Current Schedulers Packer Scheduler Over-allocation of net/disk Current Schedulers Packer Scheduler 2 tasks/T  3 tasks/T (+50%) 2 tasks/ 2T  2 tasks/T (+100%)
  • 4. … worse with dependencies Problem 2 Tt, 𝟏 𝒏 r t, 1- r t, r t, 1- r t, 1- r (T- 2)t, 𝟏 𝒏 r (T- 4)t, 𝟏 𝒏 r ~Tt, 𝟏 𝒏 r … … DAG label= {duration, resource demand} resource time ~nT t … resource time ~T t … … Crit. Path Best Critical path scheduling is n times off since it ignores resource demands Packers can be d times off since they ignore future work [d resources]
  • 5. Typical job scheduler infrastructure + packing + bounded unfairness + merge schedules + overbook DAG AM DAG AM … Node heartbeat Task assignment Schedule Constructor Schedule Constructor RM NM NM NM NM
  • 6. Main ideas in multi-resource packing Task packing ~ Multi-dimensional bin packing, but * Very hard problem (“APX-hard”) * Available heuristics do not directly apply [task demands change with placement] Alignment score (A) = D  R A packing heuristic  Task’s resources demand vector: D  Machine resource vector: R< Fit A job completion time heuristic shortest remaining work, P tasks avg. duration tasks avg. resource demand * * = remaining # tasks Packing Efficiency ? delays job completion loses packing efficiencyJob Completion Time Fairness Trade-offs: We show that: {best “perf” |bounded unfairness} ~ best “perf” loses both
  • 7. Main ideas in packing dependent tasks 1. Identify troublesome tasks (meat) and place them first 2. Systematically place other tasks without deadlocks 3. At runtime, use a precedence order from the computed schedule + heuristics to (a) overbook, (b) previous slide. 4. Better lower bounds for DAG completion time M P C O time resource meat begin meat end parents meat children
  • 8. Results - 1 Packing Packing + Deps. Lower bound [20K DAGs from Cosmos]
  • 9. Results - 2 Tez + Packing Tez + Pack +Deps [200 jobs from TPC-DS, 200 server cluster]
  • 12. Map (disk) Reduce (netw.) Fair share among two identical jobs 50% 50% 50% 50% 2T 4T Instantaneous fairness 100 % 100 % 100 % 100 % 2T 3TT 1) Temporal relaxation of fairness a job will finish within (1 + 𝑓)x the time it takes given strict share 2) Optimal trade-off with performance (1 + 𝑓)x fairness costs (2 + 2𝑓 − 2 𝑓 + 𝑓2)x on make-span 3) A simple (offline) algorithm that achieves the above trade-off Problem: Instantaneous fairness can be up to dx worse on makespan (d resources) Best Fairness slack 𝒇 Perf loss 0 (perfectly fair) 2x 1 (<2x longer) 1.1x 2 (<3x longer) 1.07x
  • 13. Bare metal VM Allocation Data-parallel Jobs Job: Tasks Dependencies E.g., HDInsight, AzureBatch E.g., BigData (Yarn, Cosmos, Spark) E.g., CloudBuild 3500 servers 3500 users >20M targets/day ~100K servers (40K at Yahoo) >50K servers >2EB stored >6K devs
  • 14. • Tasks are short-lived (10s of seconds) • Have peculiar shaped demands • Composites are important (job needs all tasks to finish) • OK to kill and restart tasks • Locality 1) Job scheduling has specific aspects 2) will speed-up the average job (and reduce resource cost) 3) research + practice
  • 15. Resource aware scheduling improves SLOs and Return/$
  • 16. Cluster Scheduling for Jobs Jobs Machines, file-system, network Cluster Scheduler matches tasks to resources Goals • High cluster utilization • Fast job completion time • Predictable perf./ fairness • Efficient (milliseconds…) E.g., HDInsight, AzureBatch E.g., BigData (Hive, SCOPE, Spark) E.g., CloudBuild Tasks Dependencies
  • 17. Need careful multi-resource planning Problem Fragmentation Current Schedulers Packer Scheduler Over-allocation of net/disk Current Schedulers Packer Scheduler 2 tasks/T  3 tasks/T (+50%) 2 tasks/ 2T  2 tasks/T (+100%)
  • 18. … worse with dependencies Problem 2 Tt, 𝟏 𝒏 r t, 1- r t, r t, 1- r t, 1- r (T- 2)t, 𝟏 𝒏 r (T- 4)t, 𝟏 𝒏 r ~Tt, 𝟏 𝒏 r … … DAG label= {duration, resource demand} resource time ~nT t … resource time ~T t … … Crit. Path Best Critical path scheduling is n times off since it ignores resource demands Packers can be d times off since they ignore future work [d resources]
  • 19. Typical job scheduler infrastructure + packing + bounded unfairness + merge schedules + overbook DAG AM DAG AM … Node heartbeat Task assignment Schedule Constructor Schedule Constructor RM NM NM NM NM
  • 20. Main ideas in packing dependent tasks 1. Identify troublesome tasks (T) and place them first 2. Systematically place other tasks without dead- ends 3. At runtime, enforce computed schedule + heuristics to (a) overbook, (b) previous slide. 4. Better lower bounds for DAG completion time T P C O time resource Trouble begin Trouble end parents trouble children
  • 21. Results - 1 Packing Packing + Deps. Lower bound [20K DAGs from Cosmos] 2X1.5X
  • 22. Results - 2 Tez + Packing Tez + Pack +Deps [200 jobs from TPC-DS, 200 server cluster]
  • 23. Multi-Resource Packing for Cluster Schedulers
  • 24. Performance of cluster schedulers We observe that: 1Time to finish a set of jobs  Resources are fragmented i.e. machines are running below capacity  Even at 100% usage, goodput is much smaller due to over-allocation  Even pareto-efficient multi-resource fair schemes result in much lower performance Tetris up to 40% improvement in makespan1 and job completion time with near-perfect fairness
  • 25. Findings from Bing and Facebook traces analysis  Tasks need varying amounts of each resource  Demands for resources are weakly correlated Diversity in multi-resource requirements: Multiple resources become tight This matters because no single bottleneck resource:  Enough cross-rack network bandwidth to use all CPU cores 25 Upper bounding potential gains  reduce makespan1 by up to 49%  reduce avg. job compl. time by up to 46%
  • 26. 26 Why so bad #1 Production schedulers neither pack tasks nor consider all their relevant resource demands #1 Resource Fragmentation #2 Over-allocation
  • 27. Current Schedulers “Packer” Scheduler Machine A 4 GB Memory Machine B 4 GB Memory T1: 2 GB T3: 4 GB T2: 2 GB Time Resource Fragmentation (RF) STOP Machine A 4 GB Memory Machine B 4 GB Memory T1: 2 GB T3: 4 GB T2: 2 GB Time Avg. task compl. time = 1 t 27 Current Schedulers RF increase with the number of resources being allocated ! Avg. task compl.time = 1.33 t Resources allocated in terms of Slots Free resources unable to be assigned to tasks
  • 28. Current Schedulers “Packer” Scheduler Machine A 4 GB Memory; 20 MB/s Nw. Time T1: 2 GB Memory 20 MB/s Nw. T2: 2 GB Memory 20 MB/s Nw. T3: 2 GB Memory Machine A 4 GB Memory; 20 MB/s Nw.Time T1: 2 GB Memory 20 MB/s Nw. T2: 2 GB Memory 20 MB/s Nw. T3: 2 GB Memory STOP 20 MB/s Nw. 20 MB/s Nw. 28 Over-Allocation  Not all tasks resource demands are explicitly allocated  Disk and network are over-allocated Avg. task compl.time= 2.33 t Avg. task compl. time = 1.33 t Current Schedulers
  • 29. Work Conserving != no fragmentation, over-allocation  Treat cluster as a big bag of resources  Hides the impact of resource fragmentation  Assume job has a fixed resource profile  Different tasks in the same job have different demands Multi-resource Fairness Schemes do not help either Why so bad #2  The schedule impacts job’s current resource profiles  Can schedule to create complementarity profiles Packer Scheduler vs. DRF  Avg. Job Compl.Time: 50%  Makespan: 33% Pareto1 efficient != performant 1no job can increase share without decreasing the share of another 29
  • 30. Competing objectives Job completion time Fairness vs. Cluster efficiency vs. Current Schedulers 1. Resource Fragmentation 3. Fair allocations sacrifice performance 2. Over-Allocation 30
  • 31. # 1 Pack tasks along multiple resources to improve cluster efficiency and reduce makespan 31
  • 32. Theory Practice Multi-Resource Packing of Tasks similar to Multi-Dimensional Bin Packing Balls could be tasks Bin could be machine, time 1APX-Hard is a strict subset of NP-hard APX-Hard1 Existing heuristics do not directly apply here:  Assume balls of a fixed size  Assume balls are known apriori 32  vary with time / machine placed  elastic  cope with online arrival of jobs, dependencies, cluster activity Avoiding fragmentation looks like:  Tight bin packing  Reduces # of bins used -> reduce makespan
  • 33. # 1 Packing heuristic 1. Check for fit ensure no over-allocation Over-Allocation Alignment score (A) 33 A packing heuristic  Tasks resources demand vector  Machine resource vector< Fit “A” works because: 2. Bigger balls get bigger scores 3. Abundant resources used first Resource Fragmentation 4. Can spread load across machines
  • 34. # 2 Faster average job completion time 34
  • 35. 35 CHALLENGE # 2 Shortest Remaining Time First1 (SRTF) 1SRTF – M. Harchol-Balter et al. Connection Scheduling in Web Servers [USITS’99] schedules jobs in ascending order of their remaining time Job Completion Time Heuristic Q: What is the shortest “remaining time” ? “remaining work” remaining # tasks tasks durations tasks resource demands & & = A job completion time heuristic  Gives a score P to every job  Extended SRTF to incorporate multiple resources
  • 36. 36 CHALLENGE # 2 Job Completion Time Heuristic Combine A and P scores ! Packing Efficiency Completion Time ? 1: among J runnable jobs 2: score (j) = A(t, R)+ P(j) 3: max task t in j, demand(t) ≤ R (resources free) 4: pick j*, t* = argmax score(j) A: delays job completion time P: loss in packing efficiency
  • 37. # 3 Achieve performance and fairness 37
  • 38. # 3 38  A says: “task i should go here to improve packing efficiency” Feasible solution which typically can satisfy all of them  P says: “schedule job j next to improve job completion time”  Fairness says: “this set of jobs should be scheduled next” Fairness Heuristic Performance and fairness do not mix well in general But …. We can get “perfect fairness” and much better performance
  • 39. # 3 39  Fairness Knob, F  [0, 1)  F = 0 most efficient scheduling  F → 1 close to perfect fairness Pick the best-for-perf. task from among 1-F fraction of jobs furthest from fair share Fairness Heuristic Fairness is not a tight constraint  Long term fairness not short term fairness  Lose a bit of fairness for a lot of gains in performance Heuristic
  • 40. 40 Putting it all together We saw: Other things in the paper:  Packing efficiency  Prefer small remaining work  Fairness knob  Estimate task demands  Deal with inaccuracies, barriers  Ingestion / evacuation Job Manager1 Node Manager1 Cluster-wide Resource Manager Multi-resource asks; barrier hint Track resource usage; enforce allocations New logic to match tasks to machines (+packing, +SRTF, +fairness) Allocations Asks Offers Resource availability reports Yarn architecture Changes to add Tetris(shown in orange)
  • 41. Evaluation  Pluggable scheduler in Yarn 2.4  250 machine cluster deployment  Replay Bing and Facebook traces 41
  • 42. 42 Efficiency Makespan DRF 28 % Avg. Job Compl. Time 35% 0 50 100 150 200 0 5000 10000 15000 Utilization(%) Time (s) CPU Mem In St Tetris Gains from  avoiding fragmentation  avoid over-allocation 0 50 100 150 200 0 4500 9000 13500 18000 22500 Utilization(%) Time (s) CPU Mem In St Tetris vs. Capacity Scheduler 29 % 30 % Over-allocation Lower value => higher resource fragmentation Utilization(%) 200 150 100 50 0 0 5000 10000 15000 Time (s) Over-allocation Lower value => higher resource fragmentation Capacity Scheduler
  • 43. 43 Fairness Fairness Knob  quantifies the extent to which Tetris adheres to fair allocation No Fairness F = 0 Makespan 50 % 10 % 25 % Job Compl. Time 40 % 23 % 35 % Avg. Slowdown [over impacted jobs] 25 % 2 % 5 % Full Fairness F → 1 F = 0.25
  • 44. Pack efficiently along multiple resources Prefer jobs with less “remaining work” Incorporate Fairness  combine heuristics that improve packing efficiency with those that lower average job completion time  achieving desired amounts of fairness can coexist with improving cluster performance  implemented inside YARN; trace-driven simulations and deployment show encouraging initial results We are working towards a Yarn check-in http://research.microsoft.com/en-us/UM/redmond/projects/tetris/ 44
  • 46. Estimating resource requirements Estimating Resource Demands Under-utilization  from: o finished tasks in the same phase  peak usage demands estimates Machine1 - In Network 850 1024 0 512 MBytes/sec Time (sec) In Network Used In Network Free Resource Tracker o report unused resources o aware of other cluster activities: ingestion and evacuation Resource Tracker o collecting statistics from recurring jobs Peak Demand o inputs size/location of tasks 46 Placement Impacts network/disk requirements
  • 47. Packer Scheduler vs. DRF DRF Scheduler Packer Schedulers 2 tasks Job Schedule Resources used 2 tasks 2 tasks 2 tasks 2 tasks 2 tasks 6 tasks 6 tasks 6 tasksA B C 18 cores 16 GB 18 cores 16 GB 18 cores 16 GB t 2t 3t 0 tasks Job Schedule Resources used 0 tasks 6 tasks 0 tasks 6 tasks 18 tasksA B C 18 cores 18 cores 6 GB 18 cores 6 GB t 2t 3t 36 GB Durations: A: 3t B: 3t C: 3t Durations: A: t B: 2t C: 3t 33% improvement Dominant Resource Fairness (DRF) computes the dominant share (DS) of every user and seeks to maximize the minimum DS across all users Cluster [18 Cores, 36 GB Memory] Job: [Task Prof.], # tasks A [1 Core, 2 GB], 18 B [3 Cores, 1 GB], 6 C [3 Cores, 1 GB], 6 DS = 𝟏 𝟑 max (qA, qB, qC) (Maximize allocations) qA + 3qB + 3qC ≤ 18 (CPU constraint) 2qA + 1qB + 1qC ≤ 36 (Memory constraint) qA 18 = qB 6 = qC 6 (Equalize DS) 47
  • 48. 1Time to finish a set of jobs Machine 1,2: [2 Cores, 4 GB] Job: [Task Prof.], # tasks A [2 Cores, 3 GB], 6 B [1 Core, 2 GB], 2 Resources used 4 cores 6 GB 2 tasks 2 tasks 2 tasks 2 tasks t 2t 3t 4t Job Schedule 4 cores 6 GB 4 cores 6 GB 2 cores 4 GB Resources used 2 cores 4 GB 2 tasks 2 tasks 2 tasks 2 tasks t 2t 3t 4t Job Schedule 4 cores 6 GB 4 cores 6 GB 4 cores 6 GB Pack No Pack Durations: A: 3t B: 4t Durations: A: 4t B: t 29% improvement 48 Packing efficiency does not achieve everything Achieving packing efficiency does not necessarily improve job completion time
  • 49. 49 Ingestion / evacuation ingestion = storing incoming data for later analytics evacuation = data evacuated and re-replicated before maintenance operations  e.g. some clusters reports volumes of up to 10 TB per hour Other cluster activities which produce background traffic  e.g. rack decommission for machines re-imaging Resource Tracker reports, used by Tetris to avoid contention between its tasks and these activities
  • 54. 54 Virtual Machine Packing != Tetris Virtual Machine Packing But focus on different challenges and not task packing:  balance load across servers  ensure VM availability inspite of failures  allow for quick software and hardware updates  NO corresponding entity to a job and hence job completion time is inexpressible  Explicit resource requirements (e.g. small VM) makes VM packing simpler Consolidating VMs, with multi-dimensional resource requirements, on to the fewest number of servers
  • 55. 55 Barrier knob, b  [0, 1) Tetris gives preference for last tasks in a stage Offer resources to tasks in a stage preceding a barrier, where b fraction of tasks have finished  b = 1 no tasks preferentially treated
  • 56. 56 Starvation Prevention It could take a long time to accommodate large tasks ? But … 1. most tasks have demands within one order of magnitude of one another 2. machines report resource availability to the scheduler periodically  scheduler learn about all the resources freed up by tasks that finish in the preceding period together => can to reservation for large tasks
  • 57. 57 Cluster load vs. Tetris performance
  • 58. Packing and Dependency-aware Scheduling for Data-Parallel Clusters
  • 59. Performance of cluster schedulers We observe that: 1Time to finish a set of jobs  Typically cluster schedulers do dependency-aware scheduling  OR multi-resource packing  None of the existing solutions are close to optimal for more than 50% of the production jobs Graphene > 30% improvements in makespan1 and job completion time for more than 50% of the jobs 2
  • 60. Findings from Bing traces analysis Jobs structure have evolved into complex DAGs of tasks  depth 7  103 tasks Median job DAG’s has: A good cluster scheduler should be aware of dependencies 1Time to finish a set of jobs 3
  • 61. Findings from Bing traces analysis  High coefficient of variation (~1) for many resources  Demands for resources are weakly correlated Applications have (very) diverse resource needs: Multiple resources become tight This matters because no single bottleneck resource:  Enough cross-rack network bandwidth to use all CPU cores 61  CPU, Memory, Network and Disk A good cluster scheduler should pack resources
  • 62. 62 Why so bad Production schedulers DON’T pack tasks consider dependencies ORAND
  • 63. Dependency-aware Packing  Breadth First Search (BFS) 63  Do not account for tasks resource demands  If so, they assume tasks have homogeneous demands OR Consider the DAG structure during the schedule  Tetris  Ignore dependencies  Takes local greedy choices  Handle tasks with multiple resource requirements   Any scheduler that is not packing, is up to n x OPTIMAL (n – number tasks) Any scheduler that ignores dependencies is d x OPTIMAL (d – number resource dimensions)  Critical Path Scheduling (CPSched)
  • 64. Where does the “work” lie in a DAG? “Work” – stages in a DAG where most amount of resources X time is spent  Large DAGs that are neither a bunch of unrelated stages nor a chain of stages  > 40% of the DAGs have most of the “work” on the Critical Path CPSched performs well  > 30% of the DAGs have most of the “work” such that Packers performs well For ~50% of the DAGs neither packers nor critically-based schedulers may perform well 7
  • 65. Pack tasks along multiple resources while consider tasks dependencies 65  State-of-the art techniques are suboptimal  Key ideas in Graphene  Conclusion
  • 66. State-of-the art scheduling techniques are suboptimal CPSched / Tetris 3 X Optimal 66 t0: t1: t2: t3: 1 {.7, .31} .01 {.95, .01} .01 {.1, .7} .96 {. 2, .68} .98 {. 1, .01} .01 {. 01, .01} t4: t5: duration {rsrc.1, rsrc.2} task: CPSched t0 t4 t5 t t1 t3t2 2t 3t Time: ~3T Tetris t0 t1 t2 t t4 t3t5 2t 3t Time: ~3T Optimal t1 t0 t t4 t3 t2 3t Time: ~T t5 Key insights:  t0, t2, t5 are troublesome tasks  schedule them as soon as possible  Total capacity in any dimens. = 1
  • 67. Schedule construction: identify troublesome tasks and place accordingly on a virtual resource time space. 67 # 1
  • 68. T P C O … time resources T … time resources P O C T Schedule Construction  Identify tasks that can lead to a poor schedule (troublesome tasks) - T  more likely to be on the critical path  more difficult to pack  Break the others tasks into P, C, O sets based on their relationship with tasks from T  Place tasks in T on a virtual time space; overlay the others to fill any resultant holes in this space Nearly optimal for over three quarters of our analyzed production DAGs 11
  • 69. Online component: enforces the desired schedule of the various DAGs. 69 # 2
  • 70. DAG Schedule Construction Schedule Construction Preference order Preference order - merging schedulesDAG Runtime component Node heartbeat Task assignment Resource Manager  Prefer jobs with less remaining work  Enforces priority ordering  Local placement  Multi-resource packing  Judicious overbooking of malleable resources  Deficit counters to bound unfairness  Enables implementation of different fairness schemes Job completion time Online Scheduling Makespan Being Fair - bound unfairness - packing + overbooking 13
  • 71. Evaluation  Implemented in Yarn and Tez  250 machine cluster deployment  Replay Bing traces and TPC-DS / TPC-H workloads 71
  • 72. Makespan Tetris 29 % Avg. Job Compl. Time 27% Graphene vs. Critical Path 31 % 33 %BFS 23 % 24% Gains from  view of the entire DAG  place the troublesome tasks first Efficiency  more compact schedule  better packing  overbooking 15
  • 73.  combine various mechanisms to improve packing efficiency and consider tasks dependencies  constructs a good schedule by placing tasks on a virtual resource time space  implemented inside YARN and Tez; trace-driven simulations and deployment show encouraging initial results 73  online heuristics that softly enforces the desired schedules
  • 74. Makespan Tetris 29 % Avg. Job Compl. Time 27% Graphene vs. Critical Path 31 % 33 %BFS 23 % 24% Gains from  view of the entire DAG  place the troublesome tasks first Graphene BFSRunning tasks Efficiency  more compact schedule  better packing  overbooking 15