GoodFit: Multi-Resource Packing of Tasks with Dependencies

GoodFit: Multi-Resource Packing
of Tasks with Dependencies

Cluster Scheduling for Jobs
Jobs
Machines, file-system, network
Cluster Scheduler
matches tasks to resources
Goals
• High cluster utilization
• Fast job completion time
• Predictable perf./ fairness
E.g., BigData (Hive, SCOPE, Spark)
E.g., CloudBuild
Tasks
Dependencies
• Need not keep resource “buffers”
• More dynamic than VM placement (tasks last seconds)
• Aggregate properties are important (eg, all tasks in a job should finish)

Need careful multi-resource planning
Problem
Fragmentation
Current Schedulers Packer Scheduler
Over-allocation of net/disk
Current Schedulers Packer Scheduler
2 tasks/T  3 tasks/T (+50%) 2 tasks/ 2T  2 tasks/T (+100%)

… worse with dependencies
Problem 2
Tt,
𝟏
𝒏
r t, 1- r
t, r
t, 1- r t, 1- r
(T- 2)t,
𝟏
𝒏
r (T- 4)t,
𝟏
𝒏
r ~Tt,
𝟏
𝒏
r
…
…
DAG label= {duration, resource demand}
resource
time
~nT t
…
resource
time
~T t
…
…
Crit. Path Best
Critical path scheduling is n times off since it ignores resource demands
Packers can be d times off since they ignore future work [d resources]

Typical job scheduler infrastructure
+ packing
+ bounded unfairness
+ merge schedules
+ overbook
DAG
AM
DAG
AM
… Node
heartbeat
Task
assignment
Schedule
Constructor
Schedule
Constructor
RM
NM
NM
NM
NM

Main ideas in multi-resource packing
Task packing ~ Multi-dimensional bin packing, but
* Very hard problem (“APX-hard”)
* Available heuristics do not directly apply [task demands change with placement]
Alignment score (A) = D  R
A packing heuristic
 Task’s resources demand vector: D  Machine resource vector: R<
Fit
A job completion time heuristic shortest remaining work, P tasks avg. duration
tasks avg. resource demand
*
*
=
remaining # tasks
Packing
Efficiency
?
delays job completion
loses packing efficiencyJob Completion
Time
Fairness
Trade-offs:
We show that:
{best “perf” |bounded unfairness} ~ best “perf”
loses both

Main ideas in packing dependent tasks
1. Identify troublesome tasks (meat) and place
them first
2. Systematically place other tasks without
deadlocks
3. At runtime, use a precedence order from the
computed schedule + heuristics to (a)
overbook, (b) previous slide.
4. Better lower bounds for DAG completion time
M
P
C
O
time
resource
meat
begin
meat
end
parents
meat
children

Results - 1
Packing
Packing + Deps.
Lower bound
[20K DAGs from Cosmos]

Results - 2
Tez + Packing
Tez + Pack +Deps
[200 jobs from TPC-DS, 200 server cluster]

Temporal relaxation of fairness

Map
(disk)
Reduce
(netw.)
Fair share among two identical jobs
50%
50%
50%
50%
2T 4T
Instantaneous fairness
100
%
100
%
100
%
100
%
2T 3TT
1) Temporal relaxation of fairness
a job will finish within (1 + 𝑓)x the time it takes given strict share
2) Optimal trade-off with performance
(1 + 𝑓)x fairness costs (2 + 2𝑓 − 2 𝑓 + 𝑓2)x on make-span
3) A simple (offline) algorithm that achieves the above trade-off
Problem:
Instantaneous fairness can be up to dx worse on makespan (d resources)
Best
Fairness slack 𝒇 Perf loss
0 (perfectly fair) 2x
1 (<2x longer) 1.1x
2 (<3x longer) 1.07x

Bare metal
VM Allocation
Data-parallel Jobs
Job: Tasks
Dependencies
E.g., HDInsight, AzureBatch
E.g., BigData (Yarn, Cosmos, Spark)
E.g., CloudBuild
3500 servers
3500 users
>20M targets/day
~100K servers (40K at Yahoo)
>50K servers
>2EB stored
>6K devs

• Tasks are short-lived (10s of seconds)
• Have peculiar shaped demands
• Composites are important (job needs all tasks to finish)
• OK to kill and restart tasks
• Locality
1) Job scheduling has specific aspects
2) will speed-up the average job (and reduce resource cost)
3) research + practice

Resource aware scheduling improves SLOs and Return/$

Cluster Scheduling for Jobs
Jobs
Machines, file-system, network
Cluster Scheduler
matches tasks to resources
Goals
• High cluster utilization
• Fast job completion time
• Predictable perf./ fairness
• Efficient (milliseconds…)
E.g., HDInsight, AzureBatch
E.g., BigData (Hive, SCOPE, Spark)
E.g., CloudBuild
Tasks
Dependencies

Main ideas in packing dependent tasks
1. Identify troublesome tasks (T) and place them
first
2. Systematically place other tasks without dead-
ends
3. At runtime, enforce computed schedule +
heuristics to (a) overbook, (b) previous slide.
4. Better lower bounds for DAG completion time
T
P
C
O
time
resource
Trouble
begin
Trouble
end
parents
trouble
children

Results - 1
Packing
Packing + Deps.
Lower bound
[20K DAGs from Cosmos]
2X1.5X

Multi-Resource Packing for Cluster Schedulers

Performance of cluster schedulers
We observe that:
1Time to finish a set of jobs
 Resources are fragmented i.e. machines are running below capacity
 Even at 100% usage, goodput is much smaller due to over-allocation
 Even pareto-efficient multi-resource fair schemes result in much lower performance
Tetris
up to 40% improvement in makespan1 and job
completion time with near-perfect fairness

Findings from Bing and Facebook traces analysis
 Tasks need varying amounts of each resource
 Demands for resources are weakly correlated
Diversity in multi-resource requirements:
Multiple resources become tight
This matters because no single bottleneck resource:
 Enough cross-rack network bandwidth to use all CPU cores
25
Upper bounding potential gains
 reduce makespan1 by up to 49%
 reduce avg. job compl. time by up to 46%

26
Why so bad #1
Production schedulers neither pack
tasks nor consider all their relevant
resource demands
#1 Resource Fragmentation
#2 Over-allocation

Current Schedulers “Packer” Scheduler
Machine A
4 GB Memory
Machine B
4 GB Memory
T1: 2 GB
T3: 4 GB
T2: 2 GB
Time
Resource Fragmentation (RF)
STOP
Machine A
4 GB Memory
Machine B
4 GB Memory
T1: 2 GB
T3: 4 GB
T2: 2 GB
Time
Avg. task compl. time = 1 t
27
Current Schedulers
RF increase with the
number of resources
being allocated !
Avg. task compl.time = 1.33 t
Resources allocated
in terms of Slots
Free resources unable
to be assigned to tasks

Current Schedulers “Packer” Scheduler
Machine A
4 GB Memory; 20 MB/s Nw.
Time
T1: 2 GB
Memory
20 MB/s
Nw.
T2: 2 GB
Memory
20 MB/s
Nw.
T3: 2 GB
Memory
Machine A
4 GB Memory; 20 MB/s Nw.Time
T1: 2 GB
Memory
20 MB/s
Nw.
T2: 2 GB
Memory
20 MB/s
Nw.
T3: 2 GB
Memory
STOP
20 MB/s
Nw.
20 MB/s
Nw.
28
Over-Allocation
 Not all tasks resource
demands are
explicitly allocated
 Disk and network
are over-allocated
Avg. task compl.time= 2.33 t Avg. task compl. time = 1.33 t
Current Schedulers

Work Conserving != no fragmentation, over-allocation
 Treat cluster as a big bag of resources
 Hides the impact of resource fragmentation
 Assume job has a fixed resource profile
 Different tasks in the same job have different demands
Multi-resource Fairness Schemes do not help either
Why so bad #2
 The schedule impacts job’s current resource profiles
 Can schedule to create complementarity profiles
Packer Scheduler vs. DRF
 Avg. Job Compl.Time: 50%
 Makespan: 33%
Pareto1 efficient != performant
1no job can increase share without decreasing the share of another
29

Competing objectives
Job completion time
Fairness
vs.
Cluster efficiency
vs.
Current Schedulers
1. Resource Fragmentation
3. Fair allocations sacrifice performance
2. Over-Allocation
30

# 1
Pack tasks along multiple resources to improve
cluster efficiency and reduce makespan
31

Theory Practice
Multi-Resource Packing of Tasks
similar to
Multi-Dimensional Bin Packing
Balls could be tasks
Bin could be machine, time
1APX-Hard is a strict subset of NP-hard
APX-Hard1
Existing heuristics do not directly apply here:
 Assume balls of a fixed size
 Assume balls are known apriori
32
 vary with time / machine placed
 elastic
 cope with online arrival of jobs,
dependencies, cluster activity
Avoiding fragmentation looks like:
 Tight bin packing
 Reduces # of bins used -> reduce makespan

# 1
Packing heuristic
1. Check for fit ensure no over-allocation Over-Allocation
Alignment score (A)
33
A packing heuristic
 Tasks resources demand vector  Machine resource vector<
Fit
“A” works because:
2. Bigger balls get bigger scores
3. Abundant resources used first
Resource Fragmentation
4. Can spread load across machines

# 2
Faster average job completion time
34

35
CHALLENGE
# 2
Shortest Remaining Time First1 (SRTF)
1SRTF – M. Harchol-Balter et al. Connection Scheduling in Web Servers [USITS’99]
schedules jobs in ascending order of their remaining time
Job Completion
Time Heuristic
Q: What is the shortest “remaining time” ?
“remaining work”
remaining # tasks
tasks durations
tasks resource demands
&
&
=
A job completion time heuristic
 Gives a score P to every job
 Extended SRTF to incorporate multiple resources

36
CHALLENGE
# 2
Job Completion
Time Heuristic
Combine A and P scores !
Packing
Efficiency
Completion
Time
?
1: among J runnable jobs
2: score (j) = A(t, R)+ P(j)
3: max task t in j, demand(t) ≤ R (resources free)
4: pick j*, t* = argmax score(j)
A: delays job completion time
P: loss in packing efficiency

# 3
Achieve performance and fairness
37

# 3
38
 A says: “task i should go here to improve packing efficiency”
Feasible solution which typically can satisfy all of them
 P says: “schedule job j next to improve job completion time”
 Fairness says: “this set of jobs should be scheduled next”
Fairness
Heuristic
Performance and fairness do not mix well in general
But ….
We can get “perfect fairness” and much better performance

# 3
39
 Fairness Knob, F  [0, 1)
 F = 0 most efficient scheduling
 F → 1 close to perfect fairness
Pick the best-for-perf. task from among
1-F fraction of jobs furthest from fair share
Fairness
Heuristic
Fairness is not a tight constraint
 Long term fairness not short term fairness
 Lose a bit of fairness for a lot of gains in performance
Heuristic

40
Putting it all together
We saw:
Other things in the paper:
 Packing efficiency
 Prefer small remaining work
 Fairness knob
 Estimate task demands
 Deal with inaccuracies, barriers
 Ingestion / evacuation
Job Manager1
Node Manager1
Cluster-wide Resource Manager
Multi-resource asks;
barrier hint
Track resource usage;
enforce allocations
New logic to match tasks to machines
(+packing, +SRTF, +fairness)
Allocations
Asks
Offers
Resource
availability reports
Yarn architecture
Changes to add Tetris(shown in orange)

Evaluation
 Pluggable scheduler in Yarn 2.4
 250 machine cluster deployment
 Replay Bing and Facebook traces
41

42
Efficiency
Makespan
DRF 28 %
Avg. Job Compl. Time
35%
0
50
100
150
200
0 5000 10000 15000
Utilization(%)
Time (s)
CPU Mem In St
Tetris
Gains from
 avoiding fragmentation
 avoid over-allocation
0
50
100
150
200
0 4500 9000 13500 18000 22500
Utilization(%)
Time (s)
CPU Mem In St
Tetris vs.
Capacity
Scheduler
29 % 30 %
Over-allocation
Lower value => higher resource fragmentation
Utilization(%)
200
150
100
50
0
0 5000 10000 15000
Time (s)
Over-allocation
Lower value => higher resource fragmentation
Capacity Scheduler

43
Fairness
Fairness Knob
 quantifies the extent to which Tetris adheres to fair allocation
No Fairness
F = 0
Makespan
50 %
10 %
25 %
Job Compl.
Time
40 %
23 %
35 %
Avg. Slowdown
[over impacted jobs]
25 %
2 %
5 %
Full Fairness
F → 1
F = 0.25

Pack efficiently
along multiple
resources
Prefer jobs
with less
“remaining
work”
Incorporate
Fairness
 combine heuristics that improve packing efficiency with those that
lower average job completion time
 achieving desired amounts of fairness can coexist with improving
cluster performance
 implemented inside YARN; trace-driven simulations and deployment
show encouraging initial results
We are working towards a Yarn check-in
http://research.microsoft.com/en-us/UM/redmond/projects/tetris/
44

Estimating resource requirements
Estimating Resource Demands
Under-utilization
 from:
o finished tasks in the same phase
 peak usage demands estimates
Machine1 - In Network
850
1024
0
512
MBytes/sec
Time (sec)
In Network Used
In Network Free
Resource Tracker
o report unused resources
o aware of other cluster activities: ingestion and evacuation
Resource Tracker
o collecting statistics from recurring jobs
Peak Demand
o inputs size/location of tasks
46
Placement
Impacts network/disk requirements

Packer Scheduler vs. DRF
DRF Scheduler Packer Schedulers
2 tasks
Job Schedule
Resources used
2 tasks 2 tasks
2 tasks 2 tasks 2 tasks
6 tasks 6 tasks 6 tasksA
B
C
18 cores
16 GB
18 cores
16 GB
18 cores
16 GB
t 2t 3t
0 tasks
Job Schedule
Resources used
0 tasks 6 tasks
0 tasks 6 tasks
18 tasksA
B
C
18 cores 18 cores
6 GB
18 cores
6 GB
t 2t 3t
36 GB
Durations:
A: 3t
B: 3t
C: 3t
Durations:
A: t
B: 2t
C: 3t
33%
improvement
Dominant Resource Fairness (DRF)
computes the dominant share (DS) of every user and
seeks to maximize the minimum DS across all users
Cluster [18 Cores, 36 GB Memory]
Job: [Task Prof.], # tasks
A [1 Core, 2 GB], 18
B [3 Cores, 1 GB], 6
C [3 Cores, 1 GB], 6
DS =
𝟏
𝟑
max (qA, qB, qC) (Maximize allocations)
qA + 3qB + 3qC ≤ 18 (CPU constraint)
2qA + 1qB + 1qC ≤ 36 (Memory constraint)
qA
18
=
qB
6
=
qC
6
(Equalize DS) 47

Machine 1,2: [2 Cores, 4 GB]
Job: [Task Prof.], # tasks
A [2 Cores, 3 GB], 6
B [1 Core, 2 GB], 2
Resources used
4
cores
6 GB
2
tasks
2
tasks
2
tasks
2
tasks
t 2t 3t 4t
Job Schedule
4
cores
6 GB
4
cores
6 GB
2
cores
4 GB
Resources used
2
cores
4 GB
2
tasks
2
tasks
2
tasks
2
tasks
t 2t 3t 4t
Job Schedule
4
cores
6 GB
4
cores
6 GB
4
cores
6 GB
Pack No Pack
Durations:
A: 3t
B: 4t
Durations:
A: 4t
B: t
29% improvement
48
Packing efficiency does not achieve everything
Achieving packing efficiency does not
necessarily improve job completion time

49
Ingestion / evacuation
ingestion = storing incoming data for later analytics
evacuation = data evacuated and re-replicated before
maintenance operations
 e.g. some clusters reports volumes of up to 10 TB per hour
Other cluster activities which produce background traffic
 e.g. rack decommission for machines re-imaging
Resource Tracker reports, used by Tetris to avoid
contention between its tasks and these activities

51
Alternative Packing Heuristics

54
Virtual Machine Packing != Tetris
Virtual Machine Packing
But focus on different challenges and not task packing:
 balance load across servers
 ensure VM availability inspite of failures
 allow for quick software and hardware updates
 NO corresponding entity to a job and hence job completion time is inexpressible
 Explicit resource requirements (e.g. small VM) makes VM packing simpler
Consolidating VMs, with multi-dimensional resource
requirements, on to the fewest number of servers

55
Barrier knob, b  [0, 1)
Tetris gives preference for last tasks in a stage
Offer resources to tasks in a stage preceding a
barrier, where b fraction of tasks have finished
 b = 1 no tasks preferentially treated

56
Starvation Prevention
It could take a long time to accommodate large tasks ?
But …
1. most tasks have demands within one order of magnitude of one another
2. machines report resource availability to the scheduler periodically
 scheduler learn about all the resources freed up by tasks that finish in the
preceding period together => can to reservation for large tasks

57
Cluster load vs. Tetris performance

Packing and Dependency-aware Scheduling for
Data-Parallel Clusters

Performance of cluster schedulers
We observe that:
 Typically cluster schedulers do dependency-aware scheduling
 OR multi-resource packing
 None of the existing solutions are close to optimal for more than 50% of the
production jobs
Graphene
> 30% improvements in makespan1 and job
completion time for more than 50% of the jobs
2

Findings from Bing traces analysis
Jobs structure have evolved into complex DAGs of tasks
 depth 7
 103 tasks
Median job DAG’s has:
A good cluster scheduler should be
aware of dependencies
3

Findings from Bing traces analysis
 High coefficient of variation (~1) for many resources
 Demands for resources are weakly correlated
Applications have (very) diverse resource needs:
Multiple resources become tight
This matters because no single bottleneck resource:
 Enough cross-rack network bandwidth to use all CPU cores
61
 CPU, Memory, Network and Disk
A good cluster scheduler should
pack resources

62
Why so bad
Production schedulers
DON’T pack tasks
consider dependencies
ORAND

Dependency-aware Packing
 Breadth First Search (BFS)
63
 Do not account for tasks resource demands
 If so, they assume tasks have homogeneous
demands
OR Consider the DAG structure during the
schedule
 Tetris
 Ignore dependencies
 Takes local greedy choices
 Handle tasks with multiple resource
requirements
 
Any scheduler that is not packing,
is up to n x OPTIMAL (n – number tasks)
Any scheduler that ignores dependencies is
d x OPTIMAL (d – number resource dimensions)
 Critical Path Scheduling
(CPSched)

Where does the “work” lie in a DAG?
“Work” – stages in a DAG where most amount of resources X time is spent
 Large DAGs that are neither a bunch of unrelated stages
nor a chain of stages
 > 40% of the DAGs have most of the “work” on the Critical Path CPSched performs well
 > 30% of the DAGs have most of the “work” such that Packers performs well
For ~50% of the DAGs neither
packers nor critically-based
schedulers may perform well 7

Pack tasks along multiple resources
while consider tasks dependencies
65
 State-of-the art techniques are suboptimal
 Key ideas in Graphene
 Conclusion

State-of-the art scheduling techniques are suboptimal
CPSched / Tetris
3 X Optimal
66
t0: t1:
t2:
t3:
1
{.7, .31}
.01
{.95, .01}
.01
{.1, .7}
.96
{. 2, .68}
.98
{. 1, .01}
.01
{. 01, .01}
t4:
t5:
duration
{rsrc.1, rsrc.2}
task:
CPSched t0 t4 t5
t
t1 t3t2
2t 3t
Time: ~3T
Tetris t0 t1 t2
t
t4 t3t5
2t 3t
Time: ~3T
Optimal t1 t0
t
t4 t3
t2
3t
Time: ~T
t5
Key insights:
 t0, t2, t5 are troublesome tasks
 schedule them as soon as possible
 Total capacity in any dimens. = 1

Schedule construction: identify troublesome tasks
and place accordingly on a virtual resource time
space.
67
# 1

T
P
C
O
…
time
resources
T
…
time
resources
P
O
C
T
Schedule Construction
 Identify tasks that can lead to a poor schedule (troublesome tasks) - T
 more likely to be on the critical path
 more difficult to pack
 Break the others tasks into P, C, O sets based on their relationship with tasks from T
 Place tasks in T on a virtual time space; overlay the others to fill any resultant holes in
this space
Nearly optimal for over three quarters of our
analyzed production DAGs
11

Online component: enforces the desired schedule
of the various DAGs.
69
# 2

DAG
Preference order
Preference order
- merging schedulesDAG
Runtime component
Node
heartbeat
Task
assignment
Resource Manager
 Prefer jobs with less
remaining work
 Enforces priority ordering
 Local placement
 Multi-resource packing
 Judicious overbooking of
malleable resources
 Deficit counters to bound
unfairness
 Enables implementation
of different fairness
schemes
Job completion time
Online Scheduling
Makespan Being Fair
- bound unfairness
- packing + overbooking
13

Evaluation
 Implemented in Yarn and Tez
 250 machine cluster deployment
 Replay Bing traces and
TPC-DS / TPC-H workloads
71

Makespan
Tetris
29 %
27%
Graphene vs.
Critical Path
31 % 33 %BFS
23 % 24%
Gains from
 view of the entire DAG
 place the troublesome
tasks first
Efficiency
 more compact schedule
 better packing
 overbooking
15

 combine various mechanisms to improve packing efficiency and
consider tasks dependencies
 constructs a good schedule by placing tasks on a virtual resource time space
 implemented inside YARN and Tez; trace-driven simulations and
deployment show encouraging initial results
73
 online heuristics that softly enforces the desired schedules

Makespan
Tetris
29 %
27%
Graphene vs.
Critical Path
31 % 33 %BFS
23 % 24%
Gains from
 view of the entire DAG
 place the troublesome
tasks first
Graphene BFSRunning tasks
Efficiency
 more compact schedule
 better packing
 overbooking
15

GoodFit: Multi-Resource Packing of Tasks with Dependencies

GoodFit: Multi-Resource Packing of Tasks with Dependencies

More Related Content

What's hot (20)

Similar to GoodFit: Multi-Resource Packing of Tasks with Dependencies (20)

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded (20)

GoodFit: Multi-Resource Packing of Tasks with Dependencies