CPU Scheduling for Virtual Desktop Infrastructure

CPU Scheduling for
Virtual Desktop Infrastructure
PhD Defense
Hwanju Kim
2012-11-16

Virtual Desktop Infrastructure (VDI)
• Desktop provisioning
Dedicated workstations
VM VM
VM
VM
VM
- Energy wastage by idle desktops
- Resource underutilization
- High management cost
- High maintenance cost
- Low level of security
+ Energy savings by consolidation
+ High resource utilization
+ Low management cost
(flexible HW/SW provisioning)
+ Low maintenance cost
(dynamic HW/SW upgrade)
+ High level of security
(centralized data containment)
VM-based shared environments
2/35

Hardware
Virtual Machine Monitor (VMM)
Desktop Consolidation
• Distinctive workload characteristics
• High consolidation ratio
• 4:1~15:1 [VMware VDI], 6~8 per core [Botelho’08]
• Diverse user-dependent workloads
• Light users and knowledgeable workers coexist
• Multi-layer mixed workloads
• Multi-tasking (interactive+background) in a consolidated VM
VM VM VM VM VM
VM VM VM VM
Mixed
Interactive
CPU-intensive Parallel
3/35

VM
Challenges on CPU Scheduling
• Challenges due to the primary principles of
VMM, compared to OS scheduling research
pCPU
VMM scheduler
pCPU
vCPU vCPU
OS scheduler
vCPU
OS scheduler
VMM
vCPU vCPU
OS scheduler
Task Task Task Task Task TaskTask Task
VMVM
1. Semantic gap
( OS independence)
: Two independent
scheduling layers
2. Scarce Information
( Small TCB)
: Difficulty in extracting
workload characteristics
3. Inter-VM fairness
( Performance isolation)
: Favoring a VM must not compromise inter-VM fairness
• I/O operations
• Privileged instructions
• Process and thread
information
• Inter-process
communications
• I/O operations and
semantics
• System calls
• etc…
Each VM is virtualized
as a black box
I believe I’m on a
dedicated machine
Lightweightness
(No cross-layer optimization)
Efficiency
(Intelligent VMM)
4/35

VMVM
The Goals of This Thesis
• The enlightened CPU scheduling of VMM for
consolidated desktops
• Efficient CPU management with lightweight VMM
extensions
VMM scheduler VMM
vCPU vCPU vCPU vCPU
VM
Interactive
workload
ThreadThreadThread
Background
workload
ThreadThreadThread
VM
Communicating
workload
Thread Thread
Enlightening about
diverse workload
demands inside a VM
Base: CPU bandwidth
partitioning for
performance isolation
Design principles
1. OS-independence: VMM-level solutions without OS-dependent optimizations
2. Diversity: Identifying the computing demands of diverse workloads (including mixed workloads)
3. Inter-VM fairness: Performance isolation for multi-tenant environments
5/35

Related Work
Proposals References
Design principles
OS-
independence
Diversity
Inter-VM
fairness
Proportional-share scheduling Xen, KVM, VMware ESX O X O
Interactive & soft real-time
scheduling
[Lin et al., SC’05]
[Lee et al., VEE’10]
[Masrur et al., RTCSA’10]
O
X
(User-directed,
no mixed &
communicating
workloads)
X
OS-assisted scheduling
[Kim et al., EuroPar’08]
[Xia et al., ICPADS’09]
X
(OS-dependent
optimization)
X
(No communicating
workloads)
O
I/O-friendly scheduling
[Govindan et al., VEE’07]
[Ongaro et al., VEE’08]
[Liao et al., ANCS’08]
[Hu et al., HPDC’10]
O
X
(Only I/O-intensive
workloads)
O
Multiprocessor
VM scheduling
Relaxed
coscheduling
[VMware ESXi’10]
[Sukwong et al., EuroSys’11] O X
(No mixed workloads)
O
Spinlock-aware
scheduling
[Uhlig et al., VM’04]
[Weng et al., HPDC’11]
X
(OS-dependent
optimization)
X
(Only spinlock-
intensive workloads)
O
Hybrid
scheduling
[Weng et al., VEE’09] O
X
(User-involved,
no mixed workloads)
O

Overview
VMM scheduler VMM
vCPU
vCPU vCPU
VM
VM
Multithreaded
(communicating or parallel)
workload
Thread
• Introduction to “Task-aware VM scheduling”
[Kim et al., VEE’09], [Kim et al., JPDC’11]
+ The first solution to mixed workloads in a consolidated VM
+ Simple and effective for I/O-bound interactive workloads
- No consideration about multiprocessor VMs
- Lacking ability to support modern interactive workloads
pCPU
CPU-
bound
task
I/O-
bound
task
vCPU
VM
CPU-
bound
task
CPU-
bound
task
• Proposal for multiprocessor VM scheduling
 Efficient scheduling for multithreaded workloads
hosted on multiprocessor VMs
Proposal
vCPU vCPU
VMM scheduler VMM
pCPU pCPU pCPU pCPU
Thread ThreadThread
User-
Interactive
workload
Background
workload
Defense
“Demand-based
coordinated
scheduling”
“Virtual
asymmetric
multiprocessor”
Implementation
Extension
Task-based
Priority boosting
7/35

Demand-Based Coordinated Scheduling
for Multiprocessor VMs
How to effectively schedule multithreaded workloads hosted in
multiprocessor VMs?
vCPU vCPU
VM
Multithreaded
(communicating or parallel)
workload
Thread
vCPU vCPU
VMM scheduler VMM
pCPU pCPU pCPU pCPU
Thread ThreadThread

Why Coordinated Scheduling?
• Uncoordinated vs. Coordinated scheduling
vCPU
VMM scheduler VMM
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
Time
shared
Uncoordinated scheduling
Each vCPU is treated as an independent entity
regardless of its sibling vCPUs
Independent
entity
vCPU
VMM scheduler VMM
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
Coordinated scheduling
Sibling vCPUs are coordinated by VMM scheduler
Coordinated
group
Why is coordination needed?
• Many applications are multithreaded and parallelized
 Multiple threads perform a job communicating with
each other to arbitrate accesses to shared resources
vCPU
vCPU
vCPU
Time
shared
Lock
holder
Lock
waiter
Lock
waiter
Active
Inactive
Inactive
Uncoordinated scheduling makes
inter-thread communication ineffective
Similar to traditional job scheduling
issues in distributed environments
• Multicore resembles a distributed environment
Time
shared
9/35

Coordination Space
• Space and time domains
• Space domain
• pCPU assignment policy
• Where is each sibling vCPU assigned?
• Time domain
• Preemptive scheduling policy
• When and which sibling vCPUs are preemptively scheduled
• e.g., Co-scheduling
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
Space
Where to schedule?
Time
When to schedule?
Coordinated
group
10/35

Space Domain: pCPU Assignment
• A naïve method
• “Balance scheduling”[Sukwong et al., EuroSys’11]
• Spread sibling vCPUs on separate pCPUs
• Probabilistic co-scheduling due to
the increase of likelihood of coscheduling
• No coordination in time domain
• Limitation
• An unrealistic assumption: “CPU load is well balanced”
• In practice, VMs with equal CPU shares have
• Different number of vCPUs
• Different thread-level parallelism
• Phase-changed multithreaded workloads
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU vCPU
vCPU
Highly contended
Larger
CPU shares
11/35

Space Domain: pCPU Assignment
• Proposed scheme
• “Load-conscious balance scheduling”
• Hybrid scheme of balance scheduling & load-based assignment
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
If all candidate pCPUs are not overloaded,
balance scheduling
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU vCPU
vCPU
Otherwise,
load-based assignment
vCPU
pCPU0 pCPU1 pCPU2 pCPU3
vCPU
vCPU vCPU
Wait queue
• Example
vCPUvCPU vCPU
Candidate pCPU set
(Scheduler assigns a lowest-loaded pCPU in this set)
= {pCPU0, pCPU1, pCPU2, pCPU3}
pCPU3 is overloaded
(i.e., CPU load > Average CPU load)
How about contention
between sibling vCPUs?
 Pass to coordination in time domain!
12/35

Time Domain: Preemption Policy
• What type of contention demands coordination?
• Busy-waiting for communication (or synchronization)
• Unnecessary CPU consumption by busy-waiting for a
descheduled (inactive) vCPU
• Significant performance degradation
• Why serious in multiprocessor VMs?
• Semantic gap
• OSes make liberal use of busy-waiting (e.g., spinlock) since they
believe their vCPUs are always online (i.e., dedicated)
• “Demand-based coordinated scheduling”
• Issues
• When and where to demand coordination?
• Busy-waiting really matters?
• How to detect coordination demand?
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
13/35

• When and where to demand coordination?
• Experimental analysis
• 13 emerging multithreaded applications in the PARSEC suite
• Diverse characteristics
• Kernel time ratio in the case of consolidation
• Busy-waiting occurs in kernel space
0%
20%
40%
60%
80%
100%
blackscholes
bodytrack
canneal
dedup
facesim
ferret
fluidanimate
freqmine
raytrace
streamcluster
swaptions
vips
x264
CPUtime(%)
Kernel time User time
0%
20%
40%
60%
80%
100%
blackscholes
bodytrack
canneal
dedup
facesim
ferret
fluidanimate
freqmine
raytrace
streamcluster
swaptions
vips
x264
CPUtime(%)
Kernel time User time
Solorun (no consolidation) Corun (w/ 1 VM running streamcluster)
Kernel time ratio
is largely amplified
by x1.3~x30
A VM with 8 vCPUs
on 8 pCPUs
14/35

• Where is the kernel time amplified?
Function Application
CPU cycles (%)
(Total kernel CPU cycles (%))
TLB shootdown
dedup 43% (83%)
ferret 9% (11%)
vips 41% (47%)
Lock spinning
bodytrack 5% (8%)
canneal 4% (5%)
dedup 36% (83%)
facesim 4% (5%)
streamcluster 10% (11%)
swaptions 5% (6%)
vips 4% (47%)
x264 7% (8%)
15/35

• TLB shootdown
• Notification of TLB invalidation to a remote CPU
CPU
Thread
CPU
Thread
Virtual address
space
TLB TLB
V->P1
V->P1
V->P1
TLB (Translation Lookaside Buffer):
Per-CPU cache for
virtual address mapping
V->P2 or V->Null
Modify
or
Unmap
Inter-processor interrupt (IPI)
Busy-waiting until all corresponding
TLB entries are invalidated
 Efficient in native systems,
but not in virtualized systems
if target vCPUs are not scheduled
“A TLB shootdown IPI is a signal for coordination demand!”
 Co-schedule IPI-recipient vCPUs with a sender vCPU
0
500
1000
1500
2000
blackscholes
bodytrack
canneal
dedup
facesim
ferret
fluidanimate
freqmine
raytrace
streamcluster
swaptions
vips
x264
TLBIPIs/sec/vCPU
TLB shootdown IPI traffic
16/35

pCPU
• Lock spinning
• Which spinlocks show dominant wait time?
0%
20%
40%
60%
80%
100%
bodytra…
canneal
dedup
facesim
streamc…
swaptio…
vips
x264
Spinlockwaittime(%)
Other locks
Wait-queue lock
Pagetable lock
Runqueue lock
Semaphore wait-queue lock
Futex wait-queue lock
89%
Futex: Kernel support for user-level synchronization
(e.g., mutex, barrier, condvar)
81%
mutex_lock(mutex)
/* critical section */
mutex_unlock(mutex)
futex_wake(mutex) {
spin_lock(queue->lock)
thread=dequeue(queue)
wake_up(thread)
spin_unlock(queue->lock)
}
mutex_lock(mutex)
futex_wait(mutex) {
enqueue(queue, me)
spin_unlock(queue->lock)
schedule() /* blocked */
vCPU0 vCPU1
/* wake-up */
/* critical section */
mutex_unlock(mutex)
futex_wake(mutex) {
If vCPU0 is preempted during waking vCPU1 up,
vCPU1 busy-waits on the preempted spinlock
: So-called lock-holder preemption (LHP)
vCPU1
vCPU0
Active
Preempted
“A Reschedule IPI is a signal for coordination demand!”
 Delay preemption of an IPI-sender vCPU
until a likely-held spinlock is released
Reschedule
IPI
kernel
Preempted
17/35

• Proposed scheme
• Urgent vCPU first (UVF) scheduling
• Urgent time slice (utslice)
• Long enough for a reschedule IPI sender to release a spinlock
• Short enough to quickly serve multiple urgent vCPUs
pCPU
vCPU vCPU vCPU
Urgent queue Runqueue
vCPU
pCPU
vCPU vCPU vCPUvCPU
FIFO order Proportional shares order
vCPU : urgent state
vCPU vCPU
Wait queue
Protect from preemption
during urgent time slice
(utslice)
If inter-VM fairness is kept
18/35

Evaluation
• Utslice parameter
• 1. Utslice for reducing LHP
• 2. Utslice for quickly serving multiple urgent vCPUs
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 100 300 500 700 1000
#offutexqueueLHP
Utslice (usec)
bodytrack
facesim
streamcluster
Workloads:
A futex-intensive workload in one VM
+ dedup in another VM as a preempting VM
>300us utslice
~2x~3.8x LHP reduction
Remaining LHPs occur during local wake-up or
before reschedule IPI transmission
 Not likely lead to lock contention
19/35

Evaluation
• Utslice parameter
• 1. utslice for reducing LHP
• 2. utslice for quickly serving multiple urgent vCPUs
30
35
40
45
50
55
60
0
2
4
6
8
10
12
14
16
100 500 1000 3000 5000
Averageexecutiontime(sec)
CPUcycles(%)
Utslice (usec)
Spinlock cycles (%)
TLB cycles (%)
Execution time (sec)
Workloads:
3 VMs, each of which runs vips
(vips - TLB-IPI-intensive application)
As utslice increases,
TLB shootdown cycles increase
500usec is an appropriate utslice for both
LHP reduction and multiple urgent vCPUs
~11% degradation
20/35

Evaluation
• Workload consolidation
• One 8-vCPU VM + four 1-vCPU VMs (x264)
0.00
0.50
1.00
1.50
2.00
Normalizedexecutiontime
Workloads of 8-vCPU VM
Baseline
Balance
LC-Balance
LC-Balance+Resched-DP
LC-Balance+Resched-DP+TLB-Co
Multiprocessor VMs
Need coordination in time domain (~90% improvement)
0.00
0.50
1.00
1.50
Normalizedexecutiontime
Co-running workloads with 1-vCPU VM (x264)
Baseline
Balance
LC-Balance
LC-Balance+Resched-DP
LC-Balance+Resched-DP+TLB-Co
Balance scheduling degrades 1-vCPU VM by incurring unnecessary contention Singleprocessor VMs
21/35

Summary
• Contributions
• Load-conscious balance scheduling
• Essential for heterogeneously consolidated environments
where load imbalance usually takes place
• IPI-driven coordinated scheduling
• Effective for VMM to alleviate unnecessary CPU contention
based on IPIs between sibling vCPUs
• Future work
• Combining the “scheduling-based method” with
“contention management methods”
• Contention management methods
• Paravirtual spinlock, HW-based spin detection
22/35

Virtual Asymmetric Multiprocessor for
User-Interactive Performance
How to improve user-interactive performance mixed in
multiprocessor VMs?
vCPU vCPU
VM
vCPU vCPU
VMM scheduler VMM
pCPU pCPU pCPU pCPU
User-
Interactive
workload
Background
workload

Motivation
• Background & idea
• The initial proposal of “Task-aware scheduling” did
not consider multiprocessor VMs
• Existing VMM schedulers give an illusion of
symmetric multiprocessor (SMP) to each VM
• Due to the absence of mixed workload tracking
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
VM
Interactive Background
Time
shared
Virtual SMP (vSMP)
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
VMInteractive
Background
Virtual AMP (vAMP)
vCPU
Equally contended
regardless of
user interactions
Proposal
The size of vCPU =
The amount of CPU shares
Fast vCPUs Slow vCPUs
24/35

Workload Classification
• Previous methods
• Time-quanta based classification
• “Interactive workloads typically show short time quantum”
• OS technique: User I/O-driven IPC tracking
[Zheng et al., SIGMETRICS’10]
X server Terminal Firefox
IPC IPC
User I/O
+ Identifying a set of tasks involved in a user interaction (I/O)
- Relying on various OS-level IPC structures (e.g., socket, pipe, signal)
 VMM cannot access OS-level IPCs
+ Clear classification between
I/O-bound and CPU-bound tasks
- Modern interactive workloads
show mixed behaviors
- Multithreaded CPU-bound job
shows short time quanta due to
inter-thread communication
25/35
An interactive task group

Workload Classification
• Proposed scheme
• “Background workload identification”
• Instead of tracking interactive workloads,
• Identifying “background CPU noise”
at the time of “user I/O”
• Rationales
• Interactive CPU load is typically initiated
by user I/O
• VMM can unobtrusively monitor
user I/O and per-task CPU load
• Exceptional case
• Multimedia workloads (e.g., video playback)
• Filtering multimedia tasks from background workloads
• Tasks requesting audio I/O
26/35

Virtual Asymmetric Multiprocessor
• vAMP
• Dynamically adjusting CPU shares of a vCPU
according to its currently hosting task
1. Maintaining
per-task CPU load
during pre-I/O period
 Pre-I/O period is
set to shorter than
general user think time
(1 second by default)
2. Tagging tasks that
have generated
nontrivial CPU loads
as background tasks
 Threshold can be
set to filter daemon tasks
that possibly serve
interactive workloads
3. Dynamically adjusting
vCPU’s shares based on
weight ratio
(e.g., background :
non-background
= 1:5)
4. Providing vAMP
during an interactive
episode
 An interactive episode
is restarted when another
user I/O occurs or is
finished if maximum time is
elapsed without user I/O
27/35

Limitation
• An intrinsic limitation of VMM-only approach
• Manipulating only a single scheduling layer
(i.e., VMM scheduler)
• A vAMP-oblivious OS scheduler
• Agnostic about underlying vAMP (i.e., all vCPUs are identical)
• Possibly multiplexing interactive and background tasks on the
same vCPU
• A slow vCPU has higher scheduling latency
• “Frequent multiplexing” might offset the benefit of vAMP
Example: A scheduling trace during Google Chrome launch
“Aggressive weight ratio is not always effective if multiplexing frequently happens”
 Weight ratio is an important parameter for interactive performance
Background task Non-background task
28/35

Guest OS Extension
• Guest OS extension for vAMP
• OS enlightenment about vAMP
• To avoid ineffective multiplexing of interactive and
background tasks on the same vCPU  Isolation
• Design principles
• Keeping VMM OS-independent
• Optional extension for further enhancement of interactive
performance
• Keeping extension OS-independent
• No reliance on specific OS functionality
• Isolating tasks on separate CPUs is a general interface of
commodity OSes (e.g., modifying CPU affinity)
• Small kernel changes for low maintenance cost
29/35

Guest OS Extension
• Linux extension for vAMP
• User-level vAMP-daemon
• Isolating background tasks exposed by VMM from non-
background tasks
• Small kernel changes that expose background tasks to user
VM
vAMP scheduler
VMM
vCPU vCPU
Task load monitor
Background
tasks
T1, T2
vAMP-daemon
Kernel
User
Input
interface
Cpuset
interface
T1 T2
T3 T4
Procfs
interface
1. Event-
driven
2. Read
3. Isolate
Isolation procedure:
1. Initially dedicating nr_fast_vcpus to interactive
tasks (i.e., non-background tasks)
2. Periodically increasing nr_fast_vcpus when
fast vCPUs become fully utilized
(also periodically checking the end of an interactive
episode  stop isolation)
Default nr_fast_vcpus = 1 due to the low
thread-level parallelism of interactive workloads
[Blake et al., ISCA’10]
30/35

Evaluation
• Application launch
• Background workload
• Data mining application (freqmine) with 8 threads
• Weight ratio (background : non-background)
• vAMP(L)=1:3, vAMP(M)=1:9, vAMP(H)=1:18
8-vCPU VM 8-vCPU VM
freqminefreqmine
App
launch
Remote
desktop
client
0.00
0.20
0.40
0.60
0.80
1.00
1.20
Impress Firefox Chrome Gimp
Normalizedaveragelaunchtime
Interactive applications
Baseline
vAMP(L)
vAMP(M)
vAMP(H)
vAMP(L) w/ Ext
vAMP(M) w/ Ext
vAMP(H) w/ Ext
vAMP improves launch performance by 7~40% High weight ratio is ineffective because of
negative effect of multiplexing
Guest OS extension achieves further improvement
of interactive performance by up to 70%
Why did Gimp show significant improvement
even without the guest OS extension?
8-pCPU
31/35

Evaluation
• Application launch
• Chrome vs. Gimp (without guest OS extension)
Chrome (Web browser)
Gimp (Image editing program)
 Many threads are cooperatively scheduled in a fine-grained manner
 A single thread dominantly involves computation with little communication
32/35

Evaluation
• Media player
• VLC media player
• 1920x800 HD video with 23.976 frames per second (FPS)
• Mult: multimedia workload filtering
Without multimedia workload filtering,
VLC is misidentified as a background task
vAMP improves playback quality by up to 22.3 FPS,
but high weight ratio still degrades the quality
Guest OS extension achieves 23.8 FPS
8-vCPU VM 8-vCPU VM
freqminefreqmine
Media
player
8-pCPU
33/35
0
5
10
15
20
25
30
Averageframespersecond(FPS)
Baseline
vAMP(L) w/o Mult
vAMP(L)
vAMP(M)
vAMP(H)
vAMP(L) w/ Ext
vAMP(M) w/ Ext
vAMP(H) w/ Ext

Summary
• vAMP
• Dynamically varying vCPU performance based on
their hosting workloads
• A feasible method of improving interactive performance
• Assisted by a simple guest OS extension
• Isolation of different types of workloads enhances the
effectiveness of vAMP
• Future work
• Collaboration of VMM and OSes for vAMP
• Standard & well-defined API
34/35

Conclusions
• Lessons learned from the thesis
• In-depth analysis of OSes and workloads can realize
intelligent CPU scheduling based only on VMM-
visible events
• Both lightweightness and efficiency are achieved
• Task-awareness is an essential ability for VMM to
effectively handle mixed workloads
• Multi-tasking is ubiquitous inside every VM
• Coordinated scheduling improves CPU efficiency of
multiprocessor VMs
• Resolving unnecessary CPU contention is crucial
35/35

Publications
• Task-aware VM scheduling
• [VEE’09] Hwanju Kim, Hyeontaek Lim, Jinkyu Jeong, Heeseung Jo, Joonwon Lee, “Task-aware Virtual Machine Scheduling for I/O
Performance”
• [JPDC’11] Hwanju Kim, Hyeontaek Lim, Jinkyu Jeong, Heeseung Jo, Joonwon Lee, Seungryoul Maeng, “Transparently Bridging
Semantic Gap in CPU Management for Virtualized Environments”
• [MMSys’12] Hwanju Kim, Jinkyu Jeong, Jaeho Hwang, Joonwon Lee, Seungryoul Maeng, “Scheduler Support for Video-oriented
Multimedia on Client-side Virtualization”
• [ApSys’12] Hwanju Kim, Sangwook Kim, Jinkyu Jeong, Joonwon Lee, and Seungryoul Maeng, “Virtual Asymmetric Multiprocessor for
Interactive Performance of Consolidated Desktops”
• Demand-based coordinated scheduling
• [ASPLOS’13] Hwanju Kim, Sangwook Kim, Jinkyu Jeong, Joonwon Lee, and Seungryoul Maeng, “Demand-Based Coordinated
Scheduling for SMP VMs”
• Other work on virtualization
• [IEEE TC’11] Hwanju Kim, Heeseung Jo, and Joonwon Lee, “XHive: Efficient Cooperative Caching for Virtual Machines”
• [IEEE TC’10] Heeseung Jo, Hwanju Kim, Jae-Wan Jang, Joonwon Lee, and Seungryoul Maeng, “Transparent Fault Tolerance of Device
Drivers for Virtual Machines”
• [MICRO’10] Daehoon Kim, Hwanju Kim, and Jaehyuk Huh, “Virtual Snooping: Filtering Snoops in Virtualized Multi-cores”
• [VHPC’11] Sangwook Kim, Hwanju Kim, and Joonwon Lee, “Group-Based Memory Deduplication for Virtualized Clouds”
• [Euro-Par’08] Dongsung Kim, Hwanju Kim, Myeongjae Jeon, Euiseong Seo, Joonwon Lee, “Guest-Aware Priority-based Virtual
Machine Scheduling for Highly Consolidated Server”
• [VHPC’09] Heeseung Jo, Youngjin Kwon, Hwanju Kim, Euiseong Seo, Joonwon Lee, Seungryoul Maeng, “SSD-HDD-Hybrid Virtual Disk
in Consolidated Environments”
• Other work on embedded and mobile systems
• [ACM TECS’12] Jinkyu Jeong, Hwanju Kim, Jeaho Hwang, Joonwon Lee, and Seungryoul Maeng, “Rigorous Rental Memory
Management for Embedded Systems”
• [CASES’12] Jinkyu Jeong, Hwanju Kim, Jeaho Hwang, Joonwon Lee, and Seungryoul Maeng, “DaaC: Device-reserved Memory as an
Eviction-based File Cache”
• [IEEE TCE’09] Heeseung Jo, Hwanju Kim, Hyun-Gul Roh, Joonwon Lee, “Improving the Startup Time of Digital TV”
• [IEEE TCE’09] Heeseung Jo, Hwanju Kim, Jinkyu Jeong, Joonwon Lee, and Seungryoul Maeng, “Optimizing the Startup Time of
Embedded Systems: A Case Study of Digital TV”
• [IEEE TCE’10] Jeaho Hwang, Jinkyu Jeong, Hwanju Kim, Jin-Soo Kim, and Joonwon Lee, “AppWatch: Detecting Kernel Bug for
Protecting Consumer Electronics Applications”
• [IEEE TCE’12] Jeaho Hwang, Jinkyu Jeong, Hwanju Kim, Jeonghwan Choi, and Joonwon Lee, “Compressed Memory Swap for QoS of
Virtualized Embedded Systems”
• [SPE’10] Jinkyu Jeong, Euiseong Seo, Jeonghwan Choi, Hwanju Kim, Heeseung Jo, and Joonwon Lee, “KAL: Kernel-assisted Non-
invasive Memory Leak Tolerance with a General-purpose Memory Allocator”

References
[Blake et al., ISCA’10] Evolution of thread-level parallelism in desktop applications
[Botelho’08] Virtual machines per server, a viable metric for hardware selection?
(http://itknowledgeexchange.techtarget.com/server-farm/virtual-machines-per-server-a-viable-metric-for-
hardware-selection/)
[Govindan et al., VEE’07] Xen and co.: communication-aware CPU scheduling for consolidated xen-based hosting
platforms
[Hu et al., HPDC’10] I/O scheduling model of virtual machine based on multi-core dynamic partitioning
[Kim et al., EuroPar’08] Guest-Aware Priority-Based Virtual Machine Scheduling for Highly Consolidated Server
[Kim et al., VEE’09] Task-aware virtual machine scheduling for I/O performance
[Kim et al., JPDC’11] Transparently Bridging Semantic Gap in CPU Management for Virtualized Environments
[Lee et al., VEE’10] Supporting Soft Real-Time Tasks in the Xen Hypervisor
[Liao et al., ANCS’08] Software techniques to improve virtualized I/O performance on multi-core systems
[Lin et al., SC’05] VSched: Mixing Batch And Interactive Virtual Machines Using Periodic Real-time Scheduling
[Masrur et al., RTCSA’10] VM-Based Real-Time Services for Automotive Control Applications
[Ongaro et al., VEE’08] Scheduling I/O in virtual machine monitors
[Sukwong et al., EuroSys’11] Is co-scheduling too expensive for SMP VMs?
[Uhlig et al., VM’04] Towards scalable multiprocessor virtual machines
[VMware ESXi’10] VMware vSphere: The CPU Scheduler in VMware ESX 4.1
[VMware VDI] Enabling your end-to end virtualization solution.
(http://www.vmware.com/solutions/partners/alliances/hp-vmware-customers.html)
[Weng et al., HPDC’11] Dynamic adaptive scheduling for virtual machines
[Weng et al., VEE’09] The hybrid scheduling framework for virtual machine systems
[Xia et al., ICPADS’09] PaS: A Preemption-aware Scheduling Interface for Improving Interactive Performance in
Consolidated Virtual Machine Environment
[Zheng et al., SIGMETRICS’10] RSIO: automatic user interaction detection and scheduling

Demand-Based Coordinated
Scheduling for Multiprocessor VMs

Proportional-Share Scheduler
• Proportional-share scheduler for SMP VMs
• Common scheduler for commodity VMMs
• Employed by KVM, Xen, VMware, etc.
• VM’s shares (S) =
Total shares x (weight / total weight)
• VCPU’s shares = S / # of active VCPUs
• Active vCPU: Non-idle vCPU
Single-threaded workload Multi-threaded (programmed) workload
VCPU0
(1024)
VCPU0
(256)
VCPU1
(256)
VCPU2
(256)
VCPU3
(256)
e.g., 4-VCPU VM (S = 1024)
Symmetric vCPUs
Existing schedulers view active vCPUs
as containers with identical power
41/35

Helping Lock
• Spin-then-block lock [AMD, XenSummit’08]
• Block after spin during a certain period
• + Reducing unnecessary spinning
• - Still LHP and unnecessary spinning
• - Profiling required to find a suitable spin threshold
• - Kernel instrumentation
• But, most popular paravirtualized approach for open-source
kernel like Linux
• Paravirt-spinlock for Xen Linux (mainline)
• Paravirt-spinlock for KVM Linux (patch)
42/35

Coordination for User-level Contention
• User-level synchronization
• Pure spin-based synchronization is rarely used in user space
• Block-based or spin-then-block synchronization
• Reschedule IPI driven coscheduling
• With regard to spin-then-block synchronization, less contention
occurs by coscheduling cooperative threads
Reschedule IPI traffic of streamcluster
Execution time of streamcluster
consolidated with bodytrack
Streamcluster intensively uses spin-then-block barriers
Resched-Co alleviates spin-phase of lock wait time
43/35

Performance on PLE
• PLE (Pause-Loop-Exit)
• A HW mechanism to notify VMM of spinning over a
predefined threshold (i.e., pathological busy-waiting)
• In response to this notification, VMM allows a currently
running vCPU to yield its pCPU
Facesim (futex-intensive) Ferret (TLB-IPI-intensive)
IPI-driven scheduling proactively alleviate unnecessary contention,
whereas PLE reactively relieves contention that has already happened 44/35

Evaluation: Urgent Allowance
• Urgent allowance
• Trading short-term fairness with CPU efficiency
• How much short-term fairness is traded?
1 vips VM
+ 2 facesim VMs
Trading short-term fairness improves overall efficiency
without negative impact on long-term fairness 45/35

Evaluation: Two Multiprocessor VMs
w/ dedup
w/ freqmine
a: baseline
b: balance
c: LC-balance
d: LC-balance+Resched-DP
e: LC-balance+Resched-DP+TLB-Co

corun
solorun
Time
Time
46/35

TLB Shootdown IPIs of Windows 7
• Heavy use of TLB shootdown IPIs by Windows 7
desktop application launch
• Most TLB shootdown IPIs are sent with
multi/broadcasting
• TLB-IPI-driven coscheduling improves PowerPoint
launch time by 23% when consolidated with 4 VMs,
each running streamclusters
Apps Explorer IE PowerPoint Word Excel
# of triggers 102 262 166 179 77
# of IPIs 608 1230 782 990 418
Launch time (ms) 622 982 975 1108 1011
47/35

Virtual Asymmetric Multiprocessor for
User-Interactive Performance

Multimedia Workload Filtering
• Tracking audio-requesting tasks
• Tracking tasks that access a virtual audio device
• Excluding audio access in an interrupt context
• Checking audio Interrupt Service Register (ISR)
• Server-client sound system
• A user-level task to serve all audio requests (e.g., pulseaudio)
• Remote wake-up tracking
1VM: VLC+facesim
1VM: freqmine
(facesim severely interferes remote wake-up tracking)
49/35

Measurement Methodology
• Spiceplay
• Snapshot-based record/replay
• Robust replay for varying loads
• Similar to VNCPlay [USENIX’05] and Deskbench [IM’09]
• Extension on the SPICE remote desktop client
• Record
• Snapshot at an input point  Input recording  Snapshot at a
user-perceived point
• Replay
• Snapshot comparison & start timer  Input replaying 
Snapshot comparison & stop timer
50/35

vAMP Parameters
• Default vAMP parameters
Parameter Role
Default
value
Rationale
Background load
threshold
Tagging background
tasks
50%
Large enough to filter general daemon tasks
such as an X server
Maximum time
of an interactive
episode
Duration of distributing
asymmetric CPU shares
5sec
Large enough to cover a general interactive
episode
(2sec was used in previous research based on
HCI work, but larger value is needed to cover
long-launched applications )
0
5
10
15
20
25
30
Averageframespersecond(FPS)
bgload_thresh=5
bgload_thresh=50
Video playback:
vAMP(L) w/ Ext
X server is misclassified as
a background task
0.00
0.20
0.40
0.60
0.80
1.00
1.20
max_intr_episode=2sec
max_intr_episode=5sec
Gimp launch:
vAMP(L) w/ Ext
Interactive episode is prematurely
finished before the end of launch
51/35

Evaluation: Background Performance
• Performance of background workloads
• With repeated launch with 1-second interval
• Intensively interactive workloads
• 3-28% degradation
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
Normalizedaverageexecutiontime
Baseline
vAMP(L)
vAMP(M)
vAMP(H)
vAMP(L) w/ Ext
vAMP(M) w/ Ext
vAMP(H) w/ Ext
52/35

Evaluation: Guest OS Extension
• Interrupt pinning
• An interactive workload can accompany I/O
• Even warm launch can involve synchronous disk writes
• During an interactive episode, pinning I/O interrupts
on fast vCPUs
• In Linux, manipulate /proc/<irq number>/smp_affinity
53/35
0
200
400
600
800
1000
1200
1400
Averagelaunchtime(sec)
vAMP(L) w/ Ext (no pin)
vAMP(M) w/ Ext (no pin)
vAMP(H) w/ Ext (no pin)
vAMP(L) w/ Ext
vAMP(M) w/ Ext
vAMP(H) w/ Ext
Chrome launch:
Chrome launch entails some synchronous writes
If a disk I/O interrupt is delivered to a slow vCPU,
scheduling latency is increased

Evaluation: Guest OS Extension
• nr_fast_vcpus parameter
• Initial number of fast vCPUs
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
nr_fast_vcpus=1
nr_fast_vcpus=2
nr_fast_vcpus=4
Interactive workloads with low thread-level parallelism do not require
a large number of initial fast vCPUs
54/35
A workload with low thread-level parallelism is adversely
affected by multiple fast vCPUs, since unnecessary vCPU-level
scheduling latency is involved

Task-aware VM Scheduling for I/O
Performance

Problem of VM Scheduling
• Task-agnostic scheduling
VMM
VM1 VM2
Run queue sorted based on CPU fairness
Mixed
task
CPU-
bound
task
I/O-
bound
task
I/O event
That event is mine
and I’m waiting
for it
Your VM has low priority now!
I don’t even know this event is for
your I/O-bound task!
Sorry not to schedule you
immediately…
Head Tail
56/35

Task-agnostic scheduling
• The worst case example for 6 consolidated VMs
• Network response time
Native Linux: Non-consolidated OS
XenoLinux: Consolidated OS on Xen
<Workloads>
• I/O+CPU
1 VM: Server & CPU-bound task
5 VMs: CPU-bound task
• I/O
1 VM: Server
By boosting mechanism
of Xen Credit scheduler
Poor responsiveness
 boosting mechanism realizes
I/O-boundness with only VCPU-level
57/35

Task-aware VM Scheduling
• Goals
• Tracking I/O-boundness with task granularity
• Improving the response time of I/O-bound tasks
• Keeping inter-VM fairness
• Challenges
PCPU
VMM
Mixed
task
CPU-
bound
task
I/O-
bound
task
I/O event
Mixed
task
CPU-
bound
task
I/O-
bound
task
VM VM
1. I/O-bound task identification
2. I/O event correlation
3. Partial boosting
58/35

1. I/O-bound Task Identification
• Observable information at the VMM
• I/O events
• Task switching events [Jones et al., USENIX’06]
• CPU time quantum of each task
• Inference based on common OS techniques
• General OS techniques (Linux, Windows, FreeBSD,
…) to infer and handle I/O-bound tasks
• 1. Small CPU time quantum (main)
• 2. Preemptive scheduling in response to I/O events
(supportive)
Example (Intel x86)
CR3 update CR3 update
I/O event Task time quantum
59/35

• Three disjoint observation classes
• Positive evidence
• Support I/O-boundness
• Negative evidence
• Support non-I/O-boundness
• Ambiguity
• No evidence
• Weighted evidence accumulation
Observation classes
Positive
evidence
Negative
evidence
If 1 and 2 are satisfied If 1 is violated
1. Small CPU time quantum (main)
2. Preemptive scheduling (supportive)
Otherwise
Ambiguity
1. I/O-bound Task Identification
# of sequential observations
The degree
of belief
At this time, this task is believed as an I/O-bound task
More penalty for
long time quantum
60/35

2. I/O Event Correlation
• I/O event correlation
• To distinguish an incoming event for I/O-bound tasks
• Why?
• To selectively prioritize I/O-bound tasks in a VM
• CPU-bound tasks also conduct I/O operations
• Goal
• Best-effort correlation
• Lightweight rather than accuracy
• I/O types
• Block I/O: disk read
• Network I/O: packet reception
61/35

2. I/O Event Correlation: Block I/O
• Request-response correlation
• Window-based correlation
• Correlation for delayed read events by guest OS
• e.g., block I/O scheduler
• Overhead per VCPU = window size x 4bytes (task ID)
T1 T2 T3 T4
read
Actual
read request
user
kernel
VMM
Inspection window Any I/O-bound
task in the window
62/35

2. I/O Event Correlation: Network I/O
• History-based prediction
• Asynchronous packet reception
• Monitoring “the firstly woken task” in response to
an incoming packet
• N-bit saturating counter for each destination port number
Portmap 00
Non-
I/O-
bound
01
Weak
I/O-
bound
10
I/O-
bound
11
Strong
I/O-
bound
If the firstly woken task is I/O-bound
Otherwise
If portmap counter’s MSB is set,
this packet is for I/O-bound tasks
Example: 2-bit counter
Destination
port number
Overhead per VM = N x 8KB
63/35

3. Partial Boosting
• Priority boosting with task-level granularity
• Borrowing future time slice to promptly handle an
incoming I/O event as long as fairness is kept
• Partial boosting lasts during the run of I/O-bound
tasks
VMM
VM1 VM2
Run queue sorted based on CPU fairness
I/O event
VM3
CPU-
bound
task
CPU-
bound
task
Head Tail
I/O-
bound
task
If this I/O event is destined for VM3 and
is inferred to be handled by its I/O-bound task,
Initiate partial boosting for VM3 VCPU
64/35

Evaluation (1/4)
• Implementation on Xen 3.2
• Experimental setup
• Intel Pentium D for Linux (single core enabled)
• Intel Q6600 (VT-x) for Windows XP (single core
enabled)
• Correlation parameters
• Chosen for >90% accuracy and low overheads
by stressful tests with synthetic workloads
• Block I/O: Inspection window size = 3
• Network I/O: Portmap bit width = 2
65/35

Evaluation (2/4)
• Network response time
<Schedulers>
Baseline = Xen Credit scheduler
TAVS = Task-aware VM scheduler
<Workloads>
1 VM: Server & CPU-bound task
Response time improvement
Fairness guarantee
66/35

Evaluation (3/4)
• Real workloads
Ubuntu Linux Windows XP
I/O-bound
tasks
CPU-bound
tasks
<Workloads>
1 VM: I/O-bound & CPU-bound task
12-50% I/O performance
improvement with
inter-VM fairness
67/35

Evaluation (4/4)
• I/O-bound task identification
68/35

Client-side Scheduler Support for
Multimedia Workloads

Client-side Virtualization
• Multiple OS instances on a local device
• Primary use cases
• Different OSes for application compatibility
• Consolidating business and personal
computing environments on a single device
• BYOD: Bring Your Own Device
Business
VM
Personal
VM
Hypervisor
Managed
domain
70/35

Multimedia on Virtualized Clients
• Multimedia is ubiquitous on any VM
Windows
VM
Linux
VM
Hypervisor
Business
VM
Personal
VM
Hypervisor
Business
VM
Personal
VM
Hypervisor
Video
Playback Compilation
Data
Processing 3D game
Video
conference Downloading
1. Multimedia workloads are dominant on virtualized clients
2. Interactive systems can have concurrently mixed workloads
71/35

Issues on Multi-layer Scheduling
• A multimedia-agnostic hypervisor invalidates OS
policies for multimedia
VM
OS
scheduler
VM
OS
Scheduler
Hypervisor
Scheduler
CPU
OS scheduler
CPU
Virtual CPU Virtual CPU
Task
Task
BVT [SOSP’99]
SMART [TOCS’03]
Rialto [SOSP’97]
BEST [MMCN’02]
HuC [TOMCCAP’06]
Redline [OSDI’08]
RSIO [SIGMETRICS’10]
Windows MMCSS
Larger CPU proportion
& Timely dispatching TaskTask TaskTask
I’m unaware of any
multimedia-specific OS policies
in a VM, since I see each VM as
a black box.
Additional
abstraction
Semantic gap!
72/35

Multimedia-agnostic Hypervisor
• Multimedia QoS degradation
• Two VMs with equal CPU shares
• Multimedia VM + Competing VM
0
5
10
15
20
25
30
AverageFPS
Competing workloads in another VM
0
10
20
30
40
50
60
70
80
90
100
AverageFPS
VM VM
Xen hypervisor
Credit scheduler
Video playback
or 3D game
Competing
workloads
Video playback (720p)
on VLC media player Quake III Arena (demo1)
73/35

Possible Solutions to Semantic Gap
• Explicit vs. Implicit
VM
OS scheduler
Hypervisor
Scheduler
Explicit
OS cooperation
+ Accurate
- OS modification
- Infeasible w/o
multimedia-friendly
OS schedulers
VM
OS scheduler
Hypervisor
Scheduler
Explicit
User involvement
+ Simple
- Inconvenient
- Unsuitable for
dynamic workloads
VM
OS scheduler
Hypervisor
Scheduler
Implicit
Hypervisor-only
+ Transparency
- Difficult to identify
workload demands
at the hypervisor
Workload monitor
74/35

Proposed Approach
• Multimedia-aware hypervisor scheduler
• Transparent scheduler support for multimedia
• No modifications to upper layer SW (OS & apps)
• “Feedback-driven VM scheduling”
VM
Hypervisor
VM VM
Multimedia
manager
(feedback-driven)
CPU
scheduler
Multimedia
monitor
Audio Video CPU
Estimated
multimedia QoS
Scheduling command
(e.g., CPU share or priority)
Challenges
1. How to estimate multimedia QoS
based on a small set of HW events?
2. How to control CPU scheduler
based on the estimated information
75/35

Multimedia QoS Estimation
• What is estimated as multimedia QoS?
• “Display rate” (i.e., frame rate)
• Used by HuC scheduler [TOMCCAP’06]
• How is a display rate captured at the
hypervisor?
• Two types of display
Framebuffer
Acceleration
unit
Display
interface
Memory-mapped
Graphics
Library
Video device
1. Memory-mapped
display
(e.g., video playback)
2. GPU-accelerated
display
(e.g., 3D game)
76/35

Memory-mapped Display (1/2)
• How to estimate a display update rate on the
memory-mapped framebuffer
• Write-protection for virtual address space
mapped to framebuffer
Framebuffer
memory
Virtual address space
Display interface
Write-protection
write
Hypervisor
page fault handler
{
Update display rate
}
The hypervisor can inspect any attempt to map memory
Sampling to reduce trap overheads
(1/128 pages, by default)
77/35

Memory-mapped Display (2/2)
• Accurate estimation
• Maintaining display rate per task
• An aggregated display rate does not
represent multimedia QoS
• Tracking guest OS task at the hypervisor
• Inspecting address space switches (Antfarm [USENIX’06])
• Monitoring audio access (RSIO [SIGMETRIC’10])
• Inspecting audio buffer access with write-protection
• A task with a high display rate and audio access
 a multimedia task
Task
Task
25 FPS
10 FPS
78/35

GPU-accelerated Display (1/2)
• Naïve method
• Inspecting GPU command buffer with
write-protection or polling
• Too heavy due to huge amount of GPU commands
• Lightweight method
• Little overhead, but less accuracy
• 3D games are less sensitive to frame rate degradation than
video playback
• GPU interrupt-based estimation
• An interrupt is typically used for an application to
manage buffer memory
• Hypothesis
• “A GPU interrupt rate is in proportion to a display rate”
79/35

GPU-accelerated Display (2/2)
• Linear relationship between display rates and
GPU interrupt rates
• Exponential weighted moving average (EWMA) is used to
reduce fluctuation
• EWMAt = (1-w) x EWMAt-1 + w x current value
0
2000
4000
6000
8000
10000
12000
0 50 100
#ofGPUinterrupt/sec
FPS
Quake3 demo1 (640x480)
60
80
100
120
140
160
0 50 100
#ofGPUinterrupt/sec
FPS
0
100
200
300
400
0 20 40 60
#ofGPUinterrupt/sec
FPS
Intel GMA 950
(Apple MacBook)
Nvidia 6150 Go
(HP Pavillion tablet)
PowerVR
(Samsung GalaxyS)
A GPU interrupt rate can be used to estimate a display rate
without additional overheads 80/35

Multimedia Manager
• A feedback-driven CPU allocator
• Base assumption
• “Additional CPU share (or higher priority) improves a display
rate”
• Desired frame rate (DFR)
• A currently achievable display rate
• Multiplied by tolerable ratio (0.8)
IF current FPS < previous FPS AND
current FPS < DFR THEN
Increase CPU share
/* Exceptional cases:
* 1) No relationship between CPU
and FPS
* 2) FPS is saturated below DFR
* 3) Local CPU contention in a VM
*/
If no FPS improvement by CPU
share increase (3 times) Then
Decrease CPU share by half
If in initial phase Then
Exponential increase
Else
Linear increase
81/35

Priority Boosting
• Responsive dispatching
• Problem
• The hypervisor does not distinguish the types of events for
priority boosting
• A VM that will handle a multimedia event cannot preempt a
currently running VM handling a normal event.
• Higher priority for multimedia-related events
• e.g., video, audio, one-shot timer
MMBOOST
IOBOOST
Normal
priorityPriority
Multimedia events
Other events
Based on remaining CPU shares
82/35

Evaluation
• Experimental environment
• Intel MacBook with Intel GMA 950
• Xen 3.4.0 with Ubuntu 8.04
• Implementation based on Xen Credit scheduler
• Two-VM scenario
• One with direct I/O + one with indirect (hosted) I/O
• Presenting the case of direct I/O in this talk
• See the paper for the details of the indirect I/O case
83/35

0
10
20
30
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
FPS
Time (sec)
Real FPS Estimated FPS
(w/ CPU-bound VM)
Estimation Accuracy
• Estimation accuracy
• Error rates: 0.55%~3.05%
0
50
100
0 5 10 15 20 25 30 35 40 45 50 55 60 65
FPS
Time (sec)
Real FPS Estimated FPS Estimated FPS (EWMA, w=0.2)
Quake 3
(w/ CPU-bound VM)
multimedia manager disabled
84/35

Estimation Overhead
• CPU overhead caused by page faults
• Video playback
• 0.3~1% with sampling
• Less than 5% with tracking all pages
Overhead
All
pages
Sampling
1/8 pages 1/32 pages 1/128 pages
Low resolution
(640x354)
4.95% 1.10% 0.54% 0.58%
High resolution
(1280x720)
3.91% 1.04% 0.69% 0.33%
85/35

Multimedia Manager
• Video playback (720p) + CPU-bound VM
0
20
40
60
80
100
0
5
10
15
20
25
0 10 20 30 40 50 60 70 80
CPUshare(%)
FPS
Time (sec)
FPS DFR CPU share (%)
20
30
40
50
60
70
80
90
100
5
10
15
20
25
5 6 7 8 9 10
40
50
60
70
80
90
100
10
15
20
25
80 81 82 83 84
86/35

Performance Improvement
• Performance improvement
• Closed to maximum achievable
frame rates
0
5
10
15
20
25
AverageFPS
Credit scheduler
Credit scheduler w/ multimedia support
0
20
40
60
80
100
AverageFPS
Credit scheduler
Credit scheduler w/ multimedia support
VM VM
Hypervisor
Video playback
or 3D game
Competing
workloads
on VLC media player Quake III Arena (demo1)
87/35

Limitations & Discussion
• Network-streamed multimedia
• Additional preemption support required for
multimedia-related network packets
• Multiple multimedia workloads in a VM
• Multimedia manager algorithm should be refined
to satisfy QoS of mixed multimedia workloads in the
same VM
• Adaptive management for SMP VMs
• Adaptive vCPU allocation based on hosted
multimedia workloads
88/35

Conclusions
• Demands for multimedia-aware hypervisor
• Multimedia are increasingly dominant in
virtualized systems
• “Multimedia-friendly hypervisor scheduler”
• Transparent and lightweight multimedia support on
client-side virtualization
• Future directions
• Multimedia for server-side VDI
• Multicore extension for SMP VMs
• Considerations for network-streamed multimedia
89/35

CPU Scheduling for Virtual Desktop Infrastructure

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to CPU Scheduling for Virtual Desktop Infrastructure (20)

Recently uploaded (20)

CPU Scheduling for Virtual Desktop Infrastructure