SlideShare a Scribd company logo
CPU Scheduling for
Virtual Desktop Infrastructure
PhD Defense
Hwanju Kim
2012-11-16
Virtual Desktop Infrastructure (VDI)
• Desktop provisioning
Dedicated workstations
VM VM
VM
VM
VM
- Energy wastage by idle desktops
- Resource underutilization
- High management cost
- High maintenance cost
- Low level of security
+ Energy savings by consolidation
+ High resource utilization
+ Low management cost
(flexible HW/SW provisioning)
+ Low maintenance cost
(dynamic HW/SW upgrade)
+ High level of security
(centralized data containment)
VM-based shared environments
2/35
Hardware
Virtual Machine Monitor (VMM)
Desktop Consolidation
• Distinctive workload characteristics
• High consolidation ratio
• 4:1~15:1 [VMware VDI], 6~8 per core [Botelho’08]
• Diverse user-dependent workloads
• Light users and knowledgeable workers coexist
• Multi-layer mixed workloads
• Multi-tasking (interactive+background) in a consolidated VM
VM VM VM VM VM
VM VM VM VM
Mixed
Interactive
CPU-intensive Parallel
3/35
VM
Challenges on CPU Scheduling
• Challenges due to the primary principles of
VMM, compared to OS scheduling research
pCPU
VMM scheduler
pCPU
vCPU vCPU
OS scheduler
vCPU
OS scheduler
VMM
vCPU vCPU
OS scheduler
Task Task Task Task Task TaskTask Task
VMVM
1. Semantic gap
( OS independence)
: Two independent
scheduling layers
2. Scarce Information
( Small TCB)
: Difficulty in extracting
workload characteristics
3. Inter-VM fairness
( Performance isolation)
: Favoring a VM must not compromise inter-VM fairness
• I/O operations
• Privileged instructions
• Process and thread
information
• Inter-process
communications
• I/O operations and
semantics
• System calls
• etc…
Each VM is virtualized
as a black box
I believe I’m on a
dedicated machine
Lightweightness
(No cross-layer optimization)
Efficiency
(Intelligent VMM)
4/35
VMVM
The Goals of This Thesis
• The enlightened CPU scheduling of VMM for
consolidated desktops
• Efficient CPU management with lightweight VMM
extensions
VMM scheduler VMM
vCPU vCPU vCPU vCPU
VM
Interactive
workload
ThreadThreadThread
Background
workload
ThreadThreadThread
VM
Communicating
workload
Thread Thread
Enlightening about
diverse workload
demands inside a VM
Base: CPU bandwidth
partitioning for
performance isolation
Design principles
1. OS-independence: VMM-level solutions without OS-dependent optimizations
2. Diversity: Identifying the computing demands of diverse workloads (including mixed workloads)
3. Inter-VM fairness: Performance isolation for multi-tenant environments
5/35
Related Work
Proposals References
Design principles
OS-
independence
Diversity
Inter-VM
fairness
Proportional-share scheduling Xen, KVM, VMware ESX O X O
Interactive & soft real-time
scheduling
[Lin et al., SC’05]
[Lee et al., VEE’10]
[Masrur et al., RTCSA’10]
O
X
(User-directed,
no mixed &
communicating
workloads)
X
OS-assisted scheduling
[Kim et al., EuroPar’08]
[Xia et al., ICPADS’09]
X
(OS-dependent
optimization)
X
(No communicating
workloads)
O
I/O-friendly scheduling
[Govindan et al., VEE’07]
[Ongaro et al., VEE’08]
[Liao et al., ANCS’08]
[Hu et al., HPDC’10]
O
X
(Only I/O-intensive
workloads)
O
Multiprocessor
VM scheduling
Relaxed
coscheduling
[VMware ESXi’10]
[Sukwong et al., EuroSys’11] O X
(No mixed workloads)
O
Spinlock-aware
scheduling
[Uhlig et al., VM’04]
[Weng et al., HPDC’11]
X
(OS-dependent
optimization)
X
(Only spinlock-
intensive workloads)
O
Hybrid
scheduling
[Weng et al., VEE’09] O
X
(User-involved,
no mixed workloads)
O
Overview
VMM scheduler VMM
vCPU
vCPU vCPU
VM
VM
Multithreaded
(communicating or parallel)
workload
Thread
• Introduction to “Task-aware VM scheduling”
[Kim et al., VEE’09], [Kim et al., JPDC’11]
+ The first solution to mixed workloads in a consolidated VM
+ Simple and effective for I/O-bound interactive workloads
- No consideration about multiprocessor VMs
- Lacking ability to support modern interactive workloads
pCPU
CPU-
bound
task
I/O-
bound
task
vCPU
VM
CPU-
bound
task
CPU-
bound
task
• Proposal for multiprocessor VM scheduling
 Efficient scheduling for multithreaded workloads
hosted on multiprocessor VMs
Proposal
vCPU vCPU
VMM scheduler VMM
pCPU pCPU pCPU pCPU
Thread ThreadThread
User-
Interactive
workload
Background
workload
Defense
“Demand-based
coordinated
scheduling”
“Virtual
asymmetric
multiprocessor”
Implementation
Extension
Task-based
Priority boosting
7/35
Demand-Based Coordinated Scheduling
for Multiprocessor VMs
How to effectively schedule multithreaded workloads hosted in
multiprocessor VMs?
vCPU vCPU
VM
Multithreaded
(communicating or parallel)
workload
Thread
vCPU vCPU
VMM scheduler VMM
pCPU pCPU pCPU pCPU
Thread ThreadThread
Why Coordinated Scheduling?
• Uncoordinated vs. Coordinated scheduling
vCPU
VMM scheduler VMM
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
Time
shared
Uncoordinated scheduling
Each vCPU is treated as an independent entity
regardless of its sibling vCPUs
Independent
entity
vCPU
VMM scheduler VMM
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
Coordinated scheduling
Sibling vCPUs are coordinated by VMM scheduler
Coordinated
group
Why is coordination needed?
• Many applications are multithreaded and parallelized
 Multiple threads perform a job communicating with
each other to arbitrate accesses to shared resources
vCPU
vCPU
vCPU
Time
shared
Lock
holder
Lock
waiter
Lock
waiter
Active
Inactive
Inactive
Uncoordinated scheduling makes
inter-thread communication ineffective
Similar to traditional job scheduling
issues in distributed environments
• Multicore resembles a distributed environment
Time
shared
9/35
Coordination Space
• Space and time domains
• Space domain
• pCPU assignment policy
• Where is each sibling vCPU assigned?
• Time domain
• Preemptive scheduling policy
• When and which sibling vCPUs are preemptively scheduled
• e.g., Co-scheduling
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
Space
Where to schedule?
Time
When to schedule?
Coordinated
group
10/35
Space Domain: pCPU Assignment
• A naïve method
• “Balance scheduling”[Sukwong et al., EuroSys’11]
• Spread sibling vCPUs on separate pCPUs
• Probabilistic co-scheduling due to
the increase of likelihood of coscheduling
• No coordination in time domain
• Limitation
• An unrealistic assumption: “CPU load is well balanced”
• In practice, VMs with equal CPU shares have
• Different number of vCPUs
• Different thread-level parallelism
• Phase-changed multithreaded workloads
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU vCPU
vCPU
Highly contended
Larger
CPU shares
11/35
Space Domain: pCPU Assignment
• Proposed scheme
• “Load-conscious balance scheduling”
• Hybrid scheme of balance scheduling & load-based assignment
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
If all candidate pCPUs are not overloaded,
balance scheduling
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU vCPU
vCPU
Otherwise,
load-based assignment
vCPU
pCPU0 pCPU1 pCPU2 pCPU3
vCPU
vCPU vCPU
Wait queue
• Example
vCPUvCPU vCPU
Candidate pCPU set
(Scheduler assigns a lowest-loaded pCPU in this set)
= {pCPU0, pCPU1, pCPU2, pCPU3}
pCPU3 is overloaded
(i.e., CPU load > Average CPU load)
How about contention
between sibling vCPUs?
 Pass to coordination in time domain!
12/35
Time Domain: Preemption Policy
• What type of contention demands coordination?
• Busy-waiting for communication (or synchronization)
• Unnecessary CPU consumption by busy-waiting for a
descheduled (inactive) vCPU
• Significant performance degradation
• Why serious in multiprocessor VMs?
• Semantic gap
• OSes make liberal use of busy-waiting (e.g., spinlock) since they
believe their vCPUs are always online (i.e., dedicated)
• “Demand-based coordinated scheduling”
• Issues
• When and where to demand coordination?
• Busy-waiting really matters?
• How to detect coordination demand?
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
13/35
Time Domain: Preemption Policy
• When and where to demand coordination?
• Experimental analysis
• 13 emerging multithreaded applications in the PARSEC suite
• Diverse characteristics
• Kernel time ratio in the case of consolidation
• Busy-waiting occurs in kernel space
0%
20%
40%
60%
80%
100%
blackscholes
bodytrack
canneal
dedup
facesim
ferret
fluidanimate
freqmine
raytrace
streamcluster
swaptions
vips
x264
CPUtime(%)
Kernel time User time
0%
20%
40%
60%
80%
100%
blackscholes
bodytrack
canneal
dedup
facesim
ferret
fluidanimate
freqmine
raytrace
streamcluster
swaptions
vips
x264
CPUtime(%)
Kernel time User time
Solorun (no consolidation) Corun (w/ 1 VM running streamcluster)
Kernel time ratio
is largely amplified
by x1.3~x30
A VM with 8 vCPUs
on 8 pCPUs
14/35
Time Domain: Preemption Policy
• Where is the kernel time amplified?
Function Application
CPU cycles (%)
(Total kernel CPU cycles (%))
TLB shootdown
dedup 43% (83%)
ferret 9% (11%)
vips 41% (47%)
Lock spinning
bodytrack 5% (8%)
canneal 4% (5%)
dedup 36% (83%)
facesim 4% (5%)
streamcluster 10% (11%)
swaptions 5% (6%)
vips 4% (47%)
x264 7% (8%)
15/35
Time Domain: Preemption Policy
• TLB shootdown
• Notification of TLB invalidation to a remote CPU
CPU
Thread
CPU
Thread
Virtual address
space
TLB TLB
V->P1
V->P1
V->P1
TLB (Translation Lookaside Buffer):
Per-CPU cache for
virtual address mapping
V->P2 or V->Null
Modify
or
Unmap
Inter-processor interrupt (IPI)
Busy-waiting until all corresponding
TLB entries are invalidated
 Efficient in native systems,
but not in virtualized systems
if target vCPUs are not scheduled
“A TLB shootdown IPI is a signal for coordination demand!”
 Co-schedule IPI-recipient vCPUs with a sender vCPU
0
500
1000
1500
2000
blackscholes
bodytrack
canneal
dedup
facesim
ferret
fluidanimate
freqmine
raytrace
streamcluster
swaptions
vips
x264
TLBIPIs/sec/vCPU
TLB shootdown IPI traffic
16/35
pCPU
Time Domain: Preemption Policy
• Lock spinning
• Which spinlocks show dominant wait time?
0%
20%
40%
60%
80%
100%
bodytra…
canneal
dedup
facesim
streamc…
swaptio…
vips
x264
Spinlockwaittime(%)
Other locks
Wait-queue lock
Pagetable lock
Runqueue lock
Semaphore wait-queue lock
Futex wait-queue lock
89%
Futex: Kernel support for user-level synchronization
(e.g., mutex, barrier, condvar)
81%
mutex_lock(mutex)
/* critical section */
mutex_unlock(mutex)
futex_wake(mutex) {
spin_lock(queue->lock)
thread=dequeue(queue)
wake_up(thread)
spin_unlock(queue->lock)
}
mutex_lock(mutex)
futex_wait(mutex) {
spin_lock(queue->lock)
enqueue(queue, me)
spin_unlock(queue->lock)
schedule() /* blocked */
vCPU0 vCPU1
/* wake-up */
/* critical section */
mutex_unlock(mutex)
futex_wake(mutex) {
spin_lock(queue->lock)
If vCPU0 is preempted during waking vCPU1 up,
vCPU1 busy-waits on the preempted spinlock
: So-called lock-holder preemption (LHP)
vCPU1
vCPU0
Active
Preempted
“A Reschedule IPI is a signal for coordination demand!”
 Delay preemption of an IPI-sender vCPU
until a likely-held spinlock is released
Reschedule
IPI
kernel
Preempted
17/35
Time Domain: Preemption Policy
• Proposed scheme
• Urgent vCPU first (UVF) scheduling
• Urgent time slice (utslice)
• Long enough for a reschedule IPI sender to release a spinlock
• Short enough to quickly serve multiple urgent vCPUs
pCPU
vCPU vCPU vCPU
Urgent queue Runqueue
vCPU
pCPU
vCPU vCPU vCPUvCPU
FIFO order Proportional shares order
vCPU : urgent state
vCPU vCPU
Wait queue
Protect from preemption
during urgent time slice
(utslice)
If inter-VM fairness is kept
18/35
Evaluation
• Utslice parameter
• 1. Utslice for reducing LHP
• 2. Utslice for quickly serving multiple urgent vCPUs
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 100 300 500 700 1000
#offutexqueueLHP
Utslice (usec)
bodytrack
facesim
streamcluster
Workloads:
A futex-intensive workload in one VM
+ dedup in another VM as a preempting VM
>300us utslice
~2x~3.8x LHP reduction
Remaining LHPs occur during local wake-up or
before reschedule IPI transmission
 Not likely lead to lock contention
19/35
Evaluation
• Utslice parameter
• 1. utslice for reducing LHP
• 2. utslice for quickly serving multiple urgent vCPUs
30
35
40
45
50
55
60
0
2
4
6
8
10
12
14
16
100 500 1000 3000 5000
Averageexecutiontime(sec)
CPUcycles(%)
Utslice (usec)
Spinlock cycles (%)
TLB cycles (%)
Execution time (sec)
Workloads:
3 VMs, each of which runs vips
(vips - TLB-IPI-intensive application)
As utslice increases,
TLB shootdown cycles increase
500usec is an appropriate utslice for both
LHP reduction and multiple urgent vCPUs
~11% degradation
20/35
Evaluation
• Workload consolidation
• One 8-vCPU VM + four 1-vCPU VMs (x264)
0.00
0.50
1.00
1.50
2.00
Normalizedexecutiontime
Workloads of 8-vCPU VM
Baseline
Balance
LC-Balance
LC-Balance+Resched-DP
LC-Balance+Resched-DP+TLB-Co
Multiprocessor VMs
Need coordination in time domain (~90% improvement)
0.00
0.50
1.00
1.50
Normalizedexecutiontime
Co-running workloads with 1-vCPU VM (x264)
Baseline
Balance
LC-Balance
LC-Balance+Resched-DP
LC-Balance+Resched-DP+TLB-Co
Balance scheduling degrades 1-vCPU VM by incurring unnecessary contention Singleprocessor VMs
21/35
Summary
• Contributions
• Load-conscious balance scheduling
• Essential for heterogeneously consolidated environments
where load imbalance usually takes place
• IPI-driven coordinated scheduling
• Effective for VMM to alleviate unnecessary CPU contention
based on IPIs between sibling vCPUs
• Future work
• Combining the “scheduling-based method” with
“contention management methods”
• Contention management methods
• Paravirtual spinlock, HW-based spin detection
22/35
Virtual Asymmetric Multiprocessor for
User-Interactive Performance
How to improve user-interactive performance mixed in
multiprocessor VMs?
vCPU vCPU
VM
vCPU vCPU
VMM scheduler VMM
pCPU pCPU pCPU pCPU
User-
Interactive
workload
Background
workload
Motivation
• Background & idea
• The initial proposal of “Task-aware scheduling” did
not consider multiprocessor VMs
• Existing VMM schedulers give an illusion of
symmetric multiprocessor (SMP) to each VM
• Due to the absence of mixed workload tracking
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
VM
Interactive Background
Time
shared
Virtual SMP (vSMP)
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
VMInteractive
Background
Virtual AMP (vAMP)
vCPU
Equally contended
regardless of
user interactions
Proposal
The size of vCPU =
The amount of CPU shares
Fast vCPUs Slow vCPUs
24/35
Workload Classification
• Previous methods
• Time-quanta based classification
• “Interactive workloads typically show short time quantum”
• OS technique: User I/O-driven IPC tracking
[Zheng et al., SIGMETRICS’10]
X server Terminal Firefox
IPC IPC
User I/O
+ Identifying a set of tasks involved in a user interaction (I/O)
- Relying on various OS-level IPC structures (e.g., socket, pipe, signal)
 VMM cannot access OS-level IPCs
+ Clear classification between
I/O-bound and CPU-bound tasks
- Modern interactive workloads
show mixed behaviors
- Multithreaded CPU-bound job
shows short time quanta due to
inter-thread communication
25/35
An interactive task group
Workload Classification
• Proposed scheme
• “Background workload identification”
• Instead of tracking interactive workloads,
• Identifying “background CPU noise”
at the time of “user I/O”
• Rationales
• Interactive CPU load is typically initiated
by user I/O
• VMM can unobtrusively monitor
user I/O and per-task CPU load
• Exceptional case
• Multimedia workloads (e.g., video playback)
• Filtering multimedia tasks from background workloads
• Tasks requesting audio I/O
26/35
Virtual Asymmetric Multiprocessor
• vAMP
• Dynamically adjusting CPU shares of a vCPU
according to its currently hosting task
1. Maintaining
per-task CPU load
during pre-I/O period
 Pre-I/O period is
set to shorter than
general user think time
(1 second by default)
2. Tagging tasks that
have generated
nontrivial CPU loads
as background tasks
 Threshold can be
set to filter daemon tasks
that possibly serve
interactive workloads
3. Dynamically adjusting
vCPU’s shares based on
weight ratio
(e.g., background :
non-background
= 1:5)
4. Providing vAMP
during an interactive
episode
 An interactive episode
is restarted when another
user I/O occurs or is
finished if maximum time is
elapsed without user I/O
27/35
Limitation
• An intrinsic limitation of VMM-only approach
• Manipulating only a single scheduling layer
(i.e., VMM scheduler)
• A vAMP-oblivious OS scheduler
• Agnostic about underlying vAMP (i.e., all vCPUs are identical)
• Possibly multiplexing interactive and background tasks on the
same vCPU
• A slow vCPU has higher scheduling latency
• “Frequent multiplexing” might offset the benefit of vAMP
Example: A scheduling trace during Google Chrome launch
“Aggressive weight ratio is not always effective if multiplexing frequently happens”
 Weight ratio is an important parameter for interactive performance
Background task Non-background task
28/35
Guest OS Extension
• Guest OS extension for vAMP
• OS enlightenment about vAMP
• To avoid ineffective multiplexing of interactive and
background tasks on the same vCPU  Isolation
• Design principles
• Keeping VMM OS-independent
• Optional extension for further enhancement of interactive
performance
• Keeping extension OS-independent
• No reliance on specific OS functionality
• Isolating tasks on separate CPUs is a general interface of
commodity OSes (e.g., modifying CPU affinity)
• Small kernel changes for low maintenance cost
29/35
Guest OS Extension
• Linux extension for vAMP
• User-level vAMP-daemon
• Isolating background tasks exposed by VMM from non-
background tasks
• Small kernel changes that expose background tasks to user
VM
vAMP scheduler
VMM
vCPU vCPU
Task load monitor
Background
tasks
T1, T2
vAMP-daemon
Kernel
User
Input
interface
Cpuset
interface
T1 T2
T3 T4
Procfs
interface
1. Event-
driven
2. Read
3. Isolate
Isolation procedure:
1. Initially dedicating nr_fast_vcpus to interactive
tasks (i.e., non-background tasks)
2. Periodically increasing nr_fast_vcpus when
fast vCPUs become fully utilized
(also periodically checking the end of an interactive
episode  stop isolation)
Default nr_fast_vcpus = 1 due to the low
thread-level parallelism of interactive workloads
[Blake et al., ISCA’10]
30/35
Evaluation
• Application launch
• Background workload
• Data mining application (freqmine) with 8 threads
• Weight ratio (background : non-background)
• vAMP(L)=1:3, vAMP(M)=1:9, vAMP(H)=1:18
8-vCPU VM 8-vCPU VM
freqminefreqmine
App
launch
Remote
desktop
client
0.00
0.20
0.40
0.60
0.80
1.00
1.20
Impress Firefox Chrome Gimp
Normalizedaveragelaunchtime
Interactive applications
Baseline
vAMP(L)
vAMP(M)
vAMP(H)
vAMP(L) w/ Ext
vAMP(M) w/ Ext
vAMP(H) w/ Ext
vAMP improves launch performance by 7~40% High weight ratio is ineffective because of
negative effect of multiplexing
Guest OS extension achieves further improvement
of interactive performance by up to 70%
Why did Gimp show significant improvement
even without the guest OS extension?
8-pCPU
31/35
Evaluation
• Application launch
• Chrome vs. Gimp (without guest OS extension)
Chrome (Web browser)
Gimp (Image editing program)
 Many threads are cooperatively scheduled in a fine-grained manner
 A single thread dominantly involves computation with little communication
Background task Non-background task
Background task Non-background task
32/35
Evaluation
• Media player
• VLC media player
• 1920x800 HD video with 23.976 frames per second (FPS)
• Mult: multimedia workload filtering
Without multimedia workload filtering,
VLC is misidentified as a background task
vAMP improves playback quality by up to 22.3 FPS,
but high weight ratio still degrades the quality
Guest OS extension achieves 23.8 FPS
8-vCPU VM 8-vCPU VM
freqminefreqmine
Media
player
8-pCPU
33/35
0
5
10
15
20
25
30
Averageframespersecond(FPS)
Baseline
vAMP(L) w/o Mult
vAMP(L)
vAMP(M)
vAMP(H)
vAMP(L) w/ Ext
vAMP(M) w/ Ext
vAMP(H) w/ Ext
Summary
• vAMP
• Dynamically varying vCPU performance based on
their hosting workloads
• A feasible method of improving interactive performance
• Assisted by a simple guest OS extension
• Isolation of different types of workloads enhances the
effectiveness of vAMP
• Future work
• Collaboration of VMM and OSes for vAMP
• Standard & well-defined API
34/35
Conclusions
• Lessons learned from the thesis
• In-depth analysis of OSes and workloads can realize
intelligent CPU scheduling based only on VMM-
visible events
• Both lightweightness and efficiency are achieved
• Task-awareness is an essential ability for VMM to
effectively handle mixed workloads
• Multi-tasking is ubiquitous inside every VM
• Coordinated scheduling improves CPU efficiency of
multiprocessor VMs
• Resolving unnecessary CPU contention is crucial
35/35
Publications
• Task-aware VM scheduling
• [VEE’09] Hwanju Kim, Hyeontaek Lim, Jinkyu Jeong, Heeseung Jo, Joonwon Lee, “Task-aware Virtual Machine Scheduling for I/O
Performance”
• [JPDC’11] Hwanju Kim, Hyeontaek Lim, Jinkyu Jeong, Heeseung Jo, Joonwon Lee, Seungryoul Maeng, “Transparently Bridging
Semantic Gap in CPU Management for Virtualized Environments”
• [MMSys’12] Hwanju Kim, Jinkyu Jeong, Jaeho Hwang, Joonwon Lee, Seungryoul Maeng, “Scheduler Support for Video-oriented
Multimedia on Client-side Virtualization”
• [ApSys’12] Hwanju Kim, Sangwook Kim, Jinkyu Jeong, Joonwon Lee, and Seungryoul Maeng, “Virtual Asymmetric Multiprocessor for
Interactive Performance of Consolidated Desktops”
• Demand-based coordinated scheduling
• [ASPLOS’13] Hwanju Kim, Sangwook Kim, Jinkyu Jeong, Joonwon Lee, and Seungryoul Maeng, “Demand-Based Coordinated
Scheduling for SMP VMs”
• Other work on virtualization
• [IEEE TC’11] Hwanju Kim, Heeseung Jo, and Joonwon Lee, “XHive: Efficient Cooperative Caching for Virtual Machines”
• [IEEE TC’10] Heeseung Jo, Hwanju Kim, Jae-Wan Jang, Joonwon Lee, and Seungryoul Maeng, “Transparent Fault Tolerance of Device
Drivers for Virtual Machines”
• [MICRO’10] Daehoon Kim, Hwanju Kim, and Jaehyuk Huh, “Virtual Snooping: Filtering Snoops in Virtualized Multi-cores”
• [VHPC’11] Sangwook Kim, Hwanju Kim, and Joonwon Lee, “Group-Based Memory Deduplication for Virtualized Clouds”
• [Euro-Par’08] Dongsung Kim, Hwanju Kim, Myeongjae Jeon, Euiseong Seo, Joonwon Lee, “Guest-Aware Priority-based Virtual
Machine Scheduling for Highly Consolidated Server”
• [VHPC’09] Heeseung Jo, Youngjin Kwon, Hwanju Kim, Euiseong Seo, Joonwon Lee, Seungryoul Maeng, “SSD-HDD-Hybrid Virtual Disk
in Consolidated Environments”
• Other work on embedded and mobile systems
• [ACM TECS’12] Jinkyu Jeong, Hwanju Kim, Jeaho Hwang, Joonwon Lee, and Seungryoul Maeng, “Rigorous Rental Memory
Management for Embedded Systems”
• [CASES’12] Jinkyu Jeong, Hwanju Kim, Jeaho Hwang, Joonwon Lee, and Seungryoul Maeng, “DaaC: Device-reserved Memory as an
Eviction-based File Cache”
• [IEEE TCE’09] Heeseung Jo, Hwanju Kim, Hyun-Gul Roh, Joonwon Lee, “Improving the Startup Time of Digital TV”
• [IEEE TCE’09] Heeseung Jo, Hwanju Kim, Jinkyu Jeong, Joonwon Lee, and Seungryoul Maeng, “Optimizing the Startup Time of
Embedded Systems: A Case Study of Digital TV”
• [IEEE TCE’10] Jeaho Hwang, Jinkyu Jeong, Hwanju Kim, Jin-Soo Kim, and Joonwon Lee, “AppWatch: Detecting Kernel Bug for
Protecting Consumer Electronics Applications”
• [IEEE TCE’12] Jeaho Hwang, Jinkyu Jeong, Hwanju Kim, Jeonghwan Choi, and Joonwon Lee, “Compressed Memory Swap for QoS of
Virtualized Embedded Systems”
• [SPE’10] Jinkyu Jeong, Euiseong Seo, Jeonghwan Choi, Hwanju Kim, Heeseung Jo, and Joonwon Lee, “KAL: Kernel-assisted Non-
invasive Memory Leak Tolerance with a General-purpose Memory Allocator”
Thank
You !
References
[Blake et al., ISCA’10] Evolution of thread-level parallelism in desktop applications
[Botelho’08] Virtual machines per server, a viable metric for hardware selection?
(http://itknowledgeexchange.techtarget.com/server-farm/virtual-machines-per-server-a-viable-metric-for-
hardware-selection/)
[Govindan et al., VEE’07] Xen and co.: communication-aware CPU scheduling for consolidated xen-based hosting
platforms
[Hu et al., HPDC’10] I/O scheduling model of virtual machine based on multi-core dynamic partitioning
[Kim et al., EuroPar’08] Guest-Aware Priority-Based Virtual Machine Scheduling for Highly Consolidated Server
[Kim et al., VEE’09] Task-aware virtual machine scheduling for I/O performance
[Kim et al., JPDC’11] Transparently Bridging Semantic Gap in CPU Management for Virtualized Environments
[Lee et al., VEE’10] Supporting Soft Real-Time Tasks in the Xen Hypervisor
[Liao et al., ANCS’08] Software techniques to improve virtualized I/O performance on multi-core systems
[Lin et al., SC’05] VSched: Mixing Batch And Interactive Virtual Machines Using Periodic Real-time Scheduling
[Masrur et al., RTCSA’10] VM-Based Real-Time Services for Automotive Control Applications
[Ongaro et al., VEE’08] Scheduling I/O in virtual machine monitors
[Sukwong et al., EuroSys’11] Is co-scheduling too expensive for SMP VMs?
[Uhlig et al., VM’04] Towards scalable multiprocessor virtual machines
[VMware ESXi’10] VMware vSphere: The CPU Scheduler in VMware ESX 4.1
[VMware VDI] Enabling your end-to end virtualization solution.
(http://www.vmware.com/solutions/partners/alliances/hp-vmware-customers.html)
[Weng et al., HPDC’11] Dynamic adaptive scheduling for virtual machines
[Weng et al., VEE’09] The hybrid scheduling framework for virtual machine systems
[Xia et al., ICPADS’09] PaS: A Preemption-aware Scheduling Interface for Improving Interactive Performance in
Consolidated Virtual Machine Environment
[Zheng et al., SIGMETRICS’10] RSIO: automatic user interaction detection and scheduling
EXTRA SLIDES
Demand-Based Coordinated
Scheduling for Multiprocessor VMs
Proportional-Share Scheduler
• Proportional-share scheduler for SMP VMs
• Common scheduler for commodity VMMs
• Employed by KVM, Xen, VMware, etc.
• VM’s shares (S) =
Total shares x (weight / total weight)
• VCPU’s shares = S / # of active VCPUs
• Active vCPU: Non-idle vCPU
Single-threaded workload Multi-threaded (programmed) workload
VCPU0
(1024)
VCPU0
(256)
VCPU1
(256)
VCPU2
(256)
VCPU3
(256)
e.g., 4-VCPU VM (S = 1024)
Symmetric vCPUs
Existing schedulers view active vCPUs
as containers with identical power
41/35
Helping Lock
• Spin-then-block lock [AMD, XenSummit’08]
• Block after spin during a certain period
• + Reducing unnecessary spinning
• - Still LHP and unnecessary spinning
• - Profiling required to find a suitable spin threshold
• - Kernel instrumentation
• But, most popular paravirtualized approach for open-source
kernel like Linux
• Paravirt-spinlock for Xen Linux (mainline)
• Paravirt-spinlock for KVM Linux (patch)
42/35
Coordination for User-level Contention
• User-level synchronization
• Pure spin-based synchronization is rarely used in user space
• Block-based or spin-then-block synchronization
• Reschedule IPI driven coscheduling
• With regard to spin-then-block synchronization, less contention
occurs by coscheduling cooperative threads
Reschedule IPI traffic of streamcluster
Execution time of streamcluster
consolidated with bodytrack
Streamcluster intensively uses spin-then-block barriers
Resched-Co alleviates spin-phase of lock wait time
43/35
Performance on PLE
• PLE (Pause-Loop-Exit)
• A HW mechanism to notify VMM of spinning over a
predefined threshold (i.e., pathological busy-waiting)
• In response to this notification, VMM allows a currently
running vCPU to yield its pCPU
Facesim (futex-intensive) Ferret (TLB-IPI-intensive)
IPI-driven scheduling proactively alleviate unnecessary contention,
whereas PLE reactively relieves contention that has already happened 44/35
Evaluation: Urgent Allowance
• Urgent allowance
• Trading short-term fairness with CPU efficiency
• How much short-term fairness is traded?
1 vips VM
+ 2 facesim VMs
Trading short-term fairness improves overall efficiency
without negative impact on long-term fairness 45/35
Evaluation: Two Multiprocessor VMs
w/ dedup
w/ freqmine
a: baseline
b: balance
c: LC-balance
d: LC-balance+Resched-DP
e: LC-balance+Resched-DP+TLB-Co

corun
solorun
Time
Time
46/35
TLB Shootdown IPIs of Windows 7
• Heavy use of TLB shootdown IPIs by Windows 7
desktop application launch
• Most TLB shootdown IPIs are sent with
multi/broadcasting
• TLB-IPI-driven coscheduling improves PowerPoint
launch time by 23% when consolidated with 4 VMs,
each running streamclusters
Apps Explorer IE PowerPoint Word Excel
# of triggers 102 262 166 179 77
# of IPIs 608 1230 782 990 418
Launch time (ms) 622 982 975 1108 1011
47/35
Virtual Asymmetric Multiprocessor for
User-Interactive Performance
Multimedia Workload Filtering
• Tracking audio-requesting tasks
• Tracking tasks that access a virtual audio device
• Excluding audio access in an interrupt context
• Checking audio Interrupt Service Register (ISR)
• Server-client sound system
• A user-level task to serve all audio requests (e.g., pulseaudio)
• Remote wake-up tracking
1VM: VLC+facesim
1VM: freqmine
(facesim severely interferes remote wake-up tracking)
49/35
Measurement Methodology
• Spiceplay
• Snapshot-based record/replay
• Robust replay for varying loads
• Similar to VNCPlay [USENIX’05] and Deskbench [IM’09]
• Extension on the SPICE remote desktop client
• Record
• Snapshot at an input point  Input recording  Snapshot at a
user-perceived point
• Replay
• Snapshot comparison & start timer  Input replaying 
Snapshot comparison & stop timer
50/35
vAMP Parameters
• Default vAMP parameters
Parameter Role
Default
value
Rationale
Background load
threshold
Tagging background
tasks
50%
Large enough to filter general daemon tasks
such as an X server
Maximum time
of an interactive
episode
Duration of distributing
asymmetric CPU shares
5sec
Large enough to cover a general interactive
episode
(2sec was used in previous research based on
HCI work, but larger value is needed to cover
long-launched applications )
0
5
10
15
20
25
30
Averageframespersecond(FPS)
bgload_thresh=5
bgload_thresh=50
Video playback:
vAMP(L) w/ Ext
X server is misclassified as
a background task
0.00
0.20
0.40
0.60
0.80
1.00
1.20
Normalizedaveragelaunchtime
max_intr_episode=2sec
max_intr_episode=5sec
Gimp launch:
vAMP(L) w/ Ext
Interactive episode is prematurely
finished before the end of launch
51/35
Evaluation: Background Performance
• Performance of background workloads
• With repeated launch with 1-second interval
• Intensively interactive workloads
• 3-28% degradation
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
Impress Firefox Chrome Gimp
Normalizedaverageexecutiontime
Baseline
vAMP(L)
vAMP(M)
vAMP(H)
vAMP(L) w/ Ext
vAMP(M) w/ Ext
vAMP(H) w/ Ext
52/35
Evaluation: Guest OS Extension
• Interrupt pinning
• An interactive workload can accompany I/O
• Even warm launch can involve synchronous disk writes
• During an interactive episode, pinning I/O interrupts
on fast vCPUs
• In Linux, manipulate /proc/<irq number>/smp_affinity
53/35
0
200
400
600
800
1000
1200
1400
Averagelaunchtime(sec)
vAMP(L) w/ Ext (no pin)
vAMP(M) w/ Ext (no pin)
vAMP(H) w/ Ext (no pin)
vAMP(L) w/ Ext
vAMP(M) w/ Ext
vAMP(H) w/ Ext
Chrome launch:
Chrome launch entails some synchronous writes
If a disk I/O interrupt is delivered to a slow vCPU,
scheduling latency is increased
Evaluation: Guest OS Extension
• nr_fast_vcpus parameter
• Initial number of fast vCPUs
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
Impress Firefox Chrome Gimp
Normalizedaveragelaunchtime
nr_fast_vcpus=1
nr_fast_vcpus=2
nr_fast_vcpus=4
Interactive workloads with low thread-level parallelism do not require
a large number of initial fast vCPUs
54/35
A workload with low thread-level parallelism is adversely
affected by multiple fast vCPUs, since unnecessary vCPU-level
scheduling latency is involved
Task-aware VM Scheduling for I/O
Performance
Problem of VM Scheduling
• Task-agnostic scheduling
VMM
VM1 VM2
Run queue sorted based on CPU fairness
Mixed
task
CPU-
bound
task
I/O-
bound
task
I/O event
That event is mine
and I’m waiting
for it
Your VM has low priority now!
I don’t even know this event is for
your I/O-bound task!
Sorry not to schedule you
immediately…
Head Tail
56/35
Task-agnostic scheduling
• The worst case example for 6 consolidated VMs
• Network response time
Native Linux: Non-consolidated OS
XenoLinux: Consolidated OS on Xen
<Workloads>
• I/O+CPU
1 VM: Server & CPU-bound task
5 VMs: CPU-bound task
• I/O
1 VM: Server
5 VMs: CPU-bound task
By boosting mechanism
of Xen Credit scheduler
Poor responsiveness
 boosting mechanism realizes
I/O-boundness with only VCPU-level
57/35
Task-aware VM Scheduling
• Goals
• Tracking I/O-boundness with task granularity
• Improving the response time of I/O-bound tasks
• Keeping inter-VM fairness
• Challenges
PCPU
VMM
Mixed
task
CPU-
bound
task
I/O-
bound
task
I/O event
Mixed
task
CPU-
bound
task
I/O-
bound
task
VM VM
1. I/O-bound task identification
2. I/O event correlation
3. Partial boosting
58/35
Task-aware VM Scheduling
1. I/O-bound Task Identification
• Observable information at the VMM
• I/O events
• Task switching events [Jones et al., USENIX’06]
• CPU time quantum of each task
• Inference based on common OS techniques
• General OS techniques (Linux, Windows, FreeBSD,
…) to infer and handle I/O-bound tasks
• 1. Small CPU time quantum (main)
• 2. Preemptive scheduling in response to I/O events
(supportive)
Example (Intel x86)
CR3 update CR3 update
I/O event Task time quantum
59/35
• Three disjoint observation classes
• Positive evidence
• Support I/O-boundness
• Negative evidence
• Support non-I/O-boundness
• Ambiguity
• No evidence
• Weighted evidence accumulation
Observation classes
Positive
evidence
Negative
evidence
If 1 and 2 are satisfied If 1 is violated
1. Small CPU time quantum (main)
2. Preemptive scheduling (supportive)
Otherwise
Ambiguity
Task-aware VM Scheduling
1. I/O-bound Task Identification
# of sequential observations
The degree
of belief
At this time, this task is believed as an I/O-bound task
More penalty for
long time quantum
60/35
Task-aware VM Scheduling
2. I/O Event Correlation
• I/O event correlation
• To distinguish an incoming event for I/O-bound tasks
• Why?
• To selectively prioritize I/O-bound tasks in a VM
• CPU-bound tasks also conduct I/O operations
• Goal
• Best-effort correlation
• Lightweight rather than accuracy
• I/O types
• Block I/O: disk read
• Network I/O: packet reception
61/35
Task-aware VM Scheduling
2. I/O Event Correlation: Block I/O
• Request-response correlation
• Window-based correlation
• Correlation for delayed read events by guest OS
• e.g., block I/O scheduler
• Overhead per VCPU = window size x 4bytes (task ID)
T1 T2 T3 T4
read
Actual
read request
user
kernel
VMM
Inspection window Any I/O-bound
task in the window
62/35
Task-aware VM Scheduling
2. I/O Event Correlation: Network I/O
• History-based prediction
• Asynchronous packet reception
• Monitoring “the firstly woken task” in response to
an incoming packet
• N-bit saturating counter for each destination port number
Portmap 00
Non-
I/O-
bound
01
Weak
I/O-
bound
10
I/O-
bound
11
Strong
I/O-
bound
If the firstly woken task is I/O-bound
Otherwise
If portmap counter’s MSB is set,
this packet is for I/O-bound tasks
Example: 2-bit counter
Destination
port number
Overhead per VM = N x 8KB
63/35
Task-aware VM Scheduling
3. Partial Boosting
• Priority boosting with task-level granularity
• Borrowing future time slice to promptly handle an
incoming I/O event as long as fairness is kept
• Partial boosting lasts during the run of I/O-bound
tasks
VMM
VM1 VM2
Run queue sorted based on CPU fairness
I/O event
VM3
CPU-
bound
task
CPU-
bound
task
Head Tail
I/O-
bound
task
If this I/O event is destined for VM3 and
is inferred to be handled by its I/O-bound task,
Initiate partial boosting for VM3 VCPU
64/35
Evaluation (1/4)
• Implementation on Xen 3.2
• Experimental setup
• Intel Pentium D for Linux (single core enabled)
• Intel Q6600 (VT-x) for Windows XP (single core
enabled)
• Correlation parameters
• Chosen for >90% accuracy and low overheads
by stressful tests with synthetic workloads
• Block I/O: Inspection window size = 3
• Network I/O: Portmap bit width = 2
65/35
Evaluation (2/4)
• Network response time
<Schedulers>
Baseline = Xen Credit scheduler
TAVS = Task-aware VM scheduler
<Workloads>
1 VM: Server & CPU-bound task
5 VMs: CPU-bound task
Response time improvement
Fairness guarantee
66/35
Evaluation (3/4)
• Real workloads
Ubuntu Linux Windows XP
I/O-bound
tasks
CPU-bound
tasks
<Workloads>
1 VM: I/O-bound & CPU-bound task
5 VMs: CPU-bound task
12-50% I/O performance
improvement with
inter-VM fairness
67/35
Evaluation (4/4)
• I/O-bound task identification
68/35
Client-side Scheduler Support for
Multimedia Workloads
Client-side Virtualization
• Multiple OS instances on a local device
• Primary use cases
• Different OSes for application compatibility
• Consolidating business and personal
computing environments on a single device
• BYOD: Bring Your Own Device
Business
VM
Personal
VM
Hypervisor
Managed
domain
70/35
Multimedia on Virtualized Clients
• Multimedia is ubiquitous on any VM
Windows
VM
Linux
VM
Hypervisor
Business
VM
Personal
VM
Hypervisor
Business
VM
Personal
VM
Hypervisor
Video
Playback Compilation
Data
Processing 3D game
Video
conference Downloading
1. Multimedia workloads are dominant on virtualized clients
2. Interactive systems can have concurrently mixed workloads
71/35
Issues on Multi-layer Scheduling
• A multimedia-agnostic hypervisor invalidates OS
policies for multimedia
VM
OS
scheduler
VM
OS
Scheduler
Hypervisor
Scheduler
CPU
OS scheduler
CPU
Virtual CPU Virtual CPU
Task
Task
BVT [SOSP’99]
SMART [TOCS’03]
Rialto [SOSP’97]
BEST [MMCN’02]
HuC [TOMCCAP’06]
Redline [OSDI’08]
RSIO [SIGMETRICS’10]
Windows MMCSS
Larger CPU proportion
& Timely dispatching TaskTask TaskTask
I’m unaware of any
multimedia-specific OS policies
in a VM, since I see each VM as
a black box.
Additional
abstraction
Semantic gap!
72/35
Multimedia-agnostic Hypervisor
• Multimedia QoS degradation
• Two VMs with equal CPU shares
• Multimedia VM + Competing VM
0
5
10
15
20
25
30
AverageFPS
Competing workloads in another VM
0
10
20
30
40
50
60
70
80
90
100
AverageFPS
Competing workloads in another VM
VM VM
Xen hypervisor
Credit scheduler
Video playback
or 3D game
Competing
workloads
Video playback (720p)
on VLC media player Quake III Arena (demo1)
73/35
Possible Solutions to Semantic Gap
• Explicit vs. Implicit
VM
OS scheduler
Hypervisor
Scheduler
Explicit
OS cooperation
+ Accurate
- OS modification
- Infeasible w/o
multimedia-friendly
OS schedulers
VM
OS scheduler
Hypervisor
Scheduler
Explicit
User involvement
+ Simple
- Inconvenient
- Unsuitable for
dynamic workloads
VM
OS scheduler
Hypervisor
Scheduler
Implicit
Hypervisor-only
+ Transparency
- Difficult to identify
workload demands
at the hypervisor
Workload monitor
74/35
Proposed Approach
• Multimedia-aware hypervisor scheduler
• Transparent scheduler support for multimedia
• No modifications to upper layer SW (OS & apps)
• “Feedback-driven VM scheduling”
VM
Hypervisor
VM VM
Multimedia
manager
(feedback-driven)
CPU
scheduler
Multimedia
monitor
Audio Video CPU
Estimated
multimedia QoS
Scheduling command
(e.g., CPU share or priority)
Challenges
1. How to estimate multimedia QoS
based on a small set of HW events?
2. How to control CPU scheduler
based on the estimated information
75/35
Multimedia QoS Estimation
• What is estimated as multimedia QoS?
• “Display rate” (i.e., frame rate)
• Used by HuC scheduler [TOMCCAP’06]
• How is a display rate captured at the
hypervisor?
• Two types of display
Framebuffer
Acceleration
unit
Display
interface
Memory-mapped
Graphics
Library
Video device
1. Memory-mapped
display
(e.g., video playback)
2. GPU-accelerated
display
(e.g., 3D game)
76/35
Memory-mapped Display (1/2)
• How to estimate a display update rate on the
memory-mapped framebuffer
• Write-protection for virtual address space
mapped to framebuffer
Framebuffer
memory
Virtual address space
Display interface
Write-protection
write
Hypervisor
page fault handler
{
Update display rate
}
The hypervisor can inspect any attempt to map memory
Sampling to reduce trap overheads
(1/128 pages, by default)
77/35
Memory-mapped Display (2/2)
• Accurate estimation
• Maintaining display rate per task
• An aggregated display rate does not
represent multimedia QoS
• Tracking guest OS task at the hypervisor
• Inspecting address space switches (Antfarm [USENIX’06])
• Monitoring audio access (RSIO [SIGMETRIC’10])
• Inspecting audio buffer access with write-protection
• A task with a high display rate and audio access
 a multimedia task
Task
Task
25 FPS
10 FPS
78/35
GPU-accelerated Display (1/2)
• Naïve method
• Inspecting GPU command buffer with
write-protection or polling
• Too heavy due to huge amount of GPU commands
• Lightweight method
• Little overhead, but less accuracy
• 3D games are less sensitive to frame rate degradation than
video playback
• GPU interrupt-based estimation
• An interrupt is typically used for an application to
manage buffer memory
• Hypothesis
• “A GPU interrupt rate is in proportion to a display rate”
79/35
GPU-accelerated Display (2/2)
• Linear relationship between display rates and
GPU interrupt rates
• Exponential weighted moving average (EWMA) is used to
reduce fluctuation
• EWMAt = (1-w) x EWMAt-1 + w x current value
0
2000
4000
6000
8000
10000
12000
0 50 100
#ofGPUinterrupt/sec
FPS
Quake3 demo1 (640x480)
Quake3 demo2 (640x480)
Quake3 demo1 (1024x768)
60
80
100
120
140
160
0 50 100
#ofGPUinterrupt/sec
FPS
Quake3 demo1 (640x480)
Quake3 demo2 (640x480)
Quake3 demo1 (1024x768)
0
100
200
300
400
0 20 40 60
#ofGPUinterrupt/sec
FPS
Quake3 demo4 (320x240)
Quake3 demo4 (640x480)
Intel GMA 950
(Apple MacBook)
Nvidia 6150 Go
(HP Pavillion tablet)
PowerVR
(Samsung GalaxyS)
A GPU interrupt rate can be used to estimate a display rate
without additional overheads 80/35
Multimedia Manager
• A feedback-driven CPU allocator
• Base assumption
• “Additional CPU share (or higher priority) improves a display
rate”
• Desired frame rate (DFR)
• A currently achievable display rate
• Multiplied by tolerable ratio (0.8)
IF current FPS < previous FPS AND
current FPS < DFR THEN
Increase CPU share
/* Exceptional cases:
* 1) No relationship between CPU
and FPS
* 2) FPS is saturated below DFR
* 3) Local CPU contention in a VM
*/
If no FPS improvement by CPU
share increase (3 times) Then
Decrease CPU share by half
If in initial phase Then
Exponential increase
Else
Linear increase
81/35
Priority Boosting
• Responsive dispatching
• Problem
• The hypervisor does not distinguish the types of events for
priority boosting
• A VM that will handle a multimedia event cannot preempt a
currently running VM handling a normal event.
• Higher priority for multimedia-related events
• e.g., video, audio, one-shot timer
MMBOOST
IOBOOST
Normal
priorityPriority
Multimedia events
Other events
Based on remaining CPU shares
82/35
Evaluation
• Experimental environment
• Intel MacBook with Intel GMA 950
• Xen 3.4.0 with Ubuntu 8.04
• Implementation based on Xen Credit scheduler
• Two-VM scenario
• One with direct I/O + one with indirect (hosted) I/O
• Presenting the case of direct I/O in this talk
• See the paper for the details of the indirect I/O case
83/35
0
10
20
30
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
FPS
Time (sec)
Real FPS Estimated FPS
Video playback (720p)
(w/ CPU-bound VM)
Estimation Accuracy
• Estimation accuracy
• Error rates: 0.55%~3.05%
0
50
100
0 5 10 15 20 25 30 35 40 45 50 55 60 65
FPS
Time (sec)
Real FPS Estimated FPS Estimated FPS (EWMA, w=0.2)
Quake 3
(w/ CPU-bound VM)
multimedia manager disabled
84/35
Estimation Overhead
• CPU overhead caused by page faults
• Video playback
• 0.3~1% with sampling
• Less than 5% with tracking all pages
Overhead
All
pages
Sampling
1/8 pages 1/32 pages 1/128 pages
Low resolution
(640x354)
4.95% 1.10% 0.54% 0.58%
High resolution
(1280x720)
3.91% 1.04% 0.69% 0.33%
85/35
Multimedia Manager
• Video playback (720p) + CPU-bound VM
0
20
40
60
80
100
0
5
10
15
20
25
0 10 20 30 40 50 60 70 80
CPUshare(%)
FPS
Time (sec)
FPS DFR CPU share (%)
20
30
40
50
60
70
80
90
100
5
10
15
20
25
5 6 7 8 9 10
40
50
60
70
80
90
100
10
15
20
25
80 81 82 83 84
86/35
Performance Improvement
• Performance improvement
• Closed to maximum achievable
frame rates
0
5
10
15
20
25
AverageFPS
Competing workloads in another VM
Credit scheduler
Credit scheduler w/ multimedia support
0
20
40
60
80
100
AverageFPS
Competing workloads in another VM
Credit scheduler
Credit scheduler w/ multimedia support
VM VM
Hypervisor
Video playback
or 3D game
Competing
workloads
Video playback (720p)
on VLC media player Quake III Arena (demo1)
87/35
Limitations & Discussion
• Network-streamed multimedia
• Additional preemption support required for
multimedia-related network packets
• Multiple multimedia workloads in a VM
• Multimedia manager algorithm should be refined
to satisfy QoS of mixed multimedia workloads in the
same VM
• Adaptive management for SMP VMs
• Adaptive vCPU allocation based on hosted
multimedia workloads
88/35
Conclusions
• Demands for multimedia-aware hypervisor
• Multimedia are increasingly dominant in
virtualized systems
• “Multimedia-friendly hypervisor scheduler”
• Transparent and lightweight multimedia support on
client-side virtualization
• Future directions
• Multimedia for server-side VDI
• Multicore extension for SMP VMs
• Considerations for network-streamed multimedia
89/35

More Related Content

PPTX
6. Live VM migration
PPTX
3. CPU virtualization and scheduling
PPTX
Demand-Based Coordinated Scheduling for SMP VMs
PPTX
2. OS vs. VMM
PPTX
Hyper-V High Availability and Live Migration
PDF
Yabusame: postcopy live migration for qemu/kvm
PDF
Scheduler Support for Video-oriented Multimedia on Client-side Virtualization
PPTX
4. Memory virtualization and management
6. Live VM migration
3. CPU virtualization and scheduling
Demand-Based Coordinated Scheduling for SMP VMs
2. OS vs. VMM
Hyper-V High Availability and Live Migration
Yabusame: postcopy live migration for qemu/kvm
Scheduler Support for Video-oriented Multimedia on Client-side Virtualization
4. Memory virtualization and management

What's hot (20)

PPTX
5. IO virtualization
PDF
Live VM Migration
PPTX
Building a KVM-based Hypervisor for a Heterogeneous System Architecture Compl...
PPTX
1.Introduction to virtualization
PDF
Memory Virtualization
PDF
Xen Memory Management
PDF
VM Live Migration Speedup in Xen
PPT
Application Live Migration in LAN/WAN Environment
PDF
Virtual Machine Migration Techniques in Cloud Environment: A Survey
PPTX
Vm migration techniques
PPTX
webinar vmware v-sphere performance management Challenges and Best Practices
PDF
Virtualization and cloud Computing
PPTX
Virtual Machine Migration & Hypervisors
PPSX
Redesigning Xen Memory Sharing (Grant) Mechanism
PDF
Virtual Asymmetric Multiprocessor for Interactive Performance of Consolidated...
PPTX
Virtualization & Network Connectivity
PPTX
Introduction to Virtualization, Virsh and Virt-Manager
PDF
Virtualization Technology Overview
PPTX
Virtualization 101 - DeepDive
PPTX
cloud computing: Vm migration
5. IO virtualization
Live VM Migration
Building a KVM-based Hypervisor for a Heterogeneous System Architecture Compl...
1.Introduction to virtualization
Memory Virtualization
Xen Memory Management
VM Live Migration Speedup in Xen
Application Live Migration in LAN/WAN Environment
Virtual Machine Migration Techniques in Cloud Environment: A Survey
Vm migration techniques
webinar vmware v-sphere performance management Challenges and Best Practices
Virtualization and cloud Computing
Virtual Machine Migration & Hypervisors
Redesigning Xen Memory Sharing (Grant) Mechanism
Virtual Asymmetric Multiprocessor for Interactive Performance of Consolidated...
Virtualization & Network Connectivity
Introduction to Virtualization, Virsh and Virt-Manager
Virtualization Technology Overview
Virtualization 101 - DeepDive
cloud computing: Vm migration
Ad

Viewers also liked (18)

PDF
Introduction to virtualization
PPTX
Master VMware Performance and Capacity Management
PDF
GPU Virtualization on VMware's Hosted I/O Architecture
PDF
Task-aware Virtual Machine Scheduling for I/O Performance
PDF
가상화와 보안 발표자료
PPT
GPU Virtualization in Embedded Automotive Solutions
PPTX
VDI and Application Virtualization
PPTX
프로그래머가 몰랐던 멀티코어 CPU 이야기 - 15, 16장
PPTX
Virtual machines and their architecture
PDF
Dave Gilbert - KVM and QEMU
PPTX
QEMU - Binary Translation
PPTX
Virtual desktop infrastructure
PDF
Virtualization with KVM (Kernel-based Virtual Machine)
PPTX
PPSX
Virtualization basics
PDF
Virtualization presentation
PPT
Virtualization in cloud computing ppt
PDF
Presentation
Introduction to virtualization
Master VMware Performance and Capacity Management
GPU Virtualization on VMware's Hosted I/O Architecture
Task-aware Virtual Machine Scheduling for I/O Performance
가상화와 보안 발표자료
GPU Virtualization in Embedded Automotive Solutions
VDI and Application Virtualization
프로그래머가 몰랐던 멀티코어 CPU 이야기 - 15, 16장
Virtual machines and their architecture
Dave Gilbert - KVM and QEMU
QEMU - Binary Translation
Virtual desktop infrastructure
Virtualization with KVM (Kernel-based Virtual Machine)
Virtualization basics
Virtualization presentation
Virtualization in cloud computing ppt
Presentation
Ad

Similar to CPU Scheduling for Virtual Desktop Infrastructure (20)

PDF
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
PDF
An Updated Performance Comparison of Virtual Machines and Linux Containers
PDF
Latest (storage IO) patterns for cloud-native applications
PPTX
Operating System
PDF
VMworld 2013: How SRP Delivers More Than Power to Their Customers
PDF
AIST Super Green Cloud: lessons learned from the operation and the performanc...
PPTX
ClickOS_EE80777777777777777777777777777.pptx
PPTX
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
PDF
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
PDF
load-balancing-method-for-embedded-rt-system-20120711-0940
PPTX
Performance of Microservice frameworks on different JVMs
PDF
Exchange 2010 New England Vmug
PPTX
OS for AI: Elastic Microservices & the Next Gen of ML
PDF
My network functions are virtualized, but are they cloud-ready
PDF
Mastering Real-time Linux
PPTX
Surviving the Crisis With the Help of Oracle Database Resource Manager
PDF
VMworld 2013: Low-Cost, High-Performance Storage for VMware Horizon Desktops
PDF
A Review of Storage Specific Solutions for Providing Quality of Service in St...
PDF
IRJET- Dynamic Resource Allocation of Heterogeneous Workload in Cloud
PDF
A Performance Comparison of Container-based Virtualization Systems for MapRed...
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
An Updated Performance Comparison of Virtual Machines and Linux Containers
Latest (storage IO) patterns for cloud-native applications
Operating System
VMworld 2013: How SRP Delivers More Than Power to Their Customers
AIST Super Green Cloud: lessons learned from the operation and the performanc...
ClickOS_EE80777777777777777777777777777.pptx
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
load-balancing-method-for-embedded-rt-system-20120711-0940
Performance of Microservice frameworks on different JVMs
Exchange 2010 New England Vmug
OS for AI: Elastic Microservices & the Next Gen of ML
My network functions are virtualized, but are they cloud-ready
Mastering Real-time Linux
Surviving the Crisis With the Help of Oracle Database Resource Manager
VMworld 2013: Low-Cost, High-Performance Storage for VMware Horizon Desktops
A Review of Storage Specific Solutions for Providing Quality of Service in St...
IRJET- Dynamic Resource Allocation of Heterogeneous Workload in Cloud
A Performance Comparison of Container-based Virtualization Systems for MapRed...

Recently uploaded (20)

PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Sustainable Sites - Green Building Construction
PPTX
Lecture Notes Electrical Wiring System Components
PPT
Project quality management in manufacturing
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
composite construction of structures.pdf
PPTX
additive manufacturing of ss316l using mig welding
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Welding lecture in detail for understanding
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
PPT on Performance Review to get promotions
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Digital Logic Computer Design lecture notes
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Geodesy 1.pptx...............................................
PPTX
Internet of Things (IOT) - A guide to understanding
DOCX
573137875-Attendance-Management-System-original
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Sustainable Sites - Green Building Construction
Lecture Notes Electrical Wiring System Components
Project quality management in manufacturing
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Foundation to blockchain - A guide to Blockchain Tech
composite construction of structures.pdf
additive manufacturing of ss316l using mig welding
UNIT-1 - COAL BASED THERMAL POWER PLANTS
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Welding lecture in detail for understanding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT on Performance Review to get promotions
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Digital Logic Computer Design lecture notes
Automation-in-Manufacturing-Chapter-Introduction.pdf
Geodesy 1.pptx...............................................
Internet of Things (IOT) - A guide to understanding
573137875-Attendance-Management-System-original

CPU Scheduling for Virtual Desktop Infrastructure

  • 1. CPU Scheduling for Virtual Desktop Infrastructure PhD Defense Hwanju Kim 2012-11-16
  • 2. Virtual Desktop Infrastructure (VDI) • Desktop provisioning Dedicated workstations VM VM VM VM VM - Energy wastage by idle desktops - Resource underutilization - High management cost - High maintenance cost - Low level of security + Energy savings by consolidation + High resource utilization + Low management cost (flexible HW/SW provisioning) + Low maintenance cost (dynamic HW/SW upgrade) + High level of security (centralized data containment) VM-based shared environments 2/35
  • 3. Hardware Virtual Machine Monitor (VMM) Desktop Consolidation • Distinctive workload characteristics • High consolidation ratio • 4:1~15:1 [VMware VDI], 6~8 per core [Botelho’08] • Diverse user-dependent workloads • Light users and knowledgeable workers coexist • Multi-layer mixed workloads • Multi-tasking (interactive+background) in a consolidated VM VM VM VM VM VM VM VM VM VM Mixed Interactive CPU-intensive Parallel 3/35
  • 4. VM Challenges on CPU Scheduling • Challenges due to the primary principles of VMM, compared to OS scheduling research pCPU VMM scheduler pCPU vCPU vCPU OS scheduler vCPU OS scheduler VMM vCPU vCPU OS scheduler Task Task Task Task Task TaskTask Task VMVM 1. Semantic gap ( OS independence) : Two independent scheduling layers 2. Scarce Information ( Small TCB) : Difficulty in extracting workload characteristics 3. Inter-VM fairness ( Performance isolation) : Favoring a VM must not compromise inter-VM fairness • I/O operations • Privileged instructions • Process and thread information • Inter-process communications • I/O operations and semantics • System calls • etc… Each VM is virtualized as a black box I believe I’m on a dedicated machine Lightweightness (No cross-layer optimization) Efficiency (Intelligent VMM) 4/35
  • 5. VMVM The Goals of This Thesis • The enlightened CPU scheduling of VMM for consolidated desktops • Efficient CPU management with lightweight VMM extensions VMM scheduler VMM vCPU vCPU vCPU vCPU VM Interactive workload ThreadThreadThread Background workload ThreadThreadThread VM Communicating workload Thread Thread Enlightening about diverse workload demands inside a VM Base: CPU bandwidth partitioning for performance isolation Design principles 1. OS-independence: VMM-level solutions without OS-dependent optimizations 2. Diversity: Identifying the computing demands of diverse workloads (including mixed workloads) 3. Inter-VM fairness: Performance isolation for multi-tenant environments 5/35
  • 6. Related Work Proposals References Design principles OS- independence Diversity Inter-VM fairness Proportional-share scheduling Xen, KVM, VMware ESX O X O Interactive & soft real-time scheduling [Lin et al., SC’05] [Lee et al., VEE’10] [Masrur et al., RTCSA’10] O X (User-directed, no mixed & communicating workloads) X OS-assisted scheduling [Kim et al., EuroPar’08] [Xia et al., ICPADS’09] X (OS-dependent optimization) X (No communicating workloads) O I/O-friendly scheduling [Govindan et al., VEE’07] [Ongaro et al., VEE’08] [Liao et al., ANCS’08] [Hu et al., HPDC’10] O X (Only I/O-intensive workloads) O Multiprocessor VM scheduling Relaxed coscheduling [VMware ESXi’10] [Sukwong et al., EuroSys’11] O X (No mixed workloads) O Spinlock-aware scheduling [Uhlig et al., VM’04] [Weng et al., HPDC’11] X (OS-dependent optimization) X (Only spinlock- intensive workloads) O Hybrid scheduling [Weng et al., VEE’09] O X (User-involved, no mixed workloads) O
  • 7. Overview VMM scheduler VMM vCPU vCPU vCPU VM VM Multithreaded (communicating or parallel) workload Thread • Introduction to “Task-aware VM scheduling” [Kim et al., VEE’09], [Kim et al., JPDC’11] + The first solution to mixed workloads in a consolidated VM + Simple and effective for I/O-bound interactive workloads - No consideration about multiprocessor VMs - Lacking ability to support modern interactive workloads pCPU CPU- bound task I/O- bound task vCPU VM CPU- bound task CPU- bound task • Proposal for multiprocessor VM scheduling  Efficient scheduling for multithreaded workloads hosted on multiprocessor VMs Proposal vCPU vCPU VMM scheduler VMM pCPU pCPU pCPU pCPU Thread ThreadThread User- Interactive workload Background workload Defense “Demand-based coordinated scheduling” “Virtual asymmetric multiprocessor” Implementation Extension Task-based Priority boosting 7/35
  • 8. Demand-Based Coordinated Scheduling for Multiprocessor VMs How to effectively schedule multithreaded workloads hosted in multiprocessor VMs? vCPU vCPU VM Multithreaded (communicating or parallel) workload Thread vCPU vCPU VMM scheduler VMM pCPU pCPU pCPU pCPU Thread ThreadThread
  • 9. Why Coordinated Scheduling? • Uncoordinated vs. Coordinated scheduling vCPU VMM scheduler VMM pCPU pCPU pCPU pCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU Time shared Uncoordinated scheduling Each vCPU is treated as an independent entity regardless of its sibling vCPUs Independent entity vCPU VMM scheduler VMM pCPU pCPU pCPU pCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU Coordinated scheduling Sibling vCPUs are coordinated by VMM scheduler Coordinated group Why is coordination needed? • Many applications are multithreaded and parallelized  Multiple threads perform a job communicating with each other to arbitrate accesses to shared resources vCPU vCPU vCPU Time shared Lock holder Lock waiter Lock waiter Active Inactive Inactive Uncoordinated scheduling makes inter-thread communication ineffective Similar to traditional job scheduling issues in distributed environments • Multicore resembles a distributed environment Time shared 9/35
  • 10. Coordination Space • Space and time domains • Space domain • pCPU assignment policy • Where is each sibling vCPU assigned? • Time domain • Preemptive scheduling policy • When and which sibling vCPUs are preemptively scheduled • e.g., Co-scheduling vCPU pCPU pCPU pCPU pCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU Space Where to schedule? Time When to schedule? Coordinated group 10/35
  • 11. Space Domain: pCPU Assignment • A naïve method • “Balance scheduling”[Sukwong et al., EuroSys’11] • Spread sibling vCPUs on separate pCPUs • Probabilistic co-scheduling due to the increase of likelihood of coscheduling • No coordination in time domain • Limitation • An unrealistic assumption: “CPU load is well balanced” • In practice, VMs with equal CPU shares have • Different number of vCPUs • Different thread-level parallelism • Phase-changed multithreaded workloads vCPU pCPU pCPU pCPU pCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU pCPU pCPU pCPU pCPU vCPU vCPU vCPU vCPU vCPU Highly contended Larger CPU shares 11/35
  • 12. Space Domain: pCPU Assignment • Proposed scheme • “Load-conscious balance scheduling” • Hybrid scheme of balance scheduling & load-based assignment vCPU pCPU pCPU pCPU pCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU If all candidate pCPUs are not overloaded, balance scheduling vCPU pCPU pCPU pCPU pCPU vCPU vCPU vCPU vCPU vCPU Otherwise, load-based assignment vCPU pCPU0 pCPU1 pCPU2 pCPU3 vCPU vCPU vCPU Wait queue • Example vCPUvCPU vCPU Candidate pCPU set (Scheduler assigns a lowest-loaded pCPU in this set) = {pCPU0, pCPU1, pCPU2, pCPU3} pCPU3 is overloaded (i.e., CPU load > Average CPU load) How about contention between sibling vCPUs?  Pass to coordination in time domain! 12/35
  • 13. Time Domain: Preemption Policy • What type of contention demands coordination? • Busy-waiting for communication (or synchronization) • Unnecessary CPU consumption by busy-waiting for a descheduled (inactive) vCPU • Significant performance degradation • Why serious in multiprocessor VMs? • Semantic gap • OSes make liberal use of busy-waiting (e.g., spinlock) since they believe their vCPUs are always online (i.e., dedicated) • “Demand-based coordinated scheduling” • Issues • When and where to demand coordination? • Busy-waiting really matters? • How to detect coordination demand? vCPU pCPU pCPU pCPU pCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU 13/35
  • 14. Time Domain: Preemption Policy • When and where to demand coordination? • Experimental analysis • 13 emerging multithreaded applications in the PARSEC suite • Diverse characteristics • Kernel time ratio in the case of consolidation • Busy-waiting occurs in kernel space 0% 20% 40% 60% 80% 100% blackscholes bodytrack canneal dedup facesim ferret fluidanimate freqmine raytrace streamcluster swaptions vips x264 CPUtime(%) Kernel time User time 0% 20% 40% 60% 80% 100% blackscholes bodytrack canneal dedup facesim ferret fluidanimate freqmine raytrace streamcluster swaptions vips x264 CPUtime(%) Kernel time User time Solorun (no consolidation) Corun (w/ 1 VM running streamcluster) Kernel time ratio is largely amplified by x1.3~x30 A VM with 8 vCPUs on 8 pCPUs 14/35
  • 15. Time Domain: Preemption Policy • Where is the kernel time amplified? Function Application CPU cycles (%) (Total kernel CPU cycles (%)) TLB shootdown dedup 43% (83%) ferret 9% (11%) vips 41% (47%) Lock spinning bodytrack 5% (8%) canneal 4% (5%) dedup 36% (83%) facesim 4% (5%) streamcluster 10% (11%) swaptions 5% (6%) vips 4% (47%) x264 7% (8%) 15/35
  • 16. Time Domain: Preemption Policy • TLB shootdown • Notification of TLB invalidation to a remote CPU CPU Thread CPU Thread Virtual address space TLB TLB V->P1 V->P1 V->P1 TLB (Translation Lookaside Buffer): Per-CPU cache for virtual address mapping V->P2 or V->Null Modify or Unmap Inter-processor interrupt (IPI) Busy-waiting until all corresponding TLB entries are invalidated  Efficient in native systems, but not in virtualized systems if target vCPUs are not scheduled “A TLB shootdown IPI is a signal for coordination demand!”  Co-schedule IPI-recipient vCPUs with a sender vCPU 0 500 1000 1500 2000 blackscholes bodytrack canneal dedup facesim ferret fluidanimate freqmine raytrace streamcluster swaptions vips x264 TLBIPIs/sec/vCPU TLB shootdown IPI traffic 16/35
  • 17. pCPU Time Domain: Preemption Policy • Lock spinning • Which spinlocks show dominant wait time? 0% 20% 40% 60% 80% 100% bodytra… canneal dedup facesim streamc… swaptio… vips x264 Spinlockwaittime(%) Other locks Wait-queue lock Pagetable lock Runqueue lock Semaphore wait-queue lock Futex wait-queue lock 89% Futex: Kernel support for user-level synchronization (e.g., mutex, barrier, condvar) 81% mutex_lock(mutex) /* critical section */ mutex_unlock(mutex) futex_wake(mutex) { spin_lock(queue->lock) thread=dequeue(queue) wake_up(thread) spin_unlock(queue->lock) } mutex_lock(mutex) futex_wait(mutex) { spin_lock(queue->lock) enqueue(queue, me) spin_unlock(queue->lock) schedule() /* blocked */ vCPU0 vCPU1 /* wake-up */ /* critical section */ mutex_unlock(mutex) futex_wake(mutex) { spin_lock(queue->lock) If vCPU0 is preempted during waking vCPU1 up, vCPU1 busy-waits on the preempted spinlock : So-called lock-holder preemption (LHP) vCPU1 vCPU0 Active Preempted “A Reschedule IPI is a signal for coordination demand!”  Delay preemption of an IPI-sender vCPU until a likely-held spinlock is released Reschedule IPI kernel Preempted 17/35
  • 18. Time Domain: Preemption Policy • Proposed scheme • Urgent vCPU first (UVF) scheduling • Urgent time slice (utslice) • Long enough for a reschedule IPI sender to release a spinlock • Short enough to quickly serve multiple urgent vCPUs pCPU vCPU vCPU vCPU Urgent queue Runqueue vCPU pCPU vCPU vCPU vCPUvCPU FIFO order Proportional shares order vCPU : urgent state vCPU vCPU Wait queue Protect from preemption during urgent time slice (utslice) If inter-VM fairness is kept 18/35
  • 19. Evaluation • Utslice parameter • 1. Utslice for reducing LHP • 2. Utslice for quickly serving multiple urgent vCPUs 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 0 100 300 500 700 1000 #offutexqueueLHP Utslice (usec) bodytrack facesim streamcluster Workloads: A futex-intensive workload in one VM + dedup in another VM as a preempting VM >300us utslice ~2x~3.8x LHP reduction Remaining LHPs occur during local wake-up or before reschedule IPI transmission  Not likely lead to lock contention 19/35
  • 20. Evaluation • Utslice parameter • 1. utslice for reducing LHP • 2. utslice for quickly serving multiple urgent vCPUs 30 35 40 45 50 55 60 0 2 4 6 8 10 12 14 16 100 500 1000 3000 5000 Averageexecutiontime(sec) CPUcycles(%) Utslice (usec) Spinlock cycles (%) TLB cycles (%) Execution time (sec) Workloads: 3 VMs, each of which runs vips (vips - TLB-IPI-intensive application) As utslice increases, TLB shootdown cycles increase 500usec is an appropriate utslice for both LHP reduction and multiple urgent vCPUs ~11% degradation 20/35
  • 21. Evaluation • Workload consolidation • One 8-vCPU VM + four 1-vCPU VMs (x264) 0.00 0.50 1.00 1.50 2.00 Normalizedexecutiontime Workloads of 8-vCPU VM Baseline Balance LC-Balance LC-Balance+Resched-DP LC-Balance+Resched-DP+TLB-Co Multiprocessor VMs Need coordination in time domain (~90% improvement) 0.00 0.50 1.00 1.50 Normalizedexecutiontime Co-running workloads with 1-vCPU VM (x264) Baseline Balance LC-Balance LC-Balance+Resched-DP LC-Balance+Resched-DP+TLB-Co Balance scheduling degrades 1-vCPU VM by incurring unnecessary contention Singleprocessor VMs 21/35
  • 22. Summary • Contributions • Load-conscious balance scheduling • Essential for heterogeneously consolidated environments where load imbalance usually takes place • IPI-driven coordinated scheduling • Effective for VMM to alleviate unnecessary CPU contention based on IPIs between sibling vCPUs • Future work • Combining the “scheduling-based method” with “contention management methods” • Contention management methods • Paravirtual spinlock, HW-based spin detection 22/35
  • 23. Virtual Asymmetric Multiprocessor for User-Interactive Performance How to improve user-interactive performance mixed in multiprocessor VMs? vCPU vCPU VM vCPU vCPU VMM scheduler VMM pCPU pCPU pCPU pCPU User- Interactive workload Background workload
  • 24. Motivation • Background & idea • The initial proposal of “Task-aware scheduling” did not consider multiprocessor VMs • Existing VMM schedulers give an illusion of symmetric multiprocessor (SMP) to each VM • Due to the absence of mixed workload tracking pCPU pCPU pCPU pCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU VM Interactive Background Time shared Virtual SMP (vSMP) pCPU pCPU pCPU pCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU VMInteractive Background Virtual AMP (vAMP) vCPU Equally contended regardless of user interactions Proposal The size of vCPU = The amount of CPU shares Fast vCPUs Slow vCPUs 24/35
  • 25. Workload Classification • Previous methods • Time-quanta based classification • “Interactive workloads typically show short time quantum” • OS technique: User I/O-driven IPC tracking [Zheng et al., SIGMETRICS’10] X server Terminal Firefox IPC IPC User I/O + Identifying a set of tasks involved in a user interaction (I/O) - Relying on various OS-level IPC structures (e.g., socket, pipe, signal)  VMM cannot access OS-level IPCs + Clear classification between I/O-bound and CPU-bound tasks - Modern interactive workloads show mixed behaviors - Multithreaded CPU-bound job shows short time quanta due to inter-thread communication 25/35 An interactive task group
  • 26. Workload Classification • Proposed scheme • “Background workload identification” • Instead of tracking interactive workloads, • Identifying “background CPU noise” at the time of “user I/O” • Rationales • Interactive CPU load is typically initiated by user I/O • VMM can unobtrusively monitor user I/O and per-task CPU load • Exceptional case • Multimedia workloads (e.g., video playback) • Filtering multimedia tasks from background workloads • Tasks requesting audio I/O 26/35
  • 27. Virtual Asymmetric Multiprocessor • vAMP • Dynamically adjusting CPU shares of a vCPU according to its currently hosting task 1. Maintaining per-task CPU load during pre-I/O period  Pre-I/O period is set to shorter than general user think time (1 second by default) 2. Tagging tasks that have generated nontrivial CPU loads as background tasks  Threshold can be set to filter daemon tasks that possibly serve interactive workloads 3. Dynamically adjusting vCPU’s shares based on weight ratio (e.g., background : non-background = 1:5) 4. Providing vAMP during an interactive episode  An interactive episode is restarted when another user I/O occurs or is finished if maximum time is elapsed without user I/O 27/35
  • 28. Limitation • An intrinsic limitation of VMM-only approach • Manipulating only a single scheduling layer (i.e., VMM scheduler) • A vAMP-oblivious OS scheduler • Agnostic about underlying vAMP (i.e., all vCPUs are identical) • Possibly multiplexing interactive and background tasks on the same vCPU • A slow vCPU has higher scheduling latency • “Frequent multiplexing” might offset the benefit of vAMP Example: A scheduling trace during Google Chrome launch “Aggressive weight ratio is not always effective if multiplexing frequently happens”  Weight ratio is an important parameter for interactive performance Background task Non-background task 28/35
  • 29. Guest OS Extension • Guest OS extension for vAMP • OS enlightenment about vAMP • To avoid ineffective multiplexing of interactive and background tasks on the same vCPU  Isolation • Design principles • Keeping VMM OS-independent • Optional extension for further enhancement of interactive performance • Keeping extension OS-independent • No reliance on specific OS functionality • Isolating tasks on separate CPUs is a general interface of commodity OSes (e.g., modifying CPU affinity) • Small kernel changes for low maintenance cost 29/35
  • 30. Guest OS Extension • Linux extension for vAMP • User-level vAMP-daemon • Isolating background tasks exposed by VMM from non- background tasks • Small kernel changes that expose background tasks to user VM vAMP scheduler VMM vCPU vCPU Task load monitor Background tasks T1, T2 vAMP-daemon Kernel User Input interface Cpuset interface T1 T2 T3 T4 Procfs interface 1. Event- driven 2. Read 3. Isolate Isolation procedure: 1. Initially dedicating nr_fast_vcpus to interactive tasks (i.e., non-background tasks) 2. Periodically increasing nr_fast_vcpus when fast vCPUs become fully utilized (also periodically checking the end of an interactive episode  stop isolation) Default nr_fast_vcpus = 1 due to the low thread-level parallelism of interactive workloads [Blake et al., ISCA’10] 30/35
  • 31. Evaluation • Application launch • Background workload • Data mining application (freqmine) with 8 threads • Weight ratio (background : non-background) • vAMP(L)=1:3, vAMP(M)=1:9, vAMP(H)=1:18 8-vCPU VM 8-vCPU VM freqminefreqmine App launch Remote desktop client 0.00 0.20 0.40 0.60 0.80 1.00 1.20 Impress Firefox Chrome Gimp Normalizedaveragelaunchtime Interactive applications Baseline vAMP(L) vAMP(M) vAMP(H) vAMP(L) w/ Ext vAMP(M) w/ Ext vAMP(H) w/ Ext vAMP improves launch performance by 7~40% High weight ratio is ineffective because of negative effect of multiplexing Guest OS extension achieves further improvement of interactive performance by up to 70% Why did Gimp show significant improvement even without the guest OS extension? 8-pCPU 31/35
  • 32. Evaluation • Application launch • Chrome vs. Gimp (without guest OS extension) Chrome (Web browser) Gimp (Image editing program)  Many threads are cooperatively scheduled in a fine-grained manner  A single thread dominantly involves computation with little communication Background task Non-background task Background task Non-background task 32/35
  • 33. Evaluation • Media player • VLC media player • 1920x800 HD video with 23.976 frames per second (FPS) • Mult: multimedia workload filtering Without multimedia workload filtering, VLC is misidentified as a background task vAMP improves playback quality by up to 22.3 FPS, but high weight ratio still degrades the quality Guest OS extension achieves 23.8 FPS 8-vCPU VM 8-vCPU VM freqminefreqmine Media player 8-pCPU 33/35 0 5 10 15 20 25 30 Averageframespersecond(FPS) Baseline vAMP(L) w/o Mult vAMP(L) vAMP(M) vAMP(H) vAMP(L) w/ Ext vAMP(M) w/ Ext vAMP(H) w/ Ext
  • 34. Summary • vAMP • Dynamically varying vCPU performance based on their hosting workloads • A feasible method of improving interactive performance • Assisted by a simple guest OS extension • Isolation of different types of workloads enhances the effectiveness of vAMP • Future work • Collaboration of VMM and OSes for vAMP • Standard & well-defined API 34/35
  • 35. Conclusions • Lessons learned from the thesis • In-depth analysis of OSes and workloads can realize intelligent CPU scheduling based only on VMM- visible events • Both lightweightness and efficiency are achieved • Task-awareness is an essential ability for VMM to effectively handle mixed workloads • Multi-tasking is ubiquitous inside every VM • Coordinated scheduling improves CPU efficiency of multiprocessor VMs • Resolving unnecessary CPU contention is crucial 35/35
  • 36. Publications • Task-aware VM scheduling • [VEE’09] Hwanju Kim, Hyeontaek Lim, Jinkyu Jeong, Heeseung Jo, Joonwon Lee, “Task-aware Virtual Machine Scheduling for I/O Performance” • [JPDC’11] Hwanju Kim, Hyeontaek Lim, Jinkyu Jeong, Heeseung Jo, Joonwon Lee, Seungryoul Maeng, “Transparently Bridging Semantic Gap in CPU Management for Virtualized Environments” • [MMSys’12] Hwanju Kim, Jinkyu Jeong, Jaeho Hwang, Joonwon Lee, Seungryoul Maeng, “Scheduler Support for Video-oriented Multimedia on Client-side Virtualization” • [ApSys’12] Hwanju Kim, Sangwook Kim, Jinkyu Jeong, Joonwon Lee, and Seungryoul Maeng, “Virtual Asymmetric Multiprocessor for Interactive Performance of Consolidated Desktops” • Demand-based coordinated scheduling • [ASPLOS’13] Hwanju Kim, Sangwook Kim, Jinkyu Jeong, Joonwon Lee, and Seungryoul Maeng, “Demand-Based Coordinated Scheduling for SMP VMs” • Other work on virtualization • [IEEE TC’11] Hwanju Kim, Heeseung Jo, and Joonwon Lee, “XHive: Efficient Cooperative Caching for Virtual Machines” • [IEEE TC’10] Heeseung Jo, Hwanju Kim, Jae-Wan Jang, Joonwon Lee, and Seungryoul Maeng, “Transparent Fault Tolerance of Device Drivers for Virtual Machines” • [MICRO’10] Daehoon Kim, Hwanju Kim, and Jaehyuk Huh, “Virtual Snooping: Filtering Snoops in Virtualized Multi-cores” • [VHPC’11] Sangwook Kim, Hwanju Kim, and Joonwon Lee, “Group-Based Memory Deduplication for Virtualized Clouds” • [Euro-Par’08] Dongsung Kim, Hwanju Kim, Myeongjae Jeon, Euiseong Seo, Joonwon Lee, “Guest-Aware Priority-based Virtual Machine Scheduling for Highly Consolidated Server” • [VHPC’09] Heeseung Jo, Youngjin Kwon, Hwanju Kim, Euiseong Seo, Joonwon Lee, Seungryoul Maeng, “SSD-HDD-Hybrid Virtual Disk in Consolidated Environments” • Other work on embedded and mobile systems • [ACM TECS’12] Jinkyu Jeong, Hwanju Kim, Jeaho Hwang, Joonwon Lee, and Seungryoul Maeng, “Rigorous Rental Memory Management for Embedded Systems” • [CASES’12] Jinkyu Jeong, Hwanju Kim, Jeaho Hwang, Joonwon Lee, and Seungryoul Maeng, “DaaC: Device-reserved Memory as an Eviction-based File Cache” • [IEEE TCE’09] Heeseung Jo, Hwanju Kim, Hyun-Gul Roh, Joonwon Lee, “Improving the Startup Time of Digital TV” • [IEEE TCE’09] Heeseung Jo, Hwanju Kim, Jinkyu Jeong, Joonwon Lee, and Seungryoul Maeng, “Optimizing the Startup Time of Embedded Systems: A Case Study of Digital TV” • [IEEE TCE’10] Jeaho Hwang, Jinkyu Jeong, Hwanju Kim, Jin-Soo Kim, and Joonwon Lee, “AppWatch: Detecting Kernel Bug for Protecting Consumer Electronics Applications” • [IEEE TCE’12] Jeaho Hwang, Jinkyu Jeong, Hwanju Kim, Jeonghwan Choi, and Joonwon Lee, “Compressed Memory Swap for QoS of Virtualized Embedded Systems” • [SPE’10] Jinkyu Jeong, Euiseong Seo, Jeonghwan Choi, Hwanju Kim, Heeseung Jo, and Joonwon Lee, “KAL: Kernel-assisted Non- invasive Memory Leak Tolerance with a General-purpose Memory Allocator”
  • 38. References [Blake et al., ISCA’10] Evolution of thread-level parallelism in desktop applications [Botelho’08] Virtual machines per server, a viable metric for hardware selection? (http://itknowledgeexchange.techtarget.com/server-farm/virtual-machines-per-server-a-viable-metric-for- hardware-selection/) [Govindan et al., VEE’07] Xen and co.: communication-aware CPU scheduling for consolidated xen-based hosting platforms [Hu et al., HPDC’10] I/O scheduling model of virtual machine based on multi-core dynamic partitioning [Kim et al., EuroPar’08] Guest-Aware Priority-Based Virtual Machine Scheduling for Highly Consolidated Server [Kim et al., VEE’09] Task-aware virtual machine scheduling for I/O performance [Kim et al., JPDC’11] Transparently Bridging Semantic Gap in CPU Management for Virtualized Environments [Lee et al., VEE’10] Supporting Soft Real-Time Tasks in the Xen Hypervisor [Liao et al., ANCS’08] Software techniques to improve virtualized I/O performance on multi-core systems [Lin et al., SC’05] VSched: Mixing Batch And Interactive Virtual Machines Using Periodic Real-time Scheduling [Masrur et al., RTCSA’10] VM-Based Real-Time Services for Automotive Control Applications [Ongaro et al., VEE’08] Scheduling I/O in virtual machine monitors [Sukwong et al., EuroSys’11] Is co-scheduling too expensive for SMP VMs? [Uhlig et al., VM’04] Towards scalable multiprocessor virtual machines [VMware ESXi’10] VMware vSphere: The CPU Scheduler in VMware ESX 4.1 [VMware VDI] Enabling your end-to end virtualization solution. (http://www.vmware.com/solutions/partners/alliances/hp-vmware-customers.html) [Weng et al., HPDC’11] Dynamic adaptive scheduling for virtual machines [Weng et al., VEE’09] The hybrid scheduling framework for virtual machine systems [Xia et al., ICPADS’09] PaS: A Preemption-aware Scheduling Interface for Improving Interactive Performance in Consolidated Virtual Machine Environment [Zheng et al., SIGMETRICS’10] RSIO: automatic user interaction detection and scheduling
  • 41. Proportional-Share Scheduler • Proportional-share scheduler for SMP VMs • Common scheduler for commodity VMMs • Employed by KVM, Xen, VMware, etc. • VM’s shares (S) = Total shares x (weight / total weight) • VCPU’s shares = S / # of active VCPUs • Active vCPU: Non-idle vCPU Single-threaded workload Multi-threaded (programmed) workload VCPU0 (1024) VCPU0 (256) VCPU1 (256) VCPU2 (256) VCPU3 (256) e.g., 4-VCPU VM (S = 1024) Symmetric vCPUs Existing schedulers view active vCPUs as containers with identical power 41/35
  • 42. Helping Lock • Spin-then-block lock [AMD, XenSummit’08] • Block after spin during a certain period • + Reducing unnecessary spinning • - Still LHP and unnecessary spinning • - Profiling required to find a suitable spin threshold • - Kernel instrumentation • But, most popular paravirtualized approach for open-source kernel like Linux • Paravirt-spinlock for Xen Linux (mainline) • Paravirt-spinlock for KVM Linux (patch) 42/35
  • 43. Coordination for User-level Contention • User-level synchronization • Pure spin-based synchronization is rarely used in user space • Block-based or spin-then-block synchronization • Reschedule IPI driven coscheduling • With regard to spin-then-block synchronization, less contention occurs by coscheduling cooperative threads Reschedule IPI traffic of streamcluster Execution time of streamcluster consolidated with bodytrack Streamcluster intensively uses spin-then-block barriers Resched-Co alleviates spin-phase of lock wait time 43/35
  • 44. Performance on PLE • PLE (Pause-Loop-Exit) • A HW mechanism to notify VMM of spinning over a predefined threshold (i.e., pathological busy-waiting) • In response to this notification, VMM allows a currently running vCPU to yield its pCPU Facesim (futex-intensive) Ferret (TLB-IPI-intensive) IPI-driven scheduling proactively alleviate unnecessary contention, whereas PLE reactively relieves contention that has already happened 44/35
  • 45. Evaluation: Urgent Allowance • Urgent allowance • Trading short-term fairness with CPU efficiency • How much short-term fairness is traded? 1 vips VM + 2 facesim VMs Trading short-term fairness improves overall efficiency without negative impact on long-term fairness 45/35
  • 46. Evaluation: Two Multiprocessor VMs w/ dedup w/ freqmine a: baseline b: balance c: LC-balance d: LC-balance+Resched-DP e: LC-balance+Resched-DP+TLB-Co  corun solorun Time Time 46/35
  • 47. TLB Shootdown IPIs of Windows 7 • Heavy use of TLB shootdown IPIs by Windows 7 desktop application launch • Most TLB shootdown IPIs are sent with multi/broadcasting • TLB-IPI-driven coscheduling improves PowerPoint launch time by 23% when consolidated with 4 VMs, each running streamclusters Apps Explorer IE PowerPoint Word Excel # of triggers 102 262 166 179 77 # of IPIs 608 1230 782 990 418 Launch time (ms) 622 982 975 1108 1011 47/35
  • 48. Virtual Asymmetric Multiprocessor for User-Interactive Performance
  • 49. Multimedia Workload Filtering • Tracking audio-requesting tasks • Tracking tasks that access a virtual audio device • Excluding audio access in an interrupt context • Checking audio Interrupt Service Register (ISR) • Server-client sound system • A user-level task to serve all audio requests (e.g., pulseaudio) • Remote wake-up tracking 1VM: VLC+facesim 1VM: freqmine (facesim severely interferes remote wake-up tracking) 49/35
  • 50. Measurement Methodology • Spiceplay • Snapshot-based record/replay • Robust replay for varying loads • Similar to VNCPlay [USENIX’05] and Deskbench [IM’09] • Extension on the SPICE remote desktop client • Record • Snapshot at an input point  Input recording  Snapshot at a user-perceived point • Replay • Snapshot comparison & start timer  Input replaying  Snapshot comparison & stop timer 50/35
  • 51. vAMP Parameters • Default vAMP parameters Parameter Role Default value Rationale Background load threshold Tagging background tasks 50% Large enough to filter general daemon tasks such as an X server Maximum time of an interactive episode Duration of distributing asymmetric CPU shares 5sec Large enough to cover a general interactive episode (2sec was used in previous research based on HCI work, but larger value is needed to cover long-launched applications ) 0 5 10 15 20 25 30 Averageframespersecond(FPS) bgload_thresh=5 bgload_thresh=50 Video playback: vAMP(L) w/ Ext X server is misclassified as a background task 0.00 0.20 0.40 0.60 0.80 1.00 1.20 Normalizedaveragelaunchtime max_intr_episode=2sec max_intr_episode=5sec Gimp launch: vAMP(L) w/ Ext Interactive episode is prematurely finished before the end of launch 51/35
  • 52. Evaluation: Background Performance • Performance of background workloads • With repeated launch with 1-second interval • Intensively interactive workloads • 3-28% degradation 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 Impress Firefox Chrome Gimp Normalizedaverageexecutiontime Baseline vAMP(L) vAMP(M) vAMP(H) vAMP(L) w/ Ext vAMP(M) w/ Ext vAMP(H) w/ Ext 52/35
  • 53. Evaluation: Guest OS Extension • Interrupt pinning • An interactive workload can accompany I/O • Even warm launch can involve synchronous disk writes • During an interactive episode, pinning I/O interrupts on fast vCPUs • In Linux, manipulate /proc/<irq number>/smp_affinity 53/35 0 200 400 600 800 1000 1200 1400 Averagelaunchtime(sec) vAMP(L) w/ Ext (no pin) vAMP(M) w/ Ext (no pin) vAMP(H) w/ Ext (no pin) vAMP(L) w/ Ext vAMP(M) w/ Ext vAMP(H) w/ Ext Chrome launch: Chrome launch entails some synchronous writes If a disk I/O interrupt is delivered to a slow vCPU, scheduling latency is increased
  • 54. Evaluation: Guest OS Extension • nr_fast_vcpus parameter • Initial number of fast vCPUs 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 Impress Firefox Chrome Gimp Normalizedaveragelaunchtime nr_fast_vcpus=1 nr_fast_vcpus=2 nr_fast_vcpus=4 Interactive workloads with low thread-level parallelism do not require a large number of initial fast vCPUs 54/35 A workload with low thread-level parallelism is adversely affected by multiple fast vCPUs, since unnecessary vCPU-level scheduling latency is involved
  • 55. Task-aware VM Scheduling for I/O Performance
  • 56. Problem of VM Scheduling • Task-agnostic scheduling VMM VM1 VM2 Run queue sorted based on CPU fairness Mixed task CPU- bound task I/O- bound task I/O event That event is mine and I’m waiting for it Your VM has low priority now! I don’t even know this event is for your I/O-bound task! Sorry not to schedule you immediately… Head Tail 56/35
  • 57. Task-agnostic scheduling • The worst case example for 6 consolidated VMs • Network response time Native Linux: Non-consolidated OS XenoLinux: Consolidated OS on Xen <Workloads> • I/O+CPU 1 VM: Server & CPU-bound task 5 VMs: CPU-bound task • I/O 1 VM: Server 5 VMs: CPU-bound task By boosting mechanism of Xen Credit scheduler Poor responsiveness  boosting mechanism realizes I/O-boundness with only VCPU-level 57/35
  • 58. Task-aware VM Scheduling • Goals • Tracking I/O-boundness with task granularity • Improving the response time of I/O-bound tasks • Keeping inter-VM fairness • Challenges PCPU VMM Mixed task CPU- bound task I/O- bound task I/O event Mixed task CPU- bound task I/O- bound task VM VM 1. I/O-bound task identification 2. I/O event correlation 3. Partial boosting 58/35
  • 59. Task-aware VM Scheduling 1. I/O-bound Task Identification • Observable information at the VMM • I/O events • Task switching events [Jones et al., USENIX’06] • CPU time quantum of each task • Inference based on common OS techniques • General OS techniques (Linux, Windows, FreeBSD, …) to infer and handle I/O-bound tasks • 1. Small CPU time quantum (main) • 2. Preemptive scheduling in response to I/O events (supportive) Example (Intel x86) CR3 update CR3 update I/O event Task time quantum 59/35
  • 60. • Three disjoint observation classes • Positive evidence • Support I/O-boundness • Negative evidence • Support non-I/O-boundness • Ambiguity • No evidence • Weighted evidence accumulation Observation classes Positive evidence Negative evidence If 1 and 2 are satisfied If 1 is violated 1. Small CPU time quantum (main) 2. Preemptive scheduling (supportive) Otherwise Ambiguity Task-aware VM Scheduling 1. I/O-bound Task Identification # of sequential observations The degree of belief At this time, this task is believed as an I/O-bound task More penalty for long time quantum 60/35
  • 61. Task-aware VM Scheduling 2. I/O Event Correlation • I/O event correlation • To distinguish an incoming event for I/O-bound tasks • Why? • To selectively prioritize I/O-bound tasks in a VM • CPU-bound tasks also conduct I/O operations • Goal • Best-effort correlation • Lightweight rather than accuracy • I/O types • Block I/O: disk read • Network I/O: packet reception 61/35
  • 62. Task-aware VM Scheduling 2. I/O Event Correlation: Block I/O • Request-response correlation • Window-based correlation • Correlation for delayed read events by guest OS • e.g., block I/O scheduler • Overhead per VCPU = window size x 4bytes (task ID) T1 T2 T3 T4 read Actual read request user kernel VMM Inspection window Any I/O-bound task in the window 62/35
  • 63. Task-aware VM Scheduling 2. I/O Event Correlation: Network I/O • History-based prediction • Asynchronous packet reception • Monitoring “the firstly woken task” in response to an incoming packet • N-bit saturating counter for each destination port number Portmap 00 Non- I/O- bound 01 Weak I/O- bound 10 I/O- bound 11 Strong I/O- bound If the firstly woken task is I/O-bound Otherwise If portmap counter’s MSB is set, this packet is for I/O-bound tasks Example: 2-bit counter Destination port number Overhead per VM = N x 8KB 63/35
  • 64. Task-aware VM Scheduling 3. Partial Boosting • Priority boosting with task-level granularity • Borrowing future time slice to promptly handle an incoming I/O event as long as fairness is kept • Partial boosting lasts during the run of I/O-bound tasks VMM VM1 VM2 Run queue sorted based on CPU fairness I/O event VM3 CPU- bound task CPU- bound task Head Tail I/O- bound task If this I/O event is destined for VM3 and is inferred to be handled by its I/O-bound task, Initiate partial boosting for VM3 VCPU 64/35
  • 65. Evaluation (1/4) • Implementation on Xen 3.2 • Experimental setup • Intel Pentium D for Linux (single core enabled) • Intel Q6600 (VT-x) for Windows XP (single core enabled) • Correlation parameters • Chosen for >90% accuracy and low overheads by stressful tests with synthetic workloads • Block I/O: Inspection window size = 3 • Network I/O: Portmap bit width = 2 65/35
  • 66. Evaluation (2/4) • Network response time <Schedulers> Baseline = Xen Credit scheduler TAVS = Task-aware VM scheduler <Workloads> 1 VM: Server & CPU-bound task 5 VMs: CPU-bound task Response time improvement Fairness guarantee 66/35
  • 67. Evaluation (3/4) • Real workloads Ubuntu Linux Windows XP I/O-bound tasks CPU-bound tasks <Workloads> 1 VM: I/O-bound & CPU-bound task 5 VMs: CPU-bound task 12-50% I/O performance improvement with inter-VM fairness 67/35
  • 68. Evaluation (4/4) • I/O-bound task identification 68/35
  • 69. Client-side Scheduler Support for Multimedia Workloads
  • 70. Client-side Virtualization • Multiple OS instances on a local device • Primary use cases • Different OSes for application compatibility • Consolidating business and personal computing environments on a single device • BYOD: Bring Your Own Device Business VM Personal VM Hypervisor Managed domain 70/35
  • 71. Multimedia on Virtualized Clients • Multimedia is ubiquitous on any VM Windows VM Linux VM Hypervisor Business VM Personal VM Hypervisor Business VM Personal VM Hypervisor Video Playback Compilation Data Processing 3D game Video conference Downloading 1. Multimedia workloads are dominant on virtualized clients 2. Interactive systems can have concurrently mixed workloads 71/35
  • 72. Issues on Multi-layer Scheduling • A multimedia-agnostic hypervisor invalidates OS policies for multimedia VM OS scheduler VM OS Scheduler Hypervisor Scheduler CPU OS scheduler CPU Virtual CPU Virtual CPU Task Task BVT [SOSP’99] SMART [TOCS’03] Rialto [SOSP’97] BEST [MMCN’02] HuC [TOMCCAP’06] Redline [OSDI’08] RSIO [SIGMETRICS’10] Windows MMCSS Larger CPU proportion & Timely dispatching TaskTask TaskTask I’m unaware of any multimedia-specific OS policies in a VM, since I see each VM as a black box. Additional abstraction Semantic gap! 72/35
  • 73. Multimedia-agnostic Hypervisor • Multimedia QoS degradation • Two VMs with equal CPU shares • Multimedia VM + Competing VM 0 5 10 15 20 25 30 AverageFPS Competing workloads in another VM 0 10 20 30 40 50 60 70 80 90 100 AverageFPS Competing workloads in another VM VM VM Xen hypervisor Credit scheduler Video playback or 3D game Competing workloads Video playback (720p) on VLC media player Quake III Arena (demo1) 73/35
  • 74. Possible Solutions to Semantic Gap • Explicit vs. Implicit VM OS scheduler Hypervisor Scheduler Explicit OS cooperation + Accurate - OS modification - Infeasible w/o multimedia-friendly OS schedulers VM OS scheduler Hypervisor Scheduler Explicit User involvement + Simple - Inconvenient - Unsuitable for dynamic workloads VM OS scheduler Hypervisor Scheduler Implicit Hypervisor-only + Transparency - Difficult to identify workload demands at the hypervisor Workload monitor 74/35
  • 75. Proposed Approach • Multimedia-aware hypervisor scheduler • Transparent scheduler support for multimedia • No modifications to upper layer SW (OS & apps) • “Feedback-driven VM scheduling” VM Hypervisor VM VM Multimedia manager (feedback-driven) CPU scheduler Multimedia monitor Audio Video CPU Estimated multimedia QoS Scheduling command (e.g., CPU share or priority) Challenges 1. How to estimate multimedia QoS based on a small set of HW events? 2. How to control CPU scheduler based on the estimated information 75/35
  • 76. Multimedia QoS Estimation • What is estimated as multimedia QoS? • “Display rate” (i.e., frame rate) • Used by HuC scheduler [TOMCCAP’06] • How is a display rate captured at the hypervisor? • Two types of display Framebuffer Acceleration unit Display interface Memory-mapped Graphics Library Video device 1. Memory-mapped display (e.g., video playback) 2. GPU-accelerated display (e.g., 3D game) 76/35
  • 77. Memory-mapped Display (1/2) • How to estimate a display update rate on the memory-mapped framebuffer • Write-protection for virtual address space mapped to framebuffer Framebuffer memory Virtual address space Display interface Write-protection write Hypervisor page fault handler { Update display rate } The hypervisor can inspect any attempt to map memory Sampling to reduce trap overheads (1/128 pages, by default) 77/35
  • 78. Memory-mapped Display (2/2) • Accurate estimation • Maintaining display rate per task • An aggregated display rate does not represent multimedia QoS • Tracking guest OS task at the hypervisor • Inspecting address space switches (Antfarm [USENIX’06]) • Monitoring audio access (RSIO [SIGMETRIC’10]) • Inspecting audio buffer access with write-protection • A task with a high display rate and audio access  a multimedia task Task Task 25 FPS 10 FPS 78/35
  • 79. GPU-accelerated Display (1/2) • Naïve method • Inspecting GPU command buffer with write-protection or polling • Too heavy due to huge amount of GPU commands • Lightweight method • Little overhead, but less accuracy • 3D games are less sensitive to frame rate degradation than video playback • GPU interrupt-based estimation • An interrupt is typically used for an application to manage buffer memory • Hypothesis • “A GPU interrupt rate is in proportion to a display rate” 79/35
  • 80. GPU-accelerated Display (2/2) • Linear relationship between display rates and GPU interrupt rates • Exponential weighted moving average (EWMA) is used to reduce fluctuation • EWMAt = (1-w) x EWMAt-1 + w x current value 0 2000 4000 6000 8000 10000 12000 0 50 100 #ofGPUinterrupt/sec FPS Quake3 demo1 (640x480) Quake3 demo2 (640x480) Quake3 demo1 (1024x768) 60 80 100 120 140 160 0 50 100 #ofGPUinterrupt/sec FPS Quake3 demo1 (640x480) Quake3 demo2 (640x480) Quake3 demo1 (1024x768) 0 100 200 300 400 0 20 40 60 #ofGPUinterrupt/sec FPS Quake3 demo4 (320x240) Quake3 demo4 (640x480) Intel GMA 950 (Apple MacBook) Nvidia 6150 Go (HP Pavillion tablet) PowerVR (Samsung GalaxyS) A GPU interrupt rate can be used to estimate a display rate without additional overheads 80/35
  • 81. Multimedia Manager • A feedback-driven CPU allocator • Base assumption • “Additional CPU share (or higher priority) improves a display rate” • Desired frame rate (DFR) • A currently achievable display rate • Multiplied by tolerable ratio (0.8) IF current FPS < previous FPS AND current FPS < DFR THEN Increase CPU share /* Exceptional cases: * 1) No relationship between CPU and FPS * 2) FPS is saturated below DFR * 3) Local CPU contention in a VM */ If no FPS improvement by CPU share increase (3 times) Then Decrease CPU share by half If in initial phase Then Exponential increase Else Linear increase 81/35
  • 82. Priority Boosting • Responsive dispatching • Problem • The hypervisor does not distinguish the types of events for priority boosting • A VM that will handle a multimedia event cannot preempt a currently running VM handling a normal event. • Higher priority for multimedia-related events • e.g., video, audio, one-shot timer MMBOOST IOBOOST Normal priorityPriority Multimedia events Other events Based on remaining CPU shares 82/35
  • 83. Evaluation • Experimental environment • Intel MacBook with Intel GMA 950 • Xen 3.4.0 with Ubuntu 8.04 • Implementation based on Xen Credit scheduler • Two-VM scenario • One with direct I/O + one with indirect (hosted) I/O • Presenting the case of direct I/O in this talk • See the paper for the details of the indirect I/O case 83/35
  • 84. 0 10 20 30 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 FPS Time (sec) Real FPS Estimated FPS Video playback (720p) (w/ CPU-bound VM) Estimation Accuracy • Estimation accuracy • Error rates: 0.55%~3.05% 0 50 100 0 5 10 15 20 25 30 35 40 45 50 55 60 65 FPS Time (sec) Real FPS Estimated FPS Estimated FPS (EWMA, w=0.2) Quake 3 (w/ CPU-bound VM) multimedia manager disabled 84/35
  • 85. Estimation Overhead • CPU overhead caused by page faults • Video playback • 0.3~1% with sampling • Less than 5% with tracking all pages Overhead All pages Sampling 1/8 pages 1/32 pages 1/128 pages Low resolution (640x354) 4.95% 1.10% 0.54% 0.58% High resolution (1280x720) 3.91% 1.04% 0.69% 0.33% 85/35
  • 86. Multimedia Manager • Video playback (720p) + CPU-bound VM 0 20 40 60 80 100 0 5 10 15 20 25 0 10 20 30 40 50 60 70 80 CPUshare(%) FPS Time (sec) FPS DFR CPU share (%) 20 30 40 50 60 70 80 90 100 5 10 15 20 25 5 6 7 8 9 10 40 50 60 70 80 90 100 10 15 20 25 80 81 82 83 84 86/35
  • 87. Performance Improvement • Performance improvement • Closed to maximum achievable frame rates 0 5 10 15 20 25 AverageFPS Competing workloads in another VM Credit scheduler Credit scheduler w/ multimedia support 0 20 40 60 80 100 AverageFPS Competing workloads in another VM Credit scheduler Credit scheduler w/ multimedia support VM VM Hypervisor Video playback or 3D game Competing workloads Video playback (720p) on VLC media player Quake III Arena (demo1) 87/35
  • 88. Limitations & Discussion • Network-streamed multimedia • Additional preemption support required for multimedia-related network packets • Multiple multimedia workloads in a VM • Multimedia manager algorithm should be refined to satisfy QoS of mixed multimedia workloads in the same VM • Adaptive management for SMP VMs • Adaptive vCPU allocation based on hosted multimedia workloads 88/35
  • 89. Conclusions • Demands for multimedia-aware hypervisor • Multimedia are increasingly dominant in virtualized systems • “Multimedia-friendly hypervisor scheduler” • Transparent and lightweight multimedia support on client-side virtualization • Future directions • Multimedia for server-side VDI • Multicore extension for SMP VMs • Considerations for network-streamed multimedia 89/35