XPDS16: Xen Scalability Analysis - Weidong Han, Zhichao Huang & Wei Yang, Huawei

HUAWEI TECHNOLOGIES CO., LTD.
www.huawei.com
Huawei Confidential
Security Level:
August, 2016
Weidong Han <hanweidong@huawei.com>
Wei Yang <richard.weiyang@huawei.com>
Zhichao Huang <huangzhichao@huawei.com>
>
Xen Scalability Analysis

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 2
Agenda
 Current Status
 Issues & Proposals
 Summary

What’s Scalability
 Scalability is the capability of a system, network, or process to
handle a growing amount of work, or its potential to be enlarged in
order to accommodate that growth [wikipedia]
 We care about below dimensions of scalability on a single server
 Horizontal scaling: more VMs , more users
 Vertical scaling: larger VM
Source https://en.wikipedia.org/wiki/Scalability

Current Status: Xen Max Limits
Xen 4.7
 Host Limits (x86)
 Up to 4095 physical CPUs
 Up to 16TB physical memory
 HVM Guest Limits (x86)
 Up to 512 virtual CPUs
 Up to 1TB virtual memory

Current Status: Scale to Thousands of VMs
 3000 VMs were experimented successfully
 Improving Scalability of Xen: the 3,000 domains experiment
(https://events.linuxfoundation.org/images/stories/slides/lfcs2013_liu.pdf
)

Scalability was improved a lot in upstream
 Grant table
 Split grant table lock into maptrack lock and grant table lock
 per-vCPU maptrack free lists
 Read-write lock
 Per-active entry lock
 Persistent grants for virtual block scalability
 Avoid grant operations and TLB flushes
 Event channel
 Extend event channel limit: up to 131072
 per-event channel lock for sending events
 Scheduler
 node-affinity for vCPU load balance
 P2M
 Use unlocked p2m lookups in hvmemul_rep_movs
 defer the invalidation until the p2m lock is released

Agenda
 Current Status
 Bottleneck & Proposals
 Summary

More Cores are Integrated
More and more cores are integrated into a CPU. There are hundreds
of cores in 8P servers. It requires better many core scalability.

Bottleneck 1: ticket spinlock is non-scalable
 Xen uses ticket spinlock by default
 Ticket spinlock is fair, FIFO
 But, it’s non-scalable for many core
 Spin on global shared variable
 Expensive cache entries invalidation

Proposal: scalable lock
 MCS lock is scalable
 Spin on local variable
 Generate a constant number of cache misses per acquisition, avoid the
performance collapse with many cores.

Bottleneck 2: call lock
 Call lock is global lock, used in on_selected_cpus to protect
IPIs on targeted CPUs
 Bottleneck case
 Frequent EPT entry changes invalidate EPT on related
PCPUs
 ept_sync_domain -> on_selected_cpus
 heavy contention of call lock

Proposal: scalable lock and finer-grain lock
 Replace ticket lock with MCS lock for call lock
 Change global lock to per-cpu lock
 Hold a per-cpu lock of each target CPU, instead of a global lock
(call lock)
 Then do IPIs: smp_send_call_function_mask(&call_data.selected);
1
2
3
N
…
It locks all even though the IPI
target CPUs are only CPU 1 and 2
src
1
2
3
N
…
src
It only locks IPI target CPUs (1 and 2),
other CPUs are still available for IPI

Evaluation Environment and Method
 Hardware
 8 sockets, 96 cores, 192 threads
 2TB RAM
 Host
 Xen 4.7.0
 Dom0: SLES 11SP3
 VM
 Win 7 64bits
 40GB RAM
 64 vCPUs
 Test case
 Launch 8 VMs with POD in parallel, it will result in lots of EPT changes due to
Windows OS will zero pages during booting, then scan and recycle zero pages
when VM memory usage beyonds reserved threshold.

Evaluation Results
0
10000
20000
30000
40000
50000
60000
70000
1
12
23
34
45
56
67
78
89
100
111
122
133
144
155
166
177
188
199
210
221
232
243
254
265
276
287
298
309
320
331
342
353
364
375
386
397
408
419
430
441
452
463
474
485
496
507
518
529
540
551
562
573
584
595
Number of Lock Wait
Xen 4.7 Xen 4.7 with Optimization
0
2000000
4000000
6000000
8000000
10000000
1
12
23
34
45
56
67
78
89
100
111
122
133
144
155
166
177
188
199
210
221
232
243
254
265
276
287
298
309
320
331
342
353
364
375
386
397
408
419
430
441
452
463
474
485
496
507
518
529
540
551
562
573
584
595
Avg Consuming of Each Lock Wait (ns)
With the optimization, the number of lock wait and each wait consuming are both reduced significantly

Bottleneck 3: vCPU load balance in credit scheduler
 vCPU load balance
 If the next highest priority local runnable vCPU has already eaten
through its credits, look on other PCPUs to see if we have more urgent
work
 Select non-idling CPUs
 Try to acquire the schedule lock of a non-idling CPU
 Then try to steal a task from this non-idling CPU
 If stealing not succeed, go to next non-idling CPU
 The bottleneck analysis
 In many core case, there will be lots of non-idling CPUs, it’s a big waste
if it often acquired the schedule lock but failed to steal a task.

Proposal: Add a check before acquiring schedule
lock
 Add a check for each non-idling CPU before acquiring
schedule lock
 Add a bitmap of each PCPU, record what PCPUs its runq vCPUs can run
(vCPU may be pinned on some PCPUs)
 If this CPU (which wants to steal task from other CPUs) is not in the
bitmap of a non-idling CPU, that means it cannot steal a task from this
non-idling CPU. Save the cost of acquiring schedule lock.

Evaluation Environment
 Hardware
 8 socket, 96 cores, 192 threads
 2TB RAM
 Host
 Xen 4.7.0
 Dom0: SLES 11SP3
 VM
 Win 7 64bits
 40GB RAM
 24 vCPUs
 Test case
 1:1 pin vCPU to PCPU, and run NotMyFault in VM

Evaluation Results: 1:1 pin VCPU to PCPU
0
10000
20000
30000
40000
50000
60000
1
7
13
19
25
31
37
43
49
55
61
67
73
79
85
91
97
103
109
115
121
127
133
139
145
151
157
163
169
175
181
187
193
199
205
211
217
223
229
235
241
247
253
259
265
271
277
283
289
295
Number of Lock Wait
0
5000000
10000000
15000000
20000000
1
7
13
19
25
31
37
43
49
55
61
67
73
79
85
91
97
103
109
115
121
127
133
139
145
151
157
163
169
175
181
187
193
199
205
211
217
223
229
235
241
247
253
259
265
271
277
283
289
295
Avg. Consuming of Each Lock Wait (ns)
With the optimization, the number of lock wait and each wait consuming are both reduced significantly

Summary
 Some scalability bottlenecks and proposals
 Future work
 Virtualization scalability benchmark
 Scenarios
 Measurement
 Huge VM (vcpu > 128, memory > 1TB) support

XPDS16: Xen Scalability Analysis - Weidong Han, Zhichao Huang & Wei Yang, Huawei

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to XPDS16: Xen Scalability Analysis - Weidong Han, Zhichao Huang & Wei Yang, Huawei (20)

More from The Linux Foundation (20)

Recently uploaded (20)

XPDS16: Xen Scalability Analysis - Weidong Han, Zhichao Huang & Wei Yang, Huawei