HUAWEI TECHNOLOGIES CO., LTD.
www.huawei.com
Huawei Confidential
Security Level:
August, 2016
Weidong Han <hanweidong@huawei.com>
Wei Yang <richard.weiyang@huawei.com>
Zhichao Huang <huangzhichao@huawei.com>
>
Xen Scalability Analysis
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 2
Agenda
 Current Status
 Issues & Proposals
 Summary
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 3
What’s Scalability
 Scalability is the capability of a system, network, or process to
handle a growing amount of work, or its potential to be enlarged in
order to accommodate that growth [wikipedia]
 We care about below dimensions of scalability on a single server
 Horizontal scaling: more VMs , more users
 Vertical scaling: larger VM
Source https://en.wikipedia.org/wiki/Scalability
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 4
Current Status: Xen Max Limits
Xen 4.7
 Host Limits (x86)
 Up to 4095 physical CPUs
 Up to 16TB physical memory
 HVM Guest Limits (x86)
 Up to 512 virtual CPUs
 Up to 1TB virtual memory
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 5
Current Status: Scale to Thousands of VMs
 3000 VMs were experimented successfully
 Improving Scalability of Xen: the 3,000 domains experiment
(https://events.linuxfoundation.org/images/stories/slides/lfcs2013_liu.pdf
)
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 6
Scalability was improved a lot in upstream
 Grant table
 Split grant table lock into maptrack lock and grant table lock
 per-vCPU maptrack free lists
 Read-write lock
 Per-active entry lock
 Persistent grants for virtual block scalability
 Avoid grant operations and TLB flushes
 Event channel
 Extend event channel limit: up to 131072
 per-event channel lock for sending events
 Scheduler
 node-affinity for vCPU load balance
 P2M
 Use unlocked p2m lookups in hvmemul_rep_movs
 defer the invalidation until the p2m lock is released
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 7
Agenda
 Current Status
 Bottleneck & Proposals
 Summary
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 8
More Cores are Integrated
More and more cores are integrated into a CPU. There are hundreds
of cores in 8P servers. It requires better many core scalability.
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 9
Bottleneck 1: ticket spinlock is non-scalable
 Xen uses ticket spinlock by default
 Ticket spinlock is fair, FIFO
 But, it’s non-scalable for many core
 Spin on global shared variable
 Expensive cache entries invalidation
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 10
Proposal: scalable lock
 MCS lock is scalable
 Spin on local variable
 Generate a constant number of cache misses per acquisition, avoid the
performance collapse with many cores.
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 11
Bottleneck 2: call lock
 Call lock is global lock, used in on_selected_cpus to protect
IPIs on targeted CPUs
 Bottleneck case
 Frequent EPT entry changes invalidate EPT on related
PCPUs
 ept_sync_domain -> on_selected_cpus
 heavy contention of call lock
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 12
Proposal: scalable lock and finer-grain lock
 Replace ticket lock with MCS lock for call lock
 Change global lock to per-cpu lock
 Hold a per-cpu lock of each target CPU, instead of a global lock
(call lock)
 Then do IPIs: smp_send_call_function_mask(&call_data.selected);
1
2
3
N
…
It locks all even though the IPI
target CPUs are only CPU 1 and 2
src
1
2
3
N
…
src
It only locks IPI target CPUs (1 and 2),
other CPUs are still available for IPI
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 13
Evaluation Environment and Method
 Hardware
 8 sockets, 96 cores, 192 threads
 2TB RAM
 Host
 Xen 4.7.0
 Dom0: SLES 11SP3
 VM
 Win 7 64bits
 40GB RAM
 64 vCPUs
 Test case
 Launch 8 VMs with POD in parallel, it will result in lots of EPT changes due to
Windows OS will zero pages during booting, then scan and recycle zero pages
when VM memory usage beyonds reserved threshold.
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 14
Evaluation Results
0
10000
20000
30000
40000
50000
60000
70000
1
12
23
34
45
56
67
78
89
100
111
122
133
144
155
166
177
188
199
210
221
232
243
254
265
276
287
298
309
320
331
342
353
364
375
386
397
408
419
430
441
452
463
474
485
496
507
518
529
540
551
562
573
584
595
Number of Lock Wait
Xen 4.7 Xen 4.7 with Optimization
0
2000000
4000000
6000000
8000000
10000000
1
12
23
34
45
56
67
78
89
100
111
122
133
144
155
166
177
188
199
210
221
232
243
254
265
276
287
298
309
320
331
342
353
364
375
386
397
408
419
430
441
452
463
474
485
496
507
518
529
540
551
562
573
584
595
Avg Consuming of Each Lock Wait (ns)
Xen 4.7 Xen 4.7 with Optimization
With the optimization, the number of lock wait and each wait consuming are both reduced significantly
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 15
Bottleneck 3: vCPU load balance in credit scheduler
 vCPU load balance
 If the next highest priority local runnable vCPU has already eaten
through its credits, look on other PCPUs to see if we have more urgent
work
 Select non-idling CPUs
 Try to acquire the schedule lock of a non-idling CPU
 Then try to steal a task from this non-idling CPU
 If stealing not succeed, go to next non-idling CPU
 The bottleneck analysis
 In many core case, there will be lots of non-idling CPUs, it’s a big waste
if it often acquired the schedule lock but failed to steal a task.
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 16
Proposal: Add a check before acquiring schedule
lock
 Add a check for each non-idling CPU before acquiring
schedule lock
 Add a bitmap of each PCPU, record what PCPUs its runq vCPUs can run
(vCPU may be pinned on some PCPUs)
 If this CPU (which wants to steal task from other CPUs) is not in the
bitmap of a non-idling CPU, that means it cannot steal a task from this
non-idling CPU. Save the cost of acquiring schedule lock.
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 17
Evaluation Environment
 Hardware
 8 socket, 96 cores, 192 threads
 2TB RAM
 Host
 Xen 4.7.0
 Dom0: SLES 11SP3
 VM
 Win 7 64bits
 40GB RAM
 24 vCPUs
 Test case
 1:1 pin vCPU to PCPU, and run NotMyFault in VM
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 18
Evaluation Results: 1:1 pin VCPU to PCPU
0
10000
20000
30000
40000
50000
60000
1
7
13
19
25
31
37
43
49
55
61
67
73
79
85
91
97
103
109
115
121
127
133
139
145
151
157
163
169
175
181
187
193
199
205
211
217
223
229
235
241
247
253
259
265
271
277
283
289
295
Number of Lock Wait
Xen 4.7 Xen 4.7 with Optimization
0
5000000
10000000
15000000
20000000
1
7
13
19
25
31
37
43
49
55
61
67
73
79
85
91
97
103
109
115
121
127
133
139
145
151
157
163
169
175
181
187
193
199
205
211
217
223
229
235
241
247
253
259
265
271
277
283
289
295
Avg. Consuming of Each Lock Wait (ns)
Xen 4.7 Xen 4.7 with Optimization
With the optimization, the number of lock wait and each wait consuming are both reduced significantly
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 19
Summary
 Some scalability bottlenecks and proposals
 Future work
 Virtualization scalability benchmark
 Scenarios
 Measurement
 Huge VM (vcpu > 128, memory > 1TB) support
Thank you
www.huawei.com

More Related Content

PDF
XPDS16: Hypervisor-based Security: Vicarious Learning via Introspektioneerin...
PDF
XPDS16: libvirt and Tools: What's New and What's Next - James Fehlig, SUSE
PDF
XPDS16: Xen Development Update
PDF
XPDS16: Xen Orchestra: building a Cloud on top of Xen - Olivier Lambert & Jul...
PDF
XPDS16: CPUID handling for guests - Andrew Cooper, Citrix
PDF
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
PDF
PDF
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
XPDS16: Hypervisor-based Security: Vicarious Learning via Introspektioneerin...
XPDS16: libvirt and Tools: What's New and What's Next - James Fehlig, SUSE
XPDS16: Xen Development Update
XPDS16: Xen Orchestra: building a Cloud on top of Xen - Olivier Lambert & Jul...
XPDS16: CPUID handling for guests - Andrew Cooper, Citrix
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...

What's hot (20)

PDF
kdump: usage and_internals
PDF
XPDS16: Xenbedded: Xen-based client virtualization for phones and tablets - ...
PDF
XPDS14 - Towards Massive Server Consolidation - Filipe Manco, NEC
PDF
XPDS16: The OpenXT Project in 2016 - Christopher Clark, BAE Systems
PDF
XPDS14 - Intel(r) Virtualization Technology for Directed I/O (VT-d) Posted In...
PDF
XPDS14: Removing the Xen Linux Upstream Delta of Various Linux Distros - Luis...
PPTX
XPDS16: Windows PV Network Performance - Paul Durrant, Citrix Systems Inc
PDF
Quickly Debug VM Failures in OpenStack
PDF
PVH : PV Guest in HVM container
PDF
XPDS14: Xen 4.5 Roadmap - Konrad Wilk, Oracle
PDF
Xen and the art of embedded virtualization (ELC 2017)
PDF
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
PDF
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Asible
PDF
XPDS14 - Xen as High-Performance NFV Platform - Jun Nakajima, Intel
PDF
QEMU Disk IO Which performs Better: Native or threads?
PPTX
Optimizing VM images for OpenStack with KVM/QEMU
PDF
XPDDS18: The Art of Virtualizing Cache Maintenance - Julien Grall, Arm
PDF
Obstacles & Solutions for Livepatch Support on ARM64 Architecture
ODP
Kvm and libvirt
PDF
LFNW2014 Advanced Security Features of Xen Project Hypervisor
kdump: usage and_internals
XPDS16: Xenbedded: Xen-based client virtualization for phones and tablets - ...
XPDS14 - Towards Massive Server Consolidation - Filipe Manco, NEC
XPDS16: The OpenXT Project in 2016 - Christopher Clark, BAE Systems
XPDS14 - Intel(r) Virtualization Technology for Directed I/O (VT-d) Posted In...
XPDS14: Removing the Xen Linux Upstream Delta of Various Linux Distros - Luis...
XPDS16: Windows PV Network Performance - Paul Durrant, Citrix Systems Inc
Quickly Debug VM Failures in OpenStack
PVH : PV Guest in HVM container
XPDS14: Xen 4.5 Roadmap - Konrad Wilk, Oracle
Xen and the art of embedded virtualization (ELC 2017)
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Asible
XPDS14 - Xen as High-Performance NFV Platform - Jun Nakajima, Intel
QEMU Disk IO Which performs Better: Native or threads?
Optimizing VM images for OpenStack with KVM/QEMU
XPDDS18: The Art of Virtualizing Cache Maintenance - Julien Grall, Arm
Obstacles & Solutions for Livepatch Support on ARM64 Architecture
Kvm and libvirt
LFNW2014 Advanced Security Features of Xen Project Hypervisor
Ad

Viewers also liked (20)

PPTX
XPDS16: Scope and Performance of Credit-2 Scheduler. - Anshul Makkar, Ctirix...
PDF
Fosdem17 - Mixed License FOSS Projects
PDF
Fosdem 17 - Towards a HVM-like Dom0 for Xen
PDF
XPDS16: Patch review for non-maintainers - George Dunlap, Citrix Systems R&D...
PDF
XPDS16: Keeping coherency on ARM - Julien Grall, ARM
PDF
XPDS16: High-Performance Virtualization for HPC Cloud on Xen - Jun Nakajima &...
PDF
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
PDF
XPDS16: AMD's virtualization memory encryption technology - Brijesh Singh, A...
PDF
XPDS16: Live scalability for vGPU using gScale - Xiao Zheng, Intel
PDF
LCEU13: Securing your cloud with Xen's advanced security features - George Du...
PDF
XPDS16: Live Migration of vGPU - Xiao Zheng, Intel Asia-Pacific Research & De...
PDF
XPDS16: Making Migration More Secure - John Shackleton, Adventium Labs
PDF
Minimizing I/O Latency in Xen-ARM
PPTX
Презентация Huawei на совместном вебинаре, 30.11.2016
PDF
mHealth Israel_Huawei Products Proposition_December 2015
ODP
Bashのヒストリ展開を活用する
PDF
Usp友の会勉強会、ジャクソン構造図の巻(後編)
ODP
FreeBSDのブートプロセス
PDF
2015.08.29 JUS共催勉強会資料
PDF
Usp友の会勉強会、ジャクソン構造図の巻(前編)
XPDS16: Scope and Performance of Credit-2 Scheduler. - Anshul Makkar, Ctirix...
Fosdem17 - Mixed License FOSS Projects
Fosdem 17 - Towards a HVM-like Dom0 for Xen
XPDS16: Patch review for non-maintainers - George Dunlap, Citrix Systems R&D...
XPDS16: Keeping coherency on ARM - Julien Grall, ARM
XPDS16: High-Performance Virtualization for HPC Cloud on Xen - Jun Nakajima &...
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
XPDS16: AMD's virtualization memory encryption technology - Brijesh Singh, A...
XPDS16: Live scalability for vGPU using gScale - Xiao Zheng, Intel
LCEU13: Securing your cloud with Xen's advanced security features - George Du...
XPDS16: Live Migration of vGPU - Xiao Zheng, Intel Asia-Pacific Research & De...
XPDS16: Making Migration More Secure - John Shackleton, Adventium Labs
Minimizing I/O Latency in Xen-ARM
Презентация Huawei на совместном вебинаре, 30.11.2016
mHealth Israel_Huawei Products Proposition_December 2015
Bashのヒストリ展開を活用する
Usp友の会勉強会、ジャクソン構造図の巻(後編)
FreeBSDのブートプロセス
2015.08.29 JUS共催勉強会資料
Usp友の会勉強会、ジャクソン構造図の巻(前編)
Ad

Similar to XPDS16: Xen Scalability Analysis - Weidong Han, Zhichao Huang & Wei Yang, Huawei (20)

PDF
XPDDS18: Performance tuning on Xen platform - Bo Zhang & Yifei Jiang, Huawei
PDF
2 new hw_features_cat_cod_etc
PDF
Deep Dive on Amazon EC2 Instances (March 2017)
PPTX
Solve the colocation conundrum: Performance and density at scale with Kubernetes
PDF
STATUS UPDATE OF COLO PROJECT XIAOWEI YANG, HUAWEI AND WILL AULD, INTEL
PDF
Lynn Comp - Big Data & Cloud Summit 2013
PDF
HPC Facility Designing for next generation HPC systems Ram Nagappan Intel Final
PDF
VM Live Migration Speedup in Xen
PDF
High Performance Computing: an Introduction for the Society of Actuaries
PDF
LCU14 206- Tools to Analyse Scheduling Behaviour and Its Impact on Power Mana...
PPTX
Tectonic Summit 2016: It's Go Time
PPTX
Acceleration_and_Security_draft_v2
PDF
Scaling systems for research computing
PDF
Unlocking the SDN and NFV Transformation
PDF
RedisConf18 - Auto-Scaling Redis Caches - Observability, Efficiency & Perform...
PDF
5G Multi-Access Edge Compute
PDF
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
PDF
20160503 Amazed by AWS | Tips about Performance on AWS
PDF
A performance-aware power capping orchestrator for the Xen hypervisor
PDF
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
XPDDS18: Performance tuning on Xen platform - Bo Zhang & Yifei Jiang, Huawei
2 new hw_features_cat_cod_etc
Deep Dive on Amazon EC2 Instances (March 2017)
Solve the colocation conundrum: Performance and density at scale with Kubernetes
STATUS UPDATE OF COLO PROJECT XIAOWEI YANG, HUAWEI AND WILL AULD, INTEL
Lynn Comp - Big Data & Cloud Summit 2013
HPC Facility Designing for next generation HPC systems Ram Nagappan Intel Final
VM Live Migration Speedup in Xen
High Performance Computing: an Introduction for the Society of Actuaries
LCU14 206- Tools to Analyse Scheduling Behaviour and Its Impact on Power Mana...
Tectonic Summit 2016: It's Go Time
Acceleration_and_Security_draft_v2
Scaling systems for research computing
Unlocking the SDN and NFV Transformation
RedisConf18 - Auto-Scaling Redis Caches - Observability, Efficiency & Perform...
5G Multi-Access Edge Compute
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
20160503 Amazed by AWS | Tips about Performance on AWS
A performance-aware power capping orchestrator for the Xen hypervisor
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...

More from The Linux Foundation (20)

PDF
ELC2019: Static Partitioning Made Simple
PDF
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
PDF
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
PDF
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
PDF
XPDDS19 Keynote: Unikraft Weather Report
PDF
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
PDF
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
PDF
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
PDF
XPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
PPTX
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...
PPTX
OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
PDF
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
PDF
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
PDF
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
PDF
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
PDF
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
PDF
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
PDF
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
PDF
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
PDF
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
ELC2019: Static Partitioning Made Simple
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
XPDDS19 Keynote: Unikraft Weather Report
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
XPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...
OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE

Recently uploaded (20)

PPTX
Benefits of Physical activity for teenagers.pptx
PPTX
2018-HIPAA-Renewal-Training for executives
PPTX
Configure Apache Mutual Authentication
PPTX
Build Your First AI Agent with UiPath.pptx
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PPT
What is a Computer? Input Devices /output devices
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
DOCX
search engine optimization ppt fir known well about this
PPT
Geologic Time for studying geology for geologist
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
PDF
CloudStack 4.21: First Look Webinar slides
PPTX
The various Industrial Revolutions .pptx
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
sbt 2.0: go big (Scala Days 2025 edition)
Benefits of Physical activity for teenagers.pptx
2018-HIPAA-Renewal-Training for executives
Configure Apache Mutual Authentication
Build Your First AI Agent with UiPath.pptx
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
What is a Computer? Input Devices /output devices
Convolutional neural network based encoder-decoder for efficient real-time ob...
A review of recent deep learning applications in wood surface defect identifi...
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
OpenACC and Open Hackathons Monthly Highlights July 2025
search engine optimization ppt fir known well about this
Geologic Time for studying geology for geologist
Developing a website for English-speaking practice to English as a foreign la...
Zenith AI: Advanced Artificial Intelligence
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
CloudStack 4.21: First Look Webinar slides
The various Industrial Revolutions .pptx
A proposed approach for plagiarism detection in Myanmar Unicode text
Custom Battery Pack Design Considerations for Performance and Safety
sbt 2.0: go big (Scala Days 2025 edition)

XPDS16: Xen Scalability Analysis - Weidong Han, Zhichao Huang & Wei Yang, Huawei

  • 1. HUAWEI TECHNOLOGIES CO., LTD. www.huawei.com Huawei Confidential Security Level: August, 2016 Weidong Han <hanweidong@huawei.com> Wei Yang <richard.weiyang@huawei.com> Zhichao Huang <huangzhichao@huawei.com> > Xen Scalability Analysis
  • 2. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 2 Agenda  Current Status  Issues & Proposals  Summary
  • 3. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 3 What’s Scalability  Scalability is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged in order to accommodate that growth [wikipedia]  We care about below dimensions of scalability on a single server  Horizontal scaling: more VMs , more users  Vertical scaling: larger VM Source https://en.wikipedia.org/wiki/Scalability
  • 4. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 4 Current Status: Xen Max Limits Xen 4.7  Host Limits (x86)  Up to 4095 physical CPUs  Up to 16TB physical memory  HVM Guest Limits (x86)  Up to 512 virtual CPUs  Up to 1TB virtual memory
  • 5. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 5 Current Status: Scale to Thousands of VMs  3000 VMs were experimented successfully  Improving Scalability of Xen: the 3,000 domains experiment (https://events.linuxfoundation.org/images/stories/slides/lfcs2013_liu.pdf )
  • 6. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 6 Scalability was improved a lot in upstream  Grant table  Split grant table lock into maptrack lock and grant table lock  per-vCPU maptrack free lists  Read-write lock  Per-active entry lock  Persistent grants for virtual block scalability  Avoid grant operations and TLB flushes  Event channel  Extend event channel limit: up to 131072  per-event channel lock for sending events  Scheduler  node-affinity for vCPU load balance  P2M  Use unlocked p2m lookups in hvmemul_rep_movs  defer the invalidation until the p2m lock is released
  • 7. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 7 Agenda  Current Status  Bottleneck & Proposals  Summary
  • 8. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 8 More Cores are Integrated More and more cores are integrated into a CPU. There are hundreds of cores in 8P servers. It requires better many core scalability.
  • 9. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 9 Bottleneck 1: ticket spinlock is non-scalable  Xen uses ticket spinlock by default  Ticket spinlock is fair, FIFO  But, it’s non-scalable for many core  Spin on global shared variable  Expensive cache entries invalidation
  • 10. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 10 Proposal: scalable lock  MCS lock is scalable  Spin on local variable  Generate a constant number of cache misses per acquisition, avoid the performance collapse with many cores.
  • 11. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 11 Bottleneck 2: call lock  Call lock is global lock, used in on_selected_cpus to protect IPIs on targeted CPUs  Bottleneck case  Frequent EPT entry changes invalidate EPT on related PCPUs  ept_sync_domain -> on_selected_cpus  heavy contention of call lock
  • 12. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 12 Proposal: scalable lock and finer-grain lock  Replace ticket lock with MCS lock for call lock  Change global lock to per-cpu lock  Hold a per-cpu lock of each target CPU, instead of a global lock (call lock)  Then do IPIs: smp_send_call_function_mask(&call_data.selected); 1 2 3 N … It locks all even though the IPI target CPUs are only CPU 1 and 2 src 1 2 3 N … src It only locks IPI target CPUs (1 and 2), other CPUs are still available for IPI
  • 13. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 13 Evaluation Environment and Method  Hardware  8 sockets, 96 cores, 192 threads  2TB RAM  Host  Xen 4.7.0  Dom0: SLES 11SP3  VM  Win 7 64bits  40GB RAM  64 vCPUs  Test case  Launch 8 VMs with POD in parallel, it will result in lots of EPT changes due to Windows OS will zero pages during booting, then scan and recycle zero pages when VM memory usage beyonds reserved threshold.
  • 14. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 14 Evaluation Results 0 10000 20000 30000 40000 50000 60000 70000 1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 210 221 232 243 254 265 276 287 298 309 320 331 342 353 364 375 386 397 408 419 430 441 452 463 474 485 496 507 518 529 540 551 562 573 584 595 Number of Lock Wait Xen 4.7 Xen 4.7 with Optimization 0 2000000 4000000 6000000 8000000 10000000 1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 210 221 232 243 254 265 276 287 298 309 320 331 342 353 364 375 386 397 408 419 430 441 452 463 474 485 496 507 518 529 540 551 562 573 584 595 Avg Consuming of Each Lock Wait (ns) Xen 4.7 Xen 4.7 with Optimization With the optimization, the number of lock wait and each wait consuming are both reduced significantly
  • 15. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 15 Bottleneck 3: vCPU load balance in credit scheduler  vCPU load balance  If the next highest priority local runnable vCPU has already eaten through its credits, look on other PCPUs to see if we have more urgent work  Select non-idling CPUs  Try to acquire the schedule lock of a non-idling CPU  Then try to steal a task from this non-idling CPU  If stealing not succeed, go to next non-idling CPU  The bottleneck analysis  In many core case, there will be lots of non-idling CPUs, it’s a big waste if it often acquired the schedule lock but failed to steal a task.
  • 16. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 16 Proposal: Add a check before acquiring schedule lock  Add a check for each non-idling CPU before acquiring schedule lock  Add a bitmap of each PCPU, record what PCPUs its runq vCPUs can run (vCPU may be pinned on some PCPUs)  If this CPU (which wants to steal task from other CPUs) is not in the bitmap of a non-idling CPU, that means it cannot steal a task from this non-idling CPU. Save the cost of acquiring schedule lock.
  • 17. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 17 Evaluation Environment  Hardware  8 socket, 96 cores, 192 threads  2TB RAM  Host  Xen 4.7.0  Dom0: SLES 11SP3  VM  Win 7 64bits  40GB RAM  24 vCPUs  Test case  1:1 pin vCPU to PCPU, and run NotMyFault in VM
  • 18. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 18 Evaluation Results: 1:1 pin VCPU to PCPU 0 10000 20000 30000 40000 50000 60000 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 133 139 145 151 157 163 169 175 181 187 193 199 205 211 217 223 229 235 241 247 253 259 265 271 277 283 289 295 Number of Lock Wait Xen 4.7 Xen 4.7 with Optimization 0 5000000 10000000 15000000 20000000 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 133 139 145 151 157 163 169 175 181 187 193 199 205 211 217 223 229 235 241 247 253 259 265 271 277 283 289 295 Avg. Consuming of Each Lock Wait (ns) Xen 4.7 Xen 4.7 with Optimization With the optimization, the number of lock wait and each wait consuming are both reduced significantly
  • 19. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 19 Summary  Some scalability bottlenecks and proposals  Future work  Virtualization scalability benchmark  Scenarios  Measurement  Huge VM (vcpu > 128, memory > 1TB) support