SlideShare a Scribd company logo
The Rise and Fall of Assembler
and the VGIC from Hell
Marc Zyngier, ARM
Christoffer Dall, Linaro
KVM/ARM:
The rise and fall of assembly code
and the VGIC from hell
Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier <marc.zyngier@arm.com>
LCU16
September 14, 2016
© ARM 2016
KVM/ARM: Absolute Beginners
KVM/ARM, merged in v3.9
38 files changed, 6546 insertions(+), 20 deletions(-)
Not bad, for a start. But wait...
13 files changed, 2060 insertions(+), 12 deletions(-)
That’s the vgic...
12 files changed, 489 insertions(+), 1 deletion(-)
And here’s the timer
Over 9k LoC, with about 10% assembly code implementing the world switch (mostly).
2 © ARM 2016
KVM/arm64: Always Crashing In The Same Car
KVM/arm64, merged in v3.11
The arm64 port didn’t change this fine tradition
World switch in asm, the rest in C.
Not changing the structure made it easy to build on the initial work
When you have something that works, it is tempting not to reinvent the wheel...
The result: about 3400 lines of new code
About 1000 lines of assembly code
3 © ARM 2016
EL2: What In The World
What does this assembly code do?
It swaps two execution environments (between host and guest)
GPRs
FPSIMD
All system registers (including virtual memory)
Interrupt context
Handle all exceptions happening whilst a guest runs
Interrupts
Page faults
Paravirtualized services
Offers an small set of services to the host too
TLB invalidation, please run this guest...
Generally known as “the World Switch”.
4 © ARM 2016
World switch: Under Pressure
Initial code is fairly straightforward
Things become quickly more complicated
GICv3 support
Lazy FPSIMD
Debug support
Interaction between various code paths are not obvious
Register allocation gets a bit hairy
Try throwing 31 balls in the air...
... and keep track of their individual positions...
... before catching them
Maintainers are feeling the pressure...
Optimizing is hard makes the code more fragile
Bugs are hard to squash
5 © ARM 2016
EL2: Life on Mars
Let’s take a step back: why using assembly code:
HYP/EL2 is a separate exception level
Its own exceptions, its own page tables
Its own rules too...
Its VA space is at an offset from the kernel VA
Not the usual warm, cosy kernel environment
No printk, no tracing, no debug facility (omg, no printk!!)
More akin being stranded on an iceberg. Naked.
Easier to write a standalone piece of code
Exception boundaries are well understood
Creates clear delimitations between kernel and HYP spaces
6 © ARM 2016
VHE: Breaking Glass
And then comes the feature that breaks everything: VHE.
VHE allows the kernel to run at EL2 on ARMv8.1 systems
Does so by aliasing _EL2 registers to their _EL1 counterpart
The kernel runs unmodified
The hypervisor needs to be heavily modified
But making the world switch code VHE compliant is ... interesting.
Entirely relies on code patching
Tons of system register renaming
Alternate sequences for some paths
See presentation at LCA15
The result, although functionnal, is not easily maintainable
Optimizing becomes extremely hard, wasting the VHE effort
Maybe it is time to reconsider how the world-switch code is architected.
7 © ARM 2016
WSinC: Changes
So what is actually required to use C at EL2? Surprizingly little:
Have a valid stack
Respect the AArch64 PCS (IHI 0055B)
Map the read-only data into EL2
And a few more things that are Linux specific:
Put the code ends up in a separate section
Do NOT call any kernel function from the HYP code
Unless you can guarantee they are inlined
Remember the bit about having a different VA space?
Make sure nothing gets traced or instrumented
8 © ARM 2016
WSinC: Shape of Things
There is still some bits of assembly code that are required:
Calling into HYP
Marshalling the parameters across HVC
Just a function call for VHE
Entering the guest
Seen as a normal function from C code (__guest_enter)
Performs GPR save/restore
Taking exceptions (interrupt, hypercall, fault)
Save a very minimal context
Seen by the C code as __guest_enter returning
Everything else is written as C code.
9 © ARM 2016
WSinC: Get Real
So what?
A couple of weeks spent hacking the kernel
That’s what holidays are for!
28 files changed, 1532 insertions(+), 1612 deletions(-)
Yes, we actually removed a bit of code
Very few bugs (the compiler catches the silly stuff early)
The result is slightly faster, despite not being optimized
Turns out CPUs are optimized for compiled code...
32 files changed, 862 insertions(+), 377 deletions(-)
And then comes VHE
And then we can start optimizing, because it’s easy!
Up to 40% reduction in interrupt latency
VHE-specific optimizations on the way
We can now share some the HYP code with the 32bit port
10 © ARM 2016
Thank you!
The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM limited
(or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be
trademarks of their respective owners.
Copyright © 2016 ARM Limited
© ARM 2016
ENGINEERS
AND DEVICES
WORKING
TOGETHER
The VGIC from Hell
ENGINEERS AND DEVICES
WORKING TOGETHER
Generic Interrupt Controller (GIC) - simplified view
ENGINEERS AND DEVICES
WORKING TOGETHER
The Gist of the GIC
● Devices signal interrupts to the GIC
● CPUs can receive interrupts (ACK) and complete interrupts (EOI)
● CPUs can configure the GIC:
○ CPU affinity
○ IRQ priority
○ Level vs. Edge trigger
○ Enable/Disable IRQs
○ ...and more scary stuff
● CPUs can ask the GIC to interrupt other CPUs (IPIs)
ENGINEERS AND DEVICES
WORKING TOGETHER
The V in VGIC
● Virtualization Extensions (Hardware Virtualization Support)
● Provides a virtual CPU interface that the VM can interact with directly
● Provides a hypervisor control interface to deliver virtual interrupts
● Benefit: No traps on ACK/EOI
Hardware takes care of priorities, masking, etc.
GIC
VM
Hypervisor
VCPU
Interface
CPU
Interface
Virtual Interrupts
Physical Interrupts
Hypervisor
Control
Interface
ENGINEERS AND DEVICES
WORKING TOGETHER
The Software Problem
● The GIC is split in two:
○ Distributor (configuration side)
○ CPU Interface (delivery side)
● No virtualization support for the distributor
● Must fully emulate distributor in software
● Emulated distributor drives delivery of virtual interrupts
ENGINEERS AND DEVICES
WORKING TOGETHER
VM
Software Architecture
Linux
KVM
VGIC
QEMU VM
Virtual IRQs
GIC
ENGINEERS AND DEVICES
WORKING TOGETHER
Software Challenges
● Lots of state
○ Each IRQ has: enabled/disabled, priority, active, pending, soft_pending, affinity, and more...
○ Global state: enable/disable
○ Per-vcpu state: List Registers (LRs) in Hypervisor Control Interface
○ ...all of this is per-VM.
● Lots of transitions:
○ Userspace and vhost can make virtual IRQ lines go up and down
○ Virtual CPUs can make interrupts pending (IPIs)
○ Virtual CPUs can modify other individual IRQ state (e.g. affinity)
○ Hardware can change state without notifying software (GIC Virtualization Extensions)
● Everything happens asynchronously
ENGINEERS AND DEVICES
WORKING TOGETHER
The old VGIC
● ...was a mess, because
● Maintained per-IRQ state as global state based on many large bitmaps
● Made it possible to compute global state quickly
● Duplicated pre-computed distributor and VCPU state
● Level-triggered interrupts were shoe-horned into design
Symptoms:
● Made it very hard to ensure consistent state
● Required global lock on almost every operation (measurable!)
● Unintuitive code; calculate bit-positions to modify a boolean state
● Drove maintainers to point of insanity
ENGINEERS AND DEVICES
WORKING TOGETHER
The New VGIC
● Was designed during a Linaro mini-sprint
● Covers GICv2, GICv3, and data structures for the ITS
● Key insight #1:
○ Most of the time, there are no IRQs in flight
● Key insight #2:
○ MMIO operations are rare, and not in the critical path
● The basic idea:
struct vgic_irq {
int intid;
struct list_head ap_list;
bool pending;
...
};
ENGINEERS AND DEVICES
WORKING TOGETHER
list_head
The AP List
VCPU
AP List
list_head
VCPU
AP List
list_head
vgic_irq
vgic_irq
vgic_irq
vgic_irq
vgic_irq
vgic_irq
...
ENGINEERS AND DEVICES
WORKING TOGETHER
Locking in the new world
● Historical data has shown we need more fine-grained locking than a per-VM
lock.
● Locking scheme becomes:
○ One lock per struct vgic_irq to ensure consistency
○ One lock per AP list
○ Only the VCPU thread itself may remove IRQs from its AP list
● Sometimes you need to grab more than one lock
○ Solution: Define strict locking order
● Locking order:
○ AP List lock
■ IRQ lock
○ Lowest-numbered VCPU’s AP list lock first
● Documented in virt/kvm/arm/vgic/vgic.c
ENGINEERS AND DEVICES
WORKING TOGETHER
1. VCPU 3 takes its AP List lock
2. VCPU 3 takes the IRQ lock
3. VCPU 3 concludes that this interrupt is still pending and must now be handled
by VCPU 1
4. VCPU 3 releases the IRQ lock
5. VCPU 3 releases the AP List lock
6. VCPU 3 takes the VCPU 1 AP List Lock
7. VCPU 3 takes its own AP List lock
8. VCPU 3 takes the IRQ lock
9. Re-check all conditions for the reassignment
10. Carry out reassignment if all conditions are still met
11. Release all locks in reverse order
Worst Locking Example: Reassign pending IRQ
ENGINEERS AND DEVICES
WORKING TOGETHER
Status of the new VGIC
● Merged in v4.7
● Roughly 600 fewer lines of code
● Pretty stable since the merge
● Improved world switch performance
● Happier maintainers
● Huge thanks to: Andre Przywara, Eric Auger, Peter Maydell, Alex Bennée
ENGINEERS
AND DEVICES
WORKING
TOGETHER
Where this leaves us
● KVM/ARM is in really good shape!
● Highlighted new’ish features:
○ Virtual GICv3
○ Virtual ITS
○ VHOST with virtual MSIs and virtual ITS
○ VHE support on ARMv8.1
○ Reduced world-switch time
● In the pipeline:
○ GICv3 save/restore
○ ITS save/restore
○ PCIe with MSI passthrough
○ Cross CPU-Type Support (migration in heterogeneous datacenters)
○ Optimizations
○ GICv4 (direct virtual interrupt injection)
Thank You
#LAS16
For further information: www.linaro.org
LAS16 keynotes and videos on: connect.linaro.org

More Related Content

PPTX
LAS16-106: GNU Toolchain Development Lifecycle
PDF
LAS16-200: SCMI - System Management and Control Interface
PDF
LAS16-201: ART JIT in Android N
PDF
LAS16-301: OpenStack on Aarch64, running in production, upstream improvements...
PDF
LAS16-TR03: Upstreaming 201
PDF
LAS16-209: Finished and Upcoming Projects in LMG
PDF
BUD17-104: Scripting Languages in IoT: Challenges and Approaches
PDF
LAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSD
LAS16-106: GNU Toolchain Development Lifecycle
LAS16-200: SCMI - System Management and Control Interface
LAS16-201: ART JIT in Android N
LAS16-301: OpenStack on Aarch64, running in production, upstream improvements...
LAS16-TR03: Upstreaming 201
LAS16-209: Finished and Upcoming Projects in LMG
BUD17-104: Scripting Languages in IoT: Challenges and Approaches
LAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSD

What's hot (20)

PDF
LAS16-109: LAS16-109: The status quo and the future of 96Boards
PDF
Las16 309 - lua jit arm64 port - status
PDF
LAS16-405:OpenDataPlane: Software Defined Dataplane leader
PDF
Las16 200 - firmware summit - ras what is it- why do we need it
PDF
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
PDF
LAS16-402: ARM Trusted Firmware – from Enterprise to Embedded
PDF
Linux-wpan: IEEE 802.15.4 and 6LoWPAN in the Linux Kernel - BUD17-120
PDF
LAS16-108: JerryScript and other scripting languages for IoT
PDF
LAS16-TR06: Remoteproc & rpmsg development
PDF
LAS16-207: Bus scaling QoS
PDF
BUD17-310: Introducing LLDB for linux on Arm and AArch64
PDF
BKK16-502 Suspend to Idle
PDF
BKK16-400A LuvOS and ACPI Compliance Testing
PDF
Ostech war story using mainline linux for an android tv bsp
PDF
ELC-E 2016 Neil Armstrong - No, it's never too late to upstream your legacy l...
PDF
LAS16-310: Introducing the first 96Boards TV Platform: Poplar by Hisilicon
PDF
LAS16-507: LXC support in LAVA
PDF
BKK16-213 Where's the Hardware?
PDF
LAS16-305: Smart City Big Data Visualization on 96Boards
PDF
BUD17-405: Building a reference IoT product with Zephyr
LAS16-109: LAS16-109: The status quo and the future of 96Boards
Las16 309 - lua jit arm64 port - status
LAS16-405:OpenDataPlane: Software Defined Dataplane leader
Las16 200 - firmware summit - ras what is it- why do we need it
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
LAS16-402: ARM Trusted Firmware – from Enterprise to Embedded
Linux-wpan: IEEE 802.15.4 and 6LoWPAN in the Linux Kernel - BUD17-120
LAS16-108: JerryScript and other scripting languages for IoT
LAS16-TR06: Remoteproc & rpmsg development
LAS16-207: Bus scaling QoS
BUD17-310: Introducing LLDB for linux on Arm and AArch64
BKK16-502 Suspend to Idle
BKK16-400A LuvOS and ACPI Compliance Testing
Ostech war story using mainline linux for an android tv bsp
ELC-E 2016 Neil Armstrong - No, it's never too late to upstream your legacy l...
LAS16-310: Introducing the first 96Boards TV Platform: Poplar by Hisilicon
LAS16-507: LXC support in LAVA
BKK16-213 Where's the Hardware?
LAS16-305: Smart City Big Data Visualization on 96Boards
BUD17-405: Building a reference IoT product with Zephyr
Ad

Viewers also liked (19)

PDF
LAS16-403: GDB Linux Kernel Awareness
PPTX
LAS16-203: Platform security architecture for embedded devices
PDF
BUD17-510: Power management in Linux together with secure firmware
PDF
BUD17-218: Scheduler Load tracking update and improvement
PDF
LAS16-101: Efficient kernel backporting
PDF
ARM-KVM: Weather Report
PDF
2010 11 psa montreal explanation and fundamentalism
PDF
20141111_SOS3_Gallo
PDF
BKK16-304 The State of GDB on AArch64
PDF
HKG15-405: Redundant zero/sign-extension elimination in GCC
PDF
BKK16-305B ILP32 Performance on AArch64
PDF
BKK16-504 Running Linux in EL2 Virtualization
PDF
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
PDF
LCE12: LCE12 ARMv8 Plenary
PDF
HKG15-400: Next steps in KVM enablement on ARM
PDF
Dave Gilbert - KVM and QEMU
PDF
Linux on ARM 64-bit Architecture
PPTX
GCC for ARMv8 Aarch64
PDF
LAS16-407: Internet of Tiny Linux (IoTL): the sequel.
LAS16-403: GDB Linux Kernel Awareness
LAS16-203: Platform security architecture for embedded devices
BUD17-510: Power management in Linux together with secure firmware
BUD17-218: Scheduler Load tracking update and improvement
LAS16-101: Efficient kernel backporting
ARM-KVM: Weather Report
2010 11 psa montreal explanation and fundamentalism
20141111_SOS3_Gallo
BKK16-304 The State of GDB on AArch64
HKG15-405: Redundant zero/sign-extension elimination in GCC
BKK16-305B ILP32 Performance on AArch64
BKK16-504 Running Linux in EL2 Virtualization
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
LCE12: LCE12 ARMv8 Plenary
HKG15-400: Next steps in KVM enablement on ARM
Dave Gilbert - KVM and QEMU
Linux on ARM 64-bit Architecture
GCC for ARMv8 Aarch64
LAS16-407: Internet of Tiny Linux (IoTL): the sequel.
Ad

Similar to LAS16-500: The Rise and Fall of Assembler and the VGIC from Hell (20)

PDF
SR-IOV ixgbe Driver Limitations and Improvement
PDF
HKG15-300: Art's Quick Compiler: An unofficial overview
PDF
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
PDF
XPDDS17: Reworking the ARM GIC Emulation & Xen Challenges in the ARM ITS Emu...
PPTX
Enabling Java: Windows on Arm64 - A Success Story!
PDF
IRQs: the Hard, the Soft, the Threaded and the Preemptible
PDF
MOVED: The challenge of SVE in QEMU - SFO17-103
PPTX
Cloud firewall logging
PDF
Approaching hyperconvergedopenstack
PDF
Porting_uClinux_CELF2008_Griffin
PPTX
A Comprehensive Implementation and Evaluation of Direct Interrupt Delivery
PPT
Nvidia tegra K1 Presentation
PDF
BlueHat v18 || A mitigation for kernel toctou vulnerabilities
PPT
UNIT-III ES.ppt
PDF
SoC Idling for unconf COSCUP 2016
PPTX
Beneath the Linux Interrupt handling
PPTX
QEMU - Binary Translation
PDF
Arm architecture overview
PDF
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
PDF
100Gbps OpenStack For Providing High-Performance NFV
SR-IOV ixgbe Driver Limitations and Improvement
HKG15-300: Art's Quick Compiler: An unofficial overview
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
XPDDS17: Reworking the ARM GIC Emulation & Xen Challenges in the ARM ITS Emu...
Enabling Java: Windows on Arm64 - A Success Story!
IRQs: the Hard, the Soft, the Threaded and the Preemptible
MOVED: The challenge of SVE in QEMU - SFO17-103
Cloud firewall logging
Approaching hyperconvergedopenstack
Porting_uClinux_CELF2008_Griffin
A Comprehensive Implementation and Evaluation of Direct Interrupt Delivery
Nvidia tegra K1 Presentation
BlueHat v18 || A mitigation for kernel toctou vulnerabilities
UNIT-III ES.ppt
SoC Idling for unconf COSCUP 2016
Beneath the Linux Interrupt handling
QEMU - Binary Translation
Arm architecture overview
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
100Gbps OpenStack For Providing High-Performance NFV

More from Linaro (20)

PDF
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
PDF
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
PDF
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
PDF
Bud17 113: distribution ci using qemu and open qa
PDF
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
PDF
HPC network stack on ARM - Linaro HPC Workshop 2018
PDF
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
PDF
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
PDF
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
PDF
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
PDF
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
PDF
HKG18-100K1 - George Grey: Opening Keynote
PDF
HKG18-318 - OpenAMP Workshop
PDF
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
PDF
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
PDF
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
PDF
HKG18-TR08 - Upstreaming SVE in QEMU
PDF
HKG18-113- Secure Data Path work with i.MX8M
PPTX
HKG18-120 - Devicetree Schema Documentation and Validation
PPTX
HKG18-223 - Trusted FirmwareM: Trusted boot
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Bud17 113: distribution ci using qemu and open qa
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-100K1 - George Grey: Opening Keynote
HKG18-318 - OpenAMP Workshop
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-113- Secure Data Path work with i.MX8M
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-223 - Trusted FirmwareM: Trusted boot

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Encapsulation theory and applications.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPT
Teaching material agriculture food technology
PPTX
Cloud computing and distributed systems.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
KodekX | Application Modernization Development
PDF
Empathic Computing: Creating Shared Understanding
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
Per capita expenditure prediction using model stacking based on satellite ima...
“AI and Expert System Decision Support & Business Intelligence Systems”
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Encapsulation theory and applications.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Teaching material agriculture food technology
Cloud computing and distributed systems.
Digital-Transformation-Roadmap-for-Companies.pptx
A Presentation on Artificial Intelligence
Dropbox Q2 2025 Financial Results & Investor Presentation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectral efficient network and resource selection model in 5G networks
Mobile App Security Testing_ A Comprehensive Guide.pdf
KodekX | Application Modernization Development
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Advanced methodologies resolving dimensionality complications for autism neur...

LAS16-500: The Rise and Fall of Assembler and the VGIC from Hell

  • 1. The Rise and Fall of Assembler and the VGIC from Hell Marc Zyngier, ARM Christoffer Dall, Linaro
  • 2. KVM/ARM: The rise and fall of assembly code and the VGIC from hell Christoffer Dall <christoffer.dall@linaro.org> Marc Zyngier <marc.zyngier@arm.com> LCU16 September 14, 2016 © ARM 2016
  • 3. KVM/ARM: Absolute Beginners KVM/ARM, merged in v3.9 38 files changed, 6546 insertions(+), 20 deletions(-) Not bad, for a start. But wait... 13 files changed, 2060 insertions(+), 12 deletions(-) That’s the vgic... 12 files changed, 489 insertions(+), 1 deletion(-) And here’s the timer Over 9k LoC, with about 10% assembly code implementing the world switch (mostly). 2 © ARM 2016
  • 4. KVM/arm64: Always Crashing In The Same Car KVM/arm64, merged in v3.11 The arm64 port didn’t change this fine tradition World switch in asm, the rest in C. Not changing the structure made it easy to build on the initial work When you have something that works, it is tempting not to reinvent the wheel... The result: about 3400 lines of new code About 1000 lines of assembly code 3 © ARM 2016
  • 5. EL2: What In The World What does this assembly code do? It swaps two execution environments (between host and guest) GPRs FPSIMD All system registers (including virtual memory) Interrupt context Handle all exceptions happening whilst a guest runs Interrupts Page faults Paravirtualized services Offers an small set of services to the host too TLB invalidation, please run this guest... Generally known as “the World Switch”. 4 © ARM 2016
  • 6. World switch: Under Pressure Initial code is fairly straightforward Things become quickly more complicated GICv3 support Lazy FPSIMD Debug support Interaction between various code paths are not obvious Register allocation gets a bit hairy Try throwing 31 balls in the air... ... and keep track of their individual positions... ... before catching them Maintainers are feeling the pressure... Optimizing is hard makes the code more fragile Bugs are hard to squash 5 © ARM 2016
  • 7. EL2: Life on Mars Let’s take a step back: why using assembly code: HYP/EL2 is a separate exception level Its own exceptions, its own page tables Its own rules too... Its VA space is at an offset from the kernel VA Not the usual warm, cosy kernel environment No printk, no tracing, no debug facility (omg, no printk!!) More akin being stranded on an iceberg. Naked. Easier to write a standalone piece of code Exception boundaries are well understood Creates clear delimitations between kernel and HYP spaces 6 © ARM 2016
  • 8. VHE: Breaking Glass And then comes the feature that breaks everything: VHE. VHE allows the kernel to run at EL2 on ARMv8.1 systems Does so by aliasing _EL2 registers to their _EL1 counterpart The kernel runs unmodified The hypervisor needs to be heavily modified But making the world switch code VHE compliant is ... interesting. Entirely relies on code patching Tons of system register renaming Alternate sequences for some paths See presentation at LCA15 The result, although functionnal, is not easily maintainable Optimizing becomes extremely hard, wasting the VHE effort Maybe it is time to reconsider how the world-switch code is architected. 7 © ARM 2016
  • 9. WSinC: Changes So what is actually required to use C at EL2? Surprizingly little: Have a valid stack Respect the AArch64 PCS (IHI 0055B) Map the read-only data into EL2 And a few more things that are Linux specific: Put the code ends up in a separate section Do NOT call any kernel function from the HYP code Unless you can guarantee they are inlined Remember the bit about having a different VA space? Make sure nothing gets traced or instrumented 8 © ARM 2016
  • 10. WSinC: Shape of Things There is still some bits of assembly code that are required: Calling into HYP Marshalling the parameters across HVC Just a function call for VHE Entering the guest Seen as a normal function from C code (__guest_enter) Performs GPR save/restore Taking exceptions (interrupt, hypercall, fault) Save a very minimal context Seen by the C code as __guest_enter returning Everything else is written as C code. 9 © ARM 2016
  • 11. WSinC: Get Real So what? A couple of weeks spent hacking the kernel That’s what holidays are for! 28 files changed, 1532 insertions(+), 1612 deletions(-) Yes, we actually removed a bit of code Very few bugs (the compiler catches the silly stuff early) The result is slightly faster, despite not being optimized Turns out CPUs are optimized for compiled code... 32 files changed, 862 insertions(+), 377 deletions(-) And then comes VHE And then we can start optimizing, because it’s easy! Up to 40% reduction in interrupt latency VHE-specific optimizations on the way We can now share some the HYP code with the 32bit port 10 © ARM 2016
  • 12. Thank you! The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright © 2016 ARM Limited © ARM 2016
  • 14. ENGINEERS AND DEVICES WORKING TOGETHER Generic Interrupt Controller (GIC) - simplified view
  • 15. ENGINEERS AND DEVICES WORKING TOGETHER The Gist of the GIC ● Devices signal interrupts to the GIC ● CPUs can receive interrupts (ACK) and complete interrupts (EOI) ● CPUs can configure the GIC: ○ CPU affinity ○ IRQ priority ○ Level vs. Edge trigger ○ Enable/Disable IRQs ○ ...and more scary stuff ● CPUs can ask the GIC to interrupt other CPUs (IPIs)
  • 16. ENGINEERS AND DEVICES WORKING TOGETHER The V in VGIC ● Virtualization Extensions (Hardware Virtualization Support) ● Provides a virtual CPU interface that the VM can interact with directly ● Provides a hypervisor control interface to deliver virtual interrupts ● Benefit: No traps on ACK/EOI Hardware takes care of priorities, masking, etc. GIC VM Hypervisor VCPU Interface CPU Interface Virtual Interrupts Physical Interrupts Hypervisor Control Interface
  • 17. ENGINEERS AND DEVICES WORKING TOGETHER The Software Problem ● The GIC is split in two: ○ Distributor (configuration side) ○ CPU Interface (delivery side) ● No virtualization support for the distributor ● Must fully emulate distributor in software ● Emulated distributor drives delivery of virtual interrupts
  • 18. ENGINEERS AND DEVICES WORKING TOGETHER VM Software Architecture Linux KVM VGIC QEMU VM Virtual IRQs GIC
  • 19. ENGINEERS AND DEVICES WORKING TOGETHER Software Challenges ● Lots of state ○ Each IRQ has: enabled/disabled, priority, active, pending, soft_pending, affinity, and more... ○ Global state: enable/disable ○ Per-vcpu state: List Registers (LRs) in Hypervisor Control Interface ○ ...all of this is per-VM. ● Lots of transitions: ○ Userspace and vhost can make virtual IRQ lines go up and down ○ Virtual CPUs can make interrupts pending (IPIs) ○ Virtual CPUs can modify other individual IRQ state (e.g. affinity) ○ Hardware can change state without notifying software (GIC Virtualization Extensions) ● Everything happens asynchronously
  • 20. ENGINEERS AND DEVICES WORKING TOGETHER The old VGIC ● ...was a mess, because ● Maintained per-IRQ state as global state based on many large bitmaps ● Made it possible to compute global state quickly ● Duplicated pre-computed distributor and VCPU state ● Level-triggered interrupts were shoe-horned into design Symptoms: ● Made it very hard to ensure consistent state ● Required global lock on almost every operation (measurable!) ● Unintuitive code; calculate bit-positions to modify a boolean state ● Drove maintainers to point of insanity
  • 21. ENGINEERS AND DEVICES WORKING TOGETHER The New VGIC ● Was designed during a Linaro mini-sprint ● Covers GICv2, GICv3, and data structures for the ITS ● Key insight #1: ○ Most of the time, there are no IRQs in flight ● Key insight #2: ○ MMIO operations are rare, and not in the critical path ● The basic idea: struct vgic_irq { int intid; struct list_head ap_list; bool pending; ... };
  • 22. ENGINEERS AND DEVICES WORKING TOGETHER list_head The AP List VCPU AP List list_head VCPU AP List list_head vgic_irq vgic_irq vgic_irq vgic_irq vgic_irq vgic_irq ...
  • 23. ENGINEERS AND DEVICES WORKING TOGETHER Locking in the new world ● Historical data has shown we need more fine-grained locking than a per-VM lock. ● Locking scheme becomes: ○ One lock per struct vgic_irq to ensure consistency ○ One lock per AP list ○ Only the VCPU thread itself may remove IRQs from its AP list ● Sometimes you need to grab more than one lock ○ Solution: Define strict locking order ● Locking order: ○ AP List lock ■ IRQ lock ○ Lowest-numbered VCPU’s AP list lock first ● Documented in virt/kvm/arm/vgic/vgic.c
  • 24. ENGINEERS AND DEVICES WORKING TOGETHER 1. VCPU 3 takes its AP List lock 2. VCPU 3 takes the IRQ lock 3. VCPU 3 concludes that this interrupt is still pending and must now be handled by VCPU 1 4. VCPU 3 releases the IRQ lock 5. VCPU 3 releases the AP List lock 6. VCPU 3 takes the VCPU 1 AP List Lock 7. VCPU 3 takes its own AP List lock 8. VCPU 3 takes the IRQ lock 9. Re-check all conditions for the reassignment 10. Carry out reassignment if all conditions are still met 11. Release all locks in reverse order Worst Locking Example: Reassign pending IRQ
  • 25. ENGINEERS AND DEVICES WORKING TOGETHER Status of the new VGIC ● Merged in v4.7 ● Roughly 600 fewer lines of code ● Pretty stable since the merge ● Improved world switch performance ● Happier maintainers ● Huge thanks to: Andre Przywara, Eric Auger, Peter Maydell, Alex Bennée
  • 26. ENGINEERS AND DEVICES WORKING TOGETHER Where this leaves us ● KVM/ARM is in really good shape! ● Highlighted new’ish features: ○ Virtual GICv3 ○ Virtual ITS ○ VHOST with virtual MSIs and virtual ITS ○ VHE support on ARMv8.1 ○ Reduced world-switch time ● In the pipeline: ○ GICv3 save/restore ○ ITS save/restore ○ PCIe with MSI passthrough ○ Cross CPU-Type Support (migration in heterogeneous datacenters) ○ Optimizations ○ GICv4 (direct virtual interrupt injection)
  • 27. Thank You #LAS16 For further information: www.linaro.org LAS16 keynotes and videos on: connect.linaro.org