SlideShare a Scribd company logo
Building A KVM-based Hypervisor for A
Heterogeneous System Architecture
Compliant System
National Chiao Tung University & National Tsing Hua University & National Taiwan University
Yu-Ju Huang, Hsuan-Heng Wu,
Yeh-Ching Chung, Wei-Chung Hsu
Agenda
• Motivation
• Background
• HSA features
• AMD’s implementation on Kaveri, the HSA-
compliant platform
• Design and Implementation
• Evaluation
• Conclusion
2
Motivation
• Problem of heterogeneous computing
• Data communication between CPU & GPU
• Inefficiency
• Programmability inconvenience
• Heterogeneous System Architecture (HSA)
• Developed by HSA Foundation
• Goal
• Improving computation efficiency for heterogeneous computing
• Reducing programmability barrier
• Make virtual machines also get benefit of HSA !
3
HSA
Hypervisor
Guest
OS
Guest
OS
A
p
p
A
p
p
A
p
p
A
p
p HSA!!!
HSA Features
• Shared virtual memory
• I/O page faulting
• User-level queueing
• Memory based signaling
4
CPU Memory
GPUCPU
GPU
Memory
Data copy
Before HSA
Physical Memory
HSA GPUCPU
Virtual Memory
HSA
Application
Queues
Operating System
GPU Driver
GPU
Before HSA
HSA GPU
Application
Queues
HSA
• Shared virtual memory
• I/O page faulting
• User-level queueing
• Memory based signaling
Shared Virtual Memory - IOMMU
• Set process page table to IOMMU to carry out virtual to
physical address translation
• CPU and GPU share same process page table
5
System Memory
GPU CPU
IOMMU MMUProcess Page Table
I/O Page Faulting - PPR
• PPR(peripheral page service request) issued by IOMMU as
interrupt
• PPR logs contains fault process ID and fault address
• get_user_pages API can be used to fix page fault
6
IOMMU CPU
Call PPR handler
Get PPR logs
Fix fault fault
COMPLETE command
PPR Interrupt
1
2
3
4
5
User Level Queueing -
Kernel Fusion Driver (KFD)
• Help applications set address of user level queues to GPU
7
Kernel Space
GPU
Userspace
KFD
Addr of user
level queue
User Level Queues
Computation
Design - How to Virtualize
• User-level queueing
• VirtIO-KFD
• Shared virtual memory
• Shadow page table
• Why not hardware-assisted nested paging ?
• I/O Page faulting
• Shadow PPR
• VirtIO-IOMMU
8
Virtualize User Level Queueing
VirtIO-KFD
9
Guest OS
Host OS
KFD
Qemu
Guest
App
VirtIO-KFD
(Back-end)
VirtIO-KFD
(Front-end)
Guest
App
Guest
App
GPU
Share virtqueue
HSA Runtime Library
1
2
3
4
KVM
Virtualize Shared Virtual Memory
Shadow Page Table
10
Guest OS
Host OS
KFD
Qemu
Guest
App
VirtIO-KFD
(Back-end)
VirtIO-KFD
(Front-end)
Guest
App
Guest
App
Share virtqueue
HSA Runtime Library
1
2
3
4IOMMU
Driver
KVM
IOMMU
Addr of
shadow
page table
5
6
GPU
IOMMU
Memory
ID System Page table
1 Host, process 1 Addr of PT
2 Guest 1,
process 1
Addr of SPT
Page
Table
ID=1
HVA
MPA
Native ScenarioGuest Scenario
 More guest processes in different guest OSes are also allowed.
11
IOMMU Snapshot During GPU Execution
GVA
MPA
ID=2
Virtualize I/O Page Faulting
VirtIO-IOMMU, Shadow PPR
12
Guest OS
Host OS
Shadow
PPR
Qemu
Guest
App
VirtIO-
IOMMU
Guest
App
Guest
App
IOMMU
HSA Runtime Library
IOMMU
Driver
KVM
Interrupt1
3
5
4
2
PPR: Peripheral Page Request
System Architecture
13
Guest OS
Host OS
KVM
Shadow
PPR
KFD
Qemu
(Host Process)
HSA Runtime Library
Guest
App
VirtIO-
IOMMU
VirtIO-
IOMMU
VirtIO-KFD
VirtIO-KFD
Guest
App
Guest
App
IOMMU GPU
User level
queuing
IOMMU
Driver
 KFD: Kernel Fusion Driver
 PPR: Peripheral Page Request
Shared
virtual
memory
I/O page
faulting
Evaluation
• Queue initialization time
• Measuring overheads of VirtIO-KFD
• GPU execution time
• Measuring overheads of shadow page table and shadow PPR
14
Configurations Native Guest
Hardware platform Kaveri
Memory 8G 4G
Number of CPUs 4 4
OS Ubuntu 13.10
Queue Initialization Time
15
Average 30% performance drop.
GPU Execution Time
16
Achieve average 95% of native performance in most cases.
GPU time
(sec)
BinarySea
rch
FastWalsh
Transform
BitonocSort FloydWars
hall
MatrixMulti
plication
MatrixTrans
pose
MoteCarlo
Asian
Native 0.0108 0.0018 0.014 16.094 8.012 0.502 17.458
Guest 0.0113 0.0019 0.016 16.603 8.286 0.538 18.342
Small benchmark
Enqueue Task
Kick GPU
Wait Signal
World Switch to Host
Switch Back
Guest Application
World Switch to Host
Signal
delay
Enqueue many times
Conclusion
• Successfully implementing a hypervisor virtualizing HSA
features.
• Guest system can get benefit of HSA and carry out
heterogeneous computing.
• GPU in Kaveri is shareable between multiple guest OSes and
host OS.
17
Thanks!
Q&A
gic4107@gmail.com
18

More Related Content

PPTX
VMware Cloud Foundation - PnP presentation 8_6_18 EN.pptx
PDF
.NET Core, ASP.NET Core Course, Session 1
PPTX
Azure from scratch Part 1 By Girish Kalamati
PDF
Kubernetes Webinar - Using ConfigMaps & Secrets
PPTX
A brief study on Kubernetes and its components
PDF
1075: .NETからCUDAを使うひとつの方法
PDF
Android Chromium Rendering Pipeline
PDF
DDD와 이벤트소싱
VMware Cloud Foundation - PnP presentation 8_6_18 EN.pptx
.NET Core, ASP.NET Core Course, Session 1
Azure from scratch Part 1 By Girish Kalamati
Kubernetes Webinar - Using ConfigMaps & Secrets
A brief study on Kubernetes and its components
1075: .NETからCUDAを使うひとつの方法
Android Chromium Rendering Pipeline
DDD와 이벤트소싱

What's hot (20)

PDF
Safety-Certifying Open Source Software: The Case of the Xen Hypervisor
PPTX
Kubernetes #4 volume & stateful set
PPSX
Docker Kubernetes Istio
PDF
Google Kubernetes Engine (GKE) deep dive
PPTX
Key aggregate cryptosystem for scalable data sharing in cloud storage
PPTX
Angular 2.0 forms
PDF
Mastering kubernetes ingress nginx
PPTX
VMware vSphere vsan EN.pptx
PDF
What Is Kubernetes | Kubernetes Introduction | Kubernetes Tutorial For Beginn...
PDF
Docker & kubernetes
PDF
vRO Training Document
PDF
Kubernetes Introduction
PPT
Cloud stack vs openstack vs eucalyptus
PPTX
Azure Site Recovery Bootcamp
PDF
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
PPTX
WebAssembly WASM Introduction Presentation
PDF
PHP-FPM の子プロセス制御方法と設定をおさらいしよう
ODP
Kvm virtualization platform
PDF
Cryptocurrency Tracker
Safety-Certifying Open Source Software: The Case of the Xen Hypervisor
Kubernetes #4 volume & stateful set
Docker Kubernetes Istio
Google Kubernetes Engine (GKE) deep dive
Key aggregate cryptosystem for scalable data sharing in cloud storage
Angular 2.0 forms
Mastering kubernetes ingress nginx
VMware vSphere vsan EN.pptx
What Is Kubernetes | Kubernetes Introduction | Kubernetes Tutorial For Beginn...
Docker & kubernetes
vRO Training Document
Kubernetes Introduction
Cloud stack vs openstack vs eucalyptus
Azure Site Recovery Bootcamp
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
WebAssembly WASM Introduction Presentation
PHP-FPM の子プロセス制御方法と設定をおさらいしよう
Kvm virtualization platform
Cryptocurrency Tracker
Ad

Similar to Building a KVM-based Hypervisor for a Heterogeneous System Architecture Compliant System (20)

PDF
Malicious Hypervisor - Virtualization in Shellcodes by Adhokshaj Mishra
PDF
2virtualizationtechnologyoverview 13540659831745-phpapp02-121127193019-phpapp01
PDF
Virtualization overheads
PPTX
Hardware support for efficient virtualization
PPTX
Virtualization of computing and servers
PPTX
Server virtualization
PDF
Virtualization with KVM
PPTX
17-virtualization.pptx
PPTX
Virtualization-Presentation-with-History
PPTX
Operating system Virtualization_NEW.pptx
PPTX
5. IO virtualization
PDF
Cloud Computing Virtualization and containers
PPTX
Virtualization concept slideshare
PPTX
Hypervisors
PPTX
3. CPU virtualization and scheduling
ODP
Kvm and libvirt
PPTX
CSC_406_5_Virtualization - Case Study, it's base on virtualization
PDF
20110522kernelvm xen+virtio
PPT
Fullandparavirtualization.ppt
PDF
Rmll Virtualization As Is Tool 20090707 V1.0
Malicious Hypervisor - Virtualization in Shellcodes by Adhokshaj Mishra
2virtualizationtechnologyoverview 13540659831745-phpapp02-121127193019-phpapp01
Virtualization overheads
Hardware support for efficient virtualization
Virtualization of computing and servers
Server virtualization
Virtualization with KVM
17-virtualization.pptx
Virtualization-Presentation-with-History
Operating system Virtualization_NEW.pptx
5. IO virtualization
Cloud Computing Virtualization and containers
Virtualization concept slideshare
Hypervisors
3. CPU virtualization and scheduling
Kvm and libvirt
CSC_406_5_Virtualization - Case Study, it's base on virtualization
20110522kernelvm xen+virtio
Fullandparavirtualization.ppt
Rmll Virtualization As Is Tool 20090707 V1.0
Ad

Recently uploaded (20)

PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
cuic standard and advanced reporting.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Big Data Technologies - Introduction.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
Understanding_Digital_Forensics_Presentation.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation theory and applications.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Reach Out and Touch Someone: Haptics and Empathic Computing
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Modernizing your data center with Dell and AMD
cuic standard and advanced reporting.pdf
Machine learning based COVID-19 study performance prediction
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx
NewMind AI Monthly Chronicles - July 2025
20250228 LYD VKU AI Blended-Learning.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Approach and Philosophy of On baking technology
Understanding_Digital_Forensics_Presentation.pptx
The AUB Centre for AI in Media Proposal.docx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Dropbox Q2 2025 Financial Results & Investor Presentation
Network Security Unit 5.pdf for BCA BBA.
Encapsulation theory and applications.pdf
Chapter 3 Spatial Domain Image Processing.pdf

Building a KVM-based Hypervisor for a Heterogeneous System Architecture Compliant System

  • 1. Building A KVM-based Hypervisor for A Heterogeneous System Architecture Compliant System National Chiao Tung University & National Tsing Hua University & National Taiwan University Yu-Ju Huang, Hsuan-Heng Wu, Yeh-Ching Chung, Wei-Chung Hsu
  • 2. Agenda • Motivation • Background • HSA features • AMD’s implementation on Kaveri, the HSA- compliant platform • Design and Implementation • Evaluation • Conclusion 2
  • 3. Motivation • Problem of heterogeneous computing • Data communication between CPU & GPU • Inefficiency • Programmability inconvenience • Heterogeneous System Architecture (HSA) • Developed by HSA Foundation • Goal • Improving computation efficiency for heterogeneous computing • Reducing programmability barrier • Make virtual machines also get benefit of HSA ! 3 HSA Hypervisor Guest OS Guest OS A p p A p p A p p A p p HSA!!!
  • 4. HSA Features • Shared virtual memory • I/O page faulting • User-level queueing • Memory based signaling 4 CPU Memory GPUCPU GPU Memory Data copy Before HSA Physical Memory HSA GPUCPU Virtual Memory HSA Application Queues Operating System GPU Driver GPU Before HSA HSA GPU Application Queues HSA • Shared virtual memory • I/O page faulting • User-level queueing • Memory based signaling
  • 5. Shared Virtual Memory - IOMMU • Set process page table to IOMMU to carry out virtual to physical address translation • CPU and GPU share same process page table 5 System Memory GPU CPU IOMMU MMUProcess Page Table
  • 6. I/O Page Faulting - PPR • PPR(peripheral page service request) issued by IOMMU as interrupt • PPR logs contains fault process ID and fault address • get_user_pages API can be used to fix page fault 6 IOMMU CPU Call PPR handler Get PPR logs Fix fault fault COMPLETE command PPR Interrupt 1 2 3 4 5
  • 7. User Level Queueing - Kernel Fusion Driver (KFD) • Help applications set address of user level queues to GPU 7 Kernel Space GPU Userspace KFD Addr of user level queue User Level Queues Computation
  • 8. Design - How to Virtualize • User-level queueing • VirtIO-KFD • Shared virtual memory • Shadow page table • Why not hardware-assisted nested paging ? • I/O Page faulting • Shadow PPR • VirtIO-IOMMU 8
  • 9. Virtualize User Level Queueing VirtIO-KFD 9 Guest OS Host OS KFD Qemu Guest App VirtIO-KFD (Back-end) VirtIO-KFD (Front-end) Guest App Guest App GPU Share virtqueue HSA Runtime Library 1 2 3 4 KVM
  • 10. Virtualize Shared Virtual Memory Shadow Page Table 10 Guest OS Host OS KFD Qemu Guest App VirtIO-KFD (Back-end) VirtIO-KFD (Front-end) Guest App Guest App Share virtqueue HSA Runtime Library 1 2 3 4IOMMU Driver KVM IOMMU Addr of shadow page table 5 6
  • 11. GPU IOMMU Memory ID System Page table 1 Host, process 1 Addr of PT 2 Guest 1, process 1 Addr of SPT Page Table ID=1 HVA MPA Native ScenarioGuest Scenario  More guest processes in different guest OSes are also allowed. 11 IOMMU Snapshot During GPU Execution GVA MPA ID=2
  • 12. Virtualize I/O Page Faulting VirtIO-IOMMU, Shadow PPR 12 Guest OS Host OS Shadow PPR Qemu Guest App VirtIO- IOMMU Guest App Guest App IOMMU HSA Runtime Library IOMMU Driver KVM Interrupt1 3 5 4 2 PPR: Peripheral Page Request
  • 13. System Architecture 13 Guest OS Host OS KVM Shadow PPR KFD Qemu (Host Process) HSA Runtime Library Guest App VirtIO- IOMMU VirtIO- IOMMU VirtIO-KFD VirtIO-KFD Guest App Guest App IOMMU GPU User level queuing IOMMU Driver  KFD: Kernel Fusion Driver  PPR: Peripheral Page Request Shared virtual memory I/O page faulting
  • 14. Evaluation • Queue initialization time • Measuring overheads of VirtIO-KFD • GPU execution time • Measuring overheads of shadow page table and shadow PPR 14 Configurations Native Guest Hardware platform Kaveri Memory 8G 4G Number of CPUs 4 4 OS Ubuntu 13.10
  • 15. Queue Initialization Time 15 Average 30% performance drop.
  • 16. GPU Execution Time 16 Achieve average 95% of native performance in most cases. GPU time (sec) BinarySea rch FastWalsh Transform BitonocSort FloydWars hall MatrixMulti plication MatrixTrans pose MoteCarlo Asian Native 0.0108 0.0018 0.014 16.094 8.012 0.502 17.458 Guest 0.0113 0.0019 0.016 16.603 8.286 0.538 18.342 Small benchmark Enqueue Task Kick GPU Wait Signal World Switch to Host Switch Back Guest Application World Switch to Host Signal delay Enqueue many times
  • 17. Conclusion • Successfully implementing a hypervisor virtualizing HSA features. • Guest system can get benefit of HSA and carry out heterogeneous computing. • GPU in Kaveri is shareable between multiple guest OSes and host OS. 17

Editor's Notes

  • #2: Hello everyone. My name is Yu-Ju Huang. Here is the author list, this is me, my partner, and two professors. We all from Taiwan, a country in the east Asia. <NEED funny intro> This is my topic today. It’s a little long, right :D? So now, I’m gonna give you a brief introduction and image about this work. Hope you can enjoy it ! In this work, our target is a special HW architecture called Heterogeneous System Architecture, or HSA in short. HSA is mainly focus on helping heterogeneous computing system more powerful and more efficient. Given the HSA-compliant HW platform, we implement a hypervisor running on top of it. And the hypervisor tries to virtualize the features provided by HSA such that the virtual machines can also get the benefits of HSA.
  • #3: In the beginning, I’ll introduce the motivation of this work. And then a brief background about HSA including the HSA features and the AMD’s implementation on Kaveri which is the first HSA-compliant platform, and also is our target platform. After that, we can talk about our design and implementation. And then the evaluation and conclusion.
  • #4: About the motivation, we start from the heterogeneous computing. The heterogeneous computing programming model requires data communication between devices. This communication cause inefficiency and programmability inconvenience. So HSA foundation propose the HSA architecture to resolve this problems. For the motivation of our work, the motivation is that if we believe the heterogeneous computing will be more and more popular in the future, then there must be a hypervisor to support virtual machines to get benefits of HSA. Here, though our discussion is based on HSA and the implementation is based on AMD’s platform. Our design philosophy can also be applied to other platform, or even other architecture that tries to improve heterogeneous computing systems.
  • #5: OK, let’s start to introduce HSA. As previous description, HSA tries to solve the communication inefficiency and inconvenience. Here is the solution of HSA. It proposes many features. And here the list is the features focusing on how a program is able to execute. These features are also what we need to virtualize. The first, shared virtual memory. Before HSA, CPU and GPU use different memory and address space, so data copy is required. For HSA, all the computing resource, like CPU and GPU or other HSA-aware devices, see the same virtual address space so they can access the system memory with virtual address. This way can eliminate the data copy. For the I/O page faulting feature, this is a requirement for shared virtual memory because we allow I/O device to access system memory directly, then the page fault service must also support it And the user-level queuing. Before HSA, tasks can only be dispatched to GPU by OS, or GPU driver. As for HSA, GPU is able to see all the user level queues. So the jobs dispatching don’t need trap into GPU driver any more. This design reduce the latency of dispatching jobs. Final, the memory based signaling is also designed for reduce OS intervention latency. Previous to HSA, once GPU finishes its task, it issue an interrupt to CPU and let CPU to notify user-space program. This path incurs OS intervention overhead. So HSA makes GPU able to access a particular memory address for job finishing notification. The particular memory address is assigned by application when it dispatch jobs. For these fours features, the memory based signaling can be achieved once GPU is able to access process address space. So actually, we have only take care to virtualize the first three features.
  • #6: Well, in the following page, I will introduce the AMD’s implementation of the HSA features. The shared virtual memory. AMD implement IOMMU for GPU or other HSA-aware devices to translate virtual address physical address. And since the CPU and GPU see the same process address space, the page table of IOMMU should be same as what CPU MMU uses. So with setting the page table properly, the shared virtual memory feature can be achieved.
  • #7: About the I/O page faulting, AMD designs a mechanism call PPR, peripheral page service request. This request is issued by IOMMU as an interrupt to CPU once a failure occurs in address translation, such as page doesn’t exist or insufficient permission to access the page. The IOMMU will also write log containing fault process ID and fault address. With these information, Linux API get_user_pages can be used to fix the I/O page fault. Here is the brief flow of the I/O page fault handling.
  • #8: As for the user-level queuing feature. The key idea is how to make GPU know where is the address of user-level queues. AMD designs a driver call kernel fusion driver, or KFD, to complete this function. During user-program initialization, the CREATE_QUEUE API will send the address of user-level queue to the KFD, and the KFD set this address to GPU. After this setting, driver’s intervention can be moved out. The driver is only used during initialization, the computation time is co-worked between GPU and user-program.
  • #9: Good? In previous slides, I describe what we need to virtualize. And from now on, I will introduce you about how we virtualize these HSA features. You can see on this page, I will elaborate more in the following page. For one thing I need to mention is that, we use the shadow page table to virtualize the shared virtual memory. I know you may feel strange why SPT is adopted rather than the nested paging. This is due to the constrain of the AMD’s IOMMU, and it’s a little complicated so I will not describe it in this talk. But you can still find the explanation in proceeding and the paper.
  • #10: As I previously describe, the key to support user level queuing is to let the GPU know where is the address of user level queue. So we implemented VirtIO-KFD, as you can see in the slide. The VirtIO-KFD help guest application to bypass the address of its queue to the real KFD. And the KFD will set it to GPU. With this way, the GPU can know where is the address of guest application queue.
  • #11: And then the shared virtual memory. As we know, the shadow page table guides the MMU to translate guest virtual address to machine physical address. So in our work, we just need to find the address of shadow page table and set it to IOMMU when guest application tries to use GPU.
  • #12: This is a snapshot of the GPU executing state. IOMMU maintains a table to map process address space ID to the corresponding page table address. In this scenario, there are two process use GPU. For native execution, like GPU run a program dispatched by a host application. Then it will know where to find the host application’s page table. For guest execution, GPU run a program dispatched by a guest application. And this program is encoded in the guest virtual address space. So IOMMU will find the corresponding SPT to translation the GVA to MPA. As you can expect, this table can be extended. So in our design, multiple processes from difference guest OSes or even host OS can share the GPU. So we kind of achieving the GPU sharing in our work.
  • #13: Final one, I/O page faulting. One challenge to virtualize this feature is that the PPR log region, where is used to store the page fault information, is inside a special IO region. Usually, guest system is not allowed to access this region. So we implemented a module called shadow-PPR. This module is used to store the information about guest GPU program’s page faults. Once a PPR occurs, the PPR handler will decide whether it is caused by guest program. If so, then store the information into shadow PPR. Then shadow PPR kick up the KVM and send a virtual interrupt into guest OS. Inside guest OS, we implemented a VirtIO-IOMMU to handle the I/O page fault. It will get page fault information from shadow PPR and fix the page fault. So this how we virtual the I/O page faulting.
  • #14: Whole system architecture. VirtIO-KFD for user level queuing. SPT for SVM. VirtIO-IOMMU for I/O page fault.
  • #15: About the experiment. We use AMD SDK as our benchmark. Data is shown in initialization time and execution time to evaluate our design.
  • #16: The data is normalized against native scenario. It’s about 30% performance drop. This drop is mainly caused by the propagation from VirtIO-KFD to real KFD. Since there are world switch overhead in this path. But usually, an application only do this initialization process once. So this performance drop is not a great concern.
  • #17: For GPU execution time. The major cause of performance drop in GPU execution time is the I/O page fault handling. But as you can see, our design does get a good result, around 95% of native performance in most cases. As for the two poor case, FWT, BS. These two benchmark does have a little poor performance. The reason is that, let’s see this figure. This is about the flow of an application dispatching jobs, waiting for signal, and getting notification when GPU finishes the job. There is possible that during guest application waits for signal, the CPU may switch to other process. So if in a particular time, GPU finish the job and send a notification. But in this particular time, the CPU is owned by other process rather than the guest system. So the application will get the signal lately. These red arrows shows the this delay. And why only the two benchmarks suffer from it. We can see the raw data here. Because they are small benchmark, about only 10 ms GPU execution time. For the long benchmark, this signal delay can be amortized. Another reason is that these two benchmark enqueue many time. So they keep inside this loop. And the overhead becomes large. For BinarySearch, though it is a small benchmark, it only enqueue once, so the overhead is invisible.
  • #18: Conclusion of our work. We implement a hypervisor that makes guest system can also get the benefit of HSA. And furthermore, we also achieve GPU sharing.