qCUDA-ARM : Virtualization for Embedded GPU Architectures

qCUDA-ARM:Virtualization for
Embedded GPU Architectures
Jordan@NTHU
2019/07/25

Outline
● Introduction
● Background
● Design and Implementation
● Experiments
● Conclusion
2

● Introduction
● Background
● Experiments
● Conclusion
3

Emergence of Internet of Things (IoT)
● Number of IoT device raise dramatically.
● Smart home, industry 4.0 and Autonomous car.
● New computing paradigm : Edge Computing.
4

Edge Computing
● Concept
Deploy computational device to the nearby of data source
● Challenge: diversity of Application scenario
○ Design of edge node architecture still an open research
problem.
○ possible solution
5

Research Trend in System Design
● Virtualization
○ Adds dimension of flexibility to ease the edge nod
architectural design.
○ Customize a specific hardware resource and create an
isolated environment for each application.
● Heterogeneous computing
○ heterogeneous system contains more than one types
of processors so that different applications can find the
proper resources to use. Ex:embedded GPU.
6

About This Research
● Investigate the fusion of those two important trends
in IOT : heterogeneity and virtualization and present
qCUDA-ARM, a virtualization solution for embedded GPU
architecture.
● Evaluate qCUDA-ARM’s performance on an Nvidia TX2
development board using three different benchmarks and
two applications and compared the results with rCUDA
and native CUDA.
7

● Introduction
● Background
● Experiments
● Conclusion
8

Virtualization
● Software simulated
platform/hardware.
● Multiple virtual machine on
single physical machine.
● Benefits:
○ Improve hardware utilization
○ Isolation
○ Fault tolerance
9
http://techxposer.com/2018/01/30/understanding-virtualization-hxypervisors/

GPGPU
10
● GPGPU
○ General Purpose GPU computation other than
computer graphics.
○ Acceleration by massive parallel execution. Includes:
scientific computing, digital currency and machine
learning.
● Requirement:
○ Needs CPU assistance.
○ Able to program the GPU.

GPGPU Working Model
11
● Communicate with CPU
● 4 steps

GPGPU Programming Model
12
● Numerous solutions, includes:
CUDA, OpenCL, OpenVIDIA and
MATLAB
● CUDA is one of most famous
solutions
○ Developed by NVIDIA
○ Provides syntax of high level
languages for software
developers to access to the
computation elements of GPU.

GPU Virtualization
13
● Provide GPU function to virtual machine.
● GPU virtualization methods
○ Hardware-assisted Virtualization
■ Direct Pass-through
■ Mediated Pass-through
○ Software Virtualization
■ Full Virtualization
■ Para Virtualization
● API-Remoting

Direct Pass-through
● Guest OS directly access GPUs
with hardware extension features
provided by either motherboard
chipset or GPU manufacturers.
● Limitation
○ Cannot share one single
GPU to multiple VM.
● Ex:
○ Intel VT-d [Abramson et al. 2006]
○ AMD-Vi [Van Doorn 2006]
14
https://blogs.vmware.com/apps/2018/05/machine-learning-using-virtualized-
gpus-on-vmware-vsphere.html

Mediated Passthrough (1/2)
● Multiple Virtual GPU (vGPU)
● Each VM own a full-feature vGPU
● VMs can directly access
performance-critical resources, without
hypervisor intervention in most cases.
● Trap privileged operations from guest,
provide secure isolation.
● Ex:
○ gVirt [Tian et al. 2014]
○ KVMGT [Song et al. 2014]
○ gScale [Xue et al. 2016]
15
https://projectacrn.github.io/latest/developer-guides/hld/hld-APL_GVT-g.htm
l

Full-Virtualization
● Software simulation
● Native driver can be ran on the guest without any
modification of GPU libraries and drivers.
● Performance overhead since software
trap-and-emulated the hardware interface of existing
○ GPUvm [Suzuki et al. 2014]
○ G-KVM [Hong et al. 2016]
16

Para-Virtualization
● Improve performance by slightly modifies the custom
driver in the guest for delivering sensitive operations
directly to the host driver , prevent hypervisor invention
hence improve performance.
● EX:
○ LoGV [Gottschlag et al. 2013]
17

API-Remoting (1/3)
18
● Wrapping the GPU APIs as
a “front-end” in the guest.
● Mediating all accesses to the
GPU as a “back-end” in the host
through an transport layer.
● Challenge: minimize the
communication overhead
between backend and frontend.
● EX:
○ GVirtuS [Giunta et al. 2010]
○ vCUDA [Shi et al. 2009]
○ rCUDA [Duato et al. 2010b]
○ qCUDA [Yu-Shiang et al. 2017]

API-Remoting (2/3)
● rCUDA [Duato et al. 2010b]
○ Offloads CUDA computation to remote GPUs.
○ For HPC scenario.
○ Client / server architecture.
○ Suitable for InfiniBand, achieve native performance
when under it.
○ Communicate by TCP/IP which may be the
performance bottleneck.
19https://www.researchgate.net/figure/Overview-of-the-general-architecture-of-the-rCU
DA-virtualization-solution_fig1_267514566

API-Remoting (3/3)
● qCUDA
○ Based on KVM, virtio
○ Prevent memory copy between guest and host
machine.
○ Work that qCUDA-ARM based on.
20

● Introduction
● Background
○ qCUDA System Architecture
○ Memory Management of qCUDA
○ Memory Management of qCUDA-ARM
○ Memory Bandwidth Improvement
● Experiments
● Conclusion
21

● Introduction
● Background
● Experiments
● Conclusion
22

qCUDA Overvice
23
GPU Driver
Host
QEMU-KVM
CUDA Program
Guest
Nvidia GPU

qCUDA Architecture Overvice
24
GPU Driver
Host
QEMU-KVM
CUDA Program
Guest
Nvidia GPU
qcu-library
qcu-driver
qcu-device

qCUDA Components
25
● qcu-library
○ The interposer library in guest system.
○ Provided native CUDA API and the wrapper functions
of dynamic memory allocation.
● qcu-driver
○ Memory management.
○ Analyzing the CUDA command come from qcu-library
to qcu-device.
● qcu-device
○ GPU binary registration.
○ GPU context management.
○ Address space transferring.

● Introduction
● Background
● Experiments
● Conclusion
26

Data Movement in GPU Virtualization
27
● In GPU working model, data
needs to be transferred between
CPU and GPU memory
● Guest CUDA program need to
transfer it’s data to host GPU
● In GPU virtualization, this data
should be transferred through
host and hypervisor.
● Critical to the virtualization
performance.
● Prior works copy data from guest
to host, however, requires extra
data copy and increase latency.

Data Movement in qCUDA
28
● Eliminate extra data copy between guest and host
(Zero-copy).
○ Guest and host in same machine.
○ Memory allocated by guest should be seen in host.
● Address space conversion do the job.

Memory Management in qCUDA
29
Two goals: zero copy and pinned host memory

Memory Management in qCUDA
30
1. Guest CUDA program allocate / free
memory via calling hooked allocation /
free function in qcu-library.
2. Qcu-library change the behavior of
memory allocation / release.
3. Allocate/release 4 MB size chunks.
4. Address space conversion (GVA->GPA)
5. Record / remove information
6. Communicate
7. Address space conversion (GPA->HVA)

● Introduction
● Background
● Experiments
● Conclusion
31

Pinned Host Memory
32
● For CUDA, there are 2 kinds of host memory: pageable
and page-lock.
○ Pageable: Could be page-out by OS
○ Page-lock: Will not be page-out
● Page-lock memory region can be directly copy by DMA,
accelerate the data copy performance.
● However, default memory allocated by host is pageable.
● Pageable memory required to be copy to an temporary
allocated pinned memory region than be copied to GPU.

33
● Pinned memory related functions:
○ cudaHostRegister(): page-lock pre-allocated memory
region.
○ cudaHostAlloc(): allocated an page-lock memory
region.
Allocate Pinned Host Memory

Allocate Pinned Host Memory in qCUDA
34
● After step 7, guest CUDA
allocated memory can be
accessed from host.
● cudaHostRegister() this region.
● Since their maps to same
physical frame.
● These guest CUDA allocated
memory be page-locked

Malfunction of cudaHostRegister() in
ARM Architecture
35
● cudaHostRegister() not be supported in ARM architecture.
● ARM architecture can only allocates pinned host memory
with cudaHostAlloc().
● However, original implementation cudaHostAlloc() is out of
function too since it’s based on cudaHostRegister().
● qCUDA-ARM implements it in different way.

qCUDA cudaHostAlloc() Internal
36

qCUDA-ARM Host Pinned Allocation Mechanism
37

38
● Each memory allocation will create a unique device node
under /dev.
● Guest memory region in host are mapped to one of these
device node, let’s cal itl GMR_fd.

39
● Allocated another same-sized
page-lock memory region and
mapping it to GMR_fd.
● The content is identical on both
region and can be transparently
transferred from guest to host
gpu or vise versa.

40
The content is identical on both region and can be transparently transferred from
guest to host gpu or vise versa.

● Introduction
● Background
● Experiments
● Conclusion
41

Pinned Memory Bandwidth Improvement
42
● The Initial experiment shows slow memory bandwidth
when copy data with cudaMemcpy() /
cudaMemcpyAsync().
● Only 1/10 of native bandwidth.
● Try to improve it.

cudaMemcpy() / cudaMemcpyAsync()
43
● cudaMemcpy (dst, src, count, ….)
○ dst - Destination memory address
○ src - Source memory address
○ count - Size in bytes to copy

qCUDA Pinned Memory Copy
Mechanism
44

qCUDA-ARM Pinned Memory Copy
Mechanism
45

qCUDA-ARM Pinned Memory Copy
Mechanism
46
● GMR_ptr can be obtained by
● Offset can be calculated by cudaHostAllocPtr + ( src -
hva_start_addr)
● Copy pinned host memory with only one call of cudaMemcpy()

Enable KVM on TX2
47
● KVM is fundamental building blocks for qCUDA but it’s originally not be
supported on TX2 development board.
● In order to enable KVM on ARM architecture, the CPU should be boot
into a CPU mode for running hypervisors called hypmode

Enable KVM on TX2
48
● Because the device tree of Nvidia released linux kernel miss the
necessary register settings for Virtual General Interrupt Controller.
● Rebuild the Linux kernel with those information provided solvethis
problem.

● Introduction
● Background
● Experiments
● Conclusion
49

Experiment Environment (1/2)
50
● NVIDIA TX2 development board
○ 4 ARM-A57 cores, 2 Denver 64-bit
CPUs, 8 GB DDR4 RAM
○ NVIDIA Pascal Architecture GPU with
compute-capability 6.2
● Host software stack
○ Ubuntu 16.04 LTS with patched Linux
kernel 4.4.38
○ CUDA toolkit 8.0
● Guest machine
○ 4 cores, 4GB RAM, 16GB disk image
○ Ubuntu 14.04.5 LTS

Experiment Environment (2/2)
51
● Although rCUDA is not designed for embedded GPU, it is the only
GPU virtualization package that works on embedded systems.
● rCUDA only support ARM frontend, need a X86 machine as the backend.
● DELL XPS 9550 laptop
○ Intel Core i7-6700HQ, 8G DDR4 RAM, 256 GB SSD
○ NVIDIA GeForce GTX 960M GPU with compute-capability of 5
● Host software stack
○ Ubuntu 16.04 LTS with Linux kernel 4.15.0-50-generic

Benchmarks
52
● Three benchmarks and two application are chosen.
● Goal: Compare the performance with rCUDA and Native.
● Since rCUDA computes the wrong results or out of function on all
benchmarks except BandwidthTest, We can only compare the result native
CUDA.

Memory Bandwidth
53
● Measures transferring bandwidth between CPU memory
and GPU memory.
● bandwidthTest benchmark
● Transferring direction
○ Data transferring bandwidth from host to device (H2D)
○ Data transfering from device to host (D2H)
● Data size range from 1KB to 1GB

Pinned Memory Bandwidth (original)
55

Pinned Memory Bandwidth (Optimized)
56

Pinned Memory Bandwidth Comparison
(Original v.s Optimized)
57

Pinned Memory Bandwidth
Compare with rCUDA and Native
58

Matrix Multiplication
59
● Measures qCUDA-ARM’s performance when deals with
task of large computation and small data transmission.
● 3 steps:
○ Copy two matrices A, B from guest memory to GPU
memory.
○ Calculate matrix C (C = A×B) at GPU.
○ Copy matrix C from GPU memory back to guest
machine memory.
● matrix dimensions range from 32 to 4096

Matrix Multiplication Performance
60

Matrix Multiplication Performance Profile
61

Matrix Multiplication CUDA APIs Time
62
Time of cudaEventSynchronize() increases as matrix
dimension increases.

Vector Addition
63
● Compare to matrixMul, this is a I/O-bound benchmark.
● Measures qCUDA-ARM’s performance when deals with task of small
computation and large data transmission.
● Contains 3 steps:
○ Copy two vectors A, B from guest memory to GPG memory.
○ Calculate matrix C (C = A + B) at GPU.
○ Copy vector C from GPU memory back to guest machine memory.
● Vector length range from 100M to 25600M

Vector Addition Performance
64

Vector Addition Performance Profile
65

Vector Addition CUDA APIs Time
66
Time of both H2D and D2H gradually dominate the
total execution time.

Edge Detection
67
● This benchmark is a GPGPU implementation of Sobel Operator.
● The Sobel edge detector uses a pair of 3 x 3 convolution masks, one
estimating gradient in the x-direction and the other estimating gradient in
y-direction.
● 3 input images:
○ 800 × 532(1.3 MB)
○ 1920 × 1080(6 MB)
○ 4928 × 3624(47 MB)
○ Copy input image from guest memory to GPGPU memory.
○ Do edge detection at GPGPU.
○ Copy processed image from GPGPU memory to host memory.

Edge Detection Performance Profile
69

Edge Detection CUDA APIs Time
70
Time of both H2D and D2H gradually dominate the
total execution time.

Edge Detection CUDA APIs Efficiency
71

GPGPU Accelerated Cryptocurrency Miner
72
● Measure the performance of qCUDA-ARM with with large computation and
small data transmission.
● This benchmark simulates cryptocurrency mining process which takes
○ An input string and hash it as target hash.
○ An ACCEPTED_ZEROS parameter to control the mining difficulty.
○ CPU read the input string and preprocess on it.
○ Allocated GPU memory and copy the input string into it.
○ GPU launch multiple thread to mine the hash.
○ Copy result from GPGPU memory to host memory.
● ACCEPTED_ZEROS range from 25 to 35, increase 2 each time.

Performance
73

Performance Profile
74

CUDA APIs Time
75
Time of D2H gradually dominate the total execution
time.

Pinned Memory Bandwidth Performance
of Multiple VMs
76

● Introduction
● Background
● Experiments
● Conclusion
77

Conclusion
78
● In this thesis, we present a GPGPU virtualization solution
for embedded GPU architecture qCUDA-ARM.
● Since the architecture difference, redesign qCUDA-ARM’s
memory management system to fit the need.
● Improve the pinned memory copy bandwidth 9x faster.
● Conduct experiment on Nvidia TX2 board, achieve 90% of
native bandwidth.

Future Works
79
● Implement more CUDA API functions in order
to support more real world applications.
● Fine-tune the pinned memory allocation
mechanism, improve the memory utilization.

Thank you for your time and attention.
80

Comparison
82
Full
Virtualization
Para
Virtualization
Direct
passthrough
Mediated
passthrough
API
Remoting
Performance Low High High High Depend
GPU
sharing
Yes Yes No Yes Yes
flexibility High Low High High Low

qcu-device (2/2)
83
● Some CUDA runtime API could not be directly executed in the host due to
the implicitly function call of CUDA context or kernel launch.
● quc-device implement these runtime API with low level driver API for
context and kernel function management.

Full-Virtualization
● Software simulation
● Native driver can be ran on the guest without any
modification of GPU libraries and drivers.
● Performance overhead since software
trap-and-emulated the hardware interface of existing
○ GPUvm [Suzuki et al. 2014]
○ G-KVM [Hong et al. 2016]
84

API-Remoting (2/4)
● GVirtuS [Giunta et al. 2010]
○ Virtualize CUDA
○ Support Xen, KVM and
VMware.
○ Try to minimize
communication overhead by
leveraging on high
performance communication
channel provided by
hypervisors.
85
● vCUDA [Shi et al. 2009]
○ Virtualize CUDA
○ Support Xen and KVM
○ vGPU per application
○ Prevent frequent context
switching between the guest
OS and the hypervisor and
improves communication
performance.
○ Batch CUDA call by Lazy
RPC that performs batching
specific CUDA calls that can
be delayed.

Observation
86
● Guest CUDA allocated memory can be seen from host after
step 3.
● Need another way preserve these region in page-lock
region.

qCUDA-ARM : Virtualization for Embedded GPU Architectures

More Related Content

What's hot (20)

Similar to qCUDA-ARM : Virtualization for Embedded GPU Architectures (20)

Recently uploaded (20)

qCUDA-ARM : Virtualization for Embedded GPU Architectures