SlideShare a Scribd company logo
qCUDA-ARM:Virtualization for
Embedded GPU Architectures
Jordan@NTHU
2019/07/25
Outline
● Introduction
● Background
● Design and Implementation
● Experiments
● Conclusion
2
● Introduction
● Background
● Design and Implementation
● Experiments
● Conclusion
3
Emergence of Internet of Things (IoT)
● Number of IoT device raise dramatically.
● Smart home, industry 4.0 and Autonomous car.
● New computing paradigm : Edge Computing.
4
Edge Computing
● Concept
Deploy computational device to the nearby of data source
● Challenge: diversity of Application scenario
○ Design of edge node architecture still an open research
problem.
○ possible solution
5
Research Trend in System Design
● Virtualization
○ Adds dimension of flexibility to ease the edge nod
architectural design.
○ Customize a specific hardware resource and create an
isolated environment for each application.
● Heterogeneous computing
○ heterogeneous system contains more than one types
of processors so that different applications can find the
proper resources to use. Ex:embedded GPU.
6
About This Research
● Investigate the fusion of those two important trends
in IOT : heterogeneity and virtualization and present
qCUDA-ARM, a virtualization solution for embedded GPU
architecture.
● Evaluate qCUDA-ARM’s performance on an Nvidia TX2
development board using three different benchmarks and
two applications and compared the results with rCUDA
and native CUDA.
7
● Introduction
● Background
● Design and Implementation
● Experiments
● Conclusion
8
Virtualization
● Software simulated
platform/hardware.
● Multiple virtual machine on
single physical machine.
● Benefits:
○ Improve hardware utilization
○ Isolation
○ Fault tolerance
9
http://techxposer.com/2018/01/30/understanding-virtualization-hxypervisors/
GPGPU
10
● GPGPU
○ General Purpose GPU computation other than
computer graphics.
○ Acceleration by massive parallel execution. Includes:
scientific computing, digital currency and machine
learning.
● Requirement:
○ Needs CPU assistance.
○ Able to program the GPU.
GPGPU Working Model
11
● Communicate with CPU
● 4 steps
GPGPU Programming Model
12
● Numerous solutions, includes:
CUDA, OpenCL, OpenVIDIA and
MATLAB
● CUDA is one of most famous
solutions
○ Developed by NVIDIA
○ Provides syntax of high level
languages for software
developers to access to the
computation elements of GPU.
GPU Virtualization
13
● Provide GPU function to virtual machine.
● GPU virtualization methods
○ Hardware-assisted Virtualization
■ Direct Pass-through
■ Mediated Pass-through
○ Software Virtualization
■ Full Virtualization
■ Para Virtualization
● API-Remoting
Direct Pass-through
● Guest OS directly access GPUs
with hardware extension features
provided by either motherboard
chipset or GPU manufacturers.
● Limitation
○ Cannot share one single
GPU to multiple VM.
● Ex:
○ Intel VT-d [Abramson et al. 2006]
○ AMD-Vi [Van Doorn 2006]
14
https://blogs.vmware.com/apps/2018/05/machine-learning-using-virtualized-
gpus-on-vmware-vsphere.html
Mediated Passthrough (1/2)
● Multiple Virtual GPU (vGPU)
● Each VM own a full-feature vGPU
● VMs can directly access
performance-critical resources, without
hypervisor intervention in most cases.
● Trap privileged operations from guest,
provide secure isolation.
● Ex:
○ gVirt [Tian et al. 2014]
○ KVMGT [Song et al. 2014]
○ gScale [Xue et al. 2016]
15
https://projectacrn.github.io/latest/developer-guides/hld/hld-APL_GVT-g.htm
l
Full-Virtualization
● Software simulation
● Native driver can be ran on the guest without any
modification of GPU libraries and drivers.
● Performance overhead since software
trap-and-emulated the hardware interface of existing
○ GPUvm [Suzuki et al. 2014]
○ G-KVM [Hong et al. 2016]
16
Para-Virtualization
● Improve performance by slightly modifies the custom
driver in the guest for delivering sensitive operations
directly to the host driver , prevent hypervisor invention
hence improve performance.
● EX:
○ LoGV [Gottschlag et al. 2013]
17
API-Remoting (1/3)
18
● Wrapping the GPU APIs as
a “front-end” in the guest.
● Mediating all accesses to the
GPU as a “back-end” in the host
through an transport layer.
● Challenge: minimize the
communication overhead
between backend and frontend.
● EX:
○ GVirtuS [Giunta et al. 2010]
○ vCUDA [Shi et al. 2009]
○ rCUDA [Duato et al. 2010b]
○ qCUDA [Yu-Shiang et al. 2017]
API-Remoting (2/3)
● rCUDA [Duato et al. 2010b]
○ Offloads CUDA computation to remote GPUs.
○ For HPC scenario.
○ Client / server architecture.
○ Suitable for InfiniBand, achieve native performance
when under it.
○ Communicate by TCP/IP which may be the
performance bottleneck.
19https://www.researchgate.net/figure/Overview-of-the-general-architecture-of-the-rCU
DA-virtualization-solution_fig1_267514566
API-Remoting (3/3)
● qCUDA
○ Based on KVM, virtio
○ Prevent memory copy between guest and host
machine.
○ Work that qCUDA-ARM based on.
20
● Introduction
● Background
● Design and Implementation
○ qCUDA System Architecture
○ Memory Management of qCUDA
○ Memory Management of qCUDA-ARM
○ Memory Bandwidth Improvement
● Experiments
● Conclusion
21
● Introduction
● Background
● Design and Implementation
○ qCUDA System Architecture
○ Memory Management of qCUDA
○ Memory Management of qCUDA-ARM
○ Memory Bandwidth Improvement
● Experiments
● Conclusion
22
qCUDA Overvice
23
GPU Driver
Host
QEMU-KVM
CUDA Program
Guest
Nvidia GPU
qCUDA Architecture Overvice
24
GPU Driver
Host
QEMU-KVM
CUDA Program
Guest
Nvidia GPU
qcu-library
qcu-driver
qcu-device
qCUDA Components
25
● qcu-library
○ The interposer library in guest system.
○ Provided native CUDA API and the wrapper functions
of dynamic memory allocation.
● qcu-driver
○ Memory management.
○ Analyzing the CUDA command come from qcu-library
to qcu-device.
● qcu-device
○ GPU binary registration.
○ GPU context management.
○ Address space transferring.
● Introduction
● Background
● Design and Implementation
○ qCUDA System Architecture
○ Memory Management of qCUDA
○ Memory Management of qCUDA-ARM
○ Memory Bandwidth Improvement
● Experiments
● Conclusion
26
Data Movement in GPU Virtualization
27
● In GPU working model, data
needs to be transferred between
CPU and GPU memory
● Guest CUDA program need to
transfer it’s data to host GPU
● In GPU virtualization, this data
should be transferred through
host and hypervisor.
● Critical to the virtualization
performance.
● Prior works copy data from guest
to host, however, requires extra
data copy and increase latency.
Data Movement in qCUDA
28
● Eliminate extra data copy between guest and host
(Zero-copy).
○ Guest and host in same machine.
○ Memory allocated by guest should be seen in host.
● Address space conversion do the job.
Memory Management in qCUDA
29
Two goals: zero copy and pinned host memory
Memory Management in qCUDA
30
1. Guest CUDA program allocate / free
memory via calling hooked allocation /
free function in qcu-library.
2. Qcu-library change the behavior of
memory allocation / release.
3. Allocate/release 4 MB size chunks.
4. Address space conversion (GVA->GPA)
5. Record / remove information
6. Communicate
7. Address space conversion (GPA->HVA)
● Introduction
● Background
● Design and Implementation
○ qCUDA System Architecture
○ Memory Management of qCUDA
○ Memory Management of qCUDA-ARM
○ Memory Bandwidth Improvement
● Experiments
● Conclusion
31
Pinned Host Memory
32
● For CUDA, there are 2 kinds of host memory: pageable
and page-lock.
○ Pageable: Could be page-out by OS
○ Page-lock: Will not be page-out
● Page-lock memory region can be directly copy by DMA,
accelerate the data copy performance.
● However, default memory allocated by host is pageable.
● Pageable memory required to be copy to an temporary
allocated pinned memory region than be copied to GPU.
33
● Pinned memory related functions:
○ cudaHostRegister(): page-lock pre-allocated memory
region.
○ cudaHostAlloc(): allocated an page-lock memory
region.
Allocate Pinned Host Memory
Allocate Pinned Host Memory in qCUDA
34
● After step 7, guest CUDA
allocated memory can be
accessed from host.
● cudaHostRegister() this region.
● Since their maps to same
physical frame.
● These guest CUDA allocated
memory be page-locked
Malfunction of cudaHostRegister() in
ARM Architecture
35
● cudaHostRegister() not be supported in ARM architecture.
● ARM architecture can only allocates pinned host memory
with cudaHostAlloc().
● However, original implementation cudaHostAlloc() is out of
function too since it’s based on cudaHostRegister().
● qCUDA-ARM implements it in different way.
qCUDA cudaHostAlloc() Internal
36
qCUDA-ARM Host Pinned Allocation Mechanism
37
qCUDA-ARM Host Pinned Allocation Mechanism
38
● Each memory allocation will create a unique device node
under /dev.
● Guest memory region in host are mapped to one of these
device node, let’s cal itl GMR_fd.
qCUDA-ARM Host Pinned Allocation Mechanism
39
● Allocated another same-sized
page-lock memory region and
mapping it to GMR_fd.
● The content is identical on both
region and can be transparently
transferred from guest to host
gpu or vise versa.
qCUDA-ARM Host Pinned Allocation Mechanism
40
The content is identical on both region and can be transparently transferred from
guest to host gpu or vise versa.
● Introduction
● Background
● Design and Implementation
○ qCUDA System Architecture
○ Memory Management of qCUDA
○ Memory Management of qCUDA-ARM
○ Memory Bandwidth Improvement
● Experiments
● Conclusion
41
Pinned Memory Bandwidth Improvement
42
● The Initial experiment shows slow memory bandwidth
when copy data with cudaMemcpy() /
cudaMemcpyAsync().
● Only 1/10 of native bandwidth.
● Try to improve it.
cudaMemcpy() / cudaMemcpyAsync()
43
● cudaMemcpy (dst, src, count, ….)
○ dst - Destination memory address
○ src - Source memory address
○ count - Size in bytes to copy
qCUDA Pinned Memory Copy
Mechanism
44
qCUDA-ARM Pinned Memory Copy
Mechanism
45
qCUDA-ARM Pinned Memory Copy
Mechanism
46
● GMR_ptr can be obtained by
● Offset can be calculated by cudaHostAllocPtr + ( src -
hva_start_addr)
● Copy pinned host memory with only one call of cudaMemcpy()
Enable KVM on TX2
47
● KVM is fundamental building blocks for qCUDA but it’s originally not be
supported on TX2 development board.
● In order to enable KVM on ARM architecture, the CPU should be boot
into a CPU mode for running hypervisors called hypmode
Enable KVM on TX2
48
● Because the device tree of Nvidia released linux kernel miss the
necessary register settings for Virtual General Interrupt Controller.
● Rebuild the Linux kernel with those information provided solvethis
problem.
● Introduction
● Background
● Design and Implementation
● Experiments
● Conclusion
49
Experiment Environment (1/2)
50
● NVIDIA TX2 development board
○ 4 ARM-A57 cores, 2 Denver 64-bit
CPUs, 8 GB DDR4 RAM
○ NVIDIA Pascal Architecture GPU with
compute-capability 6.2
● Host software stack
○ Ubuntu 16.04 LTS with patched Linux
kernel 4.4.38
○ CUDA toolkit 8.0
● Guest machine
○ 4 cores, 4GB RAM, 16GB disk image
○ Ubuntu 14.04.5 LTS
○ CUDA toolkit 8.0
Experiment Environment (2/2)
51
● Although rCUDA is not designed for embedded GPU, it is the only
GPU virtualization package that works on embedded systems.
● rCUDA only support ARM frontend, need a X86 machine as the backend.
● DELL XPS 9550 laptop
○ Intel Core i7-6700HQ, 8G DDR4 RAM, 256 GB SSD
○ NVIDIA GeForce GTX 960M GPU with compute-capability of 5
● Host software stack
○ Ubuntu 16.04 LTS with Linux kernel 4.15.0-50-generic
○ CUDA toolkit 8.0
Benchmarks
52
● Three benchmarks and two application are chosen.
● Goal: Compare the performance with rCUDA and Native.
● Since rCUDA computes the wrong results or out of function on all
benchmarks except BandwidthTest, We can only compare the result native
CUDA.
Memory Bandwidth
53
● Measures transferring bandwidth between CPU memory
and GPU memory.
● bandwidthTest benchmark
● Transferring direction
○ Data transferring bandwidth from host to device (H2D)
○ Data transfering from device to host (D2H)
● Data size range from 1KB to 1GB
Pageable Memory Bandwidth
54
Pinned Memory Bandwidth (original)
55
Pinned Memory Bandwidth (Optimized)
56
Pinned Memory Bandwidth Comparison
(Original v.s Optimized)
57
Pinned Memory Bandwidth
Compare with rCUDA and Native
58
Matrix Multiplication
59
● Measures qCUDA-ARM’s performance when deals with
task of large computation and small data transmission.
● 3 steps:
○ Copy two matrices A, B from guest memory to GPU
memory.
○ Calculate matrix C (C = A×B) at GPU.
○ Copy matrix C from GPU memory back to guest
machine memory.
● matrix dimensions range from 32 to 4096
Matrix Multiplication Performance
60
Matrix Multiplication Performance Profile
61
Matrix Multiplication CUDA APIs Time
62
Time of cudaEventSynchronize() increases as matrix
dimension increases.
Vector Addition
63
● Compare to matrixMul, this is a I/O-bound benchmark.
● Measures qCUDA-ARM’s performance when deals with task of small
computation and large data transmission.
● Contains 3 steps:
○ Copy two vectors A, B from guest memory to GPG memory.
○ Calculate matrix C (C = A + B) at GPU.
○ Copy vector C from GPU memory back to guest machine memory.
● Vector length range from 100M to 25600M
Vector Addition Performance
64
Vector Addition Performance Profile
65
Vector Addition CUDA APIs Time
66
Time of both H2D and D2H gradually dominate the
total execution time.
Edge Detection
67
● This benchmark is a GPGPU implementation of Sobel Operator.
● The Sobel edge detector uses a pair of 3 x 3 convolution masks, one
estimating gradient in the x-direction and the other estimating gradient in
y-direction.
● 3 input images:
○ 800 × 532(1.3 MB)
○ 1920 × 1080(6 MB)
○ 4928 × 3624(47 MB)
● Contains 3 steps:
○ Copy input image from guest memory to GPGPU memory.
○ Do edge detection at GPGPU.
○ Copy processed image from GPGPU memory to host memory.
Edge Detection Performance
68
Edge Detection Performance Profile
69
Edge Detection CUDA APIs Time
70
Time of both H2D and D2H gradually dominate the
total execution time.
Edge Detection CUDA APIs Efficiency
71
GPGPU Accelerated Cryptocurrency Miner
72
● Measure the performance of qCUDA-ARM with with large computation and
small data transmission.
● This benchmark simulates cryptocurrency mining process which takes
○ An input string and hash it as target hash.
○ An ACCEPTED_ZEROS parameter to control the mining difficulty.
● Contains 4 steps:
○ CPU read the input string and preprocess on it.
○ Allocated GPU memory and copy the input string into it.
○ GPU launch multiple thread to mine the hash.
○ Copy result from GPGPU memory to host memory.
● ACCEPTED_ZEROS range from 25 to 35, increase 2 each time.
GPGPU Accelerated Cryptocurrency Miner
Performance
73
GPGPU Accelerated Cryptocurrency Miner
Performance Profile
74
GPGPU Accelerated Cryptocurrency Miner
CUDA APIs Time
75
Time of D2H gradually dominate the total execution
time.
Pinned Memory Bandwidth Performance
of Multiple VMs
76
● Introduction
● Background
● Design and Implementation
● Experiments
● Conclusion
77
Conclusion
78
● In this thesis, we present a GPGPU virtualization solution
for embedded GPU architecture qCUDA-ARM.
● Since the architecture difference, redesign qCUDA-ARM’s
memory management system to fit the need.
● Improve the pinned memory copy bandwidth 9x faster.
● Conduct experiment on Nvidia TX2 board, achieve 90% of
native bandwidth.
Future Works
79
● Implement more CUDA API functions in order
to support more real world applications.
● Fine-tune the pinned memory allocation
mechanism, improve the memory utilization.
Thank you for your time and attention.
80
Appendix
81
Comparison
82
Full
Virtualization
Para
Virtualization
Direct
passthrough
Mediated
passthrough
API
Remoting
Performance Low High High High Depend
GPU
sharing
Yes Yes No Yes Yes
flexibility High Low High High Low
qcu-device (2/2)
83
● Some CUDA runtime API could not be directly executed in the host due to
the implicitly function call of CUDA context or kernel launch.
● quc-device implement these runtime API with low level driver API for
context and kernel function management.
Full-Virtualization
● Software simulation
● Native driver can be ran on the guest without any
modification of GPU libraries and drivers.
● Performance overhead since software
trap-and-emulated the hardware interface of existing
○ GPUvm [Suzuki et al. 2014]
○ G-KVM [Hong et al. 2016]
84
API-Remoting (2/4)
● GVirtuS [Giunta et al. 2010]
○ Virtualize CUDA
○ Support Xen, KVM and
VMware.
○ Try to minimize
communication overhead by
leveraging on high
performance communication
channel provided by
hypervisors.
85
● vCUDA [Shi et al. 2009]
○ Virtualize CUDA
○ Support Xen and KVM
○ vGPU per application
○ Prevent frequent context
switching between the guest
OS and the hypervisor and
improves communication
performance.
○ Batch CUDA call by Lazy
RPC that performs batching
specific CUDA calls that can
be delayed.
Observation
86
● Guest CUDA allocated memory can be seen from host after
step 3.
● Need another way preserve these region in page-lock
region.

More Related Content

PDF
Virtualization with KVM (Kernel-based Virtual Machine)
PDF
BKK16-315 Graphics Stack Update
PDF
Dockers and kubernetes
PPTX
U-boot and Android Verified Boot 2.0
PDF
Android Internals at Linaro Connect Asia 2013
ODP
Q4.11: Porting Android to new Platforms
PDF
The Android graphics path, in depth
PDF
Understanding the Android System Server
Virtualization with KVM (Kernel-based Virtual Machine)
BKK16-315 Graphics Stack Update
Dockers and kubernetes
U-boot and Android Verified Boot 2.0
Android Internals at Linaro Connect Asia 2013
Q4.11: Porting Android to new Platforms
The Android graphics path, in depth
Understanding the Android System Server

What's hot (20)

PDF
Kubermatic CNCF Webinar - start.kubermatic.pdf
PDF
Intro to Github Actions @likecoin
ODP
Openshift presentation
PDF
Hands-on ethernet driver
PPT
Basic Linux Internals
PDF
PDF
Virtualization Architecture & KVM
PDF
Google Kubernetes Engine (GKE) deep dive
PDF
Android Things : Building Embedded Devices
PDF
Gpu Systems
PDF
Mobile Mapping Technology and Applications.pdf
ODP
Monitoring IO performance with iostat and pt-diskstats
PDF
LAS16-403: GDB Linux Kernel Awareness
PDF
Red Hat Container Strategy
PPT
"Learning AOSP" - Android Hardware Abstraction Layer (HAL)
PDF
Linux on ARM 64-bit Architecture
PDF
QEMU in Cross building
PPTX
Ceph Performance and Sizing Guide
PDF
Linux kernel debugging
Kubermatic CNCF Webinar - start.kubermatic.pdf
Intro to Github Actions @likecoin
Openshift presentation
Hands-on ethernet driver
Basic Linux Internals
Virtualization Architecture & KVM
Google Kubernetes Engine (GKE) deep dive
Android Things : Building Embedded Devices
Gpu Systems
Mobile Mapping Technology and Applications.pdf
Monitoring IO performance with iostat and pt-diskstats
LAS16-403: GDB Linux Kernel Awareness
Red Hat Container Strategy
"Learning AOSP" - Android Hardware Abstraction Layer (HAL)
Linux on ARM 64-bit Architecture
QEMU in Cross building
Ceph Performance and Sizing Guide
Linux kernel debugging
Ad

Similar to qCUDA-ARM : Virtualization for Embedded GPU Architectures (20)

PDF
An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...
PDF
An exposition of performance comparison of graphic processing unit virtualiza...
PDF
lecture_GPUArchCUDA02-CUDAMem.pdf
PPT
3. CUDA_PPT.ppt info abt threads in cuda
PPTX
GPU Algorithms and trends 2018
PDF
Cuda introduction
PDF
GPU Computing with CUDA
PPT
cuda.ppt
PDF
GPU: Understanding CUDA
PDF
Unified Memory on POWER9 + V100
PDF
Introduction to CUDA programming in C language
PPTX
GPU in Computer Science advance topic .pptx
PDF
PPT
Lecture2 cuda spring 2010
PPTX
C for Cuda - Small Introduction to GPU computing
PDF
CUDA by Example : Streams : Notes
PPT
Parallel computing with Gpu
PPTX
Introduction to Accelerators
PDF
Computing using GPUs
PDF
CUDA by Example : The Final Countdown : Notes
An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...
An exposition of performance comparison of graphic processing unit virtualiza...
lecture_GPUArchCUDA02-CUDAMem.pdf
3. CUDA_PPT.ppt info abt threads in cuda
GPU Algorithms and trends 2018
Cuda introduction
GPU Computing with CUDA
cuda.ppt
GPU: Understanding CUDA
Unified Memory on POWER9 + V100
Introduction to CUDA programming in C language
GPU in Computer Science advance topic .pptx
Lecture2 cuda spring 2010
C for Cuda - Small Introduction to GPU computing
CUDA by Example : Streams : Notes
Parallel computing with Gpu
Introduction to Accelerators
Computing using GPUs
CUDA by Example : The Final Countdown : Notes
Ad

Recently uploaded (20)

PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
Well-logging-methods_new................
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Digital Logic Computer Design lecture notes
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
composite construction of structures.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
Geodesy 1.pptx...............................................
CH1 Production IntroductoryConcepts.pptx
Lecture Notes Electrical Wiring System Components
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
bas. eng. economics group 4 presentation 1.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Arduino robotics embedded978-1-4302-3184-4.pdf
Well-logging-methods_new................
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Digital Logic Computer Design lecture notes
additive manufacturing of ss316l using mig welding
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Foundation to blockchain - A guide to Blockchain Tech
CYBER-CRIMES AND SECURITY A guide to understanding
composite construction of structures.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Lesson 3_Tessellation.pptx finite Mathematics
Geodesy 1.pptx...............................................

qCUDA-ARM : Virtualization for Embedded GPU Architectures

  • 1. qCUDA-ARM:Virtualization for Embedded GPU Architectures Jordan@NTHU 2019/07/25
  • 2. Outline ● Introduction ● Background ● Design and Implementation ● Experiments ● Conclusion 2
  • 3. ● Introduction ● Background ● Design and Implementation ● Experiments ● Conclusion 3
  • 4. Emergence of Internet of Things (IoT) ● Number of IoT device raise dramatically. ● Smart home, industry 4.0 and Autonomous car. ● New computing paradigm : Edge Computing. 4
  • 5. Edge Computing ● Concept Deploy computational device to the nearby of data source ● Challenge: diversity of Application scenario ○ Design of edge node architecture still an open research problem. ○ possible solution 5
  • 6. Research Trend in System Design ● Virtualization ○ Adds dimension of flexibility to ease the edge nod architectural design. ○ Customize a specific hardware resource and create an isolated environment for each application. ● Heterogeneous computing ○ heterogeneous system contains more than one types of processors so that different applications can find the proper resources to use. Ex:embedded GPU. 6
  • 7. About This Research ● Investigate the fusion of those two important trends in IOT : heterogeneity and virtualization and present qCUDA-ARM, a virtualization solution for embedded GPU architecture. ● Evaluate qCUDA-ARM’s performance on an Nvidia TX2 development board using three different benchmarks and two applications and compared the results with rCUDA and native CUDA. 7
  • 8. ● Introduction ● Background ● Design and Implementation ● Experiments ● Conclusion 8
  • 9. Virtualization ● Software simulated platform/hardware. ● Multiple virtual machine on single physical machine. ● Benefits: ○ Improve hardware utilization ○ Isolation ○ Fault tolerance 9 http://techxposer.com/2018/01/30/understanding-virtualization-hxypervisors/
  • 10. GPGPU 10 ● GPGPU ○ General Purpose GPU computation other than computer graphics. ○ Acceleration by massive parallel execution. Includes: scientific computing, digital currency and machine learning. ● Requirement: ○ Needs CPU assistance. ○ Able to program the GPU.
  • 11. GPGPU Working Model 11 ● Communicate with CPU ● 4 steps
  • 12. GPGPU Programming Model 12 ● Numerous solutions, includes: CUDA, OpenCL, OpenVIDIA and MATLAB ● CUDA is one of most famous solutions ○ Developed by NVIDIA ○ Provides syntax of high level languages for software developers to access to the computation elements of GPU.
  • 13. GPU Virtualization 13 ● Provide GPU function to virtual machine. ● GPU virtualization methods ○ Hardware-assisted Virtualization ■ Direct Pass-through ■ Mediated Pass-through ○ Software Virtualization ■ Full Virtualization ■ Para Virtualization ● API-Remoting
  • 14. Direct Pass-through ● Guest OS directly access GPUs with hardware extension features provided by either motherboard chipset or GPU manufacturers. ● Limitation ○ Cannot share one single GPU to multiple VM. ● Ex: ○ Intel VT-d [Abramson et al. 2006] ○ AMD-Vi [Van Doorn 2006] 14 https://blogs.vmware.com/apps/2018/05/machine-learning-using-virtualized- gpus-on-vmware-vsphere.html
  • 15. Mediated Passthrough (1/2) ● Multiple Virtual GPU (vGPU) ● Each VM own a full-feature vGPU ● VMs can directly access performance-critical resources, without hypervisor intervention in most cases. ● Trap privileged operations from guest, provide secure isolation. ● Ex: ○ gVirt [Tian et al. 2014] ○ KVMGT [Song et al. 2014] ○ gScale [Xue et al. 2016] 15 https://projectacrn.github.io/latest/developer-guides/hld/hld-APL_GVT-g.htm l
  • 16. Full-Virtualization ● Software simulation ● Native driver can be ran on the guest without any modification of GPU libraries and drivers. ● Performance overhead since software trap-and-emulated the hardware interface of existing ○ GPUvm [Suzuki et al. 2014] ○ G-KVM [Hong et al. 2016] 16
  • 17. Para-Virtualization ● Improve performance by slightly modifies the custom driver in the guest for delivering sensitive operations directly to the host driver , prevent hypervisor invention hence improve performance. ● EX: ○ LoGV [Gottschlag et al. 2013] 17
  • 18. API-Remoting (1/3) 18 ● Wrapping the GPU APIs as a “front-end” in the guest. ● Mediating all accesses to the GPU as a “back-end” in the host through an transport layer. ● Challenge: minimize the communication overhead between backend and frontend. ● EX: ○ GVirtuS [Giunta et al. 2010] ○ vCUDA [Shi et al. 2009] ○ rCUDA [Duato et al. 2010b] ○ qCUDA [Yu-Shiang et al. 2017]
  • 19. API-Remoting (2/3) ● rCUDA [Duato et al. 2010b] ○ Offloads CUDA computation to remote GPUs. ○ For HPC scenario. ○ Client / server architecture. ○ Suitable for InfiniBand, achieve native performance when under it. ○ Communicate by TCP/IP which may be the performance bottleneck. 19https://www.researchgate.net/figure/Overview-of-the-general-architecture-of-the-rCU DA-virtualization-solution_fig1_267514566
  • 20. API-Remoting (3/3) ● qCUDA ○ Based on KVM, virtio ○ Prevent memory copy between guest and host machine. ○ Work that qCUDA-ARM based on. 20
  • 21. ● Introduction ● Background ● Design and Implementation ○ qCUDA System Architecture ○ Memory Management of qCUDA ○ Memory Management of qCUDA-ARM ○ Memory Bandwidth Improvement ● Experiments ● Conclusion 21
  • 22. ● Introduction ● Background ● Design and Implementation ○ qCUDA System Architecture ○ Memory Management of qCUDA ○ Memory Management of qCUDA-ARM ○ Memory Bandwidth Improvement ● Experiments ● Conclusion 22
  • 24. qCUDA Architecture Overvice 24 GPU Driver Host QEMU-KVM CUDA Program Guest Nvidia GPU qcu-library qcu-driver qcu-device
  • 25. qCUDA Components 25 ● qcu-library ○ The interposer library in guest system. ○ Provided native CUDA API and the wrapper functions of dynamic memory allocation. ● qcu-driver ○ Memory management. ○ Analyzing the CUDA command come from qcu-library to qcu-device. ● qcu-device ○ GPU binary registration. ○ GPU context management. ○ Address space transferring.
  • 26. ● Introduction ● Background ● Design and Implementation ○ qCUDA System Architecture ○ Memory Management of qCUDA ○ Memory Management of qCUDA-ARM ○ Memory Bandwidth Improvement ● Experiments ● Conclusion 26
  • 27. Data Movement in GPU Virtualization 27 ● In GPU working model, data needs to be transferred between CPU and GPU memory ● Guest CUDA program need to transfer it’s data to host GPU ● In GPU virtualization, this data should be transferred through host and hypervisor. ● Critical to the virtualization performance. ● Prior works copy data from guest to host, however, requires extra data copy and increase latency.
  • 28. Data Movement in qCUDA 28 ● Eliminate extra data copy between guest and host (Zero-copy). ○ Guest and host in same machine. ○ Memory allocated by guest should be seen in host. ● Address space conversion do the job.
  • 29. Memory Management in qCUDA 29 Two goals: zero copy and pinned host memory
  • 30. Memory Management in qCUDA 30 1. Guest CUDA program allocate / free memory via calling hooked allocation / free function in qcu-library. 2. Qcu-library change the behavior of memory allocation / release. 3. Allocate/release 4 MB size chunks. 4. Address space conversion (GVA->GPA) 5. Record / remove information 6. Communicate 7. Address space conversion (GPA->HVA)
  • 31. ● Introduction ● Background ● Design and Implementation ○ qCUDA System Architecture ○ Memory Management of qCUDA ○ Memory Management of qCUDA-ARM ○ Memory Bandwidth Improvement ● Experiments ● Conclusion 31
  • 32. Pinned Host Memory 32 ● For CUDA, there are 2 kinds of host memory: pageable and page-lock. ○ Pageable: Could be page-out by OS ○ Page-lock: Will not be page-out ● Page-lock memory region can be directly copy by DMA, accelerate the data copy performance. ● However, default memory allocated by host is pageable. ● Pageable memory required to be copy to an temporary allocated pinned memory region than be copied to GPU.
  • 33. 33 ● Pinned memory related functions: ○ cudaHostRegister(): page-lock pre-allocated memory region. ○ cudaHostAlloc(): allocated an page-lock memory region. Allocate Pinned Host Memory
  • 34. Allocate Pinned Host Memory in qCUDA 34 ● After step 7, guest CUDA allocated memory can be accessed from host. ● cudaHostRegister() this region. ● Since their maps to same physical frame. ● These guest CUDA allocated memory be page-locked
  • 35. Malfunction of cudaHostRegister() in ARM Architecture 35 ● cudaHostRegister() not be supported in ARM architecture. ● ARM architecture can only allocates pinned host memory with cudaHostAlloc(). ● However, original implementation cudaHostAlloc() is out of function too since it’s based on cudaHostRegister(). ● qCUDA-ARM implements it in different way.
  • 37. qCUDA-ARM Host Pinned Allocation Mechanism 37
  • 38. qCUDA-ARM Host Pinned Allocation Mechanism 38 ● Each memory allocation will create a unique device node under /dev. ● Guest memory region in host are mapped to one of these device node, let’s cal itl GMR_fd.
  • 39. qCUDA-ARM Host Pinned Allocation Mechanism 39 ● Allocated another same-sized page-lock memory region and mapping it to GMR_fd. ● The content is identical on both region and can be transparently transferred from guest to host gpu or vise versa.
  • 40. qCUDA-ARM Host Pinned Allocation Mechanism 40 The content is identical on both region and can be transparently transferred from guest to host gpu or vise versa.
  • 41. ● Introduction ● Background ● Design and Implementation ○ qCUDA System Architecture ○ Memory Management of qCUDA ○ Memory Management of qCUDA-ARM ○ Memory Bandwidth Improvement ● Experiments ● Conclusion 41
  • 42. Pinned Memory Bandwidth Improvement 42 ● The Initial experiment shows slow memory bandwidth when copy data with cudaMemcpy() / cudaMemcpyAsync(). ● Only 1/10 of native bandwidth. ● Try to improve it.
  • 43. cudaMemcpy() / cudaMemcpyAsync() 43 ● cudaMemcpy (dst, src, count, ….) ○ dst - Destination memory address ○ src - Source memory address ○ count - Size in bytes to copy
  • 44. qCUDA Pinned Memory Copy Mechanism 44
  • 45. qCUDA-ARM Pinned Memory Copy Mechanism 45
  • 46. qCUDA-ARM Pinned Memory Copy Mechanism 46 ● GMR_ptr can be obtained by ● Offset can be calculated by cudaHostAllocPtr + ( src - hva_start_addr) ● Copy pinned host memory with only one call of cudaMemcpy()
  • 47. Enable KVM on TX2 47 ● KVM is fundamental building blocks for qCUDA but it’s originally not be supported on TX2 development board. ● In order to enable KVM on ARM architecture, the CPU should be boot into a CPU mode for running hypervisors called hypmode
  • 48. Enable KVM on TX2 48 ● Because the device tree of Nvidia released linux kernel miss the necessary register settings for Virtual General Interrupt Controller. ● Rebuild the Linux kernel with those information provided solvethis problem.
  • 49. ● Introduction ● Background ● Design and Implementation ● Experiments ● Conclusion 49
  • 50. Experiment Environment (1/2) 50 ● NVIDIA TX2 development board ○ 4 ARM-A57 cores, 2 Denver 64-bit CPUs, 8 GB DDR4 RAM ○ NVIDIA Pascal Architecture GPU with compute-capability 6.2 ● Host software stack ○ Ubuntu 16.04 LTS with patched Linux kernel 4.4.38 ○ CUDA toolkit 8.0 ● Guest machine ○ 4 cores, 4GB RAM, 16GB disk image ○ Ubuntu 14.04.5 LTS ○ CUDA toolkit 8.0
  • 51. Experiment Environment (2/2) 51 ● Although rCUDA is not designed for embedded GPU, it is the only GPU virtualization package that works on embedded systems. ● rCUDA only support ARM frontend, need a X86 machine as the backend. ● DELL XPS 9550 laptop ○ Intel Core i7-6700HQ, 8G DDR4 RAM, 256 GB SSD ○ NVIDIA GeForce GTX 960M GPU with compute-capability of 5 ● Host software stack ○ Ubuntu 16.04 LTS with Linux kernel 4.15.0-50-generic ○ CUDA toolkit 8.0
  • 52. Benchmarks 52 ● Three benchmarks and two application are chosen. ● Goal: Compare the performance with rCUDA and Native. ● Since rCUDA computes the wrong results or out of function on all benchmarks except BandwidthTest, We can only compare the result native CUDA.
  • 53. Memory Bandwidth 53 ● Measures transferring bandwidth between CPU memory and GPU memory. ● bandwidthTest benchmark ● Transferring direction ○ Data transferring bandwidth from host to device (H2D) ○ Data transfering from device to host (D2H) ● Data size range from 1KB to 1GB
  • 55. Pinned Memory Bandwidth (original) 55
  • 56. Pinned Memory Bandwidth (Optimized) 56
  • 57. Pinned Memory Bandwidth Comparison (Original v.s Optimized) 57
  • 58. Pinned Memory Bandwidth Compare with rCUDA and Native 58
  • 59. Matrix Multiplication 59 ● Measures qCUDA-ARM’s performance when deals with task of large computation and small data transmission. ● 3 steps: ○ Copy two matrices A, B from guest memory to GPU memory. ○ Calculate matrix C (C = A×B) at GPU. ○ Copy matrix C from GPU memory back to guest machine memory. ● matrix dimensions range from 32 to 4096
  • 62. Matrix Multiplication CUDA APIs Time 62 Time of cudaEventSynchronize() increases as matrix dimension increases.
  • 63. Vector Addition 63 ● Compare to matrixMul, this is a I/O-bound benchmark. ● Measures qCUDA-ARM’s performance when deals with task of small computation and large data transmission. ● Contains 3 steps: ○ Copy two vectors A, B from guest memory to GPG memory. ○ Calculate matrix C (C = A + B) at GPU. ○ Copy vector C from GPU memory back to guest machine memory. ● Vector length range from 100M to 25600M
  • 66. Vector Addition CUDA APIs Time 66 Time of both H2D and D2H gradually dominate the total execution time.
  • 67. Edge Detection 67 ● This benchmark is a GPGPU implementation of Sobel Operator. ● The Sobel edge detector uses a pair of 3 x 3 convolution masks, one estimating gradient in the x-direction and the other estimating gradient in y-direction. ● 3 input images: ○ 800 × 532(1.3 MB) ○ 1920 × 1080(6 MB) ○ 4928 × 3624(47 MB) ● Contains 3 steps: ○ Copy input image from guest memory to GPGPU memory. ○ Do edge detection at GPGPU. ○ Copy processed image from GPGPU memory to host memory.
  • 70. Edge Detection CUDA APIs Time 70 Time of both H2D and D2H gradually dominate the total execution time.
  • 71. Edge Detection CUDA APIs Efficiency 71
  • 72. GPGPU Accelerated Cryptocurrency Miner 72 ● Measure the performance of qCUDA-ARM with with large computation and small data transmission. ● This benchmark simulates cryptocurrency mining process which takes ○ An input string and hash it as target hash. ○ An ACCEPTED_ZEROS parameter to control the mining difficulty. ● Contains 4 steps: ○ CPU read the input string and preprocess on it. ○ Allocated GPU memory and copy the input string into it. ○ GPU launch multiple thread to mine the hash. ○ Copy result from GPGPU memory to host memory. ● ACCEPTED_ZEROS range from 25 to 35, increase 2 each time.
  • 73. GPGPU Accelerated Cryptocurrency Miner Performance 73
  • 74. GPGPU Accelerated Cryptocurrency Miner Performance Profile 74
  • 75. GPGPU Accelerated Cryptocurrency Miner CUDA APIs Time 75 Time of D2H gradually dominate the total execution time.
  • 76. Pinned Memory Bandwidth Performance of Multiple VMs 76
  • 77. ● Introduction ● Background ● Design and Implementation ● Experiments ● Conclusion 77
  • 78. Conclusion 78 ● In this thesis, we present a GPGPU virtualization solution for embedded GPU architecture qCUDA-ARM. ● Since the architecture difference, redesign qCUDA-ARM’s memory management system to fit the need. ● Improve the pinned memory copy bandwidth 9x faster. ● Conduct experiment on Nvidia TX2 board, achieve 90% of native bandwidth.
  • 79. Future Works 79 ● Implement more CUDA API functions in order to support more real world applications. ● Fine-tune the pinned memory allocation mechanism, improve the memory utilization.
  • 80. Thank you for your time and attention. 80
  • 82. Comparison 82 Full Virtualization Para Virtualization Direct passthrough Mediated passthrough API Remoting Performance Low High High High Depend GPU sharing Yes Yes No Yes Yes flexibility High Low High High Low
  • 83. qcu-device (2/2) 83 ● Some CUDA runtime API could not be directly executed in the host due to the implicitly function call of CUDA context or kernel launch. ● quc-device implement these runtime API with low level driver API for context and kernel function management.
  • 84. Full-Virtualization ● Software simulation ● Native driver can be ran on the guest without any modification of GPU libraries and drivers. ● Performance overhead since software trap-and-emulated the hardware interface of existing ○ GPUvm [Suzuki et al. 2014] ○ G-KVM [Hong et al. 2016] 84
  • 85. API-Remoting (2/4) ● GVirtuS [Giunta et al. 2010] ○ Virtualize CUDA ○ Support Xen, KVM and VMware. ○ Try to minimize communication overhead by leveraging on high performance communication channel provided by hypervisors. 85 ● vCUDA [Shi et al. 2009] ○ Virtualize CUDA ○ Support Xen and KVM ○ vGPU per application ○ Prevent frequent context switching between the guest OS and the hypervisor and improves communication performance. ○ Batch CUDA call by Lazy RPC that performs batching specific CUDA calls that can be delayed.
  • 86. Observation 86 ● Guest CUDA allocated memory can be seen from host after step 3. ● Need another way preserve these region in page-lock region.