SlideShare a Scribd company logo
Storage Performance Tuning for
FAST! Virtual Machines
Fam Zheng
Senior Software Engineer
LC3-2018
2
Outline
• Virtual storage provisioning
• NUMA pinning
• VM configuration options
• Summary
• Appendix
Virtual storage provisioning
4
Provisioning virtual disks
Virtual machine
KVM
virtual_block_device_driver.ko
???
•
Virtual storage provisioning
is to expose host persistent
storage to guest for
applications’ use.
•
A device of a certain type is
presented on a system bus
•
Guest uses a corresponding
driver to do I/O
•
The disk space is allocated
from the storage available
on the host.
app
app
app
5
QEMU emulated devices
• Device types: virtio-blk, virtio-scsi, IDE, NVMe, ...
• QEMU block features
• qcow2, live snapshot
• throttling
• block migration
• incremental backup
• …
• Easy and flexible backend configuration
• Wide range of protocols: local file, NBD, iSCSI, NFS, Gluster,
Ceph, ...
• Image formats: qcow2, raw, LUKS, …
• Pushed hard for performance
• IOThread polling; userspace driver; multiqueue block layer (WIP)
6
QEMU emulated device I/O (file
backed)
vCPU
KVM
QEMU
main
thread
File system
Block layer
SCSI
Device driver
I/O Request Lifecycle
Guest virtio driver
↓
KVM ioeventfd
↓
vdev vring handler
↓
QEMU block layer
↓
LinuxAIO/POSIX syscall
↓
Host VFS/block/SCSI layer
↓
Host device driver
↓
Hardware
7
QEMU virtio IOThread
vCPU
KVM
QEMU
main
thread
●
A dedicated thread to handle
virtio vrings
●
Now fully support QEMU block
layer features
●
(Previously known as x-data-
plane of virtio-blk, limited to
raw format, no block jobs)
●
Currently one IOThread per
device
●
Multi-queue support is being
worked on
●
Adaptive polling enabled
●
Optimizes away the event
notifiers from critical path
(Linux-aio, vring, ...)
●
Reduces up to 20% latency
IOThread
virtio Virtio Queue
Host
storage
8
QEMU userspace NVMe driver
vCPU
KVM
IOThread
vfio-pci.ko
NVMe drv
(New in QEMU 2.12)
With the help of VFIO, QEMU
accesses host controller’s
submission and completion queues
without doing any syscall.
MSI/IRQ is delivered to IOThread
with eventfd, if adaptive polling of
completion queues doesn’t get
result.
No host file system, block layer or
SCSI. Data path is shortened.
QEMU process uses the controller
exclusively.
9
SPDK vhost-user
vCPU
KVM
QEMU
main
thread
SPDK vhost
QEMU
Hugepage VQ shared memory
nvme pmd
Virtio queues are handled
by a separate process, SPDK
vhost, which is built on top
of DPDK and has a
userspace poll mode NVMe
driver.
QEMU IOThread and host
kernel is out of data path.
Latency is greatly reduced
by busy polling.
No QEMU block features. No
migration (w/ NVMe pmd).
10
vfio-pci device assignment
vCPU
KVM
QEMU
main
thread
vfio-pci.ko
nvme.ko
Highly efficient. Guest driver
accesses device queues directly
without VMEXIT.
No block features of host system or
QEMU. Cannot do migration.
11
Provisioning virtual disks
Type Configuration QEMU block
features
Migration Special
requirements
Supported in
current RHEL/RHV
QEMU
emulated
IDE ✓ ✓ ✓
NVMe ✓ ✓ ✗
virtio-blk,
virtio-scsi ✓ ✓ ✓
vhost
vhost-scsi ✗ ✗ ✗
SPDK
vhost-user ✗ ✓ Hugepages ✗
Device
assignment
vfio-pci ✗ ✗
Exclusive device
assignment ✓
Sometimes higher performance means less flexibility
12
ahci
virtio-scsi, w/ iothread
virtio-blk, w/ iothread
virtio-blk, w/ iothread, userspace driver
vhost-user-blk (SPDK) (**)
vfio-pci
host /dev/nvme0n1
0 2000 4000 6000 8000 10000 12000
fio randread bs=4k iodepth=1 numjobs=1
IOPS
Backend: NVMe, Intel® SSD DC P3700 Series 400G
Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, Fedora 28
Guest: Q35, 1 vCPU, Fedora 28
QEMU: 8e36d27c5a
(**): SPDK poll mode driver threads take 100% host CPU cores, dedicatedly
[*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
NUMA Pinning
14
NUMA node 0
NUMA (Non-uniform memory
access)
vCPU
KVM
IOThread
vfio-pci.ko
NVMe drv
NUMA node 1
Goal: put vCPU, IOThread and virtual memory on the same NUMA
node with the host device that undertakes I/O
15
Automatic NUMA balancing
• Kernel feature to achieve good NUMA locality
• Periodic NUMA unmapping of process memory
• NUMA hinting fault
• Migrate on fault - moves memory to where the program using it
runs
• Task NUMA placement - moves running programs closer to their
memory
• Enabled by default in RHEL:
cat /proc/sys/kernel/numa_balancing
1
• Decent performance in most cases
• Disable it if using manual pinning
16
Manual NUMA pinning
• Option 1: Allocate all vCPUs and virtual memory on the optimal
NUMA node
$ numactl -N 1 -m 1 qemu-system-x86_64 …
• Or use Libvirt (*)
• Restrictive on resource allocation:
• Cannot use all host cores
• NUMA-local memory is limited
• Option 2: Create a guest NUMA topology matching the host, pin
IOThread to host storage controller’s NUMA node
• Libvirt is your friend! (*)
• Relies on the guest to do the right NUMA tuning
* See appendix for Libvirt XML examples
17
no NUMA pinning
NUMA pinning
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
fio randread bs=4k iodepth=1 numjobs=1
[*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
Backend: Intel® SSD DC P3700 Series
Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28
Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA balancing disabled. Virtual device: virtio-blk w/ IOThread
QEMU: 8e36d27c5a
+5%
VM Configuration Options
19
Raw block device vs image file
• Image file is more flexible, but slower
• Raw block device has better performance, but harder to
manage
• Note: snapshot is supported with raw block device. E.g:
$ qemu-img create -f qcow2 -b /path/to/base/image.qcow2 
/dev/sdc
20
QEMU emulated device I/O (block
device backed)
vCPU
KVM
IOThread
/dev/nvme0n1
nvme.ko
Using raw block device may
improve performance: no file
system in host.
21
Middle ground: use LVM
raw file (xfs) lvm block dev
0
2000
4000
6000
8000
10000
12000
14000
fio randrw bs=4k iodepth=1 numjobs=1
Backend: Intel® SSD DC P3700 Series
Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28
Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. Virtual device: virtio-blk w/ IOThread
QEMU: 8e36d27c5a
LVM is much more flexible and easier to manage than raw block or
partitions, and has good performance
[*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
22
Using QEMU VirtIO IOThread
• When using virtio, it’s recommended to enabled IOThread:
qemu-system-x86_64 … 
-object iothread,id=iothread0 
-device virtio-blk-pci,iothread=iothread0,id=… 
-device virtio-scsi-pci,iothread=iothread0,id=…
• Or in Libvirt...
23
Using QEMU VirtIO IOThread
(Libvirt)
<domain>
...
<iothreads>1</iothreads>
<disk type='file' device='disk'>
<driver name='qemu' type='raw' cache='none' iothread='1'/>
<target dev='vda' bus='virtio'/>
…
</disk>
<devices>
<controller type='scsi' index='0' model='virtio-scsi'>
<driver iothread='1'/>
...
</controller>
</devices>
</domain>
24
with IOThread without IOThread
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
virtio-blk with and without enabling IOThread
fio randread bs=4k iodepth=1 numjobs=1
Backend: Intel® SSD DC P3700 Series
Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28
Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. Virtual device: virtio-blk
QEMU: 8e36d27c5a
[*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
25
virtio-blk vs virtio-scsi
• Use virtio-scsi for many disks, or for full SCSI support (e.g.
unmap, write same, SCSI pass-through)
• virtio-blk DISCARD and WRITE ZEROES are being worked on
• Use virtio-blk for best performance
virtio-blk, iodepth=1, randread
virtio-scsi, iodepth=1, randread
vitio-blk, iodepth=4, randread
virtio-scsi, iodepth=4, randread
virtio-blk, iodepth=1, randrw
virtio-scsi, iodepth=1, randrw
vitio-blk, iodepth=4, randrw
virtio-scsi, iodepth=4, randrw
0 5000 10000 15000 20000 25000 30000 35000 40000 45000
fio blocksize=4k numjobs=1 (IOPS)
Backend: Intel® SSD DC P3700 Series; QEMU userspace driver (nvme://)
Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28
Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. IOThread enabled.
QEMU: 8e36d27c5a
[*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
26
Raw vs qcow2
• Don’t like the trade-off between features and performance?
• Try increasing qcow2 run-time cache size
qemu-system-x86_64 … 
-drive 
file=my.qcow2,if=none,id=drive0,aio=native,cache=none,
cache-size=16M 
...
• Or increase the cluster_size when creating qcow2 images
qemu-img create -f qcow2 -o cluster_size=2M my.qcow2 100G
27
Raw vs qcow2
qcow2 (64k cluster), randrw
qcow2 (64k cluster, 16M cache), randrw
qcow2 (2M cluster), randrw
raw, randrw
qcow2 (64k cluster), randread
qcow2 (64k cluster, 16M cache), randread
qcow2 (2M cluster), randread
raw, randread
0 2000 4000 6000 8000 10000 12000
fio blocksize=4k numjobs=1 iodepth=1 (IOPS)
Backend: Intel® SSD DC P3700 Series, formatted as xfs; Virtual disk size: 100G; Preallocation: full
Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28
Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. Virtual device: virtio-blk w/ IOThread
QEMU: 8e36d27c5a
[*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
28
AIO: native vs threads
• aio=native is usually better than aio=threads
• May depend on file system and workload
• ext4 native is slower because io_submit is not implemented async
xfs,
threads
xfs, native ext4,
threads
ext4, na-
tive
nvme,
threads
nvme, na-
tive
0
20000
40000
60000
80000
100000
120000
fio 4k randread numjobs=1 iodepth=16 (IOPS)
Backend: Intel® SSD DC P3700 Series
Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28
Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. Virtual device: virtio-blk w/ IOThread
QEMU: 8e36d27c5a
[*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
29
Image preallocation
• Reserve space on file system for user data or metadata:
$ qemu-img create -f $fmt -o preallocation=$mode test.img
100G
• Common modes for raw and qcow2:
• off: no preallocation
• falloc: use posix_fallocate() to reserve space
• full: reserve by writing zeros
• qcow2 specific mode:
• metadata: fully create L1/L2/refcnt tables and pre-calculate cluster
offsets, but don’t allocate space for clusters
• Consider enabling preallocation when disk space is not a
concern (it may defeat the purpose of thin provisioning)
30
Image preallocation
• Mainly affect the first pass of write performance after creating
VM
raw, off
raw, falloc
raw, full
qcow2, off
qcow2, metadata
qcow2, falloc
qcow2, full
qcow2(2M cluster), off
qcow2(2M cluster), metadata
qcow2(2M cluster), falloc
qcow2(2M cluster), full
0 5000 10000 15000 20000 25000
fio 4k randwrite numjobs=1 iodepth=1 (IOPS)
Backend: Intel® SSD DC P3700 Series; File system: xfs
Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28
Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. Virtual device: virtio-blk w/ IOThread
QEMU: 8e36d27c5a
[*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
31
Cache modes
• Cache modes consist of three separate semantics
• Usually cache=none is the optimal value
• To avoid redundant page cache in both host kernel and guest kernel
with O_DIRECT
• But feel free to experiment with writeback/directsync as well
• unsafe can be useful for throwaway VMs or guest installation
Disk write cache Host page cache
bypassing
(O_DIRECT)
Ignore flush
(dangerous!)
writeback (default) Y N N
none Y Y N
writethrough N N N
directsync N Y N
unsafe Y N Y
32
NVMe userspace driver in QEMU
• Usage
• Bind the device to vfio-pci.ko
# modprobe vfio
# modprobe vfio-pci
# echo 0000:44:00.0 >
/sys/bus/pci/devices/0000:44:00.0/driver/unbind
# echo 8086 0953 > /sys/bus/pci/drivers/vfio-pci/new_id
• Use nvme:// protocol for the disk backend
qemu-system-x86_64 … 
-drive file=nvme://0000:44:00.0/1,if=none,id=drive0 
-device virtio-blk,drive=drive0,id=vblk0,iothread=...
33
linux-aio (iodepth=1)
userspace driver (iodepth=1)
linux-aio (iodepth=4)
userspace driver (iodepth=4)
0 5000 10000 15000 20000 25000 30000 35000 40000
userspace NVMe driver vs linux-aio
fio randread bs=4k numjobs=1
Backend: Intel® SSD DC P3700 Series
Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28
Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. Virtual device: virtio-blk w/ IOThread
QEMU: 8e36d27c5a
[*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
34
IO scheduler
• blk-mq has been enabled on virtio-blk and virtio-scsi
• Available schedulers: none, mq-deadline, kyber, bfq
• Select one of none, mq-deadline and kyber depending on your
workload, if using SSD
none
(iodepth=1, jobs=1)
m
q-deadline
(iodepth=1, jobs=1)
kyber (iodepth=1, jobs=1)
bfq
(iodepth=1, jobs=1)
none
(iodepth=1, jobs=4)
m
q-deadline
(iodepth=1, jobs=4)
kyber (iodepth=1, jobs=4)
bfq
(iodepth=1, jobs=4)
none
(iodepth=16, jobs=1)
m
q-deadline
(iodepth=16, jobs=1)
kyber (iodepth=16, jobs=1)
bfq
(iodepth=16, jobs=1)
none
(iodepth=16, jobs=4)
m
q-deadline
(iodepth=16, jobs=4)
kyber (iodepth=16, jobs=4)
bfq
(iodepth=16, jobs=4)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
fio 4k randread numjobs=1 iodepth=16 (IOPS)
[*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
35
Summary
• Plan out your virtual machines based on your own constraints:
Live migration, IO throttling, live snapshot, incremental backup,
hardware availability, ...
• Workload characteristics must be accounted for
• Upgrade your QEMU and Kernel, play with the new features!
• Don’t presume about performance, benchmark it!
• NVMe performance heavily depends on preconditioning, take
the numbers with a grain of salt
• Have fun tuning for your FAST! virtual machines :-)
THANK YOU
37
Appendix: vhost-scsi
vCPU
KVM
QEMU
main
thread
vhost target
virtio-scsi
LIO backend
nvme.ko
I/O requests on the virtio queue are
handled by host kernel vhost LIO
target.
Data path is efficient: no ctx switch
to userspace is needed (IOThread is
out of data path). Backend
configuration with LIO is relatively
flexible.
Not widely used. No migration
support. No QEMU block layer
features.
38
worker
thread
Appendix: QEMU SCSI pass-
through
vCPU
KVM
IOThread
/dev/sdX
SCSI host
virtio-scsi ioctl(fd, SG_IO...
SCSI commands are passed from
guest SCSI subsystem (or
userspace SG_IO) to device.
Convenient to expose host device’s
SCSI functions to guest.
No asynchronous SG_IO interface
available. aio=native has no effect.
req
39
Appendix: NUMA - Libvirt xml
syntax (1)
• vCPU pinning:
<vcpu cpuset='0-7'>8</vcpu>
<cputune>
<vcpupin vcpu='0' cpuset='0'/>
<vcpupin vcpu='1' cpuset='1'/>
<vcpupin vcpu='2' cpuset='2'/>
<vcpupin vcpu='3' cpuset='3'/>
<vcpupin vcpu='4' cpuset='4'/>
<vcpupin vcpu='5' cpuset='5'/>
<vcpupin vcpu='6' cpuset='6'/>
<vcpupin vcpu='7' cpuset='7'/>
</cputune>
40
Appendix: NUMA - Libvirt xml
syntax (2)
• Memory allocation policy
<numatune>
<memory mode='strict' nodeset='0'>
</numatune>
<cpu>
<numa>
<cell id="0" cpus="0-1" memory="3" unit="GiB"/>
<cell id="1" cpus="2-3" memory="3" unit="GiB"/>
</numa>
</cpu>
41
Appendix: NUMA - Libvirt xml
syntax (3)
• Pinning the whole emulator
<cputune>
<emulatorpin cpuset="1-3"/>
</cputune>
42
Appendix: NUMA - Libvirt xml
syntax (4)
• Creating guest NUMA topology: Use pcie-expander-bus and
pcie-root-port to associate device to virtual NUMA node
<controller type='pci' index='3' model='pcie-expander-bus'>
<target busNr='180'>
<node>1</node>
</target>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02'function='0x0'/>
</controller>
<controller type='pci' index='6' model='pcie-root-port'>
<model name='ioh3420'/>
<target chassis='6' port='0x0'/>
<address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
</controller>

More Related Content

PDF
U-Boot - An universal bootloader
PDF
Making Linux do Hard Real-time
PDF
Linux-Internals-and-Networking
PDF
Caliptra silicon Root-of-Trust IP introduction
ODP
Looking into trusted and encrypted keys
PPT
Device tree support on arm linux
PDF
Boost UDP Transaction Performance
U-Boot - An universal bootloader
Making Linux do Hard Real-time
Linux-Internals-and-Networking
Caliptra silicon Root-of-Trust IP introduction
Looking into trusted and encrypted keys
Device tree support on arm linux
Boost UDP Transaction Performance

What's hot (20)

PDF
Linux Systems: Getting started with setting up an Embedded platform
PDF
Embedded Hypervisor for ARM
PDF
An introduction to the linux kernel and device drivers (NTU CSIE 2016.03)
PPTX
VMware vSphere 6.0 - Troubleshooting Training - Day 5
PDF
Build your own embedded linux distributions by yocto project
PDF
Kdump and the kernel crash dump analysis
PDF
LCU14 500 ARM Trusted Firmware
PDF
Qemu Introduction
PPTX
用Raspberry Pi 學Linux I2C Driver
PDF
Xen Project Contributor Training Part 3 - Communication v1.0
PDF
Physical Memory Management.pdf
PDF
Part 02 Linux Kernel Module Programming
PDF
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
PDF
HKG15-107: ACPI Power Management on ARM64 Servers (v2)
PDF
alphorm.com - Formation Linux LPIC-1/Comptia Linux+
ODP
Rust Primer
PPTX
The Next Linux Superpower: eBPF Primer
PDF
The Linux Kernel Scheduler (For Beginners) - SFO17-421
PDF
PPTX
Linux Kernel Module - For NLKB
Linux Systems: Getting started with setting up an Embedded platform
Embedded Hypervisor for ARM
An introduction to the linux kernel and device drivers (NTU CSIE 2016.03)
VMware vSphere 6.0 - Troubleshooting Training - Day 5
Build your own embedded linux distributions by yocto project
Kdump and the kernel crash dump analysis
LCU14 500 ARM Trusted Firmware
Qemu Introduction
用Raspberry Pi 學Linux I2C Driver
Xen Project Contributor Training Part 3 - Communication v1.0
Physical Memory Management.pdf
Part 02 Linux Kernel Module Programming
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
HKG15-107: ACPI Power Management on ARM64 Servers (v2)
alphorm.com - Formation Linux LPIC-1/Comptia Linux+
Rust Primer
The Next Linux Superpower: eBPF Primer
The Linux Kernel Scheduler (For Beginners) - SFO17-421
Linux Kernel Module - For NLKB
Ad

Similar to Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf (20)

PDF
QEMU Disk IO Which performs Better: Native or threads?
PDF
Libvirt/KVM Driver Update (Kilo)
PDF
Achieving the Ultimate Performance with KVM
PDF
Achieving the Ultimate Performance with KVM
PDF
Accelerating Virtual Machine Access with the Storage Performance Development ...
PDF
Boosting I/O Performance with KVM io_uring
PDF
Dave Gilbert - KVM and QEMU
PPTX
Which Hypervisor Is Best? My SQL on Ceph
PPTX
Which Hypervisor is Best?
PDF
RHEL5 XEN HandOnTraining_v0.4.pdf
PDF
3. configuring a compute node for nfv
PPTX
virtualization and hypervisors
PDF
Virtualization overheads
ODP
Kvm and libvirt
PPTX
5. IO virtualization
PPTX
LFCOLLAB15: Xen 4.5 and Beyond
ODP
S4 xen hypervisor_20080622
PDF
Current and Future of Non-Volatile Memory on Linux
PDF
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
PDF
Rmll Virtualization As Is Tool 20090707 V1.0
QEMU Disk IO Which performs Better: Native or threads?
Libvirt/KVM Driver Update (Kilo)
Achieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVM
Accelerating Virtual Machine Access with the Storage Performance Development ...
Boosting I/O Performance with KVM io_uring
Dave Gilbert - KVM and QEMU
Which Hypervisor Is Best? My SQL on Ceph
Which Hypervisor is Best?
RHEL5 XEN HandOnTraining_v0.4.pdf
3. configuring a compute node for nfv
virtualization and hypervisors
Virtualization overheads
Kvm and libvirt
5. IO virtualization
LFCOLLAB15: Xen 4.5 and Beyond
S4 xen hypervisor_20080622
Current and Future of Non-Volatile Memory on Linux
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
Rmll Virtualization As Is Tool 20090707 V1.0
Ad

Recently uploaded (20)

PPTX
Digital Literacy And Online Safety on internet
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PDF
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
PPTX
QR Codes Qr codecodecodecodecocodedecodecode
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PPTX
SAP Ariba Sourcing PPT for learning material
PDF
Paper PDF World Game (s) Great Redesign.pdf
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PPTX
international classification of diseases ICD-10 review PPT.pptx
PDF
Triggering QUIC, presented by Geoff Huston at IETF 123
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PPTX
Introuction about WHO-FIC in ICD-10.pptx
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PDF
The Internet -By the Numbers, Sri Lanka Edition
Digital Literacy And Online Safety on internet
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
Design_with_Watersergyerge45hrbgre4top (1).ppt
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
QR Codes Qr codecodecodecodecocodedecodecode
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PptxGenJS_Demo_Chart_20250317130215833.pptx
Introuction about ICD -10 and ICD-11 PPT.pptx
Decoding a Decade: 10 Years of Applied CTI Discipline
SAP Ariba Sourcing PPT for learning material
Paper PDF World Game (s) Great Redesign.pdf
SASE Traffic Flow - ZTNA Connector-1.pdf
international classification of diseases ICD-10 review PPT.pptx
Triggering QUIC, presented by Geoff Huston at IETF 123
The New Creative Director: How AI Tools for Social Media Content Creation Are...
Slides PDF The World Game (s) Eco Economic Epochs.pdf
Job_Card_System_Styled_lorem_ipsum_.pptx
Introuction about WHO-FIC in ICD-10.pptx
Slides PPTX World Game (s) Eco Economic Epochs.pptx
The Internet -By the Numbers, Sri Lanka Edition

Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf

  • 1. Storage Performance Tuning for FAST! Virtual Machines Fam Zheng Senior Software Engineer LC3-2018
  • 2. 2 Outline • Virtual storage provisioning • NUMA pinning • VM configuration options • Summary • Appendix
  • 4. 4 Provisioning virtual disks Virtual machine KVM virtual_block_device_driver.ko ??? • Virtual storage provisioning is to expose host persistent storage to guest for applications’ use. • A device of a certain type is presented on a system bus • Guest uses a corresponding driver to do I/O • The disk space is allocated from the storage available on the host. app app app
  • 5. 5 QEMU emulated devices • Device types: virtio-blk, virtio-scsi, IDE, NVMe, ... • QEMU block features • qcow2, live snapshot • throttling • block migration • incremental backup • … • Easy and flexible backend configuration • Wide range of protocols: local file, NBD, iSCSI, NFS, Gluster, Ceph, ... • Image formats: qcow2, raw, LUKS, … • Pushed hard for performance • IOThread polling; userspace driver; multiqueue block layer (WIP)
  • 6. 6 QEMU emulated device I/O (file backed) vCPU KVM QEMU main thread File system Block layer SCSI Device driver I/O Request Lifecycle Guest virtio driver ↓ KVM ioeventfd ↓ vdev vring handler ↓ QEMU block layer ↓ LinuxAIO/POSIX syscall ↓ Host VFS/block/SCSI layer ↓ Host device driver ↓ Hardware
  • 7. 7 QEMU virtio IOThread vCPU KVM QEMU main thread ● A dedicated thread to handle virtio vrings ● Now fully support QEMU block layer features ● (Previously known as x-data- plane of virtio-blk, limited to raw format, no block jobs) ● Currently one IOThread per device ● Multi-queue support is being worked on ● Adaptive polling enabled ● Optimizes away the event notifiers from critical path (Linux-aio, vring, ...) ● Reduces up to 20% latency IOThread virtio Virtio Queue Host storage
  • 8. 8 QEMU userspace NVMe driver vCPU KVM IOThread vfio-pci.ko NVMe drv (New in QEMU 2.12) With the help of VFIO, QEMU accesses host controller’s submission and completion queues without doing any syscall. MSI/IRQ is delivered to IOThread with eventfd, if adaptive polling of completion queues doesn’t get result. No host file system, block layer or SCSI. Data path is shortened. QEMU process uses the controller exclusively.
  • 9. 9 SPDK vhost-user vCPU KVM QEMU main thread SPDK vhost QEMU Hugepage VQ shared memory nvme pmd Virtio queues are handled by a separate process, SPDK vhost, which is built on top of DPDK and has a userspace poll mode NVMe driver. QEMU IOThread and host kernel is out of data path. Latency is greatly reduced by busy polling. No QEMU block features. No migration (w/ NVMe pmd).
  • 10. 10 vfio-pci device assignment vCPU KVM QEMU main thread vfio-pci.ko nvme.ko Highly efficient. Guest driver accesses device queues directly without VMEXIT. No block features of host system or QEMU. Cannot do migration.
  • 11. 11 Provisioning virtual disks Type Configuration QEMU block features Migration Special requirements Supported in current RHEL/RHV QEMU emulated IDE ✓ ✓ ✓ NVMe ✓ ✓ ✗ virtio-blk, virtio-scsi ✓ ✓ ✓ vhost vhost-scsi ✗ ✗ ✗ SPDK vhost-user ✗ ✓ Hugepages ✗ Device assignment vfio-pci ✗ ✗ Exclusive device assignment ✓ Sometimes higher performance means less flexibility
  • 12. 12 ahci virtio-scsi, w/ iothread virtio-blk, w/ iothread virtio-blk, w/ iothread, userspace driver vhost-user-blk (SPDK) (**) vfio-pci host /dev/nvme0n1 0 2000 4000 6000 8000 10000 12000 fio randread bs=4k iodepth=1 numjobs=1 IOPS Backend: NVMe, Intel® SSD DC P3700 Series 400G Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, Fedora 28 Guest: Q35, 1 vCPU, Fedora 28 QEMU: 8e36d27c5a (**): SPDK poll mode driver threads take 100% host CPU cores, dedicatedly [*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
  • 14. 14 NUMA node 0 NUMA (Non-uniform memory access) vCPU KVM IOThread vfio-pci.ko NVMe drv NUMA node 1 Goal: put vCPU, IOThread and virtual memory on the same NUMA node with the host device that undertakes I/O
  • 15. 15 Automatic NUMA balancing • Kernel feature to achieve good NUMA locality • Periodic NUMA unmapping of process memory • NUMA hinting fault • Migrate on fault - moves memory to where the program using it runs • Task NUMA placement - moves running programs closer to their memory • Enabled by default in RHEL: cat /proc/sys/kernel/numa_balancing 1 • Decent performance in most cases • Disable it if using manual pinning
  • 16. 16 Manual NUMA pinning • Option 1: Allocate all vCPUs and virtual memory on the optimal NUMA node $ numactl -N 1 -m 1 qemu-system-x86_64 … • Or use Libvirt (*) • Restrictive on resource allocation: • Cannot use all host cores • NUMA-local memory is limited • Option 2: Create a guest NUMA topology matching the host, pin IOThread to host storage controller’s NUMA node • Libvirt is your friend! (*) • Relies on the guest to do the right NUMA tuning * See appendix for Libvirt XML examples
  • 17. 17 no NUMA pinning NUMA pinning 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 fio randread bs=4k iodepth=1 numjobs=1 [*]: numbers are collected for relative comparison, not representative as a formal benchmarking result Backend: Intel® SSD DC P3700 Series Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28 Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA balancing disabled. Virtual device: virtio-blk w/ IOThread QEMU: 8e36d27c5a +5%
  • 19. 19 Raw block device vs image file • Image file is more flexible, but slower • Raw block device has better performance, but harder to manage • Note: snapshot is supported with raw block device. E.g: $ qemu-img create -f qcow2 -b /path/to/base/image.qcow2 /dev/sdc
  • 20. 20 QEMU emulated device I/O (block device backed) vCPU KVM IOThread /dev/nvme0n1 nvme.ko Using raw block device may improve performance: no file system in host.
  • 21. 21 Middle ground: use LVM raw file (xfs) lvm block dev 0 2000 4000 6000 8000 10000 12000 14000 fio randrw bs=4k iodepth=1 numjobs=1 Backend: Intel® SSD DC P3700 Series Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28 Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. Virtual device: virtio-blk w/ IOThread QEMU: 8e36d27c5a LVM is much more flexible and easier to manage than raw block or partitions, and has good performance [*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
  • 22. 22 Using QEMU VirtIO IOThread • When using virtio, it’s recommended to enabled IOThread: qemu-system-x86_64 … -object iothread,id=iothread0 -device virtio-blk-pci,iothread=iothread0,id=… -device virtio-scsi-pci,iothread=iothread0,id=… • Or in Libvirt...
  • 23. 23 Using QEMU VirtIO IOThread (Libvirt) <domain> ... <iothreads>1</iothreads> <disk type='file' device='disk'> <driver name='qemu' type='raw' cache='none' iothread='1'/> <target dev='vda' bus='virtio'/> … </disk> <devices> <controller type='scsi' index='0' model='virtio-scsi'> <driver iothread='1'/> ... </controller> </devices> </domain>
  • 24. 24 with IOThread without IOThread 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 virtio-blk with and without enabling IOThread fio randread bs=4k iodepth=1 numjobs=1 Backend: Intel® SSD DC P3700 Series Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28 Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. Virtual device: virtio-blk QEMU: 8e36d27c5a [*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
  • 25. 25 virtio-blk vs virtio-scsi • Use virtio-scsi for many disks, or for full SCSI support (e.g. unmap, write same, SCSI pass-through) • virtio-blk DISCARD and WRITE ZEROES are being worked on • Use virtio-blk for best performance virtio-blk, iodepth=1, randread virtio-scsi, iodepth=1, randread vitio-blk, iodepth=4, randread virtio-scsi, iodepth=4, randread virtio-blk, iodepth=1, randrw virtio-scsi, iodepth=1, randrw vitio-blk, iodepth=4, randrw virtio-scsi, iodepth=4, randrw 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 fio blocksize=4k numjobs=1 (IOPS) Backend: Intel® SSD DC P3700 Series; QEMU userspace driver (nvme://) Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28 Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. IOThread enabled. QEMU: 8e36d27c5a [*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
  • 26. 26 Raw vs qcow2 • Don’t like the trade-off between features and performance? • Try increasing qcow2 run-time cache size qemu-system-x86_64 … -drive file=my.qcow2,if=none,id=drive0,aio=native,cache=none, cache-size=16M ... • Or increase the cluster_size when creating qcow2 images qemu-img create -f qcow2 -o cluster_size=2M my.qcow2 100G
  • 27. 27 Raw vs qcow2 qcow2 (64k cluster), randrw qcow2 (64k cluster, 16M cache), randrw qcow2 (2M cluster), randrw raw, randrw qcow2 (64k cluster), randread qcow2 (64k cluster, 16M cache), randread qcow2 (2M cluster), randread raw, randread 0 2000 4000 6000 8000 10000 12000 fio blocksize=4k numjobs=1 iodepth=1 (IOPS) Backend: Intel® SSD DC P3700 Series, formatted as xfs; Virtual disk size: 100G; Preallocation: full Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28 Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. Virtual device: virtio-blk w/ IOThread QEMU: 8e36d27c5a [*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
  • 28. 28 AIO: native vs threads • aio=native is usually better than aio=threads • May depend on file system and workload • ext4 native is slower because io_submit is not implemented async xfs, threads xfs, native ext4, threads ext4, na- tive nvme, threads nvme, na- tive 0 20000 40000 60000 80000 100000 120000 fio 4k randread numjobs=1 iodepth=16 (IOPS) Backend: Intel® SSD DC P3700 Series Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28 Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. Virtual device: virtio-blk w/ IOThread QEMU: 8e36d27c5a [*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
  • 29. 29 Image preallocation • Reserve space on file system for user data or metadata: $ qemu-img create -f $fmt -o preallocation=$mode test.img 100G • Common modes for raw and qcow2: • off: no preallocation • falloc: use posix_fallocate() to reserve space • full: reserve by writing zeros • qcow2 specific mode: • metadata: fully create L1/L2/refcnt tables and pre-calculate cluster offsets, but don’t allocate space for clusters • Consider enabling preallocation when disk space is not a concern (it may defeat the purpose of thin provisioning)
  • 30. 30 Image preallocation • Mainly affect the first pass of write performance after creating VM raw, off raw, falloc raw, full qcow2, off qcow2, metadata qcow2, falloc qcow2, full qcow2(2M cluster), off qcow2(2M cluster), metadata qcow2(2M cluster), falloc qcow2(2M cluster), full 0 5000 10000 15000 20000 25000 fio 4k randwrite numjobs=1 iodepth=1 (IOPS) Backend: Intel® SSD DC P3700 Series; File system: xfs Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28 Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. Virtual device: virtio-blk w/ IOThread QEMU: 8e36d27c5a [*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
  • 31. 31 Cache modes • Cache modes consist of three separate semantics • Usually cache=none is the optimal value • To avoid redundant page cache in both host kernel and guest kernel with O_DIRECT • But feel free to experiment with writeback/directsync as well • unsafe can be useful for throwaway VMs or guest installation Disk write cache Host page cache bypassing (O_DIRECT) Ignore flush (dangerous!) writeback (default) Y N N none Y Y N writethrough N N N directsync N Y N unsafe Y N Y
  • 32. 32 NVMe userspace driver in QEMU • Usage • Bind the device to vfio-pci.ko # modprobe vfio # modprobe vfio-pci # echo 0000:44:00.0 > /sys/bus/pci/devices/0000:44:00.0/driver/unbind # echo 8086 0953 > /sys/bus/pci/drivers/vfio-pci/new_id • Use nvme:// protocol for the disk backend qemu-system-x86_64 … -drive file=nvme://0000:44:00.0/1,if=none,id=drive0 -device virtio-blk,drive=drive0,id=vblk0,iothread=...
  • 33. 33 linux-aio (iodepth=1) userspace driver (iodepth=1) linux-aio (iodepth=4) userspace driver (iodepth=4) 0 5000 10000 15000 20000 25000 30000 35000 40000 userspace NVMe driver vs linux-aio fio randread bs=4k numjobs=1 Backend: Intel® SSD DC P3700 Series Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28 Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. Virtual device: virtio-blk w/ IOThread QEMU: 8e36d27c5a [*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
  • 34. 34 IO scheduler • blk-mq has been enabled on virtio-blk and virtio-scsi • Available schedulers: none, mq-deadline, kyber, bfq • Select one of none, mq-deadline and kyber depending on your workload, if using SSD none (iodepth=1, jobs=1) m q-deadline (iodepth=1, jobs=1) kyber (iodepth=1, jobs=1) bfq (iodepth=1, jobs=1) none (iodepth=1, jobs=4) m q-deadline (iodepth=1, jobs=4) kyber (iodepth=1, jobs=4) bfq (iodepth=1, jobs=4) none (iodepth=16, jobs=1) m q-deadline (iodepth=16, jobs=1) kyber (iodepth=16, jobs=1) bfq (iodepth=16, jobs=1) none (iodepth=16, jobs=4) m q-deadline (iodepth=16, jobs=4) kyber (iodepth=16, jobs=4) bfq (iodepth=16, jobs=4) 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 fio 4k randread numjobs=1 iodepth=16 (IOPS) [*]: numbers are collected for relative comparison, not representative as a formal benchmarking result
  • 35. 35 Summary • Plan out your virtual machines based on your own constraints: Live migration, IO throttling, live snapshot, incremental backup, hardware availability, ... • Workload characteristics must be accounted for • Upgrade your QEMU and Kernel, play with the new features! • Don’t presume about performance, benchmark it! • NVMe performance heavily depends on preconditioning, take the numbers with a grain of salt • Have fun tuning for your FAST! virtual machines :-)
  • 37. 37 Appendix: vhost-scsi vCPU KVM QEMU main thread vhost target virtio-scsi LIO backend nvme.ko I/O requests on the virtio queue are handled by host kernel vhost LIO target. Data path is efficient: no ctx switch to userspace is needed (IOThread is out of data path). Backend configuration with LIO is relatively flexible. Not widely used. No migration support. No QEMU block layer features.
  • 38. 38 worker thread Appendix: QEMU SCSI pass- through vCPU KVM IOThread /dev/sdX SCSI host virtio-scsi ioctl(fd, SG_IO... SCSI commands are passed from guest SCSI subsystem (or userspace SG_IO) to device. Convenient to expose host device’s SCSI functions to guest. No asynchronous SG_IO interface available. aio=native has no effect. req
  • 39. 39 Appendix: NUMA - Libvirt xml syntax (1) • vCPU pinning: <vcpu cpuset='0-7'>8</vcpu> <cputune> <vcpupin vcpu='0' cpuset='0'/> <vcpupin vcpu='1' cpuset='1'/> <vcpupin vcpu='2' cpuset='2'/> <vcpupin vcpu='3' cpuset='3'/> <vcpupin vcpu='4' cpuset='4'/> <vcpupin vcpu='5' cpuset='5'/> <vcpupin vcpu='6' cpuset='6'/> <vcpupin vcpu='7' cpuset='7'/> </cputune>
  • 40. 40 Appendix: NUMA - Libvirt xml syntax (2) • Memory allocation policy <numatune> <memory mode='strict' nodeset='0'> </numatune> <cpu> <numa> <cell id="0" cpus="0-1" memory="3" unit="GiB"/> <cell id="1" cpus="2-3" memory="3" unit="GiB"/> </numa> </cpu>
  • 41. 41 Appendix: NUMA - Libvirt xml syntax (3) • Pinning the whole emulator <cputune> <emulatorpin cpuset="1-3"/> </cputune>
  • 42. 42 Appendix: NUMA - Libvirt xml syntax (4) • Creating guest NUMA topology: Use pcie-expander-bus and pcie-root-port to associate device to virtual NUMA node <controller type='pci' index='3' model='pcie-expander-bus'> <target busNr='180'> <node>1</node> </target> <address type='pci' domain='0x0000' bus='0x00' slot='0x02'function='0x0'/> </controller> <controller type='pci' index='6' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='6' port='0x0'/> <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/> </controller>