SlideShare a Scribd company logo
Improving the performance of the qcow2 format
KVM Forum 2017
Alberto Garcia <berto@igalia.com>
Improving the performance of the qcow2 format
Introduction to the qcow2 format
Improving the performance of the qcow2 format KVM Forum 2017
The qcow2 file format
qcow2: native file format for storing disk images in QEMU.
Multiple features:
Grows on demand.
Supports backing files.
Internal snapshots.
Compression.
Encryption.
Can achieve good performance (comparable to raw files),
but it depends on the scenario.
Making it faster may require:
A correct configuration.
Changes in the QEMU driver.
Changes in the format itself.
Improving the performance of the qcow2 format KVM Forum 2017
Structure of a qcow2 file
A qcow2 file is divided into clusters of equal size
(min: 512 bytes - default: 64 KB - max: 2 MB)
QCOW2 Header
Refcount table
Refcount block
L1 table
L2 table
Data cluster
L2 table
Data cluster
Data cluster
Data cluster
Data cluster
Improving the performance of the qcow2 format KVM Forum 2017
Structure of a qcow2 file
The virtual disk as seen by the VM is divided
into guest clusters of the same size
QCOW2 Header
Refcount table
Refcount block
L1 table
L2 table
Data cluster
L2 table
Data cluster
Data cluster
Data cluster
Data cluster
GUEST HOST
Improving the performance of the qcow2 format KVM Forum 2017
L1 and L2 tables
The L1 and L2 tables map guest addresses as seen by the VM
into host addresses in the qcow2 file
L1 Table L2 Tables Data clusters
Improving the performance of the qcow2 format KVM Forum 2017
Backing files
If QEMU tries to read data from a cluster that hasn’t been
allocated, it goes to the backing file in order to get the data.
Backing files don’t need to have the same format or cluster
size as the active image.
They can be chained: a backing file can have its own
backing file.
Improving the performance of the qcow2 format KVM Forum 2017
Improving the performance of the qcow2 format
The problems of L1 and L2 tables
Improving the performance of the qcow2 format KVM Forum 2017
Cluster mapping: L1 and L2 tables
The L1 and L2 tables map guest clusters to host clusters.
There’s only one L1 table per image (per snapshot,
actually), but it’s small so it can be kept in RAM.
Several L2 tables, allocated on demand as the image grows.
Each time we need to access a data cluster (read or write)
we need to go to its L2 table.
This is one additional I/O operation per request: severe
impact in performance.
Solution: keep the L2 tables in RAM too.
Improving the performance of the qcow2 format KVM Forum 2017
The qcow2 L2 cache
QEMU keeps a cache of L2 tables to speed up disk access.
The maximum amount of L2 metadata depends on the
disk size and the cluster size.
Problem: large images need large amounts of metadata, so
we cannot keep everything in memory.
Cluster size (=L2 table size) Max. L2 size per TB
64 KB 128 MB (2048 tables)
128 KB 64 MB (512 tables)
256 KB 32 MB (128 tables)
512 KB 16 MB (32 tables)
1 MB 8 MB (8 tables)
2 MB 4 MB (2 tables)
Improving the performance of the qcow2 format KVM Forum 2017
Using the qcow2 L2 cache
The cache keeps full L2 tables in memory.
Default cache size: 1MB.
It can be changed with the l2-cache-size option:
-drive file=img.qcow2,l2-cache-size=8M
With the default cluster size (64 KB) it’s enough for a 8 GB
disk image.
Setting the right cache size has a dramatic effect on
performance.
Example: random 4K read requests on a fully populated
20GB image (SSD backend).
L2 cache size Average IOPS
1 MB 5100
1.5 MB 7300
2 MB 12700
2.5 Mb 63600
Improving the performance of the qcow2 format KVM Forum 2017
How much cache do we need?
The amount of L2 metadata for a certain disk image is
disk_size×8
cluster_size
Problem: this formula is too complicated. Why would the
user need to know about it?
QEMU should probably have a good default... but what’s a
good default?
Alternative: instead of saying how much memory we
want, we can say how much disk we want to cover.
This has already been discussed, see RedHat bug #1377735.
Improving the performance of the qcow2 format KVM Forum 2017
How much cache do we need?: cluster sizes
Increasing the cluster size is an easy way to reduce the
metadata size.
l2_size = disk_size×8
cluster_size
Pros:
Same performance with a smaller cache.
Reduces fragmentation.
Cons:
Slower allocations.
Wastes more disk space.
Improving the performance of the qcow2 format KVM Forum 2017
How much cache do we need?: backing files
Problem: each qcow2 image has
its own cache. Backing images
also need theirs!
Things get worse: cached tables
in backing files might end up
being unnecessary.
activebacking
Improving the performance of the qcow2 format KVM Forum 2017
How much cache do we need?: backing files
Problem: each qcow2 image has
its own cache. Backing images
also need theirs!
Things get worse: cached tables
in backing files might end up
being unnecessary.
activebacking
Improving the performance of the qcow2 format KVM Forum 2017
How much cache do we need?: backing files
Solution: we can clean unused
cache entries using the
cache-clean-interval
setting:
-drive file=hd.qcow2,cache-clean-interval=60
activebacking
Improving the performance of the qcow2 format KVM Forum 2017
Large cluster sizes means large L2 tables
An L2 table is always one cluster in size, and each cache
entry can only store one full L2 table. This means:
More I/O if we only need few entries in an L2 table.
Inflexible and inefficient use of the cache memory.
512K Clusters
1 MB
1 MB
1 MB
1 MB
0
512 GB
L2 TablesDisk
Improving the performance of the qcow2 format KVM Forum 2017
Large cluster sizes means large L2 tables
An L2 table is always one cluster in size, and each cache
entry can only store one full L2 table. This means:
More I/O if we only need few entries in an L2 table.
Inflexible and inefficient use of the cache memory.
128K Clusters
1 MB
128K Clusters
128K Clusters
128K Clusters
1 MB
1 MB
1 MB
0
128 GB
256 GB
384 GB
512 GB
L2 TablesDisk
Improving the performance of the qcow2 format KVM Forum 2017
Large cluster sizes means large L2 tables
An L2 table is always one cluster in size, and each cache
entry can only store one full L2 table. This means:
More I/O if we only need few entries in an L2 table.
Inflexible and inefficient use of the cache memory.
0
128 GB
256 GB
384 GB
512 GB
L2 TablesDisk
Improving the performance of the qcow2 format KVM Forum 2017
Large cluster sizes means large L2 tables
An L2 table is always one cluster in size, and each cache
entry can only store one full L2 table. This means:
More I/O if we only need few entries in an L2 table.
Inflexible and inefficient use of the cache memory.
0
128 GB
256 GB
384 GB
512 GB
L2 TablesDisk
Improving the performance of the qcow2 format KVM Forum 2017
Solution: reduce the cache granularity
Instead of reading complete L2 tables, make the cache read
smaller portions: L2 slices.
Less disk I/O.
The size of the slice can be adjusted to match that of the
host filesystem.
The qcow2 on-disk format does not need to change.
The qcow2 driver in QEMU needs relatively few changes.
Patches available in the mailing list!
Example: random 4K reads (SSD backend).
Disk size Cluster size L2 cache QEMU master 4K slices
16 GB 64 KB 1 MB [8 GB] 5000 IOPS 12700 IOPS
2 TB 2 MB 4 MB [1 TB] 576 IOPS 11000 IOPS
Improving the performance of the qcow2 format KVM Forum 2017
Improving the performance of the qcow2 format
The problems of cluster allocation
Improving the performance of the qcow2 format KVM Forum 2017
Cluster allocation and copy-on-write
Active
Backing
Allocating a cluster means filling it completely with data.
If the guest write request is small, the rest must be filled
with old data (e.g from a backing image).
QEMU used up to five operations for this: 2 reads, 3 writes.
It can be done optimally with just two: 1 read, 1 write.
New algorithm already available in QEMU 2.10.
Average increase of IOPS: 60 % (HDD), 15 % (SSD).
Improving the performance of the qcow2 format KVM Forum 2017
Cluster allocation and copy-on-write
Active
Backing
Allocating a cluster means filling it completely with data.
If the guest write request is small, the rest must be filled
with old data (e.g from a backing image).
QEMU used up to five operations for this: 2 reads, 3 writes.
It can be done optimally with just two: 1 read, 1 write.
New algorithm already available in QEMU 2.10.
Average increase of IOPS: 60 % (HDD), 15 % (SSD).
Improving the performance of the qcow2 format KVM Forum 2017
Cluster allocation and copy-on-write
Active
Backing
Allocating a cluster means filling it completely with data.
If the guest write request is small, the rest must be filled
with old data (e.g from a backing image).
QEMU used up to five operations for this: 2 reads, 3 writes.
It can be done optimally with just two: 1 read, 1 write.
New algorithm already available in QEMU 2.10.
Average increase of IOPS: 60 % (HDD), 15 % (SSD).
Improving the performance of the qcow2 format KVM Forum 2017
Cluster allocation and copy-on-write
Active
Backing
Allocating a cluster means filling it completely with data.
If the guest write request is small, the rest must be filled
with old data (e.g from a backing image).
QEMU used up to five operations for this: 2 reads, 3 writes.
It can be done optimally with just two: 1 read, 1 write.
New algorithm already available in QEMU 2.10.
Average increase of IOPS: 60 % (HDD), 15 % (SSD).
Improving the performance of the qcow2 format KVM Forum 2017
Cluster allocation and copy-on-write
Active
Backing
Allocating a cluster means filling it completely with data.
If the guest write request is small, the rest must be filled
with old data (e.g from a backing image).
QEMU used up to five operations for this: 2 reads, 3 writes.
It can be done optimally with just two: 1 read, 1 write.
New algorithm already available in QEMU 2.10.
Average increase of IOPS: 60 % (HDD), 15 % (SSD).
Improving the performance of the qcow2 format KVM Forum 2017
Cluster allocation and copy-on-write
Active
Backing
Allocating a cluster means filling it completely with data.
If the guest write request is small, the rest must be filled
with old data (e.g from a backing image).
QEMU used up to five operations for this: 2 reads, 3 writes.
It can be done optimally with just two: 1 read, 1 write.
New algorithm already available in QEMU 2.10.
Average increase of IOPS: 60 % (HDD), 15 % (SSD).
Improving the performance of the qcow2 format KVM Forum 2017
Subcluster allocation
L2 Table Data clusters
Divide each data cluster into subclusters and allocate each
one individually.
Reduces allocation overhead while keeping some benefits
of large clusters.
Improving the performance of the qcow2 format KVM Forum 2017
Subcluster allocation: benefits, problems and status
Last proposed in April 2017, prototype shows 2 to 4 times
more IOPS during allocations.
If subcluster size equals request size, no copy-on-write
needed: 10 times faster.
Other benefits: it would allow preallocation of images with
backing files.
Problems:
Incompatible changes to the on-disk format.
Increases the complexity of the qcow2 driver.
Increases data fragmentation in the image.
Improving the performance of the qcow2 format KVM Forum 2017
Space preallocation
When writing to a newly-allocated cluster we must
complete the request with the old data (copy-on-write).
If there was no old data, the request is padded with zeroes.
Instead of writing those zeroes, we can use fallocate() to
preallocate and empty the cluster first.
Requires support from the OS and the filesystems (ext4,
xfs, ...).
Patches in the mailing list (by Anton Nefedov).
Improving the performance of the qcow2 format KVM Forum 2017
Improving the performance of the qcow2 format
Other considerations
Improving the performance of the qcow2 format KVM Forum 2017
qcow2 overlap checks
Sanity checks before writing to a qcow2
image.
They verify that a given offset doesn’t overlap
with existing metadata sections.
Available since QEMU 1.7.
Problem: some of these checks are relatively
expensive.
QCOW2 Header
Refcount table
Refcount block
L1 table
L2 table
Data cluster
L2 table
Data cluster
Data cluster
Snapshot table
Data cluster
Improving the performance of the qcow2 format KVM Forum 2017
qcow2 overlap checks
Constant time Cached data Needs disk access
main-header active-l2 inactive-l2
active-l1 refcount-block
refcount-table inactive-l1
snapshot-table
inactive-l2 is disabled by default (it needs to
read all snapshots’ L1 tables).
refcount-block is particularly expensive even
with small images. Optimized in QEMU v2.9.
Checks can be configured with
overlap-check.<check-name>=[on|off]
overlap-check=[constant|cached|all|none]
QCOW2 Header
Refcount table
Refcount block
L1 table
L2 table
Data cluster
L2 table
Data cluster
Data cluster
Snapshot table
Data cluster
Improving the performance of the qcow2 format KVM Forum 2017
Improving the performance of the qcow2 format
Status summary
Improving the performance of the qcow2 format KVM Forum 2017
Status summary
qcow2 L2 cache:
Size and cleanup timer are configurable.
Probably needs better defaults or configuration options.
L2 slices:
Patches in the mailing list.
COW with two I/O operations instead of five:
Available in QEMU 2.10.
COW with preallocation instead of writing zeroes:
Patches in the mailing list.
Subcluster allocation:
RFC status. Requires changes to the on-disk format.
Metadata overlap checks:
Slowest check optimized in QEMU 2.9.
Other checks can be disabled manually if needed.
Improving the performance of the qcow2 format KVM Forum 2017
Improving the performance of the qcow2 format
Acknowledgments
Improving the performance of the qcow2 format KVM Forum 2017
Improving the performance of the qcow2 format
Questions & Answers
Improving the performance of the qcow2 format KVM Forum 2017
Improving the performance of the qcow2 format
Thank you!
Improving the performance of the qcow2 format KVM Forum 2017

More Related Content

PDF
Kernel Recipes 2017: Using Linux perf at Netflix
PDF
QEMU Disk IO Which performs Better: Native or threads?
PDF
QEMU in Cross building
PDF
Qemu device prototyping
PDF
Embedded Linux Kernel - Build your custom kernel
ODP
eBPF maps 101
PDF
Virtualization - Kernel Virtual Machine (KVM)
PDF
Kernel Recipes 2013 - Nftables, what motivations and what solutions
Kernel Recipes 2017: Using Linux perf at Netflix
QEMU Disk IO Which performs Better: Native or threads?
QEMU in Cross building
Qemu device prototyping
Embedded Linux Kernel - Build your custom kernel
eBPF maps 101
Virtualization - Kernel Virtual Machine (KVM)
Kernel Recipes 2013 - Nftables, what motivations and what solutions

What's hot (20)

PDF
BPF Internals (eBPF)
PDF
Project ACRN hypervisor introduction
PDF
Performance Wins with eBPF: Getting Started (2021)
PDF
The linux networking architecture
PDF
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
PDF
Advanced Namespaces and cgroups
PDF
XPDDS17: Reworking the ARM GIC Emulation & Xen Challenges in the ARM ITS Emu...
PDF
Virtualization with KVM (Kernel-based Virtual Machine)
PDF
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
PDF
Introduction to yocto
PPTX
Dataplane programming with eBPF: architecture and tools
PPTX
Linux Initialization Process (1)
PDF
Vm escape: case study virtualbox bug hunting and exploitation - Muhammad Alif...
PDF
Reverse Mapping (rmap) in Linux Kernel
PDF
How to use KASAN to debug memory corruption in OpenStack environment- (2)
PDF
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
PDF
Kubernetes security
PDF
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
PDF
Red Hat OpenStack 17 저자직강+스터디그룹_2주차
PDF
Blazing Performance with Flame Graphs
BPF Internals (eBPF)
Project ACRN hypervisor introduction
Performance Wins with eBPF: Getting Started (2021)
The linux networking architecture
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
Advanced Namespaces and cgroups
XPDDS17: Reworking the ARM GIC Emulation & Xen Challenges in the ARM ITS Emu...
Virtualization with KVM (Kernel-based Virtual Machine)
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
Introduction to yocto
Dataplane programming with eBPF: architecture and tools
Linux Initialization Process (1)
Vm escape: case study virtualbox bug hunting and exploitation - Muhammad Alif...
Reverse Mapping (rmap) in Linux Kernel
How to use KASAN to debug memory corruption in OpenStack environment- (2)
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
Kubernetes security
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
Red Hat OpenStack 17 저자직강+스터디그룹_2주차
Blazing Performance with Flame Graphs
Ad

Similar to Improving the Performance of the qcow2 Format (KVM Forum 2017) (12)

PDF
Faster and Smaller qcow2 Files with Subcluster-based Allocation
ODP
Storage best practices
PDF
Boosting I/O Performance with KVM io_uring
PDF
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
PPTX
Refining Linux
DOCX
Linux lv ms step by step
PDF
Hyper-V Best Practices & Tips and Tricks
PDF
Help, my computer is sluggish
PDF
Linux fundamental - Chap 10 fs
PDF
CLFS 2010
ODP
Optimizing Linux Servers
PDF
MySQL Server Settings Tuning
Faster and Smaller qcow2 Files with Subcluster-based Allocation
Storage best practices
Boosting I/O Performance with KVM io_uring
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
Refining Linux
Linux lv ms step by step
Hyper-V Best Practices & Tips and Tricks
Help, my computer is sluggish
Linux fundamental - Chap 10 fs
CLFS 2010
Optimizing Linux Servers
MySQL Server Settings Tuning
Ad

More from Igalia (20)

PDF
Life of a Kernel Bug Fix
PDF
Unlocking the Full Potential of WPE to Build a Successful Embedded Product
PDF
Advancing WebDriver BiDi support in WebKit
PDF
Jumping Over the Garden Wall - WPE WebKit on Android
PDF
Collective Funding, Governance and Prioritiation of Browser Engine Projects
PDF
Don't let your motivation go, save time with kworkflow
PDF
Solving the world’s (localization) problems
PDF
The Whippet Embeddable Garbage Collection Library
PDF
Nobody asks "How is JavaScript?"
PDF
Getting more juice out from your Raspberry Pi GPU
PDF
WebRTC support in WebKitGTK and WPEWebKit with GStreamer: Status update
PDF
Demystifying Temporal: A Deep Dive into JavaScript New Temporal API
PDF
CSS :has() Unlimited Power
PDF
Device-Generated Commands in Vulkan
PDF
Current state of Lavapipe: Mesa's software renderer for Vulkan
PDF
Vulkan Video is Open: Application showcase
PDF
Scheme on WebAssembly: It is happening!
PDF
EBC - A new backend compiler for etnaviv
PDF
RISC-V LLVM State of the Union
PDF
Device-Generated Commands in Vulkan
Life of a Kernel Bug Fix
Unlocking the Full Potential of WPE to Build a Successful Embedded Product
Advancing WebDriver BiDi support in WebKit
Jumping Over the Garden Wall - WPE WebKit on Android
Collective Funding, Governance and Prioritiation of Browser Engine Projects
Don't let your motivation go, save time with kworkflow
Solving the world’s (localization) problems
The Whippet Embeddable Garbage Collection Library
Nobody asks "How is JavaScript?"
Getting more juice out from your Raspberry Pi GPU
WebRTC support in WebKitGTK and WPEWebKit with GStreamer: Status update
Demystifying Temporal: A Deep Dive into JavaScript New Temporal API
CSS :has() Unlimited Power
Device-Generated Commands in Vulkan
Current state of Lavapipe: Mesa's software renderer for Vulkan
Vulkan Video is Open: Application showcase
Scheme on WebAssembly: It is happening!
EBC - A new backend compiler for etnaviv
RISC-V LLVM State of the Union
Device-Generated Commands in Vulkan

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Approach and Philosophy of On baking technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Electronic commerce courselecture one. Pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
cuic standard and advanced reporting.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
20250228 LYD VKU AI Blended-Learning.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
MYSQL Presentation for SQL database connectivity
Advanced methodologies resolving dimensionality complications for autism neur...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Approach and Philosophy of On baking technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Electronic commerce courselecture one. Pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Dropbox Q2 2025 Financial Results & Investor Presentation
Diabetes mellitus diagnosis method based random forest with bat algorithm
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Modernizing your data center with Dell and AMD
The Rise and Fall of 3GPP – Time for a Sabbatical?
cuic standard and advanced reporting.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Improving the Performance of the qcow2 Format (KVM Forum 2017)

  • 1. Improving the performance of the qcow2 format KVM Forum 2017 Alberto Garcia <berto@igalia.com>
  • 2. Improving the performance of the qcow2 format Introduction to the qcow2 format Improving the performance of the qcow2 format KVM Forum 2017
  • 3. The qcow2 file format qcow2: native file format for storing disk images in QEMU. Multiple features: Grows on demand. Supports backing files. Internal snapshots. Compression. Encryption. Can achieve good performance (comparable to raw files), but it depends on the scenario. Making it faster may require: A correct configuration. Changes in the QEMU driver. Changes in the format itself. Improving the performance of the qcow2 format KVM Forum 2017
  • 4. Structure of a qcow2 file A qcow2 file is divided into clusters of equal size (min: 512 bytes - default: 64 KB - max: 2 MB) QCOW2 Header Refcount table Refcount block L1 table L2 table Data cluster L2 table Data cluster Data cluster Data cluster Data cluster Improving the performance of the qcow2 format KVM Forum 2017
  • 5. Structure of a qcow2 file The virtual disk as seen by the VM is divided into guest clusters of the same size QCOW2 Header Refcount table Refcount block L1 table L2 table Data cluster L2 table Data cluster Data cluster Data cluster Data cluster GUEST HOST Improving the performance of the qcow2 format KVM Forum 2017
  • 6. L1 and L2 tables The L1 and L2 tables map guest addresses as seen by the VM into host addresses in the qcow2 file L1 Table L2 Tables Data clusters Improving the performance of the qcow2 format KVM Forum 2017
  • 7. Backing files If QEMU tries to read data from a cluster that hasn’t been allocated, it goes to the backing file in order to get the data. Backing files don’t need to have the same format or cluster size as the active image. They can be chained: a backing file can have its own backing file. Improving the performance of the qcow2 format KVM Forum 2017
  • 8. Improving the performance of the qcow2 format The problems of L1 and L2 tables Improving the performance of the qcow2 format KVM Forum 2017
  • 9. Cluster mapping: L1 and L2 tables The L1 and L2 tables map guest clusters to host clusters. There’s only one L1 table per image (per snapshot, actually), but it’s small so it can be kept in RAM. Several L2 tables, allocated on demand as the image grows. Each time we need to access a data cluster (read or write) we need to go to its L2 table. This is one additional I/O operation per request: severe impact in performance. Solution: keep the L2 tables in RAM too. Improving the performance of the qcow2 format KVM Forum 2017
  • 10. The qcow2 L2 cache QEMU keeps a cache of L2 tables to speed up disk access. The maximum amount of L2 metadata depends on the disk size and the cluster size. Problem: large images need large amounts of metadata, so we cannot keep everything in memory. Cluster size (=L2 table size) Max. L2 size per TB 64 KB 128 MB (2048 tables) 128 KB 64 MB (512 tables) 256 KB 32 MB (128 tables) 512 KB 16 MB (32 tables) 1 MB 8 MB (8 tables) 2 MB 4 MB (2 tables) Improving the performance of the qcow2 format KVM Forum 2017
  • 11. Using the qcow2 L2 cache The cache keeps full L2 tables in memory. Default cache size: 1MB. It can be changed with the l2-cache-size option: -drive file=img.qcow2,l2-cache-size=8M With the default cluster size (64 KB) it’s enough for a 8 GB disk image. Setting the right cache size has a dramatic effect on performance. Example: random 4K read requests on a fully populated 20GB image (SSD backend). L2 cache size Average IOPS 1 MB 5100 1.5 MB 7300 2 MB 12700 2.5 Mb 63600 Improving the performance of the qcow2 format KVM Forum 2017
  • 12. How much cache do we need? The amount of L2 metadata for a certain disk image is disk_size×8 cluster_size Problem: this formula is too complicated. Why would the user need to know about it? QEMU should probably have a good default... but what’s a good default? Alternative: instead of saying how much memory we want, we can say how much disk we want to cover. This has already been discussed, see RedHat bug #1377735. Improving the performance of the qcow2 format KVM Forum 2017
  • 13. How much cache do we need?: cluster sizes Increasing the cluster size is an easy way to reduce the metadata size. l2_size = disk_size×8 cluster_size Pros: Same performance with a smaller cache. Reduces fragmentation. Cons: Slower allocations. Wastes more disk space. Improving the performance of the qcow2 format KVM Forum 2017
  • 14. How much cache do we need?: backing files Problem: each qcow2 image has its own cache. Backing images also need theirs! Things get worse: cached tables in backing files might end up being unnecessary. activebacking Improving the performance of the qcow2 format KVM Forum 2017
  • 15. How much cache do we need?: backing files Problem: each qcow2 image has its own cache. Backing images also need theirs! Things get worse: cached tables in backing files might end up being unnecessary. activebacking Improving the performance of the qcow2 format KVM Forum 2017
  • 16. How much cache do we need?: backing files Solution: we can clean unused cache entries using the cache-clean-interval setting: -drive file=hd.qcow2,cache-clean-interval=60 activebacking Improving the performance of the qcow2 format KVM Forum 2017
  • 17. Large cluster sizes means large L2 tables An L2 table is always one cluster in size, and each cache entry can only store one full L2 table. This means: More I/O if we only need few entries in an L2 table. Inflexible and inefficient use of the cache memory. 512K Clusters 1 MB 1 MB 1 MB 1 MB 0 512 GB L2 TablesDisk Improving the performance of the qcow2 format KVM Forum 2017
  • 18. Large cluster sizes means large L2 tables An L2 table is always one cluster in size, and each cache entry can only store one full L2 table. This means: More I/O if we only need few entries in an L2 table. Inflexible and inefficient use of the cache memory. 128K Clusters 1 MB 128K Clusters 128K Clusters 128K Clusters 1 MB 1 MB 1 MB 0 128 GB 256 GB 384 GB 512 GB L2 TablesDisk Improving the performance of the qcow2 format KVM Forum 2017
  • 19. Large cluster sizes means large L2 tables An L2 table is always one cluster in size, and each cache entry can only store one full L2 table. This means: More I/O if we only need few entries in an L2 table. Inflexible and inefficient use of the cache memory. 0 128 GB 256 GB 384 GB 512 GB L2 TablesDisk Improving the performance of the qcow2 format KVM Forum 2017
  • 20. Large cluster sizes means large L2 tables An L2 table is always one cluster in size, and each cache entry can only store one full L2 table. This means: More I/O if we only need few entries in an L2 table. Inflexible and inefficient use of the cache memory. 0 128 GB 256 GB 384 GB 512 GB L2 TablesDisk Improving the performance of the qcow2 format KVM Forum 2017
  • 21. Solution: reduce the cache granularity Instead of reading complete L2 tables, make the cache read smaller portions: L2 slices. Less disk I/O. The size of the slice can be adjusted to match that of the host filesystem. The qcow2 on-disk format does not need to change. The qcow2 driver in QEMU needs relatively few changes. Patches available in the mailing list! Example: random 4K reads (SSD backend). Disk size Cluster size L2 cache QEMU master 4K slices 16 GB 64 KB 1 MB [8 GB] 5000 IOPS 12700 IOPS 2 TB 2 MB 4 MB [1 TB] 576 IOPS 11000 IOPS Improving the performance of the qcow2 format KVM Forum 2017
  • 22. Improving the performance of the qcow2 format The problems of cluster allocation Improving the performance of the qcow2 format KVM Forum 2017
  • 23. Cluster allocation and copy-on-write Active Backing Allocating a cluster means filling it completely with data. If the guest write request is small, the rest must be filled with old data (e.g from a backing image). QEMU used up to five operations for this: 2 reads, 3 writes. It can be done optimally with just two: 1 read, 1 write. New algorithm already available in QEMU 2.10. Average increase of IOPS: 60 % (HDD), 15 % (SSD). Improving the performance of the qcow2 format KVM Forum 2017
  • 24. Cluster allocation and copy-on-write Active Backing Allocating a cluster means filling it completely with data. If the guest write request is small, the rest must be filled with old data (e.g from a backing image). QEMU used up to five operations for this: 2 reads, 3 writes. It can be done optimally with just two: 1 read, 1 write. New algorithm already available in QEMU 2.10. Average increase of IOPS: 60 % (HDD), 15 % (SSD). Improving the performance of the qcow2 format KVM Forum 2017
  • 25. Cluster allocation and copy-on-write Active Backing Allocating a cluster means filling it completely with data. If the guest write request is small, the rest must be filled with old data (e.g from a backing image). QEMU used up to five operations for this: 2 reads, 3 writes. It can be done optimally with just two: 1 read, 1 write. New algorithm already available in QEMU 2.10. Average increase of IOPS: 60 % (HDD), 15 % (SSD). Improving the performance of the qcow2 format KVM Forum 2017
  • 26. Cluster allocation and copy-on-write Active Backing Allocating a cluster means filling it completely with data. If the guest write request is small, the rest must be filled with old data (e.g from a backing image). QEMU used up to five operations for this: 2 reads, 3 writes. It can be done optimally with just two: 1 read, 1 write. New algorithm already available in QEMU 2.10. Average increase of IOPS: 60 % (HDD), 15 % (SSD). Improving the performance of the qcow2 format KVM Forum 2017
  • 27. Cluster allocation and copy-on-write Active Backing Allocating a cluster means filling it completely with data. If the guest write request is small, the rest must be filled with old data (e.g from a backing image). QEMU used up to five operations for this: 2 reads, 3 writes. It can be done optimally with just two: 1 read, 1 write. New algorithm already available in QEMU 2.10. Average increase of IOPS: 60 % (HDD), 15 % (SSD). Improving the performance of the qcow2 format KVM Forum 2017
  • 28. Cluster allocation and copy-on-write Active Backing Allocating a cluster means filling it completely with data. If the guest write request is small, the rest must be filled with old data (e.g from a backing image). QEMU used up to five operations for this: 2 reads, 3 writes. It can be done optimally with just two: 1 read, 1 write. New algorithm already available in QEMU 2.10. Average increase of IOPS: 60 % (HDD), 15 % (SSD). Improving the performance of the qcow2 format KVM Forum 2017
  • 29. Subcluster allocation L2 Table Data clusters Divide each data cluster into subclusters and allocate each one individually. Reduces allocation overhead while keeping some benefits of large clusters. Improving the performance of the qcow2 format KVM Forum 2017
  • 30. Subcluster allocation: benefits, problems and status Last proposed in April 2017, prototype shows 2 to 4 times more IOPS during allocations. If subcluster size equals request size, no copy-on-write needed: 10 times faster. Other benefits: it would allow preallocation of images with backing files. Problems: Incompatible changes to the on-disk format. Increases the complexity of the qcow2 driver. Increases data fragmentation in the image. Improving the performance of the qcow2 format KVM Forum 2017
  • 31. Space preallocation When writing to a newly-allocated cluster we must complete the request with the old data (copy-on-write). If there was no old data, the request is padded with zeroes. Instead of writing those zeroes, we can use fallocate() to preallocate and empty the cluster first. Requires support from the OS and the filesystems (ext4, xfs, ...). Patches in the mailing list (by Anton Nefedov). Improving the performance of the qcow2 format KVM Forum 2017
  • 32. Improving the performance of the qcow2 format Other considerations Improving the performance of the qcow2 format KVM Forum 2017
  • 33. qcow2 overlap checks Sanity checks before writing to a qcow2 image. They verify that a given offset doesn’t overlap with existing metadata sections. Available since QEMU 1.7. Problem: some of these checks are relatively expensive. QCOW2 Header Refcount table Refcount block L1 table L2 table Data cluster L2 table Data cluster Data cluster Snapshot table Data cluster Improving the performance of the qcow2 format KVM Forum 2017
  • 34. qcow2 overlap checks Constant time Cached data Needs disk access main-header active-l2 inactive-l2 active-l1 refcount-block refcount-table inactive-l1 snapshot-table inactive-l2 is disabled by default (it needs to read all snapshots’ L1 tables). refcount-block is particularly expensive even with small images. Optimized in QEMU v2.9. Checks can be configured with overlap-check.<check-name>=[on|off] overlap-check=[constant|cached|all|none] QCOW2 Header Refcount table Refcount block L1 table L2 table Data cluster L2 table Data cluster Data cluster Snapshot table Data cluster Improving the performance of the qcow2 format KVM Forum 2017
  • 35. Improving the performance of the qcow2 format Status summary Improving the performance of the qcow2 format KVM Forum 2017
  • 36. Status summary qcow2 L2 cache: Size and cleanup timer are configurable. Probably needs better defaults or configuration options. L2 slices: Patches in the mailing list. COW with two I/O operations instead of five: Available in QEMU 2.10. COW with preallocation instead of writing zeroes: Patches in the mailing list. Subcluster allocation: RFC status. Requires changes to the on-disk format. Metadata overlap checks: Slowest check optimized in QEMU 2.9. Other checks can be disabled manually if needed. Improving the performance of the qcow2 format KVM Forum 2017
  • 37. Improving the performance of the qcow2 format Acknowledgments Improving the performance of the qcow2 format KVM Forum 2017
  • 38. Improving the performance of the qcow2 format Questions & Answers Improving the performance of the qcow2 format KVM Forum 2017
  • 39. Improving the performance of the qcow2 format Thank you! Improving the performance of the qcow2 format KVM Forum 2017