SlideShare a Scribd company logo
Subcluster allocation for qcow2 images
KVM Forum 2020
Alberto Garcia <berto@igalia.com>
The qcow2 file format
qcow2: native file format for storing disk images in QEMU.
Many features: grows on demand, backing files, internal
snapshots, compression, encryption...
Subcluster allocation for qcow2 images KVM Forum 2020
The qcow2 file format
qcow2: native file format for storing disk images in QEMU.
Many features: grows on demand, backing files, internal
snapshots, compression, encryption...
But why is it sometimes slower than a raw file?
Subcluster allocation for qcow2 images KVM Forum 2020
The qcow2 file format
qcow2: native file format for storing disk images in QEMU.
Many features: grows on demand, backing files, internal
snapshots, compression, encryption...
But why is it sometimes slower than a raw file?
Because it is not correctly configured.
Subcluster allocation for qcow2 images KVM Forum 2020
The qcow2 file format
qcow2: native file format for storing disk images in QEMU.
Many features: grows on demand, backing files, internal
snapshots, compression, encryption...
But why is it sometimes slower than a raw file?
Because it is not correctly configured.
Because the qcow2 driver in QEMU needs to be improved.
Subcluster allocation for qcow2 images KVM Forum 2020
The qcow2 file format
qcow2: native file format for storing disk images in QEMU.
Many features: grows on demand, backing files, internal
snapshots, compression, encryption...
But why is it sometimes slower than a raw file?
Because it is not correctly configured.
Because the qcow2 driver in QEMU needs to be improved.
Check my presentation at KVM Forum 2017!
Subcluster allocation for qcow2 images KVM Forum 2020
The qcow2 file format
qcow2: native file format for storing disk images in QEMU.
Many features: grows on demand, backing files, internal
snapshots, compression, encryption...
But why is it sometimes slower than a raw file?
Because it is not correctly configured.
Because the qcow2 driver in QEMU needs to be improved.
Check my presentation at KVM Forum 2017!
Because of the very design of the qcow2 file format.
Today we are going to focus on that.
Subcluster allocation for qcow2 images KVM Forum 2020
Structure of a qcow2 file
A qcow2 file is divided into clusters of equal size
(min: 512 bytes - default: 64 KB - max: 2 MB)
QCOW2 Header
Refcount table
Refcount block
L1 table
L2 table
Data cluster
L2 table
Data cluster
Data cluster
Data cluster
Data cluster
Subcluster allocation for qcow2 images KVM Forum 2020
Structure of a qcow2 file
The virtual disk as seen by the VM is divided
into guest clusters of the same size
QCOW2 Header
Refcount table
Refcount block
L1 table
L2 table
Data cluster
L2 table
Data cluster
Data cluster
Data cluster
Data cluster
GUEST HOST
Subcluster allocation for qcow2 images KVM Forum 2020
Problem 1: copy-on-write means more I/O
Active
Backing
A data cluster is the smallest unit of allocation: writing to a
new data cluster means filling it completely with data.
If the guest write request is small, the rest must be filled
with data from the backing file, or with zeroes (if there is
no backing file).
Problem: QEMU needs to perform additional I/O to copy
the rest of the data.
Subcluster allocation for qcow2 images KVM Forum 2020
Problem 1: copy-on-write means more I/O
Example: random 4KB write requests to an empty 40GB image
(SSD backend)
Cluster size With a backing file Without a backing file∗
16 KB 3600 IOPS 5859 IOPS
32 KB 2557 IOPS 5674 IOPS
64 KB 1634 IOPS 2527 IOPS
128 KB 869 IOPS 1576 IOPS
256 KB 577 IOPS 976 IOPS
512 KB 364 IOPS 510 IOPS
(*): Worst case scenario. QEMU first tries fallocate() which is much faster
than writing zeroes
Subcluster allocation for qcow2 images KVM Forum 2020
Problem 2: copy-on-write means more used space
The larger the cluster size, the more the image grows with
each allocation.
Example: how much does an image grow after...
...100 MB worth of random 4KB write requests?
...creating a filesystem on an empty 1 TB image?
Cluster size random writes mkfs.ext4
Raw file 101 MB 1.1 GB
4 KB 158 MB 1.1 GB
64 KB 1.6 GB 1.1 GB
512 KB 11 GB 1.3 GB
2 MB 29 GB 2.1 GB
The actual size difference in real-world scenarios depends
a lot on the usage.
Subcluster allocation for qcow2 images KVM Forum 2020
Decreasing the cluster size
In summary: increasing the cluster size...
...results in less performance due to the additional I/O
needed for copy-on-write.
...produces larger images and duplicate data.
Subcluster allocation for qcow2 images KVM Forum 2020
Decreasing the cluster size
In summary: increasing the cluster size...
...results in less performance due to the additional I/O
needed for copy-on-write.
...produces larger images and duplicate data.
Then let’s just decrease the cluster size, right?
Subcluster allocation for qcow2 images KVM Forum 2020
Decreasing the cluster size
In summary: increasing the cluster size...
...results in less performance due to the additional I/O
needed for copy-on-write.
...produces larger images and duplicate data.
Then let’s just decrease the cluster size, right?
Not so easy: smaller clusters means more metadata
Subcluster allocation for qcow2 images KVM Forum 2020
Problem 3: Smaller clusters means more metadata
Apart from the guest data itself, qcow2 images store some
important metadata:
Cluster mapping (L1 and L2 tables).
Reference counts.
If we have smaller clusters we’ll end up having more of
them, and this means additional metadata.
Subcluster allocation for qcow2 images KVM Forum 2020
L1 and L2 tables
The L1 and L2 tables map guest addresses as seen by the VM
into host addresses in the qcow2 file
L1 Table L2 Tables Data clusters
Subcluster allocation for qcow2 images KVM Forum 2020
The L1 table
There is only one L1 table per image (per snapshot,
actually).
The L1 table has a variable size but it’s usually small.
Example: 16KB of data for a 1TB image (using the default
settings).
It is stored contiguous in the image file.
QEMU keeps it in memory all the time.
64-bit entries: each contains a pointer to an L2 table.
Subcluster allocation for qcow2 images KVM Forum 2020
L2 tables
There are multiple L2 tables and they are allocated on
demand as the image grows.
Each table is exactly one cluster in size.
64-bit entries: each contains a pointer to a data cluster.
If we reducing the cluster size by half we need twice as
many L2 entries.
Graphically:
L2 Table Data clusters
Subcluster allocation for qcow2 images KVM Forum 2020
L2 tables
There are multiple L2 tables and they are allocated on
demand as the image grows.
Each table is exactly one cluster in size.
64-bit entries: each contains a pointer to a data cluster.
If we reducing the cluster size by half we need twice as
many L2 entries.
Graphically:
L2 Table Data clusters
Subcluster allocation for qcow2 images KVM Forum 2020
L2 metadata size
This is the maximum amount of L2 metadata needed for an
image with a virtual size of 1 TB.
Cluster size Max. L2 metadata
8 K B 1 GB
16 KB 512 MB
32 KB 256 MB
64 KB 128 MB
128 KB 64 MB
256 KB 32 MB
512 KB 16 MB
1 MB 8 MB
2 MB 4 MB
Subcluster allocation for qcow2 images KVM Forum 2020
Accessing L2 metadata
Each time we need to access a data cluster (read or write)
we need to go to its L2 table to get its location.
This is one additional I/O operation per request: severe
impact in performance.
We can mitigate that by keeping the L2 tables in RAM.
QEMU has an L2 cache for that purpose.
Example: random 4K reads on a 40GB image:
L2 cache size Average IOPS
1 MB 8068
2 MB 10606
5 MB 41187
Again, reducing the cluster size by half implies:
Twice as much L2 metadata.
Twice as much RAM for the L2 cache.
Subcluster allocation for qcow2 images KVM Forum 2020
Reference counts
Each cluster in a qcow2 image has a reference count (all
types, not just data clusters).
They are stored in a two-level structure called reference
table and reference blocks. Like L2 tables, the size of a
reference block is also one cluster.
Allocating clusters has the additional overhead of
updating their reference counts.
With a smaller clusters we need to allocate more of them.
Subcluster allocation for qcow2 images KVM Forum 2020
The overhead of having to allocate clusters
Overall, smaller clusters are faster to fill with data, but if they
get too small the overhead of the allocation process exceeds the
benefits.
Cluster size Write IOPS
512 KB 364 IOPS
256 KB 577 IOPS
128 KB 869 IOPS
64 KB 1634 IOPS
32 KB 2557 IOPS
16 KB 3600 IOPS
8 KB 758 IOPS
4 KB 97 IOPS
2 KB 77 IOPS
1 KB 62 IOPS
Subcluster allocation for qcow2 images KVM Forum 2020
The situation so far
We cannot have too big clusters because they waste more
space and increase the amount of I/O needed for
allocating clusters.
We cannot have too small clusters because they increase
the amount of metadata, which has a negative impact in
performance and/or memory usage.
This is a direct consequence of the design of the qcow2
format.
Subcluster allocation for qcow2 images KVM Forum 2020
Subcluster allocation
I’m presenting a mixed approach to mitigate this problem:
subcluster allocation.
In short:
We have big clusters in order to reduce the amount of
metadata in the image.
Each one of the clusters is divided into 32 subclusters that
can be allocated separately. This means faster allocations
and reduced disk usage.
Subcluster allocation for qcow2 images KVM Forum 2020
Subcluster allocation: what it looks like
A standard L2 table with entries and their data clusters
L2 Table Data clusters
Subcluster allocation for qcow2 images KVM Forum 2020
Subcluster allocation: what it looks like
An extended L2 table with subcluster allocation
L2 Table Data clusters
Subcluster allocation for qcow2 images KVM Forum 2020
L2 tables in detail
Each L2 table contains a number of entries that look like
this:
Cluster offset
0
63
Each cluster has one of these states:
Unallocated.
Allocated (normal or compressed).
All zeroes.
Now we also need to store information for each subcluster.
Subcluster allocation for qcow2 images KVM Forum 2020
Extended L2 entries
We are adding extended L2 entries, which contain a 64-bit
bitmap indicating the status of each subcluster.
Cluster offset
64
127
Subcluster allocation bitmap
0
63
Each individual subcluster can be allocated, unallocated or
“all zeroes”.
Compressed clusters don’t have subclusters and work the
same as before.
Subcluster allocation for qcow2 images KVM Forum 2020
Two use cases for subcluster allocation
Case 1: Having very large clusters in order to minimize the
amount of metadata while reducing the amount of
duplicated data and I/O.
Case 2: Having smaller clusters to minimize the amount of
copy-on-write and get the maximum I/O performance.
Subcluster allocation for qcow2 images KVM Forum 2020
Results 1: less copy-on-write means faster I/O
Having less copy-on-write improves the allocation
performance.
If subcluster size = request size no copy-on-write is
needed!
Average IOPS of random 4KB writes:
With a backing file
Cluster size Without subclusters With subclusters
16 KB 3600 IOPS 8124 IOPS
32 KB 2557 IOPS 11575 IOPS
64 KB 1634 IOPS 13219 IOPS
128 KB 869 IOPS 12076 IOPS
256 KB 577 IOPS 9739 IOPS
512 KB 364 IOPS 4708 IOPS
1 MB 216 IOPS 2542 IOPS
2 MB 125 IOPS 1591 IOPS
Subcluster allocation for qcow2 images KVM Forum 2020
Results 1: less copy-on-write means faster I/O
Having less copy-on-write improves the allocation
performance.
If subcluster size = request size no copy-on-write is
needed!
Average IOPS of random 4KB writes:
Without a backing file∗
Cluster size Without subclusters With subclusters
16 KB 5859 IOPS 8063 IOPS
32 KB 5674 IOPS 11107 IOPS
64 KB 2527 IOPS 12731 IOPS
128 KB 1576 IOPS 11808 IOPS
256 KB 976 IOPS 9195 IOPS
512 KB 510 IOPS 7079 IOPS
1 MB 448 IOPS 3306 IOPS
2 MB 262 IOPS 2269 IOPS
(*): Worst case scenario. QEMU first tries fallocate() which is much
faster than writing zeroes
Subcluster allocation for qcow2 images KVM Forum 2020
Results 2: less copy-on-write means less used space
Repeating the earlier test: how much does an image grow
after...
...100 MB worth of random 4KB write requests?
...creating a filesystem on an empty 1 TB image?
Cluster size random writes mkfs.ext4
Raw file 101 MB 1.1 GB
64 KB 111 MB 1.1 GB
(vs 158 MB)
512 KB 404 MB 1.1 GB
(vs 11 GB) (vs 1.3 GB)
2 MB 1.6 GB 1.1 GB
(vs 29 GB) (vs 2.1 GB)
Subcluster allocation for qcow2 images KVM Forum 2020
Results 3: larger clusters mean less metadata
Extended L2 entries are twice as large but each one of them
references 32 subclusters.
As a result we have 16 times less metadata for the same
unit of allocation.
This table compares the amount of L2 metadata for a 1TB
image.
Standard L2 entries
Cluster size Max. L2 size
4 KB 2 GB
8 KB 1 GB
16 KB 512 MB
32 KB 256 MB
64 KB 128 MB
Extended L2 entries
Subcluster size Max. L2 size
4 KB 128 MB
8 KB 64 MB
16 KB 32 MB
32 KB 16 MB
64 KB 8 MB
Subcluster allocation for qcow2 images KVM Forum 2020
Caveats
This feature is useful during allocation. Writing to already
allocated areas won’t be faster.
Don’t use it with compressed images.
Extended L2 entries are twice as big but offer no benefits
for compressed clusters.
If your image does not have a backing file maybe you
won’t see any speed-up!
Copy-on-write of empty clusters is already fast if the
filesystem supports it.
However you still get the other advantages of using
subclusters.
You won’t be able to read the image with older versions of
QEMU (and don’t expect backports!).
Subcluster allocation for qcow2 images KVM Forum 2020
Implementation status
Not available in any QEMU release yet.
Expected in QEMU 5.2 around December.
The implementation is complete, it is already in the
repository and it is ready to be tested.
Simply build a recent QEMU from git and create a qcow2
image with -o extended_l2=on.
Note: the default cluster size is still 64 KB. You probably
want to create an image with cluster_size=128k or
more!
Feedback, bug reports, etc., are very much appreciated!
qemu-block@nongnu.org
Subcluster allocation for qcow2 images KVM Forum 2020
Acknowledgments
This work was sponsored by
Subcluster allocation for qcow2 images KVM Forum 2020

More Related Content

PPTX
Presentation v mware virtual san 6.0
PDF
Dell Technologies - The Complete ISG Hardware Portfolio
PPTX
Virtual SAN 6.2, hyper-converged infrastructure software
PPTX
9월 웨비나 - AWS 클라우드 보안의 이해 (양승도 솔루션즈 아키텍트)
PPTX
ファイルサーバを高速バックアップ!Veeam NASバックアップのここがスゴイ!
PPTX
re:Invent 2021のS3アップデート紹介 & Glacier Instant Retrieval試してみた
PDF
02B_AWS IoT Core for LoRaWANのご紹介
PDF
VMware Cloud on AWSネットワーク詳細解説
Presentation v mware virtual san 6.0
Dell Technologies - The Complete ISG Hardware Portfolio
Virtual SAN 6.2, hyper-converged infrastructure software
9월 웨비나 - AWS 클라우드 보안의 이해 (양승도 솔루션즈 아키텍트)
ファイルサーバを高速バックアップ!Veeam NASバックアップのここがスゴイ!
re:Invent 2021のS3アップデート紹介 & Glacier Instant Retrieval試してみた
02B_AWS IoT Core for LoRaWANのご紹介
VMware Cloud on AWSネットワーク詳細解説

What's hot (20)

PPTX
Rhel cluster basics 2
PDF
一歩先行く Azure Computing シリーズ(全3回) 第2回 Azure VM どれを選ぶの? Azure VM 集中講座
PDF
L2延伸を利用したクラウド移行とクラウド活用術
PDF
BlueStore, A New Storage Backend for Ceph, One Year In
PDF
Amazon s3へのデータ転送における課題とその対処法を一挙紹介
PDF
Ceph as software define storage
PPTX
Veeam Solutions for SMB_2022.pptx
PDF
Microsoft Azure Storage 概要
PDF
AWS Black Belt Online Seminar Amazon EC2
PDF
20180704 AWS Black Belt Online Seminar Amazon Elastic File System (Amazon EFS...
PPTX
Oracleからamazon auroraへの移行にむけて
PDF
V sphere 7 update 3 へのアップグレードについて
PDF
Amazon Pinpoint × グロースハック活用事例集
PDF
SaaS テナント毎のコストを把握するための「AWS Application Cost Profiler」のご紹介
PDF
[AWS Builders] AWS 서버리스 서비스를 활용한 웹 애플리케이션 구축 및 배포 방법 - 정창호, AWS 솔루션즈 아키텍트
PDF
Black Belt Online Seminar Amazon CloudWatch
DOC
AP 10 dll quarter 1 week 6 kontemporaryung isyu july 10 to 14
PDF
Amazon VPCトレーニング-VPCの説明
PPTX
DeNA の AWS アカウント管理とセキュリティ監査自動化
PDF
デバイス WebAPIによるスマートフォン周辺デバイスの活用
Rhel cluster basics 2
一歩先行く Azure Computing シリーズ(全3回) 第2回 Azure VM どれを選ぶの? Azure VM 集中講座
L2延伸を利用したクラウド移行とクラウド活用術
BlueStore, A New Storage Backend for Ceph, One Year In
Amazon s3へのデータ転送における課題とその対処法を一挙紹介
Ceph as software define storage
Veeam Solutions for SMB_2022.pptx
Microsoft Azure Storage 概要
AWS Black Belt Online Seminar Amazon EC2
20180704 AWS Black Belt Online Seminar Amazon Elastic File System (Amazon EFS...
Oracleからamazon auroraへの移行にむけて
V sphere 7 update 3 へのアップグレードについて
Amazon Pinpoint × グロースハック活用事例集
SaaS テナント毎のコストを把握するための「AWS Application Cost Profiler」のご紹介
[AWS Builders] AWS 서버리스 서비스를 활용한 웹 애플리케이션 구축 및 배포 방법 - 정창호, AWS 솔루션즈 아키텍트
Black Belt Online Seminar Amazon CloudWatch
AP 10 dll quarter 1 week 6 kontemporaryung isyu july 10 to 14
Amazon VPCトレーニング-VPCの説明
DeNA の AWS アカウント管理とセキュリティ監査自動化
デバイス WebAPIによるスマートフォン周辺デバイスの活用
Ad

Similar to Faster and Smaller qcow2 Files with Subcluster-based Allocation (20)

PDF
Improving the Performance of the qcow2 Format (KVM Forum 2017)
PPT
Threading Successes 06 Allegorithmic
PDF
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
PDF
CLFS 2010
PDF
Memory, Big Data, NoSQL and Virtualization
PPTX
SDC20 ScaleFlux.pptx
PPTX
Scylla Summit 2018: Keeping Your Latency SLAs No Matter What!
PDF
Road show 2015 triangle meetup
PPTX
Accelerating hbase with nvme and bucket cache
PDF
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
PPTX
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
PDF
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
PPT
Design and implementation of a reliable and cost-effective cloud computing in...
PDF
Measuring Database Performance on Bare Metal AWS Instances
PPT
04 cache memory
PPTX
Storage and performance- Batch processing, Whiptail
PDF
Hot sec10 slide-suzaki
PDF
Optimizing MongoDB: Lessons Learned at Localytics
PDF
Designs, Lessons and Advice from Building Large Distributed Systems
PDF
Accelerating HBase with NVMe and Bucket Cache
Improving the Performance of the qcow2 Format (KVM Forum 2017)
Threading Successes 06 Allegorithmic
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
CLFS 2010
Memory, Big Data, NoSQL and Virtualization
SDC20 ScaleFlux.pptx
Scylla Summit 2018: Keeping Your Latency SLAs No Matter What!
Road show 2015 triangle meetup
Accelerating hbase with nvme and bucket cache
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Design and implementation of a reliable and cost-effective cloud computing in...
Measuring Database Performance on Bare Metal AWS Instances
04 cache memory
Storage and performance- Batch processing, Whiptail
Hot sec10 slide-suzaki
Optimizing MongoDB: Lessons Learned at Localytics
Designs, Lessons and Advice from Building Large Distributed Systems
Accelerating HBase with NVMe and Bucket Cache
Ad

More from Igalia (20)

PDF
Life of a Kernel Bug Fix
PDF
Unlocking the Full Potential of WPE to Build a Successful Embedded Product
PDF
Advancing WebDriver BiDi support in WebKit
PDF
Jumping Over the Garden Wall - WPE WebKit on Android
PDF
Collective Funding, Governance and Prioritiation of Browser Engine Projects
PDF
Don't let your motivation go, save time with kworkflow
PDF
Solving the world’s (localization) problems
PDF
The Whippet Embeddable Garbage Collection Library
PDF
Nobody asks "How is JavaScript?"
PDF
Getting more juice out from your Raspberry Pi GPU
PDF
WebRTC support in WebKitGTK and WPEWebKit with GStreamer: Status update
PDF
Demystifying Temporal: A Deep Dive into JavaScript New Temporal API
PDF
CSS :has() Unlimited Power
PDF
Device-Generated Commands in Vulkan
PDF
Current state of Lavapipe: Mesa's software renderer for Vulkan
PDF
Vulkan Video is Open: Application showcase
PDF
Scheme on WebAssembly: It is happening!
PDF
EBC - A new backend compiler for etnaviv
PDF
RISC-V LLVM State of the Union
PDF
Device-Generated Commands in Vulkan
Life of a Kernel Bug Fix
Unlocking the Full Potential of WPE to Build a Successful Embedded Product
Advancing WebDriver BiDi support in WebKit
Jumping Over the Garden Wall - WPE WebKit on Android
Collective Funding, Governance and Prioritiation of Browser Engine Projects
Don't let your motivation go, save time with kworkflow
Solving the world’s (localization) problems
The Whippet Embeddable Garbage Collection Library
Nobody asks "How is JavaScript?"
Getting more juice out from your Raspberry Pi GPU
WebRTC support in WebKitGTK and WPEWebKit with GStreamer: Status update
Demystifying Temporal: A Deep Dive into JavaScript New Temporal API
CSS :has() Unlimited Power
Device-Generated Commands in Vulkan
Current state of Lavapipe: Mesa's software renderer for Vulkan
Vulkan Video is Open: Application showcase
Scheme on WebAssembly: It is happening!
EBC - A new backend compiler for etnaviv
RISC-V LLVM State of the Union
Device-Generated Commands in Vulkan

Recently uploaded (20)

PPTX
artificial intelligence overview of it and more
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PDF
Sims 4 Historia para lo sims 4 para jugar
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
PPTX
Introduction to Information and Communication Technology
PPTX
SAP Ariba Sourcing PPT for learning material
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PDF
The Internet -By the Numbers, Sri Lanka Edition
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
DOCX
Unit-3 cyber security network security of internet system
PPTX
presentation_pfe-universite-molay-seltan.pptx
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PPTX
Internet___Basics___Styled_ presentation
PPTX
E -tech empowerment technologies PowerPoint
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PDF
Introduction to the IoT system, how the IoT system works
artificial intelligence overview of it and more
WebRTC in SignalWire - troubleshooting media negotiation
Sims 4 Historia para lo sims 4 para jugar
Introuction about ICD -10 and ICD-11 PPT.pptx
Introduction to Information and Communication Technology
SAP Ariba Sourcing PPT for learning material
Slides PDF The World Game (s) Eco Economic Epochs.pdf
Slides PPTX World Game (s) Eco Economic Epochs.pptx
SASE Traffic Flow - ZTNA Connector-1.pdf
The Internet -By the Numbers, Sri Lanka Edition
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
Unit-3 cyber security network security of internet system
presentation_pfe-universite-molay-seltan.pptx
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
Internet___Basics___Styled_ presentation
E -tech empowerment technologies PowerPoint
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
Unit-1 introduction to cyber security discuss about how to secure a system
Introduction to the IoT system, how the IoT system works

Faster and Smaller qcow2 Files with Subcluster-based Allocation

  • 1. Subcluster allocation for qcow2 images KVM Forum 2020 Alberto Garcia <berto@igalia.com>
  • 2. The qcow2 file format qcow2: native file format for storing disk images in QEMU. Many features: grows on demand, backing files, internal snapshots, compression, encryption... Subcluster allocation for qcow2 images KVM Forum 2020
  • 3. The qcow2 file format qcow2: native file format for storing disk images in QEMU. Many features: grows on demand, backing files, internal snapshots, compression, encryption... But why is it sometimes slower than a raw file? Subcluster allocation for qcow2 images KVM Forum 2020
  • 4. The qcow2 file format qcow2: native file format for storing disk images in QEMU. Many features: grows on demand, backing files, internal snapshots, compression, encryption... But why is it sometimes slower than a raw file? Because it is not correctly configured. Subcluster allocation for qcow2 images KVM Forum 2020
  • 5. The qcow2 file format qcow2: native file format for storing disk images in QEMU. Many features: grows on demand, backing files, internal snapshots, compression, encryption... But why is it sometimes slower than a raw file? Because it is not correctly configured. Because the qcow2 driver in QEMU needs to be improved. Subcluster allocation for qcow2 images KVM Forum 2020
  • 6. The qcow2 file format qcow2: native file format for storing disk images in QEMU. Many features: grows on demand, backing files, internal snapshots, compression, encryption... But why is it sometimes slower than a raw file? Because it is not correctly configured. Because the qcow2 driver in QEMU needs to be improved. Check my presentation at KVM Forum 2017! Subcluster allocation for qcow2 images KVM Forum 2020
  • 7. The qcow2 file format qcow2: native file format for storing disk images in QEMU. Many features: grows on demand, backing files, internal snapshots, compression, encryption... But why is it sometimes slower than a raw file? Because it is not correctly configured. Because the qcow2 driver in QEMU needs to be improved. Check my presentation at KVM Forum 2017! Because of the very design of the qcow2 file format. Today we are going to focus on that. Subcluster allocation for qcow2 images KVM Forum 2020
  • 8. Structure of a qcow2 file A qcow2 file is divided into clusters of equal size (min: 512 bytes - default: 64 KB - max: 2 MB) QCOW2 Header Refcount table Refcount block L1 table L2 table Data cluster L2 table Data cluster Data cluster Data cluster Data cluster Subcluster allocation for qcow2 images KVM Forum 2020
  • 9. Structure of a qcow2 file The virtual disk as seen by the VM is divided into guest clusters of the same size QCOW2 Header Refcount table Refcount block L1 table L2 table Data cluster L2 table Data cluster Data cluster Data cluster Data cluster GUEST HOST Subcluster allocation for qcow2 images KVM Forum 2020
  • 10. Problem 1: copy-on-write means more I/O Active Backing A data cluster is the smallest unit of allocation: writing to a new data cluster means filling it completely with data. If the guest write request is small, the rest must be filled with data from the backing file, or with zeroes (if there is no backing file). Problem: QEMU needs to perform additional I/O to copy the rest of the data. Subcluster allocation for qcow2 images KVM Forum 2020
  • 11. Problem 1: copy-on-write means more I/O Example: random 4KB write requests to an empty 40GB image (SSD backend) Cluster size With a backing file Without a backing file∗ 16 KB 3600 IOPS 5859 IOPS 32 KB 2557 IOPS 5674 IOPS 64 KB 1634 IOPS 2527 IOPS 128 KB 869 IOPS 1576 IOPS 256 KB 577 IOPS 976 IOPS 512 KB 364 IOPS 510 IOPS (*): Worst case scenario. QEMU first tries fallocate() which is much faster than writing zeroes Subcluster allocation for qcow2 images KVM Forum 2020
  • 12. Problem 2: copy-on-write means more used space The larger the cluster size, the more the image grows with each allocation. Example: how much does an image grow after... ...100 MB worth of random 4KB write requests? ...creating a filesystem on an empty 1 TB image? Cluster size random writes mkfs.ext4 Raw file 101 MB 1.1 GB 4 KB 158 MB 1.1 GB 64 KB 1.6 GB 1.1 GB 512 KB 11 GB 1.3 GB 2 MB 29 GB 2.1 GB The actual size difference in real-world scenarios depends a lot on the usage. Subcluster allocation for qcow2 images KVM Forum 2020
  • 13. Decreasing the cluster size In summary: increasing the cluster size... ...results in less performance due to the additional I/O needed for copy-on-write. ...produces larger images and duplicate data. Subcluster allocation for qcow2 images KVM Forum 2020
  • 14. Decreasing the cluster size In summary: increasing the cluster size... ...results in less performance due to the additional I/O needed for copy-on-write. ...produces larger images and duplicate data. Then let’s just decrease the cluster size, right? Subcluster allocation for qcow2 images KVM Forum 2020
  • 15. Decreasing the cluster size In summary: increasing the cluster size... ...results in less performance due to the additional I/O needed for copy-on-write. ...produces larger images and duplicate data. Then let’s just decrease the cluster size, right? Not so easy: smaller clusters means more metadata Subcluster allocation for qcow2 images KVM Forum 2020
  • 16. Problem 3: Smaller clusters means more metadata Apart from the guest data itself, qcow2 images store some important metadata: Cluster mapping (L1 and L2 tables). Reference counts. If we have smaller clusters we’ll end up having more of them, and this means additional metadata. Subcluster allocation for qcow2 images KVM Forum 2020
  • 17. L1 and L2 tables The L1 and L2 tables map guest addresses as seen by the VM into host addresses in the qcow2 file L1 Table L2 Tables Data clusters Subcluster allocation for qcow2 images KVM Forum 2020
  • 18. The L1 table There is only one L1 table per image (per snapshot, actually). The L1 table has a variable size but it’s usually small. Example: 16KB of data for a 1TB image (using the default settings). It is stored contiguous in the image file. QEMU keeps it in memory all the time. 64-bit entries: each contains a pointer to an L2 table. Subcluster allocation for qcow2 images KVM Forum 2020
  • 19. L2 tables There are multiple L2 tables and they are allocated on demand as the image grows. Each table is exactly one cluster in size. 64-bit entries: each contains a pointer to a data cluster. If we reducing the cluster size by half we need twice as many L2 entries. Graphically: L2 Table Data clusters Subcluster allocation for qcow2 images KVM Forum 2020
  • 20. L2 tables There are multiple L2 tables and they are allocated on demand as the image grows. Each table is exactly one cluster in size. 64-bit entries: each contains a pointer to a data cluster. If we reducing the cluster size by half we need twice as many L2 entries. Graphically: L2 Table Data clusters Subcluster allocation for qcow2 images KVM Forum 2020
  • 21. L2 metadata size This is the maximum amount of L2 metadata needed for an image with a virtual size of 1 TB. Cluster size Max. L2 metadata 8 K B 1 GB 16 KB 512 MB 32 KB 256 MB 64 KB 128 MB 128 KB 64 MB 256 KB 32 MB 512 KB 16 MB 1 MB 8 MB 2 MB 4 MB Subcluster allocation for qcow2 images KVM Forum 2020
  • 22. Accessing L2 metadata Each time we need to access a data cluster (read or write) we need to go to its L2 table to get its location. This is one additional I/O operation per request: severe impact in performance. We can mitigate that by keeping the L2 tables in RAM. QEMU has an L2 cache for that purpose. Example: random 4K reads on a 40GB image: L2 cache size Average IOPS 1 MB 8068 2 MB 10606 5 MB 41187 Again, reducing the cluster size by half implies: Twice as much L2 metadata. Twice as much RAM for the L2 cache. Subcluster allocation for qcow2 images KVM Forum 2020
  • 23. Reference counts Each cluster in a qcow2 image has a reference count (all types, not just data clusters). They are stored in a two-level structure called reference table and reference blocks. Like L2 tables, the size of a reference block is also one cluster. Allocating clusters has the additional overhead of updating their reference counts. With a smaller clusters we need to allocate more of them. Subcluster allocation for qcow2 images KVM Forum 2020
  • 24. The overhead of having to allocate clusters Overall, smaller clusters are faster to fill with data, but if they get too small the overhead of the allocation process exceeds the benefits. Cluster size Write IOPS 512 KB 364 IOPS 256 KB 577 IOPS 128 KB 869 IOPS 64 KB 1634 IOPS 32 KB 2557 IOPS 16 KB 3600 IOPS 8 KB 758 IOPS 4 KB 97 IOPS 2 KB 77 IOPS 1 KB 62 IOPS Subcluster allocation for qcow2 images KVM Forum 2020
  • 25. The situation so far We cannot have too big clusters because they waste more space and increase the amount of I/O needed for allocating clusters. We cannot have too small clusters because they increase the amount of metadata, which has a negative impact in performance and/or memory usage. This is a direct consequence of the design of the qcow2 format. Subcluster allocation for qcow2 images KVM Forum 2020
  • 26. Subcluster allocation I’m presenting a mixed approach to mitigate this problem: subcluster allocation. In short: We have big clusters in order to reduce the amount of metadata in the image. Each one of the clusters is divided into 32 subclusters that can be allocated separately. This means faster allocations and reduced disk usage. Subcluster allocation for qcow2 images KVM Forum 2020
  • 27. Subcluster allocation: what it looks like A standard L2 table with entries and their data clusters L2 Table Data clusters Subcluster allocation for qcow2 images KVM Forum 2020
  • 28. Subcluster allocation: what it looks like An extended L2 table with subcluster allocation L2 Table Data clusters Subcluster allocation for qcow2 images KVM Forum 2020
  • 29. L2 tables in detail Each L2 table contains a number of entries that look like this: Cluster offset 0 63 Each cluster has one of these states: Unallocated. Allocated (normal or compressed). All zeroes. Now we also need to store information for each subcluster. Subcluster allocation for qcow2 images KVM Forum 2020
  • 30. Extended L2 entries We are adding extended L2 entries, which contain a 64-bit bitmap indicating the status of each subcluster. Cluster offset 64 127 Subcluster allocation bitmap 0 63 Each individual subcluster can be allocated, unallocated or “all zeroes”. Compressed clusters don’t have subclusters and work the same as before. Subcluster allocation for qcow2 images KVM Forum 2020
  • 31. Two use cases for subcluster allocation Case 1: Having very large clusters in order to minimize the amount of metadata while reducing the amount of duplicated data and I/O. Case 2: Having smaller clusters to minimize the amount of copy-on-write and get the maximum I/O performance. Subcluster allocation for qcow2 images KVM Forum 2020
  • 32. Results 1: less copy-on-write means faster I/O Having less copy-on-write improves the allocation performance. If subcluster size = request size no copy-on-write is needed! Average IOPS of random 4KB writes: With a backing file Cluster size Without subclusters With subclusters 16 KB 3600 IOPS 8124 IOPS 32 KB 2557 IOPS 11575 IOPS 64 KB 1634 IOPS 13219 IOPS 128 KB 869 IOPS 12076 IOPS 256 KB 577 IOPS 9739 IOPS 512 KB 364 IOPS 4708 IOPS 1 MB 216 IOPS 2542 IOPS 2 MB 125 IOPS 1591 IOPS Subcluster allocation for qcow2 images KVM Forum 2020
  • 33. Results 1: less copy-on-write means faster I/O Having less copy-on-write improves the allocation performance. If subcluster size = request size no copy-on-write is needed! Average IOPS of random 4KB writes: Without a backing file∗ Cluster size Without subclusters With subclusters 16 KB 5859 IOPS 8063 IOPS 32 KB 5674 IOPS 11107 IOPS 64 KB 2527 IOPS 12731 IOPS 128 KB 1576 IOPS 11808 IOPS 256 KB 976 IOPS 9195 IOPS 512 KB 510 IOPS 7079 IOPS 1 MB 448 IOPS 3306 IOPS 2 MB 262 IOPS 2269 IOPS (*): Worst case scenario. QEMU first tries fallocate() which is much faster than writing zeroes Subcluster allocation for qcow2 images KVM Forum 2020
  • 34. Results 2: less copy-on-write means less used space Repeating the earlier test: how much does an image grow after... ...100 MB worth of random 4KB write requests? ...creating a filesystem on an empty 1 TB image? Cluster size random writes mkfs.ext4 Raw file 101 MB 1.1 GB 64 KB 111 MB 1.1 GB (vs 158 MB) 512 KB 404 MB 1.1 GB (vs 11 GB) (vs 1.3 GB) 2 MB 1.6 GB 1.1 GB (vs 29 GB) (vs 2.1 GB) Subcluster allocation for qcow2 images KVM Forum 2020
  • 35. Results 3: larger clusters mean less metadata Extended L2 entries are twice as large but each one of them references 32 subclusters. As a result we have 16 times less metadata for the same unit of allocation. This table compares the amount of L2 metadata for a 1TB image. Standard L2 entries Cluster size Max. L2 size 4 KB 2 GB 8 KB 1 GB 16 KB 512 MB 32 KB 256 MB 64 KB 128 MB Extended L2 entries Subcluster size Max. L2 size 4 KB 128 MB 8 KB 64 MB 16 KB 32 MB 32 KB 16 MB 64 KB 8 MB Subcluster allocation for qcow2 images KVM Forum 2020
  • 36. Caveats This feature is useful during allocation. Writing to already allocated areas won’t be faster. Don’t use it with compressed images. Extended L2 entries are twice as big but offer no benefits for compressed clusters. If your image does not have a backing file maybe you won’t see any speed-up! Copy-on-write of empty clusters is already fast if the filesystem supports it. However you still get the other advantages of using subclusters. You won’t be able to read the image with older versions of QEMU (and don’t expect backports!). Subcluster allocation for qcow2 images KVM Forum 2020
  • 37. Implementation status Not available in any QEMU release yet. Expected in QEMU 5.2 around December. The implementation is complete, it is already in the repository and it is ready to be tested. Simply build a recent QEMU from git and create a qcow2 image with -o extended_l2=on. Note: the default cluster size is still 64 KB. You probably want to create an image with cluster_size=128k or more! Feedback, bug reports, etc., are very much appreciated! qemu-block@nongnu.org Subcluster allocation for qcow2 images KVM Forum 2020
  • 38. Acknowledgments This work was sponsored by Subcluster allocation for qcow2 images KVM Forum 2020