Using a Hierarchical Data Format v5 file as Zarr v3 Shard

Using a HDF5 file as Zarr v3
Shard
Mark Kittisopikul, Ph.D.
Software Engineer III
Scientific Computing Software
Janelia Research Campus
Howard Hughes Medical Institute
ESIP, July 23rd, 2025

Why combine multiple file formats?
● Avoid data duplication
○ Large datasets could be
terabytes, petabytes, or
exabytes in scale
○ We cannot afford to have
multiple copies
● Make it easier for users to read
using their favorite APIs
○ Users may have restricted
access to one API but still
need to access the same data
https://xkcd.com/927

My boss’ preferred
solution: single API
Use a single API to read all the files

Outline
● What is Zarr v3?
● Are Zarr v3 shards similar to HDF5?
● Could a Zarr v3 shard be a HDF5 file?
● Could a Zarr v3 shard to be a HDF5 file and TIFF
file?

Zarr v3 is a cloud optimized chunk-based hierarchical
array storage specification
● Metadata are stored as JSON zarr.json files for
each group and array
● Chunks are stored as individual keys (files) in
key-value store (a filesystem)
○ On a file system, chunks are individual files.
● For small chunks (e.g. 32x32x32) in a large
array (e.g. 4096x4096x4096), this could result
in millions of files
○ Efficient access requires optimized file systems
https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#

A Zarr v3 shard is a codec to subdivide a single chunk
into smaller inner chunks
● Shards allow for many chunks can exist in a
single file.
● Therefore, we can reduce the number of files
required. Example:
○ Array size: 4096 x 4096 x 4096
○ Shard size: 1024 x 1024 x 1024
○ Chunk size: 32 x 32 x 32
● Inner chunks can be individual compressed
https://zarr-specs.readthedocs.io/en/latest/v3/codecs/sharding-indexed/index.html

A Zarr v3 shard chunk index exists at either the
beginning or end of the shard
● The size of the chunk index can be calculated directly from information in the zarr.json file
○ nChunks x 16 bytes + 4 bytes
○ The 4-byte checksum of the chunk index is calculated using CRC32c
○ Retrievable using a single HTTP GET request with a byte-range header.
CRC32c

The Zarr v3 shard index is similar to a
HDF5 Fixed Array Data Block, differ by 14 or 18 bytes
64-bits
32-bits
32-bits
HDF5
Zarr
Jenkins
14
bytes
CRC32c

Formatting a HDF5 File as a Zarr v3 shard?
● We need to place the Zarr v3 shard index at the beginning or end of the file
● Options
○ A: Put the Zarr v3 shard index in the HDF5 User Block at the beginning of the file
○ B: Put the Zarr v3 shard index into a dataset at the end of the file
○ C: Relocate the HDF5 Fixed Array Data Block to the end of the file
HDF5 User Block:
Zarr v3 shard index
HDF5 Metadata
Shared data chunks
HDF5 Metadata
Shared data chunks
2nd HDF5 Dataset as
Zarr v3 shard index
HDF5 Metadata
Shared data chunks
HDF5 FADB as a
Zarr v3 shard index ?
Option A: Option B: Option C:
CRC32c
or
Jenkins
Checksum?

Zarr v3 sharding is similar a HDF5 Virtual Dataset
● Virtual datasets are a HDF5 feature that allows part of a
dataset to exist as a dataset in another file (~Zarr v3 shard)
● A Zarr v3 shard is analogous to a file with a single chunked
source dataset
https://support.hdfgroup.org/releases/hdf5/documentation/rfc/HDF5-VDS-requirements-use-cases-2014-12-10.pdf

Combined Zarr Array as a HDF5 Virtual Dataset
HDF5 Metadata
Shared data
chunks
Zarr v3 shard
index
HDF5 Metadata
Shared data
chunks
Zarr v3 shard
index
HDF5 Metadata
Shared data
chunks
Zarr v3 shard
index
HDF5 Metadata
Shared data
chunks
Zarr v3 shard
index
c/0/0
c/1/0
c/0/1
c/1/1
zarr.hdf5
zarr.json
Index files:
c/0/0 c/0/1
c/1/0 c/1/1

Could we combine TIFF, HDF5, and
Zarr?

~KBs
Can microscopists implement simply and efficiently?

Demo: Reading the same file via
distinct packages
I’ve created a single file “demo.hdf5.zarr.tiff” that can be read by
distinct Python packages:
● h5py (HDF5)
● libtiff (TIFF)
● tensorstore (Zarr v3)
https://github.com/mkitti/simple_image_formats

We can modify the data with h5py (via HDF5) …

… Then read the modified data from
all three libraries as a TIFF, HDF5
file, or a Zarr v3 shards
We can make the latest standards cooperate rather than compete.
Is this the best approach? Should we address this via file systems
(FUSE), APIs (N5), or services?

Implementation details…
● HDF5 metadata can be consolidated by using a large enough meta_block_size
● HDF5 chunk information (offset and nbytes) can be extracted efficiently using H5chunk_iter
● The Zarr v3 shard specification do not require that chunks need to be contiguous
○ There can be empty space between chunks
○ HDF5 also does not require chunks be stored contiguously
● A key difference is that HDF5 allows for OPTIONAL application of compression filters hence the
need for a 32-bit filter mask
○ If all filters are applied, the HDF5 32-bit filter mask bits are all 0

Summary
● Zarr v3 shards partition chunks into small inner chunks.
● The resulting arrangement is similar to a HDF5 Virtual Dataset.
● A “file” could be both a valid HDF5 file and a Zarr v3 shard
○ The Zarr v3 shard index could exist in a HDF5 file either as
■ A user block at the beginning of the file OR
■ An extra contiguous dataset at the end of the file
○ A merged FADB and Zarr v3 shard index would require alignment of 32-bit checksums
■ HDF5 adopts CRC32c as a checksum
■ Zarr adopts Jenkin’s lookup3 as a codec
● Alternative: A Zarr v3 virtual file driver for HDF5?
● Bonus (time permitting): Combining TIFF, HDF5, and Zarr v3
○ Jupyter notebook demonstration

Closing Thoughts
● Introducing a new format is costly
● Multi-formatting could save resources
● Cloud optimization makes it easier to combine formats
● GeoTIFF, HDF5, and Zarr may be more similar than
expected

Using a Hierarchical Data Format v5 file as Zarr v3 Shard

More Related Content

More from The HDF-EOS Tools and Information Center (20)

Recently uploaded (20)

Using a Hierarchical Data Format v5 file as Zarr v3 Shard