SlideShare a Scribd company logo
Using a HDF5 file as Zarr v3
Shard
Mark Kittisopikul, Ph.D.
Software Engineer III
Scientific Computing Software
Janelia Research Campus
Howard Hughes Medical Institute
ESIP, July 23rd, 2025
Why combine multiple file formats?
● Avoid data duplication
○ Large datasets could be
terabytes, petabytes, or
exabytes in scale
○ We cannot afford to have
multiple copies
● Make it easier for users to read
using their favorite APIs
○ Users may have restricted
access to one API but still
need to access the same data
https://xkcd.com/927
My boss’ preferred
solution: single API
Use a single API to read all the files
HDF5? NetCDF?
What if
this is all
the same?
Outline
● What is Zarr v3?
● Are Zarr v3 shards similar to HDF5?
● Could a Zarr v3 shard be a HDF5 file?
● Could a Zarr v3 shard to be a HDF5 file and TIFF
file?
Zarr v3 is a cloud optimized chunk-based hierarchical
array storage specification
● Metadata are stored as JSON zarr.json files for
each group and array
● Chunks are stored as individual keys (files) in
key-value store (a filesystem)
○ On a file system, chunks are individual files.
● For small chunks (e.g. 32x32x32) in a large
array (e.g. 4096x4096x4096), this could result
in millions of files
○ Efficient access requires optimized file systems
https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#
A Zarr v3 shard is a codec to subdivide a single chunk
into smaller inner chunks
● Shards allow for many chunks can exist in a
single file.
● Therefore, we can reduce the number of files
required. Example:
○ Array size: 4096 x 4096 x 4096
○ Shard size: 1024 x 1024 x 1024
○ Chunk size: 32 x 32 x 32
● Inner chunks can be individual compressed
https://zarr-specs.readthedocs.io/en/latest/v3/codecs/sharding-indexed/index.html
A Zarr v3 shard chunk index exists at either the
beginning or end of the shard
● The size of the chunk index can be calculated directly from information in the zarr.json file
○ nChunks x 16 bytes + 4 bytes
○ The 4-byte checksum of the chunk index is calculated using CRC32c
○ Retrievable using a single HTTP GET request with a byte-range header.
CRC32c
The Zarr v3 shard index is similar to a
HDF5 Fixed Array Data Block, differ by 14 or 18 bytes
64-bits
32-bits
32-bits
HDF5
Zarr
Jenkins
14
bytes
CRC32c
Formatting a HDF5 File as a Zarr v3 shard?
● We need to place the Zarr v3 shard index at the beginning or end of the file
● Options
○ A: Put the Zarr v3 shard index in the HDF5 User Block at the beginning of the file
○ B: Put the Zarr v3 shard index into a dataset at the end of the file
○ C: Relocate the HDF5 Fixed Array Data Block to the end of the file
HDF5 User Block:
Zarr v3 shard index
HDF5 Metadata
Shared data chunks
HDF5 Metadata
Shared data chunks
2nd HDF5 Dataset as
Zarr v3 shard index
HDF5 Metadata
Shared data chunks
HDF5 FADB as a
Zarr v3 shard index ?
Option A: Option B: Option C:
CRC32c
or
Jenkins
Checksum?
Zarr v3 sharding is similar a HDF5 Virtual Dataset
● Virtual datasets are a HDF5 feature that allows part of a
dataset to exist as a dataset in another file (~Zarr v3 shard)
● A Zarr v3 shard is analogous to a file with a single chunked
source dataset
https://support.hdfgroup.org/releases/hdf5/documentation/rfc/HDF5-VDS-requirements-use-cases-2014-12-10.pdf
Combined Zarr Array as a HDF5 Virtual Dataset
HDF5 Metadata
Shared data
chunks
Zarr v3 shard
index
HDF5 Metadata
Shared data
chunks
Zarr v3 shard
index
HDF5 Metadata
Shared data
chunks
Zarr v3 shard
index
HDF5 Metadata
Shared data
chunks
Zarr v3 shard
index
c/0/0
c/1/0
c/0/1
c/1/1
zarr.hdf5
zarr.json
Index files:
c/0/0 c/0/1
c/1/0 c/1/1
Combined Zarr Array as a HDF5 Virtual Dataset
HDF5 Metadata
Shared data
chunks
Zarr v3 shard
index
HDF5 Metadata
Shared data
chunks
Zarr v3 shard
index
HDF5 Metadata
Shared data
chunks
Zarr v3 shard
index
HDF5 Metadata
Shared data
chunks
Zarr v3 shard
index
c/0/0
c/1/0
c/0/1
c/1/1
zarr.hdf5
zarr.json
Index files:
c/0/0 c/0/1
c/1/0 c/1/1
Combined Zarr Array as a HDF5 Virtual Dataset
HDF5 Metadata
Shared data
chunks
Zarr v3 shard
index
HDF5 Metadata
Shared data
chunks
Zarr v3 shard
index
HDF5 Metadata
Shared data
chunks
Zarr v3 shard
index
HDF5 Metadata
Shared data
chunks
Zarr v3 shard
index
c/0/0
c/1/0
c/0/1
c/1/1
zarr.hdf5
zarr.json
Index files:
c/0/0 c/0/1
c/1/0 c/1/1
Could we combine TIFF, HDF5, and
Zarr?
~KBs
Can microscopists implement simply and efficiently?
Demo: Reading the same file via
distinct packages
I’ve created a single file “demo.hdf5.zarr.tiff” that can be read by
distinct Python packages:
● h5py (HDF5)
● libtiff (TIFF)
● tensorstore (Zarr v3)
https://github.com/mkitti/simple_image_formats
We can modify the data with h5py (via HDF5) …
… Then read the modified data from
all three libraries as a TIFF, HDF5
file, or a Zarr v3 shards
We can make the latest standards cooperate rather than compete.
Is this the best approach? Should we address this via file systems
(FUSE), APIs (N5), or services?
Implementation details…
● HDF5 metadata can be consolidated by using a large enough meta_block_size
● HDF5 chunk information (offset and nbytes) can be extracted efficiently using H5chunk_iter
● The Zarr v3 shard specification do not require that chunks need to be contiguous
○ There can be empty space between chunks
○ HDF5 also does not require chunks be stored contiguously
● A key difference is that HDF5 allows for OPTIONAL application of compression filters hence the
need for a 32-bit filter mask
○ If all filters are applied, the HDF5 32-bit filter mask bits are all 0
Summary
● Zarr v3 shards partition chunks into small inner chunks.
● The resulting arrangement is similar to a HDF5 Virtual Dataset.
● A “file” could be both a valid HDF5 file and a Zarr v3 shard
○ The Zarr v3 shard index could exist in a HDF5 file either as
■ A user block at the beginning of the file OR
■ An extra contiguous dataset at the end of the file
○ A merged FADB and Zarr v3 shard index would require alignment of 32-bit checksums
■ HDF5 adopts CRC32c as a checksum
■ Zarr adopts Jenkin’s lookup3 as a codec
● Alternative: A Zarr v3 virtual file driver for HDF5?
● Bonus (time permitting): Combining TIFF, HDF5, and Zarr v3
○ Jupyter notebook demonstration
Closing Thoughts
● Introducing a new format is costly
● Multi-formatting could save resources
● Cloud optimization makes it easier to combine formats
● GeoTIFF, HDF5, and Zarr may be more similar than
expected

More Related Content

PDF
Cloud-Optimized HDF5 Files - Current Status
PPTX
Parallel Computing with HDF Server
PDF
Accessing HDF5 data in the cloud with HSDS
PPT
Using HDF5 tools for performance tuning and troubleshooting
PDF
Creating Cloud-Optimized HDF5 Files
Cloud-Optimized HDF5 Files - Current Status
Parallel Computing with HDF Server
Accessing HDF5 data in the cloud with HSDS
Using HDF5 tools for performance tuning and troubleshooting
Creating Cloud-Optimized HDF5 Files

More from The HDF-EOS Tools and Information Center (20)

PDF
HDF5 2.0: Cloud Optimized from the Start
PDF
Cloud Optimized HDF5 for the ICESat-2 mission
PPTX
Access HDF Data in the Cloud via OPeNDAP Web Service
PPTX
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
PPTX
The State of HDF5 / Dana Robinson / The HDF Group
PDF
Cloud-Optimized HDF5 Files
PPTX
Highly Scalable Data Service (HSDS) Performance Features
PPTX
HDF5 OPeNDAP Handler Updates, and Performance Discussion
PPTX
Hyrax: Serving Data from S3
PPSX
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
PDF
HDF - Current status and Future Directions
PPSX
HDFEOS.org User Analsys, Updates, and Future
PPTX
HDF - Current status and Future Directions
PDF
H5Coro: The Cloud-Optimized Read-Only Library
PPTX
MATLAB Modernization on HDF5 1.10
PPTX
HDF for the Cloud - Serverless HDF
PPTX
HDF for the Cloud - New HDF Server Features
PPSX
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
PPTX
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
HDF5 2.0: Cloud Optimized from the Start
Cloud Optimized HDF5 for the ICESat-2 mission
Access HDF Data in the Cloud via OPeNDAP Web Service
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
The State of HDF5 / Dana Robinson / The HDF Group
Cloud-Optimized HDF5 Files
Highly Scalable Data Service (HSDS) Performance Features
HDF5 OPeNDAP Handler Updates, and Performance Discussion
Hyrax: Serving Data from S3
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
HDF - Current status and Future Directions
HDFEOS.org User Analsys, Updates, and Future
HDF - Current status and Future Directions
H5Coro: The Cloud-Optimized Read-Only Library
MATLAB Modernization on HDF5 1.10
HDF for the Cloud - Serverless HDF
HDF for the Cloud - New HDF Server Features
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
Ad

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
cuic standard and advanced reporting.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPT
Teaching material agriculture food technology
PPTX
Big Data Technologies - Introduction.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Electronic commerce courselecture one. Pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
sap open course for s4hana steps from ECC to s4
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
Review of recent advances in non-invasive hemoglobin estimation
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
cuic standard and advanced reporting.pdf
Encapsulation_ Review paper, used for researhc scholars
The AUB Centre for AI in Media Proposal.docx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Teaching material agriculture food technology
Big Data Technologies - Introduction.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Electronic commerce courselecture one. Pdf
Network Security Unit 5.pdf for BCA BBA.
Digital-Transformation-Roadmap-for-Companies.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Unlocking AI with Model Context Protocol (MCP)
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
sap open course for s4hana steps from ECC to s4
MIND Revenue Release Quarter 2 2025 Press Release
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
NewMind AI Weekly Chronicles - August'25 Week I
Ad

Using a Hierarchical Data Format v5 file as Zarr v3 Shard

  • 1. Using a HDF5 file as Zarr v3 Shard Mark Kittisopikul, Ph.D. Software Engineer III Scientific Computing Software Janelia Research Campus Howard Hughes Medical Institute ESIP, July 23rd, 2025
  • 2. Why combine multiple file formats? ● Avoid data duplication ○ Large datasets could be terabytes, petabytes, or exabytes in scale ○ We cannot afford to have multiple copies ● Make it easier for users to read using their favorite APIs ○ Users may have restricted access to one API but still need to access the same data https://xkcd.com/927
  • 3. My boss’ preferred solution: single API Use a single API to read all the files
  • 5. What if this is all the same?
  • 6. Outline ● What is Zarr v3? ● Are Zarr v3 shards similar to HDF5? ● Could a Zarr v3 shard be a HDF5 file? ● Could a Zarr v3 shard to be a HDF5 file and TIFF file?
  • 7. Zarr v3 is a cloud optimized chunk-based hierarchical array storage specification ● Metadata are stored as JSON zarr.json files for each group and array ● Chunks are stored as individual keys (files) in key-value store (a filesystem) ○ On a file system, chunks are individual files. ● For small chunks (e.g. 32x32x32) in a large array (e.g. 4096x4096x4096), this could result in millions of files ○ Efficient access requires optimized file systems https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#
  • 8. A Zarr v3 shard is a codec to subdivide a single chunk into smaller inner chunks ● Shards allow for many chunks can exist in a single file. ● Therefore, we can reduce the number of files required. Example: ○ Array size: 4096 x 4096 x 4096 ○ Shard size: 1024 x 1024 x 1024 ○ Chunk size: 32 x 32 x 32 ● Inner chunks can be individual compressed https://zarr-specs.readthedocs.io/en/latest/v3/codecs/sharding-indexed/index.html
  • 9. A Zarr v3 shard chunk index exists at either the beginning or end of the shard ● The size of the chunk index can be calculated directly from information in the zarr.json file ○ nChunks x 16 bytes + 4 bytes ○ The 4-byte checksum of the chunk index is calculated using CRC32c ○ Retrievable using a single HTTP GET request with a byte-range header. CRC32c
  • 10. The Zarr v3 shard index is similar to a HDF5 Fixed Array Data Block, differ by 14 or 18 bytes 64-bits 32-bits 32-bits HDF5 Zarr Jenkins 14 bytes CRC32c
  • 11. Formatting a HDF5 File as a Zarr v3 shard? ● We need to place the Zarr v3 shard index at the beginning or end of the file ● Options ○ A: Put the Zarr v3 shard index in the HDF5 User Block at the beginning of the file ○ B: Put the Zarr v3 shard index into a dataset at the end of the file ○ C: Relocate the HDF5 Fixed Array Data Block to the end of the file HDF5 User Block: Zarr v3 shard index HDF5 Metadata Shared data chunks HDF5 Metadata Shared data chunks 2nd HDF5 Dataset as Zarr v3 shard index HDF5 Metadata Shared data chunks HDF5 FADB as a Zarr v3 shard index ? Option A: Option B: Option C: CRC32c or Jenkins Checksum?
  • 12. Zarr v3 sharding is similar a HDF5 Virtual Dataset ● Virtual datasets are a HDF5 feature that allows part of a dataset to exist as a dataset in another file (~Zarr v3 shard) ● A Zarr v3 shard is analogous to a file with a single chunked source dataset https://support.hdfgroup.org/releases/hdf5/documentation/rfc/HDF5-VDS-requirements-use-cases-2014-12-10.pdf
  • 13. Combined Zarr Array as a HDF5 Virtual Dataset HDF5 Metadata Shared data chunks Zarr v3 shard index HDF5 Metadata Shared data chunks Zarr v3 shard index HDF5 Metadata Shared data chunks Zarr v3 shard index HDF5 Metadata Shared data chunks Zarr v3 shard index c/0/0 c/1/0 c/0/1 c/1/1 zarr.hdf5 zarr.json Index files: c/0/0 c/0/1 c/1/0 c/1/1
  • 14. Combined Zarr Array as a HDF5 Virtual Dataset HDF5 Metadata Shared data chunks Zarr v3 shard index HDF5 Metadata Shared data chunks Zarr v3 shard index HDF5 Metadata Shared data chunks Zarr v3 shard index HDF5 Metadata Shared data chunks Zarr v3 shard index c/0/0 c/1/0 c/0/1 c/1/1 zarr.hdf5 zarr.json Index files: c/0/0 c/0/1 c/1/0 c/1/1
  • 15. Combined Zarr Array as a HDF5 Virtual Dataset HDF5 Metadata Shared data chunks Zarr v3 shard index HDF5 Metadata Shared data chunks Zarr v3 shard index HDF5 Metadata Shared data chunks Zarr v3 shard index HDF5 Metadata Shared data chunks Zarr v3 shard index c/0/0 c/1/0 c/0/1 c/1/1 zarr.hdf5 zarr.json Index files: c/0/0 c/0/1 c/1/0 c/1/1
  • 16. Could we combine TIFF, HDF5, and Zarr?
  • 17. ~KBs Can microscopists implement simply and efficiently?
  • 18. Demo: Reading the same file via distinct packages I’ve created a single file “demo.hdf5.zarr.tiff” that can be read by distinct Python packages: ● h5py (HDF5) ● libtiff (TIFF) ● tensorstore (Zarr v3) https://github.com/mkitti/simple_image_formats
  • 19. We can modify the data with h5py (via HDF5) …
  • 20. … Then read the modified data from all three libraries as a TIFF, HDF5 file, or a Zarr v3 shards We can make the latest standards cooperate rather than compete. Is this the best approach? Should we address this via file systems (FUSE), APIs (N5), or services?
  • 21. Implementation details… ● HDF5 metadata can be consolidated by using a large enough meta_block_size ● HDF5 chunk information (offset and nbytes) can be extracted efficiently using H5chunk_iter ● The Zarr v3 shard specification do not require that chunks need to be contiguous ○ There can be empty space between chunks ○ HDF5 also does not require chunks be stored contiguously ● A key difference is that HDF5 allows for OPTIONAL application of compression filters hence the need for a 32-bit filter mask ○ If all filters are applied, the HDF5 32-bit filter mask bits are all 0
  • 22. Summary ● Zarr v3 shards partition chunks into small inner chunks. ● The resulting arrangement is similar to a HDF5 Virtual Dataset. ● A “file” could be both a valid HDF5 file and a Zarr v3 shard ○ The Zarr v3 shard index could exist in a HDF5 file either as ■ A user block at the beginning of the file OR ■ An extra contiguous dataset at the end of the file ○ A merged FADB and Zarr v3 shard index would require alignment of 32-bit checksums ■ HDF5 adopts CRC32c as a checksum ■ Zarr adopts Jenkin’s lookup3 as a codec ● Alternative: A Zarr v3 virtual file driver for HDF5? ● Bonus (time permitting): Combining TIFF, HDF5, and Zarr v3 ○ Jupyter notebook demonstration
  • 23. Closing Thoughts ● Introducing a new format is costly ● Multi-formatting could save resources ● Cloud optimization makes it easier to combine formats ● GeoTIFF, HDF5, and Zarr may be more similar than expected