1. HDF5 2.0: Cloud Optimized from the Start
2025 ESIP Summer Meeting
Aleksandar Jelenak
2. 2
Acknowledgments
• Features presented here are the result of open-source community contributors
and HDFG staff supported by a few projects.
• Project honorable mentions:
• NASA/GSFC Raytheon Company contract number 80GSFC21CA001
• Department of Energy, Office of Science, Office of Fusion Energy Sciences, Award Number DE-
SC0024442
3. 3
HDF5 Library 2.0
• Semantic versioning of releases
• CMake build system only
• Complex numbers and AI/ML datatypes
• Default settings changed to benefit cloud optimized HDF5
• New S3 backend for Read-Only S3 (ROS3) driver
• HDF5 filter availability improvements
4. 4
Compliant semantic versioning of library releases
• MAJOR.MINOR.PATCH
• In all current releases the MAJOR number was always 1, effectively meaningless.
• Major version:
• HDF5 file format change.
• Any change that introduces application binary interface (ABI) incompatibility.
• Minor version:
• Any change not deemed worthy of a major release, e.g., new API, improved performance, etc.
• Patch version:
• Security or memory leak fixes.
• Don’t worry, just upgrade.
• Minor or patch releases aimed to be drop-in upgrades.
5. 5
CMake build system only
• Building the library using Autotools removed.
• Supporting both CMake and Autotools was a big drain on HDFG resources.
• Build configurations are available as CMake presets with the library source code.
• CMake presets are in a JSON document and define configurations for building, testing, and
packaging.
• You can personalize build settings with user CMake presets in a separate JSON file
reusing/incorporating already available ones.
• Blog about building HDF5 library 2.0-dev and h5py in a conda environment:
https://www.hdfgroup.org/2025/07/22/how-to-build-hdf5-library-and-h5py-in-a-conda-virtual-environment-update/
6. 6
Complex number datatype
• The C99 standard float _Complex, double _Complex and long double
_Complex datatypes for platforms and compilers that support them.
• New HDF5 datatypes: H5T_NATIVE_FLOAT_COMPLEX,
H5T_NATIVE_DOUBLE_COMPLEX, H5T_NATIVE_LDOUBLE_COMPLEX,
H5T_COMPLEX_IEEE_F32, H5T_COMPLEX_IEEE_F64.
• Grandfathered in current complex number data based on specific HDF5 compound
datatypes:
H5T_COMPOUND {
<float_type> "r(e)(a)(l)"; OFFSET 0
<float_type> "i(m)(a)(g)(i)(n)(a)(r)(y)"; OFFSET SIZEOF("r(e)(a)(l)")
}
7. 7
AI/ML datatypes
• New predefined floating-point datatypes defined in Open Compute Project
Microscaling (MX) Specification v1.0.
• H5T_FLOAT_F8E4M3, H5T_FLOAT_F8E5M2, H5T_FLOAT_F6E2M3,
H5T_FLOAT_F6E3M2, H5T_FLOAT_F4E2M1
• Google Brain 16-bit float (bfloat16): H5T_FLOAT_BFLOAT16
• Compiler support patchy so the library provides soft conversion for now, which is
slower.
8. 8
Changed default settings for cloud optimized HDF5
• Dataset chunk cache size increase to 8 MiB from 1 MiB.
• File page cache size set to 64 MiB when ROS3 driver is used. Before: zero (no page
caching).
9. 9
New ROS3 backend
• ROS3 driver now uses AWS-C-S3 library to communicate with the AWS S3
service.
• Built-in support for AWS config/credential files and AWS environment variables.
• Handling of non-fatal failed S3 requests.
• Smart network utilization.
• Support for S3 URI (h5ls s3://mybucket/myfile.h5)
• Features already available in recent library releases:
• First 16 MiB of the file cached during its open operation.
• Non-zero file page cache will not cause library error when opening typical HDF5 files.
• Bugfix: Making an additional S3 request for the cached first 16 MiB of the file.
10. 10
Easier ROS3 logging
• HDF5_ROS3_VFD_DEBUG for S3 requests and related info to stderr. Any value
except false, off, or 0 enables debug logging.
• HDF5_ROS3_VFD_LOG_LEVEL for AWS-C-S3 library log info to a file with values:
error, info, debug, trace.
• The above two can be used together because of different logging information.
• HDF5_ROS3_VFD_LOG_FILE for AWS-C-S3 log file destination. Default file name:
hdf5_ros3_vfd.log.
12. 12
HDF5_ROS3_VFD_LOG_LEVEL example
$ HDF5_ROS3_VFD_LOG_LEVEL=info h5dump -H -p s3://hdf5.sample/data/cohdf5/GEDI/PAGE08MiB_GEDI02_A_2023034194553_O23479_02_T00431_02_003_02_V002.h5 > /dev/null
$ tail -n +26 hdf5_ros3_vfd.log | head -n 30
[INFO] [2025-06-24T23:59:11Z] [000000016fd0b000] [event-loop] - id=0x14265ac30: main loop started
[INFO] [2025-06-24T23:59:11Z] [000000016fd0b000] [event-loop] - id=0x14265ac30: default timeout 100s, and max events to process per tick 100
[INFO] [2025-06-24T23:59:11Z] [00000001f2639f00] [event-loop] - id=0x1426599b0: starting event-loop thread.
[INFO] [2025-06-24T23:59:11Z] [00000001f2639f00] [standard-retry-strategy] - static: creating new standard retry strategy
[INFO] [2025-06-24T23:59:11Z] [00000001f2639f00] [standard-retry-strategy] - id=0x14265b610: creating backing exponential backoff strategy with max_retries of 5
[INFO] [2025-06-24T23:59:11Z] [000000016fe17000] [event-loop] - id=0x1426599b0: main loop started
[INFO] [2025-06-24T23:59:11Z] [000000016fe17000] [event-loop] - id=0x1426599b0: default timeout 100s, and max events to process per tick 100
[INFO] [2025-06-24T23:59:11Z] [00000001f2639f00] [exp-backoff-strategy] - id=0x142659df0: Initializing exponential backoff retry strategy with scale factor: 0 jitter
mode: 0 and max retries 5
[INFO] [2025-06-24T23:59:11Z] [00000001f2639f00] [S3Client] - id=0x142659f20 Initiating making of meta request
[INFO] [2025-06-24T23:59:11Z] [00000001f2639f00] [connection-manager] - id=0x14265d3e0: Successfully created
[INFO] [2025-06-24T23:59:11Z] [00000001f2639f00] [S3Client] - id=0x142659f20: Created meta request 0x14265c7c0
[INFO] [2025-06-24T23:59:11Z] [000000016f5b7000] [S3ClientStats] - id=0x142659f20 Requests-in-flight(approx/exact):1/1 Requests-preparing:1 Requests-queued:0
Requests-network(get/put/default/total):0/0/0/0 Requests-streaming-waiting:0 Requests-streaming-response:0 Endpoints(in-table/allocated):1/1
[INFO] [2025-06-24T23:59:11Z] [000000016fbff000] [AuthCredentialsProvider] - (id=0x1426594c0) Default chain credentials provider successfully sourced credentials
[INFO] [2025-06-24T23:59:11Z] [000000016fbff000] [AuthSigning] - (id=0x14270b3d0) Signing successfully built canonical request for algorithm SigV4, with contents
HEAD
/hdf5.sample/data/cohdf5/GEDI/PAGE08MiB_GEDI02_A_2023034194553_O23479_02_T00431_02_003_02_V002.h5
host:s3.us-west-2.amazonaws.com
x-amz-content-sha256:UNSIGNED-PAYLOAD
x-amz-date:20250624T235911Z
host;x-amz-content-sha256;x-amz-date
UNSIGNED-PAYLOAD
[INFO] [2025-06-24T23:59:11Z] [000000016fbff000] [AuthSigning] - (id=0x14270b3d0) Signing successfully built string-to-sign via algorithm SigV4, with contents
AWS4-HMAC-SHA256
20250624T235911Z
20250624/us-west-2/s3/aws4_request
a7cba9b6346d36a06596caad68be12da7e2424309bb11ed604df13a90cbdf2eb
13. 13
Recap: HDF5 filters
• Any software that transforms a block of bytes representing one HDF5 dataset
chunk can become an HDF5 filter if a filter plugin for it is written.
• Each filter plugin have a unique identifier assigned by the HDF Group.
• Anyone can create a filter plugin. Popular ones (subjective!) are included in the
HDF Group’s GitHub repo (hdf5_plugins) for HDF5 library continuous integration
activities and binary release packages.
• Official information about registered filter plugins:
https://github.com/HDFGroup/hdf5_plugins/blob/master/docs/RegisteredFilterPlugins.md
• Most filters are for data compression.
14. 14
HDF5 filter plugin news
• Official registered filter plugins document is being constantly updated to improve
data interoperability.
• Documented custom LZ4-compressed HDF5 dataset chunk format.
• Detailed specification of filter plugin configuration parameters.
• LZ4 filter plugin bugfix if using LZ4 >= v1.9.
• Zlib-ng library can be used for the DEFLATE compression filter.
• It is about 2x faster than the zlib library used traditionally by the HDF5 library.
• This is a build option.
15. 15
How to make advanced compression filters easily
available?
• Except DEFLATE and SZIP, all other compression filters are dynamically loaded
plugins which must be available at runtime.
• Builds of HDF5 library and HDFG’s hdf5_plugins repo of filter plugins are now
compatible so no configuration required to find the plugins during runtime.
• Encourage community package repository maintainers to provide “integrated”
builds of both the library and the most popular filter plugins.
• The hdf5_plugins repo could become a community resource and centralize
maintenance of the popular plugins. Current offerings: LZ4, LZF, BZIP2, BLOSC,
BLOSC2, Zstandard, Bitshuffle, BitGroom, Granular BitRound, JPEG, and ZFP.