SlideShare a Scribd company logo
Aleksandar Jelenak
NASA EED-3 / HDF Group
2024 ESIP Summer Meeting
Cloud Optimized HDF5 Files:
Current Status
GOVERNMENT RIGHTS NOTICE
This work was authored by employees of The HDF Group under Contract No. 80GSFC21CA001 with the National Aeronautics and Space
Administration. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United
States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to reproduce, prepare derivative works, distribute copies to
the public, and perform publicly and display publicly, or allow others to do so, for United States Government purposes. All other rights are
reserved by the copyright owner.
©2024 Raytheon Company. All rights reserved.
Glossary
HDF5: Hierarchical Data Format Version 5
netCDF-4: Network Common Data Form version 4
COH5: Cloud optimized HDF5
S3: Simple Storage Service
EOSDIS: NASA Earth Observing System Data and Information System
MB: megabyte (106 bytes)
kB: kilobyte (103 bytes)
MiB: mebibyte (220 bytes)
kiB: kibibyte (210 bytes)
LIDAR: laser imaging, detection, and ranging
URI: uniform resource identifier
What are cloud optimized HDF5 files?
● Valid HDF5 files. Not a new file format or convention.
● Larger dataset chunk sizes.
● Internal file metadata consolidated into bigger contiguous blocks.
● Total number of required S3 requests is significantly reduced which directly
improves performance.
● For detailed information, see my 2023 ESIP Summer talk.
From “HDF at the Speed of Zarr” by Luis Lopez, NASA NSIDC.
Current Advice for COH5
File Settings
Larger dataset chunk sizes
● Term clarification:
○ chunk shape = number of array elements in each chunk dimension
○ chunk size = number of bytes
(number of array elements in a chunk multiplied by byte size of one array
element)
● Chunk size is prior to any filtering (compression, etc.) applied.
● Not enough testing so far:
○ EOSDIS granules with larger dataset chunks are rare.
○ h5repack tool is not easy to use for large rechunking jobs.
● Larger chunks = less of them = less internal file metadata.
Consolidation of internal metadata
● Three different consolidation methods (see the YouTube video on slide #3).
● Practically only one of them tested: files created with paged aggregation file
space management strategy. (Easier to pronounce: paged files.)
● An HDF5 file is divided into pages. Page size set at file creation.
● Each page holds either internal metadata or data (chunks).
Paged file: pros and cons
● HDF5 library reads entire pages which yields its best cloud performance.
● It also has a special cache for these pages, called page buffer. Its size must
be set prior to opening a file.
● One file page can have more than one chunk = less overall S3 requests.
● Paged files tend to have larger size compared to their non-paged version
which is caused by extra unused space in each page.
○ Think of a file page as a box filled with different sized objects.
Current Advice: Chunks
● Chunk size needs to account for speed of applying filters (e.g.,
decompression) when chunks are read.
● NASA satellite data predominantly compressed with the zlib (a.k.a., gzip,
deflate) method.
● Need to explore other compression methods for optimal speed vs.
compression ratio.
● Smaller compressed chunks fill file pages better.
● Suggested chunk sizes: 100k(i)B to 2M(i)B.
Current Advice: Paged files
● Tested file pages of 4, 8, and 16 MiB sizes.
● 8 MiB file page produced slightly better performance, with tolerable (<5%) file
size increase.
● Majority of tested files had their internal metadata in one 8MiB file page.
● Don’t worry about unused space in that one internal metadata file page.
● Majority of datasets in the tested files were stored in a single file page.
● Consider a minimum of four chunks per file page when choosing a dataset’s
chunk size.
● If writing data to a paged file in more than one open-close session, enable
re-use of file’s free space when creating it.
○ Otherwise, the file may end up much larger than needed.
○ h5repack can produce a defragmented version of the file.
What happens to chunks
in a paged file?
Example: GEDI Level 2A granule
● Global Ecosystem Dynamics Investigation (GEDI) instrument is on the
International Space Station.
● A full-waveform LIDAR system for high-resolution observations of forests’
vertical structure.
● Example granule:
○ 1,518,338,048 bytes
○ 136 contiguous datasets
○ 4,184 chunked datasets compressed with the zlib filter
● Repacked into a paged file version with 8MiB file page size.
● No chunk was “hurt” (i.e., rechunked) during repacking.
Chunk sizes
Number of stored dataset chunks
Dataset chunk spread across file pages
Extra file pages compared to dataset total size
Dataset cache size for all chunks?
HDF5 Library Improvements
for Cloud Data Access
HDF5 library
● Applies to version 1.14.4 only.
● Released in May 2024.
● All other maintenance releases of the library – 1.8.*, 1.10.*, and 1.12.* – are
deprecated now.
● Native method for S3 data access: Read-Only S3 (ROS3) virtual file driver
(VFD).
○ Not always available – build dependent.
○ Conda Forge hdf5 package has it but not h5py from PyPI.
● For Python users: fsspec via h5py.
○ fsspec connected with the library using its virtual file layer API.
○ Lacks communication of important information from the library.
Notable improvements
● ROS3 caches first 16 MiB of the file on open.
● ROS3 support for AWS temporary session token.
● Set-and-forget page buffer size. Opening non-paged files will not cause an
error.
● Fixed chunk file location info to account for file’s user block size.
● Fixed an h5repack bug for datasets with variable-length data. Important when
repacking netCDF-4 string variables.
● Next release: Build with zlib-ng. This is a newer open-source implementation
of the standard zlib compression library and ~2x faster.
● Next release: h5repack, h5ls, h5dump, and h5stat new command-line
option for page buffer size. This will enable much improved performance for
cloud optimized files in S3.
● Next release: ROS3 support for relevant AWS environment variables.
● Next release: Support for S3 object URIs (s3://bucket-name/object-name).
This work was supported by NASA/GSFC under
Raytheon Company contract number
80GSFC21CA001

More Related Content

PDF
Cloud-Optimized HDF5 Files
PPT
Performance Tuning in HDF5
PDF
Hdfs internals
PDF
Creating Cloud-Optimized HDF5 Files
PPTX
NetApp & Storage fundamentals
PPT
Caching and Buffering in HDF5
PDF
LAS16-400: Mini Conference 3 AOSP (Session 1)
PDF
VM-aware Adaptive Storage Cache Prefetching
Cloud-Optimized HDF5 Files
Performance Tuning in HDF5
Hdfs internals
Creating Cloud-Optimized HDF5 Files
NetApp & Storage fundamentals
Caching and Buffering in HDF5
LAS16-400: Mini Conference 3 AOSP (Session 1)
VM-aware Adaptive Storage Cache Prefetching

Similar to Cloud-Optimized HDF5 Files - Current Status (20)

PDF
Wheeler w 0450_linux_file_systems1
PDF
Wheeler w 0450_linux_file_systems1
PPT
HDF5 Advanced Topics - Chunking
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PDF
FITSIO, HDF4, NetCDF, PDB and HDF5 Performance - Some Benchmark Results
PDF
Hdg explains swapfile.sys, hiberfil.sys and pagefile
PDF
Hdg explains swapfile.sys, hiberfil.sys and pagefile
PPT
Using HDF5 tools for performance tuning and troubleshooting
PDF
Introduction to Hadoop and Big Data Processing
PPTX
HDF Update for DAAC Managers (2017-02-27)
PDF
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
PDF
AdvFS/Advanced File System Ccncepts
PDF
Accessing HDF5 data in the cloud with HSDS
PDF
Hadoop compression strata conference
PDF
Using a Hierarchical Data Format v5 file as Zarr v3 Shard
PDF
H5Coro: The Cloud-Optimized Read-Only Library
PPTX
What's new in Hadoop Common and HDFS
PDF
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
PDF
Hadoop data management
Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1
HDF5 Advanced Topics - Chunking
hdfs readrmation ghghg bigdats analytics info.pdf
FITSIO, HDF4, NetCDF, PDB and HDF5 Performance - Some Benchmark Results
Hdg explains swapfile.sys, hiberfil.sys and pagefile
Hdg explains swapfile.sys, hiberfil.sys and pagefile
Using HDF5 tools for performance tuning and troubleshooting
Introduction to Hadoop and Big Data Processing
HDF Update for DAAC Managers (2017-02-27)
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
AdvFS/Advanced File System Ccncepts
Accessing HDF5 data in the cloud with HSDS
Hadoop compression strata conference
Using a Hierarchical Data Format v5 file as Zarr v3 Shard
H5Coro: The Cloud-Optimized Read-Only Library
What's new in Hadoop Common and HDFS
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
Hadoop data management
Ad

More from The HDF-EOS Tools and Information Center (20)

PDF
HDF5 2.0: Cloud Optimized from the Start
PDF
Cloud Optimized HDF5 for the ICESat-2 mission
PPTX
Access HDF Data in the Cloud via OPeNDAP Web Service
PPTX
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
PPTX
The State of HDF5 / Dana Robinson / The HDF Group
PPTX
Highly Scalable Data Service (HSDS) Performance Features
PPTX
HDF5 OPeNDAP Handler Updates, and Performance Discussion
PPTX
Hyrax: Serving Data from S3
PPSX
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
PDF
HDF - Current status and Future Directions
PPSX
HDFEOS.org User Analsys, Updates, and Future
PPTX
HDF - Current status and Future Directions
PPTX
MATLAB Modernization on HDF5 1.10
PPTX
HDF for the Cloud - Serverless HDF
PPTX
HDF for the Cloud - New HDF Server Features
PPSX
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
PPTX
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
PPTX
HDF5 and Ecosystem: What Is New?
HDF5 2.0: Cloud Optimized from the Start
Cloud Optimized HDF5 for the ICESat-2 mission
Access HDF Data in the Cloud via OPeNDAP Web Service
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
The State of HDF5 / Dana Robinson / The HDF Group
Highly Scalable Data Service (HSDS) Performance Features
HDF5 OPeNDAP Handler Updates, and Performance Discussion
Hyrax: Serving Data from S3
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
HDF - Current status and Future Directions
HDFEOS.org User Analsys, Updates, and Future
HDF - Current status and Future Directions
MATLAB Modernization on HDF5 1.10
HDF for the Cloud - Serverless HDF
HDF for the Cloud - New HDF Server Features
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
HDF5 and Ecosystem: What Is New?
Ad

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Encapsulation theory and applications.pdf
PPT
Teaching material agriculture food technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Cloud computing and distributed systems.
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Electronic commerce courselecture one. Pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
Reach Out and Touch Someone: Haptics and Empathic Computing
Network Security Unit 5.pdf for BCA BBA.
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Encapsulation theory and applications.pdf
Teaching material agriculture food technology
Per capita expenditure prediction using model stacking based on satellite ima...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation_ Review paper, used for researhc scholars
Big Data Technologies - Introduction.pptx
Cloud computing and distributed systems.
Mobile App Security Testing_ A Comprehensive Guide.pdf
Programs and apps: productivity, graphics, security and other tools
Understanding_Digital_Forensics_Presentation.pptx
Approach and Philosophy of On baking technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Electronic commerce courselecture one. Pdf
Review of recent advances in non-invasive hemoglobin estimation

Cloud-Optimized HDF5 Files - Current Status

  • 1. Aleksandar Jelenak NASA EED-3 / HDF Group 2024 ESIP Summer Meeting Cloud Optimized HDF5 Files: Current Status GOVERNMENT RIGHTS NOTICE This work was authored by employees of The HDF Group under Contract No. 80GSFC21CA001 with the National Aeronautics and Space Administration. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, or allow others to do so, for United States Government purposes. All other rights are reserved by the copyright owner. ©2024 Raytheon Company. All rights reserved.
  • 2. Glossary HDF5: Hierarchical Data Format Version 5 netCDF-4: Network Common Data Form version 4 COH5: Cloud optimized HDF5 S3: Simple Storage Service EOSDIS: NASA Earth Observing System Data and Information System MB: megabyte (106 bytes) kB: kilobyte (103 bytes) MiB: mebibyte (220 bytes) kiB: kibibyte (210 bytes) LIDAR: laser imaging, detection, and ranging URI: uniform resource identifier
  • 3. What are cloud optimized HDF5 files? ● Valid HDF5 files. Not a new file format or convention. ● Larger dataset chunk sizes. ● Internal file metadata consolidated into bigger contiguous blocks. ● Total number of required S3 requests is significantly reduced which directly improves performance. ● For detailed information, see my 2023 ESIP Summer talk. From “HDF at the Speed of Zarr” by Luis Lopez, NASA NSIDC.
  • 4. Current Advice for COH5 File Settings
  • 5. Larger dataset chunk sizes ● Term clarification: ○ chunk shape = number of array elements in each chunk dimension ○ chunk size = number of bytes (number of array elements in a chunk multiplied by byte size of one array element) ● Chunk size is prior to any filtering (compression, etc.) applied. ● Not enough testing so far: ○ EOSDIS granules with larger dataset chunks are rare. ○ h5repack tool is not easy to use for large rechunking jobs. ● Larger chunks = less of them = less internal file metadata.
  • 6. Consolidation of internal metadata ● Three different consolidation methods (see the YouTube video on slide #3). ● Practically only one of them tested: files created with paged aggregation file space management strategy. (Easier to pronounce: paged files.) ● An HDF5 file is divided into pages. Page size set at file creation. ● Each page holds either internal metadata or data (chunks).
  • 7. Paged file: pros and cons ● HDF5 library reads entire pages which yields its best cloud performance. ● It also has a special cache for these pages, called page buffer. Its size must be set prior to opening a file. ● One file page can have more than one chunk = less overall S3 requests. ● Paged files tend to have larger size compared to their non-paged version which is caused by extra unused space in each page. ○ Think of a file page as a box filled with different sized objects.
  • 8. Current Advice: Chunks ● Chunk size needs to account for speed of applying filters (e.g., decompression) when chunks are read. ● NASA satellite data predominantly compressed with the zlib (a.k.a., gzip, deflate) method. ● Need to explore other compression methods for optimal speed vs. compression ratio. ● Smaller compressed chunks fill file pages better. ● Suggested chunk sizes: 100k(i)B to 2M(i)B.
  • 9. Current Advice: Paged files ● Tested file pages of 4, 8, and 16 MiB sizes. ● 8 MiB file page produced slightly better performance, with tolerable (<5%) file size increase. ● Majority of tested files had their internal metadata in one 8MiB file page. ● Don’t worry about unused space in that one internal metadata file page. ● Majority of datasets in the tested files were stored in a single file page. ● Consider a minimum of four chunks per file page when choosing a dataset’s chunk size. ● If writing data to a paged file in more than one open-close session, enable re-use of file’s free space when creating it. ○ Otherwise, the file may end up much larger than needed. ○ h5repack can produce a defragmented version of the file.
  • 10. What happens to chunks in a paged file?
  • 11. Example: GEDI Level 2A granule ● Global Ecosystem Dynamics Investigation (GEDI) instrument is on the International Space Station. ● A full-waveform LIDAR system for high-resolution observations of forests’ vertical structure. ● Example granule: ○ 1,518,338,048 bytes ○ 136 contiguous datasets ○ 4,184 chunked datasets compressed with the zlib filter ● Repacked into a paged file version with 8MiB file page size. ● No chunk was “hurt” (i.e., rechunked) during repacking.
  • 13. Number of stored dataset chunks
  • 14. Dataset chunk spread across file pages
  • 15. Extra file pages compared to dataset total size
  • 16. Dataset cache size for all chunks?
  • 17. HDF5 Library Improvements for Cloud Data Access
  • 18. HDF5 library ● Applies to version 1.14.4 only. ● Released in May 2024. ● All other maintenance releases of the library – 1.8.*, 1.10.*, and 1.12.* – are deprecated now. ● Native method for S3 data access: Read-Only S3 (ROS3) virtual file driver (VFD). ○ Not always available – build dependent. ○ Conda Forge hdf5 package has it but not h5py from PyPI. ● For Python users: fsspec via h5py. ○ fsspec connected with the library using its virtual file layer API. ○ Lacks communication of important information from the library.
  • 19. Notable improvements ● ROS3 caches first 16 MiB of the file on open. ● ROS3 support for AWS temporary session token. ● Set-and-forget page buffer size. Opening non-paged files will not cause an error. ● Fixed chunk file location info to account for file’s user block size. ● Fixed an h5repack bug for datasets with variable-length data. Important when repacking netCDF-4 string variables. ● Next release: Build with zlib-ng. This is a newer open-source implementation of the standard zlib compression library and ~2x faster. ● Next release: h5repack, h5ls, h5dump, and h5stat new command-line option for page buffer size. This will enable much improved performance for cloud optimized files in S3. ● Next release: ROS3 support for relevant AWS environment variables. ● Next release: Support for S3 object URIs (s3://bucket-name/object-name).
  • 20. This work was supported by NASA/GSFC under Raytheon Company contract number 80GSFC21CA001