SlideShare a Scribd company logo
Cloud Optimized HDF5
for the
ICESat-2 mission
ESIP Summer meeting 2024
Luis López
Research Software Engineer
NSIDC
Andrew Barrett
Aleksandar Jelenak
Lisa Kaser
Jeff Lee
Amy Steiker
Credit: NASA's Goddard Space Flight Center
Cloud Optimized HDF5 for the ICESat-2 mission
Important questions about our planet can now be answered by
integrating years of data from different missions.
Global Sea Ice Concentration Boreal Forest Biomass
The data coming from these missions is
now available in the cloud! **
NASA and other agencies started to migrate their data to the cloud.
**caveat: it’s by large in archival formats, HDF5 and NetCDF
Problem: Accessing HDF5 in
the cloud is slow, how slow?
Improving performance of HDF5 in the cloud is key to
enable science at scale.
● Data is becoming too large to work
locally.
● I/O libraries are optimized for local and
supercomputing workflows.
● HDF as a format was not designed for
the cloud.
The problem = Size + Tools + Format
Cloud-optimized HDF5?
https://www.hdfgroup.org/2024/01/strategies-and-software-to-optimize-hdf5-netcdf-4-files-for-the-cloud/
● Metadata is consolidated
● Custom caching buffer size
● Global API lock is still in place
● Metadata is scattered through the file, each nested group makes this
problem worse.
● By default, metadata is written to the file (and read from) on fixed blocks of
4kb. 1MB of metadata ~= 250 requests.
● Global API lock, those 250 reqs are sequential!
Why HDF is not Performant in the Cloud
Paged Aggregation (data + metadata)
Metadata Blocks (user or dedicated page)
Cloud Optimized HDF5 for the ICESat-2 mission
Trying Big Files from the ICESat-2 Mission
Source: https://github.com/nsidc/earthaccess/discussions/251
Accidental Complexity
NASA Policies
file format library
I/O driver
data wrangling
library
AWS S3
ds = xr.open_dataset(“s3://nasa-data.hdf5”)
(or ROS3)
It’s APIs All the Way Down
Cloud-Optimized HDF5 Works!*
Code: https://gist.github.com/betolink/b545c364f80882c113b8cc27b763c729
Source: Andrew Barrett
Remote I/O Visualized
https://github.com/ajelenak/ros3vfd-log-info
● Cloud optimizations to HDF5
reduces requests by an order of
magnitude
● Data that’s not cloud optimized or
is read with out-of-the-box
parameters produces a lot of I/O
What could we do with CO-HDF?
Xpublish Kerchunk SlideRule Happy researchers
Improving performance of HDF5 in the cloud is key to
enable science at scale. Thanks!

More Related Content

PPTX
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
PPTX
STAC, ZARR, COG, K8S and Data Cubes: The brave new world of satellite EO anal...
PDF
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
PPSX
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
PPTX
Introduction to Hadoop Administration
PPT
Integrating HDF5 with SRB
PDF
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
STAC, ZARR, COG, K8S and Data Cubes: The brave new world of satellite EO anal...
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Introduction to Hadoop Administration
Integrating HDF5 with SRB
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...

Similar to Cloud Optimized HDF5 for the ICESat-2 mission (20)

PDF
field_guide_to_hadoop_pentaho
ODP
Hadoop Meets Exadata- Kerry Osborne
PDF
spark_v1_2
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PDF
Yarn by default (Spark on YARN)
ODP
Kerry osborne hadoop meets exadata
PDF
Syncsort et le retour d'expérience ComScore
PDF
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
PPTX
Data Analytics Using Container Persistence Through SMACK - Manny Rodriguez-Pe...
PPTX
Presentation sreenu dwh-services
PPTX
Hadoop Developer
PPTX
EMC config Hadoop
PDF
PDF
PDF
PPTX
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
PDF
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
PPTX
Unit-3.pptx
PDF
Big data hadooop analytic and data warehouse comparison guide
ODP
Hadoop @ Sara & BiG Grid
field_guide_to_hadoop_pentaho
Hadoop Meets Exadata- Kerry Osborne
spark_v1_2
hdfs readrmation ghghg bigdats analytics info.pdf
Yarn by default (Spark on YARN)
Kerry osborne hadoop meets exadata
Syncsort et le retour d'expérience ComScore
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
Data Analytics Using Container Persistence Through SMACK - Manny Rodriguez-Pe...
Presentation sreenu dwh-services
Hadoop Developer
EMC config Hadoop
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
Unit-3.pptx
Big data hadooop analytic and data warehouse comparison guide
Hadoop @ Sara & BiG Grid
Ad

More from The HDF-EOS Tools and Information Center (20)

PDF
HDF5 2.0: Cloud Optimized from the Start
PDF
Using a Hierarchical Data Format v5 file as Zarr v3 Shard
PDF
Cloud-Optimized HDF5 Files - Current Status
PPTX
Access HDF Data in the Cloud via OPeNDAP Web Service
PPTX
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
PPTX
The State of HDF5 / Dana Robinson / The HDF Group
PDF
Cloud-Optimized HDF5 Files
PDF
Accessing HDF5 data in the cloud with HSDS
PPTX
Highly Scalable Data Service (HSDS) Performance Features
PDF
Creating Cloud-Optimized HDF5 Files
PPTX
HDF5 OPeNDAP Handler Updates, and Performance Discussion
PPTX
Hyrax: Serving Data from S3
PDF
HDF - Current status and Future Directions
PPSX
HDFEOS.org User Analsys, Updates, and Future
PPTX
HDF - Current status and Future Directions
PDF
H5Coro: The Cloud-Optimized Read-Only Library
PPTX
MATLAB Modernization on HDF5 1.10
PPTX
HDF for the Cloud - Serverless HDF
HDF5 2.0: Cloud Optimized from the Start
Using a Hierarchical Data Format v5 file as Zarr v3 Shard
Cloud-Optimized HDF5 Files - Current Status
Access HDF Data in the Cloud via OPeNDAP Web Service
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
The State of HDF5 / Dana Robinson / The HDF Group
Cloud-Optimized HDF5 Files
Accessing HDF5 data in the cloud with HSDS
Highly Scalable Data Service (HSDS) Performance Features
Creating Cloud-Optimized HDF5 Files
HDF5 OPeNDAP Handler Updates, and Performance Discussion
Hyrax: Serving Data from S3
HDF - Current status and Future Directions
HDFEOS.org User Analsys, Updates, and Future
HDF - Current status and Future Directions
H5Coro: The Cloud-Optimized Read-Only Library
MATLAB Modernization on HDF5 1.10
HDF for the Cloud - Serverless HDF
Ad

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Spectroscopy.pptx food analysis technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Encapsulation theory and applications.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
“AI and Expert System Decision Support & Business Intelligence Systems”
Empathic Computing: Creating Shared Understanding
Spectroscopy.pptx food analysis technology
Advanced methodologies resolving dimensionality complications for autism neur...
Machine learning based COVID-19 study performance prediction
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Chapter 3 Spatial Domain Image Processing.pdf
Programs and apps: productivity, graphics, security and other tools
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Review of recent advances in non-invasive hemoglobin estimation
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
sap open course for s4hana steps from ECC to s4
Building Integrated photovoltaic BIPV_UPV.pdf
Electronic commerce courselecture one. Pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Spectral efficient network and resource selection model in 5G networks
Encapsulation theory and applications.pdf

Cloud Optimized HDF5 for the ICESat-2 mission