SlideShare a Scribd company logo
HDF for the Cloud 1
John Readey
The HDF5 data format 2
‱ Established 20 years ago the HDF5 file format is the most commonly used
format in Earth Science
‱ Note: NetCDF4 files are actually HDF5 “under the hood”
‱ HDF5 was designed with the (somewhat contradictory) goals of:
‱ Archival format – data that can stored for decades
‱ Analysis Ready -- data that can be directly utilized for analytics (no
conversion needed)
‱ There’s a rich set of tools and language SDKs:
‱ C/C++/Fortran
‱ Python
‱ Java, etc.
HDF5 File Format meets the Cloud 3
‱ Storing large HDF5 collection on AWS is almost always about utilizing S3:
‱ Cost effective
‱ Redundant
‱ Sharable
‱ It’s easy enough to store HDF5 files as S3 objects, but these files can’t be
read using the HDF5 library (which is expecting a POSIX filesystem)
‱ Experience using FUSE to read from S3 using HDF5Library has not tended
to work so well
‱ In practice users have been left with copying files to local disk first
‱ This has led to interest in alternative formats such as Zarr, TileDB, and
our own HSDS S3 Storage Schema (more on that later)
HDF5 meets S3 halfway
 4
‱ For many years the HDF5 library has supported VFDs “Virtual File Driver”
‱ VFDs are low-level plugins that can replace the standard POSIX IO
methods with anything the developer of the VFD would like
‱ The HDF Group has developed a VFD specifically for S3 that will be included
in the next library release (coming soon!)
‱ How it works: Each POSIX read call is replaced with a S3 Range GET
‱ Features:
‱ Can read any HDF5 file (write is not supported)
‱ No changes to the public API
‱ Compatible with higher-level libraries (h5py, netcdf, xarray, etc.)
‱ This is a first release and there are some ideas for improving performance in
subsequent releases
‱ It will be very helpful to come up with an objective set of benchmarks to
compare performance between S3VFD, HSDS, Zarr, etc.
Cloud Optimized HDF
‱ For anyone putting HDF5 files on S3 for in-place reading, there are a few things that
can be done to improve performance when accessed using the S3VFD (or FUSE)
‱ Most of these optimizations can be done using existing tools (e.g. h5repack)
‱ A Cloud Optimized HDF5 files is still an HDF5 file and can be downloaded and read
with native VFD if desired
‱ Initial Proposal (likely to be revised based on testing):
‱ Use chunking for datasets larger than 1MB
‱ Use “brick style” chunk layouts (enable slicing via any dimension)
‱ Use readily available compression filters
‱ Pack metadata in front of file
‱ Aggregate smaller files into larger ones
HDF Server 6
‱ HSDS (now HDF Kita Server) is a REST based service for HDF data developed by the HDF
Group
‱ Think of it as HDF gone cloud native. 
‱ HSDS Features:
‱ Runs as a set of containers on Kubernetes – so can scale beyond one machine
‱ Requests can be parallelized across multiple containers
‱ Feature compatible with the HDF library but is independent code base
‱ Supports multiple readers/writers
‱ Uses S3 as data store
‱ Available now as part of HDF Kita Lab (our hosted Jupyter environment):
https://hdflab.hdfgroup.org
‱ Will be available on AWS Marketplace soon
HDF Cloud Schema
Big Idea: Map individual
HDF5 objects (datasets,
groups, chunks) as Object
Storage Objects
‱ Limit maximum storage object size
‱ Support parallelism for read/write
‱ Only data that is modified needs to be
updated
‱ Multiple clients can be reading/updating
the same “file”
Legend:
‱ Dataset is partitioned into
chunks
‱ Each chunk stored as an S3
object
‱ Dataset meta data (type, shape,
attributes, etc.) stored in a
separate object (as JSON text)
How to store HDF5 content in S3?
Each chunk (heavy outlines) get
persisted as a separate object
8Dataset JSON Example
creationProperties contains HDF5 dataset
creation property list settings.
Id is the objects UUID.
Layout represents HDF5 dataspace.
Root points back to the root group
Created & lastModified are timestamps
type represents HDF5 datatype.
attributes holds a list of HDF5 attribute
JSON objects.
{
"creationProperties": {},
"id": "d-9a097486-58dd-11e8-a964-
0242ac110009",
"layout": {"dims": [10], "class":
"H5D_CHUNKED"},
"root": "g-952b0bfa-58dd-11e8-a964-
0242ac110009",
"created": 1526456944,
"lastModified": 1526456944,
"shape": {"dims": [10], "class":
"H5S_SIMPLE"},
"type": {"base": "H5T_STD_I32LE",
"class": "H5T_INTEGER"},
"attributes": {}
}
Schema Details 9
‱ Key dispersal
‱ Objects are stored “flat” – no hierarchy
‱ UUIDs have a 5 char hash added to the front
‱ Idea is the evenly distribute objects across S3 storage nodes to improve
performance
‱ S3 partitions objects by first few characters of the key name
‱ Each storage node is limited to about 300 req/s
‱ There’s no list of chunks
‱ Chunk key is determined based on chunk position in the data space
‱ E.g. c-<uuid>_0_0_0 Is the corner chunk of a 3-dimensional dataset
‱ Chunk objects get created as needed on first write
‱ Schema is currently used just by HDF Server, but could just as easily be
used directly by clients (assuming that writes don’t conflict)
Supporting traditional HDF5 files 1
0
‱ Downside of the HDF S3 Schema is that data needs be transmogrified
‱ Since the bulk of the data is usually the chunk data it makes sense to
combine the ideas of the S3 Schema and S3VFD:
‱ Convert just the metadata of the source HDF5 file to the S3 Schema
‱ Store the source file as a S3 object
‱ For data reads, metadata provides offset and length into the HDF5 file
‱ S3 Range GET returns needed data
‱ This approach can be used either directly or with HDF Server
‱ Compared with the pure S3VFD approach, you reduce the number of S3
requests needed
‱ Work on supporting this is planned for later this year
References 1
1
‱ HDF Schema:
https://s3.amazonaws.com/hdfgroup/docs/obj_store_schema.pdf
‱ SciPy2017 talk:
https://s3.amazonaws.com/hdfgroup/docs/hdf_data_services_scipy2017.pdf
‱ AWS Big Data Blog article: https://aws.amazon.com/blogs/big-data/power-
from-wind-open-data-on-aws/
‱ AWS S3 Performance guidelines:
https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-
considerations.html

More Related Content

PPTX
HDF Cloud: HDF5 at Scale
PPTX
HDF for the Cloud - New HDF Server Features
PPTX
HDF for the Cloud - Serverless HDF
PPTX
Parallel Computing with HDF Server
PPTX
HDF5 and Ecosystem: What Is New?
PPTX
HDF - Current status and Future Directions
PPTX
Utilizing HDF4 File Content Maps for the Cloud Computing
PPTX
HDF Update for DAAC Managers (2017-02-27)
HDF Cloud: HDF5 at Scale
HDF for the Cloud - New HDF Server Features
HDF for the Cloud - Serverless HDF
Parallel Computing with HDF Server
HDF5 and Ecosystem: What Is New?
HDF - Current status and Future Directions
Utilizing HDF4 File Content Maps for the Cloud Computing
HDF Update for DAAC Managers (2017-02-27)

What's hot (20)

PDF
H5Coro: The Cloud-Optimized Read-Only Library
PPT
Integrating HDF5 with SRB
PPTX
MATLAB and Scientific Data: New Features and Capabilities
PPTX
Product Designer Hub - Taking HPD to the Web
PPSX
HDFEOS.org User Analsys, Updates, and Future
PPT
HDF-EOS 2/5 to netCDF Converter
PPTX
Data Analytics using MATLAB and HDF5
PPTX
MATLAB Modernization on HDF5 1.10
PDF
HDFS Analysis for Small Files
PPTX
HDF Product Designer: Using Templates to Achieve Interoperability
PPTX
Open-source Scientific Computing and Data Analytics using HDF
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
PPTX
Ozone and HDFS’s evolution
PPTX
Building a Scalable Web Crawler with Hadoop
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
Moving form HDF4 to HDF5/netCDF-4
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
PDF
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
PPTX
Hadoop 2 cluster architecture
H5Coro: The Cloud-Optimized Read-Only Library
Integrating HDF5 with SRB
MATLAB and Scientific Data: New Features and Capabilities
Product Designer Hub - Taking HPD to the Web
HDFEOS.org User Analsys, Updates, and Future
HDF-EOS 2/5 to netCDF Converter
Data Analytics using MATLAB and HDF5
MATLAB Modernization on HDF5 1.10
HDFS Analysis for Small Files
HDF Product Designer: Using Templates to Achieve Interoperability
Open-source Scientific Computing and Data Analytics using HDF
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Ozone and HDFS’s evolution
Building a Scalable Web Crawler with Hadoop
Practical NoSQL: Accumulo's dirlist Example
Moving form HDF4 to HDF5/netCDF-4
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Hadoop 2 cluster architecture
Ad

Similar to HDF for the Cloud (20)

PDF
Accessing HDF5 data in the cloud with HSDS
PDF
HDFCloud Workshop: HDF5 in the Cloud
PPTX
Highly Scalable Data Service (HSDS) Performance Features
PPT
Performance Tuning in HDF5
PDF
Cloud-Optimized HDF5 Files
PPTX
HDF Kita Lab: JupyterLab + HDF Service
PPTX
Hadoop and Big data in Big data and cloud.pptx
 
PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PPTX
A quick start guide to using HDF5 files in GLOBE Claritas
PPTX
Hadoop ppt1
PDF
Hadoop Distributed file system.pdf
PDF
hadoop distributed file systems complete information
PPTX
Cloud computing UNIT 2.1 presentation in
PPTX
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
PDF
Big Data Architecture Workshop - Vahid Amiri
PDF
Introduction to HDF5 Data Model, Programming Model and Library APIs
Accessing HDF5 data in the cloud with HSDS
HDFCloud Workshop: HDF5 in the Cloud
Highly Scalable Data Service (HSDS) Performance Features
Performance Tuning in HDF5
Cloud-Optimized HDF5 Files
HDF Kita Lab: JupyterLab + HDF Service
Hadoop and Big data in Big data and cloud.pptx
 
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
A quick start guide to using HDF5 files in GLOBE Claritas
Hadoop ppt1
Hadoop Distributed file system.pdf
hadoop distributed file systems complete information
Cloud computing UNIT 2.1 presentation in
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Big Data Architecture Workshop - Vahid Amiri
Introduction to HDF5 Data Model, Programming Model and Library APIs
Ad

More from The HDF-EOS Tools and Information Center (18)

PDF
HDF5 2.0: Cloud Optimized from the Start
PDF
Using a Hierarchical Data Format v5 file as Zarr v3 Shard
PDF
Cloud-Optimized HDF5 Files - Current Status
PDF
Cloud Optimized HDF5 for the ICESat-2 mission
PPTX
Access HDF Data in the Cloud via OPeNDAP Web Service
PPTX
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
PPTX
The State of HDF5 / Dana Robinson / The HDF Group
PDF
Creating Cloud-Optimized HDF5 Files
PPTX
HDF5 OPeNDAP Handler Updates, and Performance Discussion
PPTX
Hyrax: Serving Data from S3
PPSX
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
PDF
HDF - Current status and Future Directions
PPSX
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
PPTX
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
PPTX
PPTX
Leveraging the Cloud for HDF Software Testing
PPTX
Google Colaboratory for HDF-EOS
HDF5 2.0: Cloud Optimized from the Start
Using a Hierarchical Data Format v5 file as Zarr v3 Shard
Cloud-Optimized HDF5 Files - Current Status
Cloud Optimized HDF5 for the ICESat-2 mission
Access HDF Data in the Cloud via OPeNDAP Web Service
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
The State of HDF5 / Dana Robinson / The HDF Group
Creating Cloud-Optimized HDF5 Files
HDF5 OPeNDAP Handler Updates, and Performance Discussion
Hyrax: Serving Data from S3
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
HDF - Current status and Future Directions
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
Leveraging the Cloud for HDF Software Testing
Google Colaboratory for HDF-EOS

Recently uploaded (20)

PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Introduction to Artificial Intelligence
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Nekopoi APK 2025 free lastest update
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
top salesforce developer skills in 2025.pdf
PPTX
L1 - Introduction to python Backend.pptx
PPTX
history of c programming in notes for students .pptx
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
medical staffing services at VALiNTRY
PPTX
ai tools demonstartion for schools and inter college
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Odoo Companies in India – Driving Business Transformation.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
How to Migrate SBCGlobal Email to Yahoo Easily
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Introduction to Artificial Intelligence
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Nekopoi APK 2025 free lastest update
2025 Textile ERP Trends: SAP, Odoo & Oracle
How Creative Agencies Leverage Project Management Software.pdf
top salesforce developer skills in 2025.pdf
L1 - Introduction to python Backend.pptx
history of c programming in notes for students .pptx
Softaken Excel to vCard Converter Software.pdf
Upgrade and Innovation Strategies for SAP ERP Customers
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
medical staffing services at VALiNTRY
ai tools demonstartion for schools and inter college
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Odoo Companies in India – Driving Business Transformation.pdf

HDF for the Cloud

  • 1. HDF for the Cloud 1 John Readey
  • 2. The HDF5 data format 2 ‱ Established 20 years ago the HDF5 file format is the most commonly used format in Earth Science ‱ Note: NetCDF4 files are actually HDF5 “under the hood” ‱ HDF5 was designed with the (somewhat contradictory) goals of: ‱ Archival format – data that can stored for decades ‱ Analysis Ready -- data that can be directly utilized for analytics (no conversion needed) ‱ There’s a rich set of tools and language SDKs: ‱ C/C++/Fortran ‱ Python ‱ Java, etc.
  • 3. HDF5 File Format meets the Cloud 3 ‱ Storing large HDF5 collection on AWS is almost always about utilizing S3: ‱ Cost effective ‱ Redundant ‱ Sharable ‱ It’s easy enough to store HDF5 files as S3 objects, but these files can’t be read using the HDF5 library (which is expecting a POSIX filesystem) ‱ Experience using FUSE to read from S3 using HDF5Library has not tended to work so well ‱ In practice users have been left with copying files to local disk first ‱ This has led to interest in alternative formats such as Zarr, TileDB, and our own HSDS S3 Storage Schema (more on that later)
  • 4. HDF5 meets S3 halfway
 4 ‱ For many years the HDF5 library has supported VFDs “Virtual File Driver” ‱ VFDs are low-level plugins that can replace the standard POSIX IO methods with anything the developer of the VFD would like ‱ The HDF Group has developed a VFD specifically for S3 that will be included in the next library release (coming soon!) ‱ How it works: Each POSIX read call is replaced with a S3 Range GET ‱ Features: ‱ Can read any HDF5 file (write is not supported) ‱ No changes to the public API ‱ Compatible with higher-level libraries (h5py, netcdf, xarray, etc.) ‱ This is a first release and there are some ideas for improving performance in subsequent releases ‱ It will be very helpful to come up with an objective set of benchmarks to compare performance between S3VFD, HSDS, Zarr, etc.
  • 5. Cloud Optimized HDF ‱ For anyone putting HDF5 files on S3 for in-place reading, there are a few things that can be done to improve performance when accessed using the S3VFD (or FUSE) ‱ Most of these optimizations can be done using existing tools (e.g. h5repack) ‱ A Cloud Optimized HDF5 files is still an HDF5 file and can be downloaded and read with native VFD if desired ‱ Initial Proposal (likely to be revised based on testing): ‱ Use chunking for datasets larger than 1MB ‱ Use “brick style” chunk layouts (enable slicing via any dimension) ‱ Use readily available compression filters ‱ Pack metadata in front of file ‱ Aggregate smaller files into larger ones
  • 6. HDF Server 6 ‱ HSDS (now HDF Kita Server) is a REST based service for HDF data developed by the HDF Group ‱ Think of it as HDF gone cloud native.  ‱ HSDS Features: ‱ Runs as a set of containers on Kubernetes – so can scale beyond one machine ‱ Requests can be parallelized across multiple containers ‱ Feature compatible with the HDF library but is independent code base ‱ Supports multiple readers/writers ‱ Uses S3 as data store ‱ Available now as part of HDF Kita Lab (our hosted Jupyter environment): https://hdflab.hdfgroup.org ‱ Will be available on AWS Marketplace soon
  • 7. HDF Cloud Schema Big Idea: Map individual HDF5 objects (datasets, groups, chunks) as Object Storage Objects ‱ Limit maximum storage object size ‱ Support parallelism for read/write ‱ Only data that is modified needs to be updated ‱ Multiple clients can be reading/updating the same “file” Legend: ‱ Dataset is partitioned into chunks ‱ Each chunk stored as an S3 object ‱ Dataset meta data (type, shape, attributes, etc.) stored in a separate object (as JSON text) How to store HDF5 content in S3? Each chunk (heavy outlines) get persisted as a separate object
  • 8. 8Dataset JSON Example creationProperties contains HDF5 dataset creation property list settings. Id is the objects UUID. Layout represents HDF5 dataspace. Root points back to the root group Created & lastModified are timestamps type represents HDF5 datatype. attributes holds a list of HDF5 attribute JSON objects. { "creationProperties": {}, "id": "d-9a097486-58dd-11e8-a964- 0242ac110009", "layout": {"dims": [10], "class": "H5D_CHUNKED"}, "root": "g-952b0bfa-58dd-11e8-a964- 0242ac110009", "created": 1526456944, "lastModified": 1526456944, "shape": {"dims": [10], "class": "H5S_SIMPLE"}, "type": {"base": "H5T_STD_I32LE", "class": "H5T_INTEGER"}, "attributes": {} }
  • 9. Schema Details 9 ‱ Key dispersal ‱ Objects are stored “flat” – no hierarchy ‱ UUIDs have a 5 char hash added to the front ‱ Idea is the evenly distribute objects across S3 storage nodes to improve performance ‱ S3 partitions objects by first few characters of the key name ‱ Each storage node is limited to about 300 req/s ‱ There’s no list of chunks ‱ Chunk key is determined based on chunk position in the data space ‱ E.g. c-<uuid>_0_0_0 Is the corner chunk of a 3-dimensional dataset ‱ Chunk objects get created as needed on first write ‱ Schema is currently used just by HDF Server, but could just as easily be used directly by clients (assuming that writes don’t conflict)
  • 10. Supporting traditional HDF5 files 1 0 ‱ Downside of the HDF S3 Schema is that data needs be transmogrified ‱ Since the bulk of the data is usually the chunk data it makes sense to combine the ideas of the S3 Schema and S3VFD: ‱ Convert just the metadata of the source HDF5 file to the S3 Schema ‱ Store the source file as a S3 object ‱ For data reads, metadata provides offset and length into the HDF5 file ‱ S3 Range GET returns needed data ‱ This approach can be used either directly or with HDF Server ‱ Compared with the pure S3VFD approach, you reduce the number of S3 requests needed ‱ Work on supporting this is planned for later this year
  • 11. References 1 1 ‱ HDF Schema: https://s3.amazonaws.com/hdfgroup/docs/obj_store_schema.pdf ‱ SciPy2017 talk: https://s3.amazonaws.com/hdfgroup/docs/hdf_data_services_scipy2017.pdf ‱ AWS Big Data Blog article: https://aws.amazon.com/blogs/big-data/power- from-wind-open-data-on-aws/ ‱ AWS S3 Performance guidelines: https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf- considerations.html

Editor's Notes

  • #2: Many users of HDF5 are now migrating data archives to public or private cloud systems. The access approaches and performance characteristics of cloud storage are fundamentally different than traditional data storage systems because 1) the data are accessed over http and 2) the data are stored in an object store and identified using unique keys. There are many different ways to organize and access data in the cloud. The HDF Group is currently exploring and developing approaches that will facilitate migration to the cloud and support many existing HDF5 data access use cases. Our goal is to protect data providers and users from disruption as their data and applications are migrated to the cloud.
  • #8: This idea has been kicking around for a while, but storing potentially millions of files on a Linux filesystem would be problematic. Using S3 as the storage vehicle is a natural fit since there’s no limit to the number of objects in a bucket. With NREL we’ve validated this approach to 50 TB’s of data over 27MM objects (see aws-big-data blog article: https://aws.amazon.com/blogs/big-data/power-from-wind-open-data-on-aws/ )