HDF for the Cloud

HDF for the Cloud 1
John Readey

The HDF5 data format 2
• Established 20 years ago the HDF5 file format is the most commonly used
format in Earth Science
• Note: NetCDF4 files are actually HDF5 “under the hood”
• HDF5 was designed with the (somewhat contradictory) goals of:
• Archival format – data that can stored for decades
• Analysis Ready -- data that can be directly utilized for analytics (no
conversion needed)
• There’s a rich set of tools and language SDKs:
• C/C++/Fortran
• Python
• Java, etc.

HDF5 File Format meets the Cloud 3
• Storing large HDF5 collection on AWS is almost always about utilizing S3:
• Cost effective
• Redundant
• Sharable
• It’s easy enough to store HDF5 files as S3 objects, but these files can’t be
read using the HDF5 library (which is expecting a POSIX filesystem)
• Experience using FUSE to read from S3 using HDF5Library has not tended
to work so well
• In practice users have been left with copying files to local disk first
• This has led to interest in alternative formats such as Zarr, TileDB, and
our own HSDS S3 Storage Schema (more on that later)

HDF5 meets S3 halfway… 4
• For many years the HDF5 library has supported VFDs “Virtual File Driver”
• VFDs are low-level plugins that can replace the standard POSIX IO
methods with anything the developer of the VFD would like
• The HDF Group has developed a VFD specifically for S3 that will be included
in the next library release (coming soon!)
• How it works: Each POSIX read call is replaced with a S3 Range GET
• Features:
• Can read any HDF5 file (write is not supported)
• No changes to the public API
• Compatible with higher-level libraries (h5py, netcdf, xarray, etc.)
• This is a first release and there are some ideas for improving performance in
subsequent releases
• It will be very helpful to come up with an objective set of benchmarks to
compare performance between S3VFD, HSDS, Zarr, etc.

Cloud Optimized HDF
• For anyone putting HDF5 files on S3 for in-place reading, there are a few things that
can be done to improve performance when accessed using the S3VFD (or FUSE)
• Most of these optimizations can be done using existing tools (e.g. h5repack)
• A Cloud Optimized HDF5 files is still an HDF5 file and can be downloaded and read
with native VFD if desired
• Initial Proposal (likely to be revised based on testing):
• Use chunking for datasets larger than 1MB
• Use “brick style” chunk layouts (enable slicing via any dimension)
• Use readily available compression filters
• Pack metadata in front of file
• Aggregate smaller files into larger ones

HDF Server 6
• HSDS (now HDF Kita Server) is a REST based service for HDF data developed by the HDF
Group
• Think of it as HDF gone cloud native. 
• HSDS Features:
• Runs as a set of containers on Kubernetes – so can scale beyond one machine
• Requests can be parallelized across multiple containers
• Feature compatible with the HDF library but is independent code base
• Supports multiple readers/writers
• Uses S3 as data store
• Available now as part of HDF Kita Lab (our hosted Jupyter environment):
https://hdflab.hdfgroup.org
• Will be available on AWS Marketplace soon

HDF Cloud Schema
Big Idea: Map individual
HDF5 objects (datasets,
groups, chunks) as Object
Storage Objects
• Limit maximum storage object size
• Support parallelism for read/write
• Only data that is modified needs to be
updated
• Multiple clients can be reading/updating
the same “file”
Legend:
• Dataset is partitioned into
chunks
• Each chunk stored as an S3
object
• Dataset meta data (type, shape,
attributes, etc.) stored in a
separate object (as JSON text)
How to store HDF5 content in S3?
Each chunk (heavy outlines) get
persisted as a separate object

8Dataset JSON Example
creationProperties contains HDF5 dataset
creation property list settings.
Id is the objects UUID.
Layout represents HDF5 dataspace.
Root points back to the root group
Created & lastModified are timestamps
type represents HDF5 datatype.
attributes holds a list of HDF5 attribute
JSON objects.
{
"creationProperties": {},
"id": "d-9a097486-58dd-11e8-a964-
0242ac110009",
"layout": {"dims": [10], "class":
"H5D_CHUNKED"},
"root": "g-952b0bfa-58dd-11e8-a964-
0242ac110009",
"created": 1526456944,
"lastModified": 1526456944,
"shape": {"dims": [10], "class":
"H5S_SIMPLE"},
"type": {"base": "H5T_STD_I32LE",
"class": "H5T_INTEGER"},
"attributes": {}
}

Schema Details 9
• Key dispersal
• Objects are stored “flat” – no hierarchy
• UUIDs have a 5 char hash added to the front
• Idea is the evenly distribute objects across S3 storage nodes to improve
performance
• S3 partitions objects by first few characters of the key name
• Each storage node is limited to about 300 req/s
• There’s no list of chunks
• Chunk key is determined based on chunk position in the data space
• E.g. c-<uuid>_0_0_0 Is the corner chunk of a 3-dimensional dataset
• Chunk objects get created as needed on first write
• Schema is currently used just by HDF Server, but could just as easily be
used directly by clients (assuming that writes don’t conflict)

Supporting traditional HDF5 files 1
0
• Downside of the HDF S3 Schema is that data needs be transmogrified
• Since the bulk of the data is usually the chunk data it makes sense to
combine the ideas of the S3 Schema and S3VFD:
• Convert just the metadata of the source HDF5 file to the S3 Schema
• Store the source file as a S3 object
• For data reads, metadata provides offset and length into the HDF5 file
• S3 Range GET returns needed data
• This approach can be used either directly or with HDF Server
• Compared with the pure S3VFD approach, you reduce the number of S3
requests needed
• Work on supporting this is planned for later this year

References 1
1
• HDF Schema:
https://s3.amazonaws.com/hdfgroup/docs/obj_store_schema.pdf
• SciPy2017 talk:
https://s3.amazonaws.com/hdfgroup/docs/hdf_data_services_scipy2017.pdf
• AWS Big Data Blog article: https://aws.amazon.com/blogs/big-data/power-
from-wind-open-data-on-aws/
• AWS S3 Performance guidelines:
https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-
considerations.html

HDF for the Cloud

More Related Content

What's hot (20)

Similar to HDF for the Cloud (20)

More from The HDF-EOS Tools and Information Center (18)

Recently uploaded (20)

HDF for the Cloud

Editor's Notes