SlideShare a Scribd company logo
Proprietary and Confidential. Copyright 2018, The HDF Group.
HDF for the Cloud:
New HDF Server Features
John Readey
2
• HDF storage schema for the cloud
• HDF Server features
• What’s new
• What’s next
• Demo
Overview
3What is HDF5?
Depends on your point of view:
• a C-API
• a data model
• a file format
Let’s imagine keeping the API and
Data model, but with a different (cloud friendly)
Storage format
4HDF Sharded Schema
Big Idea: Map individual
HDF5 objects (datasets,
groups, chunks) as Object
Storage Objects
• Limit maximum size of any object
• Support parallelism for read/write
• Only data that is modified needs to be
updated
• Multiple clients can be reading/updating
the same “file”
• Don’t need to manage free space
Legend:
• Dataset is partitioned into chunks
• Each chunk stored as an object (file)
• Dataset meta data (type, shape,
attributes, etc.) stored in a separate
object (as JSON text)
Why a sharded data format?
Each chunk (heavy outlines) get
persisted as a separate object
5Sharded format example
root_obj_id/
group.json
obj1_id/
group.json
obj2_id/
dataset.json
0_0
0_1
obj3_id/
dataset.json
0_0_2
0_0_3
Observations:
• Metadata is stored as JSON
• Chunk data stored as binary blobs
• Self-explanatory
• One HDF5 file can translate to lots of
objects
• Flat hierarchy – supports HDF5
multilinking
• Can limit maximum size of an object
• Can be used with Posix or object
storage
Schema is documented here:
https://github.com/HDFGroup/hsds/blob/master/docs/design/obj_store_schema/obj_store_schema_v2.md
6Implementations of the sharded schema
A storage format specification is nice, but it would
be useful to have some software that can actually
write and read to the format…
As it happens, we’ve created a software service that uses the
schema: HSDS (Highly Scalable Data Service)
Software is available at:
https://github.com/HDFGroup/hsds
Note: HSDS was originally developed as a NASA ACCESS 2015 project:
https://earthdata.nasa.gov/esds/competitive-programs/access/hsds
7Server Features
• Simple + familiar API
• Clients can interact with service using REST API
• SDKs provide language specific interface (e.g. h5pyd for Python)
• Can read/write just the data they need (as opposed to transferring entire files)
• Support for compression
• Container based
• Run in Docker or Kubernetes or DC/OS
• Scalable performance:
• Can cache recently accessed data in RAM
• Can parallelize requests across multiple nodes
• More nodes  better performance
• Cluster based – any number of machines can be used to constitute the server
• Multiple clients can read/write to same data source
• No limit to the amount of data that can be stored by the service
8Architecture
Legend:
• Client: Any user of the service
• Load balancer – distributes requests to Service nodes
• Service Nodes – processes requests from clients (with help from Data Nodes)
• Data Nodes – responsible for partition of Object Store
• Object Store: Base storage service (e.g. AWS S3)
9HDF API Compatibility
The sharded storage schema captures the
HDF data model, and REST service interface
is nice, but it would be great if the existing
HDF based applications and libraries could
use the new storage format without requiring a
bunch of code changes…
Two related projects provide a solution:
• H5pyd – h5py compatible package for Python
• REST VOL – HDF5 library plugin for C/C++
10H5pyd – Python client
• H5py is a popular Python package that provide a Pythonic interface to the HDF5 library
• H5pyd (for h5py distributed) provides a h5py compatible h5py for accessing the server
• Pure Python – uses requests package to make http calls to server
• Include several extensions to h5py:
• List content in folders
• Get/Set ACLs (access control list)
• Pytables-like query interface
• H5netcdf and xarray packages will use h5pyd when http:// is prepended to the file path
• Installable from PyPI: $ pip install h5pyd
• Source code: https://github.com/HDFGroup/h5pyd
11Supporting the Python Analytics Stack
Many Python users
don’t use h5py, but
tools higher up the
stack: h5netcdf,
xarray, pandas, etc.
HDF5Lib
H5PY
H5NETCDF
Xarray
Since h5pyd is
compatible with h5py,
we should be able to
support the same stack
for HDF Cloud
HDF5Lib
H5PY
H5NETCDF
Xarray
H5PYD
HDFServer
Disk
Applications can
switch between
local and cloud
access just by
changing file path.
12REST VOL Plugin
• The HDF5 VOL architecture is a plugin layer for HDF5
• Public API stays the same, but different back ends can be implemented
• REST VOL substitutes REST API requests for file i/o actions
• C/Fortran applications should be able to run as is
• Some features not implemented yet:
• VLEN support
• Large read/write support (selections >100mb)
• Downloadable from: https://github.com/HDFGroup/vol-rest
13Command Line Interface (CLI)
• Accessing HDF via a service means one can’t utilize usual shell commands: ls, rm, chmod, etc.
• Command line tools are a set of simple apps to use instead:
• hsinfo: display server version, connect info
• hsls: list content of folder or file
• hstouch: create folder or file
• hsdel: delete a file
• hsload: upload an HDF5 file
• hsget: download content from server to an HDF5 file
• hsacl: create/list/update ACLs (Access Control Lists)
• Hsdiff: compare HDF5 file with sharded representation
• Implemented in Python & uses h5pyd
• Note: data is round-tripable:
• HDF5 File hsload  HSDS store  hsget  HDF5 file
14Supporting traditional HDF5 files
• If you have HDF5 files already stored in the cloud, they can be
accessed by HDF Server
• Rather than converting the entire file to the HDF Schema, just the
metadata needs to be imported (typically <1% of the file)
• Dataset reads are converted to S3 Range Gets on the stored file
• The hsload CLI tool has an option (--link ) for loading file metadata
• It is also possible to construct a server file that aggregates multiple
stored files (similar to how the HDF5 library VDS feature works)
We’ve discussed three aspects of HDF: the data model, API, and file
format. With HSDS we’ve kept the data model and API, but the file
format is radically different. But maybe you have a PB or two of HDF5
files you’d like to use…
15New HSDS features
HSDS version 0.6 is coming soon…
What’s new:
• POSIX Support – Store content on regular disk drives
• Azure
• Azure Blob support – Support for Azure’s object storage format
• AKS (Azure Kubernetes) – Run in Azure’s managed Kubernetes
• Active Directory authentication – Authenticate via AD
• AWS
• Added support for AWS Lambda
• DC/OS – support for DC/OS (Apache Mesos) distributed system
• Domain checksums – verify when any content changes
• Role Based Access Control (RBAC) – manage ACLs for user groups
Complete list is here: https://github.com/HDFGroup/hsds/issues/47
16HSDS Platforms
POSIX
Filesystem
HSDS can be run on most container management systems:
Using different supported storage systems:
17AWS Lambda Functions
• HSDS can parallelize requests across all the
available backend (“DN”) nodes on the server
• AWS Lambda is a new service that enables you to
run requests ”serverless”
• Pay for just cpu-seconds the function runs
• By incorporating Lambda, some HDF Server
requests can parallelize across a 1000 Lambda
functions (equivalent to a 1000 container server)
• Will dramatically speed up time-series selections
18Kita Lab
• Kita Lab is a JupyterLab and HDF server environment hosted by the HDF Group on AWS
• Kita Lab users can create Python notebooks that use h5pyd to connect to HDF Server
• Each user gets equivalent to 2-core Xeon Server and 10GB local storage
• Users can use up to 100GB of data on HDF Server
• Sign up here: https://www.hdfgroup.org/hdfkitalab/
User’s container
and EBS volume
User
User logs into
Jupyter Hub
JupyterHub spawns
new container at
login
HSDS on Kubernetes
S3 Bucket
19Futures
• Sometimes you’d rather do without a server and talk to the storage system directly:
• Don’t want to deal with setting up service
• Don’t want to worry about scaling service up and down with client load
• You don’t need the synchronization (e.g. managing multiple clients writing to the
same dataset) that a service provides
• HS Direct Access will be a new VOL connector that enables this for the HDF5 library
• Will take advantage of multiple cores
• Uses same schema as HSDS (and can be used in conjunction with HSDS)
Design doc is here:
https://github.com/HDFGroup/hsds/blob/master/docs/design/direct_access/direct_access.md
20Questions?

More Related Content

PPTX
HDF for the Cloud - Serverless HDF
PPTX
HDF5 and Ecosystem: What Is New?
PPTX
HDF - Current status and Future Directions
PPTX
Parallel Computing with HDF Server
PDF
H5Coro: The Cloud-Optimized Read-Only Library
PPSX
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
PPT
Caching and Buffering in HDF5
HDF for the Cloud - Serverless HDF
HDF5 and Ecosystem: What Is New?
HDF - Current status and Future Directions
Parallel Computing with HDF Server
H5Coro: The Cloud-Optimized Read-Only Library
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Caching and Buffering in HDF5

What's hot (20)

PPTX
PPSX
HDFEOS.org User Analsys, Updates, and Future
PPTX
MATLAB Modernization on HDF5 1.10
PPT
HDF-EOS 2/5 to netCDF Converter
PPTX
PPTX
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
PDF
HDFS Analysis for Small Files
PPTX
Parallel HDF5 Developments
PPTX
MATLAB and Scientific Data: New Features and Capabilities
PPTX
Ozone and HDFS’s evolution
PPT
PPTX
Easy Access of NASA HDF data via OPeNDAP
PPT
Status of HDF-EOS, Related Software and Tools
PPT
Performance Tuning in HDF5
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
PPTX
Putting some Spark into HDF5
HDFEOS.org User Analsys, Updates, and Future
MATLAB Modernization on HDF5 1.10
HDF-EOS 2/5 to netCDF Converter
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
HDFS Analysis for Small Files
Parallel HDF5 Developments
MATLAB and Scientific Data: New Features and Capabilities
Ozone and HDFS’s evolution
Easy Access of NASA HDF data via OPeNDAP
Status of HDF-EOS, Related Software and Tools
Performance Tuning in HDF5
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Putting some Spark into HDF5
Ad

Similar to HDF for the Cloud - New HDF Server Features (20)

PDF
HDFCloud Workshop: HDF5 in the Cloud
PPTX
Highly Scalable Data Service (HSDS) Performance Features
PDF
Accessing HDF5 data in the cloud with HSDS
PPTX
Hdf5 current future
PPTX
PPTX
Hdf5 parallel
PDF
HDF5 2.0: Cloud Optimized from the Start
PDF
HDF - Current status and Future Directions
PPTX
The State of HDF5 / Dana Robinson / The HDF Group
PDF
Introduction to HDF5 Data Model, Programming Model and Library APIs
PPTX
HDFCloud Workshop: HDF5 in the Cloud
Highly Scalable Data Service (HSDS) Performance Features
Accessing HDF5 data in the cloud with HSDS
Hdf5 current future
Hdf5 parallel
HDF5 2.0: Cloud Optimized from the Start
HDF - Current status and Future Directions
The State of HDF5 / Dana Robinson / The HDF Group
Introduction to HDF5 Data Model, Programming Model and Library APIs
Ad

More from The HDF-EOS Tools and Information Center (14)

PDF
Using a Hierarchical Data Format v5 file as Zarr v3 Shard
PDF
Cloud-Optimized HDF5 Files - Current Status
PDF
Cloud Optimized HDF5 for the ICESat-2 mission
PPTX
Access HDF Data in the Cloud via OPeNDAP Web Service
PPTX
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
PDF
Cloud-Optimized HDF5 Files
PDF
Creating Cloud-Optimized HDF5 Files
PPTX
HDF5 OPeNDAP Handler Updates, and Performance Discussion
PPTX
Hyrax: Serving Data from S3
PPSX
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
PPTX
Leveraging the Cloud for HDF Software Testing
PPTX
Google Colaboratory for HDF-EOS
PPTX
HDF-EOS Data Product Developer's Guide
Using a Hierarchical Data Format v5 file as Zarr v3 Shard
Cloud-Optimized HDF5 Files - Current Status
Cloud Optimized HDF5 for the ICESat-2 mission
Access HDF Data in the Cloud via OPeNDAP Web Service
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 Files
HDF5 OPeNDAP Handler Updates, and Performance Discussion
Hyrax: Serving Data from S3
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Leveraging the Cloud for HDF Software Testing
Google Colaboratory for HDF-EOS
HDF-EOS Data Product Developer's Guide

Recently uploaded (20)

PPTX
Essential Infomation Tech presentation.pptx
PDF
medical staffing services at VALiNTRY
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
top salesforce developer skills in 2025.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Nekopoi APK 2025 free lastest update
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
System and Network Administration Chapter 2
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
Essential Infomation Tech presentation.pptx
medical staffing services at VALiNTRY
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
L1 - Introduction to python Backend.pptx
Softaken Excel to vCard Converter Software.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
top salesforce developer skills in 2025.pdf
Understanding Forklifts - TECH EHS Solution
How to Choose the Right IT Partner for Your Business in Malaysia
How to Migrate SBCGlobal Email to Yahoo Easily
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Nekopoi APK 2025 free lastest update
Navsoft: AI-Powered Business Solutions & Custom Software Development
wealthsignaloriginal-com-DS-text-... (1).pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
System and Network Administration Chapter 2
Wondershare Filmora 15 Crack With Activation Key [2025

HDF for the Cloud - New HDF Server Features

  • 1. Proprietary and Confidential. Copyright 2018, The HDF Group. HDF for the Cloud: New HDF Server Features John Readey
  • 2. 2 • HDF storage schema for the cloud • HDF Server features • What’s new • What’s next • Demo Overview
  • 3. 3What is HDF5? Depends on your point of view: • a C-API • a data model • a file format Let’s imagine keeping the API and Data model, but with a different (cloud friendly) Storage format
  • 4. 4HDF Sharded Schema Big Idea: Map individual HDF5 objects (datasets, groups, chunks) as Object Storage Objects • Limit maximum size of any object • Support parallelism for read/write • Only data that is modified needs to be updated • Multiple clients can be reading/updating the same “file” • Don’t need to manage free space Legend: • Dataset is partitioned into chunks • Each chunk stored as an object (file) • Dataset meta data (type, shape, attributes, etc.) stored in a separate object (as JSON text) Why a sharded data format? Each chunk (heavy outlines) get persisted as a separate object
  • 5. 5Sharded format example root_obj_id/ group.json obj1_id/ group.json obj2_id/ dataset.json 0_0 0_1 obj3_id/ dataset.json 0_0_2 0_0_3 Observations: • Metadata is stored as JSON • Chunk data stored as binary blobs • Self-explanatory • One HDF5 file can translate to lots of objects • Flat hierarchy – supports HDF5 multilinking • Can limit maximum size of an object • Can be used with Posix or object storage Schema is documented here: https://github.com/HDFGroup/hsds/blob/master/docs/design/obj_store_schema/obj_store_schema_v2.md
  • 6. 6Implementations of the sharded schema A storage format specification is nice, but it would be useful to have some software that can actually write and read to the format… As it happens, we’ve created a software service that uses the schema: HSDS (Highly Scalable Data Service) Software is available at: https://github.com/HDFGroup/hsds Note: HSDS was originally developed as a NASA ACCESS 2015 project: https://earthdata.nasa.gov/esds/competitive-programs/access/hsds
  • 7. 7Server Features • Simple + familiar API • Clients can interact with service using REST API • SDKs provide language specific interface (e.g. h5pyd for Python) • Can read/write just the data they need (as opposed to transferring entire files) • Support for compression • Container based • Run in Docker or Kubernetes or DC/OS • Scalable performance: • Can cache recently accessed data in RAM • Can parallelize requests across multiple nodes • More nodes  better performance • Cluster based – any number of machines can be used to constitute the server • Multiple clients can read/write to same data source • No limit to the amount of data that can be stored by the service
  • 8. 8Architecture Legend: • Client: Any user of the service • Load balancer – distributes requests to Service nodes • Service Nodes – processes requests from clients (with help from Data Nodes) • Data Nodes – responsible for partition of Object Store • Object Store: Base storage service (e.g. AWS S3)
  • 9. 9HDF API Compatibility The sharded storage schema captures the HDF data model, and REST service interface is nice, but it would be great if the existing HDF based applications and libraries could use the new storage format without requiring a bunch of code changes… Two related projects provide a solution: • H5pyd – h5py compatible package for Python • REST VOL – HDF5 library plugin for C/C++
  • 10. 10H5pyd – Python client • H5py is a popular Python package that provide a Pythonic interface to the HDF5 library • H5pyd (for h5py distributed) provides a h5py compatible h5py for accessing the server • Pure Python – uses requests package to make http calls to server • Include several extensions to h5py: • List content in folders • Get/Set ACLs (access control list) • Pytables-like query interface • H5netcdf and xarray packages will use h5pyd when http:// is prepended to the file path • Installable from PyPI: $ pip install h5pyd • Source code: https://github.com/HDFGroup/h5pyd
  • 11. 11Supporting the Python Analytics Stack Many Python users don’t use h5py, but tools higher up the stack: h5netcdf, xarray, pandas, etc. HDF5Lib H5PY H5NETCDF Xarray Since h5pyd is compatible with h5py, we should be able to support the same stack for HDF Cloud HDF5Lib H5PY H5NETCDF Xarray H5PYD HDFServer Disk Applications can switch between local and cloud access just by changing file path.
  • 12. 12REST VOL Plugin • The HDF5 VOL architecture is a plugin layer for HDF5 • Public API stays the same, but different back ends can be implemented • REST VOL substitutes REST API requests for file i/o actions • C/Fortran applications should be able to run as is • Some features not implemented yet: • VLEN support • Large read/write support (selections >100mb) • Downloadable from: https://github.com/HDFGroup/vol-rest
  • 13. 13Command Line Interface (CLI) • Accessing HDF via a service means one can’t utilize usual shell commands: ls, rm, chmod, etc. • Command line tools are a set of simple apps to use instead: • hsinfo: display server version, connect info • hsls: list content of folder or file • hstouch: create folder or file • hsdel: delete a file • hsload: upload an HDF5 file • hsget: download content from server to an HDF5 file • hsacl: create/list/update ACLs (Access Control Lists) • Hsdiff: compare HDF5 file with sharded representation • Implemented in Python & uses h5pyd • Note: data is round-tripable: • HDF5 File hsload  HSDS store  hsget  HDF5 file
  • 14. 14Supporting traditional HDF5 files • If you have HDF5 files already stored in the cloud, they can be accessed by HDF Server • Rather than converting the entire file to the HDF Schema, just the metadata needs to be imported (typically <1% of the file) • Dataset reads are converted to S3 Range Gets on the stored file • The hsload CLI tool has an option (--link ) for loading file metadata • It is also possible to construct a server file that aggregates multiple stored files (similar to how the HDF5 library VDS feature works) We’ve discussed three aspects of HDF: the data model, API, and file format. With HSDS we’ve kept the data model and API, but the file format is radically different. But maybe you have a PB or two of HDF5 files you’d like to use…
  • 15. 15New HSDS features HSDS version 0.6 is coming soon… What’s new: • POSIX Support – Store content on regular disk drives • Azure • Azure Blob support – Support for Azure’s object storage format • AKS (Azure Kubernetes) – Run in Azure’s managed Kubernetes • Active Directory authentication – Authenticate via AD • AWS • Added support for AWS Lambda • DC/OS – support for DC/OS (Apache Mesos) distributed system • Domain checksums – verify when any content changes • Role Based Access Control (RBAC) – manage ACLs for user groups Complete list is here: https://github.com/HDFGroup/hsds/issues/47
  • 16. 16HSDS Platforms POSIX Filesystem HSDS can be run on most container management systems: Using different supported storage systems:
  • 17. 17AWS Lambda Functions • HSDS can parallelize requests across all the available backend (“DN”) nodes on the server • AWS Lambda is a new service that enables you to run requests ”serverless” • Pay for just cpu-seconds the function runs • By incorporating Lambda, some HDF Server requests can parallelize across a 1000 Lambda functions (equivalent to a 1000 container server) • Will dramatically speed up time-series selections
  • 18. 18Kita Lab • Kita Lab is a JupyterLab and HDF server environment hosted by the HDF Group on AWS • Kita Lab users can create Python notebooks that use h5pyd to connect to HDF Server • Each user gets equivalent to 2-core Xeon Server and 10GB local storage • Users can use up to 100GB of data on HDF Server • Sign up here: https://www.hdfgroup.org/hdfkitalab/ User’s container and EBS volume User User logs into Jupyter Hub JupyterHub spawns new container at login HSDS on Kubernetes S3 Bucket
  • 19. 19Futures • Sometimes you’d rather do without a server and talk to the storage system directly: • Don’t want to deal with setting up service • Don’t want to worry about scaling service up and down with client load • You don’t need the synchronization (e.g. managing multiple clients writing to the same dataset) that a service provides • HS Direct Access will be a new VOL connector that enables this for the HDF5 library • Will take advantage of multiple cores • Uses same schema as HSDS (and can be used in conjunction with HSDS) Design doc is here: https://github.com/HDFGroup/hsds/blob/master/docs/design/direct_access/direct_access.md

Editor's Notes

  • #9: Each node is implemented as a docker container