SlideShare a Scribd company logo
4
Most read
13
Most read
16
Most read
Proprietary and Confidential. Copyright 2018, The HDF Group.
ESIP 2023:
Accessing HDF5 data in
the cloud with HSDS
John Readey
jreadey@hdfgroup.org
2
• Overview of HSDS
• Importing HDF5 files
• ICESat2 Case Study
• What’s next
• Questions
Overview
3
HDF5 for the Cloud
• The HDF5 library has been around 20+ years, but is not optimized for cloud-native
applications
• Expects to access a filesystem (doesn’t work well with object storage)
• No remote API
• No way to scale beyond one-thread/one-process (other than using MPI)
• No support for multi-writer/multi-reader
4
• AWS S3 (or Azure Blob Storage, or Google Cloud Store) is natural choice for large
data collections
• Redundant
• Can be accessed by multiple clients (compared with EBS Volumes)
• Pay as you go pricing (pay for just what you use)
• Low Cost – Compare 500 TB for one month:
• S3 - $11K
• EBS (EC2 attached volume, non-ssd) - $23K
• EBS (EC2 attached volume, ssd) - $51K
• EFS (sharable Posix compatible storage) - $41K
Cloud Storage Models
5
• HDF5 Library either doesn’t work at all (for writing) or can be very slow (for reading)
• Library was designed assuming POSIX compatibility & low latency
• I/O operations for read can read small blocks of data at random locations
• Latency Costs add up
• Writing to S3 requires entire file to be updated (not practical for files of any size)
• Best throughput with S3 is with ~ 16 inflight requests with relatively large sizes
• These considerations led us to develop an entirely paradigm for persisting HDF5
data – the HDF object storage schema
S3 Challenges for HDF5 data
6
HDF Cloud Native Schema
Big Idea: Map individual HDF5
objects (datasets, groups,
chunks) as Object Storage
Objects
• Limit maximum size of any object
• Support parallelism for read/write
• Only data that is modified needs to be
updated
• Multiple clients can be reading/updating
the same “file”
• Don’t need to manage free space
Legend:
• Dataset is partitioned into chunks
• Each chunk stored as an object (file)
• Dataset meta data (type, shape,
attributes, etc.) stored in a separate
object (as JSON text)
Why a sharded data format?
Each chunk (heavy outlines) get
persisted as a separate object
7
Sharded format example
root_obj_id/
group.json
obj1_id/
group.json
obj2_id/
dataset.json
0_0
0_1
obj3_id/
dataset.json
0_0_2
0_0_3
Observations:
• Metadata is stored as JSON
• Chunk data stored as binary blobs
• Self-explanatory
• One HDF5 file can translate to lots of
objects
• Flat hierarchy – supports HDF5
multilinking
• Can limit maximum size of an object
• Can be used with Posix or object
storage
Schema is documented here:
https://github.com/HDFGroup/hsds/blob/master/docs/design/obj_store_schema/obj_store_schema_v2.md
8
Introduction to HSDS
While it's possible to read/write the storage format
directly, it's easier and more performant to use the
HSDS (Highly Scalable Data Service). A REST
based service for HDF.
Note: in addition, there are serverless options for accessing the
cloud native format, but we'll focus on HSDS for this talk
Software is available at:
https://github.com/HDFGroup/hsds
9
Server Features
• Simple + familiar API
• Clients can interact with service using REST API
• SDKs provide language specific interface (e.g. h5pyd for Python)
• Can read/write just the data they need (as opposed to transferring entire files)
• Support for compression
• Container based
• Run in Docker or Kubernetes
• Scalable performance:
• Can cache recently accessed data in RAM
• Can parallelize requests across multiple nodes
• More nodes ➔ better performance
• Cluster based – any number of machines can be used to constitute the server
• Multiple clients can read/write to same data source
• No limit to the amount of data that can be stored by the service
10
Architecture
Legend:
• Client: Any user of the service
• Load balancer – distributes requests to Service nodes
• Service Nodes – processes requests from clients (with help from Data Nodes)
• Data Nodes – responsible for partition of Object Store
• Object Store: Base storage service (e.g. AWS S3)
11
HSDS Platforms
POSIX
Filesystem
HSDS can be run on most container management systems:
Using different supported storage systems:
12
Accessing HSDS
• REST API
• Language neutral
• Support parallel and async
• H5pyd
• Python 3.8 and up
• Compatible with h5py
• No parallel/async support
• HDF5 Library + REST VOL
• C/C++
• Parallel ops via H5DRead/Write multi call
13
Command Line Interface (CLI)
• Accessing HDF via a service means one can’t utilize usual shell commands: ls, rm, chmod, etc.
• Command line tools are a set of simple apps to use instead:
• hsinfo: display server version, connect info
• hsls: list content of folder or file
• hstouch: create folder or file
• hsdel: delete a file
• hsload: upload an HDF5 file
• hsget: download content from server to an HDF5 file
• hsacl: create/list/update ACLs (Access Control Lists)
• Hsdiff: compare HDF5 file with sharded representation
• Implemented in Python & uses h5pyd
• Note: data is round-trip-able:
• HDF5 File -> hsload -> HSDS store -> hsget -> (eqivalent) HDF5 file
14
Enabling HSDS access to HDF5 files
• Using hsload, HDF5 files can be transcoded to sharded schema
• Will need roughly same amount of storage as original file(s)
• Multiple files can be aggregated into one HSDS "file"
• To save time and extra storage costs, use "hsload –link". This will copy just the
metadata and create pointers to each chunk in the source file. The storage needed for
metadata is typically <1% of the overall file size
• To save even more time, use "hsload –fastlink". This will copy the meta data while
chunk locations will be determined as needed. As chunk locations are found, they will
be stored with metadata for faster access in future requests
15
Case Study – NASA ICESat-2 data
• ICESAT-2 mission collects precise (~4 mm resolution) elevation measurements as the
satellite orbits over polar and non-polar regions
• Level 2 data is stored as HDF5 files stored on AWS S3
• Roughly 3 GB and 1000 datasets per file
• As is often the case, the chunk size used (mostly 80 KB/chunk) is not optimal for cloud
access
• Performance using ros3 VFD, s3fs, or HSDS linked files not very good
• Re-chunking PB's of HDF5 files is not very practical
• What to do?
16
Hyper-chunking
• In v 0.8.0 of HSDS, the "hyper-chunking" feature was added
• The idea of hyper-chunking is to use a larger chunk size than the source HDF5 dataset
• Each HSDS chunk is comprised of multiple adjacent HDF5 chunks in the HDF5
dataset
• Target size for HSDS chunk is 4-8 MB. In the case of ICESat-2 data, this would mean
~100 HDF5 chunks for each hyper-chunk
• This reduces the number of requests the DN nodes need to handle
• The DN nodes determine the set of actual S3 rangegets need to retrieve the HDF5
chunks
17
Intelligent Rangegets
• The store so far...
• A DN node gets a request to read a HSDS hyper-chunk composed of multiple HDF5
chunks
• It's often the case that the HDF5 chunks are adjacent (or nearly so) in the HDF5 file
• The DN node will group nearby rangegets so one request can be made rather than
two or more (the extra data has minimal impact on latency to S3)
• In any case, requests to S3 are sent asynchronously which greatly improves
performance
18
Benchmarking
• To compare relative performance for accessing ICESat-2 data on S3, we created a
benchmark that replicates typical access patterns a science user would perform
• Python benchmark can be found
here: https://github.com/HDFGroup/nasa_cloud/blob/main/benchmarks/python/icesat2
_selection.py
• Based on the file path argument, data can be accessed as:
• Local file on SSD
• On S3 using ros3 VFD
• On S3 using s3fs
• Using HSDS (which reads from S3)
19
Results
• Local file: 0.7 s
• ROS3: 73.7 s
• S3FS: 36.7 s
• HSDS native: 19.3 s
• HSDS link (w/out hyper-chunking): 42.9 s
• HSDS link (w/ hyper-chunking): 37.8 s
20
Conclusions
• Performance of the ROS3 VFD not as good as S3FS
• There are several opportunities for improvement though
• Data transcoded to the sharded format has more than a 2x performance advantage
• But takes time and requires double the storage (if you are keeping the original files)
• Hyper-chunking performance was only modestly better than without hyper-chunking
• Additional work is need. Would like to see link performance closer to sharded times
21
Acknowledgement
• This work was support by NASA contract: 80NSSC22K1744
22
Questions?

More Related Content

PPTX
Highly Scalable Data Service (HSDS) Performance Features
PDF
20090713 Hbase Schema Design Case Studies
PDF
Derbycon - The Unintended Risks of Trusting Active Directory
PDF
CNIT 40: 2: DNS Protocol and Architecture
PDF
Facebook Messages & HBase
PDF
Web Security Horror Stories
PPTX
Practical Implementation of Virtual Machine
PPTX
Simplifying Real-Time Architectures for IoT with Apache Kudu
Highly Scalable Data Service (HSDS) Performance Features
20090713 Hbase Schema Design Case Studies
Derbycon - The Unintended Risks of Trusting Active Directory
CNIT 40: 2: DNS Protocol and Architecture
Facebook Messages & HBase
Web Security Horror Stories
Practical Implementation of Virtual Machine
Simplifying Real-Time Architectures for IoT with Apache Kudu

What's hot (20)

PPTX
Moving Beyond Lambda Architectures with Apache Kudu
PPTX
HBase in Practice
PPTX
Http-protocol
PPT
The Ldap Protocol
PDF
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
PPTX
Introduction_of_ADDS
PPTX
Introduction to Penetration Testing
PDF
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
PPT
Hadoop Security Architecture
PPTX
Hadoop Backup and Disaster Recovery
PPTX
DNS Security
PDF
CNIT 40: 1: The Importance of DNS Security
PDF
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
PDF
ReCertifying Active Directory
PPTX
Domain name system
PDF
Secure Session Management
PPTX
Construindo Data Lakes - Visão Prática com Hadoop e BigData
PPTX
Apache Ranger
KEY
Big Data in Real-Time at Twitter
Moving Beyond Lambda Architectures with Apache Kudu
HBase in Practice
Http-protocol
The Ldap Protocol
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Introduction_of_ADDS
Introduction to Penetration Testing
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
Hadoop Security Architecture
Hadoop Backup and Disaster Recovery
DNS Security
CNIT 40: 1: The Importance of DNS Security
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
ReCertifying Active Directory
Domain name system
Secure Session Management
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Apache Ranger
Big Data in Real-Time at Twitter
Ad

Similar to Accessing HDF5 data in the cloud with HSDS (20)

PPTX
Parallel Computing with HDF Server
PPTX
PDF
HDFCloud Workshop: HDF5 in the Cloud
PPTX
HDF for the Cloud - New HDF Server Features
PPTX
HDF for the Cloud - Serverless HDF
PPT
Performance Tuning in HDF5
PDF
Hadoop Distributed file system.pdf
PDF
hadoop distributed file systems complete information
PPTX
Introduction to Redis
PDF
Big Data Architecture Workshop - Vahid Amiri
PDF
Cloud-Optimized HDF5 Files
PDF
Aziksa hadoop architecture santosh jha
PPTX
Hadoop - HDFS
PDF
H5Coro: The Cloud-Optimized Read-Only Library
PDF
Hadoop and object stores can we do it better
PDF
Hadoop and object stores: Can we do it better?
PDF
Hadoop data management
PDF
Hive spark-s3acommitter-hbase-nfs
Parallel Computing with HDF Server
HDFCloud Workshop: HDF5 in the Cloud
HDF for the Cloud - New HDF Server Features
HDF for the Cloud - Serverless HDF
Performance Tuning in HDF5
Hadoop Distributed file system.pdf
hadoop distributed file systems complete information
Introduction to Redis
Big Data Architecture Workshop - Vahid Amiri
Cloud-Optimized HDF5 Files
Aziksa hadoop architecture santosh jha
Hadoop - HDFS
H5Coro: The Cloud-Optimized Read-Only Library
Hadoop and object stores can we do it better
Hadoop and object stores: Can we do it better?
Hadoop data management
Hive spark-s3acommitter-hbase-nfs
Ad

More from The HDF-EOS Tools and Information Center (20)

PDF
HDF5 2.0: Cloud Optimized from the Start
PDF
Using a Hierarchical Data Format v5 file as Zarr v3 Shard
PDF
Cloud-Optimized HDF5 Files - Current Status
PDF
Cloud Optimized HDF5 for the ICESat-2 mission
PPTX
Access HDF Data in the Cloud via OPeNDAP Web Service
PPTX
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
PPTX
The State of HDF5 / Dana Robinson / The HDF Group
PDF
Creating Cloud-Optimized HDF5 Files
PPTX
HDF5 OPeNDAP Handler Updates, and Performance Discussion
PPTX
Hyrax: Serving Data from S3
PPSX
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
PDF
HDF - Current status and Future Directions
PPSX
HDFEOS.org User Analsys, Updates, and Future
PPTX
HDF - Current status and Future Directions
PPTX
MATLAB Modernization on HDF5 1.10
PPSX
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
PPTX
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
PPTX
HDF5 and Ecosystem: What Is New?
HDF5 2.0: Cloud Optimized from the Start
Using a Hierarchical Data Format v5 file as Zarr v3 Shard
Cloud-Optimized HDF5 Files - Current Status
Cloud Optimized HDF5 for the ICESat-2 mission
Access HDF Data in the Cloud via OPeNDAP Web Service
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
The State of HDF5 / Dana Robinson / The HDF Group
Creating Cloud-Optimized HDF5 Files
HDF5 OPeNDAP Handler Updates, and Performance Discussion
Hyrax: Serving Data from S3
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
HDF - Current status and Future Directions
HDFEOS.org User Analsys, Updates, and Future
HDF - Current status and Future Directions
MATLAB Modernization on HDF5 1.10
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
HDF5 and Ecosystem: What Is New?

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Approach and Philosophy of On baking technology
PPTX
MYSQL Presentation for SQL database connectivity
PPT
Teaching material agriculture food technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Empathic Computing: Creating Shared Understanding
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Encapsulation theory and applications.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Spectroscopy.pptx food analysis technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Encapsulation_ Review paper, used for researhc scholars
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Approach and Philosophy of On baking technology
MYSQL Presentation for SQL database connectivity
Teaching material agriculture food technology
“AI and Expert System Decision Support & Business Intelligence Systems”
Empathic Computing: Creating Shared Understanding
The AUB Centre for AI in Media Proposal.docx
Chapter 3 Spatial Domain Image Processing.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Per capita expenditure prediction using model stacking based on satellite ima...
Encapsulation theory and applications.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Network Security Unit 5.pdf for BCA BBA.
Spectroscopy.pptx food analysis technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Encapsulation_ Review paper, used for researhc scholars

Accessing HDF5 data in the cloud with HSDS

  • 1. Proprietary and Confidential. Copyright 2018, The HDF Group. ESIP 2023: Accessing HDF5 data in the cloud with HSDS John Readey jreadey@hdfgroup.org
  • 2. 2 • Overview of HSDS • Importing HDF5 files • ICESat2 Case Study • What’s next • Questions Overview
  • 3. 3 HDF5 for the Cloud • The HDF5 library has been around 20+ years, but is not optimized for cloud-native applications • Expects to access a filesystem (doesn’t work well with object storage) • No remote API • No way to scale beyond one-thread/one-process (other than using MPI) • No support for multi-writer/multi-reader
  • 4. 4 • AWS S3 (or Azure Blob Storage, or Google Cloud Store) is natural choice for large data collections • Redundant • Can be accessed by multiple clients (compared with EBS Volumes) • Pay as you go pricing (pay for just what you use) • Low Cost – Compare 500 TB for one month: • S3 - $11K • EBS (EC2 attached volume, non-ssd) - $23K • EBS (EC2 attached volume, ssd) - $51K • EFS (sharable Posix compatible storage) - $41K Cloud Storage Models
  • 5. 5 • HDF5 Library either doesn’t work at all (for writing) or can be very slow (for reading) • Library was designed assuming POSIX compatibility & low latency • I/O operations for read can read small blocks of data at random locations • Latency Costs add up • Writing to S3 requires entire file to be updated (not practical for files of any size) • Best throughput with S3 is with ~ 16 inflight requests with relatively large sizes • These considerations led us to develop an entirely paradigm for persisting HDF5 data – the HDF object storage schema S3 Challenges for HDF5 data
  • 6. 6 HDF Cloud Native Schema Big Idea: Map individual HDF5 objects (datasets, groups, chunks) as Object Storage Objects • Limit maximum size of any object • Support parallelism for read/write • Only data that is modified needs to be updated • Multiple clients can be reading/updating the same “file” • Don’t need to manage free space Legend: • Dataset is partitioned into chunks • Each chunk stored as an object (file) • Dataset meta data (type, shape, attributes, etc.) stored in a separate object (as JSON text) Why a sharded data format? Each chunk (heavy outlines) get persisted as a separate object
  • 7. 7 Sharded format example root_obj_id/ group.json obj1_id/ group.json obj2_id/ dataset.json 0_0 0_1 obj3_id/ dataset.json 0_0_2 0_0_3 Observations: • Metadata is stored as JSON • Chunk data stored as binary blobs • Self-explanatory • One HDF5 file can translate to lots of objects • Flat hierarchy – supports HDF5 multilinking • Can limit maximum size of an object • Can be used with Posix or object storage Schema is documented here: https://github.com/HDFGroup/hsds/blob/master/docs/design/obj_store_schema/obj_store_schema_v2.md
  • 8. 8 Introduction to HSDS While it's possible to read/write the storage format directly, it's easier and more performant to use the HSDS (Highly Scalable Data Service). A REST based service for HDF. Note: in addition, there are serverless options for accessing the cloud native format, but we'll focus on HSDS for this talk Software is available at: https://github.com/HDFGroup/hsds
  • 9. 9 Server Features • Simple + familiar API • Clients can interact with service using REST API • SDKs provide language specific interface (e.g. h5pyd for Python) • Can read/write just the data they need (as opposed to transferring entire files) • Support for compression • Container based • Run in Docker or Kubernetes • Scalable performance: • Can cache recently accessed data in RAM • Can parallelize requests across multiple nodes • More nodes ➔ better performance • Cluster based – any number of machines can be used to constitute the server • Multiple clients can read/write to same data source • No limit to the amount of data that can be stored by the service
  • 10. 10 Architecture Legend: • Client: Any user of the service • Load balancer – distributes requests to Service nodes • Service Nodes – processes requests from clients (with help from Data Nodes) • Data Nodes – responsible for partition of Object Store • Object Store: Base storage service (e.g. AWS S3)
  • 11. 11 HSDS Platforms POSIX Filesystem HSDS can be run on most container management systems: Using different supported storage systems:
  • 12. 12 Accessing HSDS • REST API • Language neutral • Support parallel and async • H5pyd • Python 3.8 and up • Compatible with h5py • No parallel/async support • HDF5 Library + REST VOL • C/C++ • Parallel ops via H5DRead/Write multi call
  • 13. 13 Command Line Interface (CLI) • Accessing HDF via a service means one can’t utilize usual shell commands: ls, rm, chmod, etc. • Command line tools are a set of simple apps to use instead: • hsinfo: display server version, connect info • hsls: list content of folder or file • hstouch: create folder or file • hsdel: delete a file • hsload: upload an HDF5 file • hsget: download content from server to an HDF5 file • hsacl: create/list/update ACLs (Access Control Lists) • Hsdiff: compare HDF5 file with sharded representation • Implemented in Python & uses h5pyd • Note: data is round-trip-able: • HDF5 File -> hsload -> HSDS store -> hsget -> (eqivalent) HDF5 file
  • 14. 14 Enabling HSDS access to HDF5 files • Using hsload, HDF5 files can be transcoded to sharded schema • Will need roughly same amount of storage as original file(s) • Multiple files can be aggregated into one HSDS "file" • To save time and extra storage costs, use "hsload –link". This will copy just the metadata and create pointers to each chunk in the source file. The storage needed for metadata is typically <1% of the overall file size • To save even more time, use "hsload –fastlink". This will copy the meta data while chunk locations will be determined as needed. As chunk locations are found, they will be stored with metadata for faster access in future requests
  • 15. 15 Case Study – NASA ICESat-2 data • ICESAT-2 mission collects precise (~4 mm resolution) elevation measurements as the satellite orbits over polar and non-polar regions • Level 2 data is stored as HDF5 files stored on AWS S3 • Roughly 3 GB and 1000 datasets per file • As is often the case, the chunk size used (mostly 80 KB/chunk) is not optimal for cloud access • Performance using ros3 VFD, s3fs, or HSDS linked files not very good • Re-chunking PB's of HDF5 files is not very practical • What to do?
  • 16. 16 Hyper-chunking • In v 0.8.0 of HSDS, the "hyper-chunking" feature was added • The idea of hyper-chunking is to use a larger chunk size than the source HDF5 dataset • Each HSDS chunk is comprised of multiple adjacent HDF5 chunks in the HDF5 dataset • Target size for HSDS chunk is 4-8 MB. In the case of ICESat-2 data, this would mean ~100 HDF5 chunks for each hyper-chunk • This reduces the number of requests the DN nodes need to handle • The DN nodes determine the set of actual S3 rangegets need to retrieve the HDF5 chunks
  • 17. 17 Intelligent Rangegets • The store so far... • A DN node gets a request to read a HSDS hyper-chunk composed of multiple HDF5 chunks • It's often the case that the HDF5 chunks are adjacent (or nearly so) in the HDF5 file • The DN node will group nearby rangegets so one request can be made rather than two or more (the extra data has minimal impact on latency to S3) • In any case, requests to S3 are sent asynchronously which greatly improves performance
  • 18. 18 Benchmarking • To compare relative performance for accessing ICESat-2 data on S3, we created a benchmark that replicates typical access patterns a science user would perform • Python benchmark can be found here: https://github.com/HDFGroup/nasa_cloud/blob/main/benchmarks/python/icesat2 _selection.py • Based on the file path argument, data can be accessed as: • Local file on SSD • On S3 using ros3 VFD • On S3 using s3fs • Using HSDS (which reads from S3)
  • 19. 19 Results • Local file: 0.7 s • ROS3: 73.7 s • S3FS: 36.7 s • HSDS native: 19.3 s • HSDS link (w/out hyper-chunking): 42.9 s • HSDS link (w/ hyper-chunking): 37.8 s
  • 20. 20 Conclusions • Performance of the ROS3 VFD not as good as S3FS • There are several opportunities for improvement though • Data transcoded to the sharded format has more than a 2x performance advantage • But takes time and requires double the storage (if you are keeping the original files) • Hyper-chunking performance was only modestly better than without hyper-chunking • Additional work is need. Would like to see link performance closer to sharded times
  • 21. 21 Acknowledgement • This work was support by NASA contract: 80NSSC22K1744