SlideShare a Scribd company logo
Overview of Parallel HDF5
and Performance Tuning in HDF5
Library

Elena Pourmal, Albert Cheng
Science Data Processing
Workshop
February 27, 2002
-1-
Outline
_ Overview of Parallel HDF5 Design
_ Setting up Parallel Environment
_ Programming model for
_ Creating and Accessing a File
_ Creating and Accessing a Dataset
_ Writing and Reading Hyperslabs

_ Parallel Tutorial available at

_ http://hdf.ncsa.uiuc.edu/HDF5/doc/Tutor

_ Performance tuning in HDF5
-2-
PHDF5 Requirements
_ PHDF5 files compatible with serial HDF5
files
_ Shareable between different serial or
parallel platforms

_ Single file image to all processes

_ One file per process design is undesirable
_ Expensive post processing
_ Not useable by different number of processes

_ Standard parallel I/O interface

_ Must be portable to different platforms

-3-
PHDF5 Initial Target
_ Support for MPI programming
_ Not for shared memory
programming
_ Threads
_ OpenMP

_ Has some experiments with
_ Thread-safe support for Pthreads
_ OpenMP if called correctly
-4-
Implementation Requirements
_ No use of Threads
_ Not commonly supported (1998)

_ No reserved process
_ May interfere with parallel
algorithms

_ No spawn process
_ Not commonly supported even now
-5-
PHDF5 Implementation Layers
Parallel
Application

Parallel
Application

Parallel
Application

Parallel
Application

HDF library

Parallel HDF + MPI

Parallel I/O layer

MPI-IO

O2K Unix I/O

SP GPFS

User Applications

TFLOPS PFS

-6-

Parallel File systems
Programming Restrictions
_ Most PHDF5 APIs are collective
_ MPI definition of collective: All processes of the
communicator must participate in the right
order.
_ PHDF5 opens a parallel file with a communicator
_ Returns a file-handle
_ Future access to the file via the file-handle
_ All processes must participate in collective
PHDF5 APIs
_ Different files can be opened via different
communicators
-7-
Examples of PHDF5 API
_ Examples of PHDF5 collective API
_ File operations: H5Fcreate, H5Fopen,
H5Fclose
_ Objects creation: H5Dcreate, H5Dopen,
H5Dclose
_ Objects structure: H5Dextend (increase
dimension sizes)
_ Array data transfer can be collective or
independent
_ Dataset operations: H5Dwrite, H5Dread
-8-
What Does PHDF5 Support ?
_ After a file is opened by the
processes of a communicator
_ All parts of file are accessible by all
processes
_ All objects in the file are accessible by
all processes
_ Multiple processes write to the same
data array
_ Each process writes to individual data
array
-9-
PHDF5 API Languages
_ C and F90 language interfaces
_ Platforms supported: IBM SP2 and
SP3, Intel TFLOPS, SGI Origin 2000,
HP-UX 11.00 System V, Alpha
Compaq Clusters.

- 10 -
Creating and Accessing a File
Programming model
_ HDF5 uses access template object
to control the file access
mechanism
_ General model to access HDF5 file
in parallel:
_ Setup access template
_ Open File
_ Close File
- 11 -
Setup access template
Each process of the MPI communicator creates an
access template and sets it up with MPI parallel
access information
C:
herr_t H5Pset_fapl_mpio(hid_t plist_id,
MPI_Comm comm, MPI_Info info);

F90:
h5pset_fapl_mpio_f(plist_id, comm, info);
integer(hid_t) :: plist_id
integer
:: comm, info
- 12 -
C Example
Parallel File Create
23
24
26
27
28
29
33
34
35
36
37
38
42
49
50
51
52
54

comm = MPI_COMM_WORLD;
info = MPI_INFO_NULL;
/*
* Initialize MPI
*/
MPI_Init(&argc, &argv);
/*
* Set up file access property list for MPI-IO access
*/
plist_id = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_fapl_mpio(plist_id, comm, info);
file_id = H5Fcreate(H5FILE_NAME, H5F_ACC_TRUNC,
H5P_DEFAULT, plist_id);
/*
* Close the file.
*/
H5Fclose(file_id);
MPI_Finalize();
- 13 -
F90 Example
Parallel File Create
23
24
26
29
30
32
34
35
37
38
40
41
43
45
46
49
51
52
54
56

comm = MPI_COMM_WORLD
info = MPI_INFO_NULL
CALL MPI_INIT(mpierror)
!
! Initialize FORTRAN predefined datatypes
CALL h5open_f(error)
!
! Setup file access property list for MPI-IO access.
CALL h5pcreate_f(H5P_FILE_ACCESS_F, plist_id, error)
CALL h5pset_fapl_mpio_f(plist_id, comm, info, error)
!
! Create the file collectively.
CALL h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id,
error, access_prp = plist_id)
!
! Close the file.
CALL h5fclose_f(file_id, error)
!
! Close FORTRAN interface
CALL h5close_f(error)
- 14 CALL MPI_FINALIZE(mpierror)
Creating and Opening Dataset
All processes of the MPI communicator
open/close a dataset by a collective call
– C: H5Dcreate or H5Dopen;
H5Dclose
– F90: h5dfreate_f or h5dopen_f;
h5dclose_f
_ All processes of the MPI communicator
extend dataset with unlimited
dimensions before writing to it
– C: H5Dextend
– F90: h5dextend_f
- 15 -
C Example
Parallel Dataset Create
56
57
58
59
60
61
62
63
64
65
66
67
68

file_id = H5Fcreate(…);
/*
* Create the dataspace for the dataset.
*/
dimsf[0] = NX;
dimsf[1] = NY;
filespace = H5Screate_simple(RANK, dimsf, NULL);

70
71
72
73
74

H5Dclose(dset_id);
/*
* Close the file.
*/
H5Fclose(file_id);

/*
* Create the dataset with default properties collective.
*/
dset_id = H5Dcreate(file_id, “dataset1”, H5T_NATIVE_INT,
filespace, H5P_DEFAULT);

- 16 -
F90 Example
Parallel Dataset Create
43 CALL h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id,
error, access_prp = plist_id)
73 CALL h5screate_simple_f(rank, dimsf, filespace, error)
76 !
77 ! Create the dataset with default properties.
78 !
79 CALL h5dcreate_f(file_id, “dataset1”, H5T_NATIVE_INTEGER,
filespace, dset_id, error)
90
91
92
93
94
95

!
! Close the dataset.
CALL h5dclose_f(dset_id, error)
!
! Close the file.
CALL h5fclose_f(file_id, error)

- 17 -
Accessing a Dataset
_ All processes that have opened
dataset may do collective I/O
_ Each process may do independent
and arbitrary number of data I/O
access calls
– F90: h5dwrite_f and
h5dread_f
– C:
H5Dwrite and H5Dread
- 18 -
Accessing a Dataset
Programming model

_ Create and set dataset transfer
property
– C:

H5Pset_dxpl_mpio
– H5FD_MPIO_COLLECTIVE
– H5FD_MPIO_INDEPENDENT

– F90: h5pset_dxpl_mpio_f
– H5FD_MPIO_COLLECTIVE_F
– H5FD_MPIO_INDEPENDENT_F

_ Access dataset with the defined
transfer property
- 19 -
C Example: Collective write

95 /*
96
* Create property list for collective dataset write.
97
*/
98 plist_id = H5Pcreate(H5P_DATASET_XFER);
99 H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE);
100
101 status = H5Dwrite(dset_id, H5T_NATIVE_INT,
102
memspace, filespace, plist_id, data);

- 20 -
F90 Example: Collective write

88
89
90
91
92
93
94
95
96

! Create property list for collective dataset write
!
CALL h5pcreate_f(H5P_DATASET_XFER_F, plist_id, error)
CALL h5pset_dxpl_mpio_f(plist_id, &
H5FD_MPIO_COLLECTIVE_F, error)
!
! Write the dataset collectively.
!
CALL h5dwrite_f(dset_id, H5T_NATIVE_INTEGER, data, &
error, &
file_space_id = filespace, &
mem_space_id = memspace, &
xfer_prp = plist_id)
- 21 -
Writing and Reading Hyperslabs
Programming model
_ Each process defines memory and
file hyperslabs
_ Each process executes partial
write/read call
_ Collective calls
_ Independent calls
- 22 -
Hyperslab Example 1
Writing dataset by rows
P0
P1

File

P2
P3

- 23 -
Writing by rows
Output of h5dump utility
HDF5 "SDS_row.h5" {
GROUP "/" {
DATASET "IntArray" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 8, 5 ) / ( 8, 5 ) }
DATA {
10, 10, 10, 10, 10,
10, 10, 10, 10, 10,
11, 11, 11, 11, 11,
11, 11, 11, 11, 11,
12, 12, 12, 12, 12,
12, 12, 12, 12, 12,
13, 13, 13, 13, 13,
13, 13, 13, 13, 13
}
}
}
}
- 24 -
Example 1
Writing dataset by rows
File

P1 (memory space)

offset[1]
count[1]
offset[0]

count[0]

count[0] = dimsf[0]/mpi_size
count[1] = dimsf[1];
offset[0] = mpi_rank * count[0];
offset[1] = 0;
- 25 -
C Example 1
71 /*
72
* Each process defines dataset in memory and
* writes it to the hyperslab
73
* in the file.
74
*/
75
count[0] = dimsf[0]/mpi_size;
76
count[1] = dimsf[1];
77
offset[0] = mpi_rank * count[0];
78
offset[1] = 0;
79
memspace = H5Screate_simple(RANK,count,NULL);
80
81 /*
82
* Select hyperslab in the file.
83
*/
84
filespace = H5Dget_space(dset_id);
85
H5Sselect_hyperslab(filespace,
H5S_SELECT_SET,offset,NULL,count,NULL);
- 26 -
Hyperslab Example 2
Writing dataset by columns

P0
File
P1

- 27 -
Writing by columns
Output of h5dump utility
HDF5 "SDS_col.h5" {
GROUP "/" {
DATASET "IntArray" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 8, 6 ) / ( 8, 6 ) }
DATA {
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200
}
}
}
}
- 28 -
Example 2
Writing dataset by column
File

Memory
P0 offset[1]

block[0]

P0
dimsm[0]
dimsm[1]

P1 offset[1]

block[1]
stride[1]

P1
- 29 -
C Example 2
85
86
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102

/*
* Each process defines hyperslab in
* the file
*/
count[0] = 1;
count[1] = dimsm[1];
offset[0] = 0;
offset[1] = mpi_rank;
stride[0] = 1;
stride[1] = 2;
block[0] = dimsf[0];
block[1] = 1;
/*
* Each process selects hyperslab.
*/
filespace = H5Dget_space(dset_id);
H5Sselect_hyperslab(filespace,
H5S_SELECT_SET, offset, stride,
count, block);
- 30 -
Hyperslab Example 3
Writing dataset by pattern

P0

File

P1
P2
P3
- 31 -
Writing by Pattern
Output of h5dump utility
HDF5 "SDS_pat.h5" {
GROUP "/" {
DATASET "IntArray" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) }
DATA {
1, 3, 1, 3,
2, 4, 2, 4,
1, 3, 1, 3,
2, 4, 2, 4,
1, 3, 1, 3,
2, 4, 2, 4,
1, 3, 1, 3,
2, 4, 2, 4
}
}
}
}
- 32 -
Example 3
Writing dataset by pattern
Memory

File
stride[1]

P2
stride[0]
count[1]
offset[0] = 0;
offset[1] = 1;
count[0] = 4;
count[1] = 2;
stride[0] = 2;
stride[1] = 2;

offset[1]
- 33 -
C Example 3: Writing by
pattern
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112

/* Each process defines dataset in memory and
* writes it to the hyperslab
* in the file.
*/
count[0] = 4;
count[1] = 2;
stride[0] = 2;
stride[1] = 2;
if(mpi_rank == 0) {
offset[0] = 0;
offset[1] = 0;
}
if(mpi_rank == 1) {
offset[0] = 1;
offset[1] = 0;
}
if(mpi_rank == 2) {
offset[0] = 0;
offset[1] = 1;
}
if(mpi_rank == 3) {
offset[0] = 1;
offset[1] = 1;
- 34 }
Hyperslab Example 4
Writing dataset by chunks

P0

P1

P2

P3

- 35 -

File
Writing by Chunks
Output of h5dump utility
HDF5 "SDS_chnk.h5" {
GROUP "/" {
DATASET "IntArray" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) }
DATA {
1, 1, 2, 2,
1, 1, 2, 2,
1, 1, 2, 2,
1, 1, 2, 2,
3, 3, 4, 4,
3, 3, 4, 4,
3, 3, 4, 4,
3, 3, 4, 4
}
}
}
- 36 }
Example 4
Writing dataset by chunks
File

Memory
P2

offset[1]

chunk_dims[1]
offset[0]
chunk_dims[0]

block[0]
block[0] = chunk_dims[0];
block[1] = chunk_dims[1];
offset[0] = chunk_dims[0];
offset[1] = 0;

block[1]
- 37 -
C Example 4
Writing by chunks
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118

count[0] = 1;
count[1] = 1 ;
stride[0] = 1;
stride[1] = 1;
block[0] = chunk_dims[0];
block[1] = chunk_dims[1];
if(mpi_rank == 0) {
offset[0] = 0;
offset[1] = 0;
}
if(mpi_rank == 1) {
offset[0] = 0;
offset[1] = chunk_dims[1];
}
if(mpi_rank == 2) {
offset[0] = chunk_dims[0];
offset[1] = 0;
}
if(mpi_rank == 3) {
offset[0] = chunk_dims[0];
offset[1] = chunk_dims[1];
}
- 38 -
Performance Tuning in HDF5

- 39 -
File Level Knobs
_ H5Pset_meta_block_size
_ H5Pset_alignment
_ H5Pset_fapl_split
_ H5Pset_cache
_ H5Pset_fapl_mpio

- 40 -
H5Pset_meta_block_size
_ Sets the minimum metadata block size
allocated for metadata aggregation.
The larger the size, the fewer the
number of small data objects in the file.
_ The aggregated block of metadata is
usually written in a single write action
and always in a contiguous block,
potentially significantly improving
library and application performance.
_ Default is 2KB
- 41 -
H5Pset_alignment
_ Sets the alignment properties of a file access
property list so that any file object greater
than or equal in size to threshold bytes will be
aligned on an address which is a multiple of
alignment. This makes significant improvement
to file systems that are sensitive to data block
alignments.
_ Default values for threshold and alignment are
one, implying no alignment. Generally the
default values will result in the best
performance for single-process access to the
file. For MPI-IO and other parallel systems,
choose an alignment which is a multiple of the
- 42 disk block size.
H5Pset_fapl_split
_ Sets file driver to store metadata and
raw data in two separate files, metadata
and raw data files.
_ Significant I/O improvement if the
metadata file is stored in Unix file
systems (good for small I/O) while the
raw data file is stored in Parallel file
systems (good for large I/O).
_ Default is no split.
- 43 -
Writing contiguous dataset on Tflop at
LANL
(Mb/sec, various buffer sizes)
Measured functionality:
20
16

HDF5 writing to a standard HDF5 file

Mb/sec
12

Writing directly with MPI I/O (no HDF5)

8

HDF5 writing with the split driver

4
2

4
8
Number of processes

- 44 -

16

Each process writes 10Mb
of data.
H5Pset_cache
_ Sets:
_ the number of elements (objects) in the
meta data cache
_ the number of elements, the total number
of bytes, and the preemption policy value in
the raw data chunk cache

_ The right values depend on the
individual application access pattern.
_ Default for preemption value is 0.75
- 45 -
H5Pset_fapl_mpio
_ MPI-IO hints can be passed to the MPIIO layer via the Info parameter.
_ E.g., telling Romio to use 2-phases I/O
speeds up collective I/O in the ASCI Red
machine.

- 46 -
Data Transfer Level Knobs
_ H5Pset_buffer
_ H5Pset_sieve_buf_size
_ H5Pset_hyper_cache

- 47 -
H5Pset_buffer
_ Sets the maximum size for the type
conversion buffer and background
buffer used during data transfer. The
bigger the size, the better the
performance.
_ Default is 1 MB.

- 48 -
H5Pset_sieve_buf_size
_ Sets the maximum size of the data sieve
buffer. The bigger the size, the fewer
I/O requests issued for raw data access.
_ Default is 64KB

- 49 -
H5Pset_hyper_cache
_ Indicates whether to cache hyperslab
blocks during I/O, a process which can
significantly increase I/O speeds.
_ Default is to cache blocks with no limit
on block size for serial I/O and to not
cache blocks for parallel I/O.

- 50 -
Chunk Cache Effect
by H5Pset_cache
_ Write one integer dataset
256x256x1024 (256MB)
_ Using chunks of 256x16x1024
(16MB)
_ Two tests of
_ Default chunk cache size (1MB)
_ Set chunk cache size 16MB
- 51 -
Chunk Cache
Time Definitions
_ Total: time to open file, write dataset,
close dataset and close file.
_ Dataset Write: time to write the whole
dataset
_ Chunk Write: best time to write a chunk
_ User Time: total Unix user time of test
_ System Time: total Unix system time of
test
- 52 -
Chunk Cache size Results
Cache
Chunk
buffer size write time
on MB
(sec)

Dataset
write time
(sec)

Total
time(sec)

1

132.58

2450.25 2453.09 14

2200.1

16

0.376

7.83

3.45

8.27

- 53 -

User time
(sec)

6.21

System
time (sec)
Chunk Cache Size
Summary

_ Big chunk cache size improves
performance
_ Poor performance mostly due to
increased system time
_ Many more I/O requests
_ Smaller I/O requests

- 54 -

More Related Content

PDF
Overview of Parallel HDF5 and Performance Tuning in HDF5 Library
PDF
06 再利用のためのインクルードとクラス
DOCX
Perl 20tips
PDF
05 再利用のためのインクルード
PDF
Gur1009
PDF
misr_view 4.0: A Tool for Visualizing MISR data
PDF
Facilitating Access to EOS Data at the NSIDC DAAC
Overview of Parallel HDF5 and Performance Tuning in HDF5 Library
06 再利用のためのインクルードとクラス
Perl 20tips
05 再利用のためのインクルード
Gur1009
misr_view 4.0: A Tool for Visualizing MISR data
Facilitating Access to EOS Data at the NSIDC DAAC

Similar to Overview of Parallel HDF5 and Performance Tuning in HDF5 Library (20)

PPT
Overview of Parallel HDF5
PDF
Parallel HDF5 Introductory Tutorial
PPT
Using HDF5 and Python: The H5py module
ODP
The why and how of moving to PHP 5.5/5.6
PPT
The MATLAB Low-Level HDF5 Interface
PDF
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
PPTX
Adding CF Attributes to an HDF5 File
PDF
Tips
PPT
Session Server - Maintaing State between several Servers
PPT
How To Start Up With PHP In IBM i
PPT
How To Start Up With Php In Ibm I
ODP
The why and how of moving to PHP 5.4/5.5
PDF
Introduction to Hadoop part1
PDF
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
PDF
Introduction to PowerShell
PPT
Substituting HDF5 tools with Python/H5py scripts
PDF
Sql saturday pig session (wes floyd) v2
PPT
HDF5 Advanced Topics - Datatypes and Partial I/O
PDF
Php7 extensions workshop
Overview of Parallel HDF5
Parallel HDF5 Introductory Tutorial
Using HDF5 and Python: The H5py module
The why and how of moving to PHP 5.5/5.6
The MATLAB Low-Level HDF5 Interface
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Adding CF Attributes to an HDF5 File
Tips
Session Server - Maintaing State between several Servers
How To Start Up With PHP In IBM i
How To Start Up With Php In Ibm I
The why and how of moving to PHP 5.4/5.5
Introduction to Hadoop part1
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
Introduction to PowerShell
Substituting HDF5 tools with Python/H5py scripts
Sql saturday pig session (wes floyd) v2
HDF5 Advanced Topics - Datatypes and Partial I/O
Php7 extensions workshop
Ad

More from The HDF-EOS Tools and Information Center (20)

PDF
HDF5 2.0: Cloud Optimized from the Start
PDF
Using a Hierarchical Data Format v5 file as Zarr v3 Shard
PDF
Cloud-Optimized HDF5 Files - Current Status
PDF
Cloud Optimized HDF5 for the ICESat-2 mission
PPTX
Access HDF Data in the Cloud via OPeNDAP Web Service
PPTX
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
PPTX
The State of HDF5 / Dana Robinson / The HDF Group
PDF
Cloud-Optimized HDF5 Files
PDF
Accessing HDF5 data in the cloud with HSDS
PPTX
Highly Scalable Data Service (HSDS) Performance Features
PDF
Creating Cloud-Optimized HDF5 Files
PPTX
HDF5 OPeNDAP Handler Updates, and Performance Discussion
PPTX
Hyrax: Serving Data from S3
PPSX
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
PDF
HDF - Current status and Future Directions
PPSX
HDFEOS.org User Analsys, Updates, and Future
PPTX
HDF - Current status and Future Directions
PDF
H5Coro: The Cloud-Optimized Read-Only Library
PPTX
MATLAB Modernization on HDF5 1.10
HDF5 2.0: Cloud Optimized from the Start
Using a Hierarchical Data Format v5 file as Zarr v3 Shard
Cloud-Optimized HDF5 Files - Current Status
Cloud Optimized HDF5 for the ICESat-2 mission
Access HDF Data in the Cloud via OPeNDAP Web Service
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
The State of HDF5 / Dana Robinson / The HDF Group
Cloud-Optimized HDF5 Files
Accessing HDF5 data in the cloud with HSDS
Highly Scalable Data Service (HSDS) Performance Features
Creating Cloud-Optimized HDF5 Files
HDF5 OPeNDAP Handler Updates, and Performance Discussion
Hyrax: Serving Data from S3
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
HDF - Current status and Future Directions
HDFEOS.org User Analsys, Updates, and Future
HDF - Current status and Future Directions
H5Coro: The Cloud-Optimized Read-Only Library
MATLAB Modernization on HDF5 1.10
Ad

Recently uploaded (20)

PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Spectroscopy.pptx food analysis technology
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPT
Teaching material agriculture food technology
PDF
Electronic commerce courselecture one. Pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation theory and applications.pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Empathic Computing: Creating Shared Understanding
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Approach and Philosophy of On baking technology
MIND Revenue Release Quarter 2 2025 Press Release
Spectroscopy.pptx food analysis technology
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Programs and apps: productivity, graphics, security and other tools
Reach Out and Touch Someone: Haptics and Empathic Computing
“AI and Expert System Decision Support & Business Intelligence Systems”
Teaching material agriculture food technology
Electronic commerce courselecture one. Pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
A comparative analysis of optical character recognition models for extracting...
Advanced methodologies resolving dimensionality complications for autism neur...
Digital-Transformation-Roadmap-for-Companies.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation theory and applications.pdf
1. Introduction to Computer Programming.pptx
Unlocking AI with Model Context Protocol (MCP)
Empathic Computing: Creating Shared Understanding
Assigned Numbers - 2025 - Bluetooth® Document
Approach and Philosophy of On baking technology

Overview of Parallel HDF5 and Performance Tuning in HDF5 Library

  • 1. Overview of Parallel HDF5 and Performance Tuning in HDF5 Library Elena Pourmal, Albert Cheng Science Data Processing Workshop February 27, 2002 -1-
  • 2. Outline _ Overview of Parallel HDF5 Design _ Setting up Parallel Environment _ Programming model for _ Creating and Accessing a File _ Creating and Accessing a Dataset _ Writing and Reading Hyperslabs _ Parallel Tutorial available at _ http://hdf.ncsa.uiuc.edu/HDF5/doc/Tutor _ Performance tuning in HDF5 -2-
  • 3. PHDF5 Requirements _ PHDF5 files compatible with serial HDF5 files _ Shareable between different serial or parallel platforms _ Single file image to all processes _ One file per process design is undesirable _ Expensive post processing _ Not useable by different number of processes _ Standard parallel I/O interface _ Must be portable to different platforms -3-
  • 4. PHDF5 Initial Target _ Support for MPI programming _ Not for shared memory programming _ Threads _ OpenMP _ Has some experiments with _ Thread-safe support for Pthreads _ OpenMP if called correctly -4-
  • 5. Implementation Requirements _ No use of Threads _ Not commonly supported (1998) _ No reserved process _ May interfere with parallel algorithms _ No spawn process _ Not commonly supported even now -5-
  • 6. PHDF5 Implementation Layers Parallel Application Parallel Application Parallel Application Parallel Application HDF library Parallel HDF + MPI Parallel I/O layer MPI-IO O2K Unix I/O SP GPFS User Applications TFLOPS PFS -6- Parallel File systems
  • 7. Programming Restrictions _ Most PHDF5 APIs are collective _ MPI definition of collective: All processes of the communicator must participate in the right order. _ PHDF5 opens a parallel file with a communicator _ Returns a file-handle _ Future access to the file via the file-handle _ All processes must participate in collective PHDF5 APIs _ Different files can be opened via different communicators -7-
  • 8. Examples of PHDF5 API _ Examples of PHDF5 collective API _ File operations: H5Fcreate, H5Fopen, H5Fclose _ Objects creation: H5Dcreate, H5Dopen, H5Dclose _ Objects structure: H5Dextend (increase dimension sizes) _ Array data transfer can be collective or independent _ Dataset operations: H5Dwrite, H5Dread -8-
  • 9. What Does PHDF5 Support ? _ After a file is opened by the processes of a communicator _ All parts of file are accessible by all processes _ All objects in the file are accessible by all processes _ Multiple processes write to the same data array _ Each process writes to individual data array -9-
  • 10. PHDF5 API Languages _ C and F90 language interfaces _ Platforms supported: IBM SP2 and SP3, Intel TFLOPS, SGI Origin 2000, HP-UX 11.00 System V, Alpha Compaq Clusters. - 10 -
  • 11. Creating and Accessing a File Programming model _ HDF5 uses access template object to control the file access mechanism _ General model to access HDF5 file in parallel: _ Setup access template _ Open File _ Close File - 11 -
  • 12. Setup access template Each process of the MPI communicator creates an access template and sets it up with MPI parallel access information C: herr_t H5Pset_fapl_mpio(hid_t plist_id, MPI_Comm comm, MPI_Info info); F90: h5pset_fapl_mpio_f(plist_id, comm, info); integer(hid_t) :: plist_id integer :: comm, info - 12 -
  • 13. C Example Parallel File Create 23 24 26 27 28 29 33 34 35 36 37 38 42 49 50 51 52 54 comm = MPI_COMM_WORLD; info = MPI_INFO_NULL; /* * Initialize MPI */ MPI_Init(&argc, &argv); /* * Set up file access property list for MPI-IO access */ plist_id = H5Pcreate(H5P_FILE_ACCESS); H5Pset_fapl_mpio(plist_id, comm, info); file_id = H5Fcreate(H5FILE_NAME, H5F_ACC_TRUNC, H5P_DEFAULT, plist_id); /* * Close the file. */ H5Fclose(file_id); MPI_Finalize(); - 13 -
  • 14. F90 Example Parallel File Create 23 24 26 29 30 32 34 35 37 38 40 41 43 45 46 49 51 52 54 56 comm = MPI_COMM_WORLD info = MPI_INFO_NULL CALL MPI_INIT(mpierror) ! ! Initialize FORTRAN predefined datatypes CALL h5open_f(error) ! ! Setup file access property list for MPI-IO access. CALL h5pcreate_f(H5P_FILE_ACCESS_F, plist_id, error) CALL h5pset_fapl_mpio_f(plist_id, comm, info, error) ! ! Create the file collectively. CALL h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id, error, access_prp = plist_id) ! ! Close the file. CALL h5fclose_f(file_id, error) ! ! Close FORTRAN interface CALL h5close_f(error) - 14 CALL MPI_FINALIZE(mpierror)
  • 15. Creating and Opening Dataset All processes of the MPI communicator open/close a dataset by a collective call – C: H5Dcreate or H5Dopen; H5Dclose – F90: h5dfreate_f or h5dopen_f; h5dclose_f _ All processes of the MPI communicator extend dataset with unlimited dimensions before writing to it – C: H5Dextend – F90: h5dextend_f - 15 -
  • 16. C Example Parallel Dataset Create 56 57 58 59 60 61 62 63 64 65 66 67 68 file_id = H5Fcreate(…); /* * Create the dataspace for the dataset. */ dimsf[0] = NX; dimsf[1] = NY; filespace = H5Screate_simple(RANK, dimsf, NULL); 70 71 72 73 74 H5Dclose(dset_id); /* * Close the file. */ H5Fclose(file_id); /* * Create the dataset with default properties collective. */ dset_id = H5Dcreate(file_id, “dataset1”, H5T_NATIVE_INT, filespace, H5P_DEFAULT); - 16 -
  • 17. F90 Example Parallel Dataset Create 43 CALL h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id, error, access_prp = plist_id) 73 CALL h5screate_simple_f(rank, dimsf, filespace, error) 76 ! 77 ! Create the dataset with default properties. 78 ! 79 CALL h5dcreate_f(file_id, “dataset1”, H5T_NATIVE_INTEGER, filespace, dset_id, error) 90 91 92 93 94 95 ! ! Close the dataset. CALL h5dclose_f(dset_id, error) ! ! Close the file. CALL h5fclose_f(file_id, error) - 17 -
  • 18. Accessing a Dataset _ All processes that have opened dataset may do collective I/O _ Each process may do independent and arbitrary number of data I/O access calls – F90: h5dwrite_f and h5dread_f – C: H5Dwrite and H5Dread - 18 -
  • 19. Accessing a Dataset Programming model _ Create and set dataset transfer property – C: H5Pset_dxpl_mpio – H5FD_MPIO_COLLECTIVE – H5FD_MPIO_INDEPENDENT – F90: h5pset_dxpl_mpio_f – H5FD_MPIO_COLLECTIVE_F – H5FD_MPIO_INDEPENDENT_F _ Access dataset with the defined transfer property - 19 -
  • 20. C Example: Collective write 95 /* 96 * Create property list for collective dataset write. 97 */ 98 plist_id = H5Pcreate(H5P_DATASET_XFER); 99 H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE); 100 101 status = H5Dwrite(dset_id, H5T_NATIVE_INT, 102 memspace, filespace, plist_id, data); - 20 -
  • 21. F90 Example: Collective write 88 89 90 91 92 93 94 95 96 ! Create property list for collective dataset write ! CALL h5pcreate_f(H5P_DATASET_XFER_F, plist_id, error) CALL h5pset_dxpl_mpio_f(plist_id, & H5FD_MPIO_COLLECTIVE_F, error) ! ! Write the dataset collectively. ! CALL h5dwrite_f(dset_id, H5T_NATIVE_INTEGER, data, & error, & file_space_id = filespace, & mem_space_id = memspace, & xfer_prp = plist_id) - 21 -
  • 22. Writing and Reading Hyperslabs Programming model _ Each process defines memory and file hyperslabs _ Each process executes partial write/read call _ Collective calls _ Independent calls - 22 -
  • 23. Hyperslab Example 1 Writing dataset by rows P0 P1 File P2 P3 - 23 -
  • 24. Writing by rows Output of h5dump utility HDF5 "SDS_row.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 5 ) / ( 8, 5 ) } DATA { 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13 } } } } - 24 -
  • 25. Example 1 Writing dataset by rows File P1 (memory space) offset[1] count[1] offset[0] count[0] count[0] = dimsf[0]/mpi_size count[1] = dimsf[1]; offset[0] = mpi_rank * count[0]; offset[1] = 0; - 25 -
  • 26. C Example 1 71 /* 72 * Each process defines dataset in memory and * writes it to the hyperslab 73 * in the file. 74 */ 75 count[0] = dimsf[0]/mpi_size; 76 count[1] = dimsf[1]; 77 offset[0] = mpi_rank * count[0]; 78 offset[1] = 0; 79 memspace = H5Screate_simple(RANK,count,NULL); 80 81 /* 82 * Select hyperslab in the file. 83 */ 84 filespace = H5Dget_space(dset_id); 85 H5Sselect_hyperslab(filespace, H5S_SELECT_SET,offset,NULL,count,NULL); - 26 -
  • 27. Hyperslab Example 2 Writing dataset by columns P0 File P1 - 27 -
  • 28. Writing by columns Output of h5dump utility HDF5 "SDS_col.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 6 ) / ( 8, 6 ) } DATA { 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200 } } } } - 28 -
  • 29. Example 2 Writing dataset by column File Memory P0 offset[1] block[0] P0 dimsm[0] dimsm[1] P1 offset[1] block[1] stride[1] P1 - 29 -
  • 30. C Example 2 85 86 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 /* * Each process defines hyperslab in * the file */ count[0] = 1; count[1] = dimsm[1]; offset[0] = 0; offset[1] = mpi_rank; stride[0] = 1; stride[1] = 2; block[0] = dimsf[0]; block[1] = 1; /* * Each process selects hyperslab. */ filespace = H5Dget_space(dset_id); H5Sselect_hyperslab(filespace, H5S_SELECT_SET, offset, stride, count, block); - 30 -
  • 31. Hyperslab Example 3 Writing dataset by pattern P0 File P1 P2 P3 - 31 -
  • 32. Writing by Pattern Output of h5dump utility HDF5 "SDS_pat.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) } DATA { 1, 3, 1, 3, 2, 4, 2, 4, 1, 3, 1, 3, 2, 4, 2, 4, 1, 3, 1, 3, 2, 4, 2, 4, 1, 3, 1, 3, 2, 4, 2, 4 } } } } - 32 -
  • 33. Example 3 Writing dataset by pattern Memory File stride[1] P2 stride[0] count[1] offset[0] = 0; offset[1] = 1; count[0] = 4; count[1] = 2; stride[0] = 2; stride[1] = 2; offset[1] - 33 -
  • 34. C Example 3: Writing by pattern 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 /* Each process defines dataset in memory and * writes it to the hyperslab * in the file. */ count[0] = 4; count[1] = 2; stride[0] = 2; stride[1] = 2; if(mpi_rank == 0) { offset[0] = 0; offset[1] = 0; } if(mpi_rank == 1) { offset[0] = 1; offset[1] = 0; } if(mpi_rank == 2) { offset[0] = 0; offset[1] = 1; } if(mpi_rank == 3) { offset[0] = 1; offset[1] = 1; - 34 }
  • 35. Hyperslab Example 4 Writing dataset by chunks P0 P1 P2 P3 - 35 - File
  • 36. Writing by Chunks Output of h5dump utility HDF5 "SDS_chnk.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) } DATA { 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 4, 4 } } } - 36 }
  • 37. Example 4 Writing dataset by chunks File Memory P2 offset[1] chunk_dims[1] offset[0] chunk_dims[0] block[0] block[0] = chunk_dims[0]; block[1] = chunk_dims[1]; offset[0] = chunk_dims[0]; offset[1] = 0; block[1] - 37 -
  • 38. C Example 4 Writing by chunks 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 count[0] = 1; count[1] = 1 ; stride[0] = 1; stride[1] = 1; block[0] = chunk_dims[0]; block[1] = chunk_dims[1]; if(mpi_rank == 0) { offset[0] = 0; offset[1] = 0; } if(mpi_rank == 1) { offset[0] = 0; offset[1] = chunk_dims[1]; } if(mpi_rank == 2) { offset[0] = chunk_dims[0]; offset[1] = 0; } if(mpi_rank == 3) { offset[0] = chunk_dims[0]; offset[1] = chunk_dims[1]; } - 38 -
  • 39. Performance Tuning in HDF5 - 39 -
  • 40. File Level Knobs _ H5Pset_meta_block_size _ H5Pset_alignment _ H5Pset_fapl_split _ H5Pset_cache _ H5Pset_fapl_mpio - 40 -
  • 41. H5Pset_meta_block_size _ Sets the minimum metadata block size allocated for metadata aggregation. The larger the size, the fewer the number of small data objects in the file. _ The aggregated block of metadata is usually written in a single write action and always in a contiguous block, potentially significantly improving library and application performance. _ Default is 2KB - 41 -
  • 42. H5Pset_alignment _ Sets the alignment properties of a file access property list so that any file object greater than or equal in size to threshold bytes will be aligned on an address which is a multiple of alignment. This makes significant improvement to file systems that are sensitive to data block alignments. _ Default values for threshold and alignment are one, implying no alignment. Generally the default values will result in the best performance for single-process access to the file. For MPI-IO and other parallel systems, choose an alignment which is a multiple of the - 42 disk block size.
  • 43. H5Pset_fapl_split _ Sets file driver to store metadata and raw data in two separate files, metadata and raw data files. _ Significant I/O improvement if the metadata file is stored in Unix file systems (good for small I/O) while the raw data file is stored in Parallel file systems (good for large I/O). _ Default is no split. - 43 -
  • 44. Writing contiguous dataset on Tflop at LANL (Mb/sec, various buffer sizes) Measured functionality: 20 16 HDF5 writing to a standard HDF5 file Mb/sec 12 Writing directly with MPI I/O (no HDF5) 8 HDF5 writing with the split driver 4 2 4 8 Number of processes - 44 - 16 Each process writes 10Mb of data.
  • 45. H5Pset_cache _ Sets: _ the number of elements (objects) in the meta data cache _ the number of elements, the total number of bytes, and the preemption policy value in the raw data chunk cache _ The right values depend on the individual application access pattern. _ Default for preemption value is 0.75 - 45 -
  • 46. H5Pset_fapl_mpio _ MPI-IO hints can be passed to the MPIIO layer via the Info parameter. _ E.g., telling Romio to use 2-phases I/O speeds up collective I/O in the ASCI Red machine. - 46 -
  • 47. Data Transfer Level Knobs _ H5Pset_buffer _ H5Pset_sieve_buf_size _ H5Pset_hyper_cache - 47 -
  • 48. H5Pset_buffer _ Sets the maximum size for the type conversion buffer and background buffer used during data transfer. The bigger the size, the better the performance. _ Default is 1 MB. - 48 -
  • 49. H5Pset_sieve_buf_size _ Sets the maximum size of the data sieve buffer. The bigger the size, the fewer I/O requests issued for raw data access. _ Default is 64KB - 49 -
  • 50. H5Pset_hyper_cache _ Indicates whether to cache hyperslab blocks during I/O, a process which can significantly increase I/O speeds. _ Default is to cache blocks with no limit on block size for serial I/O and to not cache blocks for parallel I/O. - 50 -
  • 51. Chunk Cache Effect by H5Pset_cache _ Write one integer dataset 256x256x1024 (256MB) _ Using chunks of 256x16x1024 (16MB) _ Two tests of _ Default chunk cache size (1MB) _ Set chunk cache size 16MB - 51 -
  • 52. Chunk Cache Time Definitions _ Total: time to open file, write dataset, close dataset and close file. _ Dataset Write: time to write the whole dataset _ Chunk Write: best time to write a chunk _ User Time: total Unix user time of test _ System Time: total Unix system time of test - 52 -
  • 53. Chunk Cache size Results Cache Chunk buffer size write time on MB (sec) Dataset write time (sec) Total time(sec) 1 132.58 2450.25 2453.09 14 2200.1 16 0.376 7.83 3.45 8.27 - 53 - User time (sec) 6.21 System time (sec)
  • 54. Chunk Cache Size Summary _ Big chunk cache size improves performance _ Poor performance mostly due to increased system time _ Many more I/O requests _ Smaller I/O requests - 54 -