Overview of Parallel HDF5 and Performance Tuning in HDF5 Library

Overview of Parallel HDF5
and Performance Tuning in HDF5
Library

Elena Pourmal, Albert Cheng
Science Data Processing
Workshop
February 27, 2002
-1-

Outline
_ Overview of Parallel HDF5 Design
_ Setting up Parallel Environment
_ Programming model for
_ Creating and Accessing a File
_ Creating and Accessing a Dataset
_ Writing and Reading Hyperslabs

_ Parallel Tutorial available at

_ http://hdf.ncsa.uiuc.edu/HDF5/doc/Tutor

_ Performance tuning in HDF5
-2-

PHDF5 Requirements
_ PHDF5 files compatible with serial HDF5
files
_ Shareable between different serial or
parallel platforms

_ Single file image to all processes

_ One file per process design is undesirable
_ Expensive post processing
_ Not useable by different number of processes

_ Standard parallel I/O interface

_ Must be portable to different platforms

-3-

PHDF5 Initial Target
_ Support for MPI programming
_ Not for shared memory
programming
_ Threads
_ OpenMP

_ Has some experiments with
_ Thread-safe support for Pthreads
_ OpenMP if called correctly
-4-

Implementation Requirements
_ No use of Threads
_ Not commonly supported (1998)

_ No reserved process
_ May interfere with parallel
algorithms

_ No spawn process
_ Not commonly supported even now
-5-

PHDF5 Implementation Layers
Parallel
Application

Parallel
Application

Parallel
Application

Parallel
Application

HDF library

Parallel HDF + MPI

Parallel I/O layer

MPI-IO

O2K Unix I/O

SP GPFS

User Applications

TFLOPS PFS

-6-

Parallel File systems

Programming Restrictions
_ Most PHDF5 APIs are collective
_ MPI definition of collective: All processes of the
communicator must participate in the right
order.
_ PHDF5 opens a parallel file with a communicator
_ Returns a file-handle
_ Future access to the file via the file-handle
_ All processes must participate in collective
PHDF5 APIs
_ Different files can be opened via different
communicators
-7-

Examples of PHDF5 API
_ Examples of PHDF5 collective API
_ File operations: H5Fcreate, H5Fopen,
H5Fclose
_ Objects creation: H5Dcreate, H5Dopen,
H5Dclose
_ Objects structure: H5Dextend (increase
dimension sizes)
_ Array data transfer can be collective or
independent
_ Dataset operations: H5Dwrite, H5Dread
-8-

What Does PHDF5 Support ?
_ After a file is opened by the
processes of a communicator
_ All parts of file are accessible by all
processes
_ All objects in the file are accessible by
all processes
_ Multiple processes write to the same
data array
_ Each process writes to individual data
array
-9-

PHDF5 API Languages
_ C and F90 language interfaces
_ Platforms supported: IBM SP2 and
SP3, Intel TFLOPS, SGI Origin 2000,
HP-UX 11.00 System V, Alpha
Compaq Clusters.

- 10 -

Creating and Accessing a File
Programming model
_ HDF5 uses access template object
to control the file access
mechanism
_ General model to access HDF5 file
in parallel:
_ Setup access template
_ Open File
_ Close File
- 11 -

Setup access template
Each process of the MPI communicator creates an
access template and sets it up with MPI parallel
access information
C:
herr_t H5Pset_fapl_mpio(hid_t plist_id,
MPI_Comm comm, MPI_Info info);

F90:
h5pset_fapl_mpio_f(plist_id, comm, info);
integer(hid_t) :: plist_id
integer
:: comm, info
- 12 -

C Example
Parallel File Create
23
24
26
27
28
29
33
34
35
36
37
38
42
49
50
51
52
54

comm = MPI_COMM_WORLD;
info = MPI_INFO_NULL;
/*
* Initialize MPI
*/
MPI_Init(&argc, &argv);
/*
* Set up file access property list for MPI-IO access
*/
plist_id = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_fapl_mpio(plist_id, comm, info);
file_id = H5Fcreate(H5FILE_NAME, H5F_ACC_TRUNC,
H5P_DEFAULT, plist_id);
/*
* Close the file.
*/
H5Fclose(file_id);
MPI_Finalize();
- 13 -

F90 Example
Parallel File Create
23
24
26
29
30
32
34
35
37
38
40
41
43
45
46
49
51
52
54
56

comm = MPI_COMM_WORLD
info = MPI_INFO_NULL
CALL MPI_INIT(mpierror)
!
! Initialize FORTRAN predefined datatypes
CALL h5open_f(error)
!
! Setup file access property list for MPI-IO access.
CALL h5pcreate_f(H5P_FILE_ACCESS_F, plist_id, error)
CALL h5pset_fapl_mpio_f(plist_id, comm, info, error)
!
! Create the file collectively.
CALL h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id,
error, access_prp = plist_id)
!
! Close the file.
CALL h5fclose_f(file_id, error)
!
! Close FORTRAN interface
CALL h5close_f(error)
- 14 CALL MPI_FINALIZE(mpierror)

Creating and Opening Dataset
All processes of the MPI communicator
open/close a dataset by a collective call
– C: H5Dcreate or H5Dopen;
H5Dclose
– F90: h5dfreate_f or h5dopen_f;
h5dclose_f
_ All processes of the MPI communicator
extend dataset with unlimited
dimensions before writing to it
– C: H5Dextend
– F90: h5dextend_f
- 15 -

C Example
Parallel Dataset Create
56
57
58
59
60
61
62
63
64
65
66
67
68

file_id = H5Fcreate(…);
/*
* Create the dataspace for the dataset.
*/
dimsf[0] = NX;
dimsf[1] = NY;
filespace = H5Screate_simple(RANK, dimsf, NULL);

70
71
72
73
74

H5Dclose(dset_id);
/*
* Close the file.
*/
H5Fclose(file_id);

/*
* Create the dataset with default properties collective.
*/
dset_id = H5Dcreate(file_id, “dataset1”, H5T_NATIVE_INT,
filespace, H5P_DEFAULT);

- 16 -

F90 Example
Parallel Dataset Create
43 CALL h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id,
error, access_prp = plist_id)
73 CALL h5screate_simple_f(rank, dimsf, filespace, error)
76 !
77 ! Create the dataset with default properties.
78 !
79 CALL h5dcreate_f(file_id, “dataset1”, H5T_NATIVE_INTEGER,
filespace, dset_id, error)
90
91
92
93
94
95

!
! Close the dataset.
CALL h5dclose_f(dset_id, error)
!
! Close the file.
CALL h5fclose_f(file_id, error)

- 17 -

Accessing a Dataset
_ All processes that have opened
dataset may do collective I/O
_ Each process may do independent
and arbitrary number of data I/O
access calls
– F90: h5dwrite_f and
h5dread_f
– C:
H5Dwrite and H5Dread
- 18 -

Accessing a Dataset
Programming model

_ Create and set dataset transfer
property
– C:

H5Pset_dxpl_mpio
– H5FD_MPIO_COLLECTIVE
– H5FD_MPIO_INDEPENDENT

– F90: h5pset_dxpl_mpio_f
– H5FD_MPIO_COLLECTIVE_F
– H5FD_MPIO_INDEPENDENT_F

_ Access dataset with the defined
transfer property
- 19 -

C Example: Collective write

95 /*
96
* Create property list for collective dataset write.
97
*/
98 plist_id = H5Pcreate(H5P_DATASET_XFER);
99 H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE);
100
101 status = H5Dwrite(dset_id, H5T_NATIVE_INT,
102
memspace, filespace, plist_id, data);

- 20 -

F90 Example: Collective write

88
89
90
91
92
93
94
95
96

! Create property list for collective dataset write
!
CALL h5pcreate_f(H5P_DATASET_XFER_F, plist_id, error)
CALL h5pset_dxpl_mpio_f(plist_id, &
H5FD_MPIO_COLLECTIVE_F, error)
!
! Write the dataset collectively.
!
CALL h5dwrite_f(dset_id, H5T_NATIVE_INTEGER, data, &
error, &
file_space_id = filespace, &
mem_space_id = memspace, &
xfer_prp = plist_id)
- 21 -

Writing and Reading Hyperslabs
Programming model
_ Each process defines memory and
file hyperslabs
_ Each process executes partial
write/read call
_ Collective calls
_ Independent calls
- 22 -

Hyperslab Example 1
Writing dataset by rows
P0
P1

File

P2
P3

- 23 -

Writing by rows
Output of h5dump utility
HDF5 "SDS_row.h5" {
GROUP "/" {
DATASET "IntArray" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 8, 5 ) / ( 8, 5 ) }
DATA {
10, 10, 10, 10, 10,
10, 10, 10, 10, 10,
11, 11, 11, 11, 11,
11, 11, 11, 11, 11,
12, 12, 12, 12, 12,
12, 12, 12, 12, 12,
13, 13, 13, 13, 13,
13, 13, 13, 13, 13
}
}
}
}
- 24 -

Example 1
Writing dataset by rows
File

P1 (memory space)

offset[1]
count[1]
offset[0]

count[0]

count[0] = dimsf[0]/mpi_size
count[1] = dimsf[1];
offset[0] = mpi_rank * count[0];
offset[1] = 0;
- 25 -

C Example 1
71 /*
72
* Each process defines dataset in memory and
* writes it to the hyperslab
73
* in the file.
74
*/
75
count[0] = dimsf[0]/mpi_size;
76
count[1] = dimsf[1];
77
offset[0] = mpi_rank * count[0];
78
offset[1] = 0;
79
memspace = H5Screate_simple(RANK,count,NULL);
80
81 /*
82
* Select hyperslab in the file.
83
*/
84
filespace = H5Dget_space(dset_id);
85
H5Sselect_hyperslab(filespace,
H5S_SELECT_SET,offset,NULL,count,NULL);
- 26 -

Hyperslab Example 2
Writing dataset by columns

P0
File
P1

- 27 -

Writing by columns
HDF5 "SDS_col.h5" {
GROUP "/" {
DATA {
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200
}
}
}
}
- 28 -

Example 2
Writing dataset by column
File

Memory
P0 offset[1]

block[0]

P0
dimsm[0]
dimsm[1]

P1 offset[1]

block[1]
stride[1]

P1
- 29 -

C Example 2
85
86
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102

/*
* Each process defines hyperslab in
* the file
*/
count[0] = 1;
count[1] = dimsm[1];
offset[0] = 0;
offset[1] = mpi_rank;
stride[0] = 1;
stride[1] = 2;
block[0] = dimsf[0];
block[1] = 1;
/*
* Each process selects hyperslab.
*/
filespace = H5Dget_space(dset_id);
H5Sselect_hyperslab(filespace,
H5S_SELECT_SET, offset, stride,
count, block);
- 30 -

Hyperslab Example 3
Writing dataset by pattern

P0

File

P1
P2
P3
- 31 -

Writing by Pattern
HDF5 "SDS_pat.h5" {
GROUP "/" {
DATA {
1, 3, 1, 3,
2, 4, 2, 4,
1, 3, 1, 3,
2, 4, 2, 4,
1, 3, 1, 3,
2, 4, 2, 4,
1, 3, 1, 3,
2, 4, 2, 4
}
}
}
}
- 32 -

Example 3
Writing dataset by pattern
Memory

File
stride[1]

P2
stride[0]
count[1]
offset[0] = 0;
offset[1] = 1;
count[0] = 4;
count[1] = 2;
stride[0] = 2;
stride[1] = 2;

offset[1]
- 33 -

C Example 3: Writing by
pattern
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112

/* Each process defines dataset in memory and
* writes it to the hyperslab
* in the file.
*/
count[0] = 4;
count[1] = 2;
stride[0] = 2;
stride[1] = 2;
if(mpi_rank == 0) {
offset[0] = 0;
offset[1] = 0;
}
if(mpi_rank == 1) {
offset[0] = 1;
offset[1] = 0;
}
if(mpi_rank == 2) {
offset[0] = 0;
offset[1] = 1;
}
if(mpi_rank == 3) {
offset[0] = 1;
offset[1] = 1;
- 34 }

Hyperslab Example 4
Writing dataset by chunks

P0

P1

P2

P3

- 35 -

File

Writing by Chunks
HDF5 "SDS_chnk.h5" {
GROUP "/" {
DATA {
1, 1, 2, 2,
1, 1, 2, 2,
1, 1, 2, 2,
1, 1, 2, 2,
3, 3, 4, 4,
3, 3, 4, 4,
3, 3, 4, 4,
3, 3, 4, 4
}
}
}
- 36 }

Example 4
Writing dataset by chunks
File

Memory
P2

offset[1]

chunk_dims[1]
offset[0]
chunk_dims[0]

block[0]
block[0] = chunk_dims[0];
offset[0] = chunk_dims[0];
offset[1] = 0;

block[1]
- 37 -

C Example 4
Writing by chunks
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118

count[0] = 1;
count[1] = 1 ;
stride[0] = 1;
stride[1] = 1;
if(mpi_rank == 0) {
offset[0] = 0;
offset[1] = 0;
}
if(mpi_rank == 1) {
offset[0] = 0;
}
if(mpi_rank == 2) {
offset[1] = 0;
}
if(mpi_rank == 3) {
}
- 38 -

Performance Tuning in HDF5

- 39 -

File Level Knobs
_ H5Pset_meta_block_size
_ H5Pset_alignment
_ H5Pset_fapl_split
_ H5Pset_cache
_ H5Pset_fapl_mpio

- 40 -

H5Pset_meta_block_size
_ Sets the minimum metadata block size
allocated for metadata aggregation.
The larger the size, the fewer the
number of small data objects in the file.
_ The aggregated block of metadata is
usually written in a single write action
and always in a contiguous block,
potentially significantly improving
library and application performance.
_ Default is 2KB
- 41 -

H5Pset_alignment
_ Sets the alignment properties of a file access
property list so that any file object greater
than or equal in size to threshold bytes will be
aligned on an address which is a multiple of
alignment. This makes significant improvement
to file systems that are sensitive to data block
alignments.
_ Default values for threshold and alignment are
one, implying no alignment. Generally the
default values will result in the best
performance for single-process access to the
file. For MPI-IO and other parallel systems,
choose an alignment which is a multiple of the
- 42 disk block size.

H5Pset_fapl_split
_ Sets file driver to store metadata and
raw data in two separate files, metadata
and raw data files.
_ Significant I/O improvement if the
metadata file is stored in Unix file
systems (good for small I/O) while the
raw data file is stored in Parallel file
systems (good for large I/O).
_ Default is no split.
- 43 -

Writing contiguous dataset on Tflop at
LANL
(Mb/sec, various buffer sizes)
Measured functionality:
20
16

HDF5 writing to a standard HDF5 file

Mb/sec
12

Writing directly with MPI I/O (no HDF5)

8

HDF5 writing with the split driver

4
2

4
8
Number of processes

- 44 -

16

Each process writes 10Mb
of data.

H5Pset_cache
_ Sets:
_ the number of elements (objects) in the
meta data cache
_ the number of elements, the total number
of bytes, and the preemption policy value in
the raw data chunk cache

_ The right values depend on the
individual application access pattern.
_ Default for preemption value is 0.75
- 45 -

H5Pset_fapl_mpio
_ MPI-IO hints can be passed to the MPIIO layer via the Info parameter.
_ E.g., telling Romio to use 2-phases I/O
speeds up collective I/O in the ASCI Red
machine.

- 46 -

Data Transfer Level Knobs
_ H5Pset_buffer
_ H5Pset_sieve_buf_size
_ H5Pset_hyper_cache

- 47 -

H5Pset_buffer
_ Sets the maximum size for the type
conversion buffer and background
buffer used during data transfer. The
bigger the size, the better the
performance.
_ Default is 1 MB.

- 48 -

H5Pset_sieve_buf_size
_ Sets the maximum size of the data sieve
buffer. The bigger the size, the fewer
I/O requests issued for raw data access.
_ Default is 64KB

- 49 -

H5Pset_hyper_cache
_ Indicates whether to cache hyperslab
blocks during I/O, a process which can
significantly increase I/O speeds.
_ Default is to cache blocks with no limit
on block size for serial I/O and to not
cache blocks for parallel I/O.

- 50 -

Chunk Cache Effect
by H5Pset_cache
_ Write one integer dataset
256x256x1024 (256MB)
_ Using chunks of 256x16x1024
(16MB)
_ Two tests of
_ Default chunk cache size (1MB)
_ Set chunk cache size 16MB
- 51 -

Chunk Cache
Time Definitions
_ Total: time to open file, write dataset,
close dataset and close file.
_ Dataset Write: time to write the whole
dataset
_ Chunk Write: best time to write a chunk
_ User Time: total Unix user time of test
_ System Time: total Unix system time of
test
- 52 -

Chunk Cache size Results
Cache
Chunk
buffer size write time
on MB
(sec)

Dataset
write time
(sec)

Total
time(sec)

1

132.58

2450.25 2453.09 14

2200.1

16

0.376

7.83

3.45

8.27

- 53 -

User time
(sec)

6.21

System
time (sec)

Chunk Cache Size
Summary

_ Big chunk cache size improves
performance
_ Poor performance mostly due to
increased system time
_ Many more I/O requests
_ Smaller I/O requests

- 54 -

Overview of Parallel HDF5 and Performance Tuning in HDF5 Library

More Related Content

Similar to Overview of Parallel HDF5 and Performance Tuning in HDF5 Library (20)

More from The HDF-EOS Tools and Information Center (20)

Recently uploaded (20)

Overview of Parallel HDF5 and Performance Tuning in HDF5 Library