SlideShare a Scribd company logo
www.hdfgroup.org
The HDF Group
Parallel HDF5
March 4, 2015 HPC Oil & Gas Workshop
Quincey Koziol
Director of Core Software & HPC
The HDF Group
koziol@hdfgroup.org
http://bit.ly/QuinceyKoziol
1
www.hdfgroup.org
Recent Parallel HDF5 Success Story
• Performance of VPIC-IO on Bluewaters
• I/O Kernel of a Plasma Physics application
• 56 GB/s I/O rate in writing 5TB data using 5K
cores with multi-dataset write optimization
• VPIC-IO kernel running on 298,048 cores
• ~10 Trillion particles
• 291 TB, single file
• 1 GB stripe size and 160 Lustre OSTs
• 52 GB/s
• 53% of the peak performance
March 4, 2015 HPC Oil & Gas Workshop 3
www.hdfgroup.org
Outline
• Quick Intro to HDF5
• Overview of Parallel HDF5 design
• Parallel Consistency Semantics
• PHDF5 Programming Model
• Examples
• Performance Analysis
• Parallel Tools
• Details of upcoming features of HDF5
March 4, 2015 HPC Oil & Gas Workshop 4
http://bit.ly/ParallelHDF5-HPCOGW-2015
www.hdfgroup.org
QUICK INTRO TO HDF5
March 4, 2015 HPC Oil & Gas Workshop 5
www.hdfgroup.org
What is HDF5?
March 4, 2015 HPC Oil & Gas Workshop 6
• HDF5 == Hierarchical Data Format, v5
http://bit.ly/ParallelHDF5-HPCOGW-2015
• A flexible data model
• Structures for data organization and specification
• Open source software
• Works with data in the format
• Open file format
• Designed for high volume or complex data
www.hdfgroup.org
What is HDF5, in detail?
• A versatile data model that can represent very complex
data objects and a wide variety of metadata.
• An open source software library that runs on a wide
range of computational platforms, from cell phones to
massively parallel systems, and implements a high-level
API with C, C++, Fortran, and Java interfaces.
• A rich set of integrated performance features that allow
for access time and storage space optimizations.
• Tools and applications for managing, manipulating,
viewing, and analyzing the data in the collection.
• A completely portable file format with no limit on the
number or size of data objects stored.
March 4, 2015 7HPC Oil & Gas Workshop
http://bit.ly/ParallelHDF5-HPCOGW-2015
www.hdfgroup.org
HDF5 is like …
March 4, 2015 HPC Oil & Gas Workshop 8
www.hdfgroup.org
Why use HDF5?
• Challenging data:
• Application data that pushes the limits of what can be
addressed by traditional database systems, XML
documents, or in-house data formats.
• Software solutions:
• For very large datasets, very fast access requirements,
or very complex datasets.
• To easily share data across a wide variety of
computational platforms using applications written in
different programming languages.
• That take advantage of the many open-source and
commercial tools that understand HDF5.
• Enabling long-term preservation of data.
March 4, 2015 9HPC Oil & Gas Workshop
www.hdfgroup.org
Who uses HDF5?
• Examples of HDF5 user communities
• Astrophysics
• Astronomers
• NASA Earth Science Enterprise
• Dept. of Energy Labs
• Supercomputing Centers in US, Europe and Asia
• Synchrotrons and Light Sources in US and Europe
• Financial Institutions
• NOAA
• Engineering & Manufacturing Industries
• Many others
• For a more detailed list, visit
• http://www.hdfgroup.org/HDF5/users5.html
March 4, 2015 10HPC Oil & Gas Workshop
www.hdfgroup.org
The HDF Group
• Established in 1988
• 18 years at University of Illinois’ National Center for
Supercomputing Applications
• 8 years as independent non-profit company:
“The HDF Group”
• The HDF Group owns HDF4 and HDF5
• HDF4 & HDF5 formats, libraries, and tools are open
source and freely available with BSD-style license
• Currently employ 37 people
• Always looking for more developers!
March 4, 2015 11HPC Oil & Gas Workshop
www.hdfgroup.org
HDF5 Technology Platform
• HDF5 Abstract Data Model
• Defines the “building blocks” for data organization and
specification
• Files, Groups, Links, Datasets, Attributes, Datatypes,
Dataspaces
• HDF5 Software
• Tools
• Language Interfaces
• HDF5 Library
• HDF5 Binary File Format
• Bit-level organization of HDF5 file
• Defined by HDF5 File Format Specification
12March 4, 2015 HPC Oil & Gas Workshop
www.hdfgroup.orgMarch 4, 2015 13
HDF5 Data Model
• File – Container for objects
• Groups – provide structure among objects
• Datasets – where the primary data goes
• Data arrays
• Rich set of datatype options
• Flexible, efficient storage and I/O
• Attributes, for metadata
Everything else is built essentially from
these parts.
HPC Oil & Gas Workshop
www.hdfgroup.orgMarch 4, 2015 14
Structures to organize objects
palette
Raster image
3-D array
2-D arrayRaster image
lat | lon | temp
----|-----|-----
12 | 23 | 3.1
15 | 24 | 4.2
17 | 21 | 3.6
Table
“/” (root)
“/TestData”
“Groups”
“Datasets”
HPC Oil & Gas Workshop
www.hdfgroup.org
HDF5 Dataset
March 4, 2015 15
• HDF5 datasets organize and contain data elements.
• HDF5 datatype describes individual data elements.
• HDF5 dataspace describes the logical layout of the data elements.
Integer: 32-bit, LE
HDF5 Datatype
Multi-dimensional array of
identically typed data elements
Specifications for single data
element and array dimensions
3
Rank
Dim[2] = 7
Dimensions
Dim[0] = 4
Dim[1] = 5
HDF5 Dataspace
HPC Oil & Gas Workshop
www.hdfgroup.org
HDF5 Attributes
• Typically contain user metadata
• Have a name and a value
• Attributes “decorate” HDF5 objects
• Value is described by a datatype and a dataspace
• Analogous to a dataset, but do not support
partial I/O operations; nor can they be compressed
or extended
March 4, 2015 16HPC Oil & Gas Workshop
www.hdfgroup.org
HDF5 Technology Platform
• HDF5 Abstract Data Model
• Defines the “building blocks” for data organization and
specification
• Files, Groups, Links, Datasets, Attributes, Datatypes,
Dataspaces
• HDF5 Software
• Tools
• Language Interfaces
• HDF5 Library
• HDF5 Binary File Format
• Bit-level organization of HDF5 file
• Defined by HDF5 File Format Specification
17March 4, 2015 HPC Oil & Gas Workshop
www.hdfgroup.org
HDF5 Software Distribution
HDF5 home page: http://hdfgroup.org/HDF5/
• Latest release: HDF5 1.8.14 (1.8.15 coming in May
2015)
HDF5 source code:
• Written in C, and includes optional C++, Fortran 90
APIs, and High Level APIs
• Contains command-line utilities (h5dump, h5repack,
h5diff, ..) and compile scripts
HDF5 pre-built binaries:
• When possible, includes C, C++, F90, and High Level
libraries. Check ./lib/libhdf5.settings file.
• Built with and require the SZIP and ZLIB external
libraries
March 4, 2015 HPC Oil & Gas Workshop 18
www.hdfgroup.org
The General HDF5 API
• C, Fortran, Java, C++, and .NET bindings
• IDL, MATLAB, Python (h5py, PyTables)
• C routines begin with prefix H5?
? is a character corresponding to the type of object the
function acts on
March 4, 2015 HPC Oil & Gas Workshop 19
Example Interfaces:
H5D : Dataset interface e.g., H5Dread
H5F : File interface e.g., H5Fopen
H5S : dataSpace interface e.g., H5Sclose
www.hdfgroup.org
The HDF5 API
• For flexibility, the API is extensive
 300+ functions
• This can be daunting… but there is hope
 A few functions can do a lot
 Start simple
 Build up knowledge as more features are needed
March 4, 2015 HPC Oil & Gas Workshop 20
Victorinox
Swiss Army
Cybertool 34
www.hdfgroup.org
General Programming Paradigm
• Object is opened or created
• Object is accessed, possibly many times
• Object is closed
• Properties of object are optionally defined
Creation properties (e.g., use chunking storage)
Access properties
March 4, 2015 HPC Oil & Gas Workshop 21
www.hdfgroup.org
Basic Functions
H5Fcreate (H5Fopen) create (open) File
H5Screate_simple/H5Screate create Dataspace
H5Dcreate (H5Dopen) create (open) Dataset
H5Dread, H5Dwrite access Dataset
H5Dclose close Dataset
H5Sclose close Dataspace
H5Fclose close File
March 4, 2015 HPC Oil & Gas Workshop 22
www.hdfgroup.org
Useful Tools For New Users
March 4, 2015 HPC Oil & Gas Workshop 23
h5dump:
Tool to “dump” or display contents of HDF5 files
h5cc, h5c++, h5fc:
Scripts to compile applications
HDFView:
Java browser to view HDF5 files
http://www.hdfgroup.org/hdf-java-html/hdfview/
HDF5 Examples (C, Fortran, Java, Python, Matlab)
http://www.hdfgroup.org/ftp/HDF5/examples/
www.hdfgroup.org
OVERVIEW OF PARALLEL
HDF5 DESIGN
March 4, 2015 HPC Oil & Gas Workshop 24
www.hdfgroup.org
• Parallel HDF5 should allow multiple
processes to perform I/O to an HDF5 file at
the same time
• Single file image for all processes
• Compare with one file per process design:
• Expensive post processing
• Not usable by different number of processes
• Too many files produced for file system
• Parallel HDF5 should use a standard
parallel I/O interface
• Must be portable to different platforms
Parallel HDF5 Requirements
March 4, 2015 HPC Oil & Gas Workshop 25
www.hdfgroup.org
Design requirements, cont
• Support Message Passing Interface
(MPI) programming
• Parallel HDF5 files compatible with
serial HDF5 files
• Shareable between different serial or
parallel platforms
March 4, 2015 HPC Oil & Gas Workshop 26
www.hdfgroup.org
Design Dependencies
• MPI with MPI-IO
• MPICH, OpenMPI
• Vendor’s MPI-IO
• Parallel file system
• IBM GPFS
• Lustre
• PVFS
March 4, 2015 HPC Oil & Gas Workshop 27
www.hdfgroup.org
PHDF5 implementation layers
HDF5 Application
Compute node Compute node Compute node
HDF5 Library
MPI Library
HDF5 file on Parallel File System
Switch network + I/O servers
Disk architecture and layout of data on disk
March 4, 2015 HPC Oil & Gas Workshop 28
www.hdfgroup.org
MPI-IO VS. HDF5
March 4, 2015 HPC Oil & Gas Workshop 29
www.hdfgroup.org
MPI-IO
• MPI-IO is an Input/Output API
• It treats the data file as a “linear byte
stream” and each MPI application needs
to provide its own file and data
representations to interpret those bytes
March 4, 2015 HPC Oil & Gas Workshop 30
www.hdfgroup.org
MPI-IO
• All data stored are machine dependent
except the “external32” representation
• External32 is defined in Big Endianness
• Little-endian machines have to do the data
conversion in both read or write operations
• 64-bit sized data types may lose
information
March 4, 2015 HPC Oil & Gas Workshop 31
www.hdfgroup.org
MPI-IO vs. HDF5
• HDF5 is data management software
• It stores data and metadata according
to the HDF5 data format definition
• HDF5 file is self-describing
• Each machine can store the data in its own
native representation for efficient I/O
without loss of data precision
• Any necessary data representation
conversion is done by the HDF5 library
automatically
March 4, 2015 HPC Oil & Gas Workshop 32
www.hdfgroup.org
PARALLEL HDF5
CONSISTENCY SEMANTICS
March 4, 2015 HPC Oil & Gas Workshop 33
www.hdfgroup.org
Consistency Semantics
• Consistency Semantics: Rules that define the
outcome of multiple, possibly concurrent,
accesses to an object or data structure by one
or more processes in a computer system.
March 4, 2015 HPC Oil & Gas Workshop 34
www.hdfgroup.org
Parallel HDF5 Consistency Semantics
• Parallel HDF5 library defines a set of
consistency semantics to let users know what
to expect when processes access data
managed by the library.
• When the changes a process makes are
actually visible to itself (if it tries to read back
that data) or to other processes that access the
same file with independent or collective I/O
operations
March 4, 2015 HPC Oil & Gas Workshop 35
www.hdfgroup.org
Parallel HDF5 Consistency Semantics
• Same as MPI-I/O semantics
• Default MPI-I/O semantics doesn’t guarantee
atomicity or sequence of above calls!
• Problems may occur (although we haven’t
seen any) when writing/reading HDF5
metadata or raw data
Process 0 Process 1
MPI_File_write_at()
MPI_Barrier() MPI_Barrier()
MPI_File_read_at()
March 4, 2015 HPC Oil & Gas Workshop 36
www.hdfgroup.org
• MPI I/O provides atomicity and sync-barrier-
sync features to address the issue
• PHDF5 follows MPI I/O
• H5Fset_mpio_atomicity function to turn on
MPI atomicity
• H5Fsync function to transfer written data to
storage device (in implementation now)
March 4, 2015 HPC Oil & Gas Workshop 37
Parallel HDF5 Consistency Semantics
www.hdfgroup.org
• For more information see “Enabling a strict
consistency semantics model in parallel
HDF5” linked from the HDF5
H5Fset_mpi_atomicity Reference Manual
page1
1
http://www.hdfgroup.org/HDF5/doc/RM/Advanced/PHDF5FileConsistencySem
antics/PHDF5FileConsistencySemantics.pdf
March 4, 2015 HPC Oil & Gas Workshop 38
Parallel HDF5 Consistency Semantics
www.hdfgroup.org
HDF5 PARALLEL
PROGRAMMING MODEL
March 4, 2015 HPC Oil & Gas Workshop 39
www.hdfgroup.org
How to compile PHDF5 applications
• h5pcc – HDF5 C compiler command
• Similar to mpicc
• h5pfc – HDF5 F90 compiler command
• Similar to mpif90
• To compile:
• % h5pcc h5prog.c
• % h5pfc h5prog.f90
March 4, 2015 HPC Oil & Gas Workshop 40
www.hdfgroup.org
Programming restrictions
• PHDF5 opens a parallel file with an MPI
communicator
• Returns a file handle
• Future access to the file via the file handle
• All processes must participate in collective
PHDF5 APIs
• Different files can be opened via different
communicators
March 4, 2015 HPC Oil & Gas Workshop 41
www.hdfgroup.org
Collective HDF5 calls
• All HDF5 APIs that modify structural
metadata are collective!
• File operations
- H5Fcreate, H5Fopen, H5Fclose, etc
• Object creation
- H5Dcreate, H5Dclose, etc
• Object structure modification (e.g., dataset
extent modification)
- H5Dset_extent, etc
• http://www.hdfgroup.org/HDF5/doc/RM/CollectiveCalls.html
March 4, 2015 HPC Oil & Gas Workshop 42
www.hdfgroup.org
Other HDF5 calls
• Array data transfer can be collective or
independent
- Dataset operations: H5Dwrite, H5Dread
• Collectiveness is indicated by function
parameters, not by function names as in MPI API
March 4, 2015 HPC Oil & Gas Workshop 43
www.hdfgroup.org
What does PHDF5 support ?
• After a file is opened by the processes of a
communicator
• All parts of file are accessible by all processes
• All objects in the file are accessible by all
processes
• Multiple processes may write to the same data
array
• Each process may write to individual data array
March 4, 2015 HPC Oil & Gas Workshop 44
www.hdfgroup.org
PHDF5 API languages
• C and F90, 2003 language interfaces
• Most platforms with MPI-IO supported. e.g.,
• IBM BG/x
• Linux clusters
• Cray
March 4, 2015 HPC Oil & Gas Workshop 45
www.hdfgroup.org
Programming model
• HDF5 uses access property list to control
the file access mechanism
• General model to access HDF5 file in
parallel:
- Set up MPI-IO file access property list
- Open File
- Access Data
- Close File
March 4, 2015 HPC Oil & Gas Workshop 46
www.hdfgroup.org
Example of Serial HDF5 C program
1.
2.
3. file_id = H5Fcreate(FNAME,…, H5P_DEFAULT);
4. space_id = H5Screate_simple(…);
5. dset_id = H5Dcreate(file_id, DNAME, H5T_NATIVE_INT,
space_id,…);
6.
7.
8. status = H5Dwrite(dset_id, H5T_NATIVE_INT, …, H5P_DEFAULT,
…);
March 4, 2015 HPC Oil & Gas Workshop 48
www.hdfgroup.org
Example of Parallel HDF5 C program
Parallel HDF5 program has extra calls:
MPI_Init(&argc, &argv);
1. fapl_id = H5Pcreate(H5P_FILE_ACCESS);
2. H5Pset_fapl_mpio(fapl_id, comm, info);
3. file_id = H5Fcreate(FNAME,…, fapl_id);
4. space_id = H5Screate_simple(…);
5. dset_id = H5Dcreate(file_id, DNAME, H5T_NATIVE_INT,
space_id,…);
6. xf_id = H5Pcreate(H5P_DATASET_XFER);
7. H5Pset_dxpl_mpio(xf_id, H5FD_MPIO_COLLECTIVE);
8. status = H5Dwrite(dset_id, H5T_NATIVE_INT, …, xf_id…);
MPI_Finalize();
March 4, 2015 HPC Oil & Gas Workshop 49
www.hdfgroup.org
WRITING PATTERNS -
EXAMPLE
March 4, 2015 HPC Oil & Gas Workshop 50
www.hdfgroup.org
Parallel HDF5 tutorial examples
• For sample programs of how to write different
data patterns see:
http://www.hdfgroup.org/HDF5/Tutor/parallel.html
March 4, 2015 HPC Oil & Gas Workshop 51
www.hdfgroup.org
Programming model
• Each process defines memory and file
hyperslabs using H5Sselect_hyperslab
• Each process executes a write/read call using
hyperslabs defined, which can be either
collective or independent
• The hyperslab parameters define the portion of
the dataset to write to:
- Contiguous hyperslab
- Regularly spaced data (column or row)
- Pattern
- Blocks
March 4, 2015 HPC Oil & Gas Workshop 52
www.hdfgroup.org
Four processes writing by rows
HDF5 "SDS_row.h5" {
GROUP "/" {
DATASET "IntArray" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 8, 5 ) / ( 8, 5 ) }
DATA {
10, 10, 10, 10, 10,
10, 10, 10, 10, 10,
11, 11, 11, 11, 11,
11, 11, 11, 11, 11,
12, 12, 12, 12, 12,
12, 12, 12, 12, 12,
13, 13, 13, 13, 13,
13, 13, 13, 13, 13
March 4, 2015 HPC Oil & Gas Workshop 53
www.hdfgroup.org
Parallel HDF5 example code
71 /*
72 * Each process defines dataset in memory and writes it to the
73 * hyperslab in the file.
74 */
75 count[0] = dims[0] / mpi_size;
76 count[1] = dims[1];
77 offset[0] = mpi_rank * count[0];
78 offset[1] = 0;
79 memspace = H5Screate_simple(RANK, count, NULL);
80
81 /*
82 * Select hyperslab in the file.
83 */
84 filespace = H5Dget_space(dset_id);
85 H5Sselect_hyperslab(filespace, H5S_SELECT_SET, offset, NULL,
count, NULL);
March 4, 2015 HPC Oil & Gas Workshop 54
www.hdfgroup.org
Two processes writing by columns
HDF5 "SDS_col.h5" {
GROUP "/" {
DATASET "IntArray" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 8, 6 ) / ( 8, 6 ) }
DATA {
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200
March 4, 2015 HPC Oil & Gas Workshop 55
www.hdfgroup.org
Four processes writing by pattern
HDF5 "SDS_pat.h5" {
GROUP "/" {
DATASET "IntArray" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) }
DATA {
1, 3, 1, 3,
2, 4, 2, 4,
1, 3, 1, 3,
2, 4, 2, 4,
1, 3, 1, 3,
2, 4, 2, 4,
1, 3, 1, 3,
2, 4, 2, 4
March 4, 2015 HPC Oil & Gas Workshop 56
www.hdfgroup.org
Four processes writing by blocks
HDF5 "SDS_blk.h5" {
GROUP "/" {
DATASET "IntArray" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) }
DATA {
1, 1, 2, 2,
1, 1, 2, 2,
1, 1, 2, 2,
1, 1, 2, 2,
3, 3, 4, 4,
3, 3, 4, 4,
3, 3, 4, 4,
3, 3, 4, 4
March 4, 2015 HPC Oil & Gas Workshop 57
www.hdfgroup.org
Complex data patterns
1
9
17
25
33
41
49
57
2
10
18
26
34
42
50
58
3
11
19
27
35
43
51
59
4
12
20
28
36
44
52
60
5
13
21
29
37
45
53
61
6
14
22
30
38
46
54
62
7
15
23
31
39
47
55
63
8
16
24
32
40
48
56
64
1
9
17
25
33
41
49
57
2
10
18
26
34
42
50
58
3
11
19
27
35
43
51
59
4
12
20
28
36
44
52
60
5
13
21
29
37
45
53
61
6
14
22
30
38
46
54
62
7
15
23
31
39
47
55
63
8
16
24
32
40
48
56
64
1
9
17
25
33
41
49
57
2
10
18
26
34
42
50
58
3
11
19
27
35
43
51
59
4
12
20
28
36
44
52
60
5
13
21
29
37
45
53
61
6
14
22
30
38
46
54
62
7
15
23
31
39
47
55
63
8
16
24
32
40
48
56
64
HDF5 doesn’t have restrictions on data patterns and data balance
March 4, 2015 HPC Oil & Gas Workshop 58
www.hdfgroup.org
Examples of irregular selection
• Internally, the HDF5 library creates an MPI
datatype for each lower dimension in the
selection and then combines those types into
one giant structured MPI datatype
March 4, 2015 HPC Oil & Gas Workshop 59
www.hdfgroup.org
PERFORMANCE ANALYSIS
March 4, 2015 HPC Oil & Gas Workshop 60
www.hdfgroup.org
Performance analysis
• Some common causes of poor performance
• Possible solutions
March 4, 2015 HPC Oil & Gas Workshop 61
www.hdfgroup.org
My PHDF5 application I/O is slow
“Tuning HDF5 for Lustre File Systems” by
Howison, Koziol, Knaak, Mainzer, and
Shalf1
 Chunking and hyperslab selection
 HDF5 metadata cache
 Specific I/O system hints
March 4, 2015 HPC Oil & Gas Workshop 62
1http://www.hdfgroup.org/pubs/papers/howison_hdf5_lustre_iasds2010.pdf
www.hdfgroup.org
Collective vs. independent calls
• MPI definition of collective calls:
• All processes of the communicator must participate
in calls in the right order. E.g.,
• Process1 Process2
• call A(); call B(); call A(); call B(); **right**
• call A(); call B(); call B(); call A(); **wrong**
• Independent means not collective 
• Collective is not necessarily synchronous, nor
must require communication
March 4, 2015 HPC Oil & Gas Workshop 64
www.hdfgroup.org
Independent vs. collective access
• User reported
independent data
transfer mode was
much slower than
the collective data
transfer mode
• Data array was tall
and thin: 230,000
rows by 6 columns
:
:
:
230,000 rows
:
:
:
March 4, 2015 HPC Oil & Gas Workshop 65
www.hdfgroup.org
Debug Slow Parallel I/O Speed(1)
• Writing to one dataset
- Using 4 processes == 4 columns
- HDF5 datatype is 8-byte doubles
- 4 processes, 1000 rows == 4x1000x8 = 32,000
bytes
• % mpirun -np 4 ./a.out 1000
- Execution time: 1.783798 s.
• % mpirun -np 4 ./a.out 2000
- Execution time: 3.838858 s.
• Difference of 2 seconds for 1000 more rows =
32,000 bytes.
• Speed of 16KB/sec!!! Way too slow.
March 4, 2015 HPC Oil & Gas Workshop 66
www.hdfgroup.org
Debug slow parallel I/O speed(2)
• Build a version of PHDF5 with
• ./configure --enable-debug --enable-parallel …
• This allows the tracing of MPIO I/O calls in the
HDF5 library.
• E.g., to trace
• MPI_File_read_xx and MPI_File_write_xx calls
• % setenv H5FD_mpio_Debug “rw”
March 4, 2015 HPC Oil & Gas Workshop 67
www.hdfgroup.org
Debug slow parallel I/O speed(3)
% setenv H5FD_mpio_Debug ’rw’
% mpirun -np 4 ./a.out 1000 # Indep.; contiguous.
in H5FD_mpio_write mpi_off=0 size_i=96
in H5FD_mpio_write mpi_off=0 size_i=96
in H5FD_mpio_write mpi_off=0 size_i=96
in H5FD_mpio_write mpi_off=0 size_i=96
in H5FD_mpio_write mpi_off=2056 size_i=8
in H5FD_mpio_write mpi_off=2048 size_i=8
in H5FD_mpio_write mpi_off=2072 size_i=8
in H5FD_mpio_write mpi_off=2064 size_i=8
in H5FD_mpio_write mpi_off=2088 size_i=8
in H5FD_mpio_write mpi_off=2080 size_i=8
…
• Total of 4000 of these little 8 bytes writes == 32,000 bytes.
March 4, 2015 HPC Oil & Gas Workshop 68
www.hdfgroup.org
Independent calls are many and small
• Each process writes
one element of one
row, skips to next
row, write one
element, so on.
• Each process issues
230,000 writes of 8
bytes each.
:
:
:
230,000 rows
:
:
:
March 4, 2015 HPC Oil & Gas Workshop 69
www.hdfgroup.org
Debug slow parallel I/O speed (4)
% setenv H5FD_mpio_Debug ’rw’
% mpirun -np 4 ./a.out 1000 # Indep., Chunked by column.
in H5FD_mpio_write mpi_off=0 size_i=96
in H5FD_mpio_write mpi_off=0 size_i=96
in H5FD_mpio_write mpi_off=0 size_i=96
in H5FD_mpio_write mpi_off=0 size_i=96
in H5FD_mpio_write mpi_off=3688 size_i=8000
in H5FD_mpio_write mpi_off=11688 size_i=8000
in H5FD_mpio_write mpi_off=27688 size_i=8000
in H5FD_mpio_write mpi_off=19688 size_i=8000
in H5FD_mpio_write mpi_off=96 size_i=40
in H5FD_mpio_write mpi_off=136 size_i=544
in H5FD_mpio_write mpi_off=680 size_i=120
in H5FD_mpio_write mpi_off=800 size_i=272
…
Execution time: 0.011599 s.
March 4, 2015 HPC Oil & Gas Workshop 70
www.hdfgroup.org
Use collective mode or chunked storage
• Collective I/O will
combine many small
independent calls
into few but bigger
calls
• Chunks of columns
speeds up too
:
:
:
230,000 rows
:
:
:
March 4, 2015 HPC Oil & Gas Workshop 71
www.hdfgroup.org
Collective vs. independent write
0
100
200
300
400
500
600
700
800
900
1000
0.25 0.5 1 1.88 2.29 2.75
Secondstowrite
Data size in MBs
Independent write
Collective write
March 4, 2015 HPC Oil & Gas Workshop 72
www.hdfgroup.org
Collective I/O in HDF5
• Set up using a Data Transfer Property List
(DXPL)
• All processes must participate in the I/O call
(H5Dread/write) with a selection (which could
be a NULL selection)
• Some cases where collective I/O is not used
even when the use asks for it:
• Data conversion
• Compressed Storage
• Chunking Storage:
• When the chunk is not selected by a certain
number of processes
March 4, 2015 HPC Oil & Gas Workshop 73
www.hdfgroup.org
Enabling Collective Parallel I/O with HDF5
/* Set up file access property list w/parallel I/O access */
fa_plist_id = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_fapl_mpio(fa_plist_id, comm, info);
/* Create a new file collectively */
file_id = H5Fcreate(filename, H5F_ACC_TRUNC,
H5P_DEFAULT, fa_plist_id);
/* <omitted data decomposition for brevity> */
/* Set up data transfer property list w/collective MPI-IO */
dx_plist_id = H5Pcreate(H5P_DATASET_XFER);
H5Pset_dxpl_mpio(dx_plist_id, H5FD_MPIO_COLLECTIVE);
/* Write data elements to the dataset */
status = H5Dwrite(dset_id, H5T_NATIVE_INT,
memspace, filespace, dx_plist_id, data);
March 4, 2015 HPC Oil & Gas Workshop 74
www.hdfgroup.org
Collective I/O in HDF5
• Can query Data Transfer Property List (DXPL)
after I/O for collective I/O status:
• H5Pget_mpio_actual_io_mode
• Retrieves the type of I/O that HDF5 actually
performed on the last parallel I/O call
• H5Pget_mpio_no_collective_cause
• Retrieves local and global causes that broke
collective I/O on the last parallel I/O call
• H5Pget_mpio_actual_chunk_opt_mode
• Retrieves the type of chunk optimization that
HDF5 actually performed on the last parallel I/O
call. This is not necessarily the type of
optimization requested
March 4, 2015 HPC Oil & Gas Workshop 75
www.hdfgroup.org
EFFECT OF HDF5 STORAGE
March 4, 2015 HPC Oil & Gas Workshop 76
www.hdfgroup.org
Contiguous storage
• Metadata header separate from dataset data
• Data stored in one contiguous block in HDF5 file
Application memory
Metadata cache
Dataset header
………….
Datatype
Dataspace
………….
Attributes
…
File
Dataset data
Dataset data
March 4, 2015 HPC Oil & Gas Workshop 77
www.hdfgroup.org
On a parallel file system
File Dataset data
OST 1 OST 2 OST 3 OST 4
The file is striped over multiple OSTs depending on
the stripe size and stripe count that the file was
created with.
March 4, 2015 HPC Oil & Gas Workshop 78
www.hdfgroup.org
Chunked storage
• Data is stored in chunks of predefined size
• Two-dimensional instance may be referred to as data
tiling
• HDF5 library writes/reads the whole chunk
Contiguous Chunked
March 4, 2015 HPC Oil & Gas Workshop 79
www.hdfgroup.org
Chunked storage (cont.)
• Dataset data is divided into equally sized blocks (chunks).
• Each chunk is stored separately as a contiguous block in
HDF5 file.
Application memory
Metadata cache
Dataset header
………….
Datatype
Dataspace
………….
Attributes
…
File
Dataset data
A DC Bheader
Chunk
index
Chunk
index
A B C D
March 4, 2015 HPC Oil & Gas Workshop 80
www.hdfgroup.org
On a parallel file system
File A DC B
OST 1 OST 2 OST 3 OST 4
header
Chunk
index
The file is striped over multiple OSTs depending on
the stripe size and stripe count that the file was
created with
March 4, 2015 HPC Oil & Gas Workshop 81
www.hdfgroup.org
Which is better for performance?
• It depends!!
• Consider these selections:
• If contiguous: 2 seeks
• If chunked: 10 seeks
• If contiguous: 16 seeks
• If chunked: 4 seeks
Add to that striping over a Parallel File System, which
makes this problem very hard to solve!
March 4, 2015 HPC Oil & Gas Workshop 82
www.hdfgroup.org
Chunking and hyperslab selection
• When writing or reading, try to use hyperslab
selections that coincide with chunk boundaries.
March 4, 2015 HPC Oil & Gas Workshop
P2P1 P3
83
www.hdfgroup.org
EFFECT OF HDF5 METADATA
CACHE
March 4, 2015 HPC Oil & Gas Workshop 88
www.hdfgroup.org
Parallel HDF5 and Metadata
• Metadata operations:
• Creating/removing a dataset, group, attribute, etc…
• Extending a dataset’s dimensions
• Modifying group hierarchy
• etc …
• All operations that modify metadata are collective,
i.e., all processes have to call that operation:
• If you have 10,000 processes running your
application, and one process needs to create a
dataset, ALL processes must call H5Dcreate to
create 1 dataset.
March 4, 2015 HPC Oil & Gas Workshop 89
www.hdfgroup.org
Space allocation
• Allocating space at the file’s EOF is very simple in
serial HDF5 applications:
• The EOF value begins at offset 0 in the file
• When space is required, the EOF value is
incremented by the size of the block requested.
• Space allocation using the EOF value in parallel
HDF5 applications can result in a race condition if
processes do not synchronize with each other:
• Multiple processes believe that they are the sole
owner of a range of bytes within the HDF5 file.
• Solution: Make it Collective
March 4, 2015 HPC Oil & Gas Workshop 90
www.hdfgroup.org
Metadata cache
• To handle synchronization issues, all HDF5
operations that could potentially modify the
metadata in an HDF5 file are required to be
collective
• A list of these routines is available in the HDF5
reference
manual:http://www.hdfgroup.org/HDF5/doc/RM/C
ollectiveCalls.html
March 4, 2015 HPC Oil & Gas Workshop 93
www.hdfgroup.org
Managing the metadata cache
• All operations that modify metadata in the HDF5
file are collective:
• All processes will have the same dirty metadata
entries in their cache (i.e., metadata that is
inconsistent with what is on disk).
• Processes are not required to have the same clean
metadata entries (i.e., metadata that is in sync with
what is on disk).
• Internally, the metadata cache running on process
0 is responsible for managing changes to the
metadata in the HDF5 file.
• All the other caches must retain dirty metadata until
the process 0 cache tells them that the metadata is
clean (i.e., on disk).
March 4, 2015 HPC Oil & Gas Workshop 94
www.hdfgroup.org
Flushing the cache
• Initiated when:
• The size of dirty entries in cache exceeds a
certain threshold
• The user calls a flush
• The actual flush of metadata entries to disk is
currently implemented in two ways:
• Single Process (Process 0) write
• Distributed write
March 4, 2015 HPC Oil & Gas Workshop 99
www.hdfgroup.org
PARALLEL TOOLS
March 4, 2015 HPC Oil & Gas Workshop 102
www.hdfgroup.org
Parallel tools
• h5perf
• Performance measuring tool showing
I/O performance for different I/O APIs
March 4, 2015 HPC Oil & Gas Workshop 103
www.hdfgroup.org
h5perf
• An I/O performance measurement tool
• Tests 3 File I/O APIs:
• POSIX I/O (open/write/read/close…)
• MPI-I/O (MPI_File_{open,write,read,close})
• HDF5 (H5Fopen/H5Dwrite/H5Dread/H5Fclose)
• An indication of I/O speed upper limits
March 4, 2015 HPC Oil & Gas Workshop 104
www.hdfgroup.org
Useful parallel HDF5 links
• Parallel HDF information site
http://www.hdfgroup.org/HDF5/PHDF5/
• Parallel HDF5 tutorial available at
http://www.hdfgroup.org/HDF5/Tutor/
• HDF Help email address
help@hdfgroup.org
March 4, 2015 HPC Oil & Gas Workshop 105
www.hdfgroup.org
UPCOMING FEATURES IN
HDF5
March 4, 2015 HPC Oil & Gas Workshop 106
www.hdfgroup.org
PHDF5 Improvements in Progress
• Multi-dataset read/write operations
• Allows single collective operation on multiple
datasets
• Similar to PnetCDF “write-combining” feature
• H5Dmulti_read/write(<array of datasets,
selections, etc>)
• Order of magnitude speedup
March 4, 2015 HPC Oil & Gas Workshop 107
www.hdfgroup.org
H5Dwrite vs. H5Dwrite_multi
March 4, 2015 HPC Oil & Gas Workshop
0
1
2
3
4
5
6
7
8
9
400 800 1600 3200 6400
Writetimeinseconds
Number of datasets
H5Dwrite
H5Dwrite_multi
Rank = 1
Dims = 200
Contiguous floating-point datasets
109
www.hdfgroup.org
PHDF5 Improvements in Progress
• Avoid file truncation
• File format currently requires call to truncate
file, when closing
• Expensive in parallel (MPI_File_set_size)
• Change to file format will eliminate truncate call
March 4, 2015 HPC Oil & Gas Workshop 110
www.hdfgroup.org
PHDF5 Improvements in Progress
• Collective Object Open
• Currently, object open is independent
• All processes perform I/O to read metadata
from file, resulting in I/O storm at file system
• Change will allow a single process to read, then
broadcast metadata to other processes
March 4, 2015 HPC Oil & Gas Workshop 111
www.hdfgroup.org
Collective Object Open Performance
March 4, 2015 HPC Oil & Gas Workshop 112
www.hdfgroup.org
Other HDF5 Improvements in Progress
• Single-Writer/Multiple-Reader (SWMR)
• Virtual Object Layer (VOL)
• Virtual Datasets
March 4, 2015 HPC Oil & Gas Workshop 126
www.hdfgroup.org
Single-Writer/Multiple-Reader (SWMR)
• Improves HDF5 for Data Acquisition:
• Allows simultaneous data gathering and
monitoring/analysis
• Focused on storing data sequences for
high-speed data sources
• Supports ‘Ordered Updates’ to file:
• Crash-proofs accessing HDF5 file
• Possibly uses small amount of extra
space
January 21, 2015 127Computing for Light and Neutron Sources Forum
www.hdfgroup.org
Virtual Object Layer (VOL)
• Goal
- Provide an application with the HDF5 data model
and API, but allow different underlying storage
mechanisms
• New layer below HDF5 API
- Intercepts all API calls that can touch the data on
disk and routes them to a VOL plugin
• Potential VOL plugins:
- Native HDF5 driver (writes to HDF5 file)
- Raw driver (maps groups to file system directories
and datasets to files in directories)
- Remote driver (the file exists on a remote machine)
March 4, 2015 HPC Oil & Gas Workshop 129
www.hdfgroup.org
VOL Plugins
March 4, 2015 HPC Oil & Gas Workshop 130
VOL plugins
www.hdfgroup.org
Raw Plugin
• The flexibility of the virtual object layer
provides developers with the option to
abandon the single file, binary format like the
native HDF5 implementation.
• A “raw” file format could map HDF5 objects
(groups, datasets, etc …) to file system objects
(directories, files, etc …).
• The entire set of raw file system objects
created would represent one HDF5 container.
March 4, 2015 HPC Oil & Gas Workshop 139
www.hdfgroup.org
Remote Plugin
• A remote VOL plugin would allow access to
files located on a server.
• Prototyping two implementations:
• Web-services via RESTful access:
http://www.hdfgroup.org/projects/hdfserver/
• Native HDF5 file access over sockets:
http://svn.hdfgroup.uiuc.edu/h5netvol/trunk/
March 4, 2015 HPC Oil & Gas Workshop 140
www.hdfgroup.org
Virtual Datasets
• Mechanism for creating a composition of multiple
source datasets, while accessing through single
virtual dataset
• Modifications to source datasets are visible to
virtual dataset
• And writing to virtual dataset modifies source
datasets
• Can have subset within source dataset mapped to
subsets within virtual dataset
• Source and virtual datasets can have unlimited
dimensions
• Source datasets can be virtual datasets
themselves
March 4, 2015 HPC Oil & Gas Workshop 145
www.hdfgroup.org
Virtual Datasets, Example 1
March 4, 2015 HPC Oil & Gas Workshop 146
www.hdfgroup.org
Virtual Datasets, Example 2
March 4, 2015 HPC Oil & Gas Workshop 147
www.hdfgroup.org
Virtual Datasets, Example 3
March 4, 2015 HPC Oil & Gas Workshop 148
www.hdfgroup.org
HDF5 Roadmap
March 4, 2015 149
• Concurrency
• Single-Writer/Multiple-
Reader (SWMR)
• Internal threading
• Virtual Object Layer (VOL)
• Data Analysis
• Query / View / Index APIs
• Native HDF5 client/server
• Performance
• Scalable chunk indices
• Metadata aggregation
and Page buffering
• Asynchronous I/O
• Variable-length
records
• Fault tolerance
• Parallel I/O
• I/O Autotuning
Extreme Scale Computing HDF5
“The best way to predict the
future is to invent it.”
– Alan Kay
www.hdfgroup.org
The HDF Group
Thank You!
Questions?
March 4, 2015 HPC Oil & Gas Workshop 150
www.hdfgroup.org
Codename “HEXAD”
• Excel is a great frontend with a not so great rear ;-)
• We’ve fixed that with an HDF5 Excel Add-in
• Let’s you do the usual things including:
• Display content (file structure, detailed object info)
• Create/read/write datasets
• Create/read/update attributes
• Plenty of ideas for bells an whistles, e.g., HDF5
image & PyTables support
• Send in* your Must Have/Nice To Have feature list!
• Stay tuned for the beta program * help@hdfgroup.org
March 4, 2015 151HPC Oil & Gas Workshop
www.hdfgroup.org
HDF Server
• REST-based service for HDF5 data
• Reference Implementation for REST API
• Developed in Python using Tornado
Framework
• Supports Read/Write operations
• Clients can be Python/C/Fortran or Web Page
• Let us know what specific features you’d like to
see. E.g. VOL REST Client Plugin
March 4, 2015 152HPC Oil & Gas Workshop
www.hdfgroup.org
HDF Server Architecture
March 4, 2015 153HPC Oil & Gas Workshop
www.hdfgroup.org
Restless About HDF5/REST
March 4, 2015 154HPC Oil & Gas Workshop
www.hdfgroup.org
HDF Compass
• “Simple” HDF5 Viewer application
• Cross platform (Windows/Mac/Linux)
• Native look and feel
• Can display extremely large HDF5 files
• View HDF5 files and OpenDAP resources
• Plugin model enables different file
formats/remote resources to be supported
• Community-based development model
March 4, 2015 155HPC Oil & Gas Workshop
www.hdfgroup.org
Compass Architecture
March 4, 2015 156HPC Oil & Gas Workshop

More Related Content

What's hot (20)

PPTX
Tools to improve the usability of NASA HDF Data
PDF
H5Coro: The Cloud-Optimized Read-Only Library
PPTX
HDF - Current status and Future Directions
PPTX
PPTX
MATLAB Modernization on HDF5 1.10
PPT
Caching and Buffering in HDF5
PPTX
Hierarchical Data Formats (HDF) Update
PPSX
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
PDF
Transitioning from HDF4 to HDF5
PPT
Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps
PPSX
HDFEOS.org User Analsys, Updates, and Future
PPTX
HDF4 Mapping Project Update
PPTX
Parallel Computing with HDF Server
PPTX
HDF5 OPeNDAP project update and demo
PPTX
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
PPSX
GDAL Enhancement for ESDIS Project
Tools to improve the usability of NASA HDF Data
H5Coro: The Cloud-Optimized Read-Only Library
HDF - Current status and Future Directions
MATLAB Modernization on HDF5 1.10
Caching and Buffering in HDF5
Hierarchical Data Formats (HDF) Update
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Transitioning from HDF4 to HDF5
Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps
HDFEOS.org User Analsys, Updates, and Future
HDF4 Mapping Project Update
Parallel Computing with HDF Server
HDF5 OPeNDAP project update and demo
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
GDAL Enhancement for ESDIS Project
Ad

Viewers also liked (20)

PPTX
Ecci 1
PDF
Megan- Constructivism theory
PDF
Qns5 sonal
PDF
USS New York LPD 21 Commisioning
DOCX
PDF
Fiscalizacion
DOCX
Conferencia Magistral: Lo humano, lo ético, la educación
DOCX
PDF
Assessment of river_plan_change_using_rs_and_gis_technique
PPTX
Gerencia industrial johnny la rovere
PDF
PDF
PDF
הכנה לפרישה תכנון פיננסי
PDF
Examples
PDF
WASJ JOURNAL
PDF
PDF
Autonome voertuigen
PPTX
Realtor Presentation
PPTX
Alright presentation media
Ecci 1
Megan- Constructivism theory
Qns5 sonal
USS New York LPD 21 Commisioning
Fiscalizacion
Conferencia Magistral: Lo humano, lo ético, la educación
Assessment of river_plan_change_using_rs_and_gis_technique
Gerencia industrial johnny la rovere
הכנה לפרישה תכנון פיננסי
Examples
WASJ JOURNAL
Autonome voertuigen
Realtor Presentation
Alright presentation media
Ad

Similar to Hdf5 parallel (20)

PPTX
Parallel HDF5 Developments
PPTX
Introduction to HDF5 Data and Programming Models
PDF
Introduction to HDF5 Data Model, Programming Model and Library APIs
PDF
Introduction to HDF5 Data Model, Programming Model and Library APIs
PPT
Introduction to HDF5 Data Model, Programming Model and Library APIs
PPTX
HDF for the Cloud - New HDF Server Features
PPTX
D Robinson - Using HDF5 to work with large quantities of rich biological data
PDF
Python and HDF5: Overview
PPTX
PPT
Hdf5 intro
PDF
LCI2009-Tutorial
PDF
LCI2009-Tutorial
Parallel HDF5 Developments
Introduction to HDF5 Data and Programming Models
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
HDF for the Cloud - New HDF Server Features
D Robinson - Using HDF5 to work with large quantities of rich biological data
Python and HDF5: Overview
Hdf5 intro
LCI2009-Tutorial
LCI2009-Tutorial

Recently uploaded (20)

PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Digital Strategies for Manufacturing Companies
PDF
System and Network Administration Chapter 2
PDF
AI in Product Development-omnex systems
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
top salesforce developer skills in 2025.pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
L1 - Introduction to python Backend.pptx
PDF
Nekopoi APK 2025 free lastest update
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
history of c programming in notes for students .pptx
PDF
How Creative Agencies Leverage Project Management Software.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
How to Choose the Right IT Partner for Your Business in Malaysia
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Design an Analysis of Algorithms II-SECS-1021-03
Digital Strategies for Manufacturing Companies
System and Network Administration Chapter 2
AI in Product Development-omnex systems
Upgrade and Innovation Strategies for SAP ERP Customers
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Odoo Companies in India – Driving Business Transformation.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
top salesforce developer skills in 2025.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
L1 - Introduction to python Backend.pptx
Nekopoi APK 2025 free lastest update
Which alternative to Crystal Reports is best for small or large businesses.pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
history of c programming in notes for students .pptx
How Creative Agencies Leverage Project Management Software.pdf

Hdf5 parallel

  • 1. www.hdfgroup.org The HDF Group Parallel HDF5 March 4, 2015 HPC Oil & Gas Workshop Quincey Koziol Director of Core Software & HPC The HDF Group koziol@hdfgroup.org http://bit.ly/QuinceyKoziol 1
  • 2. www.hdfgroup.org Recent Parallel HDF5 Success Story • Performance of VPIC-IO on Bluewaters • I/O Kernel of a Plasma Physics application • 56 GB/s I/O rate in writing 5TB data using 5K cores with multi-dataset write optimization • VPIC-IO kernel running on 298,048 cores • ~10 Trillion particles • 291 TB, single file • 1 GB stripe size and 160 Lustre OSTs • 52 GB/s • 53% of the peak performance March 4, 2015 HPC Oil & Gas Workshop 3
  • 3. www.hdfgroup.org Outline • Quick Intro to HDF5 • Overview of Parallel HDF5 design • Parallel Consistency Semantics • PHDF5 Programming Model • Examples • Performance Analysis • Parallel Tools • Details of upcoming features of HDF5 March 4, 2015 HPC Oil & Gas Workshop 4 http://bit.ly/ParallelHDF5-HPCOGW-2015
  • 4. www.hdfgroup.org QUICK INTRO TO HDF5 March 4, 2015 HPC Oil & Gas Workshop 5
  • 5. www.hdfgroup.org What is HDF5? March 4, 2015 HPC Oil & Gas Workshop 6 • HDF5 == Hierarchical Data Format, v5 http://bit.ly/ParallelHDF5-HPCOGW-2015 • A flexible data model • Structures for data organization and specification • Open source software • Works with data in the format • Open file format • Designed for high volume or complex data
  • 6. www.hdfgroup.org What is HDF5, in detail? • A versatile data model that can represent very complex data objects and a wide variety of metadata. • An open source software library that runs on a wide range of computational platforms, from cell phones to massively parallel systems, and implements a high-level API with C, C++, Fortran, and Java interfaces. • A rich set of integrated performance features that allow for access time and storage space optimizations. • Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection. • A completely portable file format with no limit on the number or size of data objects stored. March 4, 2015 7HPC Oil & Gas Workshop http://bit.ly/ParallelHDF5-HPCOGW-2015
  • 7. www.hdfgroup.org HDF5 is like … March 4, 2015 HPC Oil & Gas Workshop 8
  • 8. www.hdfgroup.org Why use HDF5? • Challenging data: • Application data that pushes the limits of what can be addressed by traditional database systems, XML documents, or in-house data formats. • Software solutions: • For very large datasets, very fast access requirements, or very complex datasets. • To easily share data across a wide variety of computational platforms using applications written in different programming languages. • That take advantage of the many open-source and commercial tools that understand HDF5. • Enabling long-term preservation of data. March 4, 2015 9HPC Oil & Gas Workshop
  • 9. www.hdfgroup.org Who uses HDF5? • Examples of HDF5 user communities • Astrophysics • Astronomers • NASA Earth Science Enterprise • Dept. of Energy Labs • Supercomputing Centers in US, Europe and Asia • Synchrotrons and Light Sources in US and Europe • Financial Institutions • NOAA • Engineering & Manufacturing Industries • Many others • For a more detailed list, visit • http://www.hdfgroup.org/HDF5/users5.html March 4, 2015 10HPC Oil & Gas Workshop
  • 10. www.hdfgroup.org The HDF Group • Established in 1988 • 18 years at University of Illinois’ National Center for Supercomputing Applications • 8 years as independent non-profit company: “The HDF Group” • The HDF Group owns HDF4 and HDF5 • HDF4 & HDF5 formats, libraries, and tools are open source and freely available with BSD-style license • Currently employ 37 people • Always looking for more developers! March 4, 2015 11HPC Oil & Gas Workshop
  • 11. www.hdfgroup.org HDF5 Technology Platform • HDF5 Abstract Data Model • Defines the “building blocks” for data organization and specification • Files, Groups, Links, Datasets, Attributes, Datatypes, Dataspaces • HDF5 Software • Tools • Language Interfaces • HDF5 Library • HDF5 Binary File Format • Bit-level organization of HDF5 file • Defined by HDF5 File Format Specification 12March 4, 2015 HPC Oil & Gas Workshop
  • 12. www.hdfgroup.orgMarch 4, 2015 13 HDF5 Data Model • File – Container for objects • Groups – provide structure among objects • Datasets – where the primary data goes • Data arrays • Rich set of datatype options • Flexible, efficient storage and I/O • Attributes, for metadata Everything else is built essentially from these parts. HPC Oil & Gas Workshop
  • 13. www.hdfgroup.orgMarch 4, 2015 14 Structures to organize objects palette Raster image 3-D array 2-D arrayRaster image lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 Table “/” (root) “/TestData” “Groups” “Datasets” HPC Oil & Gas Workshop
  • 14. www.hdfgroup.org HDF5 Dataset March 4, 2015 15 • HDF5 datasets organize and contain data elements. • HDF5 datatype describes individual data elements. • HDF5 dataspace describes the logical layout of the data elements. Integer: 32-bit, LE HDF5 Datatype Multi-dimensional array of identically typed data elements Specifications for single data element and array dimensions 3 Rank Dim[2] = 7 Dimensions Dim[0] = 4 Dim[1] = 5 HDF5 Dataspace HPC Oil & Gas Workshop
  • 15. www.hdfgroup.org HDF5 Attributes • Typically contain user metadata • Have a name and a value • Attributes “decorate” HDF5 objects • Value is described by a datatype and a dataspace • Analogous to a dataset, but do not support partial I/O operations; nor can they be compressed or extended March 4, 2015 16HPC Oil & Gas Workshop
  • 16. www.hdfgroup.org HDF5 Technology Platform • HDF5 Abstract Data Model • Defines the “building blocks” for data organization and specification • Files, Groups, Links, Datasets, Attributes, Datatypes, Dataspaces • HDF5 Software • Tools • Language Interfaces • HDF5 Library • HDF5 Binary File Format • Bit-level organization of HDF5 file • Defined by HDF5 File Format Specification 17March 4, 2015 HPC Oil & Gas Workshop
  • 17. www.hdfgroup.org HDF5 Software Distribution HDF5 home page: http://hdfgroup.org/HDF5/ • Latest release: HDF5 1.8.14 (1.8.15 coming in May 2015) HDF5 source code: • Written in C, and includes optional C++, Fortran 90 APIs, and High Level APIs • Contains command-line utilities (h5dump, h5repack, h5diff, ..) and compile scripts HDF5 pre-built binaries: • When possible, includes C, C++, F90, and High Level libraries. Check ./lib/libhdf5.settings file. • Built with and require the SZIP and ZLIB external libraries March 4, 2015 HPC Oil & Gas Workshop 18
  • 18. www.hdfgroup.org The General HDF5 API • C, Fortran, Java, C++, and .NET bindings • IDL, MATLAB, Python (h5py, PyTables) • C routines begin with prefix H5? ? is a character corresponding to the type of object the function acts on March 4, 2015 HPC Oil & Gas Workshop 19 Example Interfaces: H5D : Dataset interface e.g., H5Dread H5F : File interface e.g., H5Fopen H5S : dataSpace interface e.g., H5Sclose
  • 19. www.hdfgroup.org The HDF5 API • For flexibility, the API is extensive  300+ functions • This can be daunting… but there is hope  A few functions can do a lot  Start simple  Build up knowledge as more features are needed March 4, 2015 HPC Oil & Gas Workshop 20 Victorinox Swiss Army Cybertool 34
  • 20. www.hdfgroup.org General Programming Paradigm • Object is opened or created • Object is accessed, possibly many times • Object is closed • Properties of object are optionally defined Creation properties (e.g., use chunking storage) Access properties March 4, 2015 HPC Oil & Gas Workshop 21
  • 21. www.hdfgroup.org Basic Functions H5Fcreate (H5Fopen) create (open) File H5Screate_simple/H5Screate create Dataspace H5Dcreate (H5Dopen) create (open) Dataset H5Dread, H5Dwrite access Dataset H5Dclose close Dataset H5Sclose close Dataspace H5Fclose close File March 4, 2015 HPC Oil & Gas Workshop 22
  • 22. www.hdfgroup.org Useful Tools For New Users March 4, 2015 HPC Oil & Gas Workshop 23 h5dump: Tool to “dump” or display contents of HDF5 files h5cc, h5c++, h5fc: Scripts to compile applications HDFView: Java browser to view HDF5 files http://www.hdfgroup.org/hdf-java-html/hdfview/ HDF5 Examples (C, Fortran, Java, Python, Matlab) http://www.hdfgroup.org/ftp/HDF5/examples/
  • 23. www.hdfgroup.org OVERVIEW OF PARALLEL HDF5 DESIGN March 4, 2015 HPC Oil & Gas Workshop 24
  • 24. www.hdfgroup.org • Parallel HDF5 should allow multiple processes to perform I/O to an HDF5 file at the same time • Single file image for all processes • Compare with one file per process design: • Expensive post processing • Not usable by different number of processes • Too many files produced for file system • Parallel HDF5 should use a standard parallel I/O interface • Must be portable to different platforms Parallel HDF5 Requirements March 4, 2015 HPC Oil & Gas Workshop 25
  • 25. www.hdfgroup.org Design requirements, cont • Support Message Passing Interface (MPI) programming • Parallel HDF5 files compatible with serial HDF5 files • Shareable between different serial or parallel platforms March 4, 2015 HPC Oil & Gas Workshop 26
  • 26. www.hdfgroup.org Design Dependencies • MPI with MPI-IO • MPICH, OpenMPI • Vendor’s MPI-IO • Parallel file system • IBM GPFS • Lustre • PVFS March 4, 2015 HPC Oil & Gas Workshop 27
  • 27. www.hdfgroup.org PHDF5 implementation layers HDF5 Application Compute node Compute node Compute node HDF5 Library MPI Library HDF5 file on Parallel File System Switch network + I/O servers Disk architecture and layout of data on disk March 4, 2015 HPC Oil & Gas Workshop 28
  • 28. www.hdfgroup.org MPI-IO VS. HDF5 March 4, 2015 HPC Oil & Gas Workshop 29
  • 29. www.hdfgroup.org MPI-IO • MPI-IO is an Input/Output API • It treats the data file as a “linear byte stream” and each MPI application needs to provide its own file and data representations to interpret those bytes March 4, 2015 HPC Oil & Gas Workshop 30
  • 30. www.hdfgroup.org MPI-IO • All data stored are machine dependent except the “external32” representation • External32 is defined in Big Endianness • Little-endian machines have to do the data conversion in both read or write operations • 64-bit sized data types may lose information March 4, 2015 HPC Oil & Gas Workshop 31
  • 31. www.hdfgroup.org MPI-IO vs. HDF5 • HDF5 is data management software • It stores data and metadata according to the HDF5 data format definition • HDF5 file is self-describing • Each machine can store the data in its own native representation for efficient I/O without loss of data precision • Any necessary data representation conversion is done by the HDF5 library automatically March 4, 2015 HPC Oil & Gas Workshop 32
  • 33. www.hdfgroup.org Consistency Semantics • Consistency Semantics: Rules that define the outcome of multiple, possibly concurrent, accesses to an object or data structure by one or more processes in a computer system. March 4, 2015 HPC Oil & Gas Workshop 34
  • 34. www.hdfgroup.org Parallel HDF5 Consistency Semantics • Parallel HDF5 library defines a set of consistency semantics to let users know what to expect when processes access data managed by the library. • When the changes a process makes are actually visible to itself (if it tries to read back that data) or to other processes that access the same file with independent or collective I/O operations March 4, 2015 HPC Oil & Gas Workshop 35
  • 35. www.hdfgroup.org Parallel HDF5 Consistency Semantics • Same as MPI-I/O semantics • Default MPI-I/O semantics doesn’t guarantee atomicity or sequence of above calls! • Problems may occur (although we haven’t seen any) when writing/reading HDF5 metadata or raw data Process 0 Process 1 MPI_File_write_at() MPI_Barrier() MPI_Barrier() MPI_File_read_at() March 4, 2015 HPC Oil & Gas Workshop 36
  • 36. www.hdfgroup.org • MPI I/O provides atomicity and sync-barrier- sync features to address the issue • PHDF5 follows MPI I/O • H5Fset_mpio_atomicity function to turn on MPI atomicity • H5Fsync function to transfer written data to storage device (in implementation now) March 4, 2015 HPC Oil & Gas Workshop 37 Parallel HDF5 Consistency Semantics
  • 37. www.hdfgroup.org • For more information see “Enabling a strict consistency semantics model in parallel HDF5” linked from the HDF5 H5Fset_mpi_atomicity Reference Manual page1 1 http://www.hdfgroup.org/HDF5/doc/RM/Advanced/PHDF5FileConsistencySem antics/PHDF5FileConsistencySemantics.pdf March 4, 2015 HPC Oil & Gas Workshop 38 Parallel HDF5 Consistency Semantics
  • 38. www.hdfgroup.org HDF5 PARALLEL PROGRAMMING MODEL March 4, 2015 HPC Oil & Gas Workshop 39
  • 39. www.hdfgroup.org How to compile PHDF5 applications • h5pcc – HDF5 C compiler command • Similar to mpicc • h5pfc – HDF5 F90 compiler command • Similar to mpif90 • To compile: • % h5pcc h5prog.c • % h5pfc h5prog.f90 March 4, 2015 HPC Oil & Gas Workshop 40
  • 40. www.hdfgroup.org Programming restrictions • PHDF5 opens a parallel file with an MPI communicator • Returns a file handle • Future access to the file via the file handle • All processes must participate in collective PHDF5 APIs • Different files can be opened via different communicators March 4, 2015 HPC Oil & Gas Workshop 41
  • 41. www.hdfgroup.org Collective HDF5 calls • All HDF5 APIs that modify structural metadata are collective! • File operations - H5Fcreate, H5Fopen, H5Fclose, etc • Object creation - H5Dcreate, H5Dclose, etc • Object structure modification (e.g., dataset extent modification) - H5Dset_extent, etc • http://www.hdfgroup.org/HDF5/doc/RM/CollectiveCalls.html March 4, 2015 HPC Oil & Gas Workshop 42
  • 42. www.hdfgroup.org Other HDF5 calls • Array data transfer can be collective or independent - Dataset operations: H5Dwrite, H5Dread • Collectiveness is indicated by function parameters, not by function names as in MPI API March 4, 2015 HPC Oil & Gas Workshop 43
  • 43. www.hdfgroup.org What does PHDF5 support ? • After a file is opened by the processes of a communicator • All parts of file are accessible by all processes • All objects in the file are accessible by all processes • Multiple processes may write to the same data array • Each process may write to individual data array March 4, 2015 HPC Oil & Gas Workshop 44
  • 44. www.hdfgroup.org PHDF5 API languages • C and F90, 2003 language interfaces • Most platforms with MPI-IO supported. e.g., • IBM BG/x • Linux clusters • Cray March 4, 2015 HPC Oil & Gas Workshop 45
  • 45. www.hdfgroup.org Programming model • HDF5 uses access property list to control the file access mechanism • General model to access HDF5 file in parallel: - Set up MPI-IO file access property list - Open File - Access Data - Close File March 4, 2015 HPC Oil & Gas Workshop 46
  • 46. www.hdfgroup.org Example of Serial HDF5 C program 1. 2. 3. file_id = H5Fcreate(FNAME,…, H5P_DEFAULT); 4. space_id = H5Screate_simple(…); 5. dset_id = H5Dcreate(file_id, DNAME, H5T_NATIVE_INT, space_id,…); 6. 7. 8. status = H5Dwrite(dset_id, H5T_NATIVE_INT, …, H5P_DEFAULT, …); March 4, 2015 HPC Oil & Gas Workshop 48
  • 47. www.hdfgroup.org Example of Parallel HDF5 C program Parallel HDF5 program has extra calls: MPI_Init(&argc, &argv); 1. fapl_id = H5Pcreate(H5P_FILE_ACCESS); 2. H5Pset_fapl_mpio(fapl_id, comm, info); 3. file_id = H5Fcreate(FNAME,…, fapl_id); 4. space_id = H5Screate_simple(…); 5. dset_id = H5Dcreate(file_id, DNAME, H5T_NATIVE_INT, space_id,…); 6. xf_id = H5Pcreate(H5P_DATASET_XFER); 7. H5Pset_dxpl_mpio(xf_id, H5FD_MPIO_COLLECTIVE); 8. status = H5Dwrite(dset_id, H5T_NATIVE_INT, …, xf_id…); MPI_Finalize(); March 4, 2015 HPC Oil & Gas Workshop 49
  • 48. www.hdfgroup.org WRITING PATTERNS - EXAMPLE March 4, 2015 HPC Oil & Gas Workshop 50
  • 49. www.hdfgroup.org Parallel HDF5 tutorial examples • For sample programs of how to write different data patterns see: http://www.hdfgroup.org/HDF5/Tutor/parallel.html March 4, 2015 HPC Oil & Gas Workshop 51
  • 50. www.hdfgroup.org Programming model • Each process defines memory and file hyperslabs using H5Sselect_hyperslab • Each process executes a write/read call using hyperslabs defined, which can be either collective or independent • The hyperslab parameters define the portion of the dataset to write to: - Contiguous hyperslab - Regularly spaced data (column or row) - Pattern - Blocks March 4, 2015 HPC Oil & Gas Workshop 52
  • 51. www.hdfgroup.org Four processes writing by rows HDF5 "SDS_row.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 5 ) / ( 8, 5 ) } DATA { 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13 March 4, 2015 HPC Oil & Gas Workshop 53
  • 52. www.hdfgroup.org Parallel HDF5 example code 71 /* 72 * Each process defines dataset in memory and writes it to the 73 * hyperslab in the file. 74 */ 75 count[0] = dims[0] / mpi_size; 76 count[1] = dims[1]; 77 offset[0] = mpi_rank * count[0]; 78 offset[1] = 0; 79 memspace = H5Screate_simple(RANK, count, NULL); 80 81 /* 82 * Select hyperslab in the file. 83 */ 84 filespace = H5Dget_space(dset_id); 85 H5Sselect_hyperslab(filespace, H5S_SELECT_SET, offset, NULL, count, NULL); March 4, 2015 HPC Oil & Gas Workshop 54
  • 53. www.hdfgroup.org Two processes writing by columns HDF5 "SDS_col.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 6 ) / ( 8, 6 ) } DATA { 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200 March 4, 2015 HPC Oil & Gas Workshop 55
  • 54. www.hdfgroup.org Four processes writing by pattern HDF5 "SDS_pat.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) } DATA { 1, 3, 1, 3, 2, 4, 2, 4, 1, 3, 1, 3, 2, 4, 2, 4, 1, 3, 1, 3, 2, 4, 2, 4, 1, 3, 1, 3, 2, 4, 2, 4 March 4, 2015 HPC Oil & Gas Workshop 56
  • 55. www.hdfgroup.org Four processes writing by blocks HDF5 "SDS_blk.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) } DATA { 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 4, 4 March 4, 2015 HPC Oil & Gas Workshop 57
  • 57. www.hdfgroup.org Examples of irregular selection • Internally, the HDF5 library creates an MPI datatype for each lower dimension in the selection and then combines those types into one giant structured MPI datatype March 4, 2015 HPC Oil & Gas Workshop 59
  • 58. www.hdfgroup.org PERFORMANCE ANALYSIS March 4, 2015 HPC Oil & Gas Workshop 60
  • 59. www.hdfgroup.org Performance analysis • Some common causes of poor performance • Possible solutions March 4, 2015 HPC Oil & Gas Workshop 61
  • 60. www.hdfgroup.org My PHDF5 application I/O is slow “Tuning HDF5 for Lustre File Systems” by Howison, Koziol, Knaak, Mainzer, and Shalf1  Chunking and hyperslab selection  HDF5 metadata cache  Specific I/O system hints March 4, 2015 HPC Oil & Gas Workshop 62 1http://www.hdfgroup.org/pubs/papers/howison_hdf5_lustre_iasds2010.pdf
  • 61. www.hdfgroup.org Collective vs. independent calls • MPI definition of collective calls: • All processes of the communicator must participate in calls in the right order. E.g., • Process1 Process2 • call A(); call B(); call A(); call B(); **right** • call A(); call B(); call B(); call A(); **wrong** • Independent means not collective  • Collective is not necessarily synchronous, nor must require communication March 4, 2015 HPC Oil & Gas Workshop 64
  • 62. www.hdfgroup.org Independent vs. collective access • User reported independent data transfer mode was much slower than the collective data transfer mode • Data array was tall and thin: 230,000 rows by 6 columns : : : 230,000 rows : : : March 4, 2015 HPC Oil & Gas Workshop 65
  • 63. www.hdfgroup.org Debug Slow Parallel I/O Speed(1) • Writing to one dataset - Using 4 processes == 4 columns - HDF5 datatype is 8-byte doubles - 4 processes, 1000 rows == 4x1000x8 = 32,000 bytes • % mpirun -np 4 ./a.out 1000 - Execution time: 1.783798 s. • % mpirun -np 4 ./a.out 2000 - Execution time: 3.838858 s. • Difference of 2 seconds for 1000 more rows = 32,000 bytes. • Speed of 16KB/sec!!! Way too slow. March 4, 2015 HPC Oil & Gas Workshop 66
  • 64. www.hdfgroup.org Debug slow parallel I/O speed(2) • Build a version of PHDF5 with • ./configure --enable-debug --enable-parallel … • This allows the tracing of MPIO I/O calls in the HDF5 library. • E.g., to trace • MPI_File_read_xx and MPI_File_write_xx calls • % setenv H5FD_mpio_Debug “rw” March 4, 2015 HPC Oil & Gas Workshop 67
  • 65. www.hdfgroup.org Debug slow parallel I/O speed(3) % setenv H5FD_mpio_Debug ’rw’ % mpirun -np 4 ./a.out 1000 # Indep.; contiguous. in H5FD_mpio_write mpi_off=0 size_i=96 in H5FD_mpio_write mpi_off=0 size_i=96 in H5FD_mpio_write mpi_off=0 size_i=96 in H5FD_mpio_write mpi_off=0 size_i=96 in H5FD_mpio_write mpi_off=2056 size_i=8 in H5FD_mpio_write mpi_off=2048 size_i=8 in H5FD_mpio_write mpi_off=2072 size_i=8 in H5FD_mpio_write mpi_off=2064 size_i=8 in H5FD_mpio_write mpi_off=2088 size_i=8 in H5FD_mpio_write mpi_off=2080 size_i=8 … • Total of 4000 of these little 8 bytes writes == 32,000 bytes. March 4, 2015 HPC Oil & Gas Workshop 68
  • 66. www.hdfgroup.org Independent calls are many and small • Each process writes one element of one row, skips to next row, write one element, so on. • Each process issues 230,000 writes of 8 bytes each. : : : 230,000 rows : : : March 4, 2015 HPC Oil & Gas Workshop 69
  • 67. www.hdfgroup.org Debug slow parallel I/O speed (4) % setenv H5FD_mpio_Debug ’rw’ % mpirun -np 4 ./a.out 1000 # Indep., Chunked by column. in H5FD_mpio_write mpi_off=0 size_i=96 in H5FD_mpio_write mpi_off=0 size_i=96 in H5FD_mpio_write mpi_off=0 size_i=96 in H5FD_mpio_write mpi_off=0 size_i=96 in H5FD_mpio_write mpi_off=3688 size_i=8000 in H5FD_mpio_write mpi_off=11688 size_i=8000 in H5FD_mpio_write mpi_off=27688 size_i=8000 in H5FD_mpio_write mpi_off=19688 size_i=8000 in H5FD_mpio_write mpi_off=96 size_i=40 in H5FD_mpio_write mpi_off=136 size_i=544 in H5FD_mpio_write mpi_off=680 size_i=120 in H5FD_mpio_write mpi_off=800 size_i=272 … Execution time: 0.011599 s. March 4, 2015 HPC Oil & Gas Workshop 70
  • 68. www.hdfgroup.org Use collective mode or chunked storage • Collective I/O will combine many small independent calls into few but bigger calls • Chunks of columns speeds up too : : : 230,000 rows : : : March 4, 2015 HPC Oil & Gas Workshop 71
  • 69. www.hdfgroup.org Collective vs. independent write 0 100 200 300 400 500 600 700 800 900 1000 0.25 0.5 1 1.88 2.29 2.75 Secondstowrite Data size in MBs Independent write Collective write March 4, 2015 HPC Oil & Gas Workshop 72
  • 70. www.hdfgroup.org Collective I/O in HDF5 • Set up using a Data Transfer Property List (DXPL) • All processes must participate in the I/O call (H5Dread/write) with a selection (which could be a NULL selection) • Some cases where collective I/O is not used even when the use asks for it: • Data conversion • Compressed Storage • Chunking Storage: • When the chunk is not selected by a certain number of processes March 4, 2015 HPC Oil & Gas Workshop 73
  • 71. www.hdfgroup.org Enabling Collective Parallel I/O with HDF5 /* Set up file access property list w/parallel I/O access */ fa_plist_id = H5Pcreate(H5P_FILE_ACCESS); H5Pset_fapl_mpio(fa_plist_id, comm, info); /* Create a new file collectively */ file_id = H5Fcreate(filename, H5F_ACC_TRUNC, H5P_DEFAULT, fa_plist_id); /* <omitted data decomposition for brevity> */ /* Set up data transfer property list w/collective MPI-IO */ dx_plist_id = H5Pcreate(H5P_DATASET_XFER); H5Pset_dxpl_mpio(dx_plist_id, H5FD_MPIO_COLLECTIVE); /* Write data elements to the dataset */ status = H5Dwrite(dset_id, H5T_NATIVE_INT, memspace, filespace, dx_plist_id, data); March 4, 2015 HPC Oil & Gas Workshop 74
  • 72. www.hdfgroup.org Collective I/O in HDF5 • Can query Data Transfer Property List (DXPL) after I/O for collective I/O status: • H5Pget_mpio_actual_io_mode • Retrieves the type of I/O that HDF5 actually performed on the last parallel I/O call • H5Pget_mpio_no_collective_cause • Retrieves local and global causes that broke collective I/O on the last parallel I/O call • H5Pget_mpio_actual_chunk_opt_mode • Retrieves the type of chunk optimization that HDF5 actually performed on the last parallel I/O call. This is not necessarily the type of optimization requested March 4, 2015 HPC Oil & Gas Workshop 75
  • 73. www.hdfgroup.org EFFECT OF HDF5 STORAGE March 4, 2015 HPC Oil & Gas Workshop 76
  • 74. www.hdfgroup.org Contiguous storage • Metadata header separate from dataset data • Data stored in one contiguous block in HDF5 file Application memory Metadata cache Dataset header …………. Datatype Dataspace …………. Attributes … File Dataset data Dataset data March 4, 2015 HPC Oil & Gas Workshop 77
  • 75. www.hdfgroup.org On a parallel file system File Dataset data OST 1 OST 2 OST 3 OST 4 The file is striped over multiple OSTs depending on the stripe size and stripe count that the file was created with. March 4, 2015 HPC Oil & Gas Workshop 78
  • 76. www.hdfgroup.org Chunked storage • Data is stored in chunks of predefined size • Two-dimensional instance may be referred to as data tiling • HDF5 library writes/reads the whole chunk Contiguous Chunked March 4, 2015 HPC Oil & Gas Workshop 79
  • 77. www.hdfgroup.org Chunked storage (cont.) • Dataset data is divided into equally sized blocks (chunks). • Each chunk is stored separately as a contiguous block in HDF5 file. Application memory Metadata cache Dataset header …………. Datatype Dataspace …………. Attributes … File Dataset data A DC Bheader Chunk index Chunk index A B C D March 4, 2015 HPC Oil & Gas Workshop 80
  • 78. www.hdfgroup.org On a parallel file system File A DC B OST 1 OST 2 OST 3 OST 4 header Chunk index The file is striped over multiple OSTs depending on the stripe size and stripe count that the file was created with March 4, 2015 HPC Oil & Gas Workshop 81
  • 79. www.hdfgroup.org Which is better for performance? • It depends!! • Consider these selections: • If contiguous: 2 seeks • If chunked: 10 seeks • If contiguous: 16 seeks • If chunked: 4 seeks Add to that striping over a Parallel File System, which makes this problem very hard to solve! March 4, 2015 HPC Oil & Gas Workshop 82
  • 80. www.hdfgroup.org Chunking and hyperslab selection • When writing or reading, try to use hyperslab selections that coincide with chunk boundaries. March 4, 2015 HPC Oil & Gas Workshop P2P1 P3 83
  • 81. www.hdfgroup.org EFFECT OF HDF5 METADATA CACHE March 4, 2015 HPC Oil & Gas Workshop 88
  • 82. www.hdfgroup.org Parallel HDF5 and Metadata • Metadata operations: • Creating/removing a dataset, group, attribute, etc… • Extending a dataset’s dimensions • Modifying group hierarchy • etc … • All operations that modify metadata are collective, i.e., all processes have to call that operation: • If you have 10,000 processes running your application, and one process needs to create a dataset, ALL processes must call H5Dcreate to create 1 dataset. March 4, 2015 HPC Oil & Gas Workshop 89
  • 83. www.hdfgroup.org Space allocation • Allocating space at the file’s EOF is very simple in serial HDF5 applications: • The EOF value begins at offset 0 in the file • When space is required, the EOF value is incremented by the size of the block requested. • Space allocation using the EOF value in parallel HDF5 applications can result in a race condition if processes do not synchronize with each other: • Multiple processes believe that they are the sole owner of a range of bytes within the HDF5 file. • Solution: Make it Collective March 4, 2015 HPC Oil & Gas Workshop 90
  • 84. www.hdfgroup.org Metadata cache • To handle synchronization issues, all HDF5 operations that could potentially modify the metadata in an HDF5 file are required to be collective • A list of these routines is available in the HDF5 reference manual:http://www.hdfgroup.org/HDF5/doc/RM/C ollectiveCalls.html March 4, 2015 HPC Oil & Gas Workshop 93
  • 85. www.hdfgroup.org Managing the metadata cache • All operations that modify metadata in the HDF5 file are collective: • All processes will have the same dirty metadata entries in their cache (i.e., metadata that is inconsistent with what is on disk). • Processes are not required to have the same clean metadata entries (i.e., metadata that is in sync with what is on disk). • Internally, the metadata cache running on process 0 is responsible for managing changes to the metadata in the HDF5 file. • All the other caches must retain dirty metadata until the process 0 cache tells them that the metadata is clean (i.e., on disk). March 4, 2015 HPC Oil & Gas Workshop 94
  • 86. www.hdfgroup.org Flushing the cache • Initiated when: • The size of dirty entries in cache exceeds a certain threshold • The user calls a flush • The actual flush of metadata entries to disk is currently implemented in two ways: • Single Process (Process 0) write • Distributed write March 4, 2015 HPC Oil & Gas Workshop 99
  • 87. www.hdfgroup.org PARALLEL TOOLS March 4, 2015 HPC Oil & Gas Workshop 102
  • 88. www.hdfgroup.org Parallel tools • h5perf • Performance measuring tool showing I/O performance for different I/O APIs March 4, 2015 HPC Oil & Gas Workshop 103
  • 89. www.hdfgroup.org h5perf • An I/O performance measurement tool • Tests 3 File I/O APIs: • POSIX I/O (open/write/read/close…) • MPI-I/O (MPI_File_{open,write,read,close}) • HDF5 (H5Fopen/H5Dwrite/H5Dread/H5Fclose) • An indication of I/O speed upper limits March 4, 2015 HPC Oil & Gas Workshop 104
  • 90. www.hdfgroup.org Useful parallel HDF5 links • Parallel HDF information site http://www.hdfgroup.org/HDF5/PHDF5/ • Parallel HDF5 tutorial available at http://www.hdfgroup.org/HDF5/Tutor/ • HDF Help email address help@hdfgroup.org March 4, 2015 HPC Oil & Gas Workshop 105
  • 91. www.hdfgroup.org UPCOMING FEATURES IN HDF5 March 4, 2015 HPC Oil & Gas Workshop 106
  • 92. www.hdfgroup.org PHDF5 Improvements in Progress • Multi-dataset read/write operations • Allows single collective operation on multiple datasets • Similar to PnetCDF “write-combining” feature • H5Dmulti_read/write(<array of datasets, selections, etc>) • Order of magnitude speedup March 4, 2015 HPC Oil & Gas Workshop 107
  • 93. www.hdfgroup.org H5Dwrite vs. H5Dwrite_multi March 4, 2015 HPC Oil & Gas Workshop 0 1 2 3 4 5 6 7 8 9 400 800 1600 3200 6400 Writetimeinseconds Number of datasets H5Dwrite H5Dwrite_multi Rank = 1 Dims = 200 Contiguous floating-point datasets 109
  • 94. www.hdfgroup.org PHDF5 Improvements in Progress • Avoid file truncation • File format currently requires call to truncate file, when closing • Expensive in parallel (MPI_File_set_size) • Change to file format will eliminate truncate call March 4, 2015 HPC Oil & Gas Workshop 110
  • 95. www.hdfgroup.org PHDF5 Improvements in Progress • Collective Object Open • Currently, object open is independent • All processes perform I/O to read metadata from file, resulting in I/O storm at file system • Change will allow a single process to read, then broadcast metadata to other processes March 4, 2015 HPC Oil & Gas Workshop 111
  • 96. www.hdfgroup.org Collective Object Open Performance March 4, 2015 HPC Oil & Gas Workshop 112
  • 97. www.hdfgroup.org Other HDF5 Improvements in Progress • Single-Writer/Multiple-Reader (SWMR) • Virtual Object Layer (VOL) • Virtual Datasets March 4, 2015 HPC Oil & Gas Workshop 126
  • 98. www.hdfgroup.org Single-Writer/Multiple-Reader (SWMR) • Improves HDF5 for Data Acquisition: • Allows simultaneous data gathering and monitoring/analysis • Focused on storing data sequences for high-speed data sources • Supports ‘Ordered Updates’ to file: • Crash-proofs accessing HDF5 file • Possibly uses small amount of extra space January 21, 2015 127Computing for Light and Neutron Sources Forum
  • 99. www.hdfgroup.org Virtual Object Layer (VOL) • Goal - Provide an application with the HDF5 data model and API, but allow different underlying storage mechanisms • New layer below HDF5 API - Intercepts all API calls that can touch the data on disk and routes them to a VOL plugin • Potential VOL plugins: - Native HDF5 driver (writes to HDF5 file) - Raw driver (maps groups to file system directories and datasets to files in directories) - Remote driver (the file exists on a remote machine) March 4, 2015 HPC Oil & Gas Workshop 129
  • 100. www.hdfgroup.org VOL Plugins March 4, 2015 HPC Oil & Gas Workshop 130 VOL plugins
  • 101. www.hdfgroup.org Raw Plugin • The flexibility of the virtual object layer provides developers with the option to abandon the single file, binary format like the native HDF5 implementation. • A “raw” file format could map HDF5 objects (groups, datasets, etc …) to file system objects (directories, files, etc …). • The entire set of raw file system objects created would represent one HDF5 container. March 4, 2015 HPC Oil & Gas Workshop 139
  • 102. www.hdfgroup.org Remote Plugin • A remote VOL plugin would allow access to files located on a server. • Prototyping two implementations: • Web-services via RESTful access: http://www.hdfgroup.org/projects/hdfserver/ • Native HDF5 file access over sockets: http://svn.hdfgroup.uiuc.edu/h5netvol/trunk/ March 4, 2015 HPC Oil & Gas Workshop 140
  • 103. www.hdfgroup.org Virtual Datasets • Mechanism for creating a composition of multiple source datasets, while accessing through single virtual dataset • Modifications to source datasets are visible to virtual dataset • And writing to virtual dataset modifies source datasets • Can have subset within source dataset mapped to subsets within virtual dataset • Source and virtual datasets can have unlimited dimensions • Source datasets can be virtual datasets themselves March 4, 2015 HPC Oil & Gas Workshop 145
  • 104. www.hdfgroup.org Virtual Datasets, Example 1 March 4, 2015 HPC Oil & Gas Workshop 146
  • 105. www.hdfgroup.org Virtual Datasets, Example 2 March 4, 2015 HPC Oil & Gas Workshop 147
  • 106. www.hdfgroup.org Virtual Datasets, Example 3 March 4, 2015 HPC Oil & Gas Workshop 148
  • 107. www.hdfgroup.org HDF5 Roadmap March 4, 2015 149 • Concurrency • Single-Writer/Multiple- Reader (SWMR) • Internal threading • Virtual Object Layer (VOL) • Data Analysis • Query / View / Index APIs • Native HDF5 client/server • Performance • Scalable chunk indices • Metadata aggregation and Page buffering • Asynchronous I/O • Variable-length records • Fault tolerance • Parallel I/O • I/O Autotuning Extreme Scale Computing HDF5 “The best way to predict the future is to invent it.” – Alan Kay
  • 108. www.hdfgroup.org The HDF Group Thank You! Questions? March 4, 2015 HPC Oil & Gas Workshop 150
  • 109. www.hdfgroup.org Codename “HEXAD” • Excel is a great frontend with a not so great rear ;-) • We’ve fixed that with an HDF5 Excel Add-in • Let’s you do the usual things including: • Display content (file structure, detailed object info) • Create/read/write datasets • Create/read/update attributes • Plenty of ideas for bells an whistles, e.g., HDF5 image & PyTables support • Send in* your Must Have/Nice To Have feature list! • Stay tuned for the beta program * help@hdfgroup.org March 4, 2015 151HPC Oil & Gas Workshop
  • 110. www.hdfgroup.org HDF Server • REST-based service for HDF5 data • Reference Implementation for REST API • Developed in Python using Tornado Framework • Supports Read/Write operations • Clients can be Python/C/Fortran or Web Page • Let us know what specific features you’d like to see. E.g. VOL REST Client Plugin March 4, 2015 152HPC Oil & Gas Workshop
  • 111. www.hdfgroup.org HDF Server Architecture March 4, 2015 153HPC Oil & Gas Workshop
  • 112. www.hdfgroup.org Restless About HDF5/REST March 4, 2015 154HPC Oil & Gas Workshop
  • 113. www.hdfgroup.org HDF Compass • “Simple” HDF5 Viewer application • Cross platform (Windows/Mac/Linux) • Native look and feel • Can display extremely large HDF5 files • View HDF5 files and OpenDAP resources • Plugin model enables different file formats/remote resources to be supported • Community-based development model March 4, 2015 155HPC Oil & Gas Workshop
  • 114. www.hdfgroup.org Compass Architecture March 4, 2015 156HPC Oil & Gas Workshop