Optimization Techniques at the I/O Forwarding Layer

Cluster 2010 Presentation

Optimization Techniques at the I/O Forwarding Layer

Kazuki Ohta (presenter):
Preferred Infrastructure, Inc., University of Tokyo

Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross:
Argonne National Laboratory

Yutaka Ishikawa:
University of Tokyo

Contact: kazuki.ohta@gmail.com
1

Background: Compute and Storage Imbalance

• Leadership-class computational scale:
• 100,000+ processes
• Advanced Multi-core architectures, Compute node OSs
• Leadership-class storage scale:
• 100+ servers
• Commercial storage hardware, Cluster ﬁle system

• Current leadership-class machines supply only 1GB/s of storage
throughput for every 10TF of compute performance. This gap grew
factor of 10 in recent years.
• Bridging this imbalance between compute and storage is a critical
problem for the large-scale computation.

2

Previous Studies: Current I/O Software Stack

Storage Abstraction,
Data Portability
(HDF5, NetCDF, Parallel/Serial Applications
ADIOS)
High-Level I/O Libraries
Organizing Accesses
from Many Clients
(ROMIO) POSIX I/O
MPI-IO
VFS, FUSE

Parallel File Systems
Logical File System
over Many Storage Devices
(PVFS2, Lustre, GPFS, PanFS, Storage Devices
Ceph, etc)

3

Challenge: Millions of Concurrent Clients

• 1,000,000+ concurrent clients present a challenge to current I/O stack
• e,g. metadata performance, locking, network incast problem, etc.
• I/O Forwarding Layer is introduced.
• All I/O requests are delegated to dedicated I/O forwarder process.
• I/O forwarder reduces the number of clients seen by the ﬁle system for all
applications, without collective I/O.

I/O Path
Compute Compute Compute Compute Compute Compute .... Compute
Processes

I/O Forwarder
I/O Forwarder

Parallel File System PVFS2 Parallel File System
PVFS2 PVFS2 PVFS2

4

I/O Software Stack with I/O Forwarding

Parallel/Serial Applications

High-Level I/O Libraries

POSIX I/O
MPI-IO
VFS, FUSE
Bridge Between Compute
Process and Storage System I/O Forwarding
(IBM ciod, Cray DVS, IOFSL)

Parallel File Systems

Storage Devices

5

Example I/O System: Blue Gene/P Architecture

6

I/O Forwarding Challenges

• Large Requests
• Latency of the forwarding
• Memory limit of the I/O
• Variety of backend ﬁle system node performance
• Small Requests
• Current I/O forwarding mechanism reduces the number of clients, but
does not reduces the number of requests.
• Request processing overheads at the ﬁle systems

• We proposed two optimization techniques for the I/O forwarding layer.
• Out-Of-Order I/O Pipelining, for large requests.
• I/O Request Scheduler, for small requests.
7

Out-Of-Order I/O Pipelining
• Split large I/O requests into
small ﬁxed-size chunks
• These chunks are forwarded in File FileSystem
an out-of-order way. Client IOFSL System Client IOFSL Threads

• Good points
• Reduce forwarding latency,
by overlapping the I/O
requests and the network
transfer.
• I/O sizes are not limited by
the memory size at the
forwarding node. No-Pipelining Out-Of-Order Pipelining

• Little effect by the slowest
ﬁle system node.
8

I/O Request Scheduler

• Scheduling and Merging the small requests at the forwarder
• Reduce number of seeks
• Reduce number of requests, the ﬁle systems actually sees
• Scheduling overhead must be minimum
• Handle-Based Round-Robin algorithm for the fairness between ﬁles
• Ranges are managed by Interval Tree
• The contiguous requests are merged

Pick N Requests
Q and Issue I/O

H H H H H

Read Read Write Read Read

9

I/O Forwarding and Scalability Layer (IOFSL)

• IOFSL Project [Nawab 2009]
• Open-Source I/O Forwarding Implementation
• http://www.iofsl.org/
• Portable on most HPC environment
• Network Independent
• All network communication is done by BMI [Carns 2005]
• TCP/IP, Inﬁniband, Myrinet, Blue Gene/P Tree,
Portals, etc.
• File System Independent
• MPI-IO (ROMIO) / FUSE Client
10

IOFSL Software Stack

Application

POSIXInterface MPI-IO Interface

FUSE ROMIO-ZOIDFS IOFSL Server

ZOIDFS Client API FileSystem Dispatcher

BMI BMI PVFS2 libsysio

ZOIDFS Protocol I/O Request
(TCP, Inﬁniband, Myrinet, etc. )

• Out-Of-Order I/O Pipelining and the I/O request scheduler have been implemented
in the IOFSL, and evaluated on two environments.
• T2K Tokyo (Linux Cluster), and ANL Surveyor (Blue Gene/P)
11

Evaluation on T2K: Spec

• T2K Open Super Computer, Tokyo Sites
• http://www.open-supercomputer.org/
• 32 node Research Cluster
• 16 cores: 2.3 GHz Quad-Core Opteron*4
• 32GB Memory
• 10Gbps Myrinet Network
• SATA Disk (Read: 49.52 MB/sec, Write 39.76 MB/sec)
• One IOFSL, Four PVFS2, 128 MPI Processes
• Software
• MPICH2 1.1.1p1
• PVFS2 CVS (almost 2.8.2)
• both are conﬁgured to use Myrinet
12

Evaluation on T2K: IOR Benchmark

• Each process issues the same amount of I/O
• Gradually increasing the message size, and see the
bandwidth change
• Note: modiﬁed to do fsync() for MPI-IO

Message Size

13

Evaluation on T2K: IOR Benchmark, 128procs
120
ROMIO PVFS2
IOFSL none
IOFSL hbrr
100

80
Bandwidth (MB/s)

60

40

20

0
4 16 64 256 1024 4096 16384 65536
Message Size (KB) 14

120
ROMIO PVFS2
IOFSL none
IOFSL hbrr
100

80
Bandwidth (MB/s)

60 Out-Of-Order Pipelining Improvements
( 29.5%)
40

20

0
4 16 64 256 1024 4096 16384 65536

120
ROMIO PVFS2
IOFSL none
IOFSL hbrr
100

80
Bandwidth (MB/s)

60 Out-Of-Order Pipelining Improvements
( 29.5%)
40

I/O Scheduler Improvement ( 40.0%)
20

0
4 16 64 256 1024 4096 16384 65536

Evaluation on Blue Gene/P: Spec

• Argonne National Laboratory BG/P “Surveyor”
• Blue Gene/P platform for research and development
• 1024 nodes, 4096-core
• Four PVFS2 servers
• DataDirect Networks S2A9550 SAN
• 256 compute nodes, with 4 I/O nodes were used.

Node Card: 4 core Node Board: 128 core Rack: 4096 core 15

Evaluation on BG/P: BMI PingPong
900
BMI TCP/IP
800 BMI ZOID
CNK BG/P Tree Network
700

600
Bandwidth (MB/s)

500

400

300

200

100

0
1 4 16 64 256 1024 4096
Buffer Size (KB) 16

Evaluation on BG/P: IOR Benchmark, 256nodes
800
CIOD
IOFSL FIFO(32threads)
700

600
Bandwidth (MiB/s)

500

400

300

200

100

0
1 4 16 64 256 1024 4096 16384 65536
Message Size (KB)
17

800
CIOD
700

600
Bandwidth (MiB/s)

Performance Improvements
500
( 42.0%)
400

300

200

100

0
1 4 16 64 256 1024 4096 16384 65536
Message Size (KB)
17

800
CIOD
700

600
Bandwidth (MiB/s)

Performance Improvements
500
( 42.0%)
400

300 Performance Drop
( -38.5%)
200

100

0
1 4 16 64 256 1024 4096 16384 65536
Message Size (KB)
17

Evaluation on BG/P: Thread Count Effect
800
IOFSL FIFO (16threads)
700

600
Bandwidth (MiB/s)

500

400

300

200

100

0
1 4 16 64 256 1024 4096 16384 65536
Message Size (KB)
18

Evaluation on BG/P: Thread Count Effect
800
700

600
Bandwidth (MiB/s)

500

400

300
16 threads > 32threads
200

100

0
1 4 16 64 256 1024 4096 16384 65536
Message Size (KB)
18

Related Work

• Computational Plant Project @ Sandia National Laboratory
• First introduced I/O Forwarding Layer
• IBM Blue Gene/L, Blue Gene/P
• All I/O requests are forwarded to I/O nodes
• Compute OS can be stripped down to minimum
functionality, and reduces the OS noise
• ZOID: I/O Forwarding Project [Kamil 2008]
• Only on Blue Gene
• Lustre Network Request Scheduler (NRS) [Qian 2009]
• Request scheduler at the parallel ﬁle system nodes
• Only simulation results
19

Future Work

• Event-driven server architecture
• reduced thread contension
• Collaborative Caching at the I/O forwarding layer
• multiple I/O forwarder works collaboratively for caching data
and also metadata
• Hints from MPI-IO
• Better cooperation with collective I/O
• Evaluation on other leadership scale machines
• ORNL Jaguar, Cray XT4, XT5 systems

20

Conclusions
• Implementation and evaluation of two optimization techniques at the I/O
Forwarding Layer
• I/O pipelining that overlaps the ﬁle system requests and the network
communication.
• I/O scheduler that reduces the number of independent, non-contiguous
ﬁle systems accesses.
• Demonstrating portable I/O forwarding layer, and performance comparison
with existing HPC I/O software stack.
• Two Environments
• T2K Tokyo Linux cluster
• ANL Blue Gene/P Surveyor
• First I/O forwarding evaluations on linux cluster and Blue Gene/P
• First comparison between Blue Gene/P IBM I/O stack with OSS I/O stack

21

Thanks!

Kazuki Ohta (presenter):
Preferred Infrastructure, Inc., University of Tokyo

Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross:
Argonne National Laboratory

Yutaka Ishikawa:
University of Tokyo
Contact: kazuki.ohta@gmail.com
22

Optimization Techniques at the I/O Forwarding Layer

More Related Content

What's hot (20)

Similar to Optimization Techniques at the I/O Forwarding Layer (20)

More from Kazuki Ohta (6)

Recently uploaded (20)

Optimization Techniques at the I/O Forwarding Layer