SlideShare a Scribd company logo
Cluster 2010 Presentation

Optimization Techniques at the I/O Forwarding Layer

Kazuki Ohta (presenter):
Preferred Infrastructure, Inc., University of Tokyo

Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross:
Argonne National Laboratory

Yutaka Ishikawa:
University of Tokyo

                                                      Contact: kazuki.ohta@gmail.com
                                                                                   1
Background: Compute and Storage Imbalance

• Leadership-class computational scale:
  • 100,000+ processes
  • Advanced Multi-core architectures, Compute node OSs
• Leadership-class storage scale:
  • 100+ servers
  • Commercial storage hardware, Cluster file system


• Current leadership-class machines supply only 1GB/s of storage
  throughput for every 10TF of compute performance. This gap grew
  factor of 10 in recent years.
• Bridging this imbalance between compute and storage is a critical
  problem for the large-scale computation.

                                                                      2
Previous Studies: Current I/O Software Stack


   Storage Abstraction,
     Data Portability
     (HDF5, NetCDF,                     Parallel/Serial Applications
         ADIOS)
                                         High-Level I/O Libraries
    Organizing Accesses
    from Many Clients
        (ROMIO)                                                     POSIX I/O
                               MPI-IO
                                                                    VFS, FUSE

                                          Parallel File Systems
    Logical File System
over Many Storage Devices
(PVFS2, Lustre, GPFS, PanFS,                 Storage Devices
         Ceph, etc)




                                                                                3
Challenge: Millions of Concurrent Clients

• 1,000,000+ concurrent clients present a challenge to current I/O stack
   • e,g. metadata performance, locking, network incast problem, etc.
• I/O Forwarding Layer is introduced.
   • All I/O requests are delegated to dedicated I/O forwarder process.
   • I/O forwarder reduces the number of clients seen by the file system for all
     applications, without collective I/O.

                                 I/O Path
              Compute                       Compute     Compute   Compute   Compute    Compute   ....    Compute
              Processes


            I/O Forwarder
                                                                       I/O Forwarder




          Parallel File System                        PVFS2       Parallel File System
                                                                   PVFS2         PVFS2           PVFS2




                                                                                                                   4
I/O Software Stack with I/O Forwarding


                                       Parallel/Serial Applications


                                        High-Level I/O Libraries

                                                                   POSIX I/O
                              MPI-IO
                                                                   VFS, FUSE
  Bridge Between Compute
Process and Storage System                   I/O Forwarding
(IBM ciod, Cray DVS, IOFSL)

                                         Parallel File Systems


                                            Storage Devices




                                                                               5
Example I/O System: Blue Gene/P Architecture




                                               6
I/O Forwarding Challenges

• Large Requests
  • Latency of the forwarding
  • Memory limit of the I/O
  • Variety of backend file system node performance
• Small Requests
  • Current I/O forwarding mechanism reduces the number of clients, but
    does not reduces the number of requests.
  • Request processing overheads at the file systems


• We proposed two optimization techniques for the I/O forwarding layer.
  • Out-Of-Order I/O Pipelining, for large requests.
  • I/O Request Scheduler, for small requests.
                                                                          7
Out-Of-Order I/O Pipelining
• Split large I/O requests into
  small fixed-size chunks
• These chunks are forwarded in                     File                        FileSystem
  an out-of-order way.              Client   IOFSL System   Client   IOFSL       Threads

• Good points
   • Reduce forwarding latency,
     by overlapping the I/O
     requests and the network
     transfer.
   • I/O sizes are not limited by
     the memory size at the
     forwarding node.                    No-Pipelining          Out-Of-Order Pipelining

   • Little effect by the slowest
     file system node.
                                                                                             8
I/O Request Scheduler

• Scheduling and Merging the small requests at the forwarder
  • Reduce number of seeks
  • Reduce number of requests, the file systems actually sees
• Scheduling overhead must be minimum
  • Handle-Based Round-Robin algorithm for the fairness between files
  • Ranges are managed by Interval Tree
     • The contiguous requests are merged

                                                        Pick N Requests
                              Q                          and Issue I/O

                H       H          H       H      H



               Read    Read       Write   Read   Read


                                                                          9
I/O Forwarding and Scalability Layer (IOFSL)

• IOFSL Project [Nawab 2009]
  • Open-Source I/O Forwarding Implementation
  • http://www.iofsl.org/
• Portable on most HPC environment
  • Network Independent
    • All network communication is done by BMI [Carns 2005]
       • TCP/IP, Infiniband, Myrinet, Blue Gene/P Tree,
         Portals, etc.
  • File System Independent
  • MPI-IO (ROMIO) / FUSE Client
                                                              10
IOFSL Software Stack

               Application

    POSIXInterface     MPI-IO Interface

        FUSE           ROMIO-ZOIDFS                       IOFSL Server

           ZOIDFS Client API                                FileSystem Dispatcher

                     BMI                            BMI     PVFS2        libsysio


                             ZOIDFS Protocol                        I/O Request
                      (TCP, Infiniband, Myrinet, etc. )


• Out-Of-Order I/O Pipelining and the I/O request scheduler have been implemented
  in the IOFSL, and evaluated on two environments.
   • T2K Tokyo (Linux Cluster), and ANL Surveyor (Blue Gene/P)
                                                                                    11
Evaluation on T2K: Spec

• T2K Open Super Computer, Tokyo Sites
  • http://www.open-supercomputer.org/
  • 32 node Research Cluster
  • 16 cores: 2.3 GHz Quad-Core Opteron*4
  • 32GB Memory
  • 10Gbps Myrinet Network
  • SATA Disk (Read: 49.52 MB/sec, Write 39.76 MB/sec)
• One IOFSL, Four PVFS2, 128 MPI Processes
• Software
  • MPICH2 1.1.1p1
  • PVFS2 CVS (almost 2.8.2)
     • both are configured to use Myrinet
                                                         12
Evaluation on T2K: IOR Benchmark

• Each process issues the same amount of I/O
• Gradually increasing the message size, and see the
  bandwidth change
  • Note: modified to do fsync() for MPI-IO




              Message Size

                                                       13
Evaluation on T2K: IOR Benchmark, 128procs
                     120
                               ROMIO PVFS2
                                 IOFSL none
                                  IOFSL hbrr
                     100


                      80
  Bandwidth (MB/s)




                      60


                      40


                      20


                       0
                           4        16         64     256      1024     4096   16384   65536
                                                    Message Size (KB)                          14
Evaluation on T2K: IOR Benchmark, 128procs
                     120
                               ROMIO PVFS2
                                 IOFSL none
                                  IOFSL hbrr
                     100


                      80
  Bandwidth (MB/s)




                      60                                    Out-Of-Order Pipelining Improvements
                                                                         ( 29.5%)
                      40


                      20


                       0
                           4        16         64     256      1024     4096   16384   65536
                                                    Message Size (KB)                              14
Evaluation on T2K: IOR Benchmark, 128procs
                     120
                               ROMIO PVFS2
                                 IOFSL none
                                  IOFSL hbrr
                     100


                      80
  Bandwidth (MB/s)




                      60                                    Out-Of-Order Pipelining Improvements
                                                                         ( 29.5%)
                      40

                                                          I/O Scheduler Improvement ( 40.0%)
                      20


                       0
                           4        16         64     256      1024     4096   16384   65536
                                                    Message Size (KB)                              14
Evaluation on Blue Gene/P: Spec

• Argonne National Laboratory BG/P “Surveyor”
  • Blue Gene/P platform for research and development
  • 1024 nodes, 4096-core
  • Four PVFS2 servers
  • DataDirect Networks S2A9550 SAN
• 256 compute nodes, with 4 I/O nodes were used.




         Node Card: 4 core     Node Board: 128 core   Rack: 4096 core   15
Evaluation on BG/P: BMI PingPong
                    900
                                        BMI TCP/IP
                    800                  BMI ZOID
                              CNK BG/P Tree Network
                    700

                    600
 Bandwidth (MB/s)




                    500

                    400

                    300

                    200

                    100

                      0
                          1      4       16       64        256   1024   4096
                                              Buffer Size (KB)                  16
Evaluation on BG/P: IOR Benchmark, 256nodes
                      800
                                               CIOD
                                IOFSL FIFO(32threads)
                      700

                      600
  Bandwidth (MiB/s)




                      500

                      400

                      300

                      200

                      100

                        0
                            1   4      16     64    256      1024   4096   16384   65536
                                              Message Size (KB)
                                                                                           17
Evaluation on BG/P: IOR Benchmark, 256nodes
                      800
                                                CIOD
                                 IOFSL FIFO(32threads)
                      700

                      600
  Bandwidth (MiB/s)




                                Performance Improvements
                      500
                                        ( 42.0%)
                      400

                      300

                      200

                      100

                        0
                            1    4      16     64    256      1024   4096   16384   65536
                                               Message Size (KB)
                                                                                            17
Evaluation on BG/P: IOR Benchmark, 256nodes
                      800
                                                CIOD
                                 IOFSL FIFO(32threads)
                      700

                      600
  Bandwidth (MiB/s)




                                Performance Improvements
                      500
                                        ( 42.0%)
                      400

                      300                                                   Performance Drop
                                                                               ( -38.5%)
                      200

                      100

                        0
                            1    4      16     64    256      1024   4096     16384   65536
                                               Message Size (KB)
                                                                                               17
Evaluation on BG/P: Thread Count Effect
                      800
                                IOFSL FIFO (16threads)
                                IOFSL FIFO (32threads)
                      700

                      600
  Bandwidth (MiB/s)




                      500

                      400

                      300

                      200

                      100

                        0
                            1   4     16      64    256      1024   4096   16384   65536
                                              Message Size (KB)
                                                                                           18
Evaluation on BG/P: Thread Count Effect
                      800
                                IOFSL FIFO (16threads)
                                IOFSL FIFO (32threads)
                      700

                      600
  Bandwidth (MiB/s)




                      500

                      400

                      300
                                                                      16 threads > 32threads
                      200

                      100

                        0
                            1   4     16      64    256      1024   4096   16384   65536
                                              Message Size (KB)
                                                                                               18
Related Work

• Computational Plant Project @ Sandia National Laboratory
  • First introduced I/O Forwarding Layer
• IBM Blue Gene/L, Blue Gene/P
  • All I/O requests are forwarded to I/O nodes
     • Compute OS can be stripped down to minimum
      functionality, and reduces the OS noise
  • ZOID: I/O Forwarding Project [Kamil 2008]
     • Only on Blue Gene
• Lustre Network Request Scheduler (NRS) [Qian 2009]
  • Request scheduler at the parallel file system nodes
  • Only simulation results
                                                             19
Future Work

• Event-driven server architecture
  • reduced thread contension
• Collaborative Caching at the I/O forwarding layer
  • multiple I/O forwarder works collaboratively for caching data
   and also metadata
• Hints from MPI-IO
  • Better cooperation with collective I/O
• Evaluation on other leadership scale machines
  • ORNL Jaguar, Cray XT4, XT5 systems




                                                                    20
Conclusions
• Implementation and evaluation of two optimization techniques at the I/O
  Forwarding Layer
  • I/O pipelining that overlaps the file system requests and the network
    communication.
  • I/O scheduler that reduces the number of independent, non-contiguous
    file systems accesses.
• Demonstrating portable I/O forwarding layer, and performance comparison
  with existing HPC I/O software stack.
  • Two Environments
     • T2K Tokyo Linux cluster
     • ANL Blue Gene/P Surveyor
  • First I/O forwarding evaluations on linux cluster and Blue Gene/P
  • First comparison between Blue Gene/P IBM I/O stack with OSS I/O stack

                                                                            21
Thanks!

Kazuki Ohta (presenter):
Preferred Infrastructure, Inc., University of Tokyo

Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross:
Argonne National Laboratory

Yutaka Ishikawa:
University of Tokyo
                                                      Contact: kazuki.ohta@gmail.com
                                                                                  22

More Related Content

PDF
NetApp Multi-Protocol Storage Evaluation
PDF
Assisting User’s Transition to Titan’s Accelerated Architecture
PDF
Network Programming: Data Plane Development Kit (DPDK)
PPTX
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
PDF
FD.io - The Universal Dataplane
PDF
Introduction to DPDK RIB library
PDF
Overview of Intel® Omni-Path Architecture
PPTX
Big Data Benchmarking with RDMA solutions
NetApp Multi-Protocol Storage Evaluation
Assisting User’s Transition to Titan’s Accelerated Architecture
Network Programming: Data Plane Development Kit (DPDK)
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
FD.io - The Universal Dataplane
Introduction to DPDK RIB library
Overview of Intel® Omni-Path Architecture
Big Data Benchmarking with RDMA solutions

What's hot (20)

PDF
Ipv6 introduction - MUM 2011 presentation
DOCX
PDF
Intel's Nehalem Microarchitecture by Glenn Hinton
PPTX
"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011
PPT
Technology Updates in IPv6
PDF
Overview of HPC Interconnects
PDF
Cisco IPv6 Tutorial by Hinwoto
PDF
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
PPTX
Introduction to ipv6 v1.3
PDF
Introduction to IPv6
PDF
High-Performance and Scalable Designs of Programming Models for Exascale Systems
PDF
IPv6 Transition & Deployment, including IPv6-only in cellular and broadband
PPT
Cisco presentation2
PDF
IPV6 Hands on Lab
PDF
Intel the-latest-on-ofi
PDF
Data Plane and VNF Acceleration Mini Summit
PPT
Named Data Networking Operational Aspects - IoT as a Use-case
PPTX
Nehalem
PDF
Khan and morrison_dq207
Ipv6 introduction - MUM 2011 presentation
Intel's Nehalem Microarchitecture by Glenn Hinton
"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011
Technology Updates in IPv6
Overview of HPC Interconnects
Cisco IPv6 Tutorial by Hinwoto
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
Introduction to ipv6 v1.3
Introduction to IPv6
High-Performance and Scalable Designs of Programming Models for Exascale Systems
IPv6 Transition & Deployment, including IPv6-only in cellular and broadband
Cisco presentation2
IPV6 Hands on Lab
Intel the-latest-on-ofi
Data Plane and VNF Acceleration Mini Summit
Named Data Networking Operational Aspects - IoT as a Use-case
Nehalem
Khan and morrison_dq207
Ad

Similar to Optimization Techniques at the I/O Forwarding Layer (20)

PPT
PVFS: A Parallel File System for Linux Clusters
PDF
To Infiniband and Beyond
PDF
OpenPOWER Acceleration of HPCC Systems
PDF
Scaling the Container Dataplane
PDF
Linux one vs x86
PDF
Linux one vs x86 18 july
PDF
NIO.pdf
PPTX
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
PPT
Efficient Support for MPI-I/O Atomicity
PDF
Three years of OFELIA - taking stock
PPTX
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
PDF
Latest (storage IO) patterns for cloud-native applications
PPTX
The Modern Telco Network: Defining The Telco Cloud
PPTX
Openflow overview
PPTX
Intel omni path architecture
PDF
CLFS 2010
ODP
Systems Support for Many Task Computing
PPT
Plank
PDF
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
PDF
Web Services for the Internet of Things
PVFS: A Parallel File System for Linux Clusters
To Infiniband and Beyond
OpenPOWER Acceleration of HPCC Systems
Scaling the Container Dataplane
Linux one vs x86
Linux one vs x86 18 july
NIO.pdf
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Efficient Support for MPI-I/O Atomicity
Three years of OFELIA - taking stock
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Latest (storage IO) patterns for cloud-native applications
The Modern Telco Network: Defining The Telco Cloud
Openflow overview
Intel omni path architecture
CLFS 2010
Systems Support for Many Task Computing
Plank
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Web Services for the Internet of Things
Ad

More from Kazuki Ohta (6)

KEY
Using MPI
PPTX
修士中間発表
PPTX
Hadoop @ Java CCC 2008 Spring
PPT
Googleの基盤クローン Hadoopについて
PPTX
Sedue at Hatena::Bookmark
PPT
Google Perf Tools (tcmalloc) の使い方
Using MPI
修士中間発表
Hadoop @ Java CCC 2008 Spring
Googleの基盤クローン Hadoopについて
Sedue at Hatena::Bookmark
Google Perf Tools (tcmalloc) の使い方

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Empathic Computing: Creating Shared Understanding
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Cloud computing and distributed systems.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Encapsulation theory and applications.pdf
PPT
Teaching material agriculture food technology
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The Rise and Fall of 3GPP – Time for a Sabbatical?
Empathic Computing: Creating Shared Understanding
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Cloud computing and distributed systems.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Encapsulation theory and applications.pdf
Teaching material agriculture food technology
Understanding_Digital_Forensics_Presentation.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
“AI and Expert System Decision Support & Business Intelligence Systems”
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25 Week I
Review of recent advances in non-invasive hemoglobin estimation
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
20250228 LYD VKU AI Blended-Learning.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Network Security Unit 5.pdf for BCA BBA.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Optimization Techniques at the I/O Forwarding Layer

  • 1. Cluster 2010 Presentation Optimization Techniques at the I/O Forwarding Layer Kazuki Ohta (presenter): Preferred Infrastructure, Inc., University of Tokyo Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross: Argonne National Laboratory Yutaka Ishikawa: University of Tokyo Contact: kazuki.ohta@gmail.com 1
  • 2. Background: Compute and Storage Imbalance • Leadership-class computational scale: • 100,000+ processes • Advanced Multi-core architectures, Compute node OSs • Leadership-class storage scale: • 100+ servers • Commercial storage hardware, Cluster file system • Current leadership-class machines supply only 1GB/s of storage throughput for every 10TF of compute performance. This gap grew factor of 10 in recent years. • Bridging this imbalance between compute and storage is a critical problem for the large-scale computation. 2
  • 3. Previous Studies: Current I/O Software Stack Storage Abstraction, Data Portability (HDF5, NetCDF, Parallel/Serial Applications ADIOS) High-Level I/O Libraries Organizing Accesses from Many Clients (ROMIO) POSIX I/O MPI-IO VFS, FUSE Parallel File Systems Logical File System over Many Storage Devices (PVFS2, Lustre, GPFS, PanFS, Storage Devices Ceph, etc) 3
  • 4. Challenge: Millions of Concurrent Clients • 1,000,000+ concurrent clients present a challenge to current I/O stack • e,g. metadata performance, locking, network incast problem, etc. • I/O Forwarding Layer is introduced. • All I/O requests are delegated to dedicated I/O forwarder process. • I/O forwarder reduces the number of clients seen by the file system for all applications, without collective I/O. I/O Path Compute Compute Compute Compute Compute Compute .... Compute Processes I/O Forwarder I/O Forwarder Parallel File System PVFS2 Parallel File System PVFS2 PVFS2 PVFS2 4
  • 5. I/O Software Stack with I/O Forwarding Parallel/Serial Applications High-Level I/O Libraries POSIX I/O MPI-IO VFS, FUSE Bridge Between Compute Process and Storage System I/O Forwarding (IBM ciod, Cray DVS, IOFSL) Parallel File Systems Storage Devices 5
  • 6. Example I/O System: Blue Gene/P Architecture 6
  • 7. I/O Forwarding Challenges • Large Requests • Latency of the forwarding • Memory limit of the I/O • Variety of backend file system node performance • Small Requests • Current I/O forwarding mechanism reduces the number of clients, but does not reduces the number of requests. • Request processing overheads at the file systems • We proposed two optimization techniques for the I/O forwarding layer. • Out-Of-Order I/O Pipelining, for large requests. • I/O Request Scheduler, for small requests. 7
  • 8. Out-Of-Order I/O Pipelining • Split large I/O requests into small fixed-size chunks • These chunks are forwarded in File FileSystem an out-of-order way. Client IOFSL System Client IOFSL Threads • Good points • Reduce forwarding latency, by overlapping the I/O requests and the network transfer. • I/O sizes are not limited by the memory size at the forwarding node. No-Pipelining Out-Of-Order Pipelining • Little effect by the slowest file system node. 8
  • 9. I/O Request Scheduler • Scheduling and Merging the small requests at the forwarder • Reduce number of seeks • Reduce number of requests, the file systems actually sees • Scheduling overhead must be minimum • Handle-Based Round-Robin algorithm for the fairness between files • Ranges are managed by Interval Tree • The contiguous requests are merged Pick N Requests Q and Issue I/O H H H H H Read Read Write Read Read 9
  • 10. I/O Forwarding and Scalability Layer (IOFSL) • IOFSL Project [Nawab 2009] • Open-Source I/O Forwarding Implementation • http://www.iofsl.org/ • Portable on most HPC environment • Network Independent • All network communication is done by BMI [Carns 2005] • TCP/IP, Infiniband, Myrinet, Blue Gene/P Tree, Portals, etc. • File System Independent • MPI-IO (ROMIO) / FUSE Client 10
  • 11. IOFSL Software Stack Application POSIXInterface MPI-IO Interface FUSE ROMIO-ZOIDFS IOFSL Server ZOIDFS Client API FileSystem Dispatcher BMI BMI PVFS2 libsysio ZOIDFS Protocol I/O Request (TCP, Infiniband, Myrinet, etc. ) • Out-Of-Order I/O Pipelining and the I/O request scheduler have been implemented in the IOFSL, and evaluated on two environments. • T2K Tokyo (Linux Cluster), and ANL Surveyor (Blue Gene/P) 11
  • 12. Evaluation on T2K: Spec • T2K Open Super Computer, Tokyo Sites • http://www.open-supercomputer.org/ • 32 node Research Cluster • 16 cores: 2.3 GHz Quad-Core Opteron*4 • 32GB Memory • 10Gbps Myrinet Network • SATA Disk (Read: 49.52 MB/sec, Write 39.76 MB/sec) • One IOFSL, Four PVFS2, 128 MPI Processes • Software • MPICH2 1.1.1p1 • PVFS2 CVS (almost 2.8.2) • both are configured to use Myrinet 12
  • 13. Evaluation on T2K: IOR Benchmark • Each process issues the same amount of I/O • Gradually increasing the message size, and see the bandwidth change • Note: modified to do fsync() for MPI-IO Message Size 13
  • 14. Evaluation on T2K: IOR Benchmark, 128procs 120 ROMIO PVFS2 IOFSL none IOFSL hbrr 100 80 Bandwidth (MB/s) 60 40 20 0 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 14
  • 15. Evaluation on T2K: IOR Benchmark, 128procs 120 ROMIO PVFS2 IOFSL none IOFSL hbrr 100 80 Bandwidth (MB/s) 60 Out-Of-Order Pipelining Improvements ( 29.5%) 40 20 0 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 14
  • 16. Evaluation on T2K: IOR Benchmark, 128procs 120 ROMIO PVFS2 IOFSL none IOFSL hbrr 100 80 Bandwidth (MB/s) 60 Out-Of-Order Pipelining Improvements ( 29.5%) 40 I/O Scheduler Improvement ( 40.0%) 20 0 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 14
  • 17. Evaluation on Blue Gene/P: Spec • Argonne National Laboratory BG/P “Surveyor” • Blue Gene/P platform for research and development • 1024 nodes, 4096-core • Four PVFS2 servers • DataDirect Networks S2A9550 SAN • 256 compute nodes, with 4 I/O nodes were used. Node Card: 4 core Node Board: 128 core Rack: 4096 core 15
  • 18. Evaluation on BG/P: BMI PingPong 900 BMI TCP/IP 800 BMI ZOID CNK BG/P Tree Network 700 600 Bandwidth (MB/s) 500 400 300 200 100 0 1 4 16 64 256 1024 4096 Buffer Size (KB) 16
  • 19. Evaluation on BG/P: IOR Benchmark, 256nodes 800 CIOD IOFSL FIFO(32threads) 700 600 Bandwidth (MiB/s) 500 400 300 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 17
  • 20. Evaluation on BG/P: IOR Benchmark, 256nodes 800 CIOD IOFSL FIFO(32threads) 700 600 Bandwidth (MiB/s) Performance Improvements 500 ( 42.0%) 400 300 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 17
  • 21. Evaluation on BG/P: IOR Benchmark, 256nodes 800 CIOD IOFSL FIFO(32threads) 700 600 Bandwidth (MiB/s) Performance Improvements 500 ( 42.0%) 400 300 Performance Drop ( -38.5%) 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 17
  • 22. Evaluation on BG/P: Thread Count Effect 800 IOFSL FIFO (16threads) IOFSL FIFO (32threads) 700 600 Bandwidth (MiB/s) 500 400 300 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 18
  • 23. Evaluation on BG/P: Thread Count Effect 800 IOFSL FIFO (16threads) IOFSL FIFO (32threads) 700 600 Bandwidth (MiB/s) 500 400 300 16 threads > 32threads 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 18
  • 24. Related Work • Computational Plant Project @ Sandia National Laboratory • First introduced I/O Forwarding Layer • IBM Blue Gene/L, Blue Gene/P • All I/O requests are forwarded to I/O nodes • Compute OS can be stripped down to minimum functionality, and reduces the OS noise • ZOID: I/O Forwarding Project [Kamil 2008] • Only on Blue Gene • Lustre Network Request Scheduler (NRS) [Qian 2009] • Request scheduler at the parallel file system nodes • Only simulation results 19
  • 25. Future Work • Event-driven server architecture • reduced thread contension • Collaborative Caching at the I/O forwarding layer • multiple I/O forwarder works collaboratively for caching data and also metadata • Hints from MPI-IO • Better cooperation with collective I/O • Evaluation on other leadership scale machines • ORNL Jaguar, Cray XT4, XT5 systems 20
  • 26. Conclusions • Implementation and evaluation of two optimization techniques at the I/O Forwarding Layer • I/O pipelining that overlaps the file system requests and the network communication. • I/O scheduler that reduces the number of independent, non-contiguous file systems accesses. • Demonstrating portable I/O forwarding layer, and performance comparison with existing HPC I/O software stack. • Two Environments • T2K Tokyo Linux cluster • ANL Blue Gene/P Surveyor • First I/O forwarding evaluations on linux cluster and Blue Gene/P • First comparison between Blue Gene/P IBM I/O stack with OSS I/O stack 21
  • 27. Thanks! Kazuki Ohta (presenter): Preferred Infrastructure, Inc., University of Tokyo Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross: Argonne National Laboratory Yutaka Ishikawa: University of Tokyo Contact: kazuki.ohta@gmail.com 22