SlideShare a Scribd company logo
ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Early application experiences on Summit
Wayne Joubert
Scientific Computing Group
Oak Ridge Leadership Computing Facility
3rd OpenPOWER Academia Discussion Group Workshop
Nov. 10, 2018
2
Summit – background
Officially launched June 8, 2018
World’s fastest supercomputer
Peak speed 200 PF
#1 on TOP500, @ 122.3 PF, June 2018
#1 Green500 level-3 measured system
#1 on HPCG benchmark
Used by 5 out of 6 Gordon Bell Finalist teams
Achieved world’s first ExaOp calculation by an
application, @ 2.36 ExaOps (ExaFlops16)
Not yet officially accepted, but already achieving
impressive results on conventional science and
machine learning applications
3 Slide courtesy Jack Wells
4 Slide courtesy Jack Wells
5
Summit early users
• 1,080 compute nodes have been
available to users since December
2017, after that built up to present
4,608 nodes
• Used by 13 CAAR teams (Center
for Accelerated Application
Readiness)
• 65 Letters of Intent for Summit Early
Science program – were allowed on
Summit for application readiness
• Gordon Bell teams (5)
• System Acceptance Test team –
preparations for final system
acceptance testing Graphic courtesy Tjerk Straatsma
6
Summit early science applicants
• Received 65 LOIs in January, 47 full
proposals in June
• Awardees will be among the first users
to get access to Summit after
acceptance
• Notably, 12 of the 65 LOIs (~ 20%)
had a machine learning component –
remarkable growth in a short period of
time
• Tremendous interest in running on
Summit, from Early Science as well as
2019 INCITE projects (announcement
Monday)
7
Summit Gordon Bell Teams
Slide courtesy Jack Wells
8
Summit Gordon Bell Finalist Projects
• CoMet team used Tensor Cores to achieve 2.36 ExaOps performance on a
comparative genomics application
• Prabhat’s LBL team, deep learning application, 1.13 ExaOps peak, .999
ExaOps sustained performance for identification of extreme weather patterns
from high resolution climate simulation data
• University of Tokyo team used AI and transprecision computing for
earthquake simulation
• ORNL / Robert Patton team, MENNDL code, 152 PetaOps analyzing atomic-
level materials properties from electron microscopy data
• LBL-led team using LQCD code with mixed precision multigrid solver to
study the physics of subatomic particles
• Full presentations @ SC18 sessions, Wed. 3:30-5:00, Thu. 10:30-12:00
9
Summit: first impressions
• Our early experience with Summit is that it is an
extremely powerful system
– Very strong GPUs
– Apps are often getting a higher fraction of peak than on
Titan – improvements to GPU hardware, software over time
– New features useful for some apps, e.g., Tensor Cores,
NVMe devices
– Low-congestion fat tree interconnect with adaptive routing
• Many apps have gotten impressive results already
• The early system was somewhat rough around the
edges, with a number of issues we have had to
work through with the vendors
• The system has progressively gotten better as all
parties have been working through the issues
CAAR application performance to
date – number of nodes scaled to
(out of 4,608 Summit nodes) and
performance vs. CPU-only
From “Early Application Results on Summit,” T.P.
Straatsma, Smoky Mountain Conference 2018
10
Summit node performance
• Summit nodes are achieving a high percentage of theoretical
peak performance characteristics
• For details see Vazhkudai et al., “The Design, Deployment,
and Evaluation of the CORAL Pre-Exascale Systems,” @
SC18, Wed. 3:30PM
11
Summit node performance: CPU memory subsystem
• Using the Stream benchmark to measure CPU memory bandwidth
• Theoretical peak 340 GB/sec, actual ~ 275 GB/sec, ~ 82% of peak
• Significant boost from previous Titan, JaguarPF nodes
SUMMIT: peak 170 X 2 = 340, actual ~ 275
actual: ~ 82%
TITAN: peak 25.6 X 2 = 51.2, actual ~ 34
actual ~ 67%
JAGUARPF: peak 25.6, actual ~ 19
actual ~ 75%
12
Summit node performance: GPU HBM memory
• Theoretical peak bandwidth of 900 GB/sec
• Measured performance from GPU Stream benchmark: 789 (Copy), 788 (Mul), 831 (Add and
Triad) GB/sec, representing 88%-92% of peak.
• Compares extremely well to Titan K20X, ~ 181 GB/sec out of 250 GB/sec peak (72%)
• Innovations were made in the GPU memory to achieve a higher fraction of peak performance
13
Summit node performance: CPU-GPU NVLINK
• Previously relied on much slower PCIe-2 connection on Titan
• On-socket transfer rates are 92%, 86% of peak 50, 100 GB/sec
• Off-socket transfers go through the X-Bus, are slower
14
Summit node performance: Infiniband interconnect
• Node-to-node bandwidth, latency measured using IMB benchmark
• Achieving 22.36, 44.29 GB/sec out of peak 25, 50 GB/sec unidirectional / bidirectional for
sufficiently large messages
• 89% of theoretical peak
15
GTC application (CAAR; acceptance test code)
• GTC (Gyrokinetic Toroidal Code) is a particle-in-cell (PIC) application
to simulate magnetically confined plasmas in fusion reactors such as
ITER
• Written in Fortran 90
• Accelerated primarily with OpenACC
• OpenMP acceleration of CPU-only parts; also an OpenMP code
version for Intel Xeon Phi
• Project personnel: Zhihong Lin (co-PI), William Tang (co-PI), Jian
Bao, Wayne Joubert, Matthew Niemerg, Lei Shi, Sam Taimourzadeh,
Bei Wang, Peng Wang, Wenlu Zhang
• http://phoenix.ps.uci.edu/gtc_group
16
GTC application: experiences porting to Summit GPUs
• Expensive particle push and charge loops are mapped to GPU using OpenACC, with persistent
data on the GPUs
• (aside: a number of codes since 2012 have used OpenACC on Titan and Summit, including
several Summit CAAR codes. Some codes are now starting to use OpenMP 4, e.g., PSDNS on
Titan)
• “Shift” operation — to move particles to different MPI ranks — to get high performance uses highly
optimized custom CUDA code — takes advantage of OpenACC/CUDA interoperability
• Poisson field solver
– original code used PETSc ILU(0)+GMRES sparse solver (CPU-only)
– now uses NVIDIA’s AMGX algebraic multigrid solver using GPUs, > 20X faster
– also option to use use Hypre algebraic multigrid solver (GPU support in development)
• Build option exists to use GPU Unified Memory, originally was much slower than explicit transfers,
now performance is near parity thanks to PGI compiler improvements
• GTC has significant I/O requirements. The GTC I/O behaviors uncovered some issues with the
Summit GPFS file system, which were addressed
17
GTC results
Weak scaling to 4500 nodes of Summit
18
CoMet application (INCITE, Early Science, Gordon Bell)
CoMet = Combinatorial Metrics code
A new biosciences application used to find genomic features within a
population
Not a “traditional” modeling and simulation code (e.g., continuum PDE
solver, PIC, Monte Carlo, etc.)
Also is not a deep learning app per se, though is part of an AI workflow
Best described as a data analytics application used in comparative
genomics studies
Gordon Bell Finalist -- see talk Thurs 11:30 AM
19
CoMet application
The primary computation is an all-to-all comparison of vectors
Computationally similar to a distributed DGEMM operation, as in the
ScaLAPACK library and PBLAS — very computationally intensive, but
also requires communication of very large matrices
Written in C++, uses CUDA, cuBLAS and modified MAGMA calls
Uses explicit calls for both asynchronous MPI point-to-point messages
and asynchronous CPU/GPU transfers, with pipelining to overlap
compute and transfer
OpenMP threading is used for CPU work, done concurrently with GPU
20
CoMet algorithm: Custom Correlation Coefficient (CCC)
Used to analyze allele data from a genome, encoded as 2-bit vector elements
Base implementation uses bitwise operations (AND, OR, NOT, shift, mask,
__popcll, etc.) to operate on this binary allelic data
v1 v2
0
1
1
1
Vectors composed of
2-bit entries
1 1
0 1
1 1
0 1
Take all combinations of
bits from the left and right
vector elements
0 0
2 2
00 10
01 11
Tally results into a
table to represent
how the 2 vectors
are related
21
CCC method: mapping to Tensor Cores
• Each vector is replaced by two
vectors, each containing the number
of 0s and 1s of each element of the
original vector, forming a new matrix
of vectors V
• Then taking the dense matrix-matrix
product VT V generates all 2X2 tables
for all vector pairs
• HGEMM applied using call to
cublasGemmEx in cuBLAS library,
gives identical result to original
method
0
1
1
1
0
0 2 0
0 2
1 1
Original
vector
# 0s
FP16
vectors
# 1s
22
CoMet performance
• Achieved 2.36 ExaOps (mixed
precision ExaFlops) at 4,560
nodes (99% of Summit) using the
Tensor Cores
• Near-perfect scaling made
possible by Summit’s Mellanox
Infiniband fat tree network with
adaptive routing
• Equivalent to 86.4 TF per GPU
for the whole computation
(including communications and
transfers)
• > 4X faster than original bitwise
implementation on Summit GPUs
W. Joubert, J. Nance, D. Weighill, D. Jacobson, “Parallel Accelerated Vector Similarity Calculations
for Genomics Applications,” Parallel Computing, vol. 75, July 2018, pp. 130-145,
https://www.sciencedirect.com/science/article/pii/S016781911830084X
W. Joubert, J. Nance, S. Climer, D. Weighill, D. Jacobson, “Parallel Accelerated Custom Correlation
Coefficient Calculations for Genomics Applications,” arxiv 1705.08213 [cs], Parallel Computing,
accepted.
Wayne Joubert, Deborah Weighill, David Kainer, Sharlee Climer, Amy Justice, Kjiersten Fagnan,
Daniel Jacobson, “Attacking the Opioid Epidemic: Determining the Epistatic and Pleiotropic Genetic
Architectures for Chronic Pain and Opioid Addiction,” SC18, Gordon Bell finalist, to appear.
23
Summit Power Consumption
• 2-way CCC/sp/tc @
4560 nodes
• Summit power usage
for 1 out of 4 phases of
the run, duration ~ 50
sec.
• Avg power: 11.45 MW
(20% higher than HPL)
• 206 GigaOps / Watt
24
Issues / challenges of using Tensor Cores
• Matrices for this problem are tall and skinny – axis order had to be reversed to
give shorter leading matrix dimension for better TC performance (about 2X faster)
(thanks to Sean Treichler of NVIDIA for suggestion)
• HGEMM performance as a function of matrix size is irregular, hard to precisely
predict – performed extensive timing tests with Baidu DeepBench benchmark to
try to understand – advisable to pad up to a multiple of a small power of 2 (e.g., 8,
16, 32) – however too much padding will be wasteful
• There are many tuning options for HGEMM (~16 choices for the algorithm setting)
– determined CUBLAS_GEMM_ALGO4_TENSOR_OP was the best – would prefer if
default setting would give this performance (hoping for improvements with CUDA
10)
• TC/HGEMM has surprising data-dependent performance: 125 TF theoretical
peak, 113 TF achievable on zero-filled matrices, 105 TF peak on random CCC
matrices, ~95 TF peak on matrices with fully random FP16 entries
25
Issues
• Measurements on 1 Summit GPU
using nvidia-smi
• Data-dependent performance of
Tensor Cores is due to 300W
power/frequency throttling of Voltas
on Summit
• Baidu DeepBench GEMM
benchmark has a bug (reported),
incorrectly fills FP16 matrices with
zeros instead of the intended
random values, thus miscomputes
GPU performance
26
Reduced precision: other possible opportunities
• We are starting to look at other opportunities for using reduced
precision for science calculations on Summit
• In the past scientists have had accuracy concerns and usually
required double precision
– E.g., S3D combustion code, 2009 paper found single precision not adequate
• New hardware (16X faster HGEMM than DGEMM) may call for a
second look
– ICL/Dongarra group already developing iterative refinement dense solver
using Tensor Cores (see talk @ SC18, Wed. 4PM)
– Deep learning projects already seeing high rates, e.g., peak 1.13 ExaOps
– Previous work on reduced precision iterative solvers e.g., Turner/Walker 1992
paper on reduced precision GMRES sparse solver
– Need to carefully evaluate on a case-by-case basis
27
Summit: general comments on user experiences
• The most common execution configuration on Summit is 1 MPI rank owns 1 GPU
and some CPU cores (like Titan), though some codes are using other
configurations, and no doubt users will experiment with still others
• Have requested jsrun options that would allow arbitrary execution configuration
on nodes—some users absolutely need this flexibility, e.g., 2 apps need
nonuniform resource sets for master/slave execution
• Earlier saw long jsrun/MPI init times on Summit, especially at large node/rank
counts. This has improved considerably.
• Earlier Spectrum MPI beta versions we received had never been run at such high
node counts—various issues encountered and bugs filed—IBM has worked to
address
28
Summit: general comments
• We would prefer more vendor tuning of third party libraries, as we have had in
the past. IBM does give us some optimized build instructions for third party
libraries.
• A more general concern regarding the broader industry: every new HPC system
we get has more complex node hardware and software stack. We hope HPC
vendors very carefully manage this complexity. Users want and need advanced
hardware features but also need reliable, coherent software to access them.
• Similarly, users mix programming models, e.g., MPI, OpenMP, OpenACC,
Kokkos, etc., sometimes in complex ways. We need to have interoperability
and coherency between these (Example: can an OpenMP thread launch an
OpenACC kernel)
29
Summit: general comments
• GPU high-speed atomic update operations of Volta (and Pascal) have made a
huge impact on some applications
• Unified memory, automatic migration of data to GPU very helpful for some codes—
e.g., codes with deep data structures. However, some users will prefer manual
control of the exact timing of transfers for performance.
• Most codes that run at our center also run at other sites. Use of vendor-specific
libraries or features that give higher performance may be avoided by some users
to maintain portability. We prefer standards-based approaches when possible.
• MPS will be used by some, but can add complexity, e.g., need to manage CUDA
mem handles. Also, MPS adds to the myriad of complexities to manage (resource
sets, ranks per node, SMT mode, NUMA domains, thread binding, GPU binding,
etc.).
30
Summit: general comments
• We like having multiple compilers for risk mitigation, but there may not be any single
compiler satisfying all requirements for a project, e.g., OpenACC, fast CPU code
generation, etc. Also, Fortran is important, used by slightly under half of projects
(INCITE 2014).
• Features like RDMA and GPUDirect are important to users. RDMA is needed by at
least one library (ExaTensor) used by 3 CAAR projects.
• Because of Titan, we have already optimized many of our codes for CPU-GPU
interconnect bandwidth (overlapped transfers, data persistence on GPUs) and latency
(large transfers, longer-running kernels). However, some users still need to run many
kernels, e.g., QMCPACK, thus still rely on low-latency kernel launch.
• Inter-node messages of many possible sizes depending on the app, e.g., halo (e.g.,
S3D-Legion), large (~ 1 GB) messages (ScaLAPACK, CoMet, SLATE), small latency-
limited messages (climate codes)—teams will work to optimize each of these cases.
31
Conclusions
• Summit has shown itself a very powerful system for multiple
applications so far
• We have worked with IBM and other partners to resolve issues
• We are looking forward to the new science that Summit will make
possible in the near future
32
This research used resources of the Oak Ridge Leadership
Computing Facility at the Oak Ridge National Laboratory, which
is supported by the Office of Science of the U.S. Department of
Energy under Contract No. DE-AC05-00OR22725.
Questions?
Wayne Joubert
joubert@ornl.gov

More Related Content

PDF
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
PPTX
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
PPTX
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
PDF
cnsm2011_slide
PDF
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
PDF
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
PPTX
OOW-IMC-final
PDF
Performance Optimization of HPC Applications: From Hardware to Source Code
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
cnsm2011_slide
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
OOW-IMC-final
Performance Optimization of HPC Applications: From Hardware to Source Code

What's hot (20)

PDF
Barcelona Supercomputing Center, Generador de Riqueza
PDF
Multi-core GPU – Fast parallel SAR image generation
PDF
ECP Application Development
PPT
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
PPTX
Programmable Exascale Supercomputer
PDF
GTC Japan 2016 Chainer feature introduction
PPTX
Cloud Computing
PDF
Designing HPC & Deep Learning Middleware for Exascale Systems
PDF
Best Practices: Large Scale Multiphysics
PDF
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
PDF
Hpc, grid and cloud computing - the past, present, and future challenge
PDF
Tokyo Webmining Talk1
PDF
第11回 配信講義 計算科学技術特論A(2021)
PDF
Parallel Implementation of K Means Clustering on CUDA
PDF
Interactive Data Analysis for End Users on HN Science Cloud
PDF
Accelerating Real Time Applications on Heterogeneous Platforms
PDF
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
PDF
Python for Earth
PDF
Gossip-based resource allocation for green computing in large clouds
PDF
Core Objective 1: Highlights from the Central Data Resource
Barcelona Supercomputing Center, Generador de Riqueza
Multi-core GPU – Fast parallel SAR image generation
ECP Application Development
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Programmable Exascale Supercomputer
GTC Japan 2016 Chainer feature introduction
Cloud Computing
Designing HPC & Deep Learning Middleware for Exascale Systems
Best Practices: Large Scale Multiphysics
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Hpc, grid and cloud computing - the past, present, and future challenge
Tokyo Webmining Talk1
第11回 配信講義 計算科学技術特論A(2021)
Parallel Implementation of K Means Clustering on CUDA
Interactive Data Analysis for End Users on HN Science Cloud
Accelerating Real Time Applications on Heterogeneous Platforms
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Python for Earth
Gossip-based resource allocation for green computing in large clouds
Core Objective 1: Highlights from the Central Data Resource
Ad

Similar to Early Application experiences on Summit (20)

PDF
In datacenter performance analysis of a tensor processing unit
PDF
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
PPT
Harnessing OpenCL in Modern Coprocessors
PPTX
2018 03 25 system ml ai and openpower meetup
PDF
cug2011-praveen
PDF
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
PDF
TAU E4S ON OpenPOWER /POWER9 platform
PDF
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
PDF
Large-Scale Optimization Strategies for Typical HPC Workloads
PPTX
Hardware architecture of Summit Supercomputer
PDF
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
PDF
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
PPTX
참여기관_발표자료-국민대학교 201301 정기회의
PPT
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
PDF
Application Profiling at the HPCAC High Performance Center
PDF
Maximize Impact: Learn from the Dual Pillars of Open-Source Energy Planning T...
PPT
Current Trends in HPC
PPT
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
PDF
OpenPOWER Acceleration of HPCC Systems
PDF
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
In datacenter performance analysis of a tensor processing unit
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Harnessing OpenCL in Modern Coprocessors
2018 03 25 system ml ai and openpower meetup
cug2011-praveen
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
TAU E4S ON OpenPOWER /POWER9 platform
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Large-Scale Optimization Strategies for Typical HPC Workloads
Hardware architecture of Summit Supercomputer
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
참여기관_발표자료-국민대학교 201301 정기회의
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Application Profiling at the HPCAC High Performance Center
Maximize Impact: Learn from the Dual Pillars of Open-Source Energy Planning T...
Current Trends in HPC
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
OpenPOWER Acceleration of HPCC Systems
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
Ad

More from Ganesan Narayanasamy (20)

PDF
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
PDF
Chip Design Curriculum development Residency program
PDF
Basics of Digital Design and Verilog
PDF
180 nm Tape out experience using Open POWER ISA
PDF
Workload Transformation and Innovations in POWER Architecture
PDF
OpenPOWER Workshop at IIT Roorkee
PDF
Deep Learning Use Cases using OpenPOWER systems
PDF
IBM BOA for POWER
PDF
OpenPOWER System Marconi100
PDF
OpenPOWER Latest Updates
PDF
POWER10 innovations for HPC
PDF
Deeplearningusingcloudpakfordata
PDF
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
PDF
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
PDF
AI in healthcare - Use Cases
PDF
AI in Health Care using IBM Systems/OpenPOWER systems
PDF
AI in Healh Care using IBM POWER systems
PDF
Poster from NUS
PDF
SAP HANA on POWER9 systems
PPTX
Graphical Structure Learning accelerated with POWER9
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Chip Design Curriculum development Residency program
Basics of Digital Design and Verilog
180 nm Tape out experience using Open POWER ISA
Workload Transformation and Innovations in POWER Architecture
OpenPOWER Workshop at IIT Roorkee
Deep Learning Use Cases using OpenPOWER systems
IBM BOA for POWER
OpenPOWER System Marconi100
OpenPOWER Latest Updates
POWER10 innovations for HPC
Deeplearningusingcloudpakfordata
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare - Use Cases
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Healh Care using IBM POWER systems
Poster from NUS
SAP HANA on POWER9 systems
Graphical Structure Learning accelerated with POWER9

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Modernizing your data center with Dell and AMD
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
KodekX | Application Modernization Development
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
MYSQL Presentation for SQL database connectivity
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
A Presentation on Artificial Intelligence
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Big Data Technologies - Introduction.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Reach Out and Touch Someone: Haptics and Empathic Computing
Diabetes mellitus diagnosis method based random forest with bat algorithm
Modernizing your data center with Dell and AMD
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Understanding_Digital_Forensics_Presentation.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Unlocking AI with Model Context Protocol (MCP)
KodekX | Application Modernization Development
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MYSQL Presentation for SQL database connectivity
The AUB Centre for AI in Media Proposal.docx
A Presentation on Artificial Intelligence
Spectral efficient network and resource selection model in 5G networks
NewMind AI Weekly Chronicles - August'25 Week I
NewMind AI Monthly Chronicles - July 2025
Empathic Computing: Creating Shared Understanding
Big Data Technologies - Introduction.pptx

Early Application experiences on Summit

  • 1. ORNL is managed by UT-Battelle, LLC for the US Department of Energy Early application experiences on Summit Wayne Joubert Scientific Computing Group Oak Ridge Leadership Computing Facility 3rd OpenPOWER Academia Discussion Group Workshop Nov. 10, 2018
  • 2. 2 Summit – background Officially launched June 8, 2018 World’s fastest supercomputer Peak speed 200 PF #1 on TOP500, @ 122.3 PF, June 2018 #1 Green500 level-3 measured system #1 on HPCG benchmark Used by 5 out of 6 Gordon Bell Finalist teams Achieved world’s first ExaOp calculation by an application, @ 2.36 ExaOps (ExaFlops16) Not yet officially accepted, but already achieving impressive results on conventional science and machine learning applications
  • 3. 3 Slide courtesy Jack Wells
  • 4. 4 Slide courtesy Jack Wells
  • 5. 5 Summit early users • 1,080 compute nodes have been available to users since December 2017, after that built up to present 4,608 nodes • Used by 13 CAAR teams (Center for Accelerated Application Readiness) • 65 Letters of Intent for Summit Early Science program – were allowed on Summit for application readiness • Gordon Bell teams (5) • System Acceptance Test team – preparations for final system acceptance testing Graphic courtesy Tjerk Straatsma
  • 6. 6 Summit early science applicants • Received 65 LOIs in January, 47 full proposals in June • Awardees will be among the first users to get access to Summit after acceptance • Notably, 12 of the 65 LOIs (~ 20%) had a machine learning component – remarkable growth in a short period of time • Tremendous interest in running on Summit, from Early Science as well as 2019 INCITE projects (announcement Monday)
  • 7. 7 Summit Gordon Bell Teams Slide courtesy Jack Wells
  • 8. 8 Summit Gordon Bell Finalist Projects • CoMet team used Tensor Cores to achieve 2.36 ExaOps performance on a comparative genomics application • Prabhat’s LBL team, deep learning application, 1.13 ExaOps peak, .999 ExaOps sustained performance for identification of extreme weather patterns from high resolution climate simulation data • University of Tokyo team used AI and transprecision computing for earthquake simulation • ORNL / Robert Patton team, MENNDL code, 152 PetaOps analyzing atomic- level materials properties from electron microscopy data • LBL-led team using LQCD code with mixed precision multigrid solver to study the physics of subatomic particles • Full presentations @ SC18 sessions, Wed. 3:30-5:00, Thu. 10:30-12:00
  • 9. 9 Summit: first impressions • Our early experience with Summit is that it is an extremely powerful system – Very strong GPUs – Apps are often getting a higher fraction of peak than on Titan – improvements to GPU hardware, software over time – New features useful for some apps, e.g., Tensor Cores, NVMe devices – Low-congestion fat tree interconnect with adaptive routing • Many apps have gotten impressive results already • The early system was somewhat rough around the edges, with a number of issues we have had to work through with the vendors • The system has progressively gotten better as all parties have been working through the issues CAAR application performance to date – number of nodes scaled to (out of 4,608 Summit nodes) and performance vs. CPU-only From “Early Application Results on Summit,” T.P. Straatsma, Smoky Mountain Conference 2018
  • 10. 10 Summit node performance • Summit nodes are achieving a high percentage of theoretical peak performance characteristics • For details see Vazhkudai et al., “The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems,” @ SC18, Wed. 3:30PM
  • 11. 11 Summit node performance: CPU memory subsystem • Using the Stream benchmark to measure CPU memory bandwidth • Theoretical peak 340 GB/sec, actual ~ 275 GB/sec, ~ 82% of peak • Significant boost from previous Titan, JaguarPF nodes SUMMIT: peak 170 X 2 = 340, actual ~ 275 actual: ~ 82% TITAN: peak 25.6 X 2 = 51.2, actual ~ 34 actual ~ 67% JAGUARPF: peak 25.6, actual ~ 19 actual ~ 75%
  • 12. 12 Summit node performance: GPU HBM memory • Theoretical peak bandwidth of 900 GB/sec • Measured performance from GPU Stream benchmark: 789 (Copy), 788 (Mul), 831 (Add and Triad) GB/sec, representing 88%-92% of peak. • Compares extremely well to Titan K20X, ~ 181 GB/sec out of 250 GB/sec peak (72%) • Innovations were made in the GPU memory to achieve a higher fraction of peak performance
  • 13. 13 Summit node performance: CPU-GPU NVLINK • Previously relied on much slower PCIe-2 connection on Titan • On-socket transfer rates are 92%, 86% of peak 50, 100 GB/sec • Off-socket transfers go through the X-Bus, are slower
  • 14. 14 Summit node performance: Infiniband interconnect • Node-to-node bandwidth, latency measured using IMB benchmark • Achieving 22.36, 44.29 GB/sec out of peak 25, 50 GB/sec unidirectional / bidirectional for sufficiently large messages • 89% of theoretical peak
  • 15. 15 GTC application (CAAR; acceptance test code) • GTC (Gyrokinetic Toroidal Code) is a particle-in-cell (PIC) application to simulate magnetically confined plasmas in fusion reactors such as ITER • Written in Fortran 90 • Accelerated primarily with OpenACC • OpenMP acceleration of CPU-only parts; also an OpenMP code version for Intel Xeon Phi • Project personnel: Zhihong Lin (co-PI), William Tang (co-PI), Jian Bao, Wayne Joubert, Matthew Niemerg, Lei Shi, Sam Taimourzadeh, Bei Wang, Peng Wang, Wenlu Zhang • http://phoenix.ps.uci.edu/gtc_group
  • 16. 16 GTC application: experiences porting to Summit GPUs • Expensive particle push and charge loops are mapped to GPU using OpenACC, with persistent data on the GPUs • (aside: a number of codes since 2012 have used OpenACC on Titan and Summit, including several Summit CAAR codes. Some codes are now starting to use OpenMP 4, e.g., PSDNS on Titan) • “Shift” operation — to move particles to different MPI ranks — to get high performance uses highly optimized custom CUDA code — takes advantage of OpenACC/CUDA interoperability • Poisson field solver – original code used PETSc ILU(0)+GMRES sparse solver (CPU-only) – now uses NVIDIA’s AMGX algebraic multigrid solver using GPUs, > 20X faster – also option to use use Hypre algebraic multigrid solver (GPU support in development) • Build option exists to use GPU Unified Memory, originally was much slower than explicit transfers, now performance is near parity thanks to PGI compiler improvements • GTC has significant I/O requirements. The GTC I/O behaviors uncovered some issues with the Summit GPFS file system, which were addressed
  • 17. 17 GTC results Weak scaling to 4500 nodes of Summit
  • 18. 18 CoMet application (INCITE, Early Science, Gordon Bell) CoMet = Combinatorial Metrics code A new biosciences application used to find genomic features within a population Not a “traditional” modeling and simulation code (e.g., continuum PDE solver, PIC, Monte Carlo, etc.) Also is not a deep learning app per se, though is part of an AI workflow Best described as a data analytics application used in comparative genomics studies Gordon Bell Finalist -- see talk Thurs 11:30 AM
  • 19. 19 CoMet application The primary computation is an all-to-all comparison of vectors Computationally similar to a distributed DGEMM operation, as in the ScaLAPACK library and PBLAS — very computationally intensive, but also requires communication of very large matrices Written in C++, uses CUDA, cuBLAS and modified MAGMA calls Uses explicit calls for both asynchronous MPI point-to-point messages and asynchronous CPU/GPU transfers, with pipelining to overlap compute and transfer OpenMP threading is used for CPU work, done concurrently with GPU
  • 20. 20 CoMet algorithm: Custom Correlation Coefficient (CCC) Used to analyze allele data from a genome, encoded as 2-bit vector elements Base implementation uses bitwise operations (AND, OR, NOT, shift, mask, __popcll, etc.) to operate on this binary allelic data v1 v2 0 1 1 1 Vectors composed of 2-bit entries 1 1 0 1 1 1 0 1 Take all combinations of bits from the left and right vector elements 0 0 2 2 00 10 01 11 Tally results into a table to represent how the 2 vectors are related
  • 21. 21 CCC method: mapping to Tensor Cores • Each vector is replaced by two vectors, each containing the number of 0s and 1s of each element of the original vector, forming a new matrix of vectors V • Then taking the dense matrix-matrix product VT V generates all 2X2 tables for all vector pairs • HGEMM applied using call to cublasGemmEx in cuBLAS library, gives identical result to original method 0 1 1 1 0 0 2 0 0 2 1 1 Original vector # 0s FP16 vectors # 1s
  • 22. 22 CoMet performance • Achieved 2.36 ExaOps (mixed precision ExaFlops) at 4,560 nodes (99% of Summit) using the Tensor Cores • Near-perfect scaling made possible by Summit’s Mellanox Infiniband fat tree network with adaptive routing • Equivalent to 86.4 TF per GPU for the whole computation (including communications and transfers) • > 4X faster than original bitwise implementation on Summit GPUs W. Joubert, J. Nance, D. Weighill, D. Jacobson, “Parallel Accelerated Vector Similarity Calculations for Genomics Applications,” Parallel Computing, vol. 75, July 2018, pp. 130-145, https://www.sciencedirect.com/science/article/pii/S016781911830084X W. Joubert, J. Nance, S. Climer, D. Weighill, D. Jacobson, “Parallel Accelerated Custom Correlation Coefficient Calculations for Genomics Applications,” arxiv 1705.08213 [cs], Parallel Computing, accepted. Wayne Joubert, Deborah Weighill, David Kainer, Sharlee Climer, Amy Justice, Kjiersten Fagnan, Daniel Jacobson, “Attacking the Opioid Epidemic: Determining the Epistatic and Pleiotropic Genetic Architectures for Chronic Pain and Opioid Addiction,” SC18, Gordon Bell finalist, to appear.
  • 23. 23 Summit Power Consumption • 2-way CCC/sp/tc @ 4560 nodes • Summit power usage for 1 out of 4 phases of the run, duration ~ 50 sec. • Avg power: 11.45 MW (20% higher than HPL) • 206 GigaOps / Watt
  • 24. 24 Issues / challenges of using Tensor Cores • Matrices for this problem are tall and skinny – axis order had to be reversed to give shorter leading matrix dimension for better TC performance (about 2X faster) (thanks to Sean Treichler of NVIDIA for suggestion) • HGEMM performance as a function of matrix size is irregular, hard to precisely predict – performed extensive timing tests with Baidu DeepBench benchmark to try to understand – advisable to pad up to a multiple of a small power of 2 (e.g., 8, 16, 32) – however too much padding will be wasteful • There are many tuning options for HGEMM (~16 choices for the algorithm setting) – determined CUBLAS_GEMM_ALGO4_TENSOR_OP was the best – would prefer if default setting would give this performance (hoping for improvements with CUDA 10) • TC/HGEMM has surprising data-dependent performance: 125 TF theoretical peak, 113 TF achievable on zero-filled matrices, 105 TF peak on random CCC matrices, ~95 TF peak on matrices with fully random FP16 entries
  • 25. 25 Issues • Measurements on 1 Summit GPU using nvidia-smi • Data-dependent performance of Tensor Cores is due to 300W power/frequency throttling of Voltas on Summit • Baidu DeepBench GEMM benchmark has a bug (reported), incorrectly fills FP16 matrices with zeros instead of the intended random values, thus miscomputes GPU performance
  • 26. 26 Reduced precision: other possible opportunities • We are starting to look at other opportunities for using reduced precision for science calculations on Summit • In the past scientists have had accuracy concerns and usually required double precision – E.g., S3D combustion code, 2009 paper found single precision not adequate • New hardware (16X faster HGEMM than DGEMM) may call for a second look – ICL/Dongarra group already developing iterative refinement dense solver using Tensor Cores (see talk @ SC18, Wed. 4PM) – Deep learning projects already seeing high rates, e.g., peak 1.13 ExaOps – Previous work on reduced precision iterative solvers e.g., Turner/Walker 1992 paper on reduced precision GMRES sparse solver – Need to carefully evaluate on a case-by-case basis
  • 27. 27 Summit: general comments on user experiences • The most common execution configuration on Summit is 1 MPI rank owns 1 GPU and some CPU cores (like Titan), though some codes are using other configurations, and no doubt users will experiment with still others • Have requested jsrun options that would allow arbitrary execution configuration on nodes—some users absolutely need this flexibility, e.g., 2 apps need nonuniform resource sets for master/slave execution • Earlier saw long jsrun/MPI init times on Summit, especially at large node/rank counts. This has improved considerably. • Earlier Spectrum MPI beta versions we received had never been run at such high node counts—various issues encountered and bugs filed—IBM has worked to address
  • 28. 28 Summit: general comments • We would prefer more vendor tuning of third party libraries, as we have had in the past. IBM does give us some optimized build instructions for third party libraries. • A more general concern regarding the broader industry: every new HPC system we get has more complex node hardware and software stack. We hope HPC vendors very carefully manage this complexity. Users want and need advanced hardware features but also need reliable, coherent software to access them. • Similarly, users mix programming models, e.g., MPI, OpenMP, OpenACC, Kokkos, etc., sometimes in complex ways. We need to have interoperability and coherency between these (Example: can an OpenMP thread launch an OpenACC kernel)
  • 29. 29 Summit: general comments • GPU high-speed atomic update operations of Volta (and Pascal) have made a huge impact on some applications • Unified memory, automatic migration of data to GPU very helpful for some codes— e.g., codes with deep data structures. However, some users will prefer manual control of the exact timing of transfers for performance. • Most codes that run at our center also run at other sites. Use of vendor-specific libraries or features that give higher performance may be avoided by some users to maintain portability. We prefer standards-based approaches when possible. • MPS will be used by some, but can add complexity, e.g., need to manage CUDA mem handles. Also, MPS adds to the myriad of complexities to manage (resource sets, ranks per node, SMT mode, NUMA domains, thread binding, GPU binding, etc.).
  • 30. 30 Summit: general comments • We like having multiple compilers for risk mitigation, but there may not be any single compiler satisfying all requirements for a project, e.g., OpenACC, fast CPU code generation, etc. Also, Fortran is important, used by slightly under half of projects (INCITE 2014). • Features like RDMA and GPUDirect are important to users. RDMA is needed by at least one library (ExaTensor) used by 3 CAAR projects. • Because of Titan, we have already optimized many of our codes for CPU-GPU interconnect bandwidth (overlapped transfers, data persistence on GPUs) and latency (large transfers, longer-running kernels). However, some users still need to run many kernels, e.g., QMCPACK, thus still rely on low-latency kernel launch. • Inter-node messages of many possible sizes depending on the app, e.g., halo (e.g., S3D-Legion), large (~ 1 GB) messages (ScaLAPACK, CoMet, SLATE), small latency- limited messages (climate codes)—teams will work to optimize each of these cases.
  • 31. 31 Conclusions • Summit has shown itself a very powerful system for multiple applications so far • We have worked with IBM and other partners to resolve issues • We are looking forward to the new science that Summit will make possible in the near future
  • 32. 32 This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. Questions? Wayne Joubert joubert@ornl.gov