SlideShare a Scribd company logo
NVIDIA HPC ソフトウエア斜め読み
NARUHIKO TAN | HPC SOLUTION ARCHITECT
AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
PROGRAMMING THE NVIDIA PLATFORM
CPU, GPU, and Network
ACCELERATED STANDARD LANGUAGES
ISO C++, ISO Fortran
PLATFORM SPECIALIZATION
CUDA
ACCELERATION LIBRARIES
Core Communication
Math Data Analytics AI Quantum
std::transform(par, x, x+n, y, y,
[=](float x, float y){ return y +
a*x; }
);
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
import cunumeric as np
…
def saxpy(a, x, y):
y[:] += a*x
#pragma acc data copy(x,y) {
...
std::transform(par, x, x+n, y, y,
[=](float x, float y){
return y + a*x;
});
...
}
#pragma omp target data map(x,y) {
...
std::transform(par, x, x+n, y, y,
[=](float x, float y){
return y + a*x;
});
...
}
__global__
void saxpy(int n, float a,
float *x, float *y) {
int i = blockIdx.x*blockDim.x +
threadIdx.x;
if (i < n) y[i] += a*x[i];
}
int main(void) {
...
cudaMemcpy(d_x, x, ...);
cudaMemcpy(d_y, y, ...);
saxpy<<<(N+255)/256,256>>>(...);
cudaMemcpy(y, d_y, ...);
ACCELERATED STANDARD LANGUAGES
ISO C++, ISO Fortran
INCREMENTAL PORTABLE OPTIMIZATION
OpenACC, OpenMP
PLATFORM SPECIALIZATION
CUDA
NVIDIA HPC SDK
Available at developer.nvidia.com/hpc-sdk, on NGC, via Spack, and in the Cloud
Develop for the NVIDIA Platform: GPU, CPU and Interconnect
Libraries | Accelerated C++ and Fortran | Directives | CUDA
7-8 Releases Per Year | Freely Available
Compilers
nvcc nvc
nvc++
nvfortran
Programming
Models
Standard C++ & Fortran
OpenACC & OpenMP
CUDA
Core
Libraries
libcu++
Thrust
CUB
Math
Libraries
cuBLAS cuTENSOR
cuSPARSE cuSOLVER
cuFFT cuRAND
Communication
Libraries
HPC-X
NVSHMEM
NCCL
DEVELOPMENT
Profilers
Nsight
Systems
Compute
Debugger
cuda-gdb
Host
Device
ANALYSIS
SHARP HCOLL
UCX SHMEM
MPI
ACCELERATED COMPUTING WITH
STANDARD LANGUAGES
§ C++
PILLARS OF STANDARD LANGUAGE PARALLELISM
7
Copyright (C) 2021 Bryce Adelstein Lelbach
Common Algorithms that Dispatch to
Vendor-Optimized Parallel Libraries
Tools to Write Your Own Parallel
Algorithms that Run Anywhere
sender auto
algorithm (sender auto s) {
return s | bulk(N,
[] (auto data) {
// ...
}
) | bulk(N,
[] (auto data) {
// ...
}
);
}
Mechanisms for Composing Parallel
Invocations into Task Graphs
sender auto
algorithm (sender auto s) {
return s | bulk(
[] (auto data) {
// ...
}
) | bulk(
[] (auto data) {
// ...
}
);
}
C++ with OpenMP
Ø Composable, compact and elegant
Ø Easy to read and maintain
Ø ISO Standard
Ø Portable – nvc++, g++, icpc, MSVC, …
Standard C++
#pragma omp parallel // OpenMP parallel region
{
#pragma omp for // OpenMP for loop
for (MInt i = 0; i < noCells; i++) { // Loop over all cells
if (timeStep % ipow2[maxLevel_ – clevel[i * distLevel]] == 0) { // Multi-grid loop
const MInt distStartId = i * nDist; // More offsets for 1D accesses // Local offsets
const MInt distNeighStartId = i * distNeighbors;
const MFloat* const distributionsStart = &[distributions[distStartId];
for (MInt j = 0; j < nDist – 1; j += 2) { // Unrolled loop distributions (factor 2)
if (neighborId[I * distNeighbors + j] > -1) { // First unrolled iteration
const MInt n1StartId = neighborId[distNeighStartId + j] * nDist;
oldDistributions[n1StartId + j] = distributionsStart[j]; // 1D access AoS format
}
if (neighborId[I * distNeighbors + j + 1] > -1) { // Second unrolled iteration
const MInt n2StartId = neighborId[distNeighStartId + j + 1] * nDist;
oldDistributions[n2StartId + j + 1] = distributionsStart[j + 1];
}
}
oldDistributions[distStartId + lastId] = distributionsStart[lastId]; // Zero-th distribution
}
}
}
std::for_each_n(par_unseq, start, noCells, [=](auto i) { // Parallel for
if (timeStep % IPOW2[maxLevel_ – a_level(i)] != 0) // Multi-level loop
return;
for (MInt j = 0; j < nDist; ++j) {
if (auto n = c_neighborId(i, j); n == -1) continue;
a_oldDistribution(n, j) = a_distribution(i, j); // SoA or AoS mem_fn
}
});
M-AIA WITH C++17 PARALLEL ALGORITHMS
Multi-physics simulation framework
from RWTH Aachen University
Ø Hierarchical grids, complex moving geometries
Ø Adaptive meshing, load balancing
Ø Numerical methods: FV, DG, LBM, FEM, Level-Set, ...
Ø Physics: aeroacoustics, combustion, biomedical, ...
Ø Developed by ~20 PhDs (Mech. Eng.), ~500k LOC++
Ø Programming model: MPI + ISO C++ parallelism
M-AIA
Multi-physics simulation framework developed at the Institute of Aerodynamics, RWTH Aachen University
Decaying isotropic turbulence
400k fully-resolved particles
1 1.025
8.74
0
1
2
3
4
5
6
7
8
9
10
OpenMP (2x EPYC 7742) ISO C++ (2x EPYC 7742) ISO C++ (A100)
Relative
Speed-Up
PARALLELISM IN C++ ROADMAP
C++ 14 C++ 17 C++ 20 C++ PIPELINE
• Memory model
enhancements
• Lambdas
• Atomics
extensions
• Generic Lambda
Expressions
• Parallel algorithms
• Forward progress
guarantees
• Memory model
clarifications
• Scalable
synchronization
library
• Ranges
• Span
• Linear algebra
algorithms
• Asynchronous
parallel algorithms
• Senders-receivers
• Mdspan
• Range-based
parallel algorithms
• Extended floating-
point types
General parallelism user facing feature
How users run C++
code on GPUs today
Co-designed with
V100 hardware
support
Custom algorithms
and async.
control flow
N-dimensional
loops and
usability
Extended C++ interface
to BLAS/Lapack
General usability
of performance
provided by executors
C++ 11
PILLARS OF STANDARD LANGUAGE PARALLELISM
11
Copyright (C) 2021 Bryce Adelstein Lelbach
With Senders & Receivers
Today
Common Algorithms that Dispatch to
Vendor-Optimized Parallel Libraries
Tools to Write Your Own Parallel
Algorithms that Run Anywhere
sender auto
algorithm (sender auto s) {
return s | bulk(N,
[] (auto data) {
// ...
}
) | bulk(N,
[] (auto data) {
// ...
}
);
}
Mechanisms for Composing Parallel
Invocations into Task Graphs
sender auto
algorithm (sender auto s) {
return s | bulk(
[] (auto data) {
// ...
}
) | bulk(
[] (auto data) {
// ...
}
);
}
SENDERS & RECEIVERS
Maxwell’s equations
template <ComputeSchedulerT, WriteSchedulerT>
auto maxwell_eqs(ComputeSchedulerT &scheduler, WriteSchedulerT &writer)
{
return repeat_n(
n_outer_iterations,
repeat_n(
n_inner_iterations,
schedule(scheduler)
| bulk(grid.cells, update_h(accessor))
| bulk(grid.cells, update_e(time, dt, accessor)))
| transfer(writer)
| then(dump_results(report_step, accessor)))
| then([]{ printf("simulation completen"); })
);
}
Simplify Work Across CPUs and
Accelerators
• Uniform abstraction between code and
diverse resources
• ISO standard
• Write once, run everywhere
•
ELECTROMAGNETISM
Raw performance & % of peak
std::sync_wait(maxwell(inline_scheduler, inline_scheduler));
std::sync_wait(maxwell(openmp_scheduler, inline_scheduler));
std::sync_wait(maxwell(cuda, inline_scheduler));
§ CPUs: AMD EPYC 7742 CPUs, GPUs: NVIDIA A100-SXM4-80
§ Inline (1 CPU HW thread), OpenMP-128 (1x CPU), OpenMP-256 (2x CPUs), Graph (1x GPU), Multi-GPU (2x GPUs)
§ clang-12 with –O3 –DNDEBUG –mtune=native -fopenmp
0
5
10
15
20
25
30
OpenMP 128 OpenMP 256 CUDA (1 A100) CUDA (2 A100)
Speedup
Scheduler
0
10
20
30
40
50
60
70
80
90
100
OpenMP 128 OpenMP 256 CUDA (1 A100) CUDA (2 A100)
Efficiency
vs
STREAM
TRIAD
Scheduler
STRONG SCALING USING ISO STANDARD C++
NVIDIA CONFIDENTIAL. DO NOT
DISTRIBUTE.
NVIDIA SUPERPOD
§ 140x NVIDIA DGX-A100 640
§ 1120x NVIDIA A100-SXM4-80 GPUs
0
0.2
0.4
0.6
0.8
1
1.2
0
5
10
15
20
25
30
35
40
0 200 400 600 800 1000 1200
Speedup
Number of GPUs
Maxwell SR Scaling Ideal Scaling Efficiency
PARALLEL ALGORITHMS AND SENDERS & RECIEVERS
PALABOS CARBON SEQUESTRATION
15
Copyright (C) 2022 NVIDIA
Ø Palabos is a framework for fluid dynamics simulations using
Lattice-Boltzmann methods.
Ø Code for multi-component flow through a porous media
ported to C++ Senders and Receivers.
Ø Application: simulating carbon sequestration in sandstone.
Christian Huber (Brown University), Jonas Latt (University of Geneva)
Georgy Evtushenko (NVIDIA), Gonzalo Brito (NVIDIA)
0
4
8
12
16
32 128 224 320 416 512
A100 GPUs
Strong Scaling
ACCELERATED COMPUTING WITH
STANDARD LANGUAGES
§ FORTRAN
MODERN FORTRAN FEATURES FOR HPC
Standard Parallelism and Concurrency Features
DO CONCURRENT Reductions
Support for reduction operations
on concurrent loops (ala
OpenACC/OpenMP). Began
supporting in nvfortran 21.11.
Fortran 202X
Coming in 2023
Atomics
Propose support for atomic
variable accesses
Asynchronous Tasking
Propose support for asynchronous
tasks
Fortran 202Y
In discussion
DO CONCURRENT
Data parallel loop construct, locality
specifiers. Supported in nvfortran
Array Intrinsics
Various math intrinsics that may
apply to entire arrays and map to
accelerated libraries supported in
nvfortran.
Co-Arrays
Partitioned Global Address Space
arrays, teams of processes (images),
collectives & synchronization.
Awaiting F18.
Fortran 2018
MINIWEATHER
Standard Language Parallelism in Climate/Weather Applications
Mini-App written in C++ and Fortran that simulates
weather-like fluid flows using Finite Volume and
Runge-Kutta methods.
Existing parallelization in MPI, OpenMP, OpenACC, …
Included in the SPEChpc benchmark suite*
Open-source and commonly-used in training events.
https://github.com/mrnorman/miniWeather/
MiniWeather
0
10
20
OpenMP (CPU) Concurrent (CPU) Concurrent (GPU) OpenACC
do concurrent (ll=1:NUM_VARS, k=1:nz, i=1:nx)
local(x,z,x0,z0,xrad,zrad,amp,dist,wpert)
if (data_spec_int == DATA_SPEC_GRAVITY_WAVES) then
x = (i_beg-1 + i-0.5_rp) * dx
z = (k_beg-1 + k-0.5_rp) * dz
x0 = xlen/8
z0 = 1000
xrad = 500
zrad = 500
amp = 0.01_rp
dist = sqrt( ((x-x0)/xrad)**2 + ((z-z0)/zrad)**2 ) * pi / 2._rp
if (dist <= pi / 2._rp) then
wpert = amp * cos(dist)**2
else
wpert = 0._rp
endif
tend(i,k,ID_WMOM) = tend(i,k,ID_WMOM)
+ wpert*hy_dens_cell(k)
endif
state_out(i,k,ll) = state_init(i,k,ll)
+ dt * tend(i,k,ll)
enddo
Source: HPC SDK 22.1, AMD EPYC 7742, NVIDIA A100. MiniWeather: NX=2000, NZ=1000, SIM_TIME=5.
OpenACC version uses –gpu=managed option.
*SPEChpc is a trademark of The Standard Performance Evaluation Corporation
POT3D: DO CONCURRENT
POT3D is a Fortran application for approximating solar
coronal magnetic fields.
Included in the SPEChpc benchmark suite*
Existing parallelization in MPI & OpenACC
Optimized the DO CONCURRENT version by using
OpenACC solely for data motion and atomics
https://github.com/predsci/POT3D
POT3D
!$acc enter data copyin(phi,dr_i)
!$acc enter data create(br)
do concurrent (k=1:np,j=1:nt,i=1:nrm1)
br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i)
enddo
!$acc exit data delete(phi,dr_i,br)
Data courtesy of Predictive Science Inc. *SPEChpc is a trademark of The Standard Performance Evaluation Corporation
ACCELERATED COMPUTING WITH
STANDARD LANGUAGES
§ PYTHON
PRODUCTIVITY
Sequential and Composable Code
§ Sequential semantics - no visible
parallelism or synchronization
§ Name-based global data – no partitioning
§ Composable – can combine with other
libraries and datatypes
def cg_solve(A, b, conv_iters):
x = np.zeros_like(b)
r = b - A.dot(x)
p = r
rsold = r.dot(r)
converged = False
max_iters = b.shape[0]
for i in range(max_iters):
Ap = A.dot(p)
alpha = rsold / (p.dot(Ap))
x = x + alpha * p
r = r - alpha * Ap
rsnew = r.dot(r)
if i % conv_iters == 0 and 
np.sqrt(rsnew) < 1e-10:
converged = i
break
beta = rsnew / rsold
p = r + beta * p
rsold = rsnew
PERFORMANCE
§Transparently run at any scale needed to address computational challenges at hand
§Automatically leverage all the available hardware
Transparent Acceleration
Supercomputer
Multi-GPU
GPU
DPU
Grace
CPU
COMPUTATIONAL FLUID DYNAMICS
Time
(seconds)
Relative dataset size
Number of GPUs
0
50
100
150
1 2 4 8 16 32 64 128 256 512 1024
Distributed NumPy Performance
(weak scaling)
cuPy Legate
for _ in range(iter):
un = u.copy()
vn = v.copy()
b = build_up_b(rho, dt, dx, dy, u, v)
p = pressure_poisson_periodic(b, nit, p, dx, dy)
…
Extracted from “CFD Python” course at https://github.com/barbagroup/CFDPython
Barba, Lorena A., and Forsyth, Gilbert F. (2018). CFD Python: the 12 steps to Navier-Stokes equations. Journal of
Open Source Education, 1(9), 21, https://doi.org/10.21105/jose.00021
• CFD codes like:
• Shallow-Water Equation Solver
• Oil Pipeline Risk Management: Geoclaw-
landspill simulations
• Python Libraries: Jupyter, NumPy, SciPy,
SymPy, Matplotlib
CFD Python on cuNumeric!
ACCELERATED STANDARD LANGUAGES
Parallel performance for wherever your code runs
std::transform(par, x, x+n, y,
y,[=](float x, float y){
return y + a*x;
}
);
import cunumeric as np
…
def saxpy(a, x, y):
y[:] += a*x
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
ISO C++ ISO Fortran Python
CPU GPU
nvc++ -stdpar=multicore
nvfortran –stdpar=multicore
legate –cpus 16 saxpy.py
nvc++ -stdpar=gpu
nvfortran –stdpar=gpu
legate –gpus 1 saxpy.py
LERN MORE
GTC2022 sessions
§ No More Porting: Coding for GPUs with Standard C++, Fortran, and Python [S41496]
§ Shifting through the Gears of GPU Programming Understanding Performance and Portability Trade-offs [S41620]
§ C++ Standard Parallelism [S41960]
§ Future of Standard and CUDA C++ [S41961]
§ Connect with Experts: Standard and CUDA C++ User Forum [CWE41949]
§ From Directives to DO CONCURRENT: A Case Study in Standard Parallelism [S41318]
§ Evaluating Your Options for Accelerated Numerical Computing in Pure Python [S41645]
Blogs
§ Developing Accelerated Code with Standard Language Parallelism
§ Accelerating Standard C++ with GPUs Using stdpar
§ Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK
§ Bringing Tensor Cores to Standard Fortran
§ Accelerating Python on GPUs with nvc++ and Cython
LERN MORE
Blogs
§ Multi-GPU Programming with Standard Parallel C++, Part1
§ Multi-GPU Programming with Standard Parallel C++, Part2
Open-source codes
§ LULESH: https://github.com/LLNL/LULESH
§ STLBM: https://gitlab.com/unigehpfs/stlbm
§ MiniWeather: https://github.com/mrnorman/miniWeather/
§ POT3D: https://github.com/predsci/POT3D
§ Legate: https://github.com/nv-legate
§ Jacobi example using C++ standard parallelism: https://gitlab.com/unigehpfs/paralg
NVIDIA HPC SDK Documentation
https://docs.nvidia.com/hpc-sdk/index.html
AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
NVIDIA MATH LIBRARIES
Linear Algebra, FFT, RNG and Basic Math
CUDA Math API
cuFFT
cuSPARSE cuSOLVER
cuBLAS cuTENSOR
cuRAND CUTLASS
MATH LIBRARIES
§ MULTI-GPU MATH LIBRARIES
1
2
3
4
5
6
7
8
9
10
6
4
7
2
8
0
8
8
9
6
1
0
4
1
1
2
1
2
0
1
2
8
1
3
6
1
4
4
1
5
2
1
6
0
1
6
8
1
7
6
1
8
4
1
9
2
2
0
0
2
0
8
2
1
6
2
2
4
2
3
2
2
4
0
2
4
8
2
5
6
2
6
4
2
7
2
2
8
0
2
8
8
2
9
6
3
0
4
3
1
2
3
2
0
3
2
8
3
3
6
3
4
4
3
5
2
3
6
0
3
6
8
3
7
6
3
8
4
3
9
2
4
0
0
4
0
8
4
1
6
4
2
4
4
3
2
4
4
0
4
4
8
4
5
6
4
6
4
4
7
2
4
8
0
4
8
8
4
9
6
5
0
4
5
1
2
SPEEDUP
(LARGER
IS
BETTER)
FFT SIZE (3D)
1 GPU 2 GPUs 4 GPUs 8 GPUs
cuFFTXt: MAXIMIZING SINGLE-NODE PERFORMANCE
Speedups for 3D C2C versus CTK 11.0
* A100 80GB Default clocks: CTK 11.0 vs. CTK 11.6
Recently Introduced
§ Up to 10x improvements for SNMG FFTs
cuTENSORMg: MULTI-GPU TENSOR CONTRACTIONS
Performance of FP32 Tensor Contractions on DGX A100
Data residing on Host (Dotted) or Device (Solid) Memory
* DGX A100 80GB
§ Introduced in cuTENSOR v1.4
§ Out-of-core released in v1.5
§
0
20
40
60
80
100
120
140
160
4096 8192 16384 32768 49152 65536 81920 98304 114688 131072 147456 163840 180224 196608
TFLOPS
(LARGER
IS
BETTER)
SIZES: M = N = K
1 - Device
1 - Host
2 - Device
2 - Host
4 - Device
4 - Host
8 - Device
8 - Host
Releasing cuTENSOR v1.5
§ Added Out-of-core Functionality
§ Library wide optimizations
cuSOLVERMp: DENSE LINEAR ALGEBRA AT SCALE
LU Decomposition (GETRF+GETRS) w/ Pivoting on Summit Supercomputer
1
2
4
8
16
32
64
128
256
512
1024
2048
1 2 4 8 16 32 64 128 256 512 1024 2048 4096
TIME
IN
SECONDS
(SMALLER
IS
BETTER)
NUMBER OF GPUS
State-of-the-Art HPC SDK 21.11
* Summit: 6x V100 16GB per node
Released in HPC SDK 21.11
§ LU Decomposition
§ With & Without pivoting
§ Cholesky
cuFFTMp: FFTs AT SCALE - SLAB DECOMPOSITION
Distributed 3D FFT Performance: Comparison by Precision
18 12 18 34 61
109
226
429
851
1,860
10 6 9 17 29 52
105
210
410
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
8 16 32 64 128 256 512 1024 2048 4096
2048 2560 3072 4096 5120 6144 8192 10240 12288 16384
TFLOPS
(LARGER
IS
BETTER)
# OF GPUS
PROBLEM SIZE (CUBED)
C2C Z2Z
* Selene: A100 80GB @ 1410 MHz
Coming to HPC SDK 22.3
§ Distributed 2D/3D FFTs
§ Slab Decomposition
§ Pencil Decomposition (Preview)
§ Helper functions: Pencils <→ Slabs
13 14
22 25 27
30 37 41
51
68
79
78.2
104
135
147
122
158
260
278
134
176
394
527
0
100
200
300
400
500
600
1024 2048 4096 8192
TFLOPS
(LARGER
IS
BETTER)
PROBLEM SIZE (CUBED)
32 64 128 256 512 1024 2048
cuFFTMp: FFTs AT SCALE - PENCIL DECOMPOSITION
Distributed 3D FFT Performance: C2C Comparison by GPU Count
* Selene: A100 80GB @ 1410 MHz
Coming to HPC SDK 22.3
§ Distributed 2D/3D FFTs
§ Slab Decomposition
§ Pencil Decomposition (Preview)
§ Helper functions: Pencils <→
Slabs
[S41494] A Deep Dive into the Latest HPC Software
# of GPUs
MATH LIBRARIES
§ MATH LIBRARY DEVICE EXTENSIONS
MATH LIBRARIES DEVICE EXTENSIONS
cuFFTDx Performance: Comparison with cuFFT across various sizes
0
5
10
15
20
25
2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
TFLOPS
(LARGER
IS
BETTER)
FFT SIZES (1D)
cuFFTDx cuFFT
* A100 80GB @ 1410 MHz
Released in MathDx 22.02
§ Available on DevZone
§ Support Volta+ architecture
§ FFT 1D sizes up to 32k
Future Releases
§ cuBLASDx/cuSOLVERDx
§ 2D/3D FFTs
§ Windows Support
LERN MORE
GTC2022 sessions
§ An Explanation of Slab and Pencil Decomposition Performance Across Supercomputing Clusters [S41153]
§ Recent Developments in NVIDIA Math Libraries [S41491]
§ Connect with Experts: NVIDIA Math Libraries [CWE41721]
§ Connect with Experts: Thrust, CUB, and libcu++ User Forum [CWE41948]
§ NVSHMEM: CUDA-Integrated Communication for NVIDIA GPUs (a Magnum IO session) [S41044]
Examples
§ CUDA Library Samples: https://github.com/NVIDIA/CUDALibrarySamples
MathDx 22.02
§ https://developer.nvidia.com/mathdx
LERN MORE
Math Libraries Documentation
https://docs.nvidia.com/hpc-sdk/index.html#math-libraries
Blog
§ Multimode Multi-GPU: Using NVIDIA cuFFTMp FFTs at Scale
§ Extending Block-Cyclic Tensors for Multi-GPU with NVIDIA cuTENSORMg
AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
DEVELOPER TOOLS
Profilers: Nsight Systems, Nsight Compute, CUPTI, NVIDIA Tools eXtension (NVTX)
Debuggers: cuda-gdb, Nsight Visual Studio Edition
Nsight Visual Studio Code Edition
Correctness Checker:: Compute Sanitizer IDE integrations: Nsight Eclipse Edition
Nsight Visual Studio Edition
Nsight Visual Studio Code Edition
DEVELOPER TOOLS
§ COMPUTE DEBUGGERS/IDE
CUDA-GDB
Command-Line and IDE Back-End Debugger
§ Unified CPU and CUDA
Debugging
§ CUDA-C/SASS support
§ Built on GDB and uses
many of the same CLI
commands
COMPUTE SANITIZER
Automatically Scan for Bugs and Memory Issues
Compute Sanitizer checks correctness isshues via
sub-tools:
§ Memcheck – Memory access error and leak detection
tool.
§ Racecheck – Shared memory data acces hazard detection
tool.
§ Initcheck – Uninitialized device global memory access.
§ Synccheck – Thread cynchronization hazard detection
tool.
DEVELOPER TOOLS
§ COMPUTE DEBUGGERS/IDE NEW FEATURES
CORRECTNESS TOOLS FEATURES
§OptiX support in Compute Sanitizer
§ Automatically find correctness issues in OptiX workloads
§Core Dump support in Compute Sanitizer
§ Generate core dumps on detected issues
§5x performance increase in core dump
generation
========= COMPUTE-SANITIZER
========= Invalid __global__ write of size 1 bytes
========= at 0x4d70 in
/home/cuda/optixBasic/draw_solid_color.cu:69:__raygen__dra
w_solid_color_0xebf766b2f0642d4e
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x7f878f900403 is out of bounds
========= and is 262,132 bytes after the nearest
allocation at 0x7f878f8c0400 of size 16 bytes
========= Device Frame:NVIDIA internal [0x430]
========= Saved host backtrace up to driver entry
point at kernel launch time
========= Host Frame: [0x60fbaa]
========= in /lib/x86_64-linux-
gnu/libnvoptix.so.1
========= Host Frame:optix_stubs.h:568:optixLaunch
[0xe1ff]
========= in
/home/cuda/optixBasic/optixBasic
========= Host
Frame:/home/cuda/optixBasic/optixBasic.cpp:227:main
[0xb735]
========= in
/home/cuda/optixBasic/optixBasic
========= Host
Frame:../sysdeps/nptl/libc_start_call_main.h:58:__libc_sta
rt_call_main [0x2dfd0]
========= in /lib/x86_64-linux-
gnu/libc.so.6
========= Host Frame:../csu/libc-
start.c:379:__libc_start_main [0x2e07d]
========= in /lib/x86_64-linux-
gnu/libc.so.6
========= Host Frame: [0x8dde]
========= in
/home/cuda/optixBasic/optixBasic
DEVELOPER TOOLS
§ NSIGHT SYSTEMS
NSIGHT SYSTEMS
System Profiler
Key Features:
§ System-wide application algorithm tuning
§ Multi-process tree support
§ Locate optimization opportunities
§ Visualize millions of events on a very fast GUI timeline
§ Or gaps of unused CPU and GPU time
§ Balance your workload across multiple CPUs and GPUs
§ CPU algorithms, utilization and thread state
GPU streams, kernels memory transfers, etc
§ Command line, Standalone, IDE integration
OS: Linux (x86, Power, Arm SBSA, Tegra), Windows, MacOSX (host)
GPUs: Pascal+
§ Docs/product: https://developer.nvidia.com/nsight-systems
DEVELOPER TOOLS
§ NSIGHT SYSTEMS NEW FEATURES
MULTI-REPORT TILING
Visualize More Parallel Activity
MULTI-REPORT TILING
Visualize More Parallel Activity
Open multiple
reports
Open multiple
reports
Loaded on same
timeline based on
wall-clock
EXPERT SYSTEMS & STATISTICS
Built-in Data Analytics with Advice
NVIDIA NETWORKING ADAPTER SAMPLING
§ Profile NVIDIA Networking
adaptors
§ Sent / Received /
Congestion
§ Correlate with expected
network traffic and other
system activities
GPU DIRECT STORAGE SUPPORT
GPU Metrics Sampling of PCIe BAR1 Requests & CuFile Trace
§ Direct communication to GPU memory
§ CUFILE APIs used for GPU Direct Storage
DEVELOPER TOOLS
§ NSIGHT COMPUTE
NSIGHT COMPUTE
Kernel Profiling Tool
Key Features:
§ Interactive CUDA API debugging and kernel profiling
§ Build-in rules expertise
§ Fully customizable data collection and display
§ Command line, Standalone, IDE integration, Remote targets
OS: Linux (x86, Power, Tegra, Arm SBSA), Windows, MacOSX
(host only)
GPUs: Volta, Turing, Ampere GPUs
Docs/product: https://developer.nvidia.com/nsight-compute
DEVELOPER TOOLS
§ NSIGHT COMPUTE NEW FEATURES
REGISTER DEPENDENCY VISUALIZATION
Visualize Register Usage and Dependency Chains
§ SASS view in the Source page
§ Tracking reads and writes for each register
§ Identify long dependency chains
§ Detect inefficient register usage
§ Columns show all dependencies for:
§ Registers
§ Predicates
§ Uniform Registers
§ Uniform Predicates
STANDALONE SOURCE VIEWER
§ View of side-by-side
assembly and correlated
source code for CUDA
kernels
§ No profile required
§ Open .cubin files directly
§ Helps identify compiler
optimizations and
inefficiencies
OCCUPANCY CALCULATOR
Model Hardware Usage and Identify Limiters
§ Model theoretical
hardware usage
§ Understand limitations
from hardware vs.
kernel parameters
§ Configure model to vary
HW and kernel
parameters
§ Opened from an existing
report or as a new
activity
HIERARCHICAL ROOFLINE
§ Visualize multiple levels of the memory
hierarchy
§ Identify bottlenecks caused by memory
limitations
§ Determine how modifying algorithms may (or
may not) impact performance
LERN MORE
GTC2022 sessions
§ Optimizing Communication with Nsight Systems Network Profiling [S41500]
§ Latest Updates to CUDA Developer Tools [D4121]
§ How to Understand and Optimize Shared Memory Accesses using Nsight Compute [S41723]
§ Connect with Experts: What’s in Your CUDA Toolbox? Profiling, Optimization, and Debugging Tools [CWE41541]
§ What, Where, and Why? Use CUDA Developer Tools to Detect, Locate, and Explain Bugs and Bottlenecks [S41493]
Nsight Systems Documentation
§ https://docs.nvidia.com/nsight-systems/
Nsight Compute Documentation
§ https://docs.nvidia.com/nsight-compute/
AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
NVIDIA HPC ソフトウエア斜め読み

More Related Content

PDF
1076: CUDAデバッグ・プロファイリング入門
PDF
CUDAプログラミング入門
PDF
HPC 的に H100 は魅力的な GPU なのか?
PDF
Hopper アーキテクチャで、変わること、変わらないこと
PDF
いまさら聞けない!CUDA高速化入門
PPTX
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
PDF
ガイデットフィルタとその周辺
PDF
ARM CPUにおけるSIMDを用いた高速計算入門
1076: CUDAデバッグ・プロファイリング入門
CUDAプログラミング入門
HPC 的に H100 は魅力的な GPU なのか?
Hopper アーキテクチャで、変わること、変わらないこと
いまさら聞けない!CUDA高速化入門
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
ガイデットフィルタとその周辺
ARM CPUにおけるSIMDを用いた高速計算入門

What's hot (20)

PDF
ソフト高速化の専門家が教える!AI・IoTエッジデバイスの選び方
PDF
不揮発メモリ(NVDIMM)とLinuxの対応動向について
PPTX
[DL輪読会]Objects as Points
PDF
HPC+AI ってよく聞くけど結局なんなの
PDF
TVM の紹介
PDF
PGI CUDA FortranとGPU最適化ライブラリの一連携法
PDF
PFP:材料探索のための汎用Neural Network Potential - 2021/10/4 QCMSR + DLAP共催
PDF
GPU と PYTHON と、それから最近の NVIDIA
PDF
入門 Kubeflow ~Kubernetesで機械学習をはじめるために~ (NTT Tech Conference #4 講演資料)
PDF
CUDAメモ
PDF
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
PPTX
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
PDF
DockerとPodmanの比較
PDF
プログラマ目線から見たRDMAのメリットと その応用例について
PPTX
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
PPTX
画像処理の高性能計算
PDF
第13回 配信講義 計算科学技術特論B(2022)
PPTX
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
PDF
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
PDF
ゼロから始める転移学習
ソフト高速化の専門家が教える!AI・IoTエッジデバイスの選び方
不揮発メモリ(NVDIMM)とLinuxの対応動向について
[DL輪読会]Objects as Points
HPC+AI ってよく聞くけど結局なんなの
TVM の紹介
PGI CUDA FortranとGPU最適化ライブラリの一連携法
PFP:材料探索のための汎用Neural Network Potential - 2021/10/4 QCMSR + DLAP共催
GPU と PYTHON と、それから最近の NVIDIA
入門 Kubeflow ~Kubernetesで機械学習をはじめるために~ (NTT Tech Conference #4 講演資料)
CUDAメモ
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
DockerとPodmanの比較
プログラマ目線から見たRDMAのメリットと その応用例について
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
画像処理の高性能計算
第13回 配信講義 計算科学技術特論B(2022)
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
ゼロから始める転移学習
Ad

Similar to NVIDIA HPC ソフトウエア斜め読み (20)

PPT
Intermachine Parallelism
PDF
PDF
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PDF
Data Analytics and Simulation in Parallel with MATLAB*
PPT
Behm Shah Pagerank
PDF
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
PDF
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
PDF
Haskell Accelerate
PPTX
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
PDF
Unmanaged Parallelization via P/Invoke
PDF
Automatic Task-based Code Generation for High Performance DSEL
PDF
Arvindsujeeth scaladays12
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PPT
Migration To Multi Core - Parallel Programming Models
PPTX
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
PDF
Ehsan parallel accelerator-dec2015
PPTX
NvFX GTC 2013
PDF
Productionizing your Streaming Jobs
PDF
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
PPT
Overview Of Parallel Development - Ericnel
Intermachine Parallelism
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
Data Analytics and Simulation in Parallel with MATLAB*
Behm Shah Pagerank
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Haskell Accelerate
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
Unmanaged Parallelization via P/Invoke
Automatic Task-based Code Generation for High Performance DSEL
Arvindsujeeth scaladays12
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Migration To Multi Core - Parallel Programming Models
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Ehsan parallel accelerator-dec2015
NvFX GTC 2013
Productionizing your Streaming Jobs
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Overview Of Parallel Development - Ericnel
Ad

More from NVIDIA Japan (20)

PDF
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
PDF
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
PDF
20221021_JP5.0.2-Webinar-JP_Final.pdf
PDF
開発者が語る NVIDIA cuQuantum SDK
PDF
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
PDF
Magnum IO GPUDirect Storage 最新情報
PDF
データ爆発時代のネットワークインフラ
PDF
GTC November 2021 – テレコム関連アップデート サマリー
PDF
テレコムのビッグデータ解析 & AI サイバーセキュリティ
PDF
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
PDF
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
PDF
2020年10月29日 Jetson活用によるAI教育
PDF
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
PDF
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
PDF
Jetson Xavier NX クラウドネイティブをエッジに
PDF
GTC 2020 発表内容まとめ
PDF
NVIDIA Jetson導入事例ご紹介
PDF
JETSON 最新情報 & 自動外観検査事例紹介
PDF
HELLO AI WORLD - MEET JETSON NANO
PDF
Final 20200326 jetson edge comuputing digital seminar 1 final (1)
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
20221021_JP5.0.2-Webinar-JP_Final.pdf
開発者が語る NVIDIA cuQuantum SDK
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
Magnum IO GPUDirect Storage 最新情報
データ爆発時代のネットワークインフラ
GTC November 2021 – テレコム関連アップデート サマリー
テレコムのビッグデータ解析 & AI サイバーセキュリティ
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
2020年10月29日 Jetson活用によるAI教育
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
Jetson Xavier NX クラウドネイティブをエッジに
GTC 2020 発表内容まとめ
NVIDIA Jetson導入事例ご紹介
JETSON 最新情報 & 自動外観検査事例紹介
HELLO AI WORLD - MEET JETSON NANO
Final 20200326 jetson edge comuputing digital seminar 1 final (1)

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Machine learning based COVID-19 study performance prediction
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
KodekX | Application Modernization Development
PDF
cuic standard and advanced reporting.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Cloud computing and distributed systems.
PDF
Electronic commerce courselecture one. Pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine learning based COVID-19 study performance prediction
The Rise and Fall of 3GPP – Time for a Sabbatical?
KodekX | Application Modernization Development
cuic standard and advanced reporting.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
NewMind AI Monthly Chronicles - July 2025
Understanding_Digital_Forensics_Presentation.pptx
MYSQL Presentation for SQL database connectivity
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Review of recent advances in non-invasive hemoglobin estimation
Diabetes mellitus diagnosis method based random forest with bat algorithm
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Empathic Computing: Creating Shared Understanding
Cloud computing and distributed systems.
Electronic commerce courselecture one. Pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

NVIDIA HPC ソフトウエア斜め読み

  • 2. AGENDA Accelerated Computing with Standard Languages § C++ § Fortran § Python Math Libraries § Multi-GPU Math Libraries § Math Library Device Extensions Developer Tools § Compute debuggers/IDE § Nsight Systems § Nsight Compute
  • 3. AGENDA Accelerated Computing with Standard Languages § C++ § Fortran § Python Math Libraries § Multi-GPU Math Libraries § Math Library Device Extensions Developer Tools § Compute debuggers/IDE § Nsight Systems § Nsight Compute
  • 4. PROGRAMMING THE NVIDIA PLATFORM CPU, GPU, and Network ACCELERATED STANDARD LANGUAGES ISO C++, ISO Fortran PLATFORM SPECIALIZATION CUDA ACCELERATION LIBRARIES Core Communication Math Data Analytics AI Quantum std::transform(par, x, x+n, y, y, [=](float x, float y){ return y + a*x; } ); do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo import cunumeric as np … def saxpy(a, x, y): y[:] += a*x #pragma acc data copy(x,y) { ... std::transform(par, x, x+n, y, y, [=](float x, float y){ return y + a*x; }); ... } #pragma omp target data map(x,y) { ... std::transform(par, x, x+n, y, y, [=](float x, float y){ return y + a*x; }); ... } __global__ void saxpy(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] += a*x[i]; } int main(void) { ... cudaMemcpy(d_x, x, ...); cudaMemcpy(d_y, y, ...); saxpy<<<(N+255)/256,256>>>(...); cudaMemcpy(y, d_y, ...); ACCELERATED STANDARD LANGUAGES ISO C++, ISO Fortran INCREMENTAL PORTABLE OPTIMIZATION OpenACC, OpenMP PLATFORM SPECIALIZATION CUDA
  • 5. NVIDIA HPC SDK Available at developer.nvidia.com/hpc-sdk, on NGC, via Spack, and in the Cloud Develop for the NVIDIA Platform: GPU, CPU and Interconnect Libraries | Accelerated C++ and Fortran | Directives | CUDA 7-8 Releases Per Year | Freely Available Compilers nvcc nvc nvc++ nvfortran Programming Models Standard C++ & Fortran OpenACC & OpenMP CUDA Core Libraries libcu++ Thrust CUB Math Libraries cuBLAS cuTENSOR cuSPARSE cuSOLVER cuFFT cuRAND Communication Libraries HPC-X NVSHMEM NCCL DEVELOPMENT Profilers Nsight Systems Compute Debugger cuda-gdb Host Device ANALYSIS SHARP HCOLL UCX SHMEM MPI
  • 7. PILLARS OF STANDARD LANGUAGE PARALLELISM 7 Copyright (C) 2021 Bryce Adelstein Lelbach Common Algorithms that Dispatch to Vendor-Optimized Parallel Libraries Tools to Write Your Own Parallel Algorithms that Run Anywhere sender auto algorithm (sender auto s) { return s | bulk(N, [] (auto data) { // ... } ) | bulk(N, [] (auto data) { // ... } ); } Mechanisms for Composing Parallel Invocations into Task Graphs sender auto algorithm (sender auto s) { return s | bulk( [] (auto data) { // ... } ) | bulk( [] (auto data) { // ... } ); }
  • 8. C++ with OpenMP Ø Composable, compact and elegant Ø Easy to read and maintain Ø ISO Standard Ø Portable – nvc++, g++, icpc, MSVC, … Standard C++ #pragma omp parallel // OpenMP parallel region { #pragma omp for // OpenMP for loop for (MInt i = 0; i < noCells; i++) { // Loop over all cells if (timeStep % ipow2[maxLevel_ – clevel[i * distLevel]] == 0) { // Multi-grid loop const MInt distStartId = i * nDist; // More offsets for 1D accesses // Local offsets const MInt distNeighStartId = i * distNeighbors; const MFloat* const distributionsStart = &[distributions[distStartId]; for (MInt j = 0; j < nDist – 1; j += 2) { // Unrolled loop distributions (factor 2) if (neighborId[I * distNeighbors + j] > -1) { // First unrolled iteration const MInt n1StartId = neighborId[distNeighStartId + j] * nDist; oldDistributions[n1StartId + j] = distributionsStart[j]; // 1D access AoS format } if (neighborId[I * distNeighbors + j + 1] > -1) { // Second unrolled iteration const MInt n2StartId = neighborId[distNeighStartId + j + 1] * nDist; oldDistributions[n2StartId + j + 1] = distributionsStart[j + 1]; } } oldDistributions[distStartId + lastId] = distributionsStart[lastId]; // Zero-th distribution } } } std::for_each_n(par_unseq, start, noCells, [=](auto i) { // Parallel for if (timeStep % IPOW2[maxLevel_ – a_level(i)] != 0) // Multi-level loop return; for (MInt j = 0; j < nDist; ++j) { if (auto n = c_neighborId(i, j); n == -1) continue; a_oldDistribution(n, j) = a_distribution(i, j); // SoA or AoS mem_fn } }); M-AIA WITH C++17 PARALLEL ALGORITHMS Multi-physics simulation framework from RWTH Aachen University Ø Hierarchical grids, complex moving geometries Ø Adaptive meshing, load balancing Ø Numerical methods: FV, DG, LBM, FEM, Level-Set, ... Ø Physics: aeroacoustics, combustion, biomedical, ... Ø Developed by ~20 PhDs (Mech. Eng.), ~500k LOC++ Ø Programming model: MPI + ISO C++ parallelism
  • 9. M-AIA Multi-physics simulation framework developed at the Institute of Aerodynamics, RWTH Aachen University Decaying isotropic turbulence 400k fully-resolved particles 1 1.025 8.74 0 1 2 3 4 5 6 7 8 9 10 OpenMP (2x EPYC 7742) ISO C++ (2x EPYC 7742) ISO C++ (A100) Relative Speed-Up
  • 10. PARALLELISM IN C++ ROADMAP C++ 14 C++ 17 C++ 20 C++ PIPELINE • Memory model enhancements • Lambdas • Atomics extensions • Generic Lambda Expressions • Parallel algorithms • Forward progress guarantees • Memory model clarifications • Scalable synchronization library • Ranges • Span • Linear algebra algorithms • Asynchronous parallel algorithms • Senders-receivers • Mdspan • Range-based parallel algorithms • Extended floating- point types General parallelism user facing feature How users run C++ code on GPUs today Co-designed with V100 hardware support Custom algorithms and async. control flow N-dimensional loops and usability Extended C++ interface to BLAS/Lapack General usability of performance provided by executors C++ 11
  • 11. PILLARS OF STANDARD LANGUAGE PARALLELISM 11 Copyright (C) 2021 Bryce Adelstein Lelbach With Senders & Receivers Today Common Algorithms that Dispatch to Vendor-Optimized Parallel Libraries Tools to Write Your Own Parallel Algorithms that Run Anywhere sender auto algorithm (sender auto s) { return s | bulk(N, [] (auto data) { // ... } ) | bulk(N, [] (auto data) { // ... } ); } Mechanisms for Composing Parallel Invocations into Task Graphs sender auto algorithm (sender auto s) { return s | bulk( [] (auto data) { // ... } ) | bulk( [] (auto data) { // ... } ); }
  • 12. SENDERS & RECEIVERS Maxwell’s equations template <ComputeSchedulerT, WriteSchedulerT> auto maxwell_eqs(ComputeSchedulerT &scheduler, WriteSchedulerT &writer) { return repeat_n( n_outer_iterations, repeat_n( n_inner_iterations, schedule(scheduler) | bulk(grid.cells, update_h(accessor)) | bulk(grid.cells, update_e(time, dt, accessor))) | transfer(writer) | then(dump_results(report_step, accessor))) | then([]{ printf("simulation completen"); }) ); } Simplify Work Across CPUs and Accelerators • Uniform abstraction between code and diverse resources • ISO standard • Write once, run everywhere •
  • 13. ELECTROMAGNETISM Raw performance & % of peak std::sync_wait(maxwell(inline_scheduler, inline_scheduler)); std::sync_wait(maxwell(openmp_scheduler, inline_scheduler)); std::sync_wait(maxwell(cuda, inline_scheduler)); § CPUs: AMD EPYC 7742 CPUs, GPUs: NVIDIA A100-SXM4-80 § Inline (1 CPU HW thread), OpenMP-128 (1x CPU), OpenMP-256 (2x CPUs), Graph (1x GPU), Multi-GPU (2x GPUs) § clang-12 with –O3 –DNDEBUG –mtune=native -fopenmp 0 5 10 15 20 25 30 OpenMP 128 OpenMP 256 CUDA (1 A100) CUDA (2 A100) Speedup Scheduler 0 10 20 30 40 50 60 70 80 90 100 OpenMP 128 OpenMP 256 CUDA (1 A100) CUDA (2 A100) Efficiency vs STREAM TRIAD Scheduler
  • 14. STRONG SCALING USING ISO STANDARD C++ NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. NVIDIA SUPERPOD § 140x NVIDIA DGX-A100 640 § 1120x NVIDIA A100-SXM4-80 GPUs 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20 25 30 35 40 0 200 400 600 800 1000 1200 Speedup Number of GPUs Maxwell SR Scaling Ideal Scaling Efficiency PARALLEL ALGORITHMS AND SENDERS & RECIEVERS
  • 15. PALABOS CARBON SEQUESTRATION 15 Copyright (C) 2022 NVIDIA Ø Palabos is a framework for fluid dynamics simulations using Lattice-Boltzmann methods. Ø Code for multi-component flow through a porous media ported to C++ Senders and Receivers. Ø Application: simulating carbon sequestration in sandstone. Christian Huber (Brown University), Jonas Latt (University of Geneva) Georgy Evtushenko (NVIDIA), Gonzalo Brito (NVIDIA) 0 4 8 12 16 32 128 224 320 416 512 A100 GPUs Strong Scaling
  • 16. ACCELERATED COMPUTING WITH STANDARD LANGUAGES § FORTRAN
  • 17. MODERN FORTRAN FEATURES FOR HPC Standard Parallelism and Concurrency Features DO CONCURRENT Reductions Support for reduction operations on concurrent loops (ala OpenACC/OpenMP). Began supporting in nvfortran 21.11. Fortran 202X Coming in 2023 Atomics Propose support for atomic variable accesses Asynchronous Tasking Propose support for asynchronous tasks Fortran 202Y In discussion DO CONCURRENT Data parallel loop construct, locality specifiers. Supported in nvfortran Array Intrinsics Various math intrinsics that may apply to entire arrays and map to accelerated libraries supported in nvfortran. Co-Arrays Partitioned Global Address Space arrays, teams of processes (images), collectives & synchronization. Awaiting F18. Fortran 2018
  • 18. MINIWEATHER Standard Language Parallelism in Climate/Weather Applications Mini-App written in C++ and Fortran that simulates weather-like fluid flows using Finite Volume and Runge-Kutta methods. Existing parallelization in MPI, OpenMP, OpenACC, … Included in the SPEChpc benchmark suite* Open-source and commonly-used in training events. https://github.com/mrnorman/miniWeather/ MiniWeather 0 10 20 OpenMP (CPU) Concurrent (CPU) Concurrent (GPU) OpenACC do concurrent (ll=1:NUM_VARS, k=1:nz, i=1:nx) local(x,z,x0,z0,xrad,zrad,amp,dist,wpert) if (data_spec_int == DATA_SPEC_GRAVITY_WAVES) then x = (i_beg-1 + i-0.5_rp) * dx z = (k_beg-1 + k-0.5_rp) * dz x0 = xlen/8 z0 = 1000 xrad = 500 zrad = 500 amp = 0.01_rp dist = sqrt( ((x-x0)/xrad)**2 + ((z-z0)/zrad)**2 ) * pi / 2._rp if (dist <= pi / 2._rp) then wpert = amp * cos(dist)**2 else wpert = 0._rp endif tend(i,k,ID_WMOM) = tend(i,k,ID_WMOM) + wpert*hy_dens_cell(k) endif state_out(i,k,ll) = state_init(i,k,ll) + dt * tend(i,k,ll) enddo Source: HPC SDK 22.1, AMD EPYC 7742, NVIDIA A100. MiniWeather: NX=2000, NZ=1000, SIM_TIME=5. OpenACC version uses –gpu=managed option. *SPEChpc is a trademark of The Standard Performance Evaluation Corporation
  • 19. POT3D: DO CONCURRENT POT3D is a Fortran application for approximating solar coronal magnetic fields. Included in the SPEChpc benchmark suite* Existing parallelization in MPI & OpenACC Optimized the DO CONCURRENT version by using OpenACC solely for data motion and atomics https://github.com/predsci/POT3D POT3D !$acc enter data copyin(phi,dr_i) !$acc enter data create(br) do concurrent (k=1:np,j=1:nt,i=1:nrm1) br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i) enddo !$acc exit data delete(phi,dr_i,br) Data courtesy of Predictive Science Inc. *SPEChpc is a trademark of The Standard Performance Evaluation Corporation
  • 20. ACCELERATED COMPUTING WITH STANDARD LANGUAGES § PYTHON
  • 21. PRODUCTIVITY Sequential and Composable Code § Sequential semantics - no visible parallelism or synchronization § Name-based global data – no partitioning § Composable – can combine with other libraries and datatypes def cg_solve(A, b, conv_iters): x = np.zeros_like(b) r = b - A.dot(x) p = r rsold = r.dot(r) converged = False max_iters = b.shape[0] for i in range(max_iters): Ap = A.dot(p) alpha = rsold / (p.dot(Ap)) x = x + alpha * p r = r - alpha * Ap rsnew = r.dot(r) if i % conv_iters == 0 and np.sqrt(rsnew) < 1e-10: converged = i break beta = rsnew / rsold p = r + beta * p rsold = rsnew
  • 22. PERFORMANCE §Transparently run at any scale needed to address computational challenges at hand §Automatically leverage all the available hardware Transparent Acceleration Supercomputer Multi-GPU GPU DPU Grace CPU
  • 23. COMPUTATIONAL FLUID DYNAMICS Time (seconds) Relative dataset size Number of GPUs 0 50 100 150 1 2 4 8 16 32 64 128 256 512 1024 Distributed NumPy Performance (weak scaling) cuPy Legate for _ in range(iter): un = u.copy() vn = v.copy() b = build_up_b(rho, dt, dx, dy, u, v) p = pressure_poisson_periodic(b, nit, p, dx, dy) … Extracted from “CFD Python” course at https://github.com/barbagroup/CFDPython Barba, Lorena A., and Forsyth, Gilbert F. (2018). CFD Python: the 12 steps to Navier-Stokes equations. Journal of Open Source Education, 1(9), 21, https://doi.org/10.21105/jose.00021 • CFD codes like: • Shallow-Water Equation Solver • Oil Pipeline Risk Management: Geoclaw- landspill simulations • Python Libraries: Jupyter, NumPy, SciPy, SymPy, Matplotlib CFD Python on cuNumeric!
  • 24. ACCELERATED STANDARD LANGUAGES Parallel performance for wherever your code runs std::transform(par, x, x+n, y, y,[=](float x, float y){ return y + a*x; } ); import cunumeric as np … def saxpy(a, x, y): y[:] += a*x do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo ISO C++ ISO Fortran Python CPU GPU nvc++ -stdpar=multicore nvfortran –stdpar=multicore legate –cpus 16 saxpy.py nvc++ -stdpar=gpu nvfortran –stdpar=gpu legate –gpus 1 saxpy.py
  • 25. LERN MORE GTC2022 sessions § No More Porting: Coding for GPUs with Standard C++, Fortran, and Python [S41496] § Shifting through the Gears of GPU Programming Understanding Performance and Portability Trade-offs [S41620] § C++ Standard Parallelism [S41960] § Future of Standard and CUDA C++ [S41961] § Connect with Experts: Standard and CUDA C++ User Forum [CWE41949] § From Directives to DO CONCURRENT: A Case Study in Standard Parallelism [S41318] § Evaluating Your Options for Accelerated Numerical Computing in Pure Python [S41645] Blogs § Developing Accelerated Code with Standard Language Parallelism § Accelerating Standard C++ with GPUs Using stdpar § Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK § Bringing Tensor Cores to Standard Fortran § Accelerating Python on GPUs with nvc++ and Cython
  • 26. LERN MORE Blogs § Multi-GPU Programming with Standard Parallel C++, Part1 § Multi-GPU Programming with Standard Parallel C++, Part2 Open-source codes § LULESH: https://github.com/LLNL/LULESH § STLBM: https://gitlab.com/unigehpfs/stlbm § MiniWeather: https://github.com/mrnorman/miniWeather/ § POT3D: https://github.com/predsci/POT3D § Legate: https://github.com/nv-legate § Jacobi example using C++ standard parallelism: https://gitlab.com/unigehpfs/paralg NVIDIA HPC SDK Documentation https://docs.nvidia.com/hpc-sdk/index.html
  • 27. AGENDA Accelerated Computing with Standard Languages § C++ § Fortran § Python Math Libraries § Multi-GPU Math Libraries § Math Library Device Extensions Developer Tools § Compute debuggers/IDE § Nsight Systems § Nsight Compute
  • 28. NVIDIA MATH LIBRARIES Linear Algebra, FFT, RNG and Basic Math CUDA Math API cuFFT cuSPARSE cuSOLVER cuBLAS cuTENSOR cuRAND CUTLASS
  • 29. MATH LIBRARIES § MULTI-GPU MATH LIBRARIES
  • 31. cuTENSORMg: MULTI-GPU TENSOR CONTRACTIONS Performance of FP32 Tensor Contractions on DGX A100 Data residing on Host (Dotted) or Device (Solid) Memory * DGX A100 80GB § Introduced in cuTENSOR v1.4 § Out-of-core released in v1.5 § 0 20 40 60 80 100 120 140 160 4096 8192 16384 32768 49152 65536 81920 98304 114688 131072 147456 163840 180224 196608 TFLOPS (LARGER IS BETTER) SIZES: M = N = K 1 - Device 1 - Host 2 - Device 2 - Host 4 - Device 4 - Host 8 - Device 8 - Host Releasing cuTENSOR v1.5 § Added Out-of-core Functionality § Library wide optimizations
  • 32. cuSOLVERMp: DENSE LINEAR ALGEBRA AT SCALE LU Decomposition (GETRF+GETRS) w/ Pivoting on Summit Supercomputer 1 2 4 8 16 32 64 128 256 512 1024 2048 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 TIME IN SECONDS (SMALLER IS BETTER) NUMBER OF GPUS State-of-the-Art HPC SDK 21.11 * Summit: 6x V100 16GB per node Released in HPC SDK 21.11 § LU Decomposition § With & Without pivoting § Cholesky
  • 33. cuFFTMp: FFTs AT SCALE - SLAB DECOMPOSITION Distributed 3D FFT Performance: Comparison by Precision 18 12 18 34 61 109 226 429 851 1,860 10 6 9 17 29 52 105 210 410 0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 8 16 32 64 128 256 512 1024 2048 4096 2048 2560 3072 4096 5120 6144 8192 10240 12288 16384 TFLOPS (LARGER IS BETTER) # OF GPUS PROBLEM SIZE (CUBED) C2C Z2Z * Selene: A100 80GB @ 1410 MHz Coming to HPC SDK 22.3 § Distributed 2D/3D FFTs § Slab Decomposition § Pencil Decomposition (Preview) § Helper functions: Pencils <→ Slabs
  • 34. 13 14 22 25 27 30 37 41 51 68 79 78.2 104 135 147 122 158 260 278 134 176 394 527 0 100 200 300 400 500 600 1024 2048 4096 8192 TFLOPS (LARGER IS BETTER) PROBLEM SIZE (CUBED) 32 64 128 256 512 1024 2048 cuFFTMp: FFTs AT SCALE - PENCIL DECOMPOSITION Distributed 3D FFT Performance: C2C Comparison by GPU Count * Selene: A100 80GB @ 1410 MHz Coming to HPC SDK 22.3 § Distributed 2D/3D FFTs § Slab Decomposition § Pencil Decomposition (Preview) § Helper functions: Pencils <→ Slabs [S41494] A Deep Dive into the Latest HPC Software # of GPUs
  • 35. MATH LIBRARIES § MATH LIBRARY DEVICE EXTENSIONS
  • 36. MATH LIBRARIES DEVICE EXTENSIONS cuFFTDx Performance: Comparison with cuFFT across various sizes 0 5 10 15 20 25 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 TFLOPS (LARGER IS BETTER) FFT SIZES (1D) cuFFTDx cuFFT * A100 80GB @ 1410 MHz Released in MathDx 22.02 § Available on DevZone § Support Volta+ architecture § FFT 1D sizes up to 32k Future Releases § cuBLASDx/cuSOLVERDx § 2D/3D FFTs § Windows Support
  • 37. LERN MORE GTC2022 sessions § An Explanation of Slab and Pencil Decomposition Performance Across Supercomputing Clusters [S41153] § Recent Developments in NVIDIA Math Libraries [S41491] § Connect with Experts: NVIDIA Math Libraries [CWE41721] § Connect with Experts: Thrust, CUB, and libcu++ User Forum [CWE41948] § NVSHMEM: CUDA-Integrated Communication for NVIDIA GPUs (a Magnum IO session) [S41044] Examples § CUDA Library Samples: https://github.com/NVIDIA/CUDALibrarySamples MathDx 22.02 § https://developer.nvidia.com/mathdx
  • 38. LERN MORE Math Libraries Documentation https://docs.nvidia.com/hpc-sdk/index.html#math-libraries Blog § Multimode Multi-GPU: Using NVIDIA cuFFTMp FFTs at Scale § Extending Block-Cyclic Tensors for Multi-GPU with NVIDIA cuTENSORMg
  • 39. AGENDA Accelerated Computing with Standard Languages § C++ § Fortran § Python Math Libraries § Multi-GPU Math Libraries § Math Library Device Extensions Developer Tools § Compute debuggers/IDE § Nsight Systems § Nsight Compute
  • 40. DEVELOPER TOOLS Profilers: Nsight Systems, Nsight Compute, CUPTI, NVIDIA Tools eXtension (NVTX) Debuggers: cuda-gdb, Nsight Visual Studio Edition Nsight Visual Studio Code Edition Correctness Checker:: Compute Sanitizer IDE integrations: Nsight Eclipse Edition Nsight Visual Studio Edition Nsight Visual Studio Code Edition
  • 41. DEVELOPER TOOLS § COMPUTE DEBUGGERS/IDE
  • 42. CUDA-GDB Command-Line and IDE Back-End Debugger § Unified CPU and CUDA Debugging § CUDA-C/SASS support § Built on GDB and uses many of the same CLI commands
  • 43. COMPUTE SANITIZER Automatically Scan for Bugs and Memory Issues Compute Sanitizer checks correctness isshues via sub-tools: § Memcheck – Memory access error and leak detection tool. § Racecheck – Shared memory data acces hazard detection tool. § Initcheck – Uninitialized device global memory access. § Synccheck – Thread cynchronization hazard detection tool.
  • 44. DEVELOPER TOOLS § COMPUTE DEBUGGERS/IDE NEW FEATURES
  • 45. CORRECTNESS TOOLS FEATURES §OptiX support in Compute Sanitizer § Automatically find correctness issues in OptiX workloads §Core Dump support in Compute Sanitizer § Generate core dumps on detected issues §5x performance increase in core dump generation ========= COMPUTE-SANITIZER ========= Invalid __global__ write of size 1 bytes ========= at 0x4d70 in /home/cuda/optixBasic/draw_solid_color.cu:69:__raygen__dra w_solid_color_0xebf766b2f0642d4e ========= by thread (0,0,0) in block (0,0,0) ========= Address 0x7f878f900403 is out of bounds ========= and is 262,132 bytes after the nearest allocation at 0x7f878f8c0400 of size 16 bytes ========= Device Frame:NVIDIA internal [0x430] ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: [0x60fbaa] ========= in /lib/x86_64-linux- gnu/libnvoptix.so.1 ========= Host Frame:optix_stubs.h:568:optixLaunch [0xe1ff] ========= in /home/cuda/optixBasic/optixBasic ========= Host Frame:/home/cuda/optixBasic/optixBasic.cpp:227:main [0xb735] ========= in /home/cuda/optixBasic/optixBasic ========= Host Frame:../sysdeps/nptl/libc_start_call_main.h:58:__libc_sta rt_call_main [0x2dfd0] ========= in /lib/x86_64-linux- gnu/libc.so.6 ========= Host Frame:../csu/libc- start.c:379:__libc_start_main [0x2e07d] ========= in /lib/x86_64-linux- gnu/libc.so.6 ========= Host Frame: [0x8dde] ========= in /home/cuda/optixBasic/optixBasic
  • 47. NSIGHT SYSTEMS System Profiler Key Features: § System-wide application algorithm tuning § Multi-process tree support § Locate optimization opportunities § Visualize millions of events on a very fast GUI timeline § Or gaps of unused CPU and GPU time § Balance your workload across multiple CPUs and GPUs § CPU algorithms, utilization and thread state GPU streams, kernels memory transfers, etc § Command line, Standalone, IDE integration OS: Linux (x86, Power, Arm SBSA, Tegra), Windows, MacOSX (host) GPUs: Pascal+ § Docs/product: https://developer.nvidia.com/nsight-systems
  • 48. DEVELOPER TOOLS § NSIGHT SYSTEMS NEW FEATURES
  • 50. MULTI-REPORT TILING Visualize More Parallel Activity Open multiple reports Open multiple reports Loaded on same timeline based on wall-clock
  • 51. EXPERT SYSTEMS & STATISTICS Built-in Data Analytics with Advice
  • 52. NVIDIA NETWORKING ADAPTER SAMPLING § Profile NVIDIA Networking adaptors § Sent / Received / Congestion § Correlate with expected network traffic and other system activities
  • 53. GPU DIRECT STORAGE SUPPORT GPU Metrics Sampling of PCIe BAR1 Requests & CuFile Trace § Direct communication to GPU memory § CUFILE APIs used for GPU Direct Storage
  • 55. NSIGHT COMPUTE Kernel Profiling Tool Key Features: § Interactive CUDA API debugging and kernel profiling § Build-in rules expertise § Fully customizable data collection and display § Command line, Standalone, IDE integration, Remote targets OS: Linux (x86, Power, Tegra, Arm SBSA), Windows, MacOSX (host only) GPUs: Volta, Turing, Ampere GPUs Docs/product: https://developer.nvidia.com/nsight-compute
  • 56. DEVELOPER TOOLS § NSIGHT COMPUTE NEW FEATURES
  • 57. REGISTER DEPENDENCY VISUALIZATION Visualize Register Usage and Dependency Chains § SASS view in the Source page § Tracking reads and writes for each register § Identify long dependency chains § Detect inefficient register usage § Columns show all dependencies for: § Registers § Predicates § Uniform Registers § Uniform Predicates
  • 58. STANDALONE SOURCE VIEWER § View of side-by-side assembly and correlated source code for CUDA kernels § No profile required § Open .cubin files directly § Helps identify compiler optimizations and inefficiencies
  • 59. OCCUPANCY CALCULATOR Model Hardware Usage and Identify Limiters § Model theoretical hardware usage § Understand limitations from hardware vs. kernel parameters § Configure model to vary HW and kernel parameters § Opened from an existing report or as a new activity
  • 60. HIERARCHICAL ROOFLINE § Visualize multiple levels of the memory hierarchy § Identify bottlenecks caused by memory limitations § Determine how modifying algorithms may (or may not) impact performance
  • 61. LERN MORE GTC2022 sessions § Optimizing Communication with Nsight Systems Network Profiling [S41500] § Latest Updates to CUDA Developer Tools [D4121] § How to Understand and Optimize Shared Memory Accesses using Nsight Compute [S41723] § Connect with Experts: What’s in Your CUDA Toolbox? Profiling, Optimization, and Debugging Tools [CWE41541] § What, Where, and Why? Use CUDA Developer Tools to Detect, Locate, and Explain Bugs and Bottlenecks [S41493] Nsight Systems Documentation § https://docs.nvidia.com/nsight-systems/ Nsight Compute Documentation § https://docs.nvidia.com/nsight-compute/
  • 62. AGENDA Accelerated Computing with Standard Languages § C++ § Fortran § Python Math Libraries § Multi-GPU Math Libraries § Math Library Device Extensions Developer Tools § Compute debuggers/IDE § Nsight Systems § Nsight Compute