6. GPU対応のライブラリ (一部)
NVIDIA cuBLAS NVIDIA cuRAND
NVIDIA cuSPARSE
Vector Signal
Image Processing
GPU Accelerated
Linear Algebra
NVIDIA cuFFT
C++ STL Features
for CUDA
Sparse Linear
AlgebraIMSL Library
Matrix Algebra on
GPU and Multicore
NVIDIA cuDNN
NVIDIA AmgX
15. cuBLAS適用例
for ( int j = 0; j < N; j++ ) {
for ( int i = 0; i < M; i++ ) {
for ( int k = 0; k < K; k++ ) {
C[ j*ldc + i ] = A[ k*lda + i ] * B[ j*ldb + k ];
}
}
}
行列乗算
C = A x B
CA
B
M
N
K
K
16. cuBLAS適用例
行列乗算(BLAS)
C = A x B
CA
B
M
N
K
K
sgemm( ‘n’, ‘n’, M, N, K, 1.0, A, lda, B, ldb, 0.0, C, ldc );
17. cuBLAS適用例
cublasCreate( &handle );
cudaMalloc( &d_A, sizeof(float) * M * K );
cudaMalloc( &d_B, sizeof(float) * K * N );
cudaMalloc( &d_C, sizeof(float) * M * N );
cublasSetMatrix( M, K, sizeof(float), A, lda, d_A, lda );
cublasSetMatrix( K, N, sizeof(float), B, ldb, d_B, ldb );
cublasSgemm( handle, ‘n’, M, N, K, 1.0, A, lda, B, ldb, 0.0, C, ldc );
cublasSetMatrix( M, N, sizeof(float), d_C, ldc, C, ldc );
ハンドルの作成
デバイスメモリの確保
入力データの転送
出力データの転送
実行
18. cuBLAS適用例
前処理 + 行列乗算 + 後処理
for ( k = 0; k < K; k++ )
for ( i = 0; i < M; i++ )
A[ k*lda + i ] = … ;
sgemm( ‘n’, ‘n’, M, N, K, 1.0, A, lda, B, ldb, 0.0, C, ldc );
for ( j = 0; j < N; j++ )
for ( i = 0; i < M; i++ )
C[ j*ldc + i ] = … ;
19. CUDAとの併用
…
cublasSetMatrix( M, K, sizeof(float), A, lda, d_A, lda );
cublasSetMatrix( K, N, sizeof(float), B, ldb, d_B, ldb );
kernel_update_A<<< … >>>( d_A, lda, … );
cublasSgemm( handle, ‘n’, M, N, K, 1.0, A, lda, B, ldb, 0.0, C, ldc );
kernel_update_C<<< … >>>( d_C, ldc, … );
cublasSetMatrix( M, N, sizeof(float), d_C, ldc, C, ldc );
ライブラリ
CUDAカーネル
CUDAカーネル
20. OpenACCとの併用
#pragma acc data copyin(A, B) copyout(C)
{
#pragma acc parallel
for ( k = 0; k < K; k++ )
for ( i = 0; i < M; i++ )
A[ k*lda + i ] = … ;
#pragma acc host_data use_device(A, B, C)
{ cublasSgemm( handle, ‘n’, M, N, K, 1.0, A, lda, B, ldb, 0.0, C, ldc ); }
#pragma acc parallel
for ( j = 0; j < N; j++ )
for ( i = 0; i < M; i++ )
C[ j*ldc + i ] = … ;
}
ライブラリ
OpenACC
OpenACC
25. cuFFT: FFTライブラリ
XTインタフェース対応: cufftXT API
最大4GPUs
Callbackルーチン
前処理と後処理をCallbackとして設定
NVIDIA cuFFT
Read
input
Convert
to 32-bit
Write
32-bit
Read
Perform
FFT
Write
Read
FFT
output
Convert
to 8-bit
Write 8-
bit data
Read
input
Convert
to 32-bit
Perform
FFT
Convert
to 8-bit
Write 8-
bit data
Callback無: 3カーネル
Callback有: 1カーネル
26. cuFFT: 最大700 GFLOPS
0
100
200
300
400
500
600
700
800
1 1,000 1,000,000
GFLOPS
Transform Size
単精度(32bit)
Powers of 2
Powers of 3
Powers of 5
Powers of 7
0
50
100
150
200
250
300
350
1 1,000 1,000,000
GFLOPS
Transform Size
倍精度(64bit)
Performance may vary based on OS and software
versions, and motherboard configuration
• cuFFT 7.0 on K40m, Base clocks, ECC ON
• Batched transforms on 28M-33M total elements, input and output data on device
• Excludes time to create cuFFT “plans”
1D複素数バッチ FFTs
(信号処理, 2D/3D FFTのコンポーネント)
27. cuFFT: 性能改善 (CUDA 6.5 7.0)
1x
2x
3x
4x
5x
0 20 40 60 80 100 120 140
Speedup
Transform Size
1D 単精度 Complex-to-Complex バッチFFTs
Size = 23 Size = 66Size = 31
Size = 110
Size = 121
Performance may vary based on OS and software
versions, and motherboard configuration
• cuFFT 6.5 and 7.0 on K20m, ECC ON
• Batched transforms on 32M total elements, input and output data on device
• Excludes time to create cuFFT “plans”
30. cuBLAS: 単精度:>3 TF, 倍精度:>1 TF
0
500
1,000
1,500
2,000
2,500
3,000
3,500
SGEMM
SSYMM
STRSM
SSYRK
CGEMM
CSYMM
CTRSM
CSYRK
DGEMM
DSYMM
DTRSM
DSYRK
ZGEMM
ZSYMM
ZTRSM
ZSYRK
Single Single Complex Double Double Complex
GFLOPS
• cuBLAS 7.0 on K40m, Base clocks, ECC ON, input and output data on device
• m=n=k=4096, transpose=no, side=right, fill=lower
Performance may vary based on OS and software
versions, and motherboard configuration
31. cuBLAS-XT: >12 TF (3 GPUs on 1ノード)
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
SGEMM
SSYRK
STRSM
CGEMM
CSYRK
DGEMM
DSYRK
DTRSM
ZGEMM
ZSYRK
ZTRSM
Single Single
Complex
Double Double
Complex
GFLOPS
1xK80
3xK80
• cuBLAS 7.0 on K80, Base clocks, ECC ON
• input and output data on host, m=n=k=32768, transpose=no
Performance may vary based on OS and software
versions, and motherboard configuration
34. cuSPARSE: 性能比較
0x
1x
2x
3x
4x
5x
SpeedupoverMKL
疎行列 x 密ベクトル (SpMV)
• Average of S/C/D/Z routines
• cuSPARSE 7.0 on K40m, Base clocks, ECC ON, input and output data on device
• MKL 11.0.4 on Intel Xeon Haswell single-socket 16-core E5-2698 v3 @ 2.3GHz, 3.6GHz Turbo
• Matrices obtained from: http://www.cise.ufl.edu/research/sparse/matrices/
Performance may vary based on OS and software
versions, and motherboard configuration
48. cuRAND: 高性能
0
2
4
6
8
10
12
14
16
18
XORWOW Philox MRG32k3a MTGP32 Sobol32
Scrambled
Sobol64
Scrambled
Pseudo-random Quasi-random
Gsamples/sec
Uniform Distribution Normal Distribution Log-Normal Distribution
• cuRAND 7.0 on K40m, Base clocks, ECC ON, double-precision input and output data on device
Performance may vary based on OS and software
versions, and motherboard configuration
49. cuRAND: 50倍以上高速 (vs. MKL)
0
2
4
6
8
10
12
14
16
Sobol32 MRG32k3a Sobol32 MRG32k3a Sobol32 MRG32k3a
Uniform Distribution Normal Distribution Log-Normal Distribution
GSamples/sec
cuRAND
MKL
• cuRAND 7.0 on K40m, Base clocks, ECC ON, double-precision input and output data on device
• MKL 11.0.1 on Intel Xeon Haswell single-socket 16-core E5-2698 v3 @ 2.3GHz, 3.6GHz Turbo
Performance may vary based on OS and software
versions, and motherboard configuration
50. Thrust: CUDA C++ 並列テンプレートライブラリ
C++ STLライクなテンプレートライブラリ
迅速なアプリ開発、プロトタイプ開発
GPU最適化な並列アルゴリズム
sort, reduce, scan, 他
ポータブル: CPUでも利用可能
OpenMP, TBB
GitHub: thrust.github.com
C++ STL Features
for CUDA
51. Thrust: 性能改善 (CUDA 6.5 7)
sort: 1.1–1.8倍
(ユーザ定義型は3倍)
merge: 2倍
scan: 1.15倍
reduce_by_key: 1.25倍
thrust::count_if(thrust::cuda::par.on(stream1), text, text+n, myFunc());
New in
CUDA 7.0
1.7x 1.8x
1.2x
1.1x
1.3x
1.1x
0.0x
0.5x
1.0x
1.5x
2.0x
char short int long float double
Speedup
Sort (32M samples)
• CUDA 7.0 and 6.5 on K40m, ECC ON, input and output data on device
• Performance may vary based on OS and software versions, and motherboard
configuration
CUDAストリーム対応