A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020

POLITECNICO DI MILANO
A Methodology for Automatic GPU Kernel
Optimization
Alberto Zeni

Context Deﬁnition
2
1000x
by
2025
40 Years of Microprocessor Trend Data
____________________________________________________
1980 1990 2000 2010 2020
107
106
105
104
103
102

Contributions
4
● We propose a methodology that guides the user
to develop highly optimized GPU kernels
● We demonstrate the usefulness of our
methodology by implementing it into a semi
automatic tool for kernel optimization
● We show the results of the application of our
methodology on two highly computationally
intensive algorithms

Methodology application
5
Unoptimized
source code
Rooﬂine
Generator
Rooﬂine and
Performance
Analyzer
Optimized
source code
Optimization Flow
Compiler
Optimizer
Source Code
Parser

6
Unoptimized
source code
Rooﬂine
Generator
Rooﬂine and
Performance
Analyzer
Optimized
source code
Optimization Flow
Compiler
Optimizer
Source Code
Parser

Rooﬂine Model Adaptation
8
● Model built on the characteristics of the GPU and
algorithm executed and independent to the
algorithm implementation
= number of iterations
= gpu cores frequency
= number of operations to be computed
at iteration i
= number of blocks
= number of scheduled threads per block
= number of integer cores
=
= number of streaming multiprocessors
= maximum number of blocks per streaming
multiprocessor
=

9
Unoptimized
source code
Rooﬂine
Generator
Rooﬂine and
Performance
Analyzer
Optimized
source code
Optimization Flow
Compiler
Optimizer
Source Code
Parser

10
Unoptimized
source code
Rooﬂine
Generator
Rooﬂine and
Performance
Analyzer
Optimized
source code
Optimization Flow
Compiler
Optimizer
Source Code
Parser

Source Code Parser
11
● Automatically unrolls loops if possible
● Automatically changes the memory hierarchy
● Automatically changes the number of scheduled
threads
● Automatically changes the number of scheduled
blocks
● Creates a report of the optimizations that can be
applied manually

12
Unoptimized
source code
Rooﬂine
Generator
Rooﬂine and
Performance
Analyzer
Optimized
source code
Optimization Flow
Compiler
Optimizer
Source Code
Parser

13
Unoptimized
source code
Rooﬂine
Generator
Rooﬂine and
Performance
Analyzer
Optimized
source code
Optimization Flow
Compiler
Optimizer
Source Code
Parser

Algorithms Background
14
● Pairwise alignment is one of the most commonly
used workhorses of sequence analysis
● Sequence Alignment is one of the most
computationally expensive steps in genome analysis
86%
SSPACE[1]
70%
PairHMM
GATK[3]
>80%
Bella[2]
[1] Boetzer, Marten, et al. "Scaffolding pre-assembled contigs using SSPACE." Bioinformatics 27.4 (2011): 578-579.
[2] Guidi, Giulia, et al. "BELLA: Berkeley efficient long-read to long-read aligner and overlapper." bioRxiv (2018): 464420.
[3] Sampietro, Davide, et al. "Fpga-based pairhmm forward algorithm for DNA variant calling." 2018 IEEE 29th International Conference on
Application-specific Systems, Architectures and Processors (ASAP). IEEE, 2018.
[4] Li, Heng. "Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM." arXiv preprint arXiv:1303.3997 (2013).
[5] Li, Heng. "Minimap2: pairwise alignment for nucleotide sequences." Bioinformatics 34.18 (2018): 3094-3100.
>85%
BWA-MEM[4]
>85%
minimap2[5]

15
2 2 0 0
0 0 4
0
A A T G
A
T
T
C
● Optimal algorithm for local
sequence alignment
● Perform the alignment by
computing a matrix called
Alignment Matrix
● The algorithm has a ﬁxed score
it two characters do or do not
match
● The score of a cell is
determined by following
dependencies on the
previously computed cells
Smith-Waterman Algorithm
Match = 2
Mismatch = -2

Smith-Waterman Algorithm
16
● Optimal algorithm for
local sequence
alignment
● Execution times scale
up to the length of the
aligned sequences
A
G
G
G
T
C
A
A
0 0 0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1
0 0 0 0 0 0 2 1 0
0 0 0 0 0 0 1 3 1
0 0 0 0 0 0 1 2 2
0 0 0 0 1 0 0 0 1
0 0 1 1 0 0 0 0 0
0 1 0 0 0 1 0 0 1
0 1 0 0 0 1 0 0 1
A C C T A G G A

17
1 -1 -3 -4
-1 0 -2
-3
A C G G
A
T
T
C
● Optimal algorithm for global
sequence alignment
● Perform the alignment by
computing a matrix called
Alignment Matrix
● The algorithm has a ﬁxed score
it two characters do or do not
match
● The score of a cell is
determined by following
dependencies on the
previously computed cells
The Needleman-Wunsch Algorithm
Match = 2
Mismatch = -2

18
The Needleman-Wunsch Algorithm
● Optimal algorithm for global
sequence alignment
● Execution time correlated to
the length of the aligned
sequences
● Ineﬃcient if the two
sequences do not align
A
T
T
C
G
G
C
0 -1 -2 -3 -4 -5 -6 -7
-1 1 -1 -3 -4 -5 -6 -5
-2 -1 0 -2 -4 -5 -6 -7
-3 -3 -2 -1 -3 -5 -6 -7
-4 -4 -2 -3 -2 -4 -6 -7
-5 -5 -4 -1 -2 -1 -3 -7
-6 -6 -6 -3 0 -1 0 -4
-7 -7 -5 -5 -2 -1 -2 -1
A C G G G G A

X-Drop Algorithm
19
A
T
T
C
G
G
C
0 -1 -2 -∞
-1 1 -1 -∞
-2 -1 0 -∞
-∞ -∞ -∞
A C G G G G A
● X-Drop [Zhang et al.]
termination oﬀers a great
tradeoﬀ between speed and
accuracy results
● Execution starts from a
common seed to the two
sequences
● Execution stops when the
score drops more than X
from the latest best score

X-Drop Algorithm
20
● X-Drop computation of each
cell is the same of NW
A
T
T
C
G
G
C
0 -1
-1
A C G G G G A
X=2
MAX = 0
Alignment = +1
All penalties = -1

X-Drop Algorithm
21
A
T
T
C
G
G
C
0 -1 -2
-1 1
-2
A C G G G G A
X=2
MAX = 1
Alignment = +1
All penalties = -1

X-Drop Algorithm
22
● If the score of the cell is below
more than X from MAX then
the cell is ﬂagged with -infA
T
T
C
G
G
C
0 -1 -2
-1 1
-2
A C G G G G A
X=2
MAX = 0
Alignment +1
All penalties -1

X-Drop Algorithm
23
● Once an antidiagonal has
been computed the global
maximum is updatedA
T
T
C
G
G
C
0 -1 -2
-1 1
-2
A C G G G G A
X=2
MAX = 1
Alignment = +1
All penalties = -1

X-Drop Algorithm
24
A
T
T
C
G
G
C
0 -1 -2 -3
-1 1 -1
-2 -1
-3
A C G G G G A
X=2
MAX = 1
Alignment = +1
All penalties = -1

X-Drop Algorithm
25
● -3 is below MAX - X so the cell
is ﬂagged with -inf
● MAX remained the sameA
T
T
C
G
G
C
0 -1 -2 -∞
-1 1 -1
-2 -1
-∞
A C G G G G A
X=2
MAX = 1
Alignment = +1
All penalties = -1

X-Drop Algorithm
26
● Again we compute the cells
normally
● Now -2 is below MAX - X so
the cell is ﬂagged with -inf
● MAX remained the same
A
T
T
C
G
G
C
0 -1 -2 -∞
-1 1 -1 -2
-2 -1 0
-∞ -2
A C G G G G A
X=2
MAX = 1
Alignment = +1
All penalties = -1

X-Drop Algorithm
27
normally
● Now -2 is below MAX - X so
the cell is ﬂagged with -inf
● MAX remained the same
A
T
T
C
G
G
C
0 -1 -2 -∞
-1 1 -1 -∞
-2 -1 0
-∞ -∞
A C G G G G A
X=2
MAX = 1
Alignment = +1
All penalties = -1

X-Drop Algorithm
28
normally
● Now all the cells scores are
below MAX - X
A
T
T
C
G
G
C
0 -1 -2 -∞
-1 1 -1 -∞
-2 -1 0 -2
-∞ -∞ -2
A C G G G G A
X=2
MAX = 1
Alignment = +1
All penalties = -1

X-Drop Algorithm
29
normally
● Now all the cells scores are
below MAX - X
● The execution ends
A
T
T
C
G
G
C
0 -1 -2 -∞
-1 1 -1 -∞
-2 -1 0 -∞
-∞ -∞ -∞
A C G G G G A
X=2
MAX = 1
Alignment = +1
All penalties = -1

X-Drop search space
30
X-Drop Banded
Comparison between the search space of different algorithms.
● X-Drop is a very efficient alignment heuristic to align genomic
sequences, especially if the two sequences do not align
● It offers a significant improvement when compared to Banded and
classic NW, as computation can be cut earlier if needed.
NW

GPU implemented optimizations
31
● The two algorithms follow the same computational pattern
● We started with a simple implementation of the algorithms
using a single thread and a single block
● We followed our methodology with the help of our tool to
optimize the algorithms at diﬀerent levels and introduce
Inter and Intra Parallelism
1 -1 -3 -4
-1 0 -2
-3
A C G G
A
T
T
C
A C G G
A
T
T
C

Intra Level Parallelism
32
● Parallel computation of
the anti-diagonals
● Each GPU thread is
assigned to compute a
single cell as our
methodology suggested
● Anti-diagonals split in
diﬀerent segments to
align sequences of any
length

Inter Level parallelism
33
● Parallel execution of
the alignments with
multiple blocks
● Each block has an
alignment assigned

GPU memory optimizations
34
● To ensure coalesced memory access one of the
sequences is stored backwards on the GPU

35
Evaluation Settings
Benchmarked Applications:
● SeqAn highly
optimized version of X-Drop
● ksw2: CPU SIMD
Z-drop
● Bowtie2: Smith-Waterman
● CUDASW++ 3.0: Smith-Waterman GPU
+ CPU SIMD

Evaluation Settings
36
Platforms:
● Intel Haswell Nodes
● Intel Skylake Nodes
● IBM Power 9 Nodes
● GPU

Smith-Waterman Unoptimized
Rooﬂine
37

Smith-Waterman Optimized
Rooﬂine
38

Smith-Waterman Comparison
39
34x
3x
1x
11x
1x

X-drop Unoptimized Rooﬂine
40

X-drop GPU and SeqAn Comparison
42
2x
6x

X-drop GPU and ksw2 Comparison
43
1.5x
120x

Conclusions
44
A methodology for automatic GPU Kernel Optimization
and its implementation inside a tool for automatic kernel
optimization
We applied our methodology to two highly computational
intensive algorithms
Optimized GPU X-drop Implementation with:
● More than 6.6x speed-up with respect to SeqAn
● More than 120x speed-up with respect to ksw2
Optimized GPU Smith-Waterman Implementation with:
● More than 34x speed-up with respect to Bowtie2
● More than 3x speed-up with respect to CUDASW++ 3.0

Thank you for your attention
A Methodology for Automatic GPU Kernel
Optimization
Alberto Zeni

A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020

More Related Content

What's hot (20)

Similar to A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020 (20)

More from NECST Lab @ Politecnico di Milano (20)

Recently uploaded (20)

A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020