SlideShare a Scribd company logo
POLITECNICO DI MILANO
A Methodology for Automatic GPU Kernel
Optimization
Alberto Zeni
Context Definition
2
1000x
by
2025
40 Years of Microprocessor Trend Data
____________________________________________________
1980 1990 2000 2010 2020
107
106
105
104
103
102
Context Definition
3
Contributions
4
● We propose a methodology that guides the user
to develop highly optimized GPU kernels
● We demonstrate the usefulness of our
methodology by implementing it into a semi
automatic tool for kernel optimization
● We show the results of the application of our
methodology on two highly computationally
intensive algorithms
Methodology application
5
Unoptimized
source code
Roofline
Generator
Roofline and
Performance
Analyzer
Optimized
source code
Optimization Flow
Compiler
Optimizer
Source Code
Parser
Methodology application
6
Unoptimized
source code
Roofline
Generator
Roofline and
Performance
Analyzer
Optimized
source code
Optimization Flow
Compiler
Optimizer
Source Code
Parser
Roofline Generator
7
Roofline Model Adaptation
8
● Model built on the characteristics of the GPU and
algorithm executed and independent to the
algorithm implementation
= number of iterations
= gpu cores frequency
= number of operations to be computed
at iteration i
= number of blocks
= number of scheduled threads per block
= number of integer cores
=
= number of streaming multiprocessors
= maximum number of blocks per streaming
multiprocessor
=
Methodology application
9
Unoptimized
source code
Roofline
Generator
Roofline and
Performance
Analyzer
Optimized
source code
Optimization Flow
Compiler
Optimizer
Source Code
Parser
Methodology application
10
Unoptimized
source code
Roofline
Generator
Roofline and
Performance
Analyzer
Optimized
source code
Optimization Flow
Compiler
Optimizer
Source Code
Parser
Source Code Parser
11
● Automatically unrolls loops if possible
● Automatically changes the memory hierarchy
● Automatically changes the number of scheduled
threads
● Automatically changes the number of scheduled
blocks
● Creates a report of the optimizations that can be
applied manually
Methodology application
12
Unoptimized
source code
Roofline
Generator
Roofline and
Performance
Analyzer
Optimized
source code
Optimization Flow
Compiler
Optimizer
Source Code
Parser
Methodology application
13
Unoptimized
source code
Roofline
Generator
Roofline and
Performance
Analyzer
Optimized
source code
Optimization Flow
Compiler
Optimizer
Source Code
Parser
Algorithms Background
14
● Pairwise alignment is one of the most commonly
used workhorses of sequence analysis
● Sequence Alignment is one of the most
computationally expensive steps in genome analysis
86%
SSPACE[1]
70%
PairHMM
GATK[3]
>80%
Bella[2]
[1] Boetzer, Marten, et al. "Scaffolding pre-assembled contigs using SSPACE." Bioinformatics 27.4 (2011): 578-579.
[2] Guidi, Giulia, et al. "BELLA: Berkeley efficient long-read to long-read aligner and overlapper." bioRxiv (2018): 464420.
[3] Sampietro, Davide, et al. "Fpga-based pairhmm forward algorithm for DNA variant calling." 2018 IEEE 29th International Conference on
Application-specific Systems, Architectures and Processors (ASAP). IEEE, 2018.
[4] Li, Heng. "Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM." arXiv preprint arXiv:1303.3997 (2013).
[5] Li, Heng. "Minimap2: pairwise alignment for nucleotide sequences." Bioinformatics 34.18 (2018): 3094-3100.
>85%
BWA-MEM[4]
>85%
minimap2[5]
15
2 2 0 0
0 0 4
0
A A T G
A
T
T
C
● Optimal algorithm for local
sequence alignment
● Perform the alignment by
computing a matrix called
Alignment Matrix
● The algorithm has a fixed score
it two characters do or do not
match
● The score of a cell is
determined by following
dependencies on the
previously computed cells
Smith-Waterman Algorithm
Match = 2
Mismatch = -2
Smith-Waterman Algorithm
16
● Optimal algorithm for
local sequence
alignment
● Execution times scale
up to the length of the
aligned sequences
A
G
G
G
T
C
A
A
0 0 0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1
0 0 0 0 0 0 2 1 0
0 0 0 0 0 0 1 3 1
0 0 0 0 0 0 1 2 2
0 0 0 0 1 0 0 0 1
0 0 1 1 0 0 0 0 0
0 1 0 0 0 1 0 0 1
0 1 0 0 0 1 0 0 1
A C C T A G G A
17
1 -1 -3 -4
-1 0 -2
-3
A C G G
A
T
T
C
● Optimal algorithm for global
sequence alignment
● Perform the alignment by
computing a matrix called
Alignment Matrix
● The algorithm has a fixed score
it two characters do or do not
match
● The score of a cell is
determined by following
dependencies on the
previously computed cells
The Needleman-Wunsch Algorithm
Match = 2
Mismatch = -2
18
The Needleman-Wunsch Algorithm
● Optimal algorithm for global
sequence alignment
● Execution time correlated to
the length of the aligned
sequences
● Inefficient if the two
sequences do not align
A
T
T
C
G
G
C
0 -1 -2 -3 -4 -5 -6 -7
-1 1 -1 -3 -4 -5 -6 -5
-2 -1 0 -2 -4 -5 -6 -7
-3 -3 -2 -1 -3 -5 -6 -7
-4 -4 -2 -3 -2 -4 -6 -7
-5 -5 -4 -1 -2 -1 -3 -7
-6 -6 -6 -3 0 -1 0 -4
-7 -7 -5 -5 -2 -1 -2 -1
A C G G G G A
X-Drop Algorithm
19
A
T
T
C
G
G
C
0 -1 -2 -∞
-1 1 -1 -∞
-2 -1 0 -∞
-∞ -∞ -∞
A C G G G G A
● X-Drop [Zhang et al.]
termination offers a great
tradeoff between speed and
accuracy results
● Execution starts from a
common seed to the two
sequences
● Execution stops when the
score drops more than X
from the latest best score
X-Drop Algorithm
20
● X-Drop computation of each
cell is the same of NW
A
T
T
C
G
G
C
0 -1
-1
A C G G G G A
X=2
MAX = 0
Alignment = +1
All penalties = -1
X-Drop Algorithm
21
● X-Drop computation of each
cell is the same of NW
A
T
T
C
G
G
C
0 -1 -2
-1 1
-2
A C G G G G A
X=2
MAX = 1
Alignment = +1
All penalties = -1
X-Drop Algorithm
22
● If the score of the cell is below
more than X from MAX then
the cell is flagged with -infA
T
T
C
G
G
C
0 -1 -2
-1 1
-2
A C G G G G A
X=2
MAX = 0
Alignment +1
All penalties -1
X-Drop Algorithm
23
● Once an antidiagonal has
been computed the global
maximum is updatedA
T
T
C
G
G
C
0 -1 -2
-1 1
-2
A C G G G G A
X=2
MAX = 1
Alignment = +1
All penalties = -1
X-Drop Algorithm
24
● X-Drop computation of each
cell is the same of NW
A
T
T
C
G
G
C
0 -1 -2 -3
-1 1 -1
-2 -1
-3
A C G G G G A
X=2
MAX = 1
Alignment = +1
All penalties = -1
X-Drop Algorithm
25
● -3 is below MAX - X so the cell
is flagged with -inf
● MAX remained the sameA
T
T
C
G
G
C
0 -1 -2 -∞
-1 1 -1
-2 -1
-∞
A C G G G G A
X=2
MAX = 1
Alignment = +1
All penalties = -1
X-Drop Algorithm
26
● Again we compute the cells
normally
● Now -2 is below MAX - X so
the cell is flagged with -inf
● MAX remained the same
A
T
T
C
G
G
C
0 -1 -2 -∞
-1 1 -1 -2
-2 -1 0
-∞ -2
A C G G G G A
X=2
MAX = 1
Alignment = +1
All penalties = -1
X-Drop Algorithm
27
● Again we compute the cells
normally
● Now -2 is below MAX - X so
the cell is flagged with -inf
● MAX remained the same
A
T
T
C
G
G
C
0 -1 -2 -∞
-1 1 -1 -∞
-2 -1 0
-∞ -∞
A C G G G G A
X=2
MAX = 1
Alignment = +1
All penalties = -1
X-Drop Algorithm
28
● Again we compute the cells
normally
● Now all the cells scores are
below MAX - X
A
T
T
C
G
G
C
0 -1 -2 -∞
-1 1 -1 -∞
-2 -1 0 -2
-∞ -∞ -2
A C G G G G A
X=2
MAX = 1
Alignment = +1
All penalties = -1
X-Drop Algorithm
29
● Again we compute the cells
normally
● Now all the cells scores are
below MAX - X
● The execution ends
A
T
T
C
G
G
C
0 -1 -2 -∞
-1 1 -1 -∞
-2 -1 0 -∞
-∞ -∞ -∞
A C G G G G A
X=2
MAX = 1
Alignment = +1
All penalties = -1
X-Drop search space
30
X-Drop Banded
Comparison between the search space of different algorithms.
● X-Drop is a very efficient alignment heuristic to align genomic
sequences, especially if the two sequences do not align
● It offers a significant improvement when compared to Banded and
classic NW, as computation can be cut earlier if needed.
NW
GPU implemented optimizations
31
● The two algorithms follow the same computational pattern
● We started with a simple implementation of the algorithms
using a single thread and a single block
● We followed our methodology with the help of our tool to
optimize the algorithms at different levels and introduce
Inter and Intra Parallelism
1 -1 -3 -4
-1 0 -2
-3
A C G G
A
T
T
C
A C G G
A
T
T
C
Intra Level Parallelism
32
● Parallel computation of
the anti-diagonals
● Each GPU thread is
assigned to compute a
single cell as our
methodology suggested
● Anti-diagonals split in
different segments to
align sequences of any
length
Inter Level parallelism
33
● Parallel execution of
the alignments with
multiple blocks
● Each block has an
alignment assigned
GPU memory optimizations
34
● To ensure coalesced memory access one of the
sequences is stored backwards on the GPU
35
Evaluation Settings
Benchmarked Applications:
● SeqAn highly
optimized version of X-Drop
● ksw2: CPU SIMD
Z-drop
● Bowtie2: Smith-Waterman
● CUDASW++ 3.0: Smith-Waterman GPU
+ CPU SIMD
Evaluation Settings
36
Platforms:
● Intel Haswell Nodes
● Intel Skylake Nodes
● IBM Power 9 Nodes
● GPU
Smith-Waterman Unoptimized
Roofline
37
Smith-Waterman Optimized
Roofline
38
Smith-Waterman Comparison
39
34x
3x
1x
11x
1x
X-drop Unoptimized Roofline
40
X-drop Optimized Roofline
41
X-drop GPU and SeqAn Comparison
42
2x
6x
X-drop GPU and ksw2 Comparison
43
1.5x
120x
Conclusions
44
A methodology for automatic GPU Kernel Optimization
and its implementation inside a tool for automatic kernel
optimization
We applied our methodology to two highly computational
intensive algorithms
Optimized GPU X-drop Implementation with:
● More than 6.6x speed-up with respect to SeqAn
● More than 120x speed-up with respect to ksw2
Optimized GPU Smith-Waterman Implementation with:
● More than 34x speed-up with respect to Bowtie2
● More than 3x speed-up with respect to CUDASW++ 3.0
Thank you for your attention
A Methodology for Automatic GPU Kernel
Optimization
Alberto Zeni

More Related Content

PDF
Robust PID Controller Design for Non-Minimum Phase Systems using Magnitude Op...
PPTX
4 U 5 Slides With Notes
PDF
Improving Structural Limitations of Pid Controller For Unstable Processes
PDF
Design of predictive controller for smooth set point tracking for fast dynami...
PPT
Algorithm analysis
PDF
Study on Adaptive PID Control Algorithm Based on RBF Neural Network
PDF
Cyber-Security Enhancements of Networked Control Systems Using Homomorphic En...
PDF
Controller encryption using RSA public-key encryption scheme (Asian Control C...
Robust PID Controller Design for Non-Minimum Phase Systems using Magnitude Op...
4 U 5 Slides With Notes
Improving Structural Limitations of Pid Controller For Unstable Processes
Design of predictive controller for smooth set point tracking for fast dynami...
Algorithm analysis
Study on Adaptive PID Control Algorithm Based on RBF Neural Network
Cyber-Security Enhancements of Networked Control Systems Using Homomorphic En...
Controller encryption using RSA public-key encryption scheme (Asian Control C...

What's hot (20)

PDF
Mathematical Modeling and Fuzzy Adaptive PID Control of Erection Mechanism
PDF
Servo Fundamentals
PDF
High Speed Optimized AES using Parallel Processing Implementation
PPTX
Quality of-service configuration on cisco nexus
PDF
High Speed VLSI Architecture for AES-Galois/Counter Mode
PPTX
Petri Nets: Properties, Analysis and Applications
PPTX
A petri-net
PDF
IRJET- A Novel Hybrid Security System for OFDM-PON using Highly Improved RC6 ...
PPT
Exploring Petri Net State Spaces
PDF
6 Synchronisation
PDF
Design of multiloop controller for
PPTX
SYNCHRONIZATION
PDF
Operating System-Ch6 process synchronization
PDF
Design of Linear Plasma Position Controllers with Intelligent Feedback System...
PPT
Process synchronization(deepa)
PPT
OS Process Synchronization, semaphore and Monitors
PPT
Ch7 Process Synchronization galvin
PPT
Chapter 6 - Process Synchronization
PPT
Operating Systems - "Chapter 5 Process Synchronization"
Mathematical Modeling and Fuzzy Adaptive PID Control of Erection Mechanism
Servo Fundamentals
High Speed Optimized AES using Parallel Processing Implementation
Quality of-service configuration on cisco nexus
High Speed VLSI Architecture for AES-Galois/Counter Mode
Petri Nets: Properties, Analysis and Applications
A petri-net
IRJET- A Novel Hybrid Security System for OFDM-PON using Highly Improved RC6 ...
Exploring Petri Net State Spaces
6 Synchronisation
Design of multiloop controller for
SYNCHRONIZATION
Operating System-Ch6 process synchronization
Design of Linear Plasma Position Controllers with Intelligent Feedback System...
Process synchronization(deepa)
OS Process Synchronization, semaphore and Monitors
Ch7 Process Synchronization galvin
Chapter 6 - Process Synchronization
Operating Systems - "Chapter 5 Process Synchronization"
Ad

Similar to A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020 (20)

PDF
A Methodology for Automatic GPU Kernel Optimization
PDF
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
PDF
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
PDF
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
PDF
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
PDF
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
PDF
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
PDF
Parallel Biological Sequence Comparison in GPU Platforms
PPTX
Needleman-wunch algorithm harshita
PPTX
5. Global and Local Alignment Algorithms.pptx
PPT
B.sc biochem i bobi u 3.2 algorithm + blast
PPT
B.sc biochem i bobi u 3.2 algorithm + blast
PPTX
Dynamic programming and pairwise sequence alignment
PDF
Sequence Alignment
PPTX
Sequence alignment global vs. local
PDF
Performance Efficient DNA Sequence Detectionalgo
PDF
A quantum-inspired optimization heuristic for the multiple sequence alignment...
PPTX
Msa & rooted/unrooted tree
PPTX
Dynamic programming
PPTX
DYNAMIC PROGRAMMING, Bioinformatics.pptx
A Methodology for Automatic GPU Kernel Optimization
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
Parallel Biological Sequence Comparison in GPU Platforms
Needleman-wunch algorithm harshita
5. Global and Local Alignment Algorithms.pptx
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
Dynamic programming and pairwise sequence alignment
Sequence Alignment
Sequence alignment global vs. local
Performance Efficient DNA Sequence Detectionalgo
A quantum-inspired optimization heuristic for the multiple sequence alignment...
Msa & rooted/unrooted tree
Dynamic programming
DYNAMIC PROGRAMMING, Bioinformatics.pptx
Ad

More from NECST Lab @ Politecnico di Milano (20)

PDF
Mesticheria Team - WiiReflex
PPTX
Punto e virgola Team - Stressometro
PDF
BitIt Team - Stay.straight
PDF
BabYodini Team - Talking Gloves
PDF
printf("Nome Squadra"); Team - NeoTon
PPTX
BlackBoard Team - Motion Tracking Platform
PDF
#include<brain.h> Team - HomeBeatHome
PDF
Flipflops Team - Wave U
PDF
Bug(atta) Team - Little Brother
PDF
#NECSTCamp: come partecipare
PDF
NECSTCamp101@2020.10.1
PDF
NECSTLab101 2020.2021
PDF
TreeHouse, nourish your community
PDF
TiReX: Tiled Regular eXpressionsmatching architecture
PDF
Embedding based knowledge graph link prediction for drug repurposing
PDF
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PDF
EMPhASIS - An EMbedded Public Attention Stress Identification System
PDF
Luns - Automatic lungs segmentation through neural network
PDF
BlastFunction: How to combine Serverless and FPGAs
PDF
Maeve - Fast genome analysis leveraging exact string matching
Mesticheria Team - WiiReflex
Punto e virgola Team - Stressometro
BitIt Team - Stay.straight
BabYodini Team - Talking Gloves
printf("Nome Squadra"); Team - NeoTon
BlackBoard Team - Motion Tracking Platform
#include<brain.h> Team - HomeBeatHome
Flipflops Team - Wave U
Bug(atta) Team - Little Brother
#NECSTCamp: come partecipare
NECSTCamp101@2020.10.1
NECSTLab101 2020.2021
TreeHouse, nourish your community
TiReX: Tiled Regular eXpressionsmatching architecture
Embedding based knowledge graph link prediction for drug repurposing
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
EMPhASIS - An EMbedded Public Attention Stress Identification System
Luns - Automatic lungs segmentation through neural network
BlastFunction: How to combine Serverless and FPGAs
Maeve - Fast genome analysis leveraging exact string matching

Recently uploaded (20)

PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPT
Mechanical Engineering MATERIALS Selection
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
DOCX
573137875-Attendance-Management-System-original
Model Code of Practice - Construction Work - 21102022 .pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
additive manufacturing of ss316l using mig welding
Lecture Notes Electrical Wiring System Components
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Mechanical Engineering MATERIALS Selection
R24 SURVEYING LAB MANUAL for civil enggi
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
OOP with Java - Java Introduction (Basics)
Operating System & Kernel Study Guide-1 - converted.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
573137875-Attendance-Management-System-original

A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020

  • 1. POLITECNICO DI MILANO A Methodology for Automatic GPU Kernel Optimization Alberto Zeni
  • 2. Context Definition 2 1000x by 2025 40 Years of Microprocessor Trend Data ____________________________________________________ 1980 1990 2000 2010 2020 107 106 105 104 103 102
  • 4. Contributions 4 ● We propose a methodology that guides the user to develop highly optimized GPU kernels ● We demonstrate the usefulness of our methodology by implementing it into a semi automatic tool for kernel optimization ● We show the results of the application of our methodology on two highly computationally intensive algorithms
  • 5. Methodology application 5 Unoptimized source code Roofline Generator Roofline and Performance Analyzer Optimized source code Optimization Flow Compiler Optimizer Source Code Parser
  • 6. Methodology application 6 Unoptimized source code Roofline Generator Roofline and Performance Analyzer Optimized source code Optimization Flow Compiler Optimizer Source Code Parser
  • 8. Roofline Model Adaptation 8 ● Model built on the characteristics of the GPU and algorithm executed and independent to the algorithm implementation = number of iterations = gpu cores frequency = number of operations to be computed at iteration i = number of blocks = number of scheduled threads per block = number of integer cores = = number of streaming multiprocessors = maximum number of blocks per streaming multiprocessor =
  • 9. Methodology application 9 Unoptimized source code Roofline Generator Roofline and Performance Analyzer Optimized source code Optimization Flow Compiler Optimizer Source Code Parser
  • 10. Methodology application 10 Unoptimized source code Roofline Generator Roofline and Performance Analyzer Optimized source code Optimization Flow Compiler Optimizer Source Code Parser
  • 11. Source Code Parser 11 ● Automatically unrolls loops if possible ● Automatically changes the memory hierarchy ● Automatically changes the number of scheduled threads ● Automatically changes the number of scheduled blocks ● Creates a report of the optimizations that can be applied manually
  • 12. Methodology application 12 Unoptimized source code Roofline Generator Roofline and Performance Analyzer Optimized source code Optimization Flow Compiler Optimizer Source Code Parser
  • 13. Methodology application 13 Unoptimized source code Roofline Generator Roofline and Performance Analyzer Optimized source code Optimization Flow Compiler Optimizer Source Code Parser
  • 14. Algorithms Background 14 ● Pairwise alignment is one of the most commonly used workhorses of sequence analysis ● Sequence Alignment is one of the most computationally expensive steps in genome analysis 86% SSPACE[1] 70% PairHMM GATK[3] >80% Bella[2] [1] Boetzer, Marten, et al. "Scaffolding pre-assembled contigs using SSPACE." Bioinformatics 27.4 (2011): 578-579. [2] Guidi, Giulia, et al. "BELLA: Berkeley efficient long-read to long-read aligner and overlapper." bioRxiv (2018): 464420. [3] Sampietro, Davide, et al. "Fpga-based pairhmm forward algorithm for DNA variant calling." 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE, 2018. [4] Li, Heng. "Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM." arXiv preprint arXiv:1303.3997 (2013). [5] Li, Heng. "Minimap2: pairwise alignment for nucleotide sequences." Bioinformatics 34.18 (2018): 3094-3100. >85% BWA-MEM[4] >85% minimap2[5]
  • 15. 15 2 2 0 0 0 0 4 0 A A T G A T T C ● Optimal algorithm for local sequence alignment ● Perform the alignment by computing a matrix called Alignment Matrix ● The algorithm has a fixed score it two characters do or do not match ● The score of a cell is determined by following dependencies on the previously computed cells Smith-Waterman Algorithm Match = 2 Mismatch = -2
  • 16. Smith-Waterman Algorithm 16 ● Optimal algorithm for local sequence alignment ● Execution times scale up to the length of the aligned sequences A G G G T C A A 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 1 3 1 0 0 0 0 0 0 1 2 2 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 A C C T A G G A
  • 17. 17 1 -1 -3 -4 -1 0 -2 -3 A C G G A T T C ● Optimal algorithm for global sequence alignment ● Perform the alignment by computing a matrix called Alignment Matrix ● The algorithm has a fixed score it two characters do or do not match ● The score of a cell is determined by following dependencies on the previously computed cells The Needleman-Wunsch Algorithm Match = 2 Mismatch = -2
  • 18. 18 The Needleman-Wunsch Algorithm ● Optimal algorithm for global sequence alignment ● Execution time correlated to the length of the aligned sequences ● Inefficient if the two sequences do not align A T T C G G C 0 -1 -2 -3 -4 -5 -6 -7 -1 1 -1 -3 -4 -5 -6 -5 -2 -1 0 -2 -4 -5 -6 -7 -3 -3 -2 -1 -3 -5 -6 -7 -4 -4 -2 -3 -2 -4 -6 -7 -5 -5 -4 -1 -2 -1 -3 -7 -6 -6 -6 -3 0 -1 0 -4 -7 -7 -5 -5 -2 -1 -2 -1 A C G G G G A
  • 19. X-Drop Algorithm 19 A T T C G G C 0 -1 -2 -∞ -1 1 -1 -∞ -2 -1 0 -∞ -∞ -∞ -∞ A C G G G G A ● X-Drop [Zhang et al.] termination offers a great tradeoff between speed and accuracy results ● Execution starts from a common seed to the two sequences ● Execution stops when the score drops more than X from the latest best score
  • 20. X-Drop Algorithm 20 ● X-Drop computation of each cell is the same of NW A T T C G G C 0 -1 -1 A C G G G G A X=2 MAX = 0 Alignment = +1 All penalties = -1
  • 21. X-Drop Algorithm 21 ● X-Drop computation of each cell is the same of NW A T T C G G C 0 -1 -2 -1 1 -2 A C G G G G A X=2 MAX = 1 Alignment = +1 All penalties = -1
  • 22. X-Drop Algorithm 22 ● If the score of the cell is below more than X from MAX then the cell is flagged with -infA T T C G G C 0 -1 -2 -1 1 -2 A C G G G G A X=2 MAX = 0 Alignment +1 All penalties -1
  • 23. X-Drop Algorithm 23 ● Once an antidiagonal has been computed the global maximum is updatedA T T C G G C 0 -1 -2 -1 1 -2 A C G G G G A X=2 MAX = 1 Alignment = +1 All penalties = -1
  • 24. X-Drop Algorithm 24 ● X-Drop computation of each cell is the same of NW A T T C G G C 0 -1 -2 -3 -1 1 -1 -2 -1 -3 A C G G G G A X=2 MAX = 1 Alignment = +1 All penalties = -1
  • 25. X-Drop Algorithm 25 ● -3 is below MAX - X so the cell is flagged with -inf ● MAX remained the sameA T T C G G C 0 -1 -2 -∞ -1 1 -1 -2 -1 -∞ A C G G G G A X=2 MAX = 1 Alignment = +1 All penalties = -1
  • 26. X-Drop Algorithm 26 ● Again we compute the cells normally ● Now -2 is below MAX - X so the cell is flagged with -inf ● MAX remained the same A T T C G G C 0 -1 -2 -∞ -1 1 -1 -2 -2 -1 0 -∞ -2 A C G G G G A X=2 MAX = 1 Alignment = +1 All penalties = -1
  • 27. X-Drop Algorithm 27 ● Again we compute the cells normally ● Now -2 is below MAX - X so the cell is flagged with -inf ● MAX remained the same A T T C G G C 0 -1 -2 -∞ -1 1 -1 -∞ -2 -1 0 -∞ -∞ A C G G G G A X=2 MAX = 1 Alignment = +1 All penalties = -1
  • 28. X-Drop Algorithm 28 ● Again we compute the cells normally ● Now all the cells scores are below MAX - X A T T C G G C 0 -1 -2 -∞ -1 1 -1 -∞ -2 -1 0 -2 -∞ -∞ -2 A C G G G G A X=2 MAX = 1 Alignment = +1 All penalties = -1
  • 29. X-Drop Algorithm 29 ● Again we compute the cells normally ● Now all the cells scores are below MAX - X ● The execution ends A T T C G G C 0 -1 -2 -∞ -1 1 -1 -∞ -2 -1 0 -∞ -∞ -∞ -∞ A C G G G G A X=2 MAX = 1 Alignment = +1 All penalties = -1
  • 30. X-Drop search space 30 X-Drop Banded Comparison between the search space of different algorithms. ● X-Drop is a very efficient alignment heuristic to align genomic sequences, especially if the two sequences do not align ● It offers a significant improvement when compared to Banded and classic NW, as computation can be cut earlier if needed. NW
  • 31. GPU implemented optimizations 31 ● The two algorithms follow the same computational pattern ● We started with a simple implementation of the algorithms using a single thread and a single block ● We followed our methodology with the help of our tool to optimize the algorithms at different levels and introduce Inter and Intra Parallelism 1 -1 -3 -4 -1 0 -2 -3 A C G G A T T C A C G G A T T C
  • 32. Intra Level Parallelism 32 ● Parallel computation of the anti-diagonals ● Each GPU thread is assigned to compute a single cell as our methodology suggested ● Anti-diagonals split in different segments to align sequences of any length
  • 33. Inter Level parallelism 33 ● Parallel execution of the alignments with multiple blocks ● Each block has an alignment assigned
  • 34. GPU memory optimizations 34 ● To ensure coalesced memory access one of the sequences is stored backwards on the GPU
  • 35. 35 Evaluation Settings Benchmarked Applications: ● SeqAn highly optimized version of X-Drop ● ksw2: CPU SIMD Z-drop ● Bowtie2: Smith-Waterman ● CUDASW++ 3.0: Smith-Waterman GPU + CPU SIMD
  • 36. Evaluation Settings 36 Platforms: ● Intel Haswell Nodes ● Intel Skylake Nodes ● IBM Power 9 Nodes ● GPU
  • 42. X-drop GPU and SeqAn Comparison 42 2x 6x
  • 43. X-drop GPU and ksw2 Comparison 43 1.5x 120x
  • 44. Conclusions 44 A methodology for automatic GPU Kernel Optimization and its implementation inside a tool for automatic kernel optimization We applied our methodology to two highly computational intensive algorithms Optimized GPU X-drop Implementation with: ● More than 6.6x speed-up with respect to SeqAn ● More than 120x speed-up with respect to ksw2 Optimized GPU Smith-Waterman Implementation with: ● More than 34x speed-up with respect to Bowtie2 ● More than 3x speed-up with respect to CUDASW++ 3.0
  • 45. Thank you for your attention A Methodology for Automatic GPU Kernel Optimization Alberto Zeni