Accelerating
microbiome research
with OpenACC
Igor Sfiligoi – University of California San Diego
in collaboration with
Daniel McDonald and Rob Knight
OpenACC Summit 2020 – Sept 2020
Accelerating microbiome research with OpenACC
We are what we eat
Studies demonstrated
clear link between
• Gut microbiome
• General human health
https://www.biotechniques.com/multimedia/archive/00252/microbiome2_252150a.jpg
UniFrac distance
Need to understand
how similar pairs of
microbiome samples
are with respect to the
evolutionary histories
of the organisms.
UniFrac distance matrix
• Samples where the organisms are all very
similar from an evolutionary perspective
will have a small UniFrac distance.
• On the other hand, two samples
composed of very different organisms
will have a large UniFrac distance.
Lozupone and Knight Applied and environmental microbiology 2005
Computing UniFrac
• Matrix can be computed
using a striped pattern
• Each stripe can be
computed independently
• Easy to distribute
over many compute units
CPU1
CPU2
Computing UniFrac
• Most compute localized
in a tight loop
• Operating on
a stripe range
for(unsigned int stripe = start;
stripe < stop; stripe++) {
dm_stripe = dm_stripes[stripe];
for(unsigned int j = 0;
j < n_samples / 4; j++) {
int k = j * 4;
double u1 = emb[k];
double u2 = emb[k+1];
double v1 = emb[k + stripe + 1];
double v2 = emb[k + stripe + 2];
…
dm_stripe[k] += (u1-v1)*length;
dm_stripe[k+1] += (u2-v2)*length;
}
}
Invoked many times with distinct emp[:] buffers
Computing UniFrac
• Most compute localized
in a tight loop
• Operating on
a stripe range
for(unsigned int stripe = start;
stripe < stop; stripe++) {
dm_stripe = dm_stripes[stripe];
for(unsigned int j = 0;
j < n_samples / 4; j++) {
int k = j * 4;
double u1 = emb[k];
double u2 = emb[k+1];
double v1 = emb[k + stripe + 1];
double v2 = emb[k + stripe + 2];
…
dm_stripe[k] += (u1-v1)*length;
dm_stripe[k+1] += (u2-v2)*length;
}
}
Intel Xeon E5-2680 v4 CPU
(using all 14 cores)
800 minutes (13 hours)
Modest size EMP dataset
Invoked many times with distinct emp[:] buffers
Porting to GPU
• OpenACC makes it trivial to
port to GPU compute.
• Just decorate with a pragma.
• But needed minor refactoring
to have a unified buffer.
(Was array of pointers)
#pragma acc parallel loop collapse(2) 
present(emb,dm_stripes_buf)
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int j = 0;
j < n_samples / 4; j++) {
int idx =(stripe-start_idx)*n_samples;
double *dm_stripe =dm_stripes_buf+idx;
int k = j * 4;
double u1 = emb[k];
double u2 = emb[k+1];
double v1 = emb[k + stripe + 1];
double v2 = emb[k + stripe + 2];
…
dm_stripe[k] += (u1-v1)*length;
dm_stripe[k+1] += (u2-v2)*length;
}
}
Invoked many times with distinct emp[:] buffers
Porting to GPU
#pragma acc parallel loop collapse(2) 
present(emb,dm_stripes_buf)
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int j = 0;
j < n_samples / 4; j++) {
int idx =(stripe-start_idx)*n_samples;
double *dm_stripe =dm_stripes_buf+idx;
int k = j * 4;
double u1 = emb[k];
double u2 = emb[k+1];
double v1 = emb[k + stripe + 1];
double v2 = emb[k + stripe + 2];
…
dm_stripe[k] += (u1-v1)*length;
dm_stripe[k+1] += (u2-v2)*length;
}
}
Modest size EMP dataset
NVIDIA Tesla V100
(using all 84 SMs)
92 minutes (1.5 hours) Was 13h on CPU
• OpenACC makes it trivial to
port to GPU compute.
• Just decorate with a pragma.
• But needed minor refactoring
to have a unified buffer.
(Was array of pointers)
Invoked many times with distinct emp[:] buffers
Optimization step 1
Modest size EMP dataset
92 mins before, was 13h on CPU
• Cluster reads and minimize writes
• Fewer kernel invocations
• Memory writes much more
expensive than memory reads.
• Also undo manual unrolls
• Were optimal for CPU
• Bad for GPU
• Properly align
memory buffers
• Up to 5x slowdown
when not aligned
#pragma acc parallel loop collapse(2) 
present(emb,dm_stripes_buf,length)
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int k = 0;
k < n_samples ; k++) {
…
double my_stripe = dm_stripe[k];
#pragma acc loop seq
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
double u = emb[offset+k];
double v = emb[offset+k+stripe+ 1];
my_stripe += (u-v)*length[e];
}
…
dm_stripe[k] += (u1-v1)*length;
}
}
NVIDIA Tesla V100
(using all 84 SMs)
33 minutes
Invoked fewer times due to batched emp[:] buffers
Optimization step 2
• Reorder loops to maximize
cache reuse.
#pragma acc parallel loop collapse(3) 
present(emb,dm_stripes_buf,length)
for(unsigned int sk = 0;
sk < sample_steps ; sk++) {
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int ik = 0;
ik < step_size ; ik++) {
unsigned int k = sk*step_size + ik;
…
double my_stripe = dm_stripe[k];
#pragma acc loop seq
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
double u = emb[offset+k];
double v = emb[offset+k+stripe+ 1];
my_stripe += (u-v)*length[e];
}
…
dm_stripe[k] += (u1-v1)*length;
}
}
}
NVIDIA Tesla V100
(using all 84 SMs)
12 minutes Was 33 mins
Modest size
EMP dataset
Optimization step 2
• Reorder loops to maximize
cache reuse.
• CPU code
also benefitted
from
optimization
#pragma acc parallel loop collapse(3) 
present(emb,dm_stripes_buf,length)
for(unsigned int sk = 0;
sk < sample_steps ; sk++) {
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int ik = 0;
ik < step_size ; ik++) {
unsigned int k = sk*step_size + ik;
…
double my_stripe = dm_stripe[k];
#pragma acc loop seq
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
double u = emb[offset+k];
double v = emb[offset+k+stripe+ 1];
my_stripe += (u-v)*length[e];
}
…
dm_stripe[k] += (u1-v1)*length;
}
}
}
NVIDIA Tesla V100
(using all 84 SMs)
12 minutes Was 33 mins
Modest size
EMP dataset
Xeon E5-2680 v4 CPU
(using all 14 cores)
193minutes (~3 hours)
Originally 13h on the same CPU
20x speedup on modest EMP dataset
E5-2680 v4 CPU
Original New
GPU
V100
GPU
2080TI
GPU
1080TI
GPU
1080
GPU
Mobile 1050
fp64 800 193 12 59 77 99 213
fp32 - 190 9.5 19 31 36 64
Using fp32 adds additional boost, especially on gaming and mobile GPUs
20x V100 GPU vs Xeon CPU + 4x from general optimization
140x speedup on cutting edge 113k sample
Per chip
(in minutes)
128x CPU
E5-2680 v4
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 415 97 14 29 184 252
fp32 - 91 12 20 32 82
Aggregated
(in chip hours)
128x
E5-2680 v4 CPU
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 890 207 30 1.9 49 67
fp32 - 194 26 1.3 8.5 22
140x V100 GPUs vs Xeon CPUs + 4.5x from general optimization
140x V100 GPUs vs Xeon CPUs + 4.5x from general optimization
140x speedup on cutting edge 113k sample
Per chip
(in minutes)
128x CPU
E5-2680 v4
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 415 97 14 29 184 252
fp32 - 91 12 20 32 82
Aggregated
(in chip hours)
128x
E5-2680 v4 CPU
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 890 207 30 1.9 49 67
fp32 - 194 26 1.3 8.5 22
Largish
CPU cluster
Single node
22x speedup on consumer GPUs
Per chip
(in minutes)
128x CPU
E5-2680 v4
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 415 97 14 29 184 252
fp32 - 91 12 20 32 82
Aggregated
(in chip hours)
128x
E5-2680 v4 CPU
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 890 207 30 1.9 49 67
fp32 - 194 26 1.3 8.5 22
22x 2080TI GPUs vs Xeons CPU, 4.5x from general optimizationConsumer GPUs slower
than server GPUs
but still faster than CPUs
(Memory bound)
Desiderata
• Support for array of pointers
• Was able to work around it,
but annoying
• Better multi-GPU support
• Currently handled with
multiple processes + final merge
• (Better) AMD GPU support
• GCC theoretically has it,
but performance in tests was dismal
• Non-Linux support
• Was not able to find an OpenACC
compiler for MacOS or Windows
Conclusions
• OpenACC made porting UniFrac
to GPUs extremely easy
• With a single code base
• Some additional optimizations were
needed to get maximum benefit
• But most were needed for
the CPU-only code path, too
• Performance on NVIDIA GPUs great
• But wondering what to do for AMD GPUs
and GPUs on non-linux systems
Acknowledgments
This work was partially funded by US National Science
Foundation (NSF) grants OAC-1826967, OAC-1541349
and CNS-1730158, and by US National Institutes of
Health (NIH) grant DP1-AT010885.

More Related Content

PDF
Porting and optimizing UniFrac for GPUs
PDF
Chainer ui v0.3 and imagereport
PDF
Introduction to Chainer 11 may,2018
PDF
Chainer v4 and v5
PDF
[BGOUG] Java GC - Friend or Foe
PDF
NAS EP Algorithm
PDF
【論文紹介】Relay: A New IR for Machine Learning Frameworks
PPTX
Chainer v3
Porting and optimizing UniFrac for GPUs
Chainer ui v0.3 and imagereport
Introduction to Chainer 11 may,2018
Chainer v4 and v5
[BGOUG] Java GC - Friend or Foe
NAS EP Algorithm
【論文紹介】Relay: A New IR for Machine Learning Frameworks
Chainer v3

What's hot (20)

PDF
Automatically Fusing Functions on CuPy
PPTX
Lrz kurs: big data analysis
PPTX
PyTorch Tutorial for NTU Machine Learing Course 2017
PDF
Chainer Update v1.8.0 -> v1.10.0+
PDF
Introduction to Chainer
PDF
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
PDF
CuPy v4 and v5 roadmap
PDF
eBPF Perf Tools 2019
PDF
PyTorch crash course
PDF
Overview of Chainer and Its Features
PDF
計算機性能の限界点とその考え方
PPT
Session 1 introduction to ns2
PPTX
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
PPTX
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
KEY
Cloud Services - Gluecon 2010
PDF
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
PDF
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PPTX
C++ AMP 실천 및 적용 전략
PPTX
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
PDF
LSFMM 2019 BPF Observability
Automatically Fusing Functions on CuPy
Lrz kurs: big data analysis
PyTorch Tutorial for NTU Machine Learing Course 2017
Chainer Update v1.8.0 -> v1.10.0+
Introduction to Chainer
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
CuPy v4 and v5 roadmap
eBPF Perf Tools 2019
PyTorch crash course
Overview of Chainer and Its Features
計算機性能の限界点とその考え方
Session 1 introduction to ns2
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Cloud Services - Gluecon 2010
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
C++ AMP 실천 및 적용 전략
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
LSFMM 2019 BPF Observability
Ad

Similar to Accelerating microbiome research with OpenACC (20)

PDF
第11回 配信講義 計算科学技術特論A(2021)
PDF
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
PPTX
Debugging linux issues with eBPF
PPTX
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
PDF
Network Programming: Data Plane Development Kit (DPDK)
PDF
Linux BPF Superpowers
PDF
Adaptive Linear Solvers and Eigensolvers
PPTX
General Purpose Computing using Graphics Hardware
PDF
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
PPT
Gpu and The Brick Wall
PPT
Monte Carlo on GPUs
PDF
Eco-friendly Linux kernel development
PDF
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
PDF
Adam Sitnik "State of the .NET Performance"
PDF
State of the .Net Performance
PPTX
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
PDF
Playing BBR with a userspace network stack
PDF
Performance tweaks and tools for Linux (Joe Damato)
PDF
Direct Code Execution @ CoNEXT 2013
PDF
Optimizing Parallel Reduction in CUDA : NOTES
第11回 配信講義 計算科学技術特論A(2021)
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
Debugging linux issues with eBPF
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Network Programming: Data Plane Development Kit (DPDK)
Linux BPF Superpowers
Adaptive Linear Solvers and Eigensolvers
General Purpose Computing using Graphics Hardware
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Gpu and The Brick Wall
Monte Carlo on GPUs
Eco-friendly Linux kernel development
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Adam Sitnik "State of the .NET Performance"
State of the .Net Performance
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
Playing BBR with a userspace network stack
Performance tweaks and tools for Linux (Joe Damato)
Direct Code Execution @ CoNEXT 2013
Optimizing Parallel Reduction in CUDA : NOTES
Ad

More from Igor Sfiligoi (20)

PDF
Preparing Fusion codes for Perlmutter - CGYRO
PDF
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
PDF
Comparing single-node and multi-node performance of an important fusion HPC c...
PDF
The anachronism of whole-GPU accounting
PDF
Auto-scaling HTCondor pools using Kubernetes compute resources
PDF
Speeding up bowtie2 by improving cache-hit rate
PDF
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
PDF
Comparing GPU effectiveness for Unifrac distance compute
PDF
Managing Cloud networking costs for data-intensive applications by provisioni...
PDF
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
PDF
Using A100 MIG to Scale Astronomy Scientific Output
PDF
Using commercial Clouds to process IceCube jobs
PDF
Modest scale HPC on Azure using CGYRO
PDF
Data-intensive IceCube Cloud Burst
PDF
Scheduling a Kubernetes Federation with Admiralty
PDF
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
PDF
Demonstrating 100 Gbps in and out of the public Clouds
PDF
TransAtlantic Networking using Cloud links
PDF
Bursting into the public Cloud - Sharing my experience doing it at large scal...
PDF
Demonstrating 100 Gbps in and out of the Clouds
Preparing Fusion codes for Perlmutter - CGYRO
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
Comparing single-node and multi-node performance of an important fusion HPC c...
The anachronism of whole-GPU accounting
Auto-scaling HTCondor pools using Kubernetes compute resources
Speeding up bowtie2 by improving cache-hit rate
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Comparing GPU effectiveness for Unifrac distance compute
Managing Cloud networking costs for data-intensive applications by provisioni...
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Using A100 MIG to Scale Astronomy Scientific Output
Using commercial Clouds to process IceCube jobs
Modest scale HPC on Azure using CGYRO
Data-intensive IceCube Cloud Burst
Scheduling a Kubernetes Federation with Admiralty
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating 100 Gbps in and out of the public Clouds
TransAtlantic Networking using Cloud links
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Demonstrating 100 Gbps in and out of the Clouds

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
Unlock new opportunities with location data.pdf
PPT
Module 1.ppt Iot fundamentals and Architecture
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
August Patch Tuesday
PDF
CloudStack 4.21: First Look Webinar slides
PDF
Hybrid model detection and classification of lung cancer
NewMind AI Weekly Chronicles – August ’25 Week III
DP Operators-handbook-extract for the Mautical Institute
Assigned Numbers - 2025 - Bluetooth® Document
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Unlock new opportunities with location data.pdf
Module 1.ppt Iot fundamentals and Architecture
Final SEM Unit 1 for mit wpu at pune .pptx
Developing a website for English-speaking practice to English as a foreign la...
O2C Customer Invoices to Receipt V15A.pptx
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Hindi spoken digit analysis for native and non-native speakers
1 - Historical Antecedents, Social Consideration.pdf
WOOl fibre morphology and structure.pdf for textiles
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Getting started with AI Agents and Multi-Agent Systems
A novel scalable deep ensemble learning framework for big data classification...
August Patch Tuesday
CloudStack 4.21: First Look Webinar slides
Hybrid model detection and classification of lung cancer

Accelerating microbiome research with OpenACC

  • 1. Accelerating microbiome research with OpenACC Igor Sfiligoi – University of California San Diego in collaboration with Daniel McDonald and Rob Knight OpenACC Summit 2020 – Sept 2020
  • 3. We are what we eat Studies demonstrated clear link between • Gut microbiome • General human health https://www.biotechniques.com/multimedia/archive/00252/microbiome2_252150a.jpg
  • 4. UniFrac distance Need to understand how similar pairs of microbiome samples are with respect to the evolutionary histories of the organisms. UniFrac distance matrix • Samples where the organisms are all very similar from an evolutionary perspective will have a small UniFrac distance. • On the other hand, two samples composed of very different organisms will have a large UniFrac distance. Lozupone and Knight Applied and environmental microbiology 2005
  • 5. Computing UniFrac • Matrix can be computed using a striped pattern • Each stripe can be computed independently • Easy to distribute over many compute units CPU1 CPU2
  • 6. Computing UniFrac • Most compute localized in a tight loop • Operating on a stripe range for(unsigned int stripe = start; stripe < stop; stripe++) { dm_stripe = dm_stripes[stripe]; for(unsigned int j = 0; j < n_samples / 4; j++) { int k = j * 4; double u1 = emb[k]; double u2 = emb[k+1]; double v1 = emb[k + stripe + 1]; double v2 = emb[k + stripe + 2]; … dm_stripe[k] += (u1-v1)*length; dm_stripe[k+1] += (u2-v2)*length; } } Invoked many times with distinct emp[:] buffers
  • 7. Computing UniFrac • Most compute localized in a tight loop • Operating on a stripe range for(unsigned int stripe = start; stripe < stop; stripe++) { dm_stripe = dm_stripes[stripe]; for(unsigned int j = 0; j < n_samples / 4; j++) { int k = j * 4; double u1 = emb[k]; double u2 = emb[k+1]; double v1 = emb[k + stripe + 1]; double v2 = emb[k + stripe + 2]; … dm_stripe[k] += (u1-v1)*length; dm_stripe[k+1] += (u2-v2)*length; } } Intel Xeon E5-2680 v4 CPU (using all 14 cores) 800 minutes (13 hours) Modest size EMP dataset Invoked many times with distinct emp[:] buffers
  • 8. Porting to GPU • OpenACC makes it trivial to port to GPU compute. • Just decorate with a pragma. • But needed minor refactoring to have a unified buffer. (Was array of pointers) #pragma acc parallel loop collapse(2) present(emb,dm_stripes_buf) for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int j = 0; j < n_samples / 4; j++) { int idx =(stripe-start_idx)*n_samples; double *dm_stripe =dm_stripes_buf+idx; int k = j * 4; double u1 = emb[k]; double u2 = emb[k+1]; double v1 = emb[k + stripe + 1]; double v2 = emb[k + stripe + 2]; … dm_stripe[k] += (u1-v1)*length; dm_stripe[k+1] += (u2-v2)*length; } } Invoked many times with distinct emp[:] buffers
  • 9. Porting to GPU #pragma acc parallel loop collapse(2) present(emb,dm_stripes_buf) for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int j = 0; j < n_samples / 4; j++) { int idx =(stripe-start_idx)*n_samples; double *dm_stripe =dm_stripes_buf+idx; int k = j * 4; double u1 = emb[k]; double u2 = emb[k+1]; double v1 = emb[k + stripe + 1]; double v2 = emb[k + stripe + 2]; … dm_stripe[k] += (u1-v1)*length; dm_stripe[k+1] += (u2-v2)*length; } } Modest size EMP dataset NVIDIA Tesla V100 (using all 84 SMs) 92 minutes (1.5 hours) Was 13h on CPU • OpenACC makes it trivial to port to GPU compute. • Just decorate with a pragma. • But needed minor refactoring to have a unified buffer. (Was array of pointers) Invoked many times with distinct emp[:] buffers
  • 10. Optimization step 1 Modest size EMP dataset 92 mins before, was 13h on CPU • Cluster reads and minimize writes • Fewer kernel invocations • Memory writes much more expensive than memory reads. • Also undo manual unrolls • Were optimal for CPU • Bad for GPU • Properly align memory buffers • Up to 5x slowdown when not aligned #pragma acc parallel loop collapse(2) present(emb,dm_stripes_buf,length) for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int k = 0; k < n_samples ; k++) { … double my_stripe = dm_stripe[k]; #pragma acc loop seq for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; double u = emb[offset+k]; double v = emb[offset+k+stripe+ 1]; my_stripe += (u-v)*length[e]; } … dm_stripe[k] += (u1-v1)*length; } } NVIDIA Tesla V100 (using all 84 SMs) 33 minutes Invoked fewer times due to batched emp[:] buffers
  • 11. Optimization step 2 • Reorder loops to maximize cache reuse. #pragma acc parallel loop collapse(3) present(emb,dm_stripes_buf,length) for(unsigned int sk = 0; sk < sample_steps ; sk++) { for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int ik = 0; ik < step_size ; ik++) { unsigned int k = sk*step_size + ik; … double my_stripe = dm_stripe[k]; #pragma acc loop seq for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; double u = emb[offset+k]; double v = emb[offset+k+stripe+ 1]; my_stripe += (u-v)*length[e]; } … dm_stripe[k] += (u1-v1)*length; } } } NVIDIA Tesla V100 (using all 84 SMs) 12 minutes Was 33 mins Modest size EMP dataset
  • 12. Optimization step 2 • Reorder loops to maximize cache reuse. • CPU code also benefitted from optimization #pragma acc parallel loop collapse(3) present(emb,dm_stripes_buf,length) for(unsigned int sk = 0; sk < sample_steps ; sk++) { for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int ik = 0; ik < step_size ; ik++) { unsigned int k = sk*step_size + ik; … double my_stripe = dm_stripe[k]; #pragma acc loop seq for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; double u = emb[offset+k]; double v = emb[offset+k+stripe+ 1]; my_stripe += (u-v)*length[e]; } … dm_stripe[k] += (u1-v1)*length; } } } NVIDIA Tesla V100 (using all 84 SMs) 12 minutes Was 33 mins Modest size EMP dataset Xeon E5-2680 v4 CPU (using all 14 cores) 193minutes (~3 hours) Originally 13h on the same CPU
  • 13. 20x speedup on modest EMP dataset E5-2680 v4 CPU Original New GPU V100 GPU 2080TI GPU 1080TI GPU 1080 GPU Mobile 1050 fp64 800 193 12 59 77 99 213 fp32 - 190 9.5 19 31 36 64 Using fp32 adds additional boost, especially on gaming and mobile GPUs 20x V100 GPU vs Xeon CPU + 4x from general optimization
  • 14. 140x speedup on cutting edge 113k sample Per chip (in minutes) 128x CPU E5-2680 v4 Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 415 97 14 29 184 252 fp32 - 91 12 20 32 82 Aggregated (in chip hours) 128x E5-2680 v4 CPU Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 890 207 30 1.9 49 67 fp32 - 194 26 1.3 8.5 22 140x V100 GPUs vs Xeon CPUs + 4.5x from general optimization
  • 15. 140x V100 GPUs vs Xeon CPUs + 4.5x from general optimization 140x speedup on cutting edge 113k sample Per chip (in minutes) 128x CPU E5-2680 v4 Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 415 97 14 29 184 252 fp32 - 91 12 20 32 82 Aggregated (in chip hours) 128x E5-2680 v4 CPU Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 890 207 30 1.9 49 67 fp32 - 194 26 1.3 8.5 22 Largish CPU cluster Single node
  • 16. 22x speedup on consumer GPUs Per chip (in minutes) 128x CPU E5-2680 v4 Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 415 97 14 29 184 252 fp32 - 91 12 20 32 82 Aggregated (in chip hours) 128x E5-2680 v4 CPU Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 890 207 30 1.9 49 67 fp32 - 194 26 1.3 8.5 22 22x 2080TI GPUs vs Xeons CPU, 4.5x from general optimizationConsumer GPUs slower than server GPUs but still faster than CPUs (Memory bound)
  • 17. Desiderata • Support for array of pointers • Was able to work around it, but annoying • Better multi-GPU support • Currently handled with multiple processes + final merge • (Better) AMD GPU support • GCC theoretically has it, but performance in tests was dismal • Non-Linux support • Was not able to find an OpenACC compiler for MacOS or Windows
  • 18. Conclusions • OpenACC made porting UniFrac to GPUs extremely easy • With a single code base • Some additional optimizations were needed to get maximum benefit • But most were needed for the CPU-only code path, too • Performance on NVIDIA GPUs great • But wondering what to do for AMD GPUs and GPUs on non-linux systems
  • 19. Acknowledgments This work was partially funded by US National Science Foundation (NSF) grants OAC-1826967, OAC-1541349 and CNS-1730158, and by US National Institutes of Health (NIH) grant DP1-AT010885.