SlideShare a Scribd company logo
Porting and optimizing UniFrac for GPUs
Reducing microbiome analysis runtimes by orders of magnitude
Igor Sfiligoi Daniel McDonald Rob Knight
(isfiligoi@sdsc.edu) (danielmcdonald@ucsd.edu) (robknight@ucsd.edu)
University of California San Diego - La Jolla, CA, USA
This work was partially funded by US National Science Foundation (NSF) grants OAC-1826967, OAC-1541349 and CNS-1730158, and by US National Institutes of Health (NIH) grant DP1-AT010885.
Abstract
UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another
(“beta diversity”). The recently implemented Striped UniFrac added the capability to split the problem into many
independent subproblems, and exhibits near linear scaling on multiple nodes, but does not scale linearly with the
number of CPU cores on a single node, indicating memory-bound compute.
Massively parallel algorithms, especially memory-heavy ones, are natural candidates for GPU compute. In this
poster we describe steps undertaken in porting and optimizing Striped Unifrac to GPUs. We chose to do the porting
using OpenACC, since it allows for a single code base for both GPU and CPU compute. The choice proved to be
wise, as the optimizations aimed at speeding up execution on GPUs helped CPU execution, too.
The GPUs have proven to be an excellent resource for this type of problem, reducing runtimes from days to hours.
for(unsigned int stripe = start;
stripe < stop; stripe++) {
dm_stripe = dm_stripes[stripe];
for(unsigned int j = 0;
j < n_samples / 4; j++) {
int k = j * 4;
double u1 = emb[k];
double u2 = emb[k+1];
double v1 = emb[k + stripe + 1];
double v2 = emb[k + stripe + 2];
…
dm_stripe[k] += (u1-v1)*length;
dm_stripe[k+1] += (u2-v2)*length;
}
}
#pragma acc parallel loop collapse(2) 
present(emb,dm_stripes_buf)
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int j = 0;
j < n_samples / 4; j++) {
int idx =(stripe-start_idx)*n_samples;
double *dm_stripe =dm_stripes_buf+idx;
int k = j * 4;
double u1 = emb[k];
double u2 = emb[k+1];
double v1 = emb[k + stripe + 1];
double v2 = emb[k + stripe + 2];
…
dm_stripe[k] += (u1-v1)*length;
dm_stripe[k+1] += (u2-v2)*length;
}
}
#pragma acc parallel loop collapse(2) 
present(emb,dm_stripes_buf,length)
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int k = 0;
k < n_samples ; k++) {
…
double my_stripe = dm_stripe[k];
#pragma acc loop seq
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
double u = emb[offset+k];
double v = emb[offset+k+stripe+ 1];
my_stripe += (u-v)*length[e];
}
…
dm_stripe[k] += (u1-v1)*length;
}
}
#pragma acc parallel loop collapse(3) 
present(emb,dm_stripes_buf,length)
for(unsigned int sk = 0;
sk < sample_steps ; sk++) {
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int ik = 0;
ik < step_size ; ik++) {
unsigned int k = sk*step_size + ik;
…
double my_stripe = dm_stripe[k];
#pragma acc loop seq
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
double u = emb[offset+k];
double v = emb[offset+k+stripe+ 1];
my_stripe += (u-v)*length[e];
}
…
dm_stripe[k] += (u1-v1)*length;
}
}
}
Most time spent in small
area of the code, which is
implemented as a set of
tight loops.
OpenACC makes it trivial toport
to GPU compute.
Just decorate with a pragma.
But needed minor refactoring to have a unified buffer.
Memory writes much more
expensive than memory reads,
so cluster reads and minimize writes.
Also undo manual unrolls: were optimal for CPU, bad for GPU.
Reorder loops to maximize
cache reuse.
Intel Xeon E5-2680 v4 CPU
(using all 14 cores)
800 minutes (13 hours)
NVIDIA Tesla V100
(using all 84 SMs)
92 minutes (1.5 hours)
NVIDIA Tesla V100
(using all 84 SMs)
33 minutes
NVIDIA Tesla V100
(using all 84 SMs)
12 minutes
Measuring runtime computing
UniFrac on EMP dataset.
UniFrac was originally designed
and always implemented using
fp64 math operations. Thehigher-
precision floating point math was
used to maximize reliability of
the results.
Can we use fp32?
On mobile and gaming GPUs
fp64 compute is 32x slower
than fp32 compute.
We can!
We compared the results of the compute using
fp32-enabled and fp64-only code, using the same
EMP input, and observed a near identical result.
(Mantel R2 0.99999; p<0.001, comparing
pairwise distances in the two matrices).
E5-2680 v4 CPU
Original New
GPU
V100
GPU
2080TI
GPU
1080TI
GPU
1080
GPU
Mobile 1050
fp64 800 193 12 59 77 99 213
fp32 - 190 9.5 19 31 36 64
Runtime computing UniFrac on EMP dataset (Single Chip, in minutes)
Per chip
(in minutes)
128x CPU
E5-2680 v4
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 415 97 14 29 184 252
fp32 - 91 12 20 32 82
Aggregated
(in chip hours)
128x
E5-2680 v4 CPU
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 890 207 30 1.9 49 67
fp32 - 194 26 1.3 8.5 22
Runtime computing UniFrac on dataset containing 113,721 samples (Using multiple Chips)
Conclusions
Our work now allows many microbiome analyses which were
previously relegated to large compute clusters to be performed with
much lower resource requirements. Even the largest datasets currently
envisaged could be processed in reasonable time with just a handful of
server-class GPUs, while smaller but still sizable datasets like the
EMP can be processed even on GPU-enabled workstations.
We also explored the use of lower-precision floating point math to
effectively exploit consumer-grade GPUs, which are typical in
desktop and laptop setups. We conclude that fp32 math yields
nearly identical results to fp64, and should be adequate for the vast
majority of studies, making compute on GPU-enabled personal
devices, even laptops, a sufficient resource for this otherwise rate-
limiting step for many researchers.
An ordination plot of unweighted UniFrac distances over 113,721 samples
Xeon E5-2680 v4 CPU
(using all 14 cores)
193minutes (~3 hours)
Properly aligning the memory
buffers and picking the right value
for the grouping parameters can
make a 5x difference in speed.

More Related Content

PDF
Accelerating microbiome research with OpenACC
PDF
第11回 配信講義 計算科学技術特論A(2021)
PDF
Chainer ui v0.3 and imagereport
PDF
Introduction to Chainer 11 may,2018
PDF
NAS EP Algorithm
PDF
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
PDF
Parallel Implementation of K Means Clustering on CUDA
PDF
Engineering fast indexes
Accelerating microbiome research with OpenACC
第11回 配信講義 計算科学技術特論A(2021)
Chainer ui v0.3 and imagereport
Introduction to Chainer 11 may,2018
NAS EP Algorithm
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Parallel Implementation of K Means Clustering on CUDA
Engineering fast indexes

What's hot (20)

PDF
Deep Learning with PyTorch
PDF
Chainer Update v1.8.0 -> v1.10.0+
PPTX
Chainer v3
PPTX
Parallel K means clustering using CUDA
PPTX
Introduction to PyTorch
PDF
Pytorch for tf_developers
PPTX
PyTorch Tutorial for NTU Machine Learing Course 2017
PDF
Chainer v4 and v5
PDF
Fast indexes with roaring #gomtl-10
PDF
Comparison of deep learning frameworks from a viewpoint of double backpropaga...
PDF
Introduction to Chainer
PDF
IIBMP2019 講演資料「オープンソースで始める深層学習」
PDF
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PDF
Faster Python, FOSDEM
PPT
Trelles_QnormBOSC2009
PDF
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
PPTX
Deep parking
PDF
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
PDF
Overview of Chainer and Its Features
PDF
GTC Japan 2016 Chainer feature introduction
Deep Learning with PyTorch
Chainer Update v1.8.0 -> v1.10.0+
Chainer v3
Parallel K means clustering using CUDA
Introduction to PyTorch
Pytorch for tf_developers
PyTorch Tutorial for NTU Machine Learing Course 2017
Chainer v4 and v5
Fast indexes with roaring #gomtl-10
Comparison of deep learning frameworks from a viewpoint of double backpropaga...
Introduction to Chainer
IIBMP2019 講演資料「オープンソースで始める深層学習」
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
Faster Python, FOSDEM
Trelles_QnormBOSC2009
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
Deep parking
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Overview of Chainer and Its Features
GTC Japan 2016 Chainer feature introduction
Ad

Similar to Porting and optimizing UniFrac for GPUs (20)

PDF
Solving large sparse linear systems on the GPU
PDF
CUG2011 Introduction to GPU Computing
PDF
Numba: Array-oriented Python Compiler for NumPy
PDF
Comparison of Parallel Algorithms For An Image Processing Problem on Cuda
PPTX
GPU Accelerated Computational Chemistry Applications
PDF
GPU Programming
PPT
GDC 2012: Advanced Procedural Rendering in DX11
PDF
High-Performance Physics Solver Design for Next Generation Consoles
PDF
High performance GPU computing with Ruby
PDF
High Performance Medical Reconstruction Using Stream Programming Paradigms
PDF
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
PDF
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
PDF
CUDA Deep Dive
PDF
Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1
PDF
Rubyconfindia2018 - GPU accelerated libraries for Ruby
PDF
Slide tesi
PPTX
Better performance through Superscalarity
PPT
Tridiagonal solver in gpu
PDF
FrackingPaper
PDF
Cuda Without a Phd - A practical guick start
Solving large sparse linear systems on the GPU
CUG2011 Introduction to GPU Computing
Numba: Array-oriented Python Compiler for NumPy
Comparison of Parallel Algorithms For An Image Processing Problem on Cuda
GPU Accelerated Computational Chemistry Applications
GPU Programming
GDC 2012: Advanced Procedural Rendering in DX11
High-Performance Physics Solver Design for Next Generation Consoles
High performance GPU computing with Ruby
High Performance Medical Reconstruction Using Stream Programming Paradigms
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
CUDA Deep Dive
Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1
Rubyconfindia2018 - GPU accelerated libraries for Ruby
Slide tesi
Better performance through Superscalarity
Tridiagonal solver in gpu
FrackingPaper
Cuda Without a Phd - A practical guick start
Ad

More from Igor Sfiligoi (20)

PDF
Preparing Fusion codes for Perlmutter - CGYRO
PDF
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
PDF
Comparing single-node and multi-node performance of an important fusion HPC c...
PDF
The anachronism of whole-GPU accounting
PDF
Auto-scaling HTCondor pools using Kubernetes compute resources
PDF
Speeding up bowtie2 by improving cache-hit rate
PDF
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
PDF
Comparing GPU effectiveness for Unifrac distance compute
PDF
Managing Cloud networking costs for data-intensive applications by provisioni...
PDF
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
PDF
Using A100 MIG to Scale Astronomy Scientific Output
PDF
Using commercial Clouds to process IceCube jobs
PDF
Modest scale HPC on Azure using CGYRO
PDF
Data-intensive IceCube Cloud Burst
PDF
Scheduling a Kubernetes Federation with Admiralty
PDF
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
PDF
Demonstrating 100 Gbps in and out of the public Clouds
PDF
TransAtlantic Networking using Cloud links
PDF
Bursting into the public Cloud - Sharing my experience doing it at large scal...
PDF
Demonstrating 100 Gbps in and out of the Clouds
Preparing Fusion codes for Perlmutter - CGYRO
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
Comparing single-node and multi-node performance of an important fusion HPC c...
The anachronism of whole-GPU accounting
Auto-scaling HTCondor pools using Kubernetes compute resources
Speeding up bowtie2 by improving cache-hit rate
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Comparing GPU effectiveness for Unifrac distance compute
Managing Cloud networking costs for data-intensive applications by provisioni...
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Using A100 MIG to Scale Astronomy Scientific Output
Using commercial Clouds to process IceCube jobs
Modest scale HPC on Azure using CGYRO
Data-intensive IceCube Cloud Burst
Scheduling a Kubernetes Federation with Admiralty
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating 100 Gbps in and out of the public Clouds
TransAtlantic Networking using Cloud links
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Demonstrating 100 Gbps in and out of the Clouds

Recently uploaded (20)

PPTX
surgery guide for USMLE step 2-part 1.pptx
PPT
Breast Cancer management for medicsl student.ppt
DOCX
RUHS II MBBS Microbiology Paper-II with Answer Key | 6th August 2025 (New Sch...
PPTX
Important Obstetric Emergency that must be recognised
PPTX
Pathophysiology And Clinical Features Of Peripheral Nervous System .pptx
PPTX
CME 2 Acute Chest Pain preentation for education
PPT
STD NOTES INTRODUCTION TO COMMUNITY HEALT STRATEGY.ppt
PDF
Handout_ NURS 220 Topic 10-Abnormal Pregnancy.pdf
PPTX
POLYCYSTIC OVARIAN SYNDROME.pptx by Dr( med) Charles Amoateng
PPTX
Transforming Regulatory Affairs with ChatGPT-5.pptx
PPT
Management of Acute Kidney Injury at LAUTECH
PPTX
Neuropathic pain.ppt treatment managment
PPTX
LUNG ABSCESS - respiratory medicine - ppt
PPTX
ca esophagus molecula biology detailaed molecular biology of tumors of esophagus
PPTX
neonatal infection(7392992y282939y5.pptx
PPTX
DENTAL CARIES FOR DENTISTRY STUDENT.pptx
PPT
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
PPTX
History and examination of abdomen, & pelvis .pptx
PPTX
NEET PG 2025 Pharmacology Recall | Real Exam Questions from 3rd August with D...
PPT
1b - INTRODUCTION TO EPIDEMIOLOGY (comm med).ppt
surgery guide for USMLE step 2-part 1.pptx
Breast Cancer management for medicsl student.ppt
RUHS II MBBS Microbiology Paper-II with Answer Key | 6th August 2025 (New Sch...
Important Obstetric Emergency that must be recognised
Pathophysiology And Clinical Features Of Peripheral Nervous System .pptx
CME 2 Acute Chest Pain preentation for education
STD NOTES INTRODUCTION TO COMMUNITY HEALT STRATEGY.ppt
Handout_ NURS 220 Topic 10-Abnormal Pregnancy.pdf
POLYCYSTIC OVARIAN SYNDROME.pptx by Dr( med) Charles Amoateng
Transforming Regulatory Affairs with ChatGPT-5.pptx
Management of Acute Kidney Injury at LAUTECH
Neuropathic pain.ppt treatment managment
LUNG ABSCESS - respiratory medicine - ppt
ca esophagus molecula biology detailaed molecular biology of tumors of esophagus
neonatal infection(7392992y282939y5.pptx
DENTAL CARIES FOR DENTISTRY STUDENT.pptx
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
History and examination of abdomen, & pelvis .pptx
NEET PG 2025 Pharmacology Recall | Real Exam Questions from 3rd August with D...
1b - INTRODUCTION TO EPIDEMIOLOGY (comm med).ppt

Porting and optimizing UniFrac for GPUs

  • 1. Porting and optimizing UniFrac for GPUs Reducing microbiome analysis runtimes by orders of magnitude Igor Sfiligoi Daniel McDonald Rob Knight (isfiligoi@sdsc.edu) (danielmcdonald@ucsd.edu) (robknight@ucsd.edu) University of California San Diego - La Jolla, CA, USA This work was partially funded by US National Science Foundation (NSF) grants OAC-1826967, OAC-1541349 and CNS-1730158, and by US National Institutes of Health (NIH) grant DP1-AT010885. Abstract UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another (“beta diversity”). The recently implemented Striped UniFrac added the capability to split the problem into many independent subproblems, and exhibits near linear scaling on multiple nodes, but does not scale linearly with the number of CPU cores on a single node, indicating memory-bound compute. Massively parallel algorithms, especially memory-heavy ones, are natural candidates for GPU compute. In this poster we describe steps undertaken in porting and optimizing Striped Unifrac to GPUs. We chose to do the porting using OpenACC, since it allows for a single code base for both GPU and CPU compute. The choice proved to be wise, as the optimizations aimed at speeding up execution on GPUs helped CPU execution, too. The GPUs have proven to be an excellent resource for this type of problem, reducing runtimes from days to hours. for(unsigned int stripe = start; stripe < stop; stripe++) { dm_stripe = dm_stripes[stripe]; for(unsigned int j = 0; j < n_samples / 4; j++) { int k = j * 4; double u1 = emb[k]; double u2 = emb[k+1]; double v1 = emb[k + stripe + 1]; double v2 = emb[k + stripe + 2]; … dm_stripe[k] += (u1-v1)*length; dm_stripe[k+1] += (u2-v2)*length; } } #pragma acc parallel loop collapse(2) present(emb,dm_stripes_buf) for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int j = 0; j < n_samples / 4; j++) { int idx =(stripe-start_idx)*n_samples; double *dm_stripe =dm_stripes_buf+idx; int k = j * 4; double u1 = emb[k]; double u2 = emb[k+1]; double v1 = emb[k + stripe + 1]; double v2 = emb[k + stripe + 2]; … dm_stripe[k] += (u1-v1)*length; dm_stripe[k+1] += (u2-v2)*length; } } #pragma acc parallel loop collapse(2) present(emb,dm_stripes_buf,length) for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int k = 0; k < n_samples ; k++) { … double my_stripe = dm_stripe[k]; #pragma acc loop seq for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; double u = emb[offset+k]; double v = emb[offset+k+stripe+ 1]; my_stripe += (u-v)*length[e]; } … dm_stripe[k] += (u1-v1)*length; } } #pragma acc parallel loop collapse(3) present(emb,dm_stripes_buf,length) for(unsigned int sk = 0; sk < sample_steps ; sk++) { for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int ik = 0; ik < step_size ; ik++) { unsigned int k = sk*step_size + ik; … double my_stripe = dm_stripe[k]; #pragma acc loop seq for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; double u = emb[offset+k]; double v = emb[offset+k+stripe+ 1]; my_stripe += (u-v)*length[e]; } … dm_stripe[k] += (u1-v1)*length; } } } Most time spent in small area of the code, which is implemented as a set of tight loops. OpenACC makes it trivial toport to GPU compute. Just decorate with a pragma. But needed minor refactoring to have a unified buffer. Memory writes much more expensive than memory reads, so cluster reads and minimize writes. Also undo manual unrolls: were optimal for CPU, bad for GPU. Reorder loops to maximize cache reuse. Intel Xeon E5-2680 v4 CPU (using all 14 cores) 800 minutes (13 hours) NVIDIA Tesla V100 (using all 84 SMs) 92 minutes (1.5 hours) NVIDIA Tesla V100 (using all 84 SMs) 33 minutes NVIDIA Tesla V100 (using all 84 SMs) 12 minutes Measuring runtime computing UniFrac on EMP dataset. UniFrac was originally designed and always implemented using fp64 math operations. Thehigher- precision floating point math was used to maximize reliability of the results. Can we use fp32? On mobile and gaming GPUs fp64 compute is 32x slower than fp32 compute. We can! We compared the results of the compute using fp32-enabled and fp64-only code, using the same EMP input, and observed a near identical result. (Mantel R2 0.99999; p<0.001, comparing pairwise distances in the two matrices). E5-2680 v4 CPU Original New GPU V100 GPU 2080TI GPU 1080TI GPU 1080 GPU Mobile 1050 fp64 800 193 12 59 77 99 213 fp32 - 190 9.5 19 31 36 64 Runtime computing UniFrac on EMP dataset (Single Chip, in minutes) Per chip (in minutes) 128x CPU E5-2680 v4 Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 415 97 14 29 184 252 fp32 - 91 12 20 32 82 Aggregated (in chip hours) 128x E5-2680 v4 CPU Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 890 207 30 1.9 49 67 fp32 - 194 26 1.3 8.5 22 Runtime computing UniFrac on dataset containing 113,721 samples (Using multiple Chips) Conclusions Our work now allows many microbiome analyses which were previously relegated to large compute clusters to be performed with much lower resource requirements. Even the largest datasets currently envisaged could be processed in reasonable time with just a handful of server-class GPUs, while smaller but still sizable datasets like the EMP can be processed even on GPU-enabled workstations. We also explored the use of lower-precision floating point math to effectively exploit consumer-grade GPUs, which are typical in desktop and laptop setups. We conclude that fp32 math yields nearly identical results to fp64, and should be adequate for the vast majority of studies, making compute on GPU-enabled personal devices, even laptops, a sufficient resource for this otherwise rate- limiting step for many researchers. An ordination plot of unweighted UniFrac distances over 113,721 samples Xeon E5-2680 v4 CPU (using all 14 cores) 193minutes (~3 hours) Properly aligning the memory buffers and picking the right value for the grouping parameters can make a 5x difference in speed.