SlideShare a Scribd company logo
A ScyllaDB Community
Aiding the CUDA Compiler for
Fun and Profit
Joe Rowell
Founding Engineer at poolside.ai
Joe Rowell (he/him)
Founding Engineer at poolside.ai
■ ■ “Low level performance person”
■ ■ Cryptography background
■ ■ Love arcane hardware facts
“Joe, why don’t we just write
assembler?”
A brief bit of software
A CUDA GPU is inherently parallel from a software perspective.
■ The smallest logical unit is a thread.
■ Threads are grouped into blocks.
■ Blocks are grouped into grids.
A brief bit of hardware
A CUDA GPU is inherently parallel from a hardware perspective.
■ Threads are scheduled in groups of 32 threads, called warps.
■ Warps execute on streaming multiprocessors, or SMs.
■ Multiple SMs make up your GPU.
PTX assembler
mov.u32 %r3, %ntid.x;
mov.u32 %r4, %ctaid.x;
mov.u32 %r5, %tid.x;
mad.lo.s32 %r1, %r3, %r4, %r5;
setp.ge.s32 %p1, %r1, %r2;
SASS assembler
XMAD.MRG R3, R2.reuse, c[0x0] [0x8].H1, RZ
XMAD R0, R2.reuse, c[0x0] [0x8], R0
XMAD.PSL.CBCC R2, R2.H1, R3.H1, R0
Write code for warps, not for
threads
Warp uniform variables
template <typename T>
__device__ T warp_uniform(T value) {
struct {
union {
T value; // Assume sizeof(T) <= 8
struct {
uint32_t lowInt;
uint32_t highInt;
};
};
} p;
p.value = value;
p.lowInt = __shfl_sync(0xffffffff, (unsigned)p.lowInt, 0);
p.highInt = __shfl_sync(0xffffffff, (unsigned)p.highInt, 0);
return p.value;
}
PTX assembly for warp uniformity
.visible .func (.param .b64 func_retval0) int* warp_uniform<int*>(int*)(
.param .b64 int* warp_uniform<int*>(int*)_param_0
)
{
ld.param.v2.u32 {%r1, %r2}, [int* warp_uniform<int*>(int*)_param_0];
mov.u32 %r4, 0;
mov.u32 %r5, 31;
mov.u32 %r6, -1;
shfl.sync.idx.b32 %r8|%p1, %r1, %r4, %r5, %r6;
shfl.sync.idx.b32 %r9|%p2, %r2, %r4, %r5, %r6;
mov.b64 %rd1, {%r8, %r9};
st.param.b64 [func_retval0+0], %rd1;
ret;
}
SASS assembly for warp uniformity
Warp shuffle for computing the minimum
template<typename T>
__device__ T warp_min(T val) {
for (int offset = 16; offset > 0; offset /= 2) {
T tval = __shfl_down_sync(0xFFFFFFFF, val, offset);
if (tval < val) {
val = tval;
}
}
return val;
}
Use registers where you can
Access costs
■ Registers are ~1 cycle
■ Shared memory or cache are ~5 cycles
■ Global memory is ~500 cycles
The compiler will never do this
for you.
Example of inefficient summation
#define BLOCK_SIZE 1024
template<int size>
__device__ void sum(int* out, int* in) {
__shared__ int accum[BLOCK_SIZE];
accum[threadIdx.x] = 0;
__syncthreads();
for (int i = threadIdx.x; i < size; i += BLOCK_SIZE) {
accum[threadIdx.x] += in[i];
}
__syncthreads();
*out = *out + accum[threadIdx.x];
}
Example with registers
#define BLOCK_SIZE 1024
template<int size>
__device__ void sum(int* out, int* in) {
int accum = 0;
for (int i = threadIdx.x; i < size; i += BLOCK_SIZE) {
accum += in[i];
}
*out = *out + accum;
}
Be aware of unfriendly access
patterns
CUDA memory accesses under the hood
■ Accesses in CUDA come in three varieties: 32 bytes, 64 bytes, and 128 bytes.
■ For optimal performance, you must ensure that the memory accesses are
contiguous, and ideally aligned.
Example of a poor access pattern
#define BLOCK_SIZE 1024
template<int size>
__device__ void sum(int* out, int* in) {
int accum = 0;
int offset = BLOCK_SIZE * threadIdx.x;
for (int i = 0; i < BLOCK_SIZE; i++) {
accum += in[offset + i];
}
*out = *out + accum;
}
SASS for poor access pattern
LDG.E R28, [R2+-0x3c]
LDG.E R26, [R2+-0x38]
LDG.E R27, [R2+-0x34]
LDG.E R24, [R2+-0x30]
LDG.E R25, [R2+-0x2c]
LDG.E R20, [R2+-0x28]
LDG.E R21, [R2+-0x24]
LDG.E R4, [R2+-0x20]
LDG.E R5, [R2+-0x1c]
LDG.E R6, [R2+-0x18]
LDG.E R7, [R2+-0x14]
LDG.E R8, [R2+-0x10]
SASS for good access pattern
LDG.E R10, [R2+0x1000]
LDG.E R12, [R2+0x2000]
LDG.E R13, [R2+0x3000]
LDG.E R14, [R2+0x4000]
LDG.E R16, [R2+0x5000]
LDG.E R17, [R2+0x6000]
LDG.E R18, [R2+0x7000]
LDG.E R20, [R2+0x8000]
LDG.E R21, [R2+0x9000]
LDG.E R22, [R2+0xa000]
LDG.E R24, [R2+0xb000]
LDG.E R25, [R2+0xc000]
Thank you! Let’s connect.
Joe Rowell
joe@poolside.ai

More Related Content

PPTX
C for Cuda - Small Introduction to GPU computing
PPT
Introduction to parallel computing using CUDA
PDF
GPU Computing with CUDA
PPT
Parallel computing with Gpu
PDF
Cuda Without a Phd - A practical guick start
PPT
002 - Introduction to CUDA Programming_1.ppt
PPT
3. CUDA_PPT.ppt info abt threads in cuda
PPT
cuda.ppt
C for Cuda - Small Introduction to GPU computing
Introduction to parallel computing using CUDA
GPU Computing with CUDA
Parallel computing with Gpu
Cuda Without a Phd - A practical guick start
002 - Introduction to CUDA Programming_1.ppt
3. CUDA_PPT.ppt info abt threads in cuda
cuda.ppt

Similar to Aiding the CUDA Compiler for Fun and Profit by Joe Rowell (20)

PPTX
Intro to GPGPU with CUDA (DevLink)
PDF
GPGPU Computation
PDF
Gpu perf-presentation
PPT
Intro2 Cuda Moayad
PDF
Code GPU with CUDA - Optimizing memory and control flow
PDF
GPU Programming
PPTX
Gpgpu intro
PDF
3. CUDA_Thread.pdf info on cuda threads .
PDF
GPU: Understanding CUDA
PDF
Tema3_Introduction_to_CUDA_C.pdf
PDF
Trip down the GPU lane with Machine Learning
PPT
Lecture 04
PPTX
Intro to GPGPU Programming with Cuda
PPTX
Gpu computing workshop
PDF
lecture_GPUArchCUDA02-CUDAMem.pdf
PPTX
Introduction to Accelerators
PDF
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
PPT
Lecture2 cuda spring 2010
PDF
HPP Week 1 Summary
Intro to GPGPU with CUDA (DevLink)
GPGPU Computation
Gpu perf-presentation
Intro2 Cuda Moayad
Code GPU with CUDA - Optimizing memory and control flow
GPU Programming
Gpgpu intro
3. CUDA_Thread.pdf info on cuda threads .
GPU: Understanding CUDA
Tema3_Introduction_to_CUDA_C.pdf
Trip down the GPU lane with Machine Learning
Lecture 04
Intro to GPGPU Programming with Cuda
Gpu computing workshop
lecture_GPUArchCUDA02-CUDAMem.pdf
Introduction to Accelerators
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Lecture2 cuda spring 2010
HPP Week 1 Summary
Ad

More from ScyllaDB (20)

PDF
Understanding The True Cost of DynamoDB Webinar
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
PDF
New Ways to Reduce Database Costs with ScyllaDB
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
PDF
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
PDF
Leading a High-Stakes Database Migration
PDF
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
PDF
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
PDF
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
PDF
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
PDF
ScyllaDB: 10 Years and Beyond by Dor Laor
PDF
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
PDF
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
PDF
Vector Search with ScyllaDB by Szymon Wasik
PDF
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
PDF
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
PDF
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
PDF
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
PDF
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Understanding The True Cost of DynamoDB Webinar
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
New Ways to Reduce Database Costs with ScyllaDB
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Leading a High-Stakes Database Migration
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB: 10 Years and Beyond by Dor Laor
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Vector Search with ScyllaDB by Szymon Wasik
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Ad

Recently uploaded (20)

PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPTX
Chapter 5: Probability Theory and Statistics
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Hybrid model detection and classification of lung cancer
PDF
August Patch Tuesday
PPTX
1. Introduction to Computer Programming.pptx
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
project resource management chapter-09.pdf
Assigned Numbers - 2025 - Bluetooth® Document
gpt5_lecture_notes_comprehensive_20250812015547.pdf
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
observCloud-Native Containerability and monitoring.pptx
NewMind AI Weekly Chronicles - August'25-Week II
O2C Customer Invoices to Receipt V15A.pptx
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
A comparative study of natural language inference in Swahili using monolingua...
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Chapter 5: Probability Theory and Statistics
1 - Historical Antecedents, Social Consideration.pdf
Hybrid model detection and classification of lung cancer
August Patch Tuesday
1. Introduction to Computer Programming.pptx
Final SEM Unit 1 for mit wpu at pune .pptx
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Enhancing emotion recognition model for a student engagement use case through...
DP Operators-handbook-extract for the Mautical Institute
OMC Textile Division Presentation 2021.pptx
project resource management chapter-09.pdf

Aiding the CUDA Compiler for Fun and Profit by Joe Rowell

  • 1. A ScyllaDB Community Aiding the CUDA Compiler for Fun and Profit Joe Rowell Founding Engineer at poolside.ai
  • 2. Joe Rowell (he/him) Founding Engineer at poolside.ai ■ ■ “Low level performance person” ■ ■ Cryptography background ■ ■ Love arcane hardware facts
  • 3. “Joe, why don’t we just write assembler?”
  • 4. A brief bit of software A CUDA GPU is inherently parallel from a software perspective. ■ The smallest logical unit is a thread. ■ Threads are grouped into blocks. ■ Blocks are grouped into grids.
  • 5. A brief bit of hardware A CUDA GPU is inherently parallel from a hardware perspective. ■ Threads are scheduled in groups of 32 threads, called warps. ■ Warps execute on streaming multiprocessors, or SMs. ■ Multiple SMs make up your GPU.
  • 6. PTX assembler mov.u32 %r3, %ntid.x; mov.u32 %r4, %ctaid.x; mov.u32 %r5, %tid.x; mad.lo.s32 %r1, %r3, %r4, %r5; setp.ge.s32 %p1, %r1, %r2;
  • 7. SASS assembler XMAD.MRG R3, R2.reuse, c[0x0] [0x8].H1, RZ XMAD R0, R2.reuse, c[0x0] [0x8], R0 XMAD.PSL.CBCC R2, R2.H1, R3.H1, R0
  • 8. Write code for warps, not for threads
  • 9. Warp uniform variables template <typename T> __device__ T warp_uniform(T value) { struct { union { T value; // Assume sizeof(T) <= 8 struct { uint32_t lowInt; uint32_t highInt; }; }; } p; p.value = value; p.lowInt = __shfl_sync(0xffffffff, (unsigned)p.lowInt, 0); p.highInt = __shfl_sync(0xffffffff, (unsigned)p.highInt, 0); return p.value; }
  • 10. PTX assembly for warp uniformity .visible .func (.param .b64 func_retval0) int* warp_uniform<int*>(int*)( .param .b64 int* warp_uniform<int*>(int*)_param_0 ) { ld.param.v2.u32 {%r1, %r2}, [int* warp_uniform<int*>(int*)_param_0]; mov.u32 %r4, 0; mov.u32 %r5, 31; mov.u32 %r6, -1; shfl.sync.idx.b32 %r8|%p1, %r1, %r4, %r5, %r6; shfl.sync.idx.b32 %r9|%p2, %r2, %r4, %r5, %r6; mov.b64 %rd1, {%r8, %r9}; st.param.b64 [func_retval0+0], %rd1; ret; }
  • 11. SASS assembly for warp uniformity
  • 12. Warp shuffle for computing the minimum template<typename T> __device__ T warp_min(T val) { for (int offset = 16; offset > 0; offset /= 2) { T tval = __shfl_down_sync(0xFFFFFFFF, val, offset); if (tval < val) { val = tval; } } return val; }
  • 14. Access costs ■ Registers are ~1 cycle ■ Shared memory or cache are ~5 cycles ■ Global memory is ~500 cycles
  • 15. The compiler will never do this for you.
  • 16. Example of inefficient summation #define BLOCK_SIZE 1024 template<int size> __device__ void sum(int* out, int* in) { __shared__ int accum[BLOCK_SIZE]; accum[threadIdx.x] = 0; __syncthreads(); for (int i = threadIdx.x; i < size; i += BLOCK_SIZE) { accum[threadIdx.x] += in[i]; } __syncthreads(); *out = *out + accum[threadIdx.x]; }
  • 17. Example with registers #define BLOCK_SIZE 1024 template<int size> __device__ void sum(int* out, int* in) { int accum = 0; for (int i = threadIdx.x; i < size; i += BLOCK_SIZE) { accum += in[i]; } *out = *out + accum; }
  • 18. Be aware of unfriendly access patterns
  • 19. CUDA memory accesses under the hood ■ Accesses in CUDA come in three varieties: 32 bytes, 64 bytes, and 128 bytes. ■ For optimal performance, you must ensure that the memory accesses are contiguous, and ideally aligned.
  • 20. Example of a poor access pattern #define BLOCK_SIZE 1024 template<int size> __device__ void sum(int* out, int* in) { int accum = 0; int offset = BLOCK_SIZE * threadIdx.x; for (int i = 0; i < BLOCK_SIZE; i++) { accum += in[offset + i]; } *out = *out + accum; }
  • 21. SASS for poor access pattern LDG.E R28, [R2+-0x3c] LDG.E R26, [R2+-0x38] LDG.E R27, [R2+-0x34] LDG.E R24, [R2+-0x30] LDG.E R25, [R2+-0x2c] LDG.E R20, [R2+-0x28] LDG.E R21, [R2+-0x24] LDG.E R4, [R2+-0x20] LDG.E R5, [R2+-0x1c] LDG.E R6, [R2+-0x18] LDG.E R7, [R2+-0x14] LDG.E R8, [R2+-0x10]
  • 22. SASS for good access pattern LDG.E R10, [R2+0x1000] LDG.E R12, [R2+0x2000] LDG.E R13, [R2+0x3000] LDG.E R14, [R2+0x4000] LDG.E R16, [R2+0x5000] LDG.E R17, [R2+0x6000] LDG.E R18, [R2+0x7000] LDG.E R20, [R2+0x8000] LDG.E R21, [R2+0x9000] LDG.E R22, [R2+0xa000] LDG.E R24, [R2+0xb000] LDG.E R25, [R2+0xc000]
  • 23. Thank you! Let’s connect. Joe Rowell joe@poolside.ai