Aiding the CUDA Compiler for Fun and Profit by Joe Rowell

A ScyllaDB Community
Aiding the CUDA Compiler for
Fun and Proﬁt
Joe Rowell
Founding Engineer at poolside.ai

Joe Rowell (he/him)
Founding Engineer at poolside.ai
■ ■ “Low level performance person”
■ ■ Cryptography background
■ ■ Love arcane hardware facts

“Joe, why don’t we just write
assembler?”

A brief bit of software
A CUDA GPU is inherently parallel from a software perspective.
■ The smallest logical unit is a thread.
■ Threads are grouped into blocks.
■ Blocks are grouped into grids.

A brief bit of hardware
A CUDA GPU is inherently parallel from a hardware perspective.
■ Threads are scheduled in groups of 32 threads, called warps.
■ Warps execute on streaming multiprocessors, or SMs.
■ Multiple SMs make up your GPU.

PTX assembler
mov.u32 %r3, %ntid.x;
mov.u32 %r4, %ctaid.x;
mov.u32 %r5, %tid.x;
mad.lo.s32 %r1, %r3, %r4, %r5;
setp.ge.s32 %p1, %r1, %r2;

SASS assembler
XMAD.MRG R3, R2.reuse, c[0x0] [0x8].H1, RZ
XMAD R0, R2.reuse, c[0x0] [0x8], R0
XMAD.PSL.CBCC R2, R2.H1, R3.H1, R0

Write code for warps, not for
threads

Warp uniform variables
template <typename T>
__device__ T warp_uniform(T value) {
struct {
union {
T value; // Assume sizeof(T) <= 8
struct {
uint32_t lowInt;
uint32_t highInt;
};
};
} p;
p.value = value;
p.lowInt = __shfl_sync(0xffffffff, (unsigned)p.lowInt, 0);
p.highInt = __shfl_sync(0xffffffff, (unsigned)p.highInt, 0);
return p.value;
}

PTX assembly for warp uniformity
.visible .func (.param .b64 func_retval0) int* warp_uniform<int*>(int*)(
.param .b64 int* warp_uniform<int*>(int*)_param_0
)
{
ld.param.v2.u32 {%r1, %r2}, [int* warp_uniform<int*>(int*)_param_0];
mov.u32 %r4, 0;
mov.u32 %r5, 31;
mov.u32 %r6, -1;
shfl.sync.idx.b32 %r8|%p1, %r1, %r4, %r5, %r6;
shfl.sync.idx.b32 %r9|%p2, %r2, %r4, %r5, %r6;
mov.b64 %rd1, {%r8, %r9};
st.param.b64 [func_retval0+0], %rd1;
ret;
}

SASS assembly for warp uniformity

Warp shufﬂe for computing the minimum
template<typename T>
__device__ T warp_min(T val) {
for (int offset = 16; offset > 0; offset /= 2) {
T tval = __shfl_down_sync(0xFFFFFFFF, val, offset);
if (tval < val) {
val = tval;
}
}
return val;
}

Access costs
■ Registers are ~1 cycle
■ Shared memory or cache are ~5 cycles
■ Global memory is ~500 cycles

The compiler will never do this
for you.

Example of inefﬁcient summation
#define BLOCK_SIZE 1024
template<int size>
__device__ void sum(int* out, int* in) {
__shared__ int accum[BLOCK_SIZE];
accum[threadIdx.x] = 0;
__syncthreads();
for (int i = threadIdx.x; i < size; i += BLOCK_SIZE) {
accum[threadIdx.x] += in[i];
}
__syncthreads();
*out = *out + accum[threadIdx.x];
}

Example with registers
template<int size>
int accum = 0;
for (int i = threadIdx.x; i < size; i += BLOCK_SIZE) {
accum += in[i];
}
*out = *out + accum;
}

Be aware of unfriendly access
patterns

CUDA memory accesses under the hood
■ Accesses in CUDA come in three varieties: 32 bytes, 64 bytes, and 128 bytes.
■ For optimal performance, you must ensure that the memory accesses are
contiguous, and ideally aligned.

Example of a poor access pattern
template<int size>
int accum = 0;
int offset = BLOCK_SIZE * threadIdx.x;
for (int i = 0; i < BLOCK_SIZE; i++) {
accum += in[offset + i];
}
*out = *out + accum;
}

SASS for poor access pattern
LDG.E R28, [R2+-0x3c]
LDG.E R26, [R2+-0x38]
LDG.E R27, [R2+-0x34]
LDG.E R24, [R2+-0x30]
LDG.E R25, [R2+-0x2c]
LDG.E R20, [R2+-0x28]
LDG.E R21, [R2+-0x24]
LDG.E R4, [R2+-0x20]
LDG.E R5, [R2+-0x1c]
LDG.E R6, [R2+-0x18]
LDG.E R7, [R2+-0x14]
LDG.E R8, [R2+-0x10]

SASS for good access pattern
LDG.E R10, [R2+0x1000]
LDG.E R12, [R2+0x2000]
LDG.E R13, [R2+0x3000]
LDG.E R14, [R2+0x4000]
LDG.E R16, [R2+0x5000]
LDG.E R17, [R2+0x6000]
LDG.E R18, [R2+0x7000]
LDG.E R20, [R2+0x8000]
LDG.E R21, [R2+0x9000]
LDG.E R22, [R2+0xa000]
LDG.E R24, [R2+0xb000]
LDG.E R25, [R2+0xc000]

Thank you! Let’s connect.
Joe Rowell
joe@poolside.ai

Aiding the CUDA Compiler for Fun and Profit by Joe Rowell

More Related Content

Similar to Aiding the CUDA Compiler for Fun and Profit by Joe Rowell (20)

More from ScyllaDB (20)

Recently uploaded (20)

Aiding the CUDA Compiler for Fun and Profit by Joe Rowell