General Purpose Computing using Graphics Hardware

General Purpose Computingusing Graphics HardwareHanspeter PfisterHarvard University

AcknowledgementsWon-Ki Jeong, Harvard UniversityKayvonFatahalian, Stanford University 2

GPU (Graphics Processing Unit)PC hardware dedicated for 3D graphicsMassively parallel SIMD processorPerformance pushed by game industry3NVIDIA SLI System

GPGPUGeneral Purpose computation on the GPUStarted in computer graphics research communityMapping computational problems to graphics rendering pipeline4Image CourtesyJens Krueger, Aaron Lefohn, and Won-Ki Jeong

Why GPU for computing?GPU is fastMassively parallelCPU : ~4 cores (16 SIMD lanes) @ 3.2 Ghz (Intel Quad Core)GPU : ~30 cores (240 SIMD lanes) @ 1.3 Ghz (NVIDIA GT200)High memory bandwidthProgrammableNVIDIA CUDA, DirectX Compute Shader, OpenCLHigh precision floating point support64bit floating point (IEEE 754)Inexpensive desktop supercomputerNVIDIA Tesla C1060 : ~1 TFLOPS @ $10005

Memory Bandwidth7Image Courtesy NVIDIA

GPGPU Biomedical Examples8Level-Set Segmentation (Lefohn et al.)CT/MRI Reconstruction (Sumanaweera et al.)Image Registration (Strzodka et al.)EM Image Processing (Jeong et al.)

OverviewGPU Architecture OverviewGPU Programming OverviewProgramming ModelNVIDIA CUDAOpenCLApplication ExampleCUDA ITK9

1. GPU Architecture OverviewKayvonFatahalianStanford University10

What’s in a GPU?11Input AssemblyRasterizerOutput BlendVideo DecodeTexComputeCoreComputeCoreComputeCoreComputeCoreComputeCoreComputeCoreComputeCoreComputeCoreTexTexHWorSW?WorkDistributorTexHeterogeneous chip multi-processor (highly tuned for graphics)

CPU-“style” cores12Fetch/DecodeOut-of-order control logicFancy branch predictorALU(Execute)Memory pre-fetcherExecutionContextData Cache(A big one)

Slimming down13Fetch/DecodeIdea #1: Remove components thathelp a single instructionstream run fast ALU(Execute)ExecutionContext

Two cores (two threads in parallel)14thread1thread 2Fetch/DecodeFetch/Decode<diffuseShader>:sample r0, v4, t0, s0mul r3, v0, cb0[0]madd r3, v1, cb0[1], r3madd r3, v2, cb0[2], r3clmp r3, r3, l(0.0), l(1.0)mul o0, r0, r3mul o1, r1, r3mul o2, r2, r3mov o3, l(1.0)<diffuseShader>:sample r0, v4, t0, s0mul r3, v0, cb0[0]madd r3, v1, cb0[1], r3madd r3, v2, cb0[2], r3clmp r3, r3, l(0.0), l(1.0)mul o0, r0, r3mul o1, r1, r3mul o2, r2, r3mov o3, l(1.0)ALU(Execute)ALU(Execute)ExecutionContextExecutionContext

Four cores (four threads in parallel)15Fetch/DecodeFetch/DecodeFetch/DecodeFetch/DecodeALU(Execute)ALU(Execute)ALU(Execute)ALU(Execute)ExecutionContextExecutionContextExecutionContextExecutionContext

Sixteen cores (sixteen threads in parallel)16ALUALUALUALUALUALUALUALUALUALUALUALUALUALUALUALU16 cores = 16 simultaneous instruction streams

Instruction stream sharing17But… many threads shouldbe able to share an instructionstream! <diffuseShader>:sample r0, v4, t0, s0mul r3, v0, cb0[0]madd r3, v1, cb0[1], r3madd r3, v2, cb0[2], r3clmp r3, r3, l(0.0), l(1.0)mul o0, r0, r3mul o1, r1, r3mul o2, r2, r3mov o3, l(1.0)

Recall: simple processing core18Fetch/DecodeALU(Execute)ExecutionContext

Add ALUs19Idea #2:Amortize cost/complexity ofmanaging an instructionstream across many ALUsFetch/DecodeALU 1ALU 2ALU 3ALU 4ALU 5ALU 6ALU 7ALU 8CtxCtxCtxCtxCtxCtxCtxCtxSIMD processingShared Ctx Data

Modifying the code20Fetch/Decode<diffuseShader>:sample r0, v4, t0, s0mul r3, v0, cb0[0]madd r3, v1, cb0[1], r3madd r3, v2, cb0[2], r3clmp r3, r3, l(0.0), l(1.0)mul o0, r0, r3mul o1, r1, r3mul o2, r2, r3mov o3, l(1.0)ALU 1ALU 2ALU 3ALU 4ALU 5ALU 6ALU 7ALU 8CtxCtxCtxCtxCtxCtxCtxCtxOriginal compiled shader:Shared Ctx Data Processes one threadusing scalar ops on scalarregisters

Modifying the code21Fetch/Decode<VEC8_diffuseShader>:VEC8_sample vec_r0, vec_v4, t0, vec_s0VEC8_mul vec_r3, vec_v0, cb0[0]VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)VEC8_mul vec_o0, vec_r0, vec_r3VEC8_mul vec_o1, vec_r1, vec_r3VEC8_mul vec_o2, vec_r2, vec_r3VEC8_mov vec_o3, l(1.0)ALU 1ALU 2ALU 3ALU 4ALU 5ALU 6ALU 7ALU 8CtxCtxCtxCtxCtxCtxCtxCtxNew compiled shader:Shared Ctx Data Processes 8 threadsusing vector ops on vectorregisters

Modifying the code2223146758Fetch/Decode<VEC8_diffuseShader>:VEC8_sample vec_r0, vec_v4, t0, vec_s0VEC8_mul vec_r3, vec_v0, cb0[0]VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)VEC8_mul vec_o0, vec_r0, vec_r3VEC8_mul vec_o1, vec_r1, vec_r3VEC8_mul vec_o2, vec_r2, vec_r3VEC8_mov vec_o3, l(1.0)ALU 1ALU 2ALU 3ALU 4ALU 5ALU 6ALU 7ALU 8CtxCtxCtxCtxCtxCtxCtxCtxShared Ctx Data

128 threads in parallel 2316 cores = 128 ALUs= 16 simultaneous instruction streams

But what about branches?242... 1...8Time (clocks)ALU 1ALU 2. . . ALU 8. . . <unconditional shader code>if (x> 0) {y = pow(x, exp);y *= Ks;refl = y + Ka; } else {x = 0; refl = Ka; }<resume unconditional shader code>

But what about branches?252... 1...8Time (clocks)ALU 1ALU 2. . . ALU 8. . . <unconditional shader code>TTTFFFFFif (x> 0) {y = pow(x, exp);y *= Ks;refl = y + Ka; } else {x = 0; refl = Ka; }<resume unconditional shader code>

But what about branches?262... 1...8Time (clocks)ALU 1ALU 2. . . ALU 8. . . <unconditional shader code>TTTFFFFFif (x> 0) {y = pow(x, exp);y *= Ks;refl = y + Ka; } else {x = 0; refl = Ka; }<resume unconditional shader code>Not all ALUs do useful work! Worst case: 1/8 performance

But what about branches?272... 1...8Time (clocks)ALU 1ALU 2. . . ALU 8. . . <unconditional shader code>TTTFFFFFif (x> 0) {y = pow(x, exp);y *= Ks;refl = y + Ka; } else {x = 0; refl = Ka; }<resume unconditional shader code>

Clarification28SIMD processing does not imply SIMD instructions Option 1: Explicit vector instructions

Intel/AMD x86 SSE, Intel Larrabee

Option 2: Scalar instructions, implicit HW vectorization

HW determines instruction stream sharing across ALUs (amount of sharing hidden from software)

NVIDIA GeForce (“SIMT” warps), ATI Radeon architecturesIn practice: 16 to 64 threads share an instruction stream

Stalls!Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation.Texture access latency = 100’s to 1000’s of cyclesWe’ve removed the fancy caches and logic that helps avoid stalls.29

But we have LOTS of independent threads.Idea #3:Interleave processing of many threads on a single core to avoid stalls caused by high latency operations.30

Hiding stalls31Time (clocks)Thread1 … 8ALU ALU ALU ALU ALU ALU ALU ALU Fetch/DecodeCtxCtxCtxCtxCtxCtxCtxCtxSharedCtx Data

Hiding stalls32Time (clocks)Thread9… 16Thread17 … 24Thread25 … 32Thread1 … 8ALU ALU ALU ALU ALU ALU ALU ALU Fetch/Decode12341234

Hiding stalls33Time (clocks)Thread9… 16Thread17 … 24Thread25 … 32Thread1 … 8StallRunnable1234

Hiding stalls34Time (clocks)Thread9… 16Thread17 … 24Thread25 … 32Thread1 … 8StallRunnable1234

Hiding stalls35Time (clocks)Thread9… 16Thread17 … 24Thread25 … 32Thread1 … 8StallStallStallStallRunnableRunnable1234Runnable

Throughput!36Time (clocks)Thread9… 16Thread17 … 24Thread25 … 32Thread1 … 8StartStartStallStallStallStallStartRunnableRunnableDone!RunnableDone!Runnable2341Increase run time of one groupTo maximum throughput of many groupsDone!Done!

Storing contexts37Fetch/DecodeALU ALU ALU ALU ALU ALU ALU ALU Pool of context storage32KB

Twenty small contexts38(maximal latency hiding ability)Fetch/DecodeALU ALU ALU ALU ALU ALU ALU ALU 1012345678911151213141620171819

Twelve medium contexts39Fetch/DecodeALU ALU ALU ALU ALU ALU ALU ALU 123456789101112

Four large contexts40(low latency hiding ability)Fetch/DecodeALU ALU ALU ALU ALU ALU ALU ALU 4312

GPU block diagram key= single “physical” instruction stream fetch/decode (functional unit control)= SIMD programmable functional unit (FU), control shared with other functional units. This functional unit may contain multiple 32-bit “ALUs”= 32-bit mul-add unit= 32-bit multiply unit= execution context storage = fixed function unit41

Example: NVIDIA GeForce GTX 280NVIDIA-speak:240 stream processors“SIMT execution” (automatic HW-managed sharing of instruction stream)Generic speak:30 processing cores8 SIMD functional units per core1 mul-add (2 flops) + 1 mul per functional units (3 flops/clock)Best case: 240 mul-adds + 240 muls per clock1.3 GHz clock30 * 8 * (2 + 1) * 1.3 = 933 GFLOPSMapping data-parallelism to chip:Instruction stream shared across 32 threads8 threads run on 8 SIMD functional units in one clock42

GTX 280 core43TexTexTexTexTexTexTexTexTexTex………………………………………………………………………………Zcull/Clip/RastOutput BlendWork Distributor

Example: ATI Radeon 4870AMD/ATI-speak:800 stream processorsAutomatic HW-managed sharing of scalar instruction stream (like “SIMT”)Generic speak:10 processing cores16 SIMD functional units per core5 mul-adds per functional unit (5 * 2 =10 flops/clock)Best case: 800 mul-adds per clock750 MHz clock10 * 16 * 5 * 2 * .75 = 1.2 TFLOPSMapping data-parallelism to chip:Instruction stream shared across 64 threads16 threads run on 16 SIMD functional units in one clock44

ATI Radeon 4870 core…………………………TexTexTexTexTexTexTexTexTexTexZcull/Clip/RastOutput BlendWork Distributor45

Summary: three key ideasUse many “slimmed down cores” to run in parallelPack cores full of ALUs (by sharing instruction stream across groups of threads)Option 1: Explicit SIMD vector instructionsOption 2: Implicit sharing managed by hardwareAvoid latency stalls by interleaving execution of many groups of threadsWhen one group stalls, work on another group46

2. GPU Programming ModelsProgramming ModelNVIDIA CUDAOpenCL47

Task parallelismDistribute the tasks across processors based on dependencyCoarse-grain parallelism48Task 1Task 1TimeTask 2Task 2Task 3Task 3P1Task 4Task 4P2Task 5Task 5Task 6Task 6P3Task 7Task 7Task 8Task 8Task 9Task 9Task assignment across 3 processorsTask dependency graph

Data parallelismRun a single kernel over many elementsEach element is independently updatedSame operation is applied on each elementFine-grain parallelismMany lightweight threads, easy to switch contextMaps well to ALU heavy architecture : GPU49Kernel…….DataP1P2P3P4P5Pn…….

GPU-friendly ProblemsData-parallel processingHigh arithmetic intensityKeep GPU busy all the timeComputation offsets memory latencyCoherent data accessAccess large chunk of contiguous memoryExploit fast on-chip shared memory50

The Algorithm MattersJacobi: Parallelizablefor(inti=0; i<num; i++) { vn+1[i] = (vn[i-1] + vn[i+1])/2.0; }Gauss-Seidel: Difficult to parallelizefor(inti=0; i<num; i++) {v[i] = (v[i-1] + v[i+1])/2.0; }51

Example: ReductionSerial version (O(N))for(inti=1; i<N; i++) { v[0] += v[i]; }Parallel version (O(logN)) width = N/2;while(width > 1) {for(inti=0; i<width; i++) {v[i] += v[i+width]; // computed in parallel } width /= 2; }52

GPU programming languagesUsing graphics APIsGLSL, Cg, HLSLComputing-specific APIsDX 11 Compute ShadersNVIDIA CUDAOpenCL53

NVIDIA CUDAC-extension programming languageNo graphics APISupports debugging toolsExtensions / APIFunction type : __global__, __device__, __host__Variable type : __shared__, __constant__Low-level functionscudaMalloc(), cudaFree(), cudaMemcpy(),…__syncthread(), atomicAdd(),…Program typesDevice program (kernel) : runs on the GPUHost program : runs on the CPU to call device programs54

CUDA Programming ModelKernelGPU program that runs on a thread gridThread hierarchyGrid : a set of blocksBlock : a set of threadsGrid size * block size = total # of threads55GridKernelBlock 2Block nBlock 1<diffuseShader>:sample r0, v4, t0, s0mul r3, v0, cb0[0]madd r3, v1, cb0[1], r3madd r3, v2, cb0[2], r3clmp r3, r3, l(0.0), l(1.0)mul o0, r0, r3mul o1, r1, r3mul o2, r2, r3mov o3, l(1.0). . . . .ThreadsThreadsThreads

CUDA Memory Structure56Graphics cardGPU CorePC Memory(DRAM)GPU GlobalMemory(DRAM)GPU SharedMemory(On-Chip)ALUs12004000Memory hierarchyPC memory : off-cardGPU Global : off-chip / on-cardShared/register/cache : on-chipThe host can read/write global memoryEach thread communicates using shared memory

SynchronizationThreads in the same block can communicate using shared memoryNo HW global synchronization function yet__syncthreads()Barrier for threads only within the current block__threadfence()Flushes global memory writes to make them visible to all threads57

Example: CPU Vector Addition58// Pair-wise addition of vector elements// CPU version : serial addvoid vectorAdd(float* iA, float* iB, float* oC, int num) { for(inti=0; i<num; i++) {oC[i] = iA[i] + iB[i]; }}

Example: CUDA Vector Addition59// Pair-wise addition of vector elements// CUDA version : one thread per addition__global__ voidvectorAdd(float* iA, float* iB, float* oC) {intidx = threadIdx.x + blockDim.x * blockIdx.x;oC[idx] = iA[idx] + iB[idx];}

Example: CUDA Host Code60float* h_A = (float*) malloc(N * sizeof(float));float* h_B = (float*) malloc(N * sizeof(float));// …initalizeh_A and h_B// allocate device memoryfloat* d_A, d_B, d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));// copy host memory to devicecudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice );cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice );// execute the kernel on N/256 blocks of 256 threads eachvectorAdd<<< N/256, 256>>>( d_A, d_B, d_C );

OpenCL (Open Computing Language)First industry standard for computing languageBased on C languagePlatform independentNVIDIA, ATI, Intel, ….Data and task parallel compute modelUse all computational resources in systemCPU, GPU, …Work-item : same as thread / fragment / etc..Work-group : a group of work-itemsWork-items in a same work-group can communicateExecute multiple work-groups in parallel61

OpenCL program structureHost program (CPU)Platform layerQuery compute devicesCreate contextRuntimeCreate memory objectsCompile and create kernel program objectsIssue commands (i.e., kernel launching) to command-queueSynchronization of commandsClean up OpenCL resourcesKernel (CPU, GPU)C-like code with some extensionsRuns on compute device62

CUDA v.s. OpenCL comparisonConceptually almost identicalWork-item == threadWork-group == blockSimilar memory modelGlobal, local, shared memoryKernel, host programCUDA is highly optimized only for NVIDIA GPUsOpenCL can be widely used for any GPUs/CPUs63

Implementation status of OpenCLSpecification 1.0 released by KhronosNVIDIA released Beta 1.2 driver and SDKAvailable for registered GPU computing developersApple will include in Mac OS X Snow LeopardQ3 2009NVIDIA and ATI GPUs, Intel CPU for MacMore companies will join64

GPU optimization tips: configurationIdentify bottleneckComputing / bandwidth bound (use profiler)Focus on most expensive but parallelizable parts (Amdahl’s law)Maximize parallel executionUse large input (many threads)Avoid divergent executionEfficient use of limited resourceMinimize shared memory / register use65

GPU optimization tips: memoryMemory access: the most important optimizationMinimize device to host memory overheadOverlap kernel with memory copy (asynchronous copy)Avoid shared memory bank conflictCoalesced global memory accessTexture or constant memory can be helpful (cache)Graphics cardGPU CorePC Memory(DRAM)GPU GlobalMemory(DRAM)GPU SharedMemory(On-Chip)ALUs1200400066

GPU optimization tips: instructionsUse less expensive operatorsdivision: 32 cycles, multiplication: 4 cycles*0.5 instead of /2.0Atomic operator is expensivePossible race conditionDouble precision is much slower than floatUse less accurate floating point instruction when possible__sin(), __exp(), __pow()Save unnecessary instructionsLoop unrolling67

3. Application ExampleCUDA ITK68

ITK image filters implemented using CUDAConvolution filtersMean filterGaussian filterDerivative filterHessian of Gaussian filterStatistical filterMedian filterPDE-based filterAnisotropic diffusion filter69

CUDA ITKCUDA code is integrated into ITKTransparent to the ITK usersNo need to modify current code using ITK libraryCheck environment variable ITK_CUDAEntry pointGenerateData() or ThreadedGenerateData()If ITK_CUDA == 0Execute original ITK codeIf ITK_CUDA == 1Execute CUDA code70

Convolution filtersWeighted sum of neighborsFor size n filter, each pixel is reused n timesNon-separable filter (Anisotropic)Reusing data using shared memorySeparable filter (Gaussian)N-dimensional convolution = N*1D convolution71kernelkernelkernel***

Read from input image whenever neededNaïve C/CUDA implementation72intxdim, ydim; // size of input imagefloat *in, *out; // input/output image of size xdim*ydimfloat w[][]; // convolution kernel of size n*mfor(x=0; x<xdim; x++){ for(y=0; y<ydim; y++) { // compute convolution for(sx=x-n/2; sx<=x+n/2; sx++) { for(sy=y-m/2; sy<=y+m/2; sy++) {wx = sx – x + n/2;wy = sy – y + m/2; out[x][y] = w[wx][wy]*in[sx][sy]; } } }}xdim*ydimn*mload from global memory, n*m times

For size n*m filter, each pixel is reused n*m timesSave n*m-1 global memory loads by using shared memoryImproved CUDA convolution filter73__global__ cudaConvolutionFilter2DKernel(in, out, w){ // copy global to shared memorysharedmem[] = in[][]; __syncthreads(); // sum neighbor pixel values float _sum = 0; for(uint j=threadIdx.y; j<=threadIdx.y + m; j++) { for(uinti=threadIdx.x; i<=threadIdx.x + n; i++) {wx = i – threadIdx.x;wy = j – threadIdx.y; _sum += w[wx][wy]*sharedmem[j*sharedmemdim.x + i]; } }}load from global memory (slow), only oncen*mload from shared memory (fast), n*m times

CUDA Gaussian filterApply 1D convolution filter along each axisUse temporary buffers: ping-pong rendering74// temp[0], temp[1] : temporary buffer to store intermediate resultsvoid cudaDiscreteGaussianImageFilter(in, out, stddev){ // create Gaussian weight w = ComputeGaussKernel(stddev); temp[0] = in; // call 1D convolution with Gaussian weight dim3 G, B; for(i=0; i<dimension; i++) { cudaConvolutionFilter1DKernel<<<G,B>>>(temp[i%2], temp[(i+1)%2], w); } out = temp[i%2];}1D convolution cuda kernel

Median filter14318210143182101431821014318210Viola et al. [VIS 03]Finding median by bisection of histogram binsLog(# bins) iterations8-bit pixel : log(256) = 8 iterationsIntensity :012345671.1642.Copy current block from global to shared memorymin = 0;max = 255;pivot = (min+max)/2.0f;For(i=0; i<8; i++){ count = 0; For(j=0; j<kernelsize; j++) { if(kernel[j] > pivot) count++: } if(count <kernelsize/2) max = floor(pivot); else min = ceil(pivot); pivot = (min + max)/2.0f;}return floor(pivot);1153.4.75

Perona & Malik anisotropic diffusionNonlinear diffusionAdaptive smoothing based on magnitude of gradientPreserves edges (high gradient)Numerical solutionEuler explicit integration (iterative method)Finite difference for derivative computation76Input ImageLinear diffusionP & M diffusion

PerformanceConvolution filtersMean filter : ~140xGaussian filter : ~60xDerivative filterHessian of Gaussian filterStatistical filterMedian filter : ~25xPDE-based filterAnisotropic diffusion filter : ~70x77

CUDA ITKSource code available athttp://sourceforge.net/projects/cudaitk/78

General Purpose Computing using Graphics Hardware

More Related Content

What's hot (20)

Similar to General Purpose Computing using Graphics Hardware (20)

Recently uploaded (20)

General Purpose Computing using Graphics Hardware

Editor's Notes