SlideShare a Scribd company logo
General Purpose Computingusing Graphics HardwareHanspeter PfisterHarvard University
AcknowledgementsWon-Ki Jeong, Harvard UniversityKayvonFatahalian, Stanford University 2
GPU (Graphics Processing Unit)PC hardware dedicated for 3D graphicsMassively parallel SIMD processorPerformance pushed by game industry3NVIDIA SLI System
GPGPUGeneral Purpose computation on the GPUStarted in computer graphics research communityMapping computational problems to graphics rendering pipeline4Image CourtesyJens Krueger, Aaron Lefohn, and Won-Ki Jeong
Why GPU for computing?GPU is fastMassively parallelCPU : ~4 cores (16 SIMD lanes) @ 3.2 Ghz (Intel Quad Core)GPU : ~30 cores (240 SIMD lanes) @ 1.3 Ghz (NVIDIA GT200)High memory bandwidthProgrammableNVIDIA CUDA, DirectX Compute Shader, OpenCLHigh precision floating point support64bit floating point (IEEE 754)Inexpensive desktop supercomputerNVIDIA Tesla C1060 : ~1 TFLOPS @ $10005
FLOPS6Image Courtesy NVIDIA
Memory Bandwidth7Image Courtesy NVIDIA
GPGPU Biomedical Examples8Level-Set Segmentation (Lefohn et al.)CT/MRI Reconstruction (Sumanaweera et al.)Image Registration (Strzodka et al.)EM Image Processing (Jeong et al.)
OverviewGPU Architecture OverviewGPU Programming OverviewProgramming ModelNVIDIA CUDAOpenCLApplication ExampleCUDA ITK9
1. GPU Architecture OverviewKayvonFatahalianStanford University10
What’s in a GPU?11Input AssemblyRasterizerOutput BlendVideo DecodeTexComputeCoreComputeCoreComputeCoreComputeCoreComputeCoreComputeCoreComputeCoreComputeCoreTexTexHWorSW?WorkDistributorTexHeterogeneous chip multi-processor (highly tuned for graphics)
CPU-“style” cores12Fetch/DecodeOut-of-order control logicFancy branch predictorALU(Execute)Memory pre-fetcherExecutionContextData Cache(A big one)
Slimming down13Fetch/DecodeIdea #1: Remove components thathelp a single instructionstream run fast ALU(Execute)ExecutionContext
Two cores   (two threads in parallel)14thread1thread 2Fetch/DecodeFetch/Decode<diffuseShader>:sample r0, v4, t0, s0mul  r3, v0, cb0[0]madd r3, v1, cb0[1], r3madd r3, v2, cb0[2], r3clmp r3, r3, l(0.0), l(1.0)mul  o0, r0, r3mul  o1, r1, r3mul  o2, r2, r3mov  o3, l(1.0)<diffuseShader>:sample r0, v4, t0, s0mul  r3, v0, cb0[0]madd r3, v1, cb0[1], r3madd r3, v2, cb0[2], r3clmp r3, r3, l(0.0), l(1.0)mul  o0, r0, r3mul  o1, r1, r3mul  o2, r2, r3mov  o3, l(1.0)ALU(Execute)ALU(Execute)ExecutionContextExecutionContext
Four cores   (four threads in parallel)15Fetch/DecodeFetch/DecodeFetch/DecodeFetch/DecodeALU(Execute)ALU(Execute)ALU(Execute)ALU(Execute)ExecutionContextExecutionContextExecutionContextExecutionContext
Sixteen cores   (sixteen threads in parallel)16ALUALUALUALUALUALUALUALUALUALUALUALUALUALUALUALU16 cores = 16 simultaneous instruction streams
Instruction stream sharing17But… many threads shouldbe able to share an instructionstream! <diffuseShader>:sample r0, v4, t0, s0mul  r3, v0, cb0[0]madd r3, v1, cb0[1], r3madd r3, v2, cb0[2], r3clmp r3, r3, l(0.0), l(1.0)mul  o0, r0, r3mul  o1, r1, r3mul  o2, r2, r3mov  o3, l(1.0)
Recall: simple processing core18Fetch/DecodeALU(Execute)ExecutionContext
Add ALUs19Idea #2:Amortize cost/complexity ofmanaging an instructionstream across many ALUsFetch/DecodeALU 1ALU 2ALU 3ALU 4ALU 5ALU 6ALU 7ALU 8CtxCtxCtxCtxCtxCtxCtxCtxSIMD processingShared Ctx Data
Modifying the code20Fetch/Decode<diffuseShader>:sample r0, v4, t0, s0mul  r3, v0, cb0[0]madd r3, v1, cb0[1], r3madd r3, v2, cb0[2], r3clmp r3, r3, l(0.0), l(1.0)mul  o0, r0, r3mul  o1, r1, r3mul  o2, r2, r3mov  o3, l(1.0)ALU 1ALU 2ALU 3ALU 4ALU 5ALU 6ALU 7ALU 8CtxCtxCtxCtxCtxCtxCtxCtxOriginal compiled shader:Shared Ctx Data Processes one threadusing scalar ops on scalarregisters
Modifying the code21Fetch/Decode<VEC8_diffuseShader>:VEC8_sample vec_r0, vec_v4, t0, vec_s0VEC8_mul  vec_r3, vec_v0, cb0[0]VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)VEC8_mul  vec_o0, vec_r0, vec_r3VEC8_mul  vec_o1, vec_r1, vec_r3VEC8_mul  vec_o2, vec_r2, vec_r3VEC8_mov  vec_o3, l(1.0)ALU 1ALU 2ALU 3ALU 4ALU 5ALU 6ALU 7ALU 8CtxCtxCtxCtxCtxCtxCtxCtxNew compiled shader:Shared Ctx Data Processes 8 threadsusing vector ops on vectorregisters
Modifying the code2223146758Fetch/Decode<VEC8_diffuseShader>:VEC8_sample vec_r0, vec_v4, t0, vec_s0VEC8_mul  vec_r3, vec_v0, cb0[0]VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)VEC8_mul  vec_o0, vec_r0, vec_r3VEC8_mul  vec_o1, vec_r1, vec_r3VEC8_mul  vec_o2, vec_r2, vec_r3VEC8_mov  vec_o3, l(1.0)ALU 1ALU 2ALU 3ALU 4ALU 5ALU 6ALU 7ALU 8CtxCtxCtxCtxCtxCtxCtxCtxShared Ctx Data
128 threads in parallel 2316 cores = 128 ALUs= 16 simultaneous instruction streams
But what about branches?242... 1...8Time (clocks)ALU 1ALU 2. . . ALU 8. . . <unconditional shader code>if (x> 0) {y = pow(x, exp);y *= Ks;refl = y + Ka;  } else {x = 0; refl = Ka;  }<resume unconditional shader code>
But what about branches?252... 1...8Time (clocks)ALU 1ALU 2. . . ALU 8. . . <unconditional shader code>TTTFFFFFif (x> 0) {y = pow(x, exp);y *= Ks;refl = y + Ka;  } else {x = 0; refl = Ka;  }<resume unconditional shader code>
But what about branches?262... 1...8Time (clocks)ALU 1ALU 2. . . ALU 8. . . <unconditional shader code>TTTFFFFFif (x> 0) {y = pow(x, exp);y *= Ks;refl = y + Ka;  } else {x = 0; refl = Ka;  }<resume unconditional shader code>Not all ALUs do useful work! Worst case: 1/8 performance
But what about branches?272... 1...8Time (clocks)ALU 1ALU 2. . . ALU 8. . . <unconditional shader code>TTTFFFFFif (x> 0) {y = pow(x, exp);y *= Ks;refl = y + Ka;  } else {x = 0; refl = Ka;  }<resume unconditional shader code>
Clarification28SIMD processing does not imply SIMD instructions Option 1: Explicit vector instructions
Intel/AMD x86 SSE, Intel Larrabee
Option 2:  Scalar instructions, implicit HW vectorization
HW determines instruction stream sharing across ALUs (amount of sharing hidden from software)
NVIDIA GeForce (“SIMT” warps), ATI Radeon architecturesIn practice: 16 to 64 threads share an instruction stream
Stalls!Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation.Texture access latency = 100’s to 1000’s of cyclesWe’ve removed the fancy caches and logic that helps avoid stalls.29
But we have  LOTS of independent threads.Idea #3:Interleave processing of many threads on a single core to avoid stalls caused by high latency operations.30
Hiding stalls31Time (clocks)Thread1 … 8ALU ALU ALU ALU ALU ALU ALU ALU Fetch/DecodeCtxCtxCtxCtxCtxCtxCtxCtxSharedCtx Data
Hiding stalls32Time (clocks)Thread9… 16Thread17 … 24Thread25 … 32Thread1 … 8ALU ALU ALU ALU ALU ALU ALU ALU Fetch/Decode12341234
Hiding stalls33Time (clocks)Thread9… 16Thread17 … 24Thread25 … 32Thread1 … 8StallRunnable1234
Hiding stalls34Time (clocks)Thread9… 16Thread17 … 24Thread25 … 32Thread1 … 8StallRunnable1234
Hiding stalls35Time (clocks)Thread9… 16Thread17 … 24Thread25 … 32Thread1 … 8StallStallStallStallRunnableRunnable1234Runnable
Throughput!36Time (clocks)Thread9… 16Thread17 … 24Thread25 … 32Thread1 … 8StartStartStallStallStallStallStartRunnableRunnableDone!RunnableDone!Runnable2341Increase run time of one groupTo maximum throughput of many groupsDone!Done!
Storing contexts37Fetch/DecodeALU ALU ALU ALU ALU ALU ALU ALU Pool of context storage32KB
Twenty small contexts38(maximal latency hiding ability)Fetch/DecodeALU ALU ALU ALU ALU ALU ALU ALU 1012345678911151213141620171819
Twelve medium contexts39Fetch/DecodeALU ALU ALU ALU ALU ALU ALU ALU 123456789101112
Four large contexts40(low latency hiding ability)Fetch/DecodeALU ALU ALU ALU ALU ALU ALU ALU 4312
GPU block diagram key= single “physical” instruction stream fetch/decode    (functional unit control)= SIMD programmable functional unit (FU), control shared with other   functional units.  This functional unit may contain multiple 32-bit “ALUs”= 32-bit mul-add unit= 32-bit multiply unit= execution context storage = fixed function unit41
Example: NVIDIA GeForce GTX 280NVIDIA-speak:240 stream processors“SIMT execution” (automatic HW-managed sharing of instruction stream)Generic speak:30 processing cores8 SIMD functional units per core1 mul-add (2 flops) + 1 mul per functional units (3 flops/clock)Best case: 240 mul-adds + 240 muls per clock1.3 GHz clock30 * 8 * (2 + 1) * 1.3 = 933 GFLOPSMapping data-parallelism to chip:Instruction stream shared across 32 threads8 threads run on 8 SIMD functional units in one clock42
GTX 280 core43TexTexTexTexTexTexTexTexTexTex………………………………………………………………………………Zcull/Clip/RastOutput BlendWork Distributor
Example: ATI Radeon 4870AMD/ATI-speak:800 stream processorsAutomatic HW-managed sharing of scalar instruction stream (like “SIMT”)Generic speak:10 processing cores16 SIMD functional units per core5 mul-adds per functional unit (5 * 2 =10 flops/clock)Best case: 800 mul-adds per clock750 MHz clock10 * 16 * 5 * 2 * .75 = 1.2 TFLOPSMapping data-parallelism to chip:Instruction stream shared across 64 threads16 threads run on 16 SIMD functional units in one clock44
ATI Radeon 4870 core…………………………TexTexTexTexTexTexTexTexTexTexZcull/Clip/RastOutput BlendWork Distributor45
Summary: three key ideasUse many “slimmed down cores” to run in parallelPack cores full of ALUs (by sharing instruction stream across groups of threads)Option 1: Explicit SIMD vector instructionsOption 2: Implicit sharing managed by hardwareAvoid latency stalls by interleaving execution of many groups of threadsWhen one group stalls, work on another group46
2. GPU Programming ModelsProgramming ModelNVIDIA CUDAOpenCL47
Task parallelismDistribute the tasks across processors based on dependencyCoarse-grain parallelism48Task 1Task 1TimeTask 2Task 2Task 3Task 3P1Task 4Task 4P2Task 5Task 5Task 6Task 6P3Task 7Task 7Task 8Task 8Task 9Task 9Task assignment across 3 processorsTask dependency graph
Data parallelismRun a single kernel over many elementsEach element is independently updatedSame operation is applied on each elementFine-grain parallelismMany lightweight threads, easy to switch contextMaps well to ALU heavy architecture : GPU49Kernel…….DataP1P2P3P4P5Pn…….
GPU-friendly ProblemsData-parallel processingHigh arithmetic intensityKeep GPU busy all the timeComputation offsets memory latencyCoherent data accessAccess large chunk of contiguous memoryExploit fast on-chip shared memory50
The Algorithm MattersJacobi: Parallelizablefor(inti=0; i<num; i++)	{   	vn+1[i] = (vn[i-1] + vn[i+1])/2.0;	}Gauss-Seidel: Difficult to parallelizefor(inti=0; i<num; i++)	{v[i] = (v[i-1] + v[i+1])/2.0;	}51
Example: ReductionSerial version (O(N))for(inti=1; i<N; i++)	{  	 v[0] += v[i];	}Parallel version (O(logN))	width = N/2;while(width > 1)	{for(inti=0; i<width; i++)   	{v[i] += v[i+width]; // computed in parallel   	}   	width /= 2;	}52
GPU programming languagesUsing graphics APIsGLSL, Cg, HLSLComputing-specific APIsDX 11 Compute ShadersNVIDIA CUDAOpenCL53
NVIDIA CUDAC-extension programming languageNo graphics APISupports debugging toolsExtensions / APIFunction type : __global__, __device__, __host__Variable type : __shared__, __constant__Low-level functionscudaMalloc(), cudaFree(), cudaMemcpy(),…__syncthread(), atomicAdd(),…Program typesDevice program (kernel) : runs on the GPUHost program : runs on the CPU to call device programs54
CUDA Programming ModelKernelGPU program that runs on a thread gridThread hierarchyGrid : a set of blocksBlock : a set of threadsGrid size * block size = total # of threads55GridKernelBlock 2Block nBlock 1<diffuseShader>:sample r0, v4, t0, s0mul  r3, v0, cb0[0]madd r3, v1, cb0[1], r3madd r3, v2, cb0[2], r3clmp r3, r3, l(0.0), l(1.0)mul  o0, r0, r3mul  o1, r1, r3mul  o2, r2, r3mov  o3, l(1.0). . . . .ThreadsThreadsThreads
CUDA Memory Structure56Graphics cardGPU CorePC Memory(DRAM)GPU GlobalMemory(DRAM)GPU SharedMemory(On-Chip)ALUs12004000Memory hierarchyPC memory : off-cardGPU Global : off-chip / on-cardShared/register/cache : on-chipThe host can read/write global memoryEach thread communicates using shared memory
SynchronizationThreads in the same block can communicate using shared memoryNo HW global synchronization function yet__syncthreads()Barrier for threads only within the current block__threadfence()Flushes global memory writes to make them visible to all threads57
Example: CPU Vector Addition58// Pair-wise addition of vector elements// CPU version : serial addvoid vectorAdd(float* iA, float* iB, float* oC, int num) {	  for(inti=0; i<num; i++)	  {oC[i] = iA[i] + iB[i];	  }}
Example: CUDA Vector Addition59// Pair-wise addition of vector elements// CUDA version : one thread per addition__global__ voidvectorAdd(float* iA, float* iB, float* oC) {intidx = threadIdx.x            + blockDim.x * blockIdx.x;oC[idx] = iA[idx] + iB[idx];}
Example: CUDA Host Code60float* h_A = (float*) malloc(N * sizeof(float));float* h_B = (float*) malloc(N * sizeof(float));// …initalizeh_A and h_B// allocate device memoryfloat* d_A, d_B, d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));// copy host memory to devicecudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice );cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice );// execute the kernel on N/256 blocks of 256 threads eachvectorAdd<<< N/256, 256>>>( d_A, d_B, d_C );
OpenCL	(Open Computing Language)First industry standard for computing languageBased on C languagePlatform independentNVIDIA, ATI, Intel, ….Data and task parallel compute modelUse all computational resources in systemCPU, GPU, …Work-item : same as thread / fragment / etc..Work-group : a group of work-itemsWork-items in a same work-group can communicateExecute multiple work-groups in parallel61
OpenCL program structureHost program (CPU)Platform layerQuery compute devicesCreate contextRuntimeCreate memory objectsCompile and create kernel program objectsIssue commands (i.e., kernel launching) to command-queueSynchronization of commandsClean up OpenCL resourcesKernel (CPU, GPU)C-like code with some extensionsRuns on compute device62
CUDA v.s. OpenCL comparisonConceptually almost identicalWork-item == threadWork-group == blockSimilar memory modelGlobal, local, shared memoryKernel, host programCUDA is highly optimized only for NVIDIA GPUsOpenCL can be widely used for any GPUs/CPUs63
Implementation status of OpenCLSpecification 1.0 released by KhronosNVIDIA released Beta 1.2 driver and SDKAvailable for registered GPU computing developersApple will include in Mac OS X Snow LeopardQ3 2009NVIDIA and ATI GPUs, Intel CPU for MacMore companies will join64
GPU optimization tips: configurationIdentify bottleneckComputing / bandwidth bound (use profiler)Focus on most expensive but parallelizable parts (Amdahl’s law)Maximize parallel executionUse large input (many threads)Avoid divergent executionEfficient use of limited resourceMinimize shared memory / register use65
GPU optimization tips: memoryMemory access: the most important optimizationMinimize device to host memory overheadOverlap kernel with memory copy (asynchronous copy)Avoid shared memory bank conflictCoalesced global memory accessTexture or constant memory can be helpful (cache)Graphics cardGPU CorePC Memory(DRAM)GPU GlobalMemory(DRAM)GPU SharedMemory(On-Chip)ALUs1200400066
GPU optimization tips: instructionsUse less expensive operatorsdivision: 32 cycles, multiplication: 4 cycles*0.5 instead of /2.0Atomic operator is expensivePossible race conditionDouble precision is much slower than floatUse less accurate floating point instruction when possible__sin(), __exp(), __pow()Save unnecessary instructionsLoop unrolling67
3. Application ExampleCUDA ITK68
ITK image filters implemented using CUDAConvolution filtersMean filterGaussian filterDerivative filterHessian of Gaussian filterStatistical filterMedian filterPDE-based filterAnisotropic diffusion filter69
CUDA ITKCUDA code is integrated into ITKTransparent to the ITK usersNo need to modify current code using ITK libraryCheck environment variable ITK_CUDAEntry pointGenerateData() or ThreadedGenerateData()If ITK_CUDA == 0Execute original ITK codeIf ITK_CUDA == 1Execute CUDA code70
Convolution filtersWeighted sum of neighborsFor size n filter, each pixel is reused n timesNon-separable filter (Anisotropic)Reusing data using shared memorySeparable filter (Gaussian)N-dimensional convolution = N*1D convolution71kernelkernelkernel***
Read from input image whenever neededNaïve C/CUDA implementation72intxdim, ydim;  // size of input imagefloat *in, *out; // input/output image of size xdim*ydimfloat w[][];     // convolution kernel of size n*mfor(x=0; x<xdim; x++){   for(y=0; y<ydim; y++)   {      // compute convolution      for(sx=x-n/2; sx<=x+n/2; sx++)      {         for(sy=y-m/2; sy<=y+m/2; sy++)         {wx = sx – x + n/2;wy = sy – y + m/2;            out[x][y] = w[wx][wy]*in[sx][sy];         }      }   }}xdim*ydimn*mload from global memory, n*m times
For size n*m filter, each pixel is reused n*m timesSave n*m-1 global memory loads by using shared memoryImproved CUDA convolution filter73__global__ cudaConvolutionFilter2DKernel(in, out, w){  // copy global to shared memorysharedmem[] = in[][];  __syncthreads();  // sum neighbor pixel values  float _sum = 0;  for(uint j=threadIdx.y; j<=threadIdx.y + m; j++)  {    for(uinti=threadIdx.x; i<=threadIdx.x + n; i++)    {wx = i – threadIdx.x;wy = j – threadIdx.y;      _sum += w[wx][wy]*sharedmem[j*sharedmemdim.x + i];    }  }}load from global memory (slow),  only oncen*mload from shared memory (fast), n*m times
CUDA Gaussian filterApply 1D convolution filter along each axisUse temporary buffers: ping-pong rendering74// temp[0], temp[1] : temporary buffer to store intermediate resultsvoid cudaDiscreteGaussianImageFilter(in, out, stddev){   // create Gaussian weight   w = ComputeGaussKernel(stddev);   temp[0] = in;   // call 1D convolution with Gaussian weight   dim3 G, B;   for(i=0; i<dimension; i++)   {      cudaConvolutionFilter1DKernel<<<G,B>>>(temp[i%2], temp[(i+1)%2], w);    }   out = temp[i%2];}1D convolution cuda kernel
Median filter14318210143182101431821014318210Viola et al. [VIS 03]Finding median by bisection of histogram binsLog(# bins) iterations8-bit pixel : log(256) = 8 iterationsIntensity :012345671.1642.Copy current block from global to shared memorymin = 0;max = 255;pivot = (min+max)/2.0f;For(i=0; i<8; i++){   count = 0;   For(j=0; j<kernelsize; j++)   {      if(kernel[j] > pivot) count++:   }	   if(count <kernelsize/2) max = floor(pivot);       else min = ceil(pivot);   pivot = (min + max)/2.0f;}return floor(pivot);1153.4.75
Perona & Malik anisotropic diffusionNonlinear diffusionAdaptive smoothing based on magnitude of gradientPreserves edges (high gradient)Numerical solutionEuler explicit integration (iterative method)Finite difference for derivative computation76Input ImageLinear diffusionP & M diffusion
PerformanceConvolution filtersMean filter : ~140xGaussian filter : ~60xDerivative filterHessian of Gaussian filterStatistical filterMedian filter : ~25xPDE-based filterAnisotropic diffusion filter : ~70x77
CUDA ITKSource code available athttp://sourceforge.net/projects/cudaitk/78

More Related Content

PDF
Implementing Lightweight Networking
PDF
Qemu JIT Code Generator and System Emulation
PDF
from Binary to Binary: How Qemu Works
PPTX
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
PDF
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
PDF
64-bit Android
PDF
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
PDF
Accelerating microbiome research with OpenACC
Implementing Lightweight Networking
Qemu JIT Code Generator and System Emulation
from Binary to Binary: How Qemu Works
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
64-bit Android
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Accelerating microbiome research with OpenACC

What's hot (20)

PDF
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
PDF
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
PPTX
Gpu workshop cluster universe: scripting cuda
PPTX
Linux Device Tree
PPTX
Debug generic process
PDF
Direct Code Execution @ CoNEXT 2013
PDF
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PDF
Kernelvm 201312-dlmopen
PDF
VLANs in the Linux Kernel
PPTX
QEMU - Binary Translation
PDF
第11回 配信講義 計算科学技術特論A(2021)
PDF
D itg-manual
PDF
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
PPTX
Lrz kurs: big data analysis
PDF
An evaluation of LLVM compiler for SVE with fairly complicated loops
PDF
Fun with Network Interfaces
PDF
Compilation of COSMO for GPU using LLVM
PPTX
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
PDF
Masked Software Occlusion Culling
PDF
Code GPU with CUDA - Device code optimization principle
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
Gpu workshop cluster universe: scripting cuda
Linux Device Tree
Debug generic process
Direct Code Execution @ CoNEXT 2013
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
Kernelvm 201312-dlmopen
VLANs in the Linux Kernel
QEMU - Binary Translation
第11回 配信講義 計算科学技術特論A(2021)
D itg-manual
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
Lrz kurs: big data analysis
An evaluation of LLVM compiler for SVE with fairly complicated loops
Fun with Network Interfaces
Compilation of COSMO for GPU using LLVM
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Masked Software Occlusion Culling
Code GPU with CUDA - Device code optimization principle
Ad

Similar to General Purpose Computing using Graphics Hardware (20)

PDF
tybsc cs game programming Introduction-to-GPUs.pdf
PPTX
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
PPT
NIOS II Processor.ppt
PPTX
Java on arm theory, applications, and workloads [dev5048]
PDF
The Spectre of Meltdowns
PDF
Running Java on Arm - Is it worth it in 2025?
PPT
20081114 Friday Food iLabt Bart Joris
PPTX
Snapdragon SoC and ARMv7 Architecture
PPTX
Super scaling singleton inserts
PDF
Jetson AGX Xavier and the New Era of Autonomous Machines
PPTX
Dpdk applications
PDF
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
PDF
Hardware & Software Platforms for HPC, AI and ML
PDF
Accelerating EDA workloads on Azure – Best Practice and benchmark on Intel EM...
PPT
PDF
Network Programming: Data Plane Development Kit (DPDK)
PPTX
Understanding DPDK
PPTX
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
PDF
23_Advanced_Processors controller system
PPT
class04_x86assembly.ppt hy there u need be
tybsc cs game programming Introduction-to-GPUs.pdf
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
NIOS II Processor.ppt
Java on arm theory, applications, and workloads [dev5048]
The Spectre of Meltdowns
Running Java on Arm - Is it worth it in 2025?
20081114 Friday Food iLabt Bart Joris
Snapdragon SoC and ARMv7 Architecture
Super scaling singleton inserts
Jetson AGX Xavier and the New Era of Autonomous Machines
Dpdk applications
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
Hardware & Software Platforms for HPC, AI and ML
Accelerating EDA workloads on Azure – Best Practice and benchmark on Intel EM...
Network Programming: Data Plane Development Kit (DPDK)
Understanding DPDK
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
23_Advanced_Processors controller system
class04_x86assembly.ppt hy there u need be
Ad

Recently uploaded (20)

PPT
Obstructive sleep apnea in orthodontics treatment
PPTX
POLYCYSTIC OVARIAN SYNDROME.pptx by Dr( med) Charles Amoateng
PPTX
anal canal anatomy with illustrations...
PDF
Oral Aspect of Metabolic Disease_20250717_192438_0000.pdf
PPTX
Electromyography (EMG) in Physiotherapy: Principles, Procedure & Clinical App...
PPTX
History and examination of abdomen, & pelvis .pptx
PPT
genitourinary-cancers_1.ppt Nursing care of clients with GU cancer
PPTX
NEET PG 2025 Pharmacology Recall | Real Exam Questions from 3rd August with D...
PPT
MENTAL HEALTH - NOTES.ppt for nursing students
DOCX
RUHS II MBBS Microbiology Paper-II with Answer Key | 6th August 2025 (New Sch...
PDF
Therapeutic Potential of Citrus Flavonoids in Metabolic Inflammation and Ins...
PPTX
CME 2 Acute Chest Pain preentation for education
PDF
Human Health And Disease hggyutgghg .pdf
PPTX
15.MENINGITIS AND ENCEPHALITIS-elias.pptx
PPTX
Respiratory drugs, drugs acting on the respi system
PPTX
Important Obstetric Emergency that must be recognised
PPT
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
PDF
شيت_عطا_0000000000000000000000000000.pdf
PDF
Medical Evidence in the Criminal Justice Delivery System in.pdf
PPT
Management of Acute Kidney Injury at LAUTECH
Obstructive sleep apnea in orthodontics treatment
POLYCYSTIC OVARIAN SYNDROME.pptx by Dr( med) Charles Amoateng
anal canal anatomy with illustrations...
Oral Aspect of Metabolic Disease_20250717_192438_0000.pdf
Electromyography (EMG) in Physiotherapy: Principles, Procedure & Clinical App...
History and examination of abdomen, & pelvis .pptx
genitourinary-cancers_1.ppt Nursing care of clients with GU cancer
NEET PG 2025 Pharmacology Recall | Real Exam Questions from 3rd August with D...
MENTAL HEALTH - NOTES.ppt for nursing students
RUHS II MBBS Microbiology Paper-II with Answer Key | 6th August 2025 (New Sch...
Therapeutic Potential of Citrus Flavonoids in Metabolic Inflammation and Ins...
CME 2 Acute Chest Pain preentation for education
Human Health And Disease hggyutgghg .pdf
15.MENINGITIS AND ENCEPHALITIS-elias.pptx
Respiratory drugs, drugs acting on the respi system
Important Obstetric Emergency that must be recognised
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
شيت_عطا_0000000000000000000000000000.pdf
Medical Evidence in the Criminal Justice Delivery System in.pdf
Management of Acute Kidney Injury at LAUTECH

General Purpose Computing using Graphics Hardware

  • 1. General Purpose Computingusing Graphics HardwareHanspeter PfisterHarvard University
  • 2. AcknowledgementsWon-Ki Jeong, Harvard UniversityKayvonFatahalian, Stanford University 2
  • 3. GPU (Graphics Processing Unit)PC hardware dedicated for 3D graphicsMassively parallel SIMD processorPerformance pushed by game industry3NVIDIA SLI System
  • 4. GPGPUGeneral Purpose computation on the GPUStarted in computer graphics research communityMapping computational problems to graphics rendering pipeline4Image CourtesyJens Krueger, Aaron Lefohn, and Won-Ki Jeong
  • 5. Why GPU for computing?GPU is fastMassively parallelCPU : ~4 cores (16 SIMD lanes) @ 3.2 Ghz (Intel Quad Core)GPU : ~30 cores (240 SIMD lanes) @ 1.3 Ghz (NVIDIA GT200)High memory bandwidthProgrammableNVIDIA CUDA, DirectX Compute Shader, OpenCLHigh precision floating point support64bit floating point (IEEE 754)Inexpensive desktop supercomputerNVIDIA Tesla C1060 : ~1 TFLOPS @ $10005
  • 8. GPGPU Biomedical Examples8Level-Set Segmentation (Lefohn et al.)CT/MRI Reconstruction (Sumanaweera et al.)Image Registration (Strzodka et al.)EM Image Processing (Jeong et al.)
  • 9. OverviewGPU Architecture OverviewGPU Programming OverviewProgramming ModelNVIDIA CUDAOpenCLApplication ExampleCUDA ITK9
  • 10. 1. GPU Architecture OverviewKayvonFatahalianStanford University10
  • 11. What’s in a GPU?11Input AssemblyRasterizerOutput BlendVideo DecodeTexComputeCoreComputeCoreComputeCoreComputeCoreComputeCoreComputeCoreComputeCoreComputeCoreTexTexHWorSW?WorkDistributorTexHeterogeneous chip multi-processor (highly tuned for graphics)
  • 12. CPU-“style” cores12Fetch/DecodeOut-of-order control logicFancy branch predictorALU(Execute)Memory pre-fetcherExecutionContextData Cache(A big one)
  • 13. Slimming down13Fetch/DecodeIdea #1: Remove components thathelp a single instructionstream run fast ALU(Execute)ExecutionContext
  • 14. Two cores (two threads in parallel)14thread1thread 2Fetch/DecodeFetch/Decode<diffuseShader>:sample r0, v4, t0, s0mul r3, v0, cb0[0]madd r3, v1, cb0[1], r3madd r3, v2, cb0[2], r3clmp r3, r3, l(0.0), l(1.0)mul o0, r0, r3mul o1, r1, r3mul o2, r2, r3mov o3, l(1.0)<diffuseShader>:sample r0, v4, t0, s0mul r3, v0, cb0[0]madd r3, v1, cb0[1], r3madd r3, v2, cb0[2], r3clmp r3, r3, l(0.0), l(1.0)mul o0, r0, r3mul o1, r1, r3mul o2, r2, r3mov o3, l(1.0)ALU(Execute)ALU(Execute)ExecutionContextExecutionContext
  • 15. Four cores (four threads in parallel)15Fetch/DecodeFetch/DecodeFetch/DecodeFetch/DecodeALU(Execute)ALU(Execute)ALU(Execute)ALU(Execute)ExecutionContextExecutionContextExecutionContextExecutionContext
  • 16. Sixteen cores (sixteen threads in parallel)16ALUALUALUALUALUALUALUALUALUALUALUALUALUALUALUALU16 cores = 16 simultaneous instruction streams
  • 17. Instruction stream sharing17But… many threads shouldbe able to share an instructionstream! <diffuseShader>:sample r0, v4, t0, s0mul r3, v0, cb0[0]madd r3, v1, cb0[1], r3madd r3, v2, cb0[2], r3clmp r3, r3, l(0.0), l(1.0)mul o0, r0, r3mul o1, r1, r3mul o2, r2, r3mov o3, l(1.0)
  • 18. Recall: simple processing core18Fetch/DecodeALU(Execute)ExecutionContext
  • 19. Add ALUs19Idea #2:Amortize cost/complexity ofmanaging an instructionstream across many ALUsFetch/DecodeALU 1ALU 2ALU 3ALU 4ALU 5ALU 6ALU 7ALU 8CtxCtxCtxCtxCtxCtxCtxCtxSIMD processingShared Ctx Data
  • 20. Modifying the code20Fetch/Decode<diffuseShader>:sample r0, v4, t0, s0mul r3, v0, cb0[0]madd r3, v1, cb0[1], r3madd r3, v2, cb0[2], r3clmp r3, r3, l(0.0), l(1.0)mul o0, r0, r3mul o1, r1, r3mul o2, r2, r3mov o3, l(1.0)ALU 1ALU 2ALU 3ALU 4ALU 5ALU 6ALU 7ALU 8CtxCtxCtxCtxCtxCtxCtxCtxOriginal compiled shader:Shared Ctx Data Processes one threadusing scalar ops on scalarregisters
  • 21. Modifying the code21Fetch/Decode<VEC8_diffuseShader>:VEC8_sample vec_r0, vec_v4, t0, vec_s0VEC8_mul vec_r3, vec_v0, cb0[0]VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)VEC8_mul vec_o0, vec_r0, vec_r3VEC8_mul vec_o1, vec_r1, vec_r3VEC8_mul vec_o2, vec_r2, vec_r3VEC8_mov vec_o3, l(1.0)ALU 1ALU 2ALU 3ALU 4ALU 5ALU 6ALU 7ALU 8CtxCtxCtxCtxCtxCtxCtxCtxNew compiled shader:Shared Ctx Data Processes 8 threadsusing vector ops on vectorregisters
  • 22. Modifying the code2223146758Fetch/Decode<VEC8_diffuseShader>:VEC8_sample vec_r0, vec_v4, t0, vec_s0VEC8_mul vec_r3, vec_v0, cb0[0]VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)VEC8_mul vec_o0, vec_r0, vec_r3VEC8_mul vec_o1, vec_r1, vec_r3VEC8_mul vec_o2, vec_r2, vec_r3VEC8_mov vec_o3, l(1.0)ALU 1ALU 2ALU 3ALU 4ALU 5ALU 6ALU 7ALU 8CtxCtxCtxCtxCtxCtxCtxCtxShared Ctx Data
  • 23. 128 threads in parallel 2316 cores = 128 ALUs= 16 simultaneous instruction streams
  • 24. But what about branches?242... 1...8Time (clocks)ALU 1ALU 2. . . ALU 8. . . <unconditional shader code>if (x> 0) {y = pow(x, exp);y *= Ks;refl = y + Ka; } else {x = 0; refl = Ka; }<resume unconditional shader code>
  • 25. But what about branches?252... 1...8Time (clocks)ALU 1ALU 2. . . ALU 8. . . <unconditional shader code>TTTFFFFFif (x> 0) {y = pow(x, exp);y *= Ks;refl = y + Ka; } else {x = 0; refl = Ka; }<resume unconditional shader code>
  • 26. But what about branches?262... 1...8Time (clocks)ALU 1ALU 2. . . ALU 8. . . <unconditional shader code>TTTFFFFFif (x> 0) {y = pow(x, exp);y *= Ks;refl = y + Ka; } else {x = 0; refl = Ka; }<resume unconditional shader code>Not all ALUs do useful work! Worst case: 1/8 performance
  • 27. But what about branches?272... 1...8Time (clocks)ALU 1ALU 2. . . ALU 8. . . <unconditional shader code>TTTFFFFFif (x> 0) {y = pow(x, exp);y *= Ks;refl = y + Ka; } else {x = 0; refl = Ka; }<resume unconditional shader code>
  • 28. Clarification28SIMD processing does not imply SIMD instructions Option 1: Explicit vector instructions
  • 29. Intel/AMD x86 SSE, Intel Larrabee
  • 30. Option 2: Scalar instructions, implicit HW vectorization
  • 31. HW determines instruction stream sharing across ALUs (amount of sharing hidden from software)
  • 32. NVIDIA GeForce (“SIMT” warps), ATI Radeon architecturesIn practice: 16 to 64 threads share an instruction stream
  • 33. Stalls!Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation.Texture access latency = 100’s to 1000’s of cyclesWe’ve removed the fancy caches and logic that helps avoid stalls.29
  • 34. But we have LOTS of independent threads.Idea #3:Interleave processing of many threads on a single core to avoid stalls caused by high latency operations.30
  • 35. Hiding stalls31Time (clocks)Thread1 … 8ALU ALU ALU ALU ALU ALU ALU ALU Fetch/DecodeCtxCtxCtxCtxCtxCtxCtxCtxSharedCtx Data
  • 36. Hiding stalls32Time (clocks)Thread9… 16Thread17 … 24Thread25 … 32Thread1 … 8ALU ALU ALU ALU ALU ALU ALU ALU Fetch/Decode12341234
  • 37. Hiding stalls33Time (clocks)Thread9… 16Thread17 … 24Thread25 … 32Thread1 … 8StallRunnable1234
  • 38. Hiding stalls34Time (clocks)Thread9… 16Thread17 … 24Thread25 … 32Thread1 … 8StallRunnable1234
  • 39. Hiding stalls35Time (clocks)Thread9… 16Thread17 … 24Thread25 … 32Thread1 … 8StallStallStallStallRunnableRunnable1234Runnable
  • 40. Throughput!36Time (clocks)Thread9… 16Thread17 … 24Thread25 … 32Thread1 … 8StartStartStallStallStallStallStartRunnableRunnableDone!RunnableDone!Runnable2341Increase run time of one groupTo maximum throughput of many groupsDone!Done!
  • 41. Storing contexts37Fetch/DecodeALU ALU ALU ALU ALU ALU ALU ALU Pool of context storage32KB
  • 42. Twenty small contexts38(maximal latency hiding ability)Fetch/DecodeALU ALU ALU ALU ALU ALU ALU ALU 1012345678911151213141620171819
  • 43. Twelve medium contexts39Fetch/DecodeALU ALU ALU ALU ALU ALU ALU ALU 123456789101112
  • 44. Four large contexts40(low latency hiding ability)Fetch/DecodeALU ALU ALU ALU ALU ALU ALU ALU 4312
  • 45. GPU block diagram key= single “physical” instruction stream fetch/decode (functional unit control)= SIMD programmable functional unit (FU), control shared with other functional units. This functional unit may contain multiple 32-bit “ALUs”= 32-bit mul-add unit= 32-bit multiply unit= execution context storage = fixed function unit41
  • 46. Example: NVIDIA GeForce GTX 280NVIDIA-speak:240 stream processors“SIMT execution” (automatic HW-managed sharing of instruction stream)Generic speak:30 processing cores8 SIMD functional units per core1 mul-add (2 flops) + 1 mul per functional units (3 flops/clock)Best case: 240 mul-adds + 240 muls per clock1.3 GHz clock30 * 8 * (2 + 1) * 1.3 = 933 GFLOPSMapping data-parallelism to chip:Instruction stream shared across 32 threads8 threads run on 8 SIMD functional units in one clock42
  • 48. Example: ATI Radeon 4870AMD/ATI-speak:800 stream processorsAutomatic HW-managed sharing of scalar instruction stream (like “SIMT”)Generic speak:10 processing cores16 SIMD functional units per core5 mul-adds per functional unit (5 * 2 =10 flops/clock)Best case: 800 mul-adds per clock750 MHz clock10 * 16 * 5 * 2 * .75 = 1.2 TFLOPSMapping data-parallelism to chip:Instruction stream shared across 64 threads16 threads run on 16 SIMD functional units in one clock44
  • 49. ATI Radeon 4870 core…………………………TexTexTexTexTexTexTexTexTexTexZcull/Clip/RastOutput BlendWork Distributor45
  • 50. Summary: three key ideasUse many “slimmed down cores” to run in parallelPack cores full of ALUs (by sharing instruction stream across groups of threads)Option 1: Explicit SIMD vector instructionsOption 2: Implicit sharing managed by hardwareAvoid latency stalls by interleaving execution of many groups of threadsWhen one group stalls, work on another group46
  • 51. 2. GPU Programming ModelsProgramming ModelNVIDIA CUDAOpenCL47
  • 52. Task parallelismDistribute the tasks across processors based on dependencyCoarse-grain parallelism48Task 1Task 1TimeTask 2Task 2Task 3Task 3P1Task 4Task 4P2Task 5Task 5Task 6Task 6P3Task 7Task 7Task 8Task 8Task 9Task 9Task assignment across 3 processorsTask dependency graph
  • 53. Data parallelismRun a single kernel over many elementsEach element is independently updatedSame operation is applied on each elementFine-grain parallelismMany lightweight threads, easy to switch contextMaps well to ALU heavy architecture : GPU49Kernel…….DataP1P2P3P4P5Pn…….
  • 54. GPU-friendly ProblemsData-parallel processingHigh arithmetic intensityKeep GPU busy all the timeComputation offsets memory latencyCoherent data accessAccess large chunk of contiguous memoryExploit fast on-chip shared memory50
  • 55. The Algorithm MattersJacobi: Parallelizablefor(inti=0; i<num; i++) { vn+1[i] = (vn[i-1] + vn[i+1])/2.0; }Gauss-Seidel: Difficult to parallelizefor(inti=0; i<num; i++) {v[i] = (v[i-1] + v[i+1])/2.0; }51
  • 56. Example: ReductionSerial version (O(N))for(inti=1; i<N; i++) { v[0] += v[i]; }Parallel version (O(logN)) width = N/2;while(width > 1) {for(inti=0; i<width; i++) {v[i] += v[i+width]; // computed in parallel } width /= 2; }52
  • 57. GPU programming languagesUsing graphics APIsGLSL, Cg, HLSLComputing-specific APIsDX 11 Compute ShadersNVIDIA CUDAOpenCL53
  • 58. NVIDIA CUDAC-extension programming languageNo graphics APISupports debugging toolsExtensions / APIFunction type : __global__, __device__, __host__Variable type : __shared__, __constant__Low-level functionscudaMalloc(), cudaFree(), cudaMemcpy(),…__syncthread(), atomicAdd(),…Program typesDevice program (kernel) : runs on the GPUHost program : runs on the CPU to call device programs54
  • 59. CUDA Programming ModelKernelGPU program that runs on a thread gridThread hierarchyGrid : a set of blocksBlock : a set of threadsGrid size * block size = total # of threads55GridKernelBlock 2Block nBlock 1<diffuseShader>:sample r0, v4, t0, s0mul r3, v0, cb0[0]madd r3, v1, cb0[1], r3madd r3, v2, cb0[2], r3clmp r3, r3, l(0.0), l(1.0)mul o0, r0, r3mul o1, r1, r3mul o2, r2, r3mov o3, l(1.0). . . . .ThreadsThreadsThreads
  • 60. CUDA Memory Structure56Graphics cardGPU CorePC Memory(DRAM)GPU GlobalMemory(DRAM)GPU SharedMemory(On-Chip)ALUs12004000Memory hierarchyPC memory : off-cardGPU Global : off-chip / on-cardShared/register/cache : on-chipThe host can read/write global memoryEach thread communicates using shared memory
  • 61. SynchronizationThreads in the same block can communicate using shared memoryNo HW global synchronization function yet__syncthreads()Barrier for threads only within the current block__threadfence()Flushes global memory writes to make them visible to all threads57
  • 62. Example: CPU Vector Addition58// Pair-wise addition of vector elements// CPU version : serial addvoid vectorAdd(float* iA, float* iB, float* oC, int num) { for(inti=0; i<num; i++) {oC[i] = iA[i] + iB[i]; }}
  • 63. Example: CUDA Vector Addition59// Pair-wise addition of vector elements// CUDA version : one thread per addition__global__ voidvectorAdd(float* iA, float* iB, float* oC) {intidx = threadIdx.x + blockDim.x * blockIdx.x;oC[idx] = iA[idx] + iB[idx];}
  • 64. Example: CUDA Host Code60float* h_A = (float*) malloc(N * sizeof(float));float* h_B = (float*) malloc(N * sizeof(float));// …initalizeh_A and h_B// allocate device memoryfloat* d_A, d_B, d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));// copy host memory to devicecudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice );cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice );// execute the kernel on N/256 blocks of 256 threads eachvectorAdd<<< N/256, 256>>>( d_A, d_B, d_C );
  • 65. OpenCL (Open Computing Language)First industry standard for computing languageBased on C languagePlatform independentNVIDIA, ATI, Intel, ….Data and task parallel compute modelUse all computational resources in systemCPU, GPU, …Work-item : same as thread / fragment / etc..Work-group : a group of work-itemsWork-items in a same work-group can communicateExecute multiple work-groups in parallel61
  • 66. OpenCL program structureHost program (CPU)Platform layerQuery compute devicesCreate contextRuntimeCreate memory objectsCompile and create kernel program objectsIssue commands (i.e., kernel launching) to command-queueSynchronization of commandsClean up OpenCL resourcesKernel (CPU, GPU)C-like code with some extensionsRuns on compute device62
  • 67. CUDA v.s. OpenCL comparisonConceptually almost identicalWork-item == threadWork-group == blockSimilar memory modelGlobal, local, shared memoryKernel, host programCUDA is highly optimized only for NVIDIA GPUsOpenCL can be widely used for any GPUs/CPUs63
  • 68. Implementation status of OpenCLSpecification 1.0 released by KhronosNVIDIA released Beta 1.2 driver and SDKAvailable for registered GPU computing developersApple will include in Mac OS X Snow LeopardQ3 2009NVIDIA and ATI GPUs, Intel CPU for MacMore companies will join64
  • 69. GPU optimization tips: configurationIdentify bottleneckComputing / bandwidth bound (use profiler)Focus on most expensive but parallelizable parts (Amdahl’s law)Maximize parallel executionUse large input (many threads)Avoid divergent executionEfficient use of limited resourceMinimize shared memory / register use65
  • 70. GPU optimization tips: memoryMemory access: the most important optimizationMinimize device to host memory overheadOverlap kernel with memory copy (asynchronous copy)Avoid shared memory bank conflictCoalesced global memory accessTexture or constant memory can be helpful (cache)Graphics cardGPU CorePC Memory(DRAM)GPU GlobalMemory(DRAM)GPU SharedMemory(On-Chip)ALUs1200400066
  • 71. GPU optimization tips: instructionsUse less expensive operatorsdivision: 32 cycles, multiplication: 4 cycles*0.5 instead of /2.0Atomic operator is expensivePossible race conditionDouble precision is much slower than floatUse less accurate floating point instruction when possible__sin(), __exp(), __pow()Save unnecessary instructionsLoop unrolling67
  • 73. ITK image filters implemented using CUDAConvolution filtersMean filterGaussian filterDerivative filterHessian of Gaussian filterStatistical filterMedian filterPDE-based filterAnisotropic diffusion filter69
  • 74. CUDA ITKCUDA code is integrated into ITKTransparent to the ITK usersNo need to modify current code using ITK libraryCheck environment variable ITK_CUDAEntry pointGenerateData() or ThreadedGenerateData()If ITK_CUDA == 0Execute original ITK codeIf ITK_CUDA == 1Execute CUDA code70
  • 75. Convolution filtersWeighted sum of neighborsFor size n filter, each pixel is reused n timesNon-separable filter (Anisotropic)Reusing data using shared memorySeparable filter (Gaussian)N-dimensional convolution = N*1D convolution71kernelkernelkernel***
  • 76. Read from input image whenever neededNaïve C/CUDA implementation72intxdim, ydim; // size of input imagefloat *in, *out; // input/output image of size xdim*ydimfloat w[][]; // convolution kernel of size n*mfor(x=0; x<xdim; x++){ for(y=0; y<ydim; y++) { // compute convolution for(sx=x-n/2; sx<=x+n/2; sx++) { for(sy=y-m/2; sy<=y+m/2; sy++) {wx = sx – x + n/2;wy = sy – y + m/2; out[x][y] = w[wx][wy]*in[sx][sy]; } } }}xdim*ydimn*mload from global memory, n*m times
  • 77. For size n*m filter, each pixel is reused n*m timesSave n*m-1 global memory loads by using shared memoryImproved CUDA convolution filter73__global__ cudaConvolutionFilter2DKernel(in, out, w){ // copy global to shared memorysharedmem[] = in[][]; __syncthreads(); // sum neighbor pixel values float _sum = 0; for(uint j=threadIdx.y; j<=threadIdx.y + m; j++) { for(uinti=threadIdx.x; i<=threadIdx.x + n; i++) {wx = i – threadIdx.x;wy = j – threadIdx.y; _sum += w[wx][wy]*sharedmem[j*sharedmemdim.x + i]; } }}load from global memory (slow), only oncen*mload from shared memory (fast), n*m times
  • 78. CUDA Gaussian filterApply 1D convolution filter along each axisUse temporary buffers: ping-pong rendering74// temp[0], temp[1] : temporary buffer to store intermediate resultsvoid cudaDiscreteGaussianImageFilter(in, out, stddev){ // create Gaussian weight w = ComputeGaussKernel(stddev); temp[0] = in; // call 1D convolution with Gaussian weight dim3 G, B; for(i=0; i<dimension; i++) { cudaConvolutionFilter1DKernel<<<G,B>>>(temp[i%2], temp[(i+1)%2], w); } out = temp[i%2];}1D convolution cuda kernel
  • 79. Median filter14318210143182101431821014318210Viola et al. [VIS 03]Finding median by bisection of histogram binsLog(# bins) iterations8-bit pixel : log(256) = 8 iterationsIntensity :012345671.1642.Copy current block from global to shared memorymin = 0;max = 255;pivot = (min+max)/2.0f;For(i=0; i<8; i++){ count = 0; For(j=0; j<kernelsize; j++) { if(kernel[j] > pivot) count++: } if(count <kernelsize/2) max = floor(pivot); else min = ceil(pivot); pivot = (min + max)/2.0f;}return floor(pivot);1153.4.75
  • 80. Perona & Malik anisotropic diffusionNonlinear diffusionAdaptive smoothing based on magnitude of gradientPreserves edges (high gradient)Numerical solutionEuler explicit integration (iterative method)Finite difference for derivative computation76Input ImageLinear diffusionP & M diffusion
  • 81. PerformanceConvolution filtersMean filter : ~140xGaussian filter : ~60xDerivative filterHessian of Gaussian filterStatistical filterMedian filter : ~25xPDE-based filterAnisotropic diffusion filter : ~70x77
  • 82. CUDA ITKSource code available athttp://sourceforge.net/projects/cudaitk/78
  • 83. CUDA ITK Future WorkITK GPU image classReduce CPU to GPU memory I/OPipelining supportNative interface for GPU codeSimilar to ThreadedGenerateData() for GPU threadsNumerical library (vnl)Out-of-GPU-core / GPU-clusterProcessing large images (10~100 Terabytes)GPU Platform independent implementationOpenCL could be a solution79
  • 84. ConclusionsGPU computing delivers high performanceMany scientific computing problems are parallelizableMore consistency/stability in HW/SWMain GPU architecture is matureIndustry-wide programming standard now exists (OpenCL)Better support/tools availableC-based language, compiler, and debuggerIssuesNot every problem is suitable for GPUsRe-engineering of algorithms/software requiredUnclear future performance growth of GPU hardwareIntel’s Larrabee80
  • 85. thrustthrust: a library of data parallel algorithms & data structures with an interface similar to the C++ Standard Template Library for CUDAC++ template metaprogramming automatically chooses the fastest code path at compile time
  • 86. thrust::sort#include <thrust/host_vector.h>#include <thrust/device_vector.h>#include <thrust/generate.h>#include <thrust/sort.h>#include <cstdlib>int main(void){ // generate random data on the hostthrust::host_vector<int> h_vec(1000000);thrust::generate(h_vec.begin(), h_vec.end(), rand); // transfer to device and sortthrust::device_vector<int> d_vec = h_vec; // sort 140M 32b keys/sec on GT200thrust::sort(d_vec.begin(), d_vec.end()); return 0;}http://thrust.googlecode.com

Editor's Notes

  • #5: Fluid flow, level set segmentation, DTI image
  • #12: One of the major debates you’ll see in graphics in the coming years, is whether the scheduling and work distribution logic should be provided as highly optimized hardware, or be implemented as a software program on the programmable cores.
  • #20: Pack core full of ALUsWe are not going to increase our core’s ability to decode instructionsWe will decode 1 instruction, and execute on all 8 ALUs
  • #21: How can we make use all these ALUs?
  • #22: Just have the shader program work on 8 fragments at a time. Replace the scalar operation with 8-wide vector ones.
  • #23: So the program processing 8 fragments at a time, and all the work for each fragment is carried out by 1 of the 8 ALUs. Notice that I’ve also replicate part of the context to store execution state for the 8 fragments. For example, I’d replicate the registers.
  • #36: We continue this process, moving to a new group each time we encounter a stall.If we have enough groups there will always be some work do, and the processing core’s ALUs never go idle.
  • #38: Described adding contextsIn reality there’s a fixed pool of on chip storage that is partitioned to hold contexts.Instead of using on chip storage as a traditional data cache, GPUs choose to use this store to hold contexts.
  • #39: Shadingperformance relies on large scale interleavingNumber of interleaved groups per core ~20-30Could be separate hardware-managed contexts or software-managed using techniques
  • #40: Fewer contexts fit on chipChip can hide less latencyHigher likelihood of stalls
  • #41: Loose performance when shaders use a lot of registers
  • #43: 128 simultaneous threads on each core
  • #47: Drive this ALUs using explicit SIMD instructions or implicit via HW determined sharing
  • #57: Numbers are relative cost of communication
  • #74: Runs on each thread – is parallel
  • #75: G = grid size, B = block size