SlideShare a Scribd company logo
Rob GillenIntro to GPGPU Programing With CUDA
CodeStock is proudly partnered with:RecruitWise and Staff with Excellence - www.recruitwise.jobsSend instant feedback on this session via Twitter:Send a direct message with the room number to @CodeStockd codestock 411 This guy is Amazing!For more information on sending feedback using Twitter while at CodeStock, please see the “CodeStock README” in your CodeStock guide.
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CUDARob Gillen
Welcome!Goals:Overview of GPGPU with CUDA“Vision Casting” for how you can use GPUs to improve your applicationOutlineWhy GPGPUs?ApplicationsToolingHands-On: Matrix MultiplicationRating: http://spkr8.com/t/7714
CPU vs. GPUGPU devotes more transistors to data processing
NVIDIA Fermi~1.5TFLOPS (SP)/~800GFLOPS (DP)230 GB/s DRAM Bandwidth
MotivationFLoating-Point Operations per Second (FLOPS) and memory bandwidth For the CPU and GPU
Example: Sparse Matrix-VectorCPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms",  Williams et al, Supercomputing 2007
Rayleigh-Bénard ResultsDouble precision384 x 384 x 192 grid (max that fits in 4GB)Vertical slice of temperature at y=0Transition from stratified (left) to turbulent (right)Regime depends on Rayleigh number: Ra = gαΔT/κν8.5x speedup versus Fortran code running on 8-core 2.5 GHz Xeon
G80 Characteristics367 GFLOPS  peak performance (25-50 times of current high-end microprocessors)265 GFLOPS sustained for apps such as VMDMassively parallel, 128 cores, 90WMassively threaded, sustains 1000s of threads per app30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics
Supercomputer Comparison
ApplicationsExciting applications in future mass computing market have been traditionally considered “supercomputing applications”Molecular dynamics simulation, Video and audio codingand manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These “Super-apps” represent and model physical, concurrent worldVarious granularities of parallelism exist, but…programming model must not hinder parallel implementationdata delivery needs careful management
*Not* for all applicationsSPMD (Single Program, Multiple Data) are best (data parallel)Operations need to be of sufficient size to overcome overheadThink Millions of operations.
Raytracing
NVIRT: CUDA Ray Tracing API
ToolingVS 2010 C++ (Express is OK… sortof.)NVIDIA CUDA-Capable GPUNVIDIA CUDA Toolkit (v4+)NVIDIA CUDA Tools (v4+)GPU Computing SDKNVIDIA Parallel Insight
Parallel Debugging
Parallel Analysis
VS Project Templates
VS Project Templates
Before we get too excited…Host vs DeviceKernels __global__   __device__  __host__Thread/Block Control<<<x, y>>>Multi-dimensioned coordinate objectsMemory Management/MovementThread Management – think 1000’s or 1,000,000’s
Block IDs and ThreadsEach thread uses IDs to decide what data to work onBlock ID: 1D or 2DThread ID: 1D, 2D, or 3D Simplifies memoryaddressing when processingmultidimensional dataImage processing
CUDA Thread BlockAll threads in a block execute the same kernel program (SPMD)Programmer declares block:Block size 1 to 512 concurrent threadsBlock shape 1D, 2D, or 3DBlock dimensions in threadsThreads have thread id numbers within blockThread program uses thread id to select work and address shared dataThreads in the same block share data and synchronize while doing their share of the workThreads in different blocks cannot cooperateEach block can execute in any order relative to other blocs!CUDA Thread BlockThread Id #:0 1 2 3 …          m   Thread program
Transparent ScalabilityHardware is free to assigns blocks to any processor at any timeA kernel scales across any number of parallel processorsKernel gridDeviceBlock 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7DeviceBlock 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7Block 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7timeEach block can execute in any order relative to other blocks.
A Simple Running ExampleMatrix MultiplicationA simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programsLeave shared memory usage until laterLocal, register usageThread ID usageMemory data transfer API between host and deviceAssume square matrix for simplicity
Programming Model:Square Matrix Multiplication ExampleP = M * N of size WIDTH x WIDTHWithout tiling:One thread calculates one element of PM and N are loaded WIDTH timesfrom global memoryNWIDTHMPWIDTHWIDTHWIDTH27
Memory Layout of Matrix in CM0,2M0,1M0,0M0,3M1,1M1,0M1,2M1,3M2,1M2,0M2,2M2,3M3,1M3,0M3,2M3,3MM0,2M0,1M0,0M0,3M1,1M1,0M1,2M1,3M2,1M2,0M2,2M2,3M3,1M3,0M3,2M3,3
Simple Matrix Multiplication (CPU)void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏{   for (int i = 0; i < Width; ++i) {‏  for (int j = 0; j < Width; ++j) {	float sum = 0;for (int k = 0; k < Width; ++k) {float a = M[i * width + k];float b = N[k * width + j];sum += a * b;}P[i * Width + j] = sum;   } }}NkjWIDTHMPiWIDTHk29WIDTHWIDTH
Simple Matrix Multiplication (GPU)void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‏{intsize = Width * Width * sizeof(float); float* Md, Nd, Pd;   …  // 1. Allocate and Load M, N to device memory cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMalloc(&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);// Allocate P on the devicecudaMalloc(&Pd, size);
Simple Matrix Multiplication (GPU)// 2. Kernel invocation code – to be shown later     … // 3. Read P from the devicecudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);// Free device matricescudaFree(Md); cudaFree(Nd); cudaFree(Pd);}
Kernel Function// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏{    // Pvalue is used to store the element of the matrix    // that is computed by the thread    float Pvalue = 0;
Kernel Function (contd.)for (int k = 0; k < Width; ++k)‏ {float Melement = Md[threadIdx.y*Width+k];float Nelement = Nd[k*Width+threadIdx.x];Pvalue+= Melement * Nelement;   }Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}NdkWIDTHtxMdPdtytyWIDTHtxk33WIDTHWIDTH
Kernel Function (full)// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏{   // Pvalue is used to store the element of the matrix// that is computed by the threadfloat Pvalue = 0; for (int k = 0; k < Width; ++k)‏ {     float Melement = Md[threadIdx.y*Width+k];     float Nelement = Nd[k*Width+threadIdx.x];Pvalue += Melement * Nelement;   }Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}
Kernel Invocation (Host Side) // Setup the execution configurationdim3 dimGrid(1, 1);dim3 dimBlock(Width, Width);// Launch the device computation threads!MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
Only One Thread Block UsedNdGrid 1One Block of threads compute matrix PdEach thread computes one element of PdEach threadLoads a row of matrix MdLoads a column of matrix NdPerform one multiply and addition for each pair of Md and Nd elementsCompute to off-chip memory access ratio close to 1:1 (not very high)‏Size of matrix limited by the number of threads allowed in a thread blockBlock 1Thread(2, 2)‏48   WIDTHPdMd
Handling Arbitrary Sized Square MatricesHave each 2D thread block to compute a (TILE_WIDTH)2 sub-matrix (tile) of the result matrixEach has (TILE_WIDTH)2 threadsGenerate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocksNdWIDTHMdPdbyYou still need to put a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size (64K)!TILE_WIDTHtyWIDTHbxtx37WIDTHWIDTH
Small ExampleNd1,0Nd0,0Block(0,0)Block(1,0)Nd1,1Nd0,1P1,0P0,0P2,0P3,0Nd1,2Nd0,2TILE_WIDTH = 2P0,1P1,1P3,1P2,1Nd0,3Nd1,3P0,2P2,2P3,2P1,2P0,3P2,3P3,3P1,3Pd1,0Md2,0Md1,0Md0,0Md3,0Pd0,0Pd2,0Pd3,0Md1,1Md0,1Md2,1Md3,1Pd0,1Pd1,1Pd3,1Pd2,1Block(1,1)Block(0,1)Pd0,2Pd2,2Pd3,2Pd1,2Pd0,3Pd2,3Pd3,3Pd1,3
Cleanup TopicsMemory ManagementPinned Memory (Zero-Transfer)Portable Pinned MemoryMulti-GPUWrappers (Python, Java, .NET)KernelsAtomicsThread Synchronization (staged reductions)NVCC
Questions?rob@gillenfamily.net@argodevhttp://rob.gillenfamily.netRate: http://spkr8.com/t/7714

More Related Content

PPTX
Intro to GPGPU with CUDA (DevLink)
PPT
Introduction to parallel computing using CUDA
PDF
A beginner’s guide to programming GPUs with CUDA
PDF
Introduction to CUDA
PPT
Vpu technology &gpgpu computing
PPT
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
PPTX
PPT
Intro to GPGPU with CUDA (DevLink)
Introduction to parallel computing using CUDA
A beginner’s guide to programming GPUs with CUDA
Introduction to CUDA
Vpu technology &gpgpu computing
NVidia CUDA for Bruteforce Attacks - DefCamp 2012

What's hot (18)

PDF
Cuda tutorial
PDF
GPU: Understanding CUDA
PDF
Cuda introduction
PDF
Kato Mivule: An Overview of CUDA for High Performance Computing
PDF
NVidia CUDA Tutorial - June 15, 2009
PPTX
GPGPU programming with CUDA
PDF
Computing using GPUs
PDF
PDF
Nvidia cuda tutorial_no_nda_apr08
PPTX
PPTX
Lrz kurs: big data analysis
PDF
Introduction to CUDA C: NVIDIA : Notes
PPTX
Cuda Architecture
PPTX
Gpu with cuda architecture
PPT
Monte Carlo on GPUs
PPTX
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
PDF
Gpu perf-presentation
PDF
CuPy: A NumPy-compatible Library for GPU
Cuda tutorial
GPU: Understanding CUDA
Cuda introduction
Kato Mivule: An Overview of CUDA for High Performance Computing
NVidia CUDA Tutorial - June 15, 2009
GPGPU programming with CUDA
Computing using GPUs
Nvidia cuda tutorial_no_nda_apr08
Lrz kurs: big data analysis
Introduction to CUDA C: NVIDIA : Notes
Cuda Architecture
Gpu with cuda architecture
Monte Carlo on GPUs
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Gpu perf-presentation
CuPy: A NumPy-compatible Library for GPU
Ad

Similar to Intro to GPGPU Programming with Cuda (20)

PPT
Vpu technology &gpgpu computing
PPT
Vpu technology &gpgpu computing
PPT
Vpu technology &gpgpu computing
PDF
Newbie’s guide to_the_gpgpu_universe
PDF
Deep Learning Edge
PPTX
Data-Level Parallelism in Microprocessors
PPT
NVIDIA CUDA
PDF
NVIDIA cuda programming, open source and AI
PPT
Lecture 04
PDF
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
PDF
GPU Programming
PDF
Using GPUs to handle Big Data with Java by Adam Roberts.
PDF
CUDA and Caffe for deep learning
PPTX
Introduction to Accelerators
PDF
Cuda materials
PPTX
Gpu computing workshop
PPTX
An Introduction to CUDA-OpenCL - University.pptx
PDF
Efficient algorithm for rsa text encryption using cuda c
PDF
Efficient algorithm for rsa text encryption using cuda c
PDF
Introduction to cuda geek camp singapore 2011
Vpu technology &gpgpu computing
Vpu technology &gpgpu computing
Vpu technology &gpgpu computing
Newbie’s guide to_the_gpgpu_universe
Deep Learning Edge
Data-Level Parallelism in Microprocessors
NVIDIA CUDA
NVIDIA cuda programming, open source and AI
Lecture 04
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
GPU Programming
Using GPUs to handle Big Data with Java by Adam Roberts.
CUDA and Caffe for deep learning
Introduction to Accelerators
Cuda materials
Gpu computing workshop
An Introduction to CUDA-OpenCL - University.pptx
Efficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda c
Introduction to cuda geek camp singapore 2011
Ad

More from Rob Gillen (20)

PDF
CodeStock14: Hiding in Plain Sight
PDF
What's in a password
PPTX
How well do you know your runtime
PPTX
Software defined radio and the hacker
PPTX
So whats in a password
PPTX
Hiding in plain sight
PPTX
ETCSS: Into the Mind of a Hacker
PPTX
DevLink - WiFu: You think your wireless is secure?
PPTX
You think your WiFi is safe?
PPTX
Anatomy of a Buffer Overflow Attack
PPTX
AWS vs. Azure
PPTX
A Comparison of AWS and Azure - Part2
PPTX
A Comparison of AWS and Azure - Part 1
PPTX
Scaling Document Clustering in the Cloud
PPTX
Hands On with Amazon Web Services (StirTrek)
PPTX
Windows Azure: Lessons From The Field
PPTX
Amazon Web Services for the .NET Developer
PPT
05561 Xfer Research 02
PPT
05561 Xfer Research 01
PPT
05561 Xfer Consumer 01
CodeStock14: Hiding in Plain Sight
What's in a password
How well do you know your runtime
Software defined radio and the hacker
So whats in a password
Hiding in plain sight
ETCSS: Into the Mind of a Hacker
DevLink - WiFu: You think your wireless is secure?
You think your WiFi is safe?
Anatomy of a Buffer Overflow Attack
AWS vs. Azure
A Comparison of AWS and Azure - Part2
A Comparison of AWS and Azure - Part 1
Scaling Document Clustering in the Cloud
Hands On with Amazon Web Services (StirTrek)
Windows Azure: Lessons From The Field
Amazon Web Services for the .NET Developer
05561 Xfer Research 02
05561 Xfer Research 01
05561 Xfer Consumer 01

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Approach and Philosophy of On baking technology
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Electronic commerce courselecture one. Pdf
PDF
KodekX | Application Modernization Development
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
Teaching material agriculture food technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Big Data Technologies - Introduction.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Empathic Computing: Creating Shared Understanding
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
NewMind AI Monthly Chronicles - July 2025
Approach and Philosophy of On baking technology
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Reach Out and Touch Someone: Haptics and Empathic Computing
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
cuic standard and advanced reporting.pdf
Machine learning based COVID-19 study performance prediction
Spectral efficient network and resource selection model in 5G networks
Digital-Transformation-Roadmap-for-Companies.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
KodekX | Application Modernization Development
Mobile App Security Testing_ A Comprehensive Guide.pdf
Teaching material agriculture food technology
Review of recent advances in non-invasive hemoglobin estimation
Big Data Technologies - Introduction.pptx

Intro to GPGPU Programming with Cuda

  • 1. Rob GillenIntro to GPGPU Programing With CUDA
  • 2. CodeStock is proudly partnered with:RecruitWise and Staff with Excellence - www.recruitwise.jobsSend instant feedback on this session via Twitter:Send a direct message with the room number to @CodeStockd codestock 411 This guy is Amazing!For more information on sending feedback using Twitter while at CodeStock, please see the “CodeStock README” in your CodeStock guide.
  • 4. Intro to GPGPU Programming with CUDARob Gillen
  • 5. Welcome!Goals:Overview of GPGPU with CUDA“Vision Casting” for how you can use GPUs to improve your applicationOutlineWhy GPGPUs?ApplicationsToolingHands-On: Matrix MultiplicationRating: http://spkr8.com/t/7714
  • 6. CPU vs. GPUGPU devotes more transistors to data processing
  • 7. NVIDIA Fermi~1.5TFLOPS (SP)/~800GFLOPS (DP)230 GB/s DRAM Bandwidth
  • 8. MotivationFLoating-Point Operations per Second (FLOPS) and memory bandwidth For the CPU and GPU
  • 9. Example: Sparse Matrix-VectorCPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007
  • 10. Rayleigh-Bénard ResultsDouble precision384 x 384 x 192 grid (max that fits in 4GB)Vertical slice of temperature at y=0Transition from stratified (left) to turbulent (right)Regime depends on Rayleigh number: Ra = gαΔT/κν8.5x speedup versus Fortran code running on 8-core 2.5 GHz Xeon
  • 11. G80 Characteristics367 GFLOPS peak performance (25-50 times of current high-end microprocessors)265 GFLOPS sustained for apps such as VMDMassively parallel, 128 cores, 90WMassively threaded, sustains 1000s of threads per app30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics
  • 13. ApplicationsExciting applications in future mass computing market have been traditionally considered “supercomputing applications”Molecular dynamics simulation, Video and audio codingand manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These “Super-apps” represent and model physical, concurrent worldVarious granularities of parallelism exist, but…programming model must not hinder parallel implementationdata delivery needs careful management
  • 14. *Not* for all applicationsSPMD (Single Program, Multiple Data) are best (data parallel)Operations need to be of sufficient size to overcome overheadThink Millions of operations.
  • 16. NVIRT: CUDA Ray Tracing API
  • 17. ToolingVS 2010 C++ (Express is OK… sortof.)NVIDIA CUDA-Capable GPUNVIDIA CUDA Toolkit (v4+)NVIDIA CUDA Tools (v4+)GPU Computing SDKNVIDIA Parallel Insight
  • 22. Before we get too excited…Host vs DeviceKernels __global__ __device__ __host__Thread/Block Control<<<x, y>>>Multi-dimensioned coordinate objectsMemory Management/MovementThread Management – think 1000’s or 1,000,000’s
  • 23. Block IDs and ThreadsEach thread uses IDs to decide what data to work onBlock ID: 1D or 2DThread ID: 1D, 2D, or 3D Simplifies memoryaddressing when processingmultidimensional dataImage processing
  • 24. CUDA Thread BlockAll threads in a block execute the same kernel program (SPMD)Programmer declares block:Block size 1 to 512 concurrent threadsBlock shape 1D, 2D, or 3DBlock dimensions in threadsThreads have thread id numbers within blockThread program uses thread id to select work and address shared dataThreads in the same block share data and synchronize while doing their share of the workThreads in different blocks cannot cooperateEach block can execute in any order relative to other blocs!CUDA Thread BlockThread Id #:0 1 2 3 … m Thread program
  • 25. Transparent ScalabilityHardware is free to assigns blocks to any processor at any timeA kernel scales across any number of parallel processorsKernel gridDeviceBlock 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7DeviceBlock 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7Block 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7timeEach block can execute in any order relative to other blocks.
  • 26. A Simple Running ExampleMatrix MultiplicationA simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programsLeave shared memory usage until laterLocal, register usageThread ID usageMemory data transfer API between host and deviceAssume square matrix for simplicity
  • 27. Programming Model:Square Matrix Multiplication ExampleP = M * N of size WIDTH x WIDTHWithout tiling:One thread calculates one element of PM and N are loaded WIDTH timesfrom global memoryNWIDTHMPWIDTHWIDTHWIDTH27
  • 28. Memory Layout of Matrix in CM0,2M0,1M0,0M0,3M1,1M1,0M1,2M1,3M2,1M2,0M2,2M2,3M3,1M3,0M3,2M3,3MM0,2M0,1M0,0M0,3M1,1M1,0M1,2M1,3M2,1M2,0M2,2M2,3M3,1M3,0M3,2M3,3
  • 29. Simple Matrix Multiplication (CPU)void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏{ for (int i = 0; i < Width; ++i) {‏ for (int j = 0; j < Width; ++j) { float sum = 0;for (int k = 0; k < Width; ++k) {float a = M[i * width + k];float b = N[k * width + j];sum += a * b;}P[i * Width + j] = sum; } }}NkjWIDTHMPiWIDTHk29WIDTHWIDTH
  • 30. Simple Matrix Multiplication (GPU)void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‏{intsize = Width * Width * sizeof(float); float* Md, Nd, Pd; … // 1. Allocate and Load M, N to device memory cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMalloc(&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);// Allocate P on the devicecudaMalloc(&Pd, size);
  • 31. Simple Matrix Multiplication (GPU)// 2. Kernel invocation code – to be shown later … // 3. Read P from the devicecudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);// Free device matricescudaFree(Md); cudaFree(Nd); cudaFree(Pd);}
  • 32. Kernel Function// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏{ // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;
  • 33. Kernel Function (contd.)for (int k = 0; k < Width; ++k)‏ {float Melement = Md[threadIdx.y*Width+k];float Nelement = Nd[k*Width+threadIdx.x];Pvalue+= Melement * Nelement; }Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}NdkWIDTHtxMdPdtytyWIDTHtxk33WIDTHWIDTH
  • 34. Kernel Function (full)// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏{ // Pvalue is used to store the element of the matrix// that is computed by the threadfloat Pvalue = 0; for (int k = 0; k < Width; ++k)‏ { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x];Pvalue += Melement * Nelement; }Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}
  • 35. Kernel Invocation (Host Side) // Setup the execution configurationdim3 dimGrid(1, 1);dim3 dimBlock(Width, Width);// Launch the device computation threads!MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
  • 36. Only One Thread Block UsedNdGrid 1One Block of threads compute matrix PdEach thread computes one element of PdEach threadLoads a row of matrix MdLoads a column of matrix NdPerform one multiply and addition for each pair of Md and Nd elementsCompute to off-chip memory access ratio close to 1:1 (not very high)‏Size of matrix limited by the number of threads allowed in a thread blockBlock 1Thread(2, 2)‏48 WIDTHPdMd
  • 37. Handling Arbitrary Sized Square MatricesHave each 2D thread block to compute a (TILE_WIDTH)2 sub-matrix (tile) of the result matrixEach has (TILE_WIDTH)2 threadsGenerate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocksNdWIDTHMdPdbyYou still need to put a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size (64K)!TILE_WIDTHtyWIDTHbxtx37WIDTHWIDTH
  • 38. Small ExampleNd1,0Nd0,0Block(0,0)Block(1,0)Nd1,1Nd0,1P1,0P0,0P2,0P3,0Nd1,2Nd0,2TILE_WIDTH = 2P0,1P1,1P3,1P2,1Nd0,3Nd1,3P0,2P2,2P3,2P1,2P0,3P2,3P3,3P1,3Pd1,0Md2,0Md1,0Md0,0Md3,0Pd0,0Pd2,0Pd3,0Md1,1Md0,1Md2,1Md3,1Pd0,1Pd1,1Pd3,1Pd2,1Block(1,1)Block(0,1)Pd0,2Pd2,2Pd3,2Pd1,2Pd0,3Pd2,3Pd3,3Pd1,3
  • 39. Cleanup TopicsMemory ManagementPinned Memory (Zero-Transfer)Portable Pinned MemoryMulti-GPUWrappers (Python, Java, .NET)KernelsAtomicsThread Synchronization (staged reductions)NVCC

Editor's Notes

  • #10: Sparse linear algebra is interesting both because many science and engineering codes rely on it, and also because it was traditionally assumed to be something that GPUs would not be good at (because of irregular data access patterns). We have shown that in fact GPUs are extremely good at sparse matrix-vector multiply (SpMV), which is the basic building block of sparse linear algebra. The code and an accompanying white paper are available on the cuda forums and also posted on research.nvidia.com.This is compared to an extremely well-studied, well-optimized SpMV implementation from a widely respected paper in Supercomputing 2007. that paper only reported double-precision results for CPUs; our single precision results are even more impressive in comparison.
  • #11: Compared to highly optimizedfortran code from an oceanography researcher at UCLA
  • #16: Current implementation uses short-stack approach. Top elements of the stack are cached in registers.
  • #17: RTAPI enables implementation of manydifferent raytracing flavors.left-right, top-bottom: Procedural materials, Ambient occlusion, Whittedraytracer (thin shell glass and metalic spheres) Path tracer (Cornell box), Refactions, Cook-style distribution raytracingCould also do non-rendering stuff, e.g. GIS (line of sight say), physics (collision/proximity detection)