SlideShare a Scribd company logo
HPC GPU Programming with CUDA

An Overview of CUDA for High Performance Computing

By Kato Mivule
Computer Science Department
Bowie State University
COSC887 Fall 2013

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

Agenda
•
•
•
•
•
•
•
•

CUDA Introduction.
CUDA Process flow.
CUDA Hello world program.
CUDA – Compiling and running a program.
CUDA Basic structure.
CUDA – Example program on vector addition.
CUDA – The conclusion.
CUDA – References and sources

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

CUDA – Introduction

•CUDA – Compute Unified Device Architecture.
•Developed by NVIDIA.
•A parallel computing platform and programming model .
•Implemented by the NVIDIA graphics processing units (GPUs).

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

CUDA – Introduction
•Grants access directly to the virtual instruction set and memory of GPUs.
•Allows for General Purpose Processing (GPGPU) beyond graphics .
•Allows for increased computing performance using GPUs.

Plymouth Cuda – Image Source: betterparts.org

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

CUDA – Process flow in three steps
1.

Copy input data from CPU memory to GPU memory.

2.

Load GPU program and execute.

3.

Copy results from GPU memory to CPU memory.

Image Source: http://en.wikipedia.org/wiki/CUDA

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

CUDA – Hello world program
#include <stdio.h>
__global__ void mykernel(void) {

// Denotes that this is device (GPU)code
// Denotes that function runs on device (GPU)
// Gets called from host code

}
int main(void) {

//Host (CPU) code
//Runs on Host

printf("Hello, world!n");
mykernel<<<1,1>>>();

//<<< >>> Denotes a call from host to device code

return 0;
}

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA
CUDA – Compiling and Running A Program on GWU’s Cray
1. Log into Cary: ssh cray
2. Change to ‘work’ directory: cd work
3. Create your program with file extension as .cu: vim hello1.cu
4. Load the CUDA Module module load cudatoolkit
5. Compile using NVCC: nvcc hello1.cu -o hello1
6. Execute program: ./hello1

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

CUDA – Basic structure
•The kernel – this is the GPU program.
•The kernel is executed on a grid.
•The grid – is a group of thread blocks.
•The thread block – is a group of threads.
Image Source: CUDA Overview Tutorial, Cliff Woolley, NVIDIA
http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf

•Executed on a single multi-processor.
•Can communicate and synchronize.
•Threads are grouped into Blocks and Blocks into a Grid
Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

CUDA – Basic structure
Declaring functions
• __global__ Denotes a kernel function called on host and executed on device.
• __device__ Denotes device function called and executed on device.
• __host__

Denotes a host function called and executed on host.

• __constant__ Denotes a constant device variable available to all threads.
• __shared__ Denotes a shared device variable available to all threads in a block.

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

CUDA – Basic structure
Some of the supported data types
• char and uchar
• short and ushort
• int and uint
• long and ulong
• float and ufloat

• longlong and ulonglong

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

CUDA – Basic structure
• Accessing components – kernel function specifies the number of threads
• dim3 gridDim – denotes the dimensions of grid in blocks.
•

Example: dim3 DimGrid(8,4) – 32 thread blocks

• dim3 blockDim – denotes the dimensions of block in threads.
•

Example: dim3 DimBlock (2, 2, 2) – 8 threads per block

• uint3 blockIdx – denotes a block index within grid.
• uint3 threadIdx – denotes a thread index within block.

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

CUDA – Basic structure
Thread management
•

__threadfence_block() – wait until memory access is available to block.

•

__threadfence() – wait until memory access is available to block and device.

•

__threadfence_system() – wait until memory access is available to block, device and host.

•

__syncthreads() – wait until all threads synchronize.

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

CUDA – Basic structure
Memory management
•

cudaMalloc( ) – allocates memory.

•

cudaFree( ) – frees allocated memory.

•

cudaMemcpyDeviceToHost, cudaMemcpy( )
• copies device (GPU) results back to host (CPU) memory from device to host.

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

CUDA – Basic structure
Atomic functions – executed without obstruction from other threads
• atomicAdd ( )
• atomicSub ( )
• atomicExch( )
• atomicMin ( )
• atomicMax ( )

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

CUDA – Basic structure
Atomic functions – executed without obstruction from other threads
• atomicAdd ( )
• atomicSub ( )
• atomicExch( )
• atomicMin ( )
• atomicMax ( )

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

CUDA – Example code for vector addition
//=============================================================
//Vector addition
//Oakridge National Lab Example
//https://www.olcf.ornl.gov/tutorials/cuda-vector-addition/
//=============================================================
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
// CUDA kernel. Each thread takes care of one element of c
// To run on device (GPU) and get called by Host(CPU)
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
// Get our global thread ID
int id = blockIdx.x*blockDim.x+threadIdx.x;
// Make sure we do not go out of bounds
if (id < n)
c[id] = a[id] + b[id];
}

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

CUDA – Example code for vector addition
int main( int argc, char* argv[] )
{
// Size of vectors
int n = 100000;
// Host input vectors
double *h_a;
double *h_b;
//Host output vector
double *h_c;
// Device input vectors
double *d_a;
double *d_b;
//Device output vector
double *d_c;
// Size, in bytes, of each vector
size_t bytes = n*sizeof(double);

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

CUDA – Example code for vector addition
// Allocate memory for each vector on host
h_a = (double*)malloc(bytes);
h_b = (double*)malloc(bytes);
h_c = (double*)malloc(bytes);
// Allocate memory for each vector on GPU
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
int i;
// Initialize vectors on host
for( i = 0; i < n; i++ ) {
h_a[i] = sin(i)*sin(i);
h_b[i] = cos(i)*cos(i);
}

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

CUDA – Example code for vector addition
// Copy host vectors to device
cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);
int blockSize, gridSize;
// Number of threads in each thread block
blockSize = 1024;
// Number of thread blocks in grid
gridSize = (int)ceil((float)n/blockSize);
// Execute the kernel
vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);
// Copy array back to host
cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

CUDA – Example code for vector addition
// Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0;
for(i=0; i<n; i++)
sum += h_c[i];
printf("final result: %fn", sum/n);
// Release device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
// Release host memory
free(h_a);
free(h_b);
free(h_c);
return 0;
}

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

CUDA – Example code for vector addition
Sometimes your correct CUDA code will output wrong results.
•
Check the machine for error – access to the device(GPU) might not be granted.
•
Computation might only produce correct results at the host (CPU).
//============================
//ERROR CHECKING
//============================
#define cudaCheckErrors(msg) 
do { 
cudaError_t __err = cudaGetLastError(); 
if (__err != cudaSuccess) { 
fprintf(stderr, "Fatal error: %s (%s at %s:%d)n", 
msg, cudaGetErrorString(__err), 
__FILE__, __LINE__); 
fprintf(stderr, "*** FAILED - ABORTINGn"); 
exit(1); 
} 
} while (0)
//place in memory allocation section
cudaCheckErrors("cudamalloc fail");
//place in memory copy section
cudaCheckErrors("cuda memcpy fail");
cudaCheckErrors("cudamemcpy or cuda kernel fail");
Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

Conclusion
• CUDA’s access to GPU computational power is outstanding.
• CUDA is easy to learn.

• CUDA – can take care of business by coding in C.
• However, it is a challenge translating code from host to device and device to host.

Bowie State University Department of Computer Science
HPC GPU Programming with CUDA

References and Sources
[1] CUDA Programming Blog Tutorial
http://cuda-programming.blogspot.com/2013/03/cuda-complete-complete-reference-on-cuda.html
[2] Dr. Kenrick Mock CUDA Tutorial
http://www.math.uaa.alaska.edu/~afkjm/cs448/handouts/cuda-firstprograms.pdf
[3] Parallel Programming Lecture Notes, Spring 2008, Johns Hopkins University
http://hssl.cs.jhu.edu/wiki/lib/exe/fetch.php?media=randal:teach:cs420:cudatools.pdf
[4] CUDA Super Computing Blog Tutorials
http://supercomputingblog.com/cuda-tutorials/
[5] Introduction to CUDA C Tutorial, Jason Sanders
http://www.nvidia.com/content/GTC-2010/pdfs/2131_GTC2010.pdf
[6] CUDA Overview Tutorial, Cliff Woolley, NVIDIA
http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf
[7] Oakridge National Lab CUDA Vector Addition Example
//https://www.olcf.ornl.gov/tutorials/cuda-vector-addition/
[8] CUDA – Wikipedia
http://en.wikipedia.org/wiki/CUDA

Bowie State University Department of Computer Science

More Related Content

PPT
Mainframe
PDF
Control Flow Graphs
PDF
Escape sequences
PPTX
CPU Architecture - Basic
PDF
Ide versus sata tabla de comparación
DOCX
Análisis de tarjetas madres genéricas
PPTX
Beeps sound presentation
PPTX
Assembly Language
Mainframe
Control Flow Graphs
Escape sequences
CPU Architecture - Basic
Ide versus sata tabla de comparación
Análisis de tarjetas madres genéricas
Beeps sound presentation
Assembly Language

What's hot (20)

PPTX
MicroProcessors
PPTX
80486 and pentium
PPTX
Universal serial bus(usb)
PPT
Power Point Presentation on Open Source Software
PPT
lecture:Operating Syste Ms
PPTX
Basic Computer Organization and Design
PPTX
Importance of theory of computation
PPT
Generations Of Programming Languages
PPTX
Direct linking loaders
PDF
Information Technology Careers
PDF
Von Neumann Architecture
PPTX
Classification of Programming Languages
PDF
Python Visual Studio | Edureka
PDF
Hazards in pipeline
PPTX
OPERATING SYSTEM.pptx
PPT
Ch12 microprocessor interrupts
PPTX
E portfolio flowchart
PPT
Introduction to Algorithms & flow charts
PPTX
Unit 3 sp assembler
PPTX
Introduction to Algorithm
MicroProcessors
80486 and pentium
Universal serial bus(usb)
Power Point Presentation on Open Source Software
lecture:Operating Syste Ms
Basic Computer Organization and Design
Importance of theory of computation
Generations Of Programming Languages
Direct linking loaders
Information Technology Careers
Von Neumann Architecture
Classification of Programming Languages
Python Visual Studio | Edureka
Hazards in pipeline
OPERATING SYSTEM.pptx
Ch12 microprocessor interrupts
E portfolio flowchart
Introduction to Algorithms & flow charts
Unit 3 sp assembler
Introduction to Algorithm
Ad

Similar to Kato Mivule: An Overview of CUDA for High Performance Computing (20)

PDF
CUDA First Programs: Computer Architecture CSE448 : UAA Alaska : Notes
PDF
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
PDF
GPU programming and Its Case Study
PPT
Intro2 Cuda Moayad
PDF
CUDA Tutorial 01 : Say Hello to CUDA : Notes
PDF
Tema3_Introduction_to_CUDA_C.pdf
PPTX
introduction to CUDA_C.pptx it is widely used
PDF
lecture_GPUArchCUDA02-CUDAMem.pdf
PPT
Lecture2 cuda spring 2010
PPTX
Introduction_to_CUDA_C_simple et parfiat.pptx
PPT
Parallel computing with Gpu
PDF
CUDA lab's slides of "parallel programming" course
PDF
Cuda Without a Phd - A practical guick start
PDF
Introduction to CUDA C: NVIDIA : Notes
PDF
Code gpu with cuda - CUDA introduction
PPT
3. CUDA_PPT.ppt info abt threads in cuda
PDF
Cuda materials
PDF
NVIDIA cuda programming, open source and AI
PDF
A beginner’s guide to programming GPUs with CUDA
PDF
GPU Computing with CUDA
CUDA First Programs: Computer Architecture CSE448 : UAA Alaska : Notes
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
GPU programming and Its Case Study
Intro2 Cuda Moayad
CUDA Tutorial 01 : Say Hello to CUDA : Notes
Tema3_Introduction_to_CUDA_C.pdf
introduction to CUDA_C.pptx it is widely used
lecture_GPUArchCUDA02-CUDAMem.pdf
Lecture2 cuda spring 2010
Introduction_to_CUDA_C_simple et parfiat.pptx
Parallel computing with Gpu
CUDA lab's slides of "parallel programming" course
Cuda Without a Phd - A practical guick start
Introduction to CUDA C: NVIDIA : Notes
Code gpu with cuda - CUDA introduction
3. CUDA_PPT.ppt info abt threads in cuda
Cuda materials
NVIDIA cuda programming, open source and AI
A beginner’s guide to programming GPUs with CUDA
GPU Computing with CUDA
Ad

More from Kato Mivule (20)

PDF
A Study of Usability-aware Network Trace Anonymization
PDF
Cancer Diagnostic Prediction with Amazon ML – A Tutorial
PDF
A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...
PDF
Towards A Differential Privacy and Utility Preserving Machine Learning Classi...
PDF
An Investigation of Data Privacy and Utility Preservation Using KNN Classific...
PDF
Implementation of Data Privacy and Security in an Online Student Health Recor...
PDF
Applying Data Privacy Techniques on Published Data in Uganda
PDF
Kato Mivule - Utilizing Noise Addition for Data Privacy, an Overview
PPTX
Kato Mivule - Towards Agent-based Data Privacy Engineering
PDF
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data Privacy
PDF
An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge
PDF
Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms
PDF
Lit Review Talk by Kato Mivule: Protecting DNA Sequence Anonymity with Genera...
PDF
An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge
PDF
An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge
PDF
Lit Review Talk - Signal Processing and Machine Learning with Differential Pr...
PDF
A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...
PDF
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
PDF
Kato Mivule: An Overview of Adaptive Boosting – AdaBoost
PDF
Kato Mivule: COGNITIVE 2013 - An Overview of Data Privacy in Multi-Agent Lear...
A Study of Usability-aware Network Trace Anonymization
Cancer Diagnostic Prediction with Amazon ML – A Tutorial
A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...
Towards A Differential Privacy and Utility Preserving Machine Learning Classi...
An Investigation of Data Privacy and Utility Preservation Using KNN Classific...
Implementation of Data Privacy and Security in an Online Student Health Recor...
Applying Data Privacy Techniques on Published Data in Uganda
Kato Mivule - Utilizing Noise Addition for Data Privacy, an Overview
Kato Mivule - Towards Agent-based Data Privacy Engineering
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data Privacy
An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge
Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms
Lit Review Talk by Kato Mivule: Protecting DNA Sequence Anonymity with Genera...
An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge
An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge
Lit Review Talk - Signal Processing and Machine Learning with Differential Pr...
A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
Kato Mivule: An Overview of Adaptive Boosting – AdaBoost
Kato Mivule: COGNITIVE 2013 - An Overview of Data Privacy in Multi-Agent Lear...

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Spectroscopy.pptx food analysis technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Cloud computing and distributed systems.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
KodekX | Application Modernization Development
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation theory and applications.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Approach and Philosophy of On baking technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Digital-Transformation-Roadmap-for-Companies.pptx
cuic standard and advanced reporting.pdf
Network Security Unit 5.pdf for BCA BBA.
Spectroscopy.pptx food analysis technology
Review of recent advances in non-invasive hemoglobin estimation
Cloud computing and distributed systems.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
NewMind AI Weekly Chronicles - August'25 Week I
“AI and Expert System Decision Support & Business Intelligence Systems”
Reach Out and Touch Someone: Haptics and Empathic Computing
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
KodekX | Application Modernization Development
MIND Revenue Release Quarter 2 2025 Press Release
Machine learning based COVID-19 study performance prediction
Encapsulation theory and applications.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx

Kato Mivule: An Overview of CUDA for High Performance Computing

  • 1. HPC GPU Programming with CUDA An Overview of CUDA for High Performance Computing By Kato Mivule Computer Science Department Bowie State University COSC887 Fall 2013 Bowie State University Department of Computer Science
  • 2. HPC GPU Programming with CUDA Agenda • • • • • • • • CUDA Introduction. CUDA Process flow. CUDA Hello world program. CUDA – Compiling and running a program. CUDA Basic structure. CUDA – Example program on vector addition. CUDA – The conclusion. CUDA – References and sources Bowie State University Department of Computer Science
  • 3. HPC GPU Programming with CUDA CUDA – Introduction •CUDA – Compute Unified Device Architecture. •Developed by NVIDIA. •A parallel computing platform and programming model . •Implemented by the NVIDIA graphics processing units (GPUs). Bowie State University Department of Computer Science
  • 4. HPC GPU Programming with CUDA CUDA – Introduction •Grants access directly to the virtual instruction set and memory of GPUs. •Allows for General Purpose Processing (GPGPU) beyond graphics . •Allows for increased computing performance using GPUs. Plymouth Cuda – Image Source: betterparts.org Bowie State University Department of Computer Science
  • 5. HPC GPU Programming with CUDA CUDA – Process flow in three steps 1. Copy input data from CPU memory to GPU memory. 2. Load GPU program and execute. 3. Copy results from GPU memory to CPU memory. Image Source: http://en.wikipedia.org/wiki/CUDA Bowie State University Department of Computer Science
  • 6. HPC GPU Programming with CUDA CUDA – Hello world program #include <stdio.h> __global__ void mykernel(void) { // Denotes that this is device (GPU)code // Denotes that function runs on device (GPU) // Gets called from host code } int main(void) { //Host (CPU) code //Runs on Host printf("Hello, world!n"); mykernel<<<1,1>>>(); //<<< >>> Denotes a call from host to device code return 0; } Bowie State University Department of Computer Science
  • 7. HPC GPU Programming with CUDA CUDA – Compiling and Running A Program on GWU’s Cray 1. Log into Cary: ssh cray 2. Change to ‘work’ directory: cd work 3. Create your program with file extension as .cu: vim hello1.cu 4. Load the CUDA Module module load cudatoolkit 5. Compile using NVCC: nvcc hello1.cu -o hello1 6. Execute program: ./hello1 Bowie State University Department of Computer Science
  • 8. HPC GPU Programming with CUDA CUDA – Basic structure •The kernel – this is the GPU program. •The kernel is executed on a grid. •The grid – is a group of thread blocks. •The thread block – is a group of threads. Image Source: CUDA Overview Tutorial, Cliff Woolley, NVIDIA http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf •Executed on a single multi-processor. •Can communicate and synchronize. •Threads are grouped into Blocks and Blocks into a Grid Bowie State University Department of Computer Science
  • 9. HPC GPU Programming with CUDA CUDA – Basic structure Declaring functions • __global__ Denotes a kernel function called on host and executed on device. • __device__ Denotes device function called and executed on device. • __host__ Denotes a host function called and executed on host. • __constant__ Denotes a constant device variable available to all threads. • __shared__ Denotes a shared device variable available to all threads in a block. Bowie State University Department of Computer Science
  • 10. HPC GPU Programming with CUDA CUDA – Basic structure Some of the supported data types • char and uchar • short and ushort • int and uint • long and ulong • float and ufloat • longlong and ulonglong Bowie State University Department of Computer Science
  • 11. HPC GPU Programming with CUDA CUDA – Basic structure • Accessing components – kernel function specifies the number of threads • dim3 gridDim – denotes the dimensions of grid in blocks. • Example: dim3 DimGrid(8,4) – 32 thread blocks • dim3 blockDim – denotes the dimensions of block in threads. • Example: dim3 DimBlock (2, 2, 2) – 8 threads per block • uint3 blockIdx – denotes a block index within grid. • uint3 threadIdx – denotes a thread index within block. Bowie State University Department of Computer Science
  • 12. HPC GPU Programming with CUDA CUDA – Basic structure Thread management • __threadfence_block() – wait until memory access is available to block. • __threadfence() – wait until memory access is available to block and device. • __threadfence_system() – wait until memory access is available to block, device and host. • __syncthreads() – wait until all threads synchronize. Bowie State University Department of Computer Science
  • 13. HPC GPU Programming with CUDA CUDA – Basic structure Memory management • cudaMalloc( ) – allocates memory. • cudaFree( ) – frees allocated memory. • cudaMemcpyDeviceToHost, cudaMemcpy( ) • copies device (GPU) results back to host (CPU) memory from device to host. Bowie State University Department of Computer Science
  • 14. HPC GPU Programming with CUDA CUDA – Basic structure Atomic functions – executed without obstruction from other threads • atomicAdd ( ) • atomicSub ( ) • atomicExch( ) • atomicMin ( ) • atomicMax ( ) Bowie State University Department of Computer Science
  • 15. HPC GPU Programming with CUDA CUDA – Basic structure Atomic functions – executed without obstruction from other threads • atomicAdd ( ) • atomicSub ( ) • atomicExch( ) • atomicMin ( ) • atomicMax ( ) Bowie State University Department of Computer Science
  • 16. HPC GPU Programming with CUDA CUDA – Example code for vector addition //============================================================= //Vector addition //Oakridge National Lab Example //https://www.olcf.ornl.gov/tutorials/cuda-vector-addition/ //============================================================= #include <stdio.h> #include <stdlib.h> #include <math.h> // CUDA kernel. Each thread takes care of one element of c // To run on device (GPU) and get called by Host(CPU) __global__ void vecAdd(double *a, double *b, double *c, int n) { // Get our global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; } Bowie State University Department of Computer Science
  • 17. HPC GPU Programming with CUDA CUDA – Example code for vector addition int main( int argc, char* argv[] ) { // Size of vectors int n = 100000; // Host input vectors double *h_a; double *h_b; //Host output vector double *h_c; // Device input vectors double *d_a; double *d_b; //Device output vector double *d_c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); Bowie State University Department of Computer Science
  • 18. HPC GPU Programming with CUDA CUDA – Example code for vector addition // Allocate memory for each vector on host h_a = (double*)malloc(bytes); h_b = (double*)malloc(bytes); h_c = (double*)malloc(bytes); // Allocate memory for each vector on GPU cudaMalloc(&d_a, bytes); cudaMalloc(&d_b, bytes); cudaMalloc(&d_c, bytes); int i; // Initialize vectors on host for( i = 0; i < n; i++ ) { h_a[i] = sin(i)*sin(i); h_b[i] = cos(i)*cos(i); } Bowie State University Department of Computer Science
  • 19. HPC GPU Programming with CUDA CUDA – Example code for vector addition // Copy host vectors to device cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice); cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice); int blockSize, gridSize; // Number of threads in each thread block blockSize = 1024; // Number of thread blocks in grid gridSize = (int)ceil((float)n/blockSize); // Execute the kernel vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n); // Copy array back to host cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost ); Bowie State University Department of Computer Science
  • 20. HPC GPU Programming with CUDA CUDA – Example code for vector addition // Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0; for(i=0; i<n; i++) sum += h_c[i]; printf("final result: %fn", sum/n); // Release device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); // Release host memory free(h_a); free(h_b); free(h_c); return 0; } Bowie State University Department of Computer Science
  • 21. HPC GPU Programming with CUDA CUDA – Example code for vector addition Sometimes your correct CUDA code will output wrong results. • Check the machine for error – access to the device(GPU) might not be granted. • Computation might only produce correct results at the host (CPU). //============================ //ERROR CHECKING //============================ #define cudaCheckErrors(msg) do { cudaError_t __err = cudaGetLastError(); if (__err != cudaSuccess) { fprintf(stderr, "Fatal error: %s (%s at %s:%d)n", msg, cudaGetErrorString(__err), __FILE__, __LINE__); fprintf(stderr, "*** FAILED - ABORTINGn"); exit(1); } } while (0) //place in memory allocation section cudaCheckErrors("cudamalloc fail"); //place in memory copy section cudaCheckErrors("cuda memcpy fail"); cudaCheckErrors("cudamemcpy or cuda kernel fail"); Bowie State University Department of Computer Science
  • 22. HPC GPU Programming with CUDA Conclusion • CUDA’s access to GPU computational power is outstanding. • CUDA is easy to learn. • CUDA – can take care of business by coding in C. • However, it is a challenge translating code from host to device and device to host. Bowie State University Department of Computer Science
  • 23. HPC GPU Programming with CUDA References and Sources [1] CUDA Programming Blog Tutorial http://cuda-programming.blogspot.com/2013/03/cuda-complete-complete-reference-on-cuda.html [2] Dr. Kenrick Mock CUDA Tutorial http://www.math.uaa.alaska.edu/~afkjm/cs448/handouts/cuda-firstprograms.pdf [3] Parallel Programming Lecture Notes, Spring 2008, Johns Hopkins University http://hssl.cs.jhu.edu/wiki/lib/exe/fetch.php?media=randal:teach:cs420:cudatools.pdf [4] CUDA Super Computing Blog Tutorials http://supercomputingblog.com/cuda-tutorials/ [5] Introduction to CUDA C Tutorial, Jason Sanders http://www.nvidia.com/content/GTC-2010/pdfs/2131_GTC2010.pdf [6] CUDA Overview Tutorial, Cliff Woolley, NVIDIA http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf [7] Oakridge National Lab CUDA Vector Addition Example //https://www.olcf.ornl.gov/tutorials/cuda-vector-addition/ [8] CUDA – Wikipedia http://en.wikipedia.org/wiki/CUDA Bowie State University Department of Computer Science