SlideShare a Scribd company logo
CUDA LAB
LSALAB
OVERVIEW
Programming Environment
Compile &Run CUDAprogram
CUDATools
Lab Tasks
CUDAProgramming Tips
References
GPU SERVER
IntelE5-2670 V2 10Cores CPUX 2
NVIDIAK20X GPGPUCARD X 2
Command to getyour GPGPUHW spec:
$/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery
Device0:"TeslaK20Xm"
CUDADriverVersion/RuntimeVersion 5.5/5.5
CUDACapabilityMajor/Minorversionnumber: 3.5
Totalamountofglobalmemory: 5760MBytes(6039339008bytes)
(14)Multiprocessors,(192)CUDACores/MP: 2688CUDACores
GPUClockrate: 732MHz(0.73GHz)
MemoryClockrate: 2600Mhz
MemoryBusWidth: 384-bit
L2CacheSize: 1572864bytes
Totalamountofconstantmemory: 65536bytes
Totalamountofsharedmemoryperblock: 49152bytes
Totalnumberofregistersavailableperblock:65536
Warpsize: 32
Maximumnumberofthreadspermultiprocessor: 2048
Maximumnumberofthreadsperblock: 1024
Maxdimensionsizeofathreadblock(x,y,z):(1024,1024,64)
Maxdimensionsizeofagridsize (x,y,z):(2147483647,65535,65535)
theoreticalmemorybandwidth:$2600 times 10^{6} times
(384 /8) times 2 ÷ 1024^3 = 243 GB/s$
OfficialHW Spec details:
http://www.nvidia.com/object/tesla-servers.html
COMPILE & RUN CUDA
Directlycompile to executable code
GPUand CPUcode are compiled and linked separately
#compilethesourcecodetoexecutablefile
$nvcca.cu-oa.out
COMPILE & RUN CUDA
The nvcc compiler willtranslate CUDAsource code into Parallel
Thread Execution (PTX) language in the intermediate phase.
#keepallintermediatephasefiles
$nvcca.cu-keep
#or
$nvcca.cu-save-temps
$nvcca.cu-keep
$ls
a.cpp1.ii a.cpp4.ii a.cudafe1.c a.cudafe1.stub.c a.cudafe2.stub.c a.hash a
a.cpp2.i a.cu a.cudafe1.cpp a.cudafe2.c a.fatbin a.module_id a
a.cpp3.i a.cu.cpp.ii a.cudafe1.gpu a.cudafe2.gpu a.fatbin.c a.o a
#cleanallintermediatephasefiles
$nvcca.cu-keep-clean
USEFUL NVCC USAGE
Printcode generation statistics:
$nvcc-Xptxas-vreduce.cu
ptxasinfo :0bytesgmem
ptxasinfo :Compilingentryfunction'_Z6reducePiS_'for'sm_10'
ptxasinfo :Used6registers,32bytessmem,4bytescmem[1]
-Xptxas
--ptxas-options
Specifyoptionsdirectlytotheptxoptimizingassembler.
register number:should be less than the number of available
registers, otherwises the restregisters willbe mapped into
the localmemory(off-chip).
smem stands for shared memory.
cmem stands for constantmemory. The bank-#1 constant
memorystores 4 bytes of constantvariables.
CUDA TOOLS
cuda-memcheck:functionalcorrectness checking suite.
nvidia-smi:NVIDIASystem ManagementInterface
CUDA-MEMCHECK
This toolchecks the following memoryerrors of your program,
and italso reports hardware exceptions encountered bythe
GPU.
These errors maynotcause program crash, buttheycould
unexpected program and memorymisusage.
Table .Memcheck reported errortypes
Name Description Location Precision
Memoryaccess
error
Errorsdue to out of boundsormisaligned accessesto memorybyaglobal,
local,shared orglobal atomic access.
Device Precise
Hardware
exception
Errorsthat are reported bythe hardware errorreporting mechanism. Device Imprecise
Malloc/Free errors Errorsthat occurdue to incorrect use of malloc()/free()inCUDAkernels. Device Precise
CUDAAPIerrors Reported whenaCUDAAPIcall inthe applicationreturnsafailure. Host Precise
cudaMalloc
memoryleaks
Allocationsof device memoryusing cudaMalloc()that have not beenfreed
bythe application.
Host Precise
Device Heap
MemoryLeaks
Allocationsof device memoryusing malloc()indevice code that have not
beenfreed bythe application.
Device Imprecise
CUDA-MEMCHECK
EXAMPLE
Program with double free fault
intmain(intargc,char*argv[])
{
constintelemNum=1024;
inth_data[elemNum];
int*d_data;
initArray(h_data);
intarraySize=elemNum*sizeof(int);
cudaMalloc((void**)&d_data,arraySize);
incrOneForAll<<<1,1024>>>(d_data);
cudaMemcpy((void**)&h_data,d_data,arraySize,cudaMemcpyDeviceToHost);
cudaFree(d_data);
cudaFree(d_data); //fault
printArray(h_data);
return0;
}
CUDA-MEMCHECK
EXAMPLE
$nvcc-g-Gexample.cu
$cuda-memcheck./a.out
=========CUDA-MEMCHECK
=========Programhiterror17onCUDAAPIcalltocudaFree
========= Savedhostbacktraceuptodriverentrypointaterror
========= HostFrame:/usr/lib64/libcuda.so[0x26d660]
========= HostFrame:./a.out[0x42af6]
========= HostFrame:./a.out[0x2a29]
========= HostFrame:/lib64/libc.so.6(__libc_start_main+0xfd)[0x1ecdd]
========= HostFrame:./a.out[0x2769]
=========
No error is shown if itis run directly, butCUDA-MEMCHECK
can detectthe error.
NVIDIA SYSTEM MANAGEMENT INTERFACE
(NVIDIA-SMI)
Purpose:Queryand modifyGPUdevices' state.
$nvidia-smi
+------------------------------------------------------+
|NVIDIA-SMI5.319.37 DriverVersion:319.37 |
|-------------------------------+----------------------+----------------------+
|GPU Name Persistence-M|Bus-Id Disp.A|VolatileUncorr.ECC|
|Fan Temp Perf Pwr:Usage/Cap| Memory-Usage|GPU-Util ComputeM.|
|===============================+======================+======================|
| 0 TeslaK20Xm On |0000:0B:00.0 Off| 0|
|N/A 35C P0 60W/235W| 84MB/ 5759MB| 0% Default|
+-------------------------------+----------------------+----------------------+
| 1 TeslaK20Xm On |0000:85:00.0 Off| 0|
|N/A 39C P0 60W/235W| 14MB/ 5759MB| 0% Default|
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
|Computeprocesses: GPUMemory|
| GPU PID Processname Usage |
|=============================================================================|
| 0 33736 ./RS 69MB |
+-----------------------------------------------------------------------------+
NVIDIA-SMI
You can querymore specific information on temperature,
memory, power, etc.
$nvidia-smi-q-d[TEMPERATURE|MEMORY|POWER|CLOCK|...]
For example:
$nvidia-smi-q-dPOWER
==============NVSMILOG==============
Timestamp :
DriverVersion :319.37
AttachedGPUs :2
GPU0000:0B:00.0
PowerReadings
PowerManagement :Supported
PowerDraw :60.71W
PowerLimit :235.00W
DefaultPowerLimit :235.00W
EnforcedPowerLimit :235.00W
MinPowerLimit :150.00W
MaxPowerLimit :235.00W
GPU0000:85:00.0
PowerReadings
PowerManagement :Supported
PowerDraw :31.38W
PowerLimit :235.00W
DefaultPowerLimit :235.00W
LAB ASSIGNMENTS
1. Program-#1:increase each elementin an arraybyone.
(You are required to rewrite a CPUprogram into a CUDA
one.)
2. Program-#2:use parallelreduction to calculate the sum of all
the elements in an array.
(You are required to fillin the blanks of a template CUDA
program, and reportyour GPUbandwidth to TAafter you
finish each assignment.)
1. SUM CUDAprogramming with "multi-kerneland shared
memory"
2. SUM CUDAprogramming with "interleaved addressing"
3. SUM CUDAprogramming with "sequentialaddressing"
4. SUM CUDAprogramming with "firstadd during load"
0.2 scores per task.
LABS ASSIGNMENT #1
Rewrite the following CPUfunction into a CUDAkernel
function and complete the main function byyourself:
//increaseoneforalltheelements
voidincrOneForAll(int*array,constintelemNum)
{
inti;
for(i=0;i<elemNum;++i)
{
array[i]++;
}
}
LABS ASSIGNMENT #2
Fillin the CUDAkernelfunction:
Partof the main function is given, you are required to fillin the
blanks according to the comments:
__global__voidreduce(int*g_idata,int*g_odata)
{
extern__shared__intsdata[];
//TODO:loadthecontentofglobalmemorytosharedmemory
//NOTE:synchronizeallthethreadsafterthisstep
//TODO:sumcalculation
//NOTE:synchronizeallthethreadsaftereachiteration
//TODO:writebacktheresultintothecorrespondingentryofglobalmemory
//NOTE:onlyonethreadisenoughtodothejob
}
//parametersforthefirstkernel
//TODO:setgridandblocksize
//threadNum=?
//blockNum=?
intsMemSize=1024*sizeof(int);
reduce<<<threadNum,blockNum,sMemSize>>>(d_idata,d_odata);
Hint:for "firstadd during globalload" optimization (Assignment
#2-4), the third kernelis unnecessary.
LABS ASSIGNMENT #2
Given $10^{22}$ INTs, each block has the maximum block
size $10^{10}$
How to use 3 kernelto synchronize between iterations?
LABS ASSIGNMENT #2-1
Implementthe naïve data parallelism assignmentas follows:
LABS ASSIGNMENT #2-2
Reduce number of active warps of your program:
LABS ASSIGNMENT #2-3
Preventshared memoryaccess bank confliction:
LABS ASSIGNMENT #2-4
Reduce the number of blocks in each kernel:
Notice:
Only2 kernels are needed in this case because each kernel
can now process twice amountof data than before.
Globalmemoryshould be accessed in a sequential
addressing way.
CUDA PROGRAMMING TIPS
KERNEL LAUNCH
mykernel<<<gridSize,blockSize,sMemSize,streamID>>>(args);
gridSize:number of blocks per grid
blockSize:number of threads per block
sMemSize[optional]:shared memorysize (in bytes)
streamID[optional]:stream ID, defaultis 0
BUILT-IN VARIABLES FOR INDEXING IN A
KERNEL FUNCTION
blockIdx.x, blockIdx.y, blockIdx.z:block index
threadIdx.x, threadIdx.y, threadIdx.z:thread index
gridDim.x, gridDim.y, gridDim.z:grid size (number of blocks
per grid) per dimension
blockDim.x, blockDim.y, blockDim.z:block size (number of
threads per block) per dimension
CUDAMEMCPY
cudaError_tcudaMemcpy(void*dst,
constvoid*src,
size_t count,
enumcudaMemcpyKindkind
)
Enumerator:
cudaMemcpyHostToHost:Host-> Host
cudaMemcpyHostToDevice:Host-> Device
cudaMemcpyDeviceToHost;Device -> Host
cudaMemcpyDeviceToDevice:Device -> Device
SYNCHRONIZATION
__synthread():synchronizes allthreads in a block (used inside
the kernelfunction).
cudaDeviceSynchronize():blocks untilthe device has
completed allpreceding requested tasks (used between two
kernellaunches).
kernel1<<<gridSize,blockSize>>>(args);
cudaDeviceSynchronize();
kernel2<<<gridSize,blockSize>>>(args);
HOW TO MEASURE KERNEL EXECUTION TIME
USING CUDA GPU TIMERS
Methods:
cudaEventCreate():inittimer
cudaEventDestory():destorytimer
cudaEventRecord():settimer
cudaEventSynchronize():sync timer after each kernelcall
cudaEventElapsedTime():returns the elapsed time in
milliseconds
Example:
HOW TO MEASURE KERNEL EXECUTION TIME
USING CUDA GPU TIMERS
cudaEvent_tstart,stop;
floattime;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
kernel<<<grid,threads>>>(d_idata,d_odata);
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time,start,stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
REFERENCES
1.
2.
3.
4.
5.
6.
7.
NVIDIACUDARuntime API
Programming Guide ::CUDAToolkitDocumentation
BestPractices Guide ::CUDAToolkitDocumentation
NVCC ::CUDAToolkitDocumentation
CUDA-MEMCHECK::CUDAToolkitDocumentation
nvidia-smidocumentation
CUDAerror types
THE END
ENJOY CUDA & HAPPY NEW YEAR!

More Related Content

PDF
CUDA Deep Dive
PDF
ch3-pv1-memory-management
PDF
Migrating KSM page causes the VM lock up as the KSM page merging list is too ...
PDF
Process Address Space: The way to create virtual address (page table) of user...
PPT
Евгений Крутько, Многопоточные вычисления, современный подход.
PDF
Facebook Glow Compiler のソースコードをグダグダ語る会
PDF
C++ game development with oxygine
PDF
The TCP/IP stack in the FreeBSD kernel COSCUP 2014
CUDA Deep Dive
ch3-pv1-memory-management
Migrating KSM page causes the VM lock up as the KSM page merging list is too ...
Process Address Space: The way to create virtual address (page table) of user...
Евгений Крутько, Многопоточные вычисления, современный подход.
Facebook Glow Compiler のソースコードをグダグダ語る会
C++ game development with oxygine
The TCP/IP stack in the FreeBSD kernel COSCUP 2014

What's hot (20)

PDF
Kernel Recipes 2015: Representing device-tree peripherals in ACPI
PPT
PPTX
Static analysis of C++ source code
PDF
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
PDF
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
PDF
Kotlin coroutine - the next step for RxJava developer?
PPTX
Linux Timer device driver
PDF
various tricks for remote linux exploits  by Seok-Ha Lee (wh1ant)
PPTX
Making a Process
PPTX
Down to Stack Traces, up from Heap Dumps
PDF
Development of hardware-based Elements for GStreamer 1.0: A case study (GStre...
PPT
OOP for Hardware Verification--Demystified!
PDF
Book
PDF
Workflow story: Theory versus Practice in large enterprises by Marcin Piebiak
PDF
Workflow story: Theory versus practice in Large Enterprises
PDF
計算機性能の限界点とその考え方
PDF
The Ring programming language version 1.8 book - Part 54 of 202
PPTX
Accelerating Habanero-Java Program with OpenCL Generation
PDF
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
PDF
Visual Studio를 이용한 어셈블리어 학습 part 2
Kernel Recipes 2015: Representing device-tree peripherals in ACPI
Static analysis of C++ source code
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
Kotlin coroutine - the next step for RxJava developer?
Linux Timer device driver
various tricks for remote linux exploits  by Seok-Ha Lee (wh1ant)
Making a Process
Down to Stack Traces, up from Heap Dumps
Development of hardware-based Elements for GStreamer 1.0: A case study (GStre...
OOP for Hardware Verification--Demystified!
Book
Workflow story: Theory versus Practice in large enterprises by Marcin Piebiak
Workflow story: Theory versus practice in Large Enterprises
計算機性能の限界点とその考え方
The Ring programming language version 1.8 book - Part 54 of 202
Accelerating Habanero-Java Program with OpenCL Generation
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Visual Studio를 이용한 어셈블리어 학습 part 2
Ad

Viewers also liked (14)

PDF
Marcas (flora y fauna en chimborazo)
DOC
Pedagogizacja rodziców
PPTX
Soccer maddie
PDF
14776451 reacciones-redox
PDF
my presentation of the paper "FAST'12 NCCloud"
DOC
Poradnik gimnazjalisty
PPT
Prezentacja typologia form zachowania sie bezrobotnych
DOC
Bezrobocie jako problem społeczny
PPT
Bilans kompetencji w poradnictwie zawodowym 1
PPTX
11 writing pp elaboration examples
PDF
Sony vegas pro 11 manual de inicio rápido
PDF
اتجاهات المعلمين نحو استخدام التعليم الالكتروني
PDF
DOC
Bo rang claim perjalanan
Marcas (flora y fauna en chimborazo)
Pedagogizacja rodziców
Soccer maddie
14776451 reacciones-redox
my presentation of the paper "FAST'12 NCCloud"
Poradnik gimnazjalisty
Prezentacja typologia form zachowania sie bezrobotnych
Bezrobocie jako problem społeczny
Bilans kompetencji w poradnictwie zawodowym 1
11 writing pp elaboration examples
Sony vegas pro 11 manual de inicio rápido
اتجاهات المعلمين نحو استخدام التعليم الالكتروني
Bo rang claim perjalanan
Ad

Similar to CUDA lab's slides of "parallel programming" course (20)

PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
Tema3_Introduction_to_CUDA_C.pdf
PDF
Introduction to CUDA programming in C language
PDF
Kato Mivule: An Overview of CUDA for High Performance Computing
PPT
3. CUDA_PPT.ppt info abt threads in cuda
PPT
cuda.ppt
PDF
Cuda Without a Phd - A practical guick start
PDF
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
PDF
lecture_GPUArchCUDA02-CUDAMem.pdf
PPT
Parallel computing with Gpu
PDF
Advances in GPU Computing
PDF
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
PPT
cuda_programming for vietual reality in 3d
PDF
NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...
PDF
HPP Week 1 Summary
PDF
CUDA Tutorial 01 : Say Hello to CUDA : Notes
PDF
GPU: Understanding CUDA
PDF
Cuda introduction
PDF
GPU Computing with CUDA
CUDA-Python and RAPIDS for blazing fast scientific computing
Tema3_Introduction_to_CUDA_C.pdf
Introduction to CUDA programming in C language
Kato Mivule: An Overview of CUDA for High Performance Computing
3. CUDA_PPT.ppt info abt threads in cuda
cuda.ppt
Cuda Without a Phd - A practical guick start
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
lecture_GPUArchCUDA02-CUDAMem.pdf
Parallel computing with Gpu
Advances in GPU Computing
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
cuda_programming for vietual reality in 3d
NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...
HPP Week 1 Summary
CUDA Tutorial 01 : Say Hello to CUDA : Notes
GPU: Understanding CUDA
Cuda introduction
GPU Computing with CUDA

Recently uploaded (20)

PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PDF
Sims 4 Historia para lo sims 4 para jugar
PPTX
Introuction about WHO-FIC in ICD-10.pptx
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PPTX
Introduction to Information and Communication Technology
PDF
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
PDF
The Internet -By the Numbers, Sri Lanka Edition
PDF
Cloud-Scale Log Monitoring _ Datadog.pdf
PPTX
Internet___Basics___Styled_ presentation
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PDF
How to Ensure Data Integrity During Shopify Migration_ Best Practices for Sec...
DOCX
Unit-3 cyber security network security of internet system
PPTX
INTERNET------BASICS-------UPDATED PPT PRESENTATION
PPTX
Power Point - Lesson 3_2.pptx grad school presentation
PPTX
E -tech empowerment technologies PowerPoint
PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
PPTX
artificial intelligence overview of it and more
PPT
tcp ip networks nd ip layering assotred slides
Unit-1 introduction to cyber security discuss about how to secure a system
Sims 4 Historia para lo sims 4 para jugar
Introuction about WHO-FIC in ICD-10.pptx
Module 1 - Cyber Law and Ethics 101.pptx
Introduction to Information and Communication Technology
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
The Internet -By the Numbers, Sri Lanka Edition
Cloud-Scale Log Monitoring _ Datadog.pdf
Internet___Basics___Styled_ presentation
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
SASE Traffic Flow - ZTNA Connector-1.pdf
Decoding a Decade: 10 Years of Applied CTI Discipline
How to Ensure Data Integrity During Shopify Migration_ Best Practices for Sec...
Unit-3 cyber security network security of internet system
INTERNET------BASICS-------UPDATED PPT PRESENTATION
Power Point - Lesson 3_2.pptx grad school presentation
E -tech empowerment technologies PowerPoint
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
artificial intelligence overview of it and more
tcp ip networks nd ip layering assotred slides

CUDA lab's slides of "parallel programming" course

  • 2. OVERVIEW Programming Environment Compile &Run CUDAprogram CUDATools Lab Tasks CUDAProgramming Tips References
  • 3. GPU SERVER IntelE5-2670 V2 10Cores CPUX 2 NVIDIAK20X GPGPUCARD X 2
  • 4. Command to getyour GPGPUHW spec: $/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery Device0:"TeslaK20Xm" CUDADriverVersion/RuntimeVersion 5.5/5.5 CUDACapabilityMajor/Minorversionnumber: 3.5 Totalamountofglobalmemory: 5760MBytes(6039339008bytes) (14)Multiprocessors,(192)CUDACores/MP: 2688CUDACores GPUClockrate: 732MHz(0.73GHz) MemoryClockrate: 2600Mhz MemoryBusWidth: 384-bit L2CacheSize: 1572864bytes Totalamountofconstantmemory: 65536bytes Totalamountofsharedmemoryperblock: 49152bytes Totalnumberofregistersavailableperblock:65536 Warpsize: 32 Maximumnumberofthreadspermultiprocessor: 2048 Maximumnumberofthreadsperblock: 1024 Maxdimensionsizeofathreadblock(x,y,z):(1024,1024,64) Maxdimensionsizeofagridsize (x,y,z):(2147483647,65535,65535) theoreticalmemorybandwidth:$2600 times 10^{6} times (384 /8) times 2 ÷ 1024^3 = 243 GB/s$ OfficialHW Spec details: http://www.nvidia.com/object/tesla-servers.html
  • 5. COMPILE & RUN CUDA Directlycompile to executable code GPUand CPUcode are compiled and linked separately #compilethesourcecodetoexecutablefile $nvcca.cu-oa.out
  • 6. COMPILE & RUN CUDA The nvcc compiler willtranslate CUDAsource code into Parallel Thread Execution (PTX) language in the intermediate phase. #keepallintermediatephasefiles $nvcca.cu-keep #or $nvcca.cu-save-temps $nvcca.cu-keep $ls a.cpp1.ii a.cpp4.ii a.cudafe1.c a.cudafe1.stub.c a.cudafe2.stub.c a.hash a a.cpp2.i a.cu a.cudafe1.cpp a.cudafe2.c a.fatbin a.module_id a a.cpp3.i a.cu.cpp.ii a.cudafe1.gpu a.cudafe2.gpu a.fatbin.c a.o a #cleanallintermediatephasefiles $nvcca.cu-keep-clean
  • 7. USEFUL NVCC USAGE Printcode generation statistics: $nvcc-Xptxas-vreduce.cu ptxasinfo :0bytesgmem ptxasinfo :Compilingentryfunction'_Z6reducePiS_'for'sm_10' ptxasinfo :Used6registers,32bytessmem,4bytescmem[1] -Xptxas --ptxas-options Specifyoptionsdirectlytotheptxoptimizingassembler. register number:should be less than the number of available registers, otherwises the restregisters willbe mapped into the localmemory(off-chip). smem stands for shared memory. cmem stands for constantmemory. The bank-#1 constant memorystores 4 bytes of constantvariables.
  • 8. CUDA TOOLS cuda-memcheck:functionalcorrectness checking suite. nvidia-smi:NVIDIASystem ManagementInterface
  • 9. CUDA-MEMCHECK This toolchecks the following memoryerrors of your program, and italso reports hardware exceptions encountered bythe GPU. These errors maynotcause program crash, buttheycould unexpected program and memorymisusage. Table .Memcheck reported errortypes Name Description Location Precision Memoryaccess error Errorsdue to out of boundsormisaligned accessesto memorybyaglobal, local,shared orglobal atomic access. Device Precise Hardware exception Errorsthat are reported bythe hardware errorreporting mechanism. Device Imprecise Malloc/Free errors Errorsthat occurdue to incorrect use of malloc()/free()inCUDAkernels. Device Precise CUDAAPIerrors Reported whenaCUDAAPIcall inthe applicationreturnsafailure. Host Precise cudaMalloc memoryleaks Allocationsof device memoryusing cudaMalloc()that have not beenfreed bythe application. Host Precise Device Heap MemoryLeaks Allocationsof device memoryusing malloc()indevice code that have not beenfreed bythe application. Device Imprecise
  • 10. CUDA-MEMCHECK EXAMPLE Program with double free fault intmain(intargc,char*argv[]) { constintelemNum=1024; inth_data[elemNum]; int*d_data; initArray(h_data); intarraySize=elemNum*sizeof(int); cudaMalloc((void**)&d_data,arraySize); incrOneForAll<<<1,1024>>>(d_data); cudaMemcpy((void**)&h_data,d_data,arraySize,cudaMemcpyDeviceToHost); cudaFree(d_data); cudaFree(d_data); //fault printArray(h_data); return0; }
  • 11. CUDA-MEMCHECK EXAMPLE $nvcc-g-Gexample.cu $cuda-memcheck./a.out =========CUDA-MEMCHECK =========Programhiterror17onCUDAAPIcalltocudaFree ========= Savedhostbacktraceuptodriverentrypointaterror ========= HostFrame:/usr/lib64/libcuda.so[0x26d660] ========= HostFrame:./a.out[0x42af6] ========= HostFrame:./a.out[0x2a29] ========= HostFrame:/lib64/libc.so.6(__libc_start_main+0xfd)[0x1ecdd] ========= HostFrame:./a.out[0x2769] ========= No error is shown if itis run directly, butCUDA-MEMCHECK can detectthe error.
  • 12. NVIDIA SYSTEM MANAGEMENT INTERFACE (NVIDIA-SMI) Purpose:Queryand modifyGPUdevices' state. $nvidia-smi +------------------------------------------------------+ |NVIDIA-SMI5.319.37 DriverVersion:319.37 | |-------------------------------+----------------------+----------------------+ |GPU Name Persistence-M|Bus-Id Disp.A|VolatileUncorr.ECC| |Fan Temp Perf Pwr:Usage/Cap| Memory-Usage|GPU-Util ComputeM.| |===============================+======================+======================| | 0 TeslaK20Xm On |0000:0B:00.0 Off| 0| |N/A 35C P0 60W/235W| 84MB/ 5759MB| 0% Default| +-------------------------------+----------------------+----------------------+ | 1 TeslaK20Xm On |0000:85:00.0 Off| 0| |N/A 39C P0 60W/235W| 14MB/ 5759MB| 0% Default| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ |Computeprocesses: GPUMemory| | GPU PID Processname Usage | |=============================================================================| | 0 33736 ./RS 69MB | +-----------------------------------------------------------------------------+
  • 13. NVIDIA-SMI You can querymore specific information on temperature, memory, power, etc. $nvidia-smi-q-d[TEMPERATURE|MEMORY|POWER|CLOCK|...] For example: $nvidia-smi-q-dPOWER ==============NVSMILOG============== Timestamp : DriverVersion :319.37 AttachedGPUs :2 GPU0000:0B:00.0 PowerReadings PowerManagement :Supported PowerDraw :60.71W PowerLimit :235.00W DefaultPowerLimit :235.00W EnforcedPowerLimit :235.00W MinPowerLimit :150.00W MaxPowerLimit :235.00W GPU0000:85:00.0 PowerReadings PowerManagement :Supported PowerDraw :31.38W PowerLimit :235.00W DefaultPowerLimit :235.00W
  • 14. LAB ASSIGNMENTS 1. Program-#1:increase each elementin an arraybyone. (You are required to rewrite a CPUprogram into a CUDA one.) 2. Program-#2:use parallelreduction to calculate the sum of all the elements in an array. (You are required to fillin the blanks of a template CUDA program, and reportyour GPUbandwidth to TAafter you finish each assignment.) 1. SUM CUDAprogramming with "multi-kerneland shared memory" 2. SUM CUDAprogramming with "interleaved addressing" 3. SUM CUDAprogramming with "sequentialaddressing" 4. SUM CUDAprogramming with "firstadd during load" 0.2 scores per task.
  • 15. LABS ASSIGNMENT #1 Rewrite the following CPUfunction into a CUDAkernel function and complete the main function byyourself: //increaseoneforalltheelements voidincrOneForAll(int*array,constintelemNum) { inti; for(i=0;i<elemNum;++i) { array[i]++; } }
  • 16. LABS ASSIGNMENT #2 Fillin the CUDAkernelfunction: Partof the main function is given, you are required to fillin the blanks according to the comments: __global__voidreduce(int*g_idata,int*g_odata) { extern__shared__intsdata[]; //TODO:loadthecontentofglobalmemorytosharedmemory //NOTE:synchronizeallthethreadsafterthisstep //TODO:sumcalculation //NOTE:synchronizeallthethreadsaftereachiteration //TODO:writebacktheresultintothecorrespondingentryofglobalmemory //NOTE:onlyonethreadisenoughtodothejob } //parametersforthefirstkernel //TODO:setgridandblocksize //threadNum=? //blockNum=? intsMemSize=1024*sizeof(int); reduce<<<threadNum,blockNum,sMemSize>>>(d_idata,d_odata);
  • 17. Hint:for "firstadd during globalload" optimization (Assignment #2-4), the third kernelis unnecessary. LABS ASSIGNMENT #2 Given $10^{22}$ INTs, each block has the maximum block size $10^{10}$ How to use 3 kernelto synchronize between iterations?
  • 18. LABS ASSIGNMENT #2-1 Implementthe naïve data parallelism assignmentas follows:
  • 19. LABS ASSIGNMENT #2-2 Reduce number of active warps of your program:
  • 20. LABS ASSIGNMENT #2-3 Preventshared memoryaccess bank confliction:
  • 21. LABS ASSIGNMENT #2-4 Reduce the number of blocks in each kernel: Notice: Only2 kernels are needed in this case because each kernel can now process twice amountof data than before. Globalmemoryshould be accessed in a sequential addressing way.
  • 23. KERNEL LAUNCH mykernel<<<gridSize,blockSize,sMemSize,streamID>>>(args); gridSize:number of blocks per grid blockSize:number of threads per block sMemSize[optional]:shared memorysize (in bytes) streamID[optional]:stream ID, defaultis 0
  • 24. BUILT-IN VARIABLES FOR INDEXING IN A KERNEL FUNCTION blockIdx.x, blockIdx.y, blockIdx.z:block index threadIdx.x, threadIdx.y, threadIdx.z:thread index gridDim.x, gridDim.y, gridDim.z:grid size (number of blocks per grid) per dimension blockDim.x, blockDim.y, blockDim.z:block size (number of threads per block) per dimension
  • 26. SYNCHRONIZATION __synthread():synchronizes allthreads in a block (used inside the kernelfunction). cudaDeviceSynchronize():blocks untilthe device has completed allpreceding requested tasks (used between two kernellaunches). kernel1<<<gridSize,blockSize>>>(args); cudaDeviceSynchronize(); kernel2<<<gridSize,blockSize>>>(args);
  • 27. HOW TO MEASURE KERNEL EXECUTION TIME USING CUDA GPU TIMERS Methods: cudaEventCreate():inittimer cudaEventDestory():destorytimer cudaEventRecord():settimer cudaEventSynchronize():sync timer after each kernelcall cudaEventElapsedTime():returns the elapsed time in milliseconds
  • 28. Example: HOW TO MEASURE KERNEL EXECUTION TIME USING CUDA GPU TIMERS cudaEvent_tstart,stop; floattime; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start,0); kernel<<<grid,threads>>>(d_idata,d_odata); cudaEventRecord(stop,0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time,start,stop); cudaEventDestroy(start); cudaEventDestroy(stop);
  • 29. REFERENCES 1. 2. 3. 4. 5. 6. 7. NVIDIACUDARuntime API Programming Guide ::CUDAToolkitDocumentation BestPractices Guide ::CUDAToolkitDocumentation NVCC ::CUDAToolkitDocumentation CUDA-MEMCHECK::CUDAToolkitDocumentation nvidia-smidocumentation CUDAerror types
  • 30. THE END ENJOY CUDA & HAPPY NEW YEAR!