CUDA lab's slides of "parallel programming" course

OVERVIEW
Programming Environment
Compile &Run CUDAprogram
CUDATools
Lab Tasks
CUDAProgramming Tips
References

GPU SERVER
IntelE5-2670 V2 10Cores CPUX 2
NVIDIAK20X GPGPUCARD X 2

Command to getyour GPGPUHW spec:
$/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery
Device0:"TeslaK20Xm"
CUDADriverVersion/RuntimeVersion 5.5/5.5
CUDACapabilityMajor/Minorversionnumber: 3.5
Totalamountofglobalmemory: 5760MBytes(6039339008bytes)
(14)Multiprocessors,(192)CUDACores/MP: 2688CUDACores
GPUClockrate: 732MHz(0.73GHz)
MemoryClockrate: 2600Mhz
MemoryBusWidth: 384-bit
L2CacheSize: 1572864bytes
Totalamountofconstantmemory: 65536bytes
Totalamountofsharedmemoryperblock: 49152bytes
Totalnumberofregistersavailableperblock:65536
Warpsize: 32
Maximumnumberofthreadspermultiprocessor: 2048
Maximumnumberofthreadsperblock: 1024
Maxdimensionsizeofathreadblock(x,y,z):(1024,1024,64)
Maxdimensionsizeofagridsize (x,y,z):(2147483647,65535,65535)
theoreticalmemorybandwidth:$2600 times 10^{6} times
(384 /8) times 2 ÷ 1024^3 = 243 GB/s$
OfficialHW Spec details:
http://www.nvidia.com/object/tesla-servers.html

COMPILE & RUN CUDA
Directlycompile to executable code
GPUand CPUcode are compiled and linked separately
#compilethesourcecodetoexecutablefile
$nvcca.cu-oa.out

COMPILE & RUN CUDA
The nvcc compiler willtranslate CUDAsource code into Parallel
Thread Execution (PTX) language in the intermediate phase.
#keepallintermediatephasefiles
$nvcca.cu-keep
#or
$nvcca.cu-save-temps
$nvcca.cu-keep
$ls
a.cpp1.ii a.cpp4.ii a.cudafe1.c a.cudafe1.stub.c a.cudafe2.stub.c a.hash a
a.cpp2.i a.cu a.cudafe1.cpp a.cudafe2.c a.fatbin a.module_id a
a.cpp3.i a.cu.cpp.ii a.cudafe1.gpu a.cudafe2.gpu a.fatbin.c a.o a
#cleanallintermediatephasefiles
$nvcca.cu-keep-clean

USEFUL NVCC USAGE
Printcode generation statistics:
$nvcc-Xptxas-vreduce.cu
ptxasinfo :0bytesgmem
ptxasinfo :Compilingentryfunction'_Z6reducePiS_'for'sm_10'
ptxasinfo :Used6registers,32bytessmem,4bytescmem[1]
-Xptxas
--ptxas-options
Specifyoptionsdirectlytotheptxoptimizingassembler.
register number:should be less than the number of available
registers, otherwises the restregisters willbe mapped into
the localmemory(off-chip).
smem stands for shared memory.
cmem stands for constantmemory. The bank-#1 constant
memorystores 4 bytes of constantvariables.

CUDA TOOLS
cuda-memcheck:functionalcorrectness checking suite.
nvidia-smi:NVIDIASystem ManagementInterface

CUDA-MEMCHECK
This toolchecks the following memoryerrors of your program,
and italso reports hardware exceptions encountered bythe
GPU.
These errors maynotcause program crash, buttheycould
unexpected program and memorymisusage.
Table .Memcheck reported errortypes
Name Description Location Precision
Memoryaccess
error
Errorsdue to out of boundsormisaligned accessesto memorybyaglobal,
local,shared orglobal atomic access.
Device Precise
Hardware
exception
Errorsthat are reported bythe hardware errorreporting mechanism. Device Imprecise
Malloc/Free errors Errorsthat occurdue to incorrect use of malloc()/free()inCUDAkernels. Device Precise
CUDAAPIerrors Reported whenaCUDAAPIcall inthe applicationreturnsafailure. Host Precise
cudaMalloc
memoryleaks
Allocationsof device memoryusing cudaMalloc()that have not beenfreed
bythe application.
Host Precise
Device Heap
MemoryLeaks
Allocationsof device memoryusing malloc()indevice code that have not
beenfreed bythe application.
Device Imprecise

CUDA-MEMCHECK
EXAMPLE
Program with double free fault
intmain(intargc,char*argv[])
{
constintelemNum=1024;
inth_data[elemNum];
int*d_data;
initArray(h_data);
intarraySize=elemNum*sizeof(int);
cudaMalloc((void**)&d_data,arraySize);
incrOneForAll<<<1,1024>>>(d_data);
cudaMemcpy((void**)&h_data,d_data,arraySize,cudaMemcpyDeviceToHost);
cudaFree(d_data);
cudaFree(d_data); //fault
printArray(h_data);
return0;
}

CUDA-MEMCHECK
EXAMPLE
$nvcc-g-Gexample.cu
$cuda-memcheck./a.out
=========CUDA-MEMCHECK
=========Programhiterror17onCUDAAPIcalltocudaFree
========= Savedhostbacktraceuptodriverentrypointaterror
========= HostFrame:/usr/lib64/libcuda.so[0x26d660]
========= HostFrame:./a.out[0x42af6]
========= HostFrame:./a.out[0x2a29]
========= HostFrame:/lib64/libc.so.6(__libc_start_main+0xfd)[0x1ecdd]
========= HostFrame:./a.out[0x2769]
=========
No error is shown if itis run directly, butCUDA-MEMCHECK
can detectthe error.

NVIDIA-SMI
You can querymore specific information on temperature,
memory, power, etc.
$nvidia-smi-q-d[TEMPERATURE|MEMORY|POWER|CLOCK|...]
For example:
$nvidia-smi-q-dPOWER
==============NVSMILOG==============
Timestamp :
DriverVersion :319.37
AttachedGPUs :2
GPU0000:0B:00.0
PowerReadings
PowerManagement :Supported
PowerDraw :60.71W
PowerLimit :235.00W
DefaultPowerLimit :235.00W
EnforcedPowerLimit :235.00W
MinPowerLimit :150.00W
MaxPowerLimit :235.00W
GPU0000:85:00.0
PowerReadings
PowerManagement :Supported
PowerDraw :31.38W
PowerLimit :235.00W
DefaultPowerLimit :235.00W

LAB ASSIGNMENTS
1. Program-#1:increase each elementin an arraybyone.
(You are required to rewrite a CPUprogram into a CUDA
one.)
2. Program-#2:use parallelreduction to calculate the sum of all
the elements in an array.
(You are required to fillin the blanks of a template CUDA
program, and reportyour GPUbandwidth to TAafter you
finish each assignment.)
1. SUM CUDAprogramming with "multi-kerneland shared
memory"
2. SUM CUDAprogramming with "interleaved addressing"
3. SUM CUDAprogramming with "sequentialaddressing"
4. SUM CUDAprogramming with "firstadd during load"
0.2 scores per task.

LABS ASSIGNMENT #1
Rewrite the following CPUfunction into a CUDAkernel
function and complete the main function byyourself:
//increaseoneforalltheelements
voidincrOneForAll(int*array,constintelemNum)
{
inti;
for(i=0;i<elemNum;++i)
{
array[i]++;
}
}

LABS ASSIGNMENT #2
Fillin the CUDAkernelfunction:
Partof the main function is given, you are required to fillin the
blanks according to the comments:
__global__voidreduce(int*g_idata,int*g_odata)
{
extern__shared__intsdata[];
//TODO:loadthecontentofglobalmemorytosharedmemory
//NOTE:synchronizeallthethreadsafterthisstep
//TODO:sumcalculation
//NOTE:synchronizeallthethreadsaftereachiteration
//TODO:writebacktheresultintothecorrespondingentryofglobalmemory
//NOTE:onlyonethreadisenoughtodothejob
}
//parametersforthefirstkernel
//TODO:setgridandblocksize
//threadNum=?
//blockNum=?
intsMemSize=1024*sizeof(int);
reduce<<<threadNum,blockNum,sMemSize>>>(d_idata,d_odata);

Hint:for "firstadd during globalload" optimization (Assignment
#2-4), the third kernelis unnecessary.
LABS ASSIGNMENT #2
Given $10^{22}$ INTs, each block has the maximum block
size $10^{10}$
How to use 3 kernelto synchronize between iterations?

LABS ASSIGNMENT #2-1
Implementthe naïve data parallelism assignmentas follows:

Reduce number of active warps of your program:

Preventshared memoryaccess bank confliction:

Reduce the number of blocks in each kernel:
Notice:
Only2 kernels are needed in this case because each kernel
can now process twice amountof data than before.
Globalmemoryshould be accessed in a sequential
addressing way.

KERNEL LAUNCH
mykernel<<<gridSize,blockSize,sMemSize,streamID>>>(args);
gridSize:number of blocks per grid
blockSize:number of threads per block
sMemSize[optional]:shared memorysize (in bytes)
streamID[optional]:stream ID, defaultis 0

BUILT-IN VARIABLES FOR INDEXING IN A
KERNEL FUNCTION
blockIdx.x, blockIdx.y, blockIdx.z:block index
threadIdx.x, threadIdx.y, threadIdx.z:thread index
gridDim.x, gridDim.y, gridDim.z:grid size (number of blocks
per grid) per dimension
blockDim.x, blockDim.y, blockDim.z:block size (number of
threads per block) per dimension

CUDAMEMCPY
cudaError_tcudaMemcpy(void*dst,
constvoid*src,
size_t count,
enumcudaMemcpyKindkind
)
Enumerator:
cudaMemcpyHostToHost:Host-> Host
cudaMemcpyHostToDevice:Host-> Device
cudaMemcpyDeviceToHost;Device -> Host
cudaMemcpyDeviceToDevice:Device -> Device

SYNCHRONIZATION
__synthread():synchronizes allthreads in a block (used inside
the kernelfunction).
cudaDeviceSynchronize():blocks untilthe device has
completed allpreceding requested tasks (used between two
kernellaunches).
kernel1<<<gridSize,blockSize>>>(args);
cudaDeviceSynchronize();
kernel2<<<gridSize,blockSize>>>(args);

HOW TO MEASURE KERNEL EXECUTION TIME
USING CUDA GPU TIMERS
Methods:
cudaEventCreate():inittimer
cudaEventDestory():destorytimer
cudaEventRecord():settimer
cudaEventSynchronize():sync timer after each kernelcall
cudaEventElapsedTime():returns the elapsed time in
milliseconds

Example:
HOW TO MEASURE KERNEL EXECUTION TIME
USING CUDA GPU TIMERS
cudaEvent_tstart,stop;
floattime;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
kernel<<<grid,threads>>>(d_idata,d_odata);
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time,start,stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);

REFERENCES
1.
2.
3.
4.
5.
6.
7.
NVIDIACUDARuntime API
Programming Guide ::CUDAToolkitDocumentation
BestPractices Guide ::CUDAToolkitDocumentation
NVCC ::CUDAToolkitDocumentation
CUDA-MEMCHECK::CUDAToolkitDocumentation
nvidia-smidocumentation
CUDAerror types

THE END
ENJOY CUDA & HAPPY NEW YEAR!

CUDA lab's slides of "parallel programming" course

More Related Content

What's hot (20)

Viewers also liked (14)

Similar to CUDA lab's slides of "parallel programming" course (20)

Recently uploaded (20)

CUDA lab's slides of "parallel programming" course