HPP Week 1 Summary

Heterogenous Parallel Programming
Class of 2014

Week 1 Summary

Update 1

CUDA

Pipat Methavanitpong

Heterogeneous Computing
● Diversity of Computing Units
○

CPU, GPU, DSP, Configurable Cores, Cloud Computing

● Right Man, Right Job
○

Each application requires different orientation to perform best

● Application Examples
○

Financial Analysis, Scientific Simulation, Digital Audio Processing,
Computer Vision, Numerical Methods, Interactive Physics

Latency and Throughput Orientation
Latency

Throughput

● Min Time
● Smart / Weak
● Best Path

● Max Throughput
● Stupid / Strong
● Brute Force

CPU

GPU

● Best for Sequential
● Powerful ALU

● Best for Parallel
● Weak ALU

○
○
○

Few
Low Latency
Lightly Pipelined

● Large Cache
○

Lower Latency than RAM

● Sophisticated Control
○
○

Smart Branch INSN* to take
Smart Hazard Handling

○
○
○

Many
High Latency
Heavily Pipelined

● Small Cache
○

But boost mem throughput

● Simple Control
○
○

No Predict
No Data Forwarding

CPU
ALU

GPU
ALU
Control

ALU

ALU

Cache
DRAM

DRAM

System Cost
● Hardware + Software Cost
● Software dominates after 2010
● Reduce Software Cost = One on Many
○

Scalability
■

○

Same Arch / New Hardware Offer: # of cores, pipeline depth, vector length

Portability
■

Different Arch: x86, ARM

■

Different Org and Interfaces: Latency/Throughput, Shared/Distributed Mem

Data Parallelism
Manipulation of Data in Parallel
e.g. Vector Addition

A[0]

A[1]

A[2]

A[3]

B[0]

B[1]

B[2]

B[3]

+

+

+

+

C[0]

C[1]

C[2]

C[3]

Introduction to CUDA
➔
➔
➔
➔
➔
➔
➔

CUDA = Compute Unified Device Architecture
Introduced by NVIDIA
Distribute workload from a Host to CUDA capable Devices
NVIDIA = GPU = Throughput Oriented = Best Parallel
Use of GPU to compute as CPU = GPGPU
GPGPU = General Purpose GPU
Extend C / C++ / Fortran

CUDA Thread Organization

Block

Block

Block

Block

Block

Grid

● Grid = [Vector~3D Matrix] of Blocks
○ Block = [Vector~3D Matrix] of Threads
■ Thread = One that computes

Thread

Thread

Thread

Thread

CUDA Thread Organization
Grid Dimension
Declaration

Declaration

dim3 DimGrid(x,y,z);
*var name can be others

dim3 DimBlock(x,y,z);
*var name can be others

This Block

dim3 DimGrid
(2,1,1);
dim3 DimBlock
(256,1,1);

Block Dimension

This Thread

Block 0
t0

Block 1
t1

t2

...

t255

t0

t1

t2

...

t255

CUDA Memory Organization
A Thread have its Private Registers
Threads in a Block have common Shared Memory
Blocks in a same Grid have common Global and Constant Memory

Shared

Thread

Global,
Constant

Block

Grid

HOST

But Host can only access Global and Constant Memory

Register

Register

Register

Register

Memory Management Command
Prototype

typedef enum cudaError cudaError_t

// Allocate Memory on Device
cudaError_t cudaMalloc(void** devPtr, size_t size)

enum cudaError

// Copy Data

0.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

cudaSuccess
cudaErrorMissingConfiguration
cudaErrorMemoryAllocation
cudaErrorInitializationError
cudaErrorLaunchFailure
cudaErrorPriorLaunchFailure
cudaErrorLaunchTimeout
cudaErrorLaunchOutOfResources
cudaErrorInvalidDeviceFunction
cudaErrorInvalidConfiguration
cudaErrorInvalidDevice

…

…

cudaError_t cudaMemcpy(void* dst, const void* src,
size_t size, enum cudaMemcpyKind kind)
// Free Memory on Device
cudaError_t cudaFree(void* devPtr)

enum cudaMemcpyKind
0.
1.
2.
3.
4.

cudaMemcpyHostToHost
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
cudaMemcpyDefault

For more information
http://developer.download.
nvidia.
com/compute/cuda/4_1/rel/tool
kit/docs/online/group__CUDA
RT__MEMORY.html

size - size in bytes

Kernel
Terminology for Function for Device to be called by Host
Declared by adding attribute to Function
Attribute

Return
Type

Function Type

Executed on

Only Callable
from

__device__ any

DeviceFunc()

device

device

__global__ void

KernelFunc()

device

host

host

host

__host__ any

HostFunc()

This attribute is optional
Starting Kernel Function by giving it Grid&Block Structure and Parameters
KernelFunc<<<dimGrid,dimBlock>>>(param1, param2, …);
Waiting for all thrown tasks to complete before move on
cudaDeviceSynchronize();

Row-Major Layout
Way of addressing an element in an Array
Multi-dimensional array can be addressed by 1D array
C / C++ use Row-Major Layout
A0,1

A0,2

A0,3

A1,0

A1,1

A1,2

A2,1

A2,2

A0,1

A0,2

A0,3

A1,0

A1,1

A1,2

A1,3

A2,0

A2,1

A2,2

A2,2

A1

A2

A3

A4

A5

A6

A7

A8

A9

A10

A11

A1,3

A2,0

A0,0

A0

A0,0

A2,3

Fortran uses Col-Major Index

Sample Code: Vector Addition
__global__ void vecAdd(int *d_vIn1, int *d_vIn2, *d_vOut, int n) {
int pos = blockIdx.x * blockDim.x + threadIdx.x;
if (pos < n)
d_vOut[pos] = d_vIn1[pos] + d_vIn2[pos];
}
…
int main() {
int vecLength = …;
int* h_input1 = {…}; int* h_input2 = {…};
int* h_output = (int *) malloc(vecLength * sizeof(int));
int* d_input1, d_input2, d_output;
cudaMalloc((void **) &d_input1, vecLength * sizeof(int));
cudaMalloc((void **) &d_input2, vecLength * sizeof(int));
cudaMalloc((void **) &d_output, vecLength * sizeof(int));
cudaMemcpy(d_input1,h_input1,vecLength*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(d_input2,h_input2,vecLength*sizeof(int),cudaMemcpyHostToDevice);
dim3 dimGrid((vecLength-1)/256+1,1,1);
dim3 dimBlock(256,1,1);
vecAdd<<<dimGrid,dimBlock>>>(d_input1,d_input2,d_output,vecLength);
cudaDeviceSynchronize();
cudaMemcpy(h_output,d_output,vecLength*sizeof(int),cudaMemcpyDeviceToHost);
cudaFree(d_input1); cudaFree(d_input2); cudaFree(d_output);
return 0;
}

Error Checking Pattern
cudaError_t err = cudaMalloc((void **)) &d_input1, size);
if (err != cudaSuccess) {
printf(“%s in %s at line %dn”,
cudaGetErrorString(err), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}

HPP Week 1 Summary

More Related Content

What's hot (7)

Viewers also liked (6)

Similar to HPP Week 1 Summary (20)

More from Pipat Methavanitpong (6)

Recently uploaded (20)

HPP Week 1 Summary