SlideShare a Scribd company logo
Heterogenous Parallel Programming
Class of 2014

Week 1 Summary

Update 1

CUDA

Pipat Methavanitpong
Heterogeneous Computing
● Diversity of Computing Units
○

CPU, GPU, DSP, Configurable Cores, Cloud Computing

● Right Man, Right Job
○

Each application requires different orientation to perform best

● Application Examples
○

Financial Analysis, Scientific Simulation, Digital Audio Processing,
Computer Vision, Numerical Methods, Interactive Physics
Latency and Throughput Orientation
Latency

Throughput

● Min Time
● Smart / Weak
● Best Path

● Max Throughput
● Stupid / Strong
● Brute Force
Latency and Throughput Orientation
CPU

GPU

● Best for Sequential
● Powerful ALU

● Best for Parallel
● Weak ALU

○
○
○

Few
Low Latency
Lightly Pipelined

● Large Cache
○

Lower Latency than RAM

● Sophisticated Control
○
○

Smart Branch INSN* to take
Smart Hazard Handling

○
○
○

Many
High Latency
Heavily Pipelined

● Small Cache
○

But boost mem throughput

● Simple Control
○
○

No Predict
No Data Forwarding
Latency and Throughput Orientation
CPU
ALU

GPU
ALU
Control

ALU

ALU

Cache
DRAM

DRAM
System Cost
● Hardware + Software Cost
● Software dominates after 2010
● Reduce Software Cost = One on Many
○

Scalability
■

○

Same Arch / New Hardware Offer: # of cores, pipeline depth, vector length

Portability
■

Different Arch: x86, ARM

■

Different Org and Interfaces: Latency/Throughput, Shared/Distributed Mem
Data Parallelism
Manipulation of Data in Parallel
e.g. Vector Addition

A[0]

A[1]

A[2]

A[3]

B[0]

B[1]

B[2]

B[3]

+

+

+

+

C[0]

C[1]

C[2]

C[3]
Introduction to CUDA
➔
➔
➔
➔
➔
➔
➔

CUDA = Compute Unified Device Architecture
Introduced by NVIDIA
Distribute workload from a Host to CUDA capable Devices
NVIDIA = GPU = Throughput Oriented = Best Parallel
Use of GPU to compute as CPU = GPGPU
GPGPU = General Purpose GPU
Extend C / C++ / Fortran
CUDA Thread Organization

Block

Block

Block

Block

Block

Grid

● Grid = [Vector~3D Matrix] of Blocks
○ Block = [Vector~3D Matrix] of Threads
■ Thread = One that computes

Thread

Thread

Thread

Thread
CUDA Thread Organization
Grid Dimension
Declaration

Declaration

dim3 DimGrid(x,y,z);
*var name can be others

dim3 DimBlock(x,y,z);
*var name can be others

This Block

dim3 DimGrid
(2,1,1);
dim3 DimBlock
(256,1,1);

Block Dimension

This Thread

Block 0
t0

Block 1
t1

t2

...

t255

t0

t1

t2

...

t255
CUDA Memory Organization
A Thread have its Private Registers
Threads in a Block have common Shared Memory
Blocks in a same Grid have common Global and Constant Memory

Shared

Thread

Global,
Constant

Block

Grid

HOST

But Host can only access Global and Constant Memory

Register

Register

Register

Register
Memory Management Command
Prototype

typedef enum cudaError cudaError_t

// Allocate Memory on Device
cudaError_t cudaMalloc(void** devPtr, size_t size)

enum cudaError

// Copy Data

0.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

cudaSuccess
cudaErrorMissingConfiguration
cudaErrorMemoryAllocation
cudaErrorInitializationError
cudaErrorLaunchFailure
cudaErrorPriorLaunchFailure
cudaErrorLaunchTimeout
cudaErrorLaunchOutOfResources
cudaErrorInvalidDeviceFunction
cudaErrorInvalidConfiguration
cudaErrorInvalidDevice

…

…

cudaError_t cudaMemcpy(void* dst, const void* src,
size_t size, enum cudaMemcpyKind kind)
// Free Memory on Device
cudaError_t cudaFree(void* devPtr)

enum cudaMemcpyKind
0.
1.
2.
3.
4.

cudaMemcpyHostToHost
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
cudaMemcpyDefault

For more information
http://developer.download.
nvidia.
com/compute/cuda/4_1/rel/tool
kit/docs/online/group__CUDA
RT__MEMORY.html

size - size in bytes
Kernel
Terminology for Function for Device to be called by Host
Declared by adding attribute to Function
Attribute

Return
Type

Function Type

Executed on

Only Callable
from

__device__ any

DeviceFunc()

device

device

__global__ void

KernelFunc()

device

host

host

host

__host__ any

HostFunc()

This attribute is optional
Starting Kernel Function by giving it Grid&Block Structure and Parameters
KernelFunc<<<dimGrid,dimBlock>>>(param1, param2, …);
Waiting for all thrown tasks to complete before move on
cudaDeviceSynchronize();
Row-Major Layout
Way of addressing an element in an Array
Multi-dimensional array can be addressed by 1D array
C / C++ use Row-Major Layout
A0,1

A0,2

A0,3

A1,0

A1,1

A1,2

A2,1

A2,2

A0,1

A0,2

A0,3

A1,0

A1,1

A1,2

A1,3

A2,0

A2,1

A2,2

A2,2

A1

A2

A3

A4

A5

A6

A7

A8

A9

A10

A11

A1,3

A2,0

A0,0

A0

A0,0

A2,3

Fortran uses Col-Major Index
Sample Code: Vector Addition
__global__ void vecAdd(int *d_vIn1, int *d_vIn2, *d_vOut, int n) {
int pos = blockIdx.x * blockDim.x + threadIdx.x;
if (pos < n)
d_vOut[pos] = d_vIn1[pos] + d_vIn2[pos];
}
…
int main() {
int vecLength = …;
int* h_input1 = {…}; int* h_input2 = {…};
int* h_output = (int *) malloc(vecLength * sizeof(int));
int* d_input1, d_input2, d_output;
cudaMalloc((void **) &d_input1, vecLength * sizeof(int));
cudaMalloc((void **) &d_input2, vecLength * sizeof(int));
cudaMalloc((void **) &d_output, vecLength * sizeof(int));
cudaMemcpy(d_input1,h_input1,vecLength*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(d_input2,h_input2,vecLength*sizeof(int),cudaMemcpyHostToDevice);
dim3 dimGrid((vecLength-1)/256+1,1,1);
dim3 dimBlock(256,1,1);
vecAdd<<<dimGrid,dimBlock>>>(d_input1,d_input2,d_output,vecLength);
cudaDeviceSynchronize();
cudaMemcpy(h_output,d_output,vecLength*sizeof(int),cudaMemcpyDeviceToHost);
cudaFree(d_input1); cudaFree(d_input2); cudaFree(d_output);
return 0;
}
Error Checking Pattern
cudaError_t err = cudaMalloc((void **)) &d_input1, size);
if (err != cudaSuccess) {
printf(“%s in %s at line %dn”,
cudaGetErrorString(err), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}

More Related Content

PDF
Spectre and meltdown
ODP
Uday walvekar dsp_seminar
PPTX
C for Cuda - Small Introduction to GPU computing
PDF
Multipath
PDF
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
PDF
BUD17-309: IRQ prediction
PDF
Extlect01
PDF
Ha opensuse
Spectre and meltdown
Uday walvekar dsp_seminar
C for Cuda - Small Introduction to GPU computing
Multipath
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
BUD17-309: IRQ prediction
Extlect01
Ha opensuse

What's hot (7)

PDF
NUMA and Java Databases
PPTX
PPTX
GPU-Accelerated Parallel Computing
PPTX
Intro to GPGPU Programming with Cuda
PDF
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
PDF
A beginner’s guide to programming GPUs with CUDA
PDF
Chap 17 advfs
NUMA and Java Databases
GPU-Accelerated Parallel Computing
Intro to GPGPU Programming with Cuda
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
A beginner’s guide to programming GPUs with CUDA
Chap 17 advfs
Ad

Viewers also liked (6)

PDF
Hypergraph Mining For Social Networks
PDF
Android Internals (This is not the droid you’re loking for...)
PPT
Keynote presentation hr_and_optimism
PPTX
Empathize and define
PDF
Suvidhi Industries
PPTX
May 2013 staff mtg
Hypergraph Mining For Social Networks
Android Internals (This is not the droid you’re loking for...)
Keynote presentation hr_and_optimism
Empathize and define
Suvidhi Industries
May 2013 staff mtg
Ad

Similar to HPP Week 1 Summary (20)

PDF
Micro-controllers (PIC) based Application Development
PPT
Introduction to parallel computing using CUDA
PDF
GPU architecture notes game prog gpu-arch.pdf
PDF
Linux Hosting Training Course Level 1-1
ODP
Multicore
PDF
Threads and processes
PDF
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
PPT
Vpu technology &gpgpu computing
PPT
Vpu technology &gpgpu computing
PPT
Vpu technology &gpgpu computing
PDF
Processor Organization
PPTX
Intro to GPGPU with CUDA (DevLink)
PPTX
Lect 1 Into.pptx
PDF
Caching in
PPT
Vpu technology &gpgpu computing
PPTX
Gpu computing workshop
PDF
An End to Order
PDF
cachegrand: A Take on High Performance Caching
PDF
An End to Order (many cores with java, session two)
PPTX
Efficient Buffer Management
Micro-controllers (PIC) based Application Development
Introduction to parallel computing using CUDA
GPU architecture notes game prog gpu-arch.pdf
Linux Hosting Training Course Level 1-1
Multicore
Threads and processes
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Vpu technology &gpgpu computing
Vpu technology &gpgpu computing
Vpu technology &gpgpu computing
Processor Organization
Intro to GPGPU with CUDA (DevLink)
Lect 1 Into.pptx
Caching in
Vpu technology &gpgpu computing
Gpu computing workshop
An End to Order
cachegrand: A Take on High Performance Caching
An End to Order (many cores with java, session two)
Efficient Buffer Management

More from Pipat Methavanitpong (6)

PPTX
Influence of Native Language and Society on English Proficiency
PPTX
Return oriented programming (ROP)
PPTX
Intel processor trace - What are Recorded?
PPTX
Principles in software debugging
PPTX
Exploring the World Classroom: MOOC
PPTX
Seminar 12-11-19
Influence of Native Language and Society on English Proficiency
Return oriented programming (ROP)
Intel processor trace - What are Recorded?
Principles in software debugging
Exploring the World Classroom: MOOC
Seminar 12-11-19

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Cloud computing and distributed systems.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
cuic standard and advanced reporting.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
KodekX | Application Modernization Development
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Approach and Philosophy of On baking technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Network Security Unit 5.pdf for BCA BBA.
Advanced methodologies resolving dimensionality complications for autism neur...
Building Integrated photovoltaic BIPV_UPV.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Cloud computing and distributed systems.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectral efficient network and resource selection model in 5G networks
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation_ Review paper, used for researhc scholars
cuic standard and advanced reporting.pdf
Empathic Computing: Creating Shared Understanding
KodekX | Application Modernization Development
“AI and Expert System Decision Support & Business Intelligence Systems”
Approach and Philosophy of On baking technology

HPP Week 1 Summary

  • 1. Heterogenous Parallel Programming Class of 2014 Week 1 Summary Update 1 CUDA Pipat Methavanitpong
  • 2. Heterogeneous Computing ● Diversity of Computing Units ○ CPU, GPU, DSP, Configurable Cores, Cloud Computing ● Right Man, Right Job ○ Each application requires different orientation to perform best ● Application Examples ○ Financial Analysis, Scientific Simulation, Digital Audio Processing, Computer Vision, Numerical Methods, Interactive Physics
  • 3. Latency and Throughput Orientation Latency Throughput ● Min Time ● Smart / Weak ● Best Path ● Max Throughput ● Stupid / Strong ● Brute Force
  • 4. Latency and Throughput Orientation CPU GPU ● Best for Sequential ● Powerful ALU ● Best for Parallel ● Weak ALU ○ ○ ○ Few Low Latency Lightly Pipelined ● Large Cache ○ Lower Latency than RAM ● Sophisticated Control ○ ○ Smart Branch INSN* to take Smart Hazard Handling ○ ○ ○ Many High Latency Heavily Pipelined ● Small Cache ○ But boost mem throughput ● Simple Control ○ ○ No Predict No Data Forwarding
  • 5. Latency and Throughput Orientation CPU ALU GPU ALU Control ALU ALU Cache DRAM DRAM
  • 6. System Cost ● Hardware + Software Cost ● Software dominates after 2010 ● Reduce Software Cost = One on Many ○ Scalability ■ ○ Same Arch / New Hardware Offer: # of cores, pipeline depth, vector length Portability ■ Different Arch: x86, ARM ■ Different Org and Interfaces: Latency/Throughput, Shared/Distributed Mem
  • 7. Data Parallelism Manipulation of Data in Parallel e.g. Vector Addition A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] + + + + C[0] C[1] C[2] C[3]
  • 8. Introduction to CUDA ➔ ➔ ➔ ➔ ➔ ➔ ➔ CUDA = Compute Unified Device Architecture Introduced by NVIDIA Distribute workload from a Host to CUDA capable Devices NVIDIA = GPU = Throughput Oriented = Best Parallel Use of GPU to compute as CPU = GPGPU GPGPU = General Purpose GPU Extend C / C++ / Fortran
  • 9. CUDA Thread Organization Block Block Block Block Block Grid ● Grid = [Vector~3D Matrix] of Blocks ○ Block = [Vector~3D Matrix] of Threads ■ Thread = One that computes Thread Thread Thread Thread
  • 10. CUDA Thread Organization Grid Dimension Declaration Declaration dim3 DimGrid(x,y,z); *var name can be others dim3 DimBlock(x,y,z); *var name can be others This Block dim3 DimGrid (2,1,1); dim3 DimBlock (256,1,1); Block Dimension This Thread Block 0 t0 Block 1 t1 t2 ... t255 t0 t1 t2 ... t255
  • 11. CUDA Memory Organization A Thread have its Private Registers Threads in a Block have common Shared Memory Blocks in a same Grid have common Global and Constant Memory Shared Thread Global, Constant Block Grid HOST But Host can only access Global and Constant Memory Register Register Register Register
  • 12. Memory Management Command Prototype typedef enum cudaError cudaError_t // Allocate Memory on Device cudaError_t cudaMalloc(void** devPtr, size_t size) enum cudaError // Copy Data 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. cudaSuccess cudaErrorMissingConfiguration cudaErrorMemoryAllocation cudaErrorInitializationError cudaErrorLaunchFailure cudaErrorPriorLaunchFailure cudaErrorLaunchTimeout cudaErrorLaunchOutOfResources cudaErrorInvalidDeviceFunction cudaErrorInvalidConfiguration cudaErrorInvalidDevice … … cudaError_t cudaMemcpy(void* dst, const void* src, size_t size, enum cudaMemcpyKind kind) // Free Memory on Device cudaError_t cudaFree(void* devPtr) enum cudaMemcpyKind 0. 1. 2. 3. 4. cudaMemcpyHostToHost cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice cudaMemcpyDefault For more information http://developer.download. nvidia. com/compute/cuda/4_1/rel/tool kit/docs/online/group__CUDA RT__MEMORY.html size - size in bytes
  • 13. Kernel Terminology for Function for Device to be called by Host Declared by adding attribute to Function Attribute Return Type Function Type Executed on Only Callable from __device__ any DeviceFunc() device device __global__ void KernelFunc() device host host host __host__ any HostFunc() This attribute is optional Starting Kernel Function by giving it Grid&Block Structure and Parameters KernelFunc<<<dimGrid,dimBlock>>>(param1, param2, …); Waiting for all thrown tasks to complete before move on cudaDeviceSynchronize();
  • 14. Row-Major Layout Way of addressing an element in an Array Multi-dimensional array can be addressed by 1D array C / C++ use Row-Major Layout A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A2,1 A2,2 A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A1,3 A2,0 A2,1 A2,2 A2,2 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A1,3 A2,0 A0,0 A0 A0,0 A2,3 Fortran uses Col-Major Index
  • 15. Sample Code: Vector Addition __global__ void vecAdd(int *d_vIn1, int *d_vIn2, *d_vOut, int n) { int pos = blockIdx.x * blockDim.x + threadIdx.x; if (pos < n) d_vOut[pos] = d_vIn1[pos] + d_vIn2[pos]; } … int main() { int vecLength = …; int* h_input1 = {…}; int* h_input2 = {…}; int* h_output = (int *) malloc(vecLength * sizeof(int)); int* d_input1, d_input2, d_output; cudaMalloc((void **) &d_input1, vecLength * sizeof(int)); cudaMalloc((void **) &d_input2, vecLength * sizeof(int)); cudaMalloc((void **) &d_output, vecLength * sizeof(int)); cudaMemcpy(d_input1,h_input1,vecLength*sizeof(int),cudaMemcpyHostToDevice); cudaMemcpy(d_input2,h_input2,vecLength*sizeof(int),cudaMemcpyHostToDevice); dim3 dimGrid((vecLength-1)/256+1,1,1); dim3 dimBlock(256,1,1); vecAdd<<<dimGrid,dimBlock>>>(d_input1,d_input2,d_output,vecLength); cudaDeviceSynchronize(); cudaMemcpy(h_output,d_output,vecLength*sizeof(int),cudaMemcpyDeviceToHost); cudaFree(d_input1); cudaFree(d_input2); cudaFree(d_output); return 0; }
  • 16. Error Checking Pattern cudaError_t err = cudaMalloc((void **)) &d_input1, size); if (err != cudaSuccess) { printf(“%s in %s at line %dn”, cudaGetErrorString(err), __FILE__, __LINE__); exit(EXIT_FAILURE); }