SlideShare a Scribd company logo
Vahid Amiri
NationalWorkshop of Cloud Computing
Cloud Computing Lab- Amirkabir University
Vahidamiry.ir
Nov 2012
Bio-Informatics and Life Sciences
Computational Electromagnetics and
Electrodynamics
Computational Finance
Weather, Atmospheric, Ocean Modeling
and Space Sciences
2
Computational Fluid Dynamics
Data Mining, Analytics, and Databases
Molecular Dynamics
Numerical Analytics
3
 Cluster
 Grid
 Cloud
4
 Parallel and distributed processing system
 Consists of a collection of interconnected stand-
alone computers
 Appears as a single system to users and
applications
5
6
 Distributed, heterogeneous resources for large
experiments
 Compute and storage resources
 Network of Machines
 Larger number of resources
 Extended all over the world
 Different administrative domains
7
 Investment in infrastructure
 Power and Cooling Management
 Management
 Maintenance
 Complexity
 cost
8
 Computing as a utility
 Easy to access
▪ Easy Configuration
 Pay-as-you-go
 Flexibility
 Scalability
 No need to infrastructure management
9
 IaaS
 Cloud-Based Cluster
▪ Amazon EC2
▪ GoGrid
▪ IBM
▪ Rackspace
 PaaS
 Amazon Elastic MapReduce
 GoogleApp Engine – MapReduce Service
 SaaS
10
11
12
 Develops software solutions for applications in
the cloud
 CloudBroker
 Cyclone
 Plura Processing
 Penguin on Demand
13
 Supporting five technical domains:
 Computational fluid dynamics (CFD)
 Finite element analysis
 Computational chemistry and materials
 Computational biology
14
 performance penalties
 users voluntarily lose almost all control on the
execution environment
 VirtualizationTechnology
▪ These are related to the performance loss introducedby
the virtualization mechanism
 Cloud Environment
▪ due to overheads and to the sharing of computing and
communication resources
15
 IaaS HPC
 MPI Cluster
 MapReduce Cluster
 GPU Cluster!!!
 …..
16
17
 General Purpose computation using GPU
 Data parallel algorithms leverageGPU
attributes
 Using graphic hardware for non-graphic computations
 Can improve the performance in the orders
of magnitude in certain types of
applications
18
 GPUs contain much larger number of dedicated
ALUs then CPUs.
 GPUs also contain extensive support of Stream
Processing paradigm. It is related to SIMD (
Single Instruction Multiple Data) processing.
 Each processing unit on GPU contains local
memory that improves data manipulation and
reduces fetch time.
19
20
21
 Multiprocessor(MP) = thread processor = ALU
22
 The GPU is viewed as a compute device that:
 Is a coprocessor to the CPU or host
 Has its own DRAM (device memory)
 Runs many threads in parallel
 Data-parallel portions of an application are
executed on the device as kernels which run in
parallel on many threads
 Differences between GPU and CPU threads
 GPU threads are extremely lightweight
▪ Very little creation overhead
 GPU needs 1000s of threads for full efficiency
▪ Multi-core CPU needs only a few
23
 Host The CPU and its memory (host memory)
 Device The GPU and its memory (device memory)
24
 CUDA is a set of developing tools to create applications that will
perform execution on GPU (Graphics Processing Unit).
 The API is an extension to the ANSI C programming language
 Low learning curve
 CUDA was developed by NVidia and as such can only run on
NVidia GPUs of G8x series and up.
 CUDA was released on February 15, 2007 for PC and Beta version
for MacOS X on August 19, 2008.
25
26
 A kernel is executed as a grid of thread
blocks
 A thread block is a batch of threads that
can cooperate with each other by:
 Synchronizing their execution
 Efficiently sharing data through a low
latency shared memory
Host
Kern
el 1
Kern
el 2
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
27
 Threads and blocks have IDs
 So each thread can decide
what data to work on
 Simplifies memory
addressing when processing
multidimensional data
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
28
 Parallel computations arranged as
grids
 One grid executes after another
 Blocks assigned to SM. A single
block assigned to a single SM.
Multiple blocks can be assigned to a
SM.
 Block consists of elements (threads)
29
30
31
 Demo!
32
 CUDA device driver
 CUDA Software Development Kit
 CUDAToolkit
 You (probably) need experience with C or C++
33
 Thread block – an array of concurrent threads
that execute the same program and can
cooperate to compute the result
 A thread ID has corresponding 1,2 or 3d
indices
 Threads of a thread block share memory
34
 Each thread can:
 R/W per-thread registers
 R/W per-thread local memory
 R/W per-block shared memory
 R/W per-grid global memory
 Read only per-grid constant memory
 Read only per-grid texture memory
 The host can R/W global,
constant, and texture
memories
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
35
 cudaMalloc()
 Allocates object in the device Global Memory
 Requires two parameters
▪ Address of a pointer to the allocated object
▪ Size of of allocated object
 cudaFree()
 Frees object from deviceGlobal Memory
BLOCK_SIZE = 64;
Float d_f;
int size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float);
cudaMalloc((void**)&d_f, size);
cudaFree(Md.elements);
36
 cudaMemcpy()
 memory data transfer
 Requires four parameters
▪ Pointer to source
▪ Pointer to destination
▪ Number of bytes copied
▪ Type of transfer
▪ Host to Host
▪ Host to Device
▪ Device to Host
▪ Device to Device
cudaMemcpy(d_f, f, size, cudaMemcpyHostToDevice);
cudaMemcpy(f, f_d, size, cudaMemcpyDeviceToHost);
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memor
y
Thread (0,
0)
Register
s
Local
Memor
y
Thread (1,
0)
Register
s
Block (1, 0)
Shared Memory
Local
Memor
y
Thread (0,
0)
Register
s
Local
Memor
y
Thread (1,
0)
Register
s
Host
37
 __global__ defines a kernel function
 Must return void
Executed
on the:
Only callable
from the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host
38
 Allocate the memory on the GPU
 Copy the arrays ‘a’ and ‘b’ to the GPU
 Call the kernel function
 Copy the array ‘c’ back from the GPU to the CPU
 Free the memory allocated on the GPU
39
 Step 1: Allocate the memory on the GPU
int a[N], b[N], c[N];
int *d_a, *d_b, *d_c;
cudaMalloc( (void**)&dev_a, N * sizeof(int) ) ;
cudaMalloc( (void**)&dev_b, N * sizeof(int) ) ;
cudaMalloc( (void**)&dev_c, N * sizeof(int) ) ;
40
 Step 2: Copy the arrays ‘a’ and ‘b’ to the GPU
cudaMemCpy(d_a, a, N * sizeof(int),
cudaMemcpyHostToDevice);
cudaMemCpy(d_b, b, N * sizeof(int),
cudaMemcpyHostToDevice);
 Step 3: Call the kernel function
Add<<<N,1>>>(d_a, d_b, d_c);
41
 Step 4: Copy the array ‘c’ back from the GPU to the
CPU
cudaMemCpy(c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);
 Step 5: Free the memory allocated on the GPU
cudaFree( d_a );
cudaFree( d_b );
cudaFree( d_c );
42
 kernel function
__global__ void add( int *a, int *b, int *c ) {
int tid = blockIdx.x;
if (tid < N)
c[tid] = a[tid] + b[tid];
}
43
 We’ve seen parallel vector addition using:
 Several blocks with one thread each
 One block with several threads
44
45
46
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
 P = M * N of size WIDTH xWIDTH
 One thread handles one element of P
 M and N are loadedWIDTH times from
global memory
M
N
P
WIDTHWIDTH
WIDTH WIDTH
47
 Memory latency can be hidden by keeping a
large number of threads busy
 Keep number of threads per block (block size)
and number of blocks per grid (grid size) as
large as possible
 Constant memory can be used for constant
data (variables that do not change).
 Constant memory is cached.
48
49
 Each thread within the
block computes one
element of Csub
50
51
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
 Recall that the “stream processors” of the
GPU are organized as MPs (multi-processors)
and every MP has its own set of resources:
 Registers
 Local memory
 The block size needs to be chosen such that
there are enough resources in an MP to
execute a block at a time.
52
 Critical for performance
 Recommended value is 192 or 256
 Maximum value is 512
 Limited by number of registers on the MP
53
 Run with the different block sizes!!
M
N
P
WIDTHWIDTH
WIDTH WIDTH
54
55
0.3 6.7 47
1079
2537
0.3 4.6 5.6 39.3
86.6126 126
19
407
947
0
500
1000
1500
2000
2500
3000
S 128 S 512 S 1024 S 3079 S 4096
block-16
Data - Test 1
Shared
56
0.3 6.7 47
1079
2537
126 126
19
407
947
2948
73 86 73
0
500
1000
1500
2000
2500
3000
3500
S 128 S 512 S 1024 S 3079 S 4096
block-16
Shared
block-32
block-64
block-128
block-512
57

More Related Content

PPTX
سکوهای ابری و مدل های برنامه نویسی در ابر
PDF
Hadoop Fundamentals I
PPTX
Column Stores and Google BigQuery
PPTX
Big data and hadoop
PPTX
Supporting Financial Services with a More Flexible Approach to Big Data
PDF
Syncsort et le retour d'expérience ComScore
PPTX
Hadoop bigdata overview
PPTX
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
سکوهای ابری و مدل های برنامه نویسی در ابر
Hadoop Fundamentals I
Column Stores and Google BigQuery
Big data and hadoop
Supporting Financial Services with a More Flexible Approach to Big Data
Syncsort et le retour d'expérience ComScore
Hadoop bigdata overview
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)

What's hot (20)

PDF
Bigtable and Dynamo
PPTX
Hadoop and WANdisco: The Future of Big Data
PDF
Hadoop: The Default Machine Learning Platform ?
PDF
Hd insight essentials quick view
PDF
Dynamo and BigTable - Review and Comparison
PDF
Achieving Separation of Compute and Storage in a Cloud World
PPTX
PDF
Rapid Cluster Computing with Apache Spark 2016
PPT
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
PDF
Payment Gateway Live hadoop project
PDF
Gcp data engineer
PPTX
SQL Server 2012 and Big Data
PPTX
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
PDF
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
PPTX
AWS Redshift Introduction - Big Data Analytics
PPTX
Big data ppt
PPTX
Big data architecture on cloud computing infrastructure
PPTX
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
PDF
Apache Cassandra at Macys
PDF
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Bigtable and Dynamo
Hadoop and WANdisco: The Future of Big Data
Hadoop: The Default Machine Learning Platform ?
Hd insight essentials quick view
Dynamo and BigTable - Review and Comparison
Achieving Separation of Compute and Storage in a Cloud World
Rapid Cluster Computing with Apache Spark 2016
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
Payment Gateway Live hadoop project
Gcp data engineer
SQL Server 2012 and Big Data
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
AWS Redshift Introduction - Big Data Analytics
Big data ppt
Big data architecture on cloud computing infrastructure
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Apache Cassandra at Macys
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Ad

Similar to Gpu computing workshop (20)

PPTX
Introduction to Accelerators
PDF
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
PPT
Lecture2 cuda spring 2010
PPT
Introduction to parallel computing using CUDA
PPTX
gpuprogram_lecture,architecture_designsn
PPT
3. CUDA_PPT.ppt info abt threads in cuda
PPT
002 - Introduction to CUDA Programming_1.ppt
PPTX
C for Cuda - Small Introduction to GPU computing
PDF
GPU Computing with CUDA
PPTX
gpu1 - Modern Systems GPU Introduction.pptx
PDF
GPU: Understanding CUDA
PPT
Cuda intro
PPT
Intro2 Cuda Moayad
PDF
Cuda Without a Phd - A practical guick start
PPTX
GPU in Computer Science advance topic .pptx
PDF
Cuda introduction
PPTX
GPU Introduction.pptx
PPTX
Gpu with cuda architecture
PPTX
introduction to CUDA_C.pptx it is widely used
PDF
Gpu perf-presentation
Introduction to Accelerators
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Lecture2 cuda spring 2010
Introduction to parallel computing using CUDA
gpuprogram_lecture,architecture_designsn
3. CUDA_PPT.ppt info abt threads in cuda
002 - Introduction to CUDA Programming_1.ppt
C for Cuda - Small Introduction to GPU computing
GPU Computing with CUDA
gpu1 - Modern Systems GPU Introduction.pptx
GPU: Understanding CUDA
Cuda intro
Intro2 Cuda Moayad
Cuda Without a Phd - A practical guick start
GPU in Computer Science advance topic .pptx
Cuda introduction
GPU Introduction.pptx
Gpu with cuda architecture
introduction to CUDA_C.pptx it is widely used
Gpu perf-presentation
Ad

Recently uploaded (20)

PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Database Infoormation System (DBIS).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
.pdf is not working space design for the following data for the following dat...
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
STUDY DESIGN details- Lt Col Maksud (21).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Mega Projects Data Mega Projects Data
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Business Ppt On Nestle.pptx huunnnhhgfvu
Fluorescence-microscope_Botany_detailed content
Galatica Smart Energy Infrastructure Startup Pitch Deck
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Database Infoormation System (DBIS).pptx

Gpu computing workshop

  • 1. Vahid Amiri NationalWorkshop of Cloud Computing Cloud Computing Lab- Amirkabir University Vahidamiry.ir Nov 2012
  • 2. Bio-Informatics and Life Sciences Computational Electromagnetics and Electrodynamics Computational Finance Weather, Atmospheric, Ocean Modeling and Space Sciences 2
  • 3. Computational Fluid Dynamics Data Mining, Analytics, and Databases Molecular Dynamics Numerical Analytics 3
  • 5.  Parallel and distributed processing system  Consists of a collection of interconnected stand- alone computers  Appears as a single system to users and applications 5
  • 6. 6
  • 7.  Distributed, heterogeneous resources for large experiments  Compute and storage resources  Network of Machines  Larger number of resources  Extended all over the world  Different administrative domains 7
  • 8.  Investment in infrastructure  Power and Cooling Management  Management  Maintenance  Complexity  cost 8
  • 9.  Computing as a utility  Easy to access ▪ Easy Configuration  Pay-as-you-go  Flexibility  Scalability  No need to infrastructure management 9
  • 10.  IaaS  Cloud-Based Cluster ▪ Amazon EC2 ▪ GoGrid ▪ IBM ▪ Rackspace  PaaS  Amazon Elastic MapReduce  GoogleApp Engine – MapReduce Service  SaaS 10
  • 11. 11
  • 12. 12
  • 13.  Develops software solutions for applications in the cloud  CloudBroker  Cyclone  Plura Processing  Penguin on Demand 13
  • 14.  Supporting five technical domains:  Computational fluid dynamics (CFD)  Finite element analysis  Computational chemistry and materials  Computational biology 14
  • 15.  performance penalties  users voluntarily lose almost all control on the execution environment  VirtualizationTechnology ▪ These are related to the performance loss introducedby the virtualization mechanism  Cloud Environment ▪ due to overheads and to the sharing of computing and communication resources 15
  • 16.  IaaS HPC  MPI Cluster  MapReduce Cluster  GPU Cluster!!!  ….. 16
  • 17. 17
  • 18.  General Purpose computation using GPU  Data parallel algorithms leverageGPU attributes  Using graphic hardware for non-graphic computations  Can improve the performance in the orders of magnitude in certain types of applications 18
  • 19.  GPUs contain much larger number of dedicated ALUs then CPUs.  GPUs also contain extensive support of Stream Processing paradigm. It is related to SIMD ( Single Instruction Multiple Data) processing.  Each processing unit on GPU contains local memory that improves data manipulation and reduces fetch time. 19
  • 20. 20
  • 21. 21
  • 22.  Multiprocessor(MP) = thread processor = ALU 22
  • 23.  The GPU is viewed as a compute device that:  Is a coprocessor to the CPU or host  Has its own DRAM (device memory)  Runs many threads in parallel  Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads  Differences between GPU and CPU threads  GPU threads are extremely lightweight ▪ Very little creation overhead  GPU needs 1000s of threads for full efficiency ▪ Multi-core CPU needs only a few 23
  • 24.  Host The CPU and its memory (host memory)  Device The GPU and its memory (device memory) 24
  • 25.  CUDA is a set of developing tools to create applications that will perform execution on GPU (Graphics Processing Unit).  The API is an extension to the ANSI C programming language  Low learning curve  CUDA was developed by NVidia and as such can only run on NVidia GPUs of G8x series and up.  CUDA was released on February 15, 2007 for PC and Beta version for MacOS X on August 19, 2008. 25
  • 26. 26
  • 27.  A kernel is executed as a grid of thread blocks  A thread block is a batch of threads that can cooperate with each other by:  Synchronizing their execution  Efficiently sharing data through a low latency shared memory Host Kern el 1 Kern el 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) 27
  • 28.  Threads and blocks have IDs  So each thread can decide what data to work on  Simplifies memory addressing when processing multidimensional data Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) 28
  • 29.  Parallel computations arranged as grids  One grid executes after another  Blocks assigned to SM. A single block assigned to a single SM. Multiple blocks can be assigned to a SM.  Block consists of elements (threads) 29
  • 30. 30
  • 31. 31
  • 33.  CUDA device driver  CUDA Software Development Kit  CUDAToolkit  You (probably) need experience with C or C++ 33
  • 34.  Thread block – an array of concurrent threads that execute the same program and can cooperate to compute the result  A thread ID has corresponding 1,2 or 3d indices  Threads of a thread block share memory 34
  • 35.  Each thread can:  R/W per-thread registers  R/W per-thread local memory  R/W per-block shared memory  R/W per-grid global memory  Read only per-grid constant memory  Read only per-grid texture memory  The host can R/W global, constant, and texture memories (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host 35
  • 36.  cudaMalloc()  Allocates object in the device Global Memory  Requires two parameters ▪ Address of a pointer to the allocated object ▪ Size of of allocated object  cudaFree()  Frees object from deviceGlobal Memory BLOCK_SIZE = 64; Float d_f; int size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float); cudaMalloc((void**)&d_f, size); cudaFree(Md.elements); 36
  • 37.  cudaMemcpy()  memory data transfer  Requires four parameters ▪ Pointer to source ▪ Pointer to destination ▪ Number of bytes copied ▪ Type of transfer ▪ Host to Host ▪ Host to Device ▪ Device to Host ▪ Device to Device cudaMemcpy(d_f, f, size, cudaMemcpyHostToDevice); cudaMemcpy(f, f_d, size, cudaMemcpyDeviceToHost); (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memor y Thread (0, 0) Register s Local Memor y Thread (1, 0) Register s Block (1, 0) Shared Memory Local Memor y Thread (0, 0) Register s Local Memor y Thread (1, 0) Register s Host 37
  • 38.  __global__ defines a kernel function  Must return void Executed on the: Only callable from the: __device__ float DeviceFunc() device device __global__ void KernelFunc() device host __host__ float HostFunc() host host 38
  • 39.  Allocate the memory on the GPU  Copy the arrays ‘a’ and ‘b’ to the GPU  Call the kernel function  Copy the array ‘c’ back from the GPU to the CPU  Free the memory allocated on the GPU 39
  • 40.  Step 1: Allocate the memory on the GPU int a[N], b[N], c[N]; int *d_a, *d_b, *d_c; cudaMalloc( (void**)&dev_a, N * sizeof(int) ) ; cudaMalloc( (void**)&dev_b, N * sizeof(int) ) ; cudaMalloc( (void**)&dev_c, N * sizeof(int) ) ; 40
  • 41.  Step 2: Copy the arrays ‘a’ and ‘b’ to the GPU cudaMemCpy(d_a, a, N * sizeof(int), cudaMemcpyHostToDevice); cudaMemCpy(d_b, b, N * sizeof(int), cudaMemcpyHostToDevice);  Step 3: Call the kernel function Add<<<N,1>>>(d_a, d_b, d_c); 41
  • 42.  Step 4: Copy the array ‘c’ back from the GPU to the CPU cudaMemCpy(c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);  Step 5: Free the memory allocated on the GPU cudaFree( d_a ); cudaFree( d_b ); cudaFree( d_c ); 42
  • 43.  kernel function __global__ void add( int *a, int *b, int *c ) { int tid = blockIdx.x; if (tid < N) c[tid] = a[tid] + b[tid]; } 43
  • 44.  We’ve seen parallel vector addition using:  Several blocks with one thread each  One block with several threads 44
  • 45. 45
  • 46. 46 (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host
  • 47.  P = M * N of size WIDTH xWIDTH  One thread handles one element of P  M and N are loadedWIDTH times from global memory M N P WIDTHWIDTH WIDTH WIDTH 47
  • 48.  Memory latency can be hidden by keeping a large number of threads busy  Keep number of threads per block (block size) and number of blocks per grid (grid size) as large as possible  Constant memory can be used for constant data (variables that do not change).  Constant memory is cached. 48
  • 49. 49
  • 50.  Each thread within the block computes one element of Csub 50
  • 51. 51 (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host
  • 52.  Recall that the “stream processors” of the GPU are organized as MPs (multi-processors) and every MP has its own set of resources:  Registers  Local memory  The block size needs to be chosen such that there are enough resources in an MP to execute a block at a time. 52
  • 53.  Critical for performance  Recommended value is 192 or 256  Maximum value is 512  Limited by number of registers on the MP 53
  • 54.  Run with the different block sizes!! M N P WIDTHWIDTH WIDTH WIDTH 54
  • 55. 55 0.3 6.7 47 1079 2537 0.3 4.6 5.6 39.3 86.6126 126 19 407 947 0 500 1000 1500 2000 2500 3000 S 128 S 512 S 1024 S 3079 S 4096 block-16 Data - Test 1 Shared
  • 56. 56 0.3 6.7 47 1079 2537 126 126 19 407 947 2948 73 86 73 0 500 1000 1500 2000 2500 3000 3500 S 128 S 512 S 1024 S 3079 S 4096 block-16 Shared block-32 block-64 block-128 block-512
  • 57. 57