SlideShare a Scribd company logo
Optimizing PerformancePerhaad Mistry & Dana Schaa,Northeastern University Computer ArchitectureResearch Lab, with Benedict R. Gaster, AMD© 2011
Instructor NotesThis lecture discusses three important optimizationsThe performance impact of mapping threads to data on the GPU is subtle but extremely important. Examples are shown (including a detailed matrix transpose) along with actual empirical resultsThe number of threads that are active on the GPU can also play a large part in achieving good performance, and so the subtleties of GPU occupancy are discussedVectorization is particularly important for AMD GPUs and is briefly discussed as well2Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
TopicsThread mappingChoosing a proper mappingOptimizing with local memoryDevice occupancyVectorization3Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread MappingThread mapping determines which threads will access which dataProper mappings can align with hardware and provide large performance benefitsImproper mappings can be disastrous to performanceThe paper Static Memory Access Pattern Analysis on a Massively Parallel GPU by Jang, et. al focuses on the task of effectively mapping threads to the data access patterns of an algorithm4Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread MappingBy using different mappings, the same thread can be assigned to access different data elementsThe examples below show three different possible mappings of threads to data (assuming the thread id is used to access an element)intgroup_size = get_local_size(0) *get_local_size(1);inttid = get_group_id(1) * get_num_groups(0) *group_size +get_group_id(0) *group_size + get_local_id(1) *get_local_size(0) +  get_local_id(0)inttid = get_global_id(1) * get_global_size(0) + get_global_id(0);inttid = get_global_id(0) * get_global_size(1) + get_global_id(1);Mapping0145Thread IDs236701230481289121345671591310111415891011261014*assuming 2x2 groups121314153711155Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread MappingConsider a serial matrix multiplication algorithmThis algorithm is suited for output data decompositionWe will create NM threads Effectively removing the outer two loopsEach thread will perform P calculationsThe inner loop will remain as part of the kernelShould the index space be MxN or NxM?6Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread MappingThread mapping 1: with an MxN index space, the kernel would be:Thread mapping 2: with an NxM index space, the kernel would be:Both mappings produce functionally equivalent versions of the program7Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread MappingThis figure shows the execution of the two thread mappings on NVIDIA GeForce 285 and 8800 GPUsNotice that mapping 2 is far superior in performance for both GPUs8Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread MappingThe discrepancy in execution times between the mappings is due to data accesses on the global memory busAssuming row-major data, data in a row (i.e., elements in adjacent columns) are stored sequentially in memoryTo ensure coalesced accesses, consecutive threads in the same wavefront should be mapped to columns (the second dimension) of the matricesThis will give coalesced accesses in Matrices B and CFor Matrix A, the iteratori3 determines the access pattern for row-major data, so thread mapping does not affect it9Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread MappingIn mapping 1, consecutive threads (tx) are mapped to different rows of Matrix C, and non-consecutive threads (ty) are mapped to columns of Matrix BThe mapping causes inefficient memory accesses10Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread MappingIn mapping 2, consecutive threads (tx) are mapped to consecutive elements in Matrices B and CAccesses to both of these matrices will be coalesced Degree of coalescence depends on the workgroup and data sizes11Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread MappingIn general, threads can be created and mapped to any data element by manipulating the values returned by the thread identifier functionsThe following matrix transpose example will show how thread IDs can be modified to achieve efficient memory accesses12Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Matrix TransposeA matrix transpose is a straightforward techniqueOut(x,y) = In(y,x)No matter which thread mapping is chosen, one operation (read/write) will produce coalesced accesses while the other (write/read) produces uncoalesced accessesNote that data must be read to a temporary location (such as a register) before being written to a new locationOutInOutIncoalesceduncoalesceduncoalescedcoalescedThreads0123012313Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Matrix TransposeIf local memory is used to buffer the data between reading and writing, we can rearrange the thread mapping to provide coalesced accesses in both directionsNote that the work group must be squareOutIn0123coalescedcoalesced4567891011Local memory12131415Threads01230123global mem index01230123012304812local mem index14Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Matrix TransposeThe following figure shows a performance comparison of the two transpose kernels for matrices of size NxM on an AMD 5870 GPU“Optimized” uses local memory and thread remapping15Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
OccupancyOn current GPUs, work groups get mapped to compute unitsWhen a work group is mapped to a compute unit, it cannot be swapped off until all of its threads complete their executionIf there are enough resources available, multiple work groups can be mapped to the same compute unit at the same time Wavefronts from another work group can be swapped in to hide latencyResources are fixed per compute unit (number of registers, local memory size, maximum number of threads)Any one of these resource constraints may limit the number of work groups on a compute unitThe term occupancy is used to describe how well the resources of the compute unit are being utilized16Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Occupancy – RegistersThe availability of registers is one of the major limiting factor for larger kernelsThe maximum number of registers required by a kernel must be available for all threads of a workgroupExample: Consider a GPU with 16384 registers per compute unit running a kernel that requires 35 registers per thread Each compute unit can execute at most 468 threadsThis affects the choice of workgroup sizeA workgroup of 512 is not possibleOnly 1 workgroup of 256 threads is allowed at a time, even though 212 more threads could be running3 workgroups of 128 threads are allowed, providing 384 threads to be scheduled, etc.17Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Occupancy – RegistersConsider another example: A GPU has 16384 registers per compute unitThe work group size of a kernel is fixed at 256 threadsThe kernel currently requires 17 registers per threadGiven the information, each work group requires 4352 registersThis allows for 3 active work groups if registers are the only limiting factorIf the code can be restructured to only use 16 registers, then 4 active work groups would be possible18Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Occupancy – Local MemoryGPUs have a limited amount of local memory on each compute unit32KB of local memory on AMD GPUs32-48KB of local memory on NVIDIA GPUsLocal memory limits the number of active work groups per compute unitDepending on the kernel, the data per workgroup may be fixed regardless of number of threads (e.g., histograms), or may vary based on the number of threads (e.g., matrix multiplication, convolution)19Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Occupancy – ThreadsGPUs have hardware limitations on the maximum number of threads per work group256 threads per WG on AMD GPUs512 threads per WG on NVIDIA GPUsNVIDIA GPUs have per-compute-unit limits on the number of active threads and work groups (depending on the GPU model)768 or 1024 threads per compute unit8 or 16 warps per compute unitAMD GPUs have GPU-wide limits on the number of wavefronts496 wavefronts on the 5870 GPU (~25 wavefronts or ~1600 threads per compute unit)20Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Occupancy – Limiting FactorsThe minimum of these three factors is what limits the active number of threads (or occupancy) of a compute unit The interactions between the factors are complexThe limiting factor may have either thread or wavefront granularityChanging work group size may affect register or shared memory usageReducing any factor (such as register usage) slightly may have allow another work group to be activeThe CUDA occupancy calculator from NVIDIA plots these factors visually allowing the tradeoffs to be visualized21Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
CUDA Occupancy CalculatorCUDA occupancy calculator:1. Enter hardware model and kernel requirements2. Resource usage and limiting factors are displayed3. Graphs are shown to visualize limiting factors22Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Compute Unit...PE0PE1PEn-1PE2VectorizationOn AMD GPUs, each processing element executes a 5-way VLIW instruction5 scalar operations or4 scalar operations + 1 transcendental operation RegistersALU + T-unitIncomingInstructionBranch UnitALU23Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
VectorizationVectorization allows a single thread to perform multiple operations at onceExplicit vectorization is achieved by using vector datatypes (such as float4) in the source programWhen a number is appended to a datatype, the datatype becomes an array of that lengthOperations can be performed on vector datatypes just like regular datatypesEach ALU will operate on different element of the float4 data24Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
VectorizationVectorization improves memory performance on AMD GPUsThe AMD Accelerated Parallel Processing OpenCL Programming Guide compares float to  float4 memory bandwidth25Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
SummaryAlthough writing a simple OpenCL program is relatively easy, optimizing code can be very difficultImproperly mapping loop iterations to OpenCL threads can significantly degrade performanceWhen creating work groups, hardware limitations (number of registers, size of local memory, etc.) need to be consideredWork groups must be sized appropriately to maximize the number of active threads and properly hide latenciesVectorization is an important optimization for AMD GPU hardwareThough not covered here, vectorization may also help performance when targeting CPUs26Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

More Related Content

PPTX
Lec07 threading hw
PPTX
Lec13 multidevice
PPTX
Lec09 nbody-optimization
PPTX
Lec06 memory
PPTX
Lec11 timing
PPTX
Lec05 buffers basic_examples
PDF
Solving Endgames in Large Imperfect-Information Games such as Poker
PDF
Parallel computation
Lec07 threading hw
Lec13 multidevice
Lec09 nbody-optimization
Lec06 memory
Lec11 timing
Lec05 buffers basic_examples
Solving Endgames in Large Imperfect-Information Games such as Poker
Parallel computation

What's hot (20)

PDF
A Random Forest using a Multi-valued Decision Diagram on an FPGa
PPTX
A Tale of Data Pattern Discovery in Parallel
PDF
CUDA and Caffe for deep learning
PPTX
High Performance Parallel Computing with Clouds and Cloud Technologies
PDF
FPT17: An object detector based on multiscale sliding window search using a f...
PDF
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
PDF
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
PPTX
Comp7404 ai group_project_15apr2018_v2.1
PDF
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
PPTX
Learn about Tensorflow for Deep Learning now! Part 1
PDF
An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
PDF
FPL15 talk: Deep Convolutional Neural Network on FPGA
PDF
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
PDF
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
PPTX
Google warehouse scale computer
PDF
Salt Identification Challenge
PDF
Nicpaper2009
PDF
Towards a Unified Data Analytics Optimizer with Yanlei Diao
PPT
Exploring Gpgpu Workloads
PDF
Naist2015 dec ver1
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Tale of Data Pattern Discovery in Parallel
CUDA and Caffe for deep learning
High Performance Parallel Computing with Clouds and Cloud Technologies
FPT17: An object detector based on multiscale sliding window search using a f...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
Comp7404 ai group_project_15apr2018_v2.1
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
Learn about Tensorflow for Deep Learning now! Part 1
An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
FPL15 talk: Deep Convolutional Neural Network on FPGA
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
Google warehouse scale computer
Salt Identification Challenge
Nicpaper2009
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Exploring Gpgpu Workloads
Naist2015 dec ver1
Ad

Similar to Lec08 optimizations (20)

PDF
Effective Sparse Matrix Representation for the GPU Architectures
PDF
Effective Sparse Matrix Representation for the GPU Architectures
PDF
Accelerating Real Time Applications on Heterogeneous Platforms
PPT
FAST MAP PROJECTION ON CUDA.ppt
PPTX
Exascale Capabl
PDF
Co question 2006
PDF
16-mmap-ml-sigmod
PDF
4213ijaia02
PDF
In datacenter performance analysis of a tensor processing unit
PDF
V3I8-0460
PDF
Conference Paper: Universal Node: Towards a high-performance NFV environment
PDF
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
PPTX
Programmable Exascale Supercomputer
PDF
Performance Optimization of Clustering On GPU
PDF
Evaluation of genetic algorithm in network-on-chip based architecture
PDF
Aw33283286
PDF
Aw33283286
PPTX
Lec04 gpu architecture
PDF
unit-iii-deep-learningunit-iii-deep-learning.pdf
PDF
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
Effective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU Architectures
Accelerating Real Time Applications on Heterogeneous Platforms
FAST MAP PROJECTION ON CUDA.ppt
Exascale Capabl
Co question 2006
16-mmap-ml-sigmod
4213ijaia02
In datacenter performance analysis of a tensor processing unit
V3I8-0460
Conference Paper: Universal Node: Towards a high-performance NFV environment
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Programmable Exascale Supercomputer
Performance Optimization of Clustering On GPU
Evaluation of genetic algorithm in network-on-chip based architecture
Aw33283286
Aw33283286
Lec04 gpu architecture
unit-iii-deep-learningunit-iii-deep-learning.pdf
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
Ad

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Spectroscopy.pptx food analysis technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
KodekX | Application Modernization Development
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Approach and Philosophy of On baking technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Unlocking AI with Model Context Protocol (MCP)
Spectroscopy.pptx food analysis technology
The AUB Centre for AI in Media Proposal.docx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Advanced methodologies resolving dimensionality complications for autism neur...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
KodekX | Application Modernization Development
MYSQL Presentation for SQL database connectivity
Programs and apps: productivity, graphics, security and other tools
Mobile App Security Testing_ A Comprehensive Guide.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Approach and Philosophy of On baking technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MIND Revenue Release Quarter 2 2025 Press Release
Building Integrated photovoltaic BIPV_UPV.pdf
Network Security Unit 5.pdf for BCA BBA.

Lec08 optimizations

  • 1. Optimizing PerformancePerhaad Mistry & Dana Schaa,Northeastern University Computer ArchitectureResearch Lab, with Benedict R. Gaster, AMD© 2011
  • 2. Instructor NotesThis lecture discusses three important optimizationsThe performance impact of mapping threads to data on the GPU is subtle but extremely important. Examples are shown (including a detailed matrix transpose) along with actual empirical resultsThe number of threads that are active on the GPU can also play a large part in achieving good performance, and so the subtleties of GPU occupancy are discussedVectorization is particularly important for AMD GPUs and is briefly discussed as well2Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 3. TopicsThread mappingChoosing a proper mappingOptimizing with local memoryDevice occupancyVectorization3Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 4. Thread MappingThread mapping determines which threads will access which dataProper mappings can align with hardware and provide large performance benefitsImproper mappings can be disastrous to performanceThe paper Static Memory Access Pattern Analysis on a Massively Parallel GPU by Jang, et. al focuses on the task of effectively mapping threads to the data access patterns of an algorithm4Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 5. Thread MappingBy using different mappings, the same thread can be assigned to access different data elementsThe examples below show three different possible mappings of threads to data (assuming the thread id is used to access an element)intgroup_size = get_local_size(0) *get_local_size(1);inttid = get_group_id(1) * get_num_groups(0) *group_size +get_group_id(0) *group_size + get_local_id(1) *get_local_size(0) + get_local_id(0)inttid = get_global_id(1) * get_global_size(0) + get_global_id(0);inttid = get_global_id(0) * get_global_size(1) + get_global_id(1);Mapping0145Thread IDs236701230481289121345671591310111415891011261014*assuming 2x2 groups121314153711155Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 6. Thread MappingConsider a serial matrix multiplication algorithmThis algorithm is suited for output data decompositionWe will create NM threads Effectively removing the outer two loopsEach thread will perform P calculationsThe inner loop will remain as part of the kernelShould the index space be MxN or NxM?6Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 7. Thread MappingThread mapping 1: with an MxN index space, the kernel would be:Thread mapping 2: with an NxM index space, the kernel would be:Both mappings produce functionally equivalent versions of the program7Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 8. Thread MappingThis figure shows the execution of the two thread mappings on NVIDIA GeForce 285 and 8800 GPUsNotice that mapping 2 is far superior in performance for both GPUs8Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 9. Thread MappingThe discrepancy in execution times between the mappings is due to data accesses on the global memory busAssuming row-major data, data in a row (i.e., elements in adjacent columns) are stored sequentially in memoryTo ensure coalesced accesses, consecutive threads in the same wavefront should be mapped to columns (the second dimension) of the matricesThis will give coalesced accesses in Matrices B and CFor Matrix A, the iteratori3 determines the access pattern for row-major data, so thread mapping does not affect it9Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 10. Thread MappingIn mapping 1, consecutive threads (tx) are mapped to different rows of Matrix C, and non-consecutive threads (ty) are mapped to columns of Matrix BThe mapping causes inefficient memory accesses10Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 11. Thread MappingIn mapping 2, consecutive threads (tx) are mapped to consecutive elements in Matrices B and CAccesses to both of these matrices will be coalesced Degree of coalescence depends on the workgroup and data sizes11Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 12. Thread MappingIn general, threads can be created and mapped to any data element by manipulating the values returned by the thread identifier functionsThe following matrix transpose example will show how thread IDs can be modified to achieve efficient memory accesses12Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 13. Matrix TransposeA matrix transpose is a straightforward techniqueOut(x,y) = In(y,x)No matter which thread mapping is chosen, one operation (read/write) will produce coalesced accesses while the other (write/read) produces uncoalesced accessesNote that data must be read to a temporary location (such as a register) before being written to a new locationOutInOutIncoalesceduncoalesceduncoalescedcoalescedThreads0123012313Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 14. Matrix TransposeIf local memory is used to buffer the data between reading and writing, we can rearrange the thread mapping to provide coalesced accesses in both directionsNote that the work group must be squareOutIn0123coalescedcoalesced4567891011Local memory12131415Threads01230123global mem index01230123012304812local mem index14Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 15. Matrix TransposeThe following figure shows a performance comparison of the two transpose kernels for matrices of size NxM on an AMD 5870 GPU“Optimized” uses local memory and thread remapping15Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 16. OccupancyOn current GPUs, work groups get mapped to compute unitsWhen a work group is mapped to a compute unit, it cannot be swapped off until all of its threads complete their executionIf there are enough resources available, multiple work groups can be mapped to the same compute unit at the same time Wavefronts from another work group can be swapped in to hide latencyResources are fixed per compute unit (number of registers, local memory size, maximum number of threads)Any one of these resource constraints may limit the number of work groups on a compute unitThe term occupancy is used to describe how well the resources of the compute unit are being utilized16Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 17. Occupancy – RegistersThe availability of registers is one of the major limiting factor for larger kernelsThe maximum number of registers required by a kernel must be available for all threads of a workgroupExample: Consider a GPU with 16384 registers per compute unit running a kernel that requires 35 registers per thread Each compute unit can execute at most 468 threadsThis affects the choice of workgroup sizeA workgroup of 512 is not possibleOnly 1 workgroup of 256 threads is allowed at a time, even though 212 more threads could be running3 workgroups of 128 threads are allowed, providing 384 threads to be scheduled, etc.17Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 18. Occupancy – RegistersConsider another example: A GPU has 16384 registers per compute unitThe work group size of a kernel is fixed at 256 threadsThe kernel currently requires 17 registers per threadGiven the information, each work group requires 4352 registersThis allows for 3 active work groups if registers are the only limiting factorIf the code can be restructured to only use 16 registers, then 4 active work groups would be possible18Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 19. Occupancy – Local MemoryGPUs have a limited amount of local memory on each compute unit32KB of local memory on AMD GPUs32-48KB of local memory on NVIDIA GPUsLocal memory limits the number of active work groups per compute unitDepending on the kernel, the data per workgroup may be fixed regardless of number of threads (e.g., histograms), or may vary based on the number of threads (e.g., matrix multiplication, convolution)19Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 20. Occupancy – ThreadsGPUs have hardware limitations on the maximum number of threads per work group256 threads per WG on AMD GPUs512 threads per WG on NVIDIA GPUsNVIDIA GPUs have per-compute-unit limits on the number of active threads and work groups (depending on the GPU model)768 or 1024 threads per compute unit8 or 16 warps per compute unitAMD GPUs have GPU-wide limits on the number of wavefronts496 wavefronts on the 5870 GPU (~25 wavefronts or ~1600 threads per compute unit)20Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 21. Occupancy – Limiting FactorsThe minimum of these three factors is what limits the active number of threads (or occupancy) of a compute unit The interactions between the factors are complexThe limiting factor may have either thread or wavefront granularityChanging work group size may affect register or shared memory usageReducing any factor (such as register usage) slightly may have allow another work group to be activeThe CUDA occupancy calculator from NVIDIA plots these factors visually allowing the tradeoffs to be visualized21Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 22. CUDA Occupancy CalculatorCUDA occupancy calculator:1. Enter hardware model and kernel requirements2. Resource usage and limiting factors are displayed3. Graphs are shown to visualize limiting factors22Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 23. Compute Unit...PE0PE1PEn-1PE2VectorizationOn AMD GPUs, each processing element executes a 5-way VLIW instruction5 scalar operations or4 scalar operations + 1 transcendental operation RegistersALU + T-unitIncomingInstructionBranch UnitALU23Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 24. VectorizationVectorization allows a single thread to perform multiple operations at onceExplicit vectorization is achieved by using vector datatypes (such as float4) in the source programWhen a number is appended to a datatype, the datatype becomes an array of that lengthOperations can be performed on vector datatypes just like regular datatypesEach ALU will operate on different element of the float4 data24Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 25. VectorizationVectorization improves memory performance on AMD GPUsThe AMD Accelerated Parallel Processing OpenCL Programming Guide compares float to float4 memory bandwidth25Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 26. SummaryAlthough writing a simple OpenCL program is relatively easy, optimizing code can be very difficultImproperly mapping loop iterations to OpenCL threads can significantly degrade performanceWhen creating work groups, hardware limitations (number of registers, size of local memory, etc.) need to be consideredWork groups must be sized appropriately to maximize the number of active threads and properly hide latenciesVectorization is an important optimization for AMD GPU hardwareThough not covered here, vectorization may also help performance when targeting CPUs26Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Editor's Notes

  • #7: Assuming that our thread structure will match the dimensions of our output matrix, the question here is whether the outer loop of the algorithm should be mapped to the X or the Y dimension of the thread structure. Same for the middle loop. The next slide will help clarify.
  • #9: These are results from Jang, et al.
  • #14: Either approach causes horrible performance on the GPU.
  • #15: After reading the data to local memory, new indexes are computed that allow consecutive threads to write to consecutive memory locations.
  • #18: Because of the hardware schedulable unit of threads (i.e. wavefronts or warps), we usually try to create workgroups that are a power of two in size, or at least a multiple of the schedulable unit
  • #19: Since workgroups are units that must be scheduled all at once, reducing register pressure by a small amount could allow many more threads to be active.
  • #22: The impact of occupancy is latency hiding, so the real benefit of having as many active threads possible is completely dependent on the degree of memory latency present in the program (compute-bound kernels will see minimal benefit from more active threads).
  • #24: NVIDIA GPUs do not have vector-based PEs
  • #25: The compiler will attempt to pack the VLIW instruction if vector data types are not used.