SlideShare a Scribd company logo
GPU
JUNE6TH2019
7nm
251 sqmm
10.3 Billion Transistors
X16 PCIe® Gen 4.0
8 GB GDDR6 256b @14 Gbps
448 GB/S*
2560 Stream Processors
Up To 9.75 TFLOPs
*256 pin G6 * 14 Gbps *1B/8b = 448 GBS
®
Radeon Display Engine
New High Resolution HDR Displays
New Levels of Compression
Radeon Multi-Media Engine
Seamless Streaming
Improved Encoding
New Graphics RDNA Architecture
New Compute Units
Multilevel Cache
Streamlined Engine Infinity Fabric
PCIE
Gen 4
Display EngineMultimedia Engine
Geometry Processor
Shader Engine
Graphics Command Processor
ACE ACE
ACE ACE
HWS
DMA
64-bitMemoryController64-bitMemoryController
64-bitMemoryController64-bitMemoryController
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
L1
Prim
Unit
L1
Prim
Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
L1
Prim
Unit
L1
Prim
Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Rasterizer
RasterizerRasterizer
Rasterizer
Shader Engine
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
L2L2L2L2L2L2L2L2
L2L2L2L2L2L2L2L2
High Fidelity Internal Color Depth
3 0 b p p C o l o r
Optimized for High Resolution HDR Displays
4 K 2 4 0 H z | S I N G L E C A B L E | 8 K 6 0 H z
Optimized for Head Mounted Displays
S i n g l e I O C o n n e c t i v i t y
Better Design For Power Efficiency
M u l t i P l a n e O v e r l a y P r o t o c o l W i t h L o w V o l t a g e M o d e
HDMI® 2.0 & DisplayPort 1.4 HDR
Display Stream Compression 1.2a
Direct Read of DCC Compressed Surfaces
H.264
MPEG-4
1080p600
4K150
1080p360
4K90
1080p360
4K60
1080p360
4K90
8K24
NEXT
GEN
4K90
8K24
IMPROVED ENCODING
N E W H D R / WC G E N C O D E ( H E V C )
8 K D E C O D E ( H E V C & V P 0 )
4 0 % E N C O D E R S P E E D U P S
YouTube
twitch
8b/10b
8b
8b/10b
Motivation
Radeon Architecture
New Compute Unit
Multi-Level Cache
Streamlined Engine
Results & Example
GPU Architecture Designed For
Gaming Performance & Efficiency
THE MOTIVATION BEHIND NAVI ARCHITECTURE
▪
▪
▪
▪
▪
▪
▪
▪
“NAVI”
▪
▪
▪
▪
2X INSTRUCTION RATE
Dual Schedulers
Dual Scalar Units
Dual SIMD32
SINGLE CYCLE ISSUE
Wave32 on SIMD32
ALU & LD/ST Unit
SFU Co-Execution
BYTES PER FLOP
128B Load/Store
64B Filter Rate
EXECUTION FLEXIBILITY
Wave64 Dual Issue
Cooperating CU Pair
LDS RTN
IDX DIRECT
VECTOR
MEM RTN
V INIT
DATA
▸
▸
▸
▸
▸
▸
▸
▸
▸
▸
▸
▸
Interleave low priority waves on long latency stall
Wave2 – Instruction M Wave0 – Instruction N+1 Wave0 – Instruction N
T 32-47 T 16-31 T 0-15T 48-63T 32-47 T 16-31 T 0-15T 48-63T 32-47 T 16-31 T 0-15T 48-63
GCN Instruction Execution
Wave64 on SIMD16 (4clk)
“GCN” –Fixed Interleave Of 4 Sets Of Threads Preventing Fine Grain Dynamic Compiler
Scheduling
Wave0
Instruction
N
T 0-31
RDNA Instruction Execution
Wave32 on SIMD32 (1clk)
“RDNA” -Single Cycle Instruction Issue Enabling Fine Grain Compiler Driven Scheduling To
Optimize For Prioritized Single Threaded Performance
T 0-31T 0-31
Wave0
Instruction
N+1
Wave2
Instruction
N+1
Time
Interleave lower priority waves on long latency stall
Both Designs Utilize Multithreading of different waves to achieve throughput and engine utilization
RDNA
2 Wave 32 ➔ 2 SIMD32
Instruction Issue ➔ 1 clock
CU ALU ➔ 100% Utilized
ILP unlocks up to 4x faster focused execution
Engage Machine Quickly By Uniformly Distributing Work To All ALUs
Optimize Efficiency And Latency By Preferring Highest Priority/Oldest Work
Extract Program ILP And Scheduling To Benefit From Data Locality
Utilize Multi-Threading Of Waves To Hide Remaining Latencies For Throughput
WORK LOAD EXAMPLE: 64 WORK-ITEMS ALU INTENSIVE CODE
0 31 0 31
SIMD 0 SIMD 1S
GCN
1 Wave64 ➔ SIMD16
Instruction Issue ➔ 4 clock
CU ALU ➔ 25% Utilized
Effective Throughput
0 15
SIMD 0
0 15
SIMD 1
0 15
SIMD 2
0 15
SIMD 3S
R
▪
▪
▪
▪
▪
▪
▪
▪
▸
▸
▸
SIMD32
Wave32
SIMD32
Wave64
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
s_add_i32 s0, s1, s2
…
…
…
v_mul_f32 v0, v1, s0
… (simd busy 4 cycles)
…
…
v_add_f32 v5, v4, v3
…
…
…
v_sub_f32 v6, v7, v0
…
…
…
s_add_i32 s0, s1, s2
v_mul_f32 v0, v1, s0
v_add_f32 v5, v4, v3
v_sub_f32 v6, v7, v0
s_add_i32 s0, s1, s2
… (salu dependency stall on S0)
v_mul_f32 v0, v1, s0
v_add_f32 v5, v4, v3
… (valu dependency stall on V0)
…
…
v_sub_f32 v6, v7, v0
s_add_i32 s0, s1, s2
… (salu dependency stall on S0)
v_mul_f32 v0, v1, s0 (lo)
v_mul_f32 v0, v1, s0 (hi)
v_add_f32 v5, v4, v3 (lo)
v_add_f32 v5, v4, v3 (hi)
… (valu dependency stall on V0 lo)
v_sub_f32 v6, v7, v0 (lo)
v_sub_f32 v6, v7, v0 (hi)
SHORTEST
WAVE ISSUE
LATENCY
44%
REDUCTION IN
ISSUE CYCLES
▪
▪
▪
▪
▸
▸
ACCESS TO
2X
Registers
UP TO
2X
ALUs
Compute Unit 1
Compute Unit 0
ACCESS TO
4X
Cache Bandwidth
New L1 Level Cache
Improved Bandwidth Amplification
Reduced Latency and Power
Reduced Congestion at L2 Level
Infinity Fabric
PCIE
Gen 4
Display EngineMultimedia Engine
Geometry Processor
Shader Engine
Graphics Command
Processor
ACE ACE
ACE ACE
HWS
DMA
64-bitMemoryController64-bitMemoryController
64-bitMemoryController64-bitMemoryController
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
L1
Prim
Unit
L1
Prim
Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
L1 Prim
Unit
L1
Prim
Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Rasterizer
RasterizerRasterizer
Rasterizer
Shader Engine
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
L2L2L2L2L2L2L2L2
L2L2L2L2L2L2L2L2
L0
L0
L0
L0
L0
L0
L0
L0
L0
L0
L0
L0
L0 L0
L0
L0
L0
L0
L0
L0
Unified LLC for GFX/ACE Pipes
Instruction Range Based Actions
OOO between R/W, L0, L1, L2, Mem
Reduced Latency and Power
Reduced Data Movement
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
SGPR
SGPR
Wave Buffers
Wave Buffers
SGPR
SIMD 1 VGPR
SIMD 1 VGPR
SGPR
Wave Buffers
Wave Buffers
L
D
S
SIMD 0 VGPR
SIMD 0 VGPR
Shader
Complex
PCIe® 4.0 Geometry
Async Compute
Command
Interfaces
Compressed Data
SOC Fabric
Rasterizer &
L1
Texture
DISPLAY
ENGINE
7nm "Navi" GPU - A GPU Built For Performance
0%
20%
40%
60%
80%
100%
See Endnotes "RX-325 and RX-362. Data based on AMD internal testing 6/1/2019.
See Endnotes RX-325, RX-358, and RX-365.
Slide data based on AMD internal testing 6/1/2019.
14 nm “Vega64” 7nm “Navi”
▸
▸
▸
▸
R
▸
▸
Vector Instruction IssueWaveId
One SIMD Instruction trace of oldest wave (12), next to oldest wave (13), etc
Waiting to be executed
Store ResultsSFU Instruction
Scalar Mem Instruction
™
7nm "Navi" GPU - A GPU Built For Performance

More Related Content

PDF
The Path to "Zen 2"
 
PPTX
“Zen 3”: AMD 2nd Generation 7nm x86-64 Microprocessor Core
 
PPTX
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
 
PDF
AMD EPYC™ Microprocessor Architecture
 
PPTX
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
 
PPTX
3D V-Cache
 
PDF
AMD: Where Gaming Begins
 
PDF
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
 
The Path to "Zen 2"
 
“Zen 3”: AMD 2nd Generation 7nm x86-64 Microprocessor Core
 
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
 
AMD EPYC™ Microprocessor Architecture
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
 
3D V-Cache
 
AMD: Where Gaming Begins
 
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
 

What's hot (20)

PPTX
AMD Chiplet Architecture for High-Performance Server and Desktop Products
 
PDF
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
 
PDF
Delivering the Future of High-Performance Computing
 
PPTX
Evaluating UCIe based multi-die SoC to meet timing and power
PDF
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
 
PPTX
Q1 Memory Fabric Forum: Building Fast and Secure Chips with CXL IP
PPTX
Nvidia (History, GPU Architecture and New Pascal Architecture)
PDF
Android Tools for Qualcomm Snapdragon Processors
PPTX
AMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUs
 
PDF
Linux on ARM 64-bit Architecture
PPTX
Broadcom PCIe & CXL Switches OCP Final.pptx
PDF
ASIC vs SOC vs FPGA
PPTX
Supermicro Servers with Micron DDR5 & SSDs: Accelerating Real World Workloads
PPT
Pcie drivers basics
PDF
Kdump and the kernel crash dump analysis
PDF
Kernel Recipes 2015 - Kernel dump analysis
PPTX
Micron: Memory Expansion with CXL Modules: Benefits, Use Cases and Enriching ...
PPTX
AMBA 5 COHERENT HUB INTERFACE.pptx
PDF
In Memory Database In Action by Tanel Poder and Kerry Osborne
PPTX
AMD Hot Chips Bulldozer & Bobcat Presentation
 
AMD Chiplet Architecture for High-Performance Server and Desktop Products
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
 
Delivering the Future of High-Performance Computing
 
Evaluating UCIe based multi-die SoC to meet timing and power
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
 
Q1 Memory Fabric Forum: Building Fast and Secure Chips with CXL IP
Nvidia (History, GPU Architecture and New Pascal Architecture)
Android Tools for Qualcomm Snapdragon Processors
AMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUs
 
Linux on ARM 64-bit Architecture
Broadcom PCIe & CXL Switches OCP Final.pptx
ASIC vs SOC vs FPGA
Supermicro Servers with Micron DDR5 & SSDs: Accelerating Real World Workloads
Pcie drivers basics
Kdump and the kernel crash dump analysis
Kernel Recipes 2015 - Kernel dump analysis
Micron: Memory Expansion with CXL Modules: Benefits, Use Cases and Enriching ...
AMBA 5 COHERENT HUB INTERFACE.pptx
In Memory Database In Action by Tanel Poder and Kerry Osborne
AMD Hot Chips Bulldozer & Bobcat Presentation
 
Ad

Similar to 7nm "Navi" GPU - A GPU Built For Performance (20)

PDF
Experiences with Power 9 at A*STAR CRC
PDF
Fujitsu Lifebook LH532 DA0FJ8MB6F0 Schematic Diagram.pdf
PDF
計算力学シミュレーションに GPU は役立つのか?
PDF
Snake Game on FPGA in Verilog
PDF
#Riverflow2 d gpu tests 2019
PPSX
Gcn performance ftw by stephan hodes
PPTX
GPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStack
PDF
Kauli SSPにおけるVyOSの導入事例
PDF
x86_64 Hardware Deep dive
PDF
Storage & Backup solutions on virtual VAX and Alpha
PDF
Embedded Recipes 2019 - Introduction to JTAG debugging
PDF
Latest HPC News from NVIDIA
PDF
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
PDF
Dv5 amd
PDF
Building a Big Data Machine Learning Platform
PDF
Hp dv6 7000 goya balen 11254-3
PPT
POLYTEDA PowerDRC/LVS overview
PDF
Amd epyc update_gdep_xilinx_ai_web_seminar_20201028
PDF
Best practices for optimizing Red Hat platforms for large scale datacenter de...
PDF
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
Experiences with Power 9 at A*STAR CRC
Fujitsu Lifebook LH532 DA0FJ8MB6F0 Schematic Diagram.pdf
計算力学シミュレーションに GPU は役立つのか?
Snake Game on FPGA in Verilog
#Riverflow2 d gpu tests 2019
Gcn performance ftw by stephan hodes
GPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStack
Kauli SSPにおけるVyOSの導入事例
x86_64 Hardware Deep dive
Storage & Backup solutions on virtual VAX and Alpha
Embedded Recipes 2019 - Introduction to JTAG debugging
Latest HPC News from NVIDIA
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
Dv5 amd
Building a Big Data Machine Learning Platform
Hp dv6 7000 goya balen 11254-3
POLYTEDA PowerDRC/LVS overview
Amd epyc update_gdep_xilinx_ai_web_seminar_20201028
Best practices for optimizing Red Hat platforms for large scale datacenter de...
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
Ad

More from AMD (17)

PPTX
Heterogeneous Integration with 3D Packaging
 
PPTX
AMD EPYC Family World Record Performance Summary Mar 2022
 
PPTX
AMD EPYC Family of Processors World Record
 
PPTX
AMD EPYC Family of Processors World Record
 
PPTX
AMD EPYC World Records
 
PPTX
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
 
PPTX
AMD EPYC 7002 World Records
 
PPTX
AMD EPYC 7002 World Records
 
PPTX
AMD EPYC 100 World Records and Counting
 
PPTX
AMD EPYC 7002 Launch World Records
 
PPTX
AMD Next Horizon
 
PPTX
AMD Next Horizon
 
PDF
AMD Next Horizon
 
PDF
Race to Reality: The Next Billion-People Market Opportunity
 
PDF
GPU Compute in Medical and Print Imaging
 
PPTX
Enabling ARM® Server Technology for the Datacenter
 
PPTX
Lessons From MineCraft: Building the Right SMB Network
 
Heterogeneous Integration with 3D Packaging
 
AMD EPYC Family World Record Performance Summary Mar 2022
 
AMD EPYC Family of Processors World Record
 
AMD EPYC Family of Processors World Record
 
AMD EPYC World Records
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
 
AMD EPYC 7002 World Records
 
AMD EPYC 7002 World Records
 
AMD EPYC 100 World Records and Counting
 
AMD EPYC 7002 Launch World Records
 
AMD Next Horizon
 
AMD Next Horizon
 
AMD Next Horizon
 
Race to Reality: The Next Billion-People Market Opportunity
 
GPU Compute in Medical and Print Imaging
 
Enabling ARM® Server Technology for the Datacenter
 
Lessons From MineCraft: Building the Right SMB Network
 

Recently uploaded (20)

PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Hybrid model detection and classification of lung cancer
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
August Patch Tuesday
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Mushroom cultivation and it's methods.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Zenith AI: Advanced Artificial Intelligence
Group 1 Presentation -Planning and Decision Making .pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Hybrid model detection and classification of lung cancer
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
August Patch Tuesday
NewMind AI Weekly Chronicles - August'25-Week II
Hindi spoken digit analysis for native and non-native speakers
DP Operators-handbook-extract for the Mautical Institute
Mushroom cultivation and it's methods.pdf
Unlocking AI with Model Context Protocol (MCP)
A comparative analysis of optical character recognition models for extracting...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
Web App vs Mobile App What Should You Build First.pdf
Approach and Philosophy of On baking technology
Programs and apps: productivity, graphics, security and other tools
Digital-Transformation-Roadmap-for-Companies.pptx
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

7nm "Navi" GPU - A GPU Built For Performance

  • 2. 7nm 251 sqmm 10.3 Billion Transistors X16 PCIe® Gen 4.0 8 GB GDDR6 256b @14 Gbps 448 GB/S* 2560 Stream Processors Up To 9.75 TFLOPs *256 pin G6 * 14 Gbps *1B/8b = 448 GBS
  • 3. ®
  • 4. Radeon Display Engine New High Resolution HDR Displays New Levels of Compression Radeon Multi-Media Engine Seamless Streaming Improved Encoding New Graphics RDNA Architecture New Compute Units Multilevel Cache Streamlined Engine Infinity Fabric PCIE Gen 4 Display EngineMultimedia Engine Geometry Processor Shader Engine Graphics Command Processor ACE ACE ACE ACE HWS DMA 64-bitMemoryController64-bitMemoryController 64-bitMemoryController64-bitMemoryController Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit L1 Prim Unit L1 Prim Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit L1 Prim Unit L1 Prim Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Rasterizer RasterizerRasterizer Rasterizer Shader Engine RB RB RB RB RB RB RB RB RB RB RB RB RB RB RB RB L2L2L2L2L2L2L2L2 L2L2L2L2L2L2L2L2
  • 5. High Fidelity Internal Color Depth 3 0 b p p C o l o r Optimized for High Resolution HDR Displays 4 K 2 4 0 H z | S I N G L E C A B L E | 8 K 6 0 H z Optimized for Head Mounted Displays S i n g l e I O C o n n e c t i v i t y Better Design For Power Efficiency M u l t i P l a n e O v e r l a y P r o t o c o l W i t h L o w V o l t a g e M o d e HDMI® 2.0 & DisplayPort 1.4 HDR Display Stream Compression 1.2a Direct Read of DCC Compressed Surfaces
  • 6. H.264 MPEG-4 1080p600 4K150 1080p360 4K90 1080p360 4K60 1080p360 4K90 8K24 NEXT GEN 4K90 8K24 IMPROVED ENCODING N E W H D R / WC G E N C O D E ( H E V C ) 8 K D E C O D E ( H E V C & V P 0 ) 4 0 % E N C O D E R S P E E D U P S YouTube twitch 8b/10b 8b 8b/10b
  • 7. Motivation Radeon Architecture New Compute Unit Multi-Level Cache Streamlined Engine Results & Example GPU Architecture Designed For Gaming Performance & Efficiency
  • 8. THE MOTIVATION BEHIND NAVI ARCHITECTURE
  • 10. 2X INSTRUCTION RATE Dual Schedulers Dual Scalar Units Dual SIMD32 SINGLE CYCLE ISSUE Wave32 on SIMD32 ALU & LD/ST Unit SFU Co-Execution BYTES PER FLOP 128B Load/Store 64B Filter Rate EXECUTION FLEXIBILITY Wave64 Dual Issue Cooperating CU Pair
  • 11. LDS RTN IDX DIRECT VECTOR MEM RTN V INIT DATA
  • 13. Interleave low priority waves on long latency stall Wave2 – Instruction M Wave0 – Instruction N+1 Wave0 – Instruction N T 32-47 T 16-31 T 0-15T 48-63T 32-47 T 16-31 T 0-15T 48-63T 32-47 T 16-31 T 0-15T 48-63 GCN Instruction Execution Wave64 on SIMD16 (4clk) “GCN” –Fixed Interleave Of 4 Sets Of Threads Preventing Fine Grain Dynamic Compiler Scheduling Wave0 Instruction N T 0-31 RDNA Instruction Execution Wave32 on SIMD32 (1clk) “RDNA” -Single Cycle Instruction Issue Enabling Fine Grain Compiler Driven Scheduling To Optimize For Prioritized Single Threaded Performance T 0-31T 0-31 Wave0 Instruction N+1 Wave2 Instruction N+1 Time Interleave lower priority waves on long latency stall Both Designs Utilize Multithreading of different waves to achieve throughput and engine utilization
  • 14. RDNA 2 Wave 32 ➔ 2 SIMD32 Instruction Issue ➔ 1 clock CU ALU ➔ 100% Utilized ILP unlocks up to 4x faster focused execution Engage Machine Quickly By Uniformly Distributing Work To All ALUs Optimize Efficiency And Latency By Preferring Highest Priority/Oldest Work Extract Program ILP And Scheduling To Benefit From Data Locality Utilize Multi-Threading Of Waves To Hide Remaining Latencies For Throughput WORK LOAD EXAMPLE: 64 WORK-ITEMS ALU INTENSIVE CODE 0 31 0 31 SIMD 0 SIMD 1S GCN 1 Wave64 ➔ SIMD16 Instruction Issue ➔ 4 clock CU ALU ➔ 25% Utilized Effective Throughput 0 15 SIMD 0 0 15 SIMD 1 0 15 SIMD 2 0 15 SIMD 3S R
  • 16. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 s_add_i32 s0, s1, s2 … … … v_mul_f32 v0, v1, s0 … (simd busy 4 cycles) … … v_add_f32 v5, v4, v3 … … … v_sub_f32 v6, v7, v0 … … … s_add_i32 s0, s1, s2 v_mul_f32 v0, v1, s0 v_add_f32 v5, v4, v3 v_sub_f32 v6, v7, v0 s_add_i32 s0, s1, s2 … (salu dependency stall on S0) v_mul_f32 v0, v1, s0 v_add_f32 v5, v4, v3 … (valu dependency stall on V0) … … v_sub_f32 v6, v7, v0 s_add_i32 s0, s1, s2 … (salu dependency stall on S0) v_mul_f32 v0, v1, s0 (lo) v_mul_f32 v0, v1, s0 (hi) v_add_f32 v5, v4, v3 (lo) v_add_f32 v5, v4, v3 (hi) … (valu dependency stall on V0 lo) v_sub_f32 v6, v7, v0 (lo) v_sub_f32 v6, v7, v0 (hi) SHORTEST WAVE ISSUE LATENCY 44% REDUCTION IN ISSUE CYCLES
  • 18. ACCESS TO 2X Registers UP TO 2X ALUs Compute Unit 1 Compute Unit 0 ACCESS TO 4X Cache Bandwidth
  • 19. New L1 Level Cache Improved Bandwidth Amplification Reduced Latency and Power Reduced Congestion at L2 Level Infinity Fabric PCIE Gen 4 Display EngineMultimedia Engine Geometry Processor Shader Engine Graphics Command Processor ACE ACE ACE ACE HWS DMA 64-bitMemoryController64-bitMemoryController 64-bitMemoryController64-bitMemoryController Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit L1 Prim Unit L1 Prim Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit L1 Prim Unit L1 Prim Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Rasterizer RasterizerRasterizer Rasterizer Shader Engine RB RB RB RB RB RB RB RB RB RB RB RB RB RB RB RB L2L2L2L2L2L2L2L2 L2L2L2L2L2L2L2L2 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0
  • 20. Unified LLC for GFX/ACE Pipes Instruction Range Based Actions OOO between R/W, L0, L1, L2, Mem Reduced Latency and Power Reduced Data Movement WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP
  • 21. SGPR SGPR Wave Buffers Wave Buffers SGPR SIMD 1 VGPR SIMD 1 VGPR SGPR Wave Buffers Wave Buffers L D S SIMD 0 VGPR SIMD 0 VGPR
  • 22. Shader Complex PCIe® 4.0 Geometry Async Compute Command Interfaces Compressed Data SOC Fabric Rasterizer & L1 Texture DISPLAY ENGINE
  • 24. 0% 20% 40% 60% 80% 100% See Endnotes "RX-325 and RX-362. Data based on AMD internal testing 6/1/2019.
  • 25. See Endnotes RX-325, RX-358, and RX-365. Slide data based on AMD internal testing 6/1/2019. 14 nm “Vega64” 7nm “Navi”
  • 26. ▸ ▸ ▸ ▸ R ▸ ▸ Vector Instruction IssueWaveId One SIMD Instruction trace of oldest wave (12), next to oldest wave (13), etc Waiting to be executed Store ResultsSFU Instruction Scalar Mem Instruction
  • 27.