Yang greenstein part_2

How Shaders are Created

Application

API

GPU Driver
Video BIOS
22 May 30, 2008 AMD: DV Club - Westford MA

Images


Display Processing

Advanced Gamma and Color Correction

No correction


Display Processing

Advanced Gamma and Color Correction

Avivo Display Engine 10-bit
No correction
gamma and color correction

“Call of Juarez” using DirectX 9


“Call of Juarez” using DirectX 10


GPU Verification


Graphics Verification Challenges

Large complex ASICs:
Approaching 1B xtrs; >50 different clocks; > 600 MHZ; >100 top level tiles
Parallel SIMDs, Multiple pipelines; hundreds of threads in flight; >300 ALUs
High BW memory/cache interface; PCI Express; Display Ports

3rd party compliance: DirectX and OpenGL Graphic APIs and Apps

Firmware critical to ASIC function
ASIC validation utilizes firmware release as part of tape out
Firmware debug requires significant amounts of time

Full frames processing requires days/weeks of RTL simulation

Market window small – consumer market is harsh!
Schedule is KING
Need incremental development; hierarchy and reuse prior
Respins are costly; time to market is critical
Christmas, Dads/Grads, or bust!


GPU Architecture


Top Level Command Processor Vertex Index Fetch
Radeon 2900

Hierarchical Z

Shader Caches
Shader Caches
Setup

Instruction &
Tessellator

Constant
Constant
Rasterizer Unit

Stream Out
Red – Compute
Geometry Vertex
Yellow – Cache Interpolators Assembler Assembler

Unified shader
Ultra-Threaded Dispatch Processor
Shader R/W

Memory Read/Write Cache

L2 Texture Cache
Instr./Const. cache

L1 Texture Cache
Texture Units
Texture Units
Texture Units
Texture Units
Unified texture cache Unified
Compression Shader
Z/Stencil Cache

4 SIMDs Processors
16 Pipelines/SIMD

5 Stream processes
(32bit FP) per pipeline Shader Export
320 ALU ops in parallel

Over 700M transistors Render Back-Ends

Color Cache


Technical Solutions

Layered CODE Methodology
Multiple Layers of Testbenches
Maximize Controllability, Observability, and Debug Efficiency
Reference Model

Tools
Coverage and assertions
Visualization

HW Emulation


Layered CODE Verification
Testbench Capability - Maximize Controllability, Observability, and Debug Efficiency

Level Controllability Observability Debug / Expected
(I/O; pipeline (Checking Fix Bugs
timing; results in I/O; Efficiency found for
sequencing; internal states) efficiency
internal state;
error injection)
Silicon in Lab Low Low days - ZERO
weeks

Chip/System Med: chip I/O Med hours – days Few

Block High: block I/O High minutes- Many
hours

Sub Block Max – closest Max minutes Most
to design;
internal corner
states


Reference Model Methodology

C++ reference model of the DUT
One “block” = one C++ object
Non-synthesizeable => easier to write than RTL
Very fast
– Several orders of magnitude faster than the design
– Used by driver, performance teams

Transaction-level accuracy
Block-block interfaces modeled (see SystemVerilog definition)
Matches design exactly (almost)
Sub-transaction debug taps for added accuracy


Testbenches

Sub-block testbenches : designer boot-strap

Block-level testbenches: constrained-
random Test
Tests written in C++
Test library in C++
Test library
– SCV, other randomization
Threaded transport layer
– Based on SystemC
– C++ to C++
– C++ to verilog
Transport
Two-pass approach
– Ref model, then RTL, then compare
Block
reference
SystemVerilog testbenches also used model OR RTL


Testbenches

Chip/system testbenches
Tests written in C++
Tests debugged on chip reference model
Test
– Collection of block ref models; see prev slide
Test library in C++
– Mimics OpenGL, an industry standard OpenGL-like
test library
OR
Test portability production
Write once, run everywhere driver

– Reference model
– Design
– H/W emulation
– Lab/diags Transport

– Production drivers
Overall TTM improved Chip reference
model OR RTL
– Driver schedule is nontrivial OR emulation
OR real H/W


HW Emulation

Usage:

In-Ckt Emulation of full chip design and running Chip DV and SW stack
Simulates up to 1000X faster than SW (RTL) simulation
Capable of rendering full image frames in minutes/hrs vs days/weeks
Capture/playback scenes of benchmarks and games

Pre Silicon
Verifying chip/system level functionalities and performance, block
interactions, stress
Allows for longer runs of random tests to look for hangs
Prototype and test SW drivers and Diag
Develop Boot Up settings

Post Silicon: BringUp to Production
Debug platform for silicon
Validate ECOs


Coverage and Assertions

Assertions are a Good Thing
White-box testing
Designer impact on DV
Etc.

Functional coverage is a Good Thing
Deep corner cases
API spec does not show all implementation details
Etc.
Bug rates/DV closure improved greatly when func covg was adopted


Visualization

It is graphics, after all
Nice to see pretty pictures
for what you are drawing

Two overlapping textured
triangles, with depth


Visualization

Corruptions become easier to see; recognize patterns

Color
corruption


Summary

AMD + ATI = positioned for success

Graphics business/technology has many challenges

Market window is everything

Techniques mostly leverage standard industry practice, with some twists
Reference-model-based flow
High quality is required
– Rely on coverage, constrained-random, etc.
H/W and S/W are both key to product success
– Seamless integration required

We are growing
Always looking for good people!

Shaw.Yang@amd.com
Gary.Greenstein@amd.com


Backup Slides


Yang greenstein part_2

More Related Content

Similar to Yang greenstein part_2 (20)

More from Obsidian Software (20)

Yang greenstein part_2