CS 354 Performance Analysis

CS 354
Performance Analysis

Mark Kilgard
University of Texas
April 26, 2012

CS 354 2

Today’s material
 In-class quiz
 On acceleration structures lecture
 Lecture topic
 Graphic Performance Analysis

CS 354 3

My Office Hours
 Tuesday, before class
 Painter (PAI) 5.35
 8:45 a.m. to 9:15
 Thursday, after class
 ACE 6.302
 11:00 a.m. to 12

 Randy’s office hours
 Monday & Wednesday
 11 a.m. to 12:00
 Painter (PAI) 5.33

CS 354 4

Last time, this time
 Last lecture, we discussed
 Acceleration structures
 This lecture
 Graphics Performance Analysis
 Projects
 Project 4 on ray tracing on Piazza
 Due May 2, 2012
 Get started!

CS 354 5

On a sheet of paper
Daily Quiz • Write your EID, name, and date
• Write #1, #2, #3 followed by its answer
 Multiple choice: Which is
NOT a bounding volume  True of False: Volume
representation rendering can be accelerated
by the GPU by drawing
a) sphere
blended slices of the volume.
b) axis-aligned bounding box
c) object aligned bounding box
d) bounding graph point
e) convex polyhedron

 True or False: Place objects
within a uniform grid is easier
than placing objects within a
KD tree.

CS 354 6

Graphics Performance Analysis

 Generating synthetic images by computer
is computationally—and bandwidth—
intensive
 Achieving interactive rates is key
 60 frames/second ≈ real-time interactivity
 Worth optimizing
 Entertainment and intuition tied to interactivity
 How do we think about graphics
performance analysis?

CS 354 7

Framing Amdahl’s Law
 Assume a workload with two parts
 First part in A%
 Second part is B%
 Such that A% + B% = 100%
 If we have a technique to speedup the
second part by N times
 But have no speedup for the first part
 What overall speed up can we expect?

CS 354 8

Amdahl’s Equation
 Assume A% + B% = 100%
 If the un-optimized effort is 100%, the optimized
effort should be smaller
B%
OptimizedEffort = A% +
N
 Speedup is ratio of UnoptimizedEffort to
OptimizedEffort
100% 1
Speedup = =
B% B
A% + ( B − 1) +
N N

CS 354 9

Who was Amdahl?
 Gene Amdahl
 CPU architect for IBM in 1960s
 Helped design IBM’s System/360 mainframe
architecture
 Left IBM to found Amdahl computer
 Building IBM compatible mainframes
 Why?
 Evaluating whether to invest in parallel
processing or not

CS 354 10

Parallelization
 Broadly speaking, computer tasks can be broken
into two portions
 Sequential sub-tasks
 Naturally requires steps to be done in a particular order
 Examples: text layout, entropy decoding
 Parallel sub-tasks
 Problem splits into lots of independent chunks of work
 Chunks of work can be done by separate processing units
simultaneously: parallelization
 Examples: tracing rays, shading pixels, transforming
vertices

CS 354 11

Serial Work Sandwiching
Parallel Work

CS 354 12

Example of Amdahl’s Law
 Say a task is 50% serial and 50% parallel
 Consider using 4 parallel processors on the
parallel portion
 Speedup: 1.6x
 Consider using 40 parallel processor on parallel
portion
 Speedup: 1.951x
 Consider limit: 1
lim =2
n →∞ .5
.5 +
n

CS 354 13

Graph of Amdahl’s Law

CS 354 14

Pessimism about Parallelism?
 Amdahl’s Law can instill pessimism about
parallel processing
 If the serial work percent is high, adding
parallel units has low benefit
 Assumes fixed “problem” size
 So workload stays same size even as parallel
execution resources are added
 So why do GPUs offer 100’s of cores
then?

CS 354 15

Gustafson's Law
 Observation
 by John Gustafson
 With N parallel unit, bigger problems can be attacked
 Great example
 Increasing GPU resolution
 Was 640x480 pixels, now 1920x1200
 More parallel units means more pixels can be
processed simultaneously
 Supporting rendering resolutions previously unattainable
 Problem size improvement
problemScale = N − A( N − 1)

CS 354 16

Example
 Say a task is 50% serial and 50% parallel
 Consider using 4 parallel processors on the
parallel portion
 Problem scales up: 2.5x
 Consider 100 parallel processors
 Problem scales up: 50.5x

 Also consider heterogeneous nature of graphics
processing units

CS 354 17

Coherent Work vs.
Incoherent Work
 Not all parallel work is created equal
 Coherent work = “adjacent” chunks of work
performing similar operations and memory
accesses
 Example: camera rays, pixel shading
 Allows sharing control of instruction execution
 Good for caches
 Incoherent work = “adjacent” chunks of work
performing dissimilar operations and memory
accesses
 Examples: reflection, shadow, and refraction rays
 Bad for caches

CS 354 18

Coherent vs. Incoherent Rays

coherent = camera rays coherent = light rays

incoherent = reflected rays

CS 354 19

Keeping Work Coherent?
 How do we keep work concurrent?
 Pipelines
 Careful because they can introduce latency
 Data structures
 SPMD (or SIMD) execution
 Single Program, Multiple Data
 To exploit Single Instruction, Multiple Data (SIMD)
units
 Bundling “adjacent” work elements helps cache and
memory access efficiency

CS 354 20

Pipeline Processing
 Parallel and naturally coherent

A Simplified Graphics Pipeline
CS 354 21

Application
Application-
OpenGL API boundary
Vertex batching & assembly

Triangle assembly

Triangle clipping

NDC to window space

Triangle rasterization

Fragment shading

Depth testing Depth buffer

Color update Framebuffer

CS 354 22

Another View of the Graphics Pipeline

3D Application
or Game

OpenGL API
CPU – GPU
Boundary
GPU Vertex Primitive Clipping, Setup, Raster
Front End Assembly Assembly and Rasterization Operations

Vertex Geometry Fragment
Shader Program Shader

Attribute Fetch

Legend
Parameter Buffer Read Texture Fetch Framebuffer Access
programmable

fixed-function
Memory Interface
OpenGL 3.3

CS 354 23

Modeling Pipeline Efficiency
 Rate of processing for sequential tasks
 Assume three tasks
 Run time is sum of each operation’s time
 A+B+C
 Rate of processing in a pipeline
 Assume three tasks, treated as stages
 Performance gated by slowest operation
 Three operations in pipeline: A, B, C
 Run time = max(A,B,C)

CS 354 24

Hardware Clocks
 Heart beat of hardware
 Measured in frequency
 Hertz (Hz) = cycles per second
 Megahertz, gigahertz = million, billion Hz

 Faster clocks = faster computation and
data transfer
 So why not simply raise clocks?
 High clocks consume more power
 Circuits are only rated to a maximum clock
speed before becoming unreliable

CS 354 25

Clock Domains
 Given chip may have multiple clocks running
 Three key domains (GPU-centric)
 Graphics clock—for fixed-function units
 Example uses: rasterization, texture filtering, blending
 Optimize for throughput, not latency
 Can often instance more units instead of raising clocks
 Processor clock—for programmable shader units
 Example: shader instruction execution
 Generally higher than graphics clock
 Because optimized for latency rather than throughput
 Memory clock—for talking to external memory
 Depends on speed rating of external memory
 Other domains too
 Display clock, PCI-Express bus clock
 Generally not crucial to rendering performance

CS 354 26

3D Pipeline Programmable
Domains run on Unified Hardware
 Unified Streaming Processor Array (SPA) architecture
means same capabilities for all domains
 Plus tessellation + compute (not shown below)

,
GPU Vertex Primitive Clipping, Setup,
Raster
Front End Assembly Assembly and Rasterization Operations

Can be Vertex Primitive Fragment
unified Program Program Program

hardware!
Attribute Fetch Parameter Buffer Read Texture Fetch Framebuffer Access

Memory Interface

CS 354 27

Memory Bandwidth
 Raw memory bandwidth
 Physical clock rate
 Examples: 3 Ghz
 Memory bus width
 64-bit, 128-bit, 192-bit, 256-bit, 384-bit
 Wider buses are faster but more expensive to route all those wires
 Signaling rate
 Double data rate (DDR) means signals are sent on the rising and
falling clock edges
 Often logical memory clock rate includes signaling rate
 Computing raw memory bandwidth
bandwidth = physicalClock × signalPerClock × busWidth

CS 354 28

Latency vs. Throughput
 Raw bandwidth is reduced by memory
utilization bandwidth
 Unrealistic to expect 100% utilization
 GPUs are much better than CPUs generally
 Trade-off
 Maximizing throughput (utilization) increases
latency
 Minimizing latency reduces utilization

CS 354 29

Computing Bandwidth [GeForce GTX 680
board]

 Example: GeForce GTX 680
 Latest NVIDIA generation
 3.54 billion transistors in 28 nm process
 Memory characteristics
 6 GHz memory clock (includes signaling rate)
 256-bit memory interface
 = 192 gigabytes/second
 6 billion × 256 bits/clock × 1byte/8bits

[GK104 die]

CS 354 30

GeForce Peak
Memory Bandwidth Trends
200
128-bit interface 256-bit interface
180

Raw
160 bandwidth
Gigabytes per second

140

Effective raw
bandwidth
120
with
compression
100
Expon.
(Effective raw
bandwidth
80
with
compression)
60
Expon. (Raw
bandwidth)

40

20

0
GeForce2 GeForce3 GeForce4 T i GeForce FX GeForce GeForce
GT S 4600 6800 Ultra 7800 GT X

CS 354 31

Effective GPU
Memory Bandwidth
 Compression schemes
 Lossless depth and color (when multisampling)
compression
 Lossy texture compression (S3TC / DXTC)
 Typically assumes 4:1 compression
 Avoidance useless work
 Early killing of fragments (Z cull)
 Avoiding useless blending and texture fetches
 Very clever memory controller designs
 Combining memory accesses for improved coherency
 Caches for texture fetches

CS 354 32

Other Metrics
 Host bandwidth
 Vertex pulling
 Vertex transformation
 Triangle rasterization and setup
 Fragment shading rate
 Shader instruction rate
 Raster (blending) operation rate
 Early Z reject rate

CS 354 33

Kepler GeForce GTX 680
High-level Block Diagram
 8 Streaming
Multiprocessors
(SMX)
 1536 CUDA Cores
 8 Geometry Units
 4 Raster Units
 128 Texture units
 32 Raster operations
 256-bit GDDR5
memory

CS 354 34

Kepler Streaming Multiprocessor

8 more copies of this

CS 354 35

Prior Generation Streaming
Multiprocessor (SM)
 Multi-processor
execution unit (Fermi)
 32 scalar processor
cores
 Warp is a unit of
thread execution of up
to 32 threads
 Two workloads
 Graphics
 Vertex shader
 Tessellation
 Geometry shader
 Fragment shader
 Compute

CS 354 36

Power Gating
 Computer architecture has hit the “power wall”
 Low-power operation is at a premium
 Battery-powered devices
 Thermal constraints
 Economic constraints
 Power Management (PM) works to reduce
power by
 Lower clocks when performance isn’t required
 Disabling hardware units
 Avoids leakage

CS 354 37

Scene Graph Labor
 High-level division of scene graph labor
 Four pipeline stages
 App (application)
 Code that manipulates/modifies the scene graph in response to
user input or other events
 Isect (intersection)
 Geometric queries such as collision detection or picking
 Cull
 Traverse the scene graph to find the nodes to be rendered
 Best example: eliminate objects out of view
 Optimize the ordering of nodes
 Sort objects to minimize graphics hardware state changes
 Draw
 Communicating drawing commands to the hardware
 Generally through graphics API (OpenGL or Direct3D)
 Can map well to multi-processor CPU systems

CS 354 38

App-cull-draw Threading
 App-cull-draw processing on one CPU core

 App-cull-draw processing on multiple CPUs

CS 354 39

Scene Graph Profiling
 Scene graph should help provide insight
into performance
 Process statistics
 What’s going on?
 Time stamps
 Database statistics
 How complex is the scene in any frame?

CS 354 40

Example:
Depth Complexity Visualization
 How many pixels are being rendered?
 Pixels can be rasterized by multiple objects
 Depth complexity is the average number of times a
pixel or color sample is updated per frame

yellow and black indicate higher depth complexity

CS 354 41

Example:
Heads-up Display of Statistics
 Process statistics
 How long is
everything taking?
 Database statistic
 What is being
rendered?
 Overlaying on
active scene often
value
 Dynamic update

CS 354 42

Benchmarking
 Synthetic benchmarks focus on rendering
particular operations in isolation
 What is the blended pixel performance
 Application benchmarks
 Try to reflect what a real application would do

CS 354 43

Tips for Interactive
Performance Analysis
 Vary things you can control
 Change window resolution
 Making it smaller and seeing better performance
 Null driver analysis
 Skip the actual rendering calls
 What if the driver was *infinitely” fast
 Use occlusion queries to monitor how many
samples (pixels) are actually got to need
 Keep data on the GPU
 Let GPU do Direct Memory Access (DMA)
 Keep from swapping textures and buffers
 Easy when multi-gigabyte graphics cards available

CS 354 44

Next Class
 Next lecture
 Surfaces
 Programmable tessellation

 Reading
 None

 Project 4
 Project 4 is a simple ray tracer
 Due Wednesday, May 2, 2012

CS 354 Performance Analysis

More Related Content

What's hot (20)

Similar to CS 354 Performance Analysis (20)

More from Mark Kilgard (20)

Recently uploaded (20)

CS 354 Performance Analysis