SlideShare a Scribd company logo
 
NVIDIA Graphics, Cg, and Transparency Mark Kilgard Graphics Software Engineer NVIDIA Corporation GPU Shading and Rendering Course 3 July 30, 2006
Outline NVIDIA graphics hardware seven years for GeForce + the future Cg—C for Graphics the cross-platform GPU programming language Depth peeling out-of-order transparency now practical
Seven Years of GeForce DX9c 2.0 Transparency antialiasing, quad-GPU GeForce 7800 GTX 2005 DX9c 2.0 Vertex textures, structured fragment branching, non-power-of-two textures, generalized floating-point textures, floating-point texture filtering and blending, dual-GPU GeForce 6800 Ultra 2004 DX9c 2.1 Single-board dual-GPU, process efficiency GeForce 7900 GTX 2006 DX9 1.5 Vertex program branching, floating-point fragment programs, 16 texture units, limited floating-point textures, color & depth compression GeForce FX 2003 DX8.1 1.4 Early Z culling, dual-monitor GeForce4 Ti 4600 2002 DX8 1.4 Programmable vertex transformation, 4 texture units, dependent textures, 3D textures, shadow maps, multisampling, occlusion queries GeForce3 2001 DX7 1.3 Hardware transform & lighting, configurable fixed-point shading, cube maps, texture compression, anisotropic texture filtering GeForce 256 2000 Direct3D Version OpenGL Version New Features Product
2006: the GeForce 7900 GTX board 512MB/256-bit GDDR3  1600 MHz effective 8 pieces of 8Mx32 16x PCI-Express DVI x 2 sVideo TV Out SLI Connector
2006: the GeForce 7900 GTX GPU 278 million transistors 650 MHz core clock 1,600 MHz GDDR3 effective memory clock 256-bit memory interface Notable Functionality Non-power-of-two textures with mipmaps Floating-point (fp16) blending and filtering sRGB color space texture filtering and frame buffer blending Vertex textures 16x anisotropic texture filtering Dynamic vertex  and  fragment branching Double-rate depth/stencil-only rendering Early depth/stencil culling Transparency antialiasing
2006: GeForce 7950 GX2, SLI-on-a-card DVI x 2 sVideo TV Out 16x PCI-Express Sandwich of two printed circuit boards Two GeForce 7 Series GPUs 500 Mhz core 1 GB video memory 512 MB per GPU 1,200 Mhz effective Effective 512-bit memory interface!
GeForce Peak Vertex Processing Trends Millions of vertices per second Vertex units  1  1  2  3  6  8  8  2 ×8 rate for trivial 4x4 vertex transform exceeds peak setup rates—allows excess vertex processing Assumes Alternate Frame Rendering (AFR) SLI Mode
GeForce Peak Triangle Setup Trends Millions of triangles per second assumes 50% face culling Assumes Alternate Frame Rendering (AFR) SLI Mode
GeForce Peak Memory Bandwidth Trends Gigabytes per second Two physical 256-bit memory interfaces 128-bit interface 256-bit interface
Effective GPU Memory Bandwidth Compression schemes Lossless depth and color (when multisampling) compression Lossy texture compression (S3TC / DXTC) Typically assumes 4:1 compression Avoid useless work Early killing of fragments (Z cull) Avoid useless blending and texture fetches Very clever memory controller designs Combining memory accesses for improved coherency Caches for texture fetches
NVIDIA Graphics Core and Memory Clock Rates Megahertz (Mhz) DDR memory transition—memory rates double physical clock rate
GeForce Peak Texture Fetch Trends Millions of texture fetches  per second Texture units  2 ×4   2×4  2×4  2×4  16  24  24  2 ×24 assuming no texture cache misses
GeForce Peak Depth/Stencil-only Fill assuming no read-modify-write Millions of depth/stencil pixel updates per second double speed depth-stencil only
GeForce Transistor Count and Semiconductor Process Millions of transistors Process ( nm )  180  180  150  130  130  110  90  90 More performance with fewer transistors: Architectural & process efficiency!
GeForce 7900 GTX Parallelism Triangle Setup/Raster Shader Instruction Dispatch Fragment Crossbar Memory Partition Memory Partition Memory Partition Memory Partition Z-Cull 8 Vertex Engines 24 Fragment Shaders 16 Raster Operation Pipelines
16 GeForce FX 5900 GeForce 6800 Ultra Vertex Fragment 2 nd  Texture Fetch 3 6 4+4 Raster Color Raster Depth 4+4 16+16 GeForce 7900 GTX Hardware Unit 8 24 16+16
2005: Comparison to CPU Pentium Extreme Edition 840 3.2 GHz Dual Core 230M Transistors 90nm process 206 mm^2 2 x 1MB Cache 25.6 GFlops GeForce 7800 GTX 430 MHz 302M Transistors 110nm process 326 mm^2 313 GFlops (shader) 1.3 TFlops (total)
2006: Comparison to CPU Intel Core 2 Extreme X6800 2.93 GHz Dual Core 291M Transistors 65nm process 143 mm^2 4MB Cache 23.2 GFlops GeForce 7900 GTX 650 MHz 278M Transistors 90nm process 196 mm^2 477 GFlops (shader) 2.1 TFlops (total)
Giga Flops Imbalance Theoretical programmable IEEE 754 single-precision Giga Flops
Future NVIDIA GPU directions DirectX 10 feature set Massive graphics functionality upgrade Language and tool support Performance tuning and content development Improved GPGPU Harness the bandwidth & Gflops for non-graphics Multi-GPU systems innovation Next-generation SLI
DirectX 10-class GPU functionality Generalized programmability, including Integer instructions Efficient branching Texture size queries, unfiltered texel fetches, & offset fetches Shadow cube maps for omni-directional shadowing Sourcing constants from bind-able buffer objects Per-primitive programmable processing Emits zero or more strips of triangles/points/lines New line and triangle adjacency primitives Output to multiple viewports and buffers
Per-primitive processing example: Automatic silhouette edge rendering emit edge of adjacent triangles that face  opposite  directions  New triangle adjacency primitive = 3 conventional vertices + 3 vertices for adjacent triangles
More DirectX 10-class GPU functionality Better blending Improved blending control for multiple draw buffers sRGB and 32-bit floating-point framebuffer blending Streamed output of vertex processing to buffers Render to vertex array Texture improvements Indexing into an “array” of 2D textures Improved render-to-texture Luminance-alpha compressed formats Compact High Dynamic Range texture formats Integer texture formats 32-bit floating-point texture filtering
Uses of DirectX 10 functionality GPU Marching Cubes GPU Cloth Styled Line Drawing Deformable Collisions Sparkling Sprites Table-free Noise Deep Waves GPU Fluid Simulation
DirectX 10-class functionality parity Feature parity DirectX 10-class features available via OpenGL Cross API portability of programmable shading content through Cg Performance parity 3D API agnostic performance parity on all Windows operating systems System support parity Linux, Mac, FreeBSD, Solaris Shared code base for drivers
 
Multi-GPU Support Original SLI was just the beginning Quad-SLI SLI support infuses all NVIDIA product design and development New SLI APIs for application-control of multiple GPUs SLI for notebooks Better thermals and power
Vertex Cores Fragment Cores Raster Color Cores Raster Depth Cores GeForce 7900 GTX Hardware Unit 8 24 16+16 GeForce 7900 GTX Quad SLI 32 96 64+64
Cg: C for Graphics
Cg: C for Graphics Cg as it exists today High-level, inspired mostly by C Graphics focused API-independent GLSL tied to OpenGL; HLSL tied to Direct3D; Cg works for both Platform-independent Cg works on PlayStation 3, ATI, NVIDIA, Linux, Solaris, Mac OS X, Windows, etc. Production language and system Cg 1.5 is part of 3D content creation tool chains Portability of Cg shaders is important
Evolution of Cg C (AT&T, 1970’s) C++ (AT&T, 1983) Java (Sun, 1994) RenderMan (Pixar, 1988) PixelFlow  Shading Language (UNC, 1998) Real-Time  Shading Language (Stanford, 2001) Cg / HLSL (NVIDIA/Microsoft, 2002) IRIS GL (SGI, 1982) OpenGL (ARB, 1992) Direct3D (Microsoft, 1995) Reality Lab (RenderMorphics, 1994) General-purpose languages Graphics Application Program Interfaces Shading Languages
Cg 1.5 Current release of Cg Supports Windows, Linux, Mac (including x86 Macs) + now Solaris Shader Model 3.0 profiles for Direct3D 9.0c Matches Sony’s PlayStation 3 Cg support Tool chain support: FX Composer 2.0 New functionality Procedural effects generation Combined programs for multiple domains New GLSL profiles to compile Cg to GLSL Improved compiler optimization
FX Composer for Cg shader authoring Shaders are assets Portability matters So express shaders in a multi-platform, multi-API language That’s Cg
Future: Modernizing Cg Opportunity to re-think the Cg language Experience-driven Shader writing was programming-in-the-small But not anymore! Provide better abstraction mechanisms Must be backward compatible Challenge : Instead of inventing yet-another shading language-specific keyword, think how a C++ programmer express the feature Think templates and classes
Cg Directions DirectX 10-class feature support Primitive (geometry) programs Constant buffers Interpolation modes Read-write index-able temporaries New texture targets: texture arrays, shadow cube maps Incorporate established C++ features, examples: Classes Templates Operator overloading But  not  runtime features like new/delete, RTTI, or exceptions
Why C++? Already inspiration for much of Cg Think of Cg’s first-class vectors simply as classes Functionality in C++ is well-understood and popular C++ is biased towards compile-time abstraction Rather than more run-time focus of Java and C# Compile-time abstraction is good since GPUs lack the run-time support for heaps, garbage collection, exceptions, and run-time polymorphism
Logical Programmable Graphics Pipeline 3D Application or Game 3D API: OpenGL or Direct3D Driver Programmable Vertex Processor Primitive Assembly Rasterization & Interpolation 3D API Commands Transformed Vertices Assembled Polygons, Lines, and Points GPU Command & Data  Stream Programmable Fragment Processor Rasterized Pre-transformed Fragments Transformed Fragments Raster Operations Framebuffer Pixel Updates GPU Front End Pre-transformed Vertices Vertex Index Stream Pixel Location Stream CPU – GPU Boundary Program vertex and fragment domains
Future Logical Programmable Graphics Pipeline 3D Application or Game 3D API: OpenGL or Direct3D Driver Programmable Vertex Processor Primitive Assembly Rasterization & Interpolation 3D API Commands Transformed Vertices Output assembled Polygons, Lines, and Points GPU Command & Data  Stream Programmable Fragment Processor Rasterized Pre-transformed Fragments Transformed Fragments Raster Operations Framebuffer Pixel Updates GPU Front End Pre-transformed Vertices Vertex Index Stream Pixel Location Stream CPU – GPU Boundary Programmable Primitive Processor Input assembled Polygons, Lines, and Points New per-primitive “geometry” programmable domain
Pass Through Geometry Program Example BufferInit<float4,6>  flatColor; TRIANGLE   void  passthru( AttribArray < float4 > position :  POSITION , AttribArray < float4 > texCoord :  TEXCOORD0 ) { flatAttrib (flatColor: COLOR ); for  ( int  i=0; i<position.length; i++) { emitVertex (position[i], texCoord[i]); } } Primitive’s attributes arrive as “templated” attribute arrays Length of attribute arrays depends on the input primitive mode, 3 for TRIANGLE Bundles a vertex based on parameter values and semantics Makes sure flat attributes are associated with the proper provoking vertex convention flatColor initialized from constant buffer 6
Depth peeling Brute force order-independent transparency algorithm [Everitt 2001] Approach Render transparent objects repeatedly Each pass peels successive color layer using dual-depth buffers Composite peeled layers in order Caveats Typically makes “thin film” assumption No refraction or scattering
Transparency: The Good vs. the Ugly
Peeled layer visualization composite layers Fragment count per layer
Another example: The Good vs. the Ugly
Another peeled layer visualization composite layers Fragment count per layer
Real-time transparency demo
Depth peeling: How it works Conventional depth buffer after rendering Color buffer has color of closest fragment Depth buffer has depth of closest fragment Re-use the depth buffer! Make depth buffer into a shadow map Clear a 2 nd  depth buffer Discard fragments if fragment depth is closer than corresponding pixel’s depth in shadow map Save color buffer for compositing Repeat this with current depth buffer to peel another layer Prior depth buffer works as “back stop” for next pass Discard fragments closer or as close as last pass for every pixel
Optimizations for real-time depth peeling Optimizations Render-to-texture to ping-pong between 2 back stop depth buffers (no depth buffer copies) Shadow mapping for 2 nd  read-only “back stop” depth buffer Asynchronous occlusion queries to determine fragments still being peeled Threshold to stop peeling Smart front-to-back (“under”) compositing Result:  120+ fps depth peeling for peeling and composting up to 14 layers as needed
So is transparency a solved problem? Bounding the error Assume a lower bound on opacity of objects  … and an upper bound on layers peeled worstCaseError  = (1- minOpacity ) maxLayers Example:  20% min. opacity with 15 peeled layers Remaining potential transparency could be off by just 3.5% if looking through 15 layers of 20% opacity (worst possible case) Typical cases are much, much better than that As occlusion query can provide a count of mis-ordered pixels Arguably  could be  for a certain class of transparency Mostly opaque scenes with thin film transparency like windows CAD models made of virtual Jell-O ®
Conclusions NVIDIA GPUs Expect more compute and bandwidth increases >> CPUs DirectX 10 = large functionality upgrade for graphics Cg, the only cross-API, multi-platform language for programmable shading Think shaders as content, not GPU programs trapped inside applications Depth peeling Harnessing the GPU’s brute force for transparency

More Related Content

PPTX
Migrating from OpenGL to Vulkan
PPT
OpenGL 4 for 2010
PPTX
Borderless Per Face Texture Mapping
PPT
CS 354 Introduction
PPT
NVIDIA's OpenGL Functionality
PPT
OpenGL for 2015
PDF
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
PPT
5 Major Challenges in Interactive Rendering
Migrating from OpenGL to Vulkan
OpenGL 4 for 2010
Borderless Per Face Texture Mapping
CS 354 Introduction
NVIDIA's OpenGL Functionality
OpenGL for 2015
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
5 Major Challenges in Interactive Rendering

What's hot (20)

PPT
OpenGL 3.2 and More
PDF
Modern OpenGL Usage: Using Vertex Buffer Objects Well
PPTX
vkFX: Effect(ive) approach for Vulkan API
PPT
CS 354 Viewing Stuff
PPTX
OpenGL 4.5 Update for NVIDIA GPUs
PPT
NVIDIA OpenGL in 2016
PPT
GTC 2012: GPU-Accelerated Path Rendering
PPTX
Siggraph 2016 - Vulkan and nvidia : the essentials
PPT
NVIDIA OpenGL 4.6 in 2017
PPT
Your Game Needs Direct3D 11, So Get Started Now!
PPT
Realtime Per Face Texture Mapping (PTEX)
PPT
SIGGRAPH 2012: NVIDIA OpenGL for 2012
PPT
Slides: Accelerating Vector Graphics Rendering using the Graphics Hardware Pi...
PPTX
Mantle for Developers
PPT
NVIDIA OpenGL and Vulkan Support for 2017
PPT
SIGGRAPH Asia 2012 Exhibitor Talk: OpenGL 4.3 and Beyond
PPT
GPU accelerated path rendering fastforward
PPTX
NvFX GTC 2013
PPTX
Parallel Futures of a Game Engine (v2.0)
PDF
Masked Software Occlusion Culling
OpenGL 3.2 and More
Modern OpenGL Usage: Using Vertex Buffer Objects Well
vkFX: Effect(ive) approach for Vulkan API
CS 354 Viewing Stuff
OpenGL 4.5 Update for NVIDIA GPUs
NVIDIA OpenGL in 2016
GTC 2012: GPU-Accelerated Path Rendering
Siggraph 2016 - Vulkan and nvidia : the essentials
NVIDIA OpenGL 4.6 in 2017
Your Game Needs Direct3D 11, So Get Started Now!
Realtime Per Face Texture Mapping (PTEX)
SIGGRAPH 2012: NVIDIA OpenGL for 2012
Slides: Accelerating Vector Graphics Rendering using the Graphics Hardware Pi...
Mantle for Developers
NVIDIA OpenGL and Vulkan Support for 2017
SIGGRAPH Asia 2012 Exhibitor Talk: OpenGL 4.3 and Beyond
GPU accelerated path rendering fastforward
NvFX GTC 2013
Parallel Futures of a Game Engine (v2.0)
Masked Software Occlusion Culling
Ad

Viewers also liked (8)

PPTX
10 - It3D Summit 2016 - vr technology - T.Riley - NVIDIA
PPTX
Dominic Eskofier (Nvidia) Every Millisecond Counts: How to Render Faster for ...
PPTX
NVIDIA Gameworks, Libraries and Tools
PPTX
Cg shaders with Unity3D
PDF
Investor Day 2015: GeForce Gaming
PDF
NVIDIA – Inventor of the GPU
PPTX
GTC 2014 - DirectX 11 Rendering and NVIDIA GameWorks in Batman: Arkham Origins
PDF
Étude Instagram : de l'utilisateur à l'influenceur
10 - It3D Summit 2016 - vr technology - T.Riley - NVIDIA
Dominic Eskofier (Nvidia) Every Millisecond Counts: How to Render Faster for ...
NVIDIA Gameworks, Libraries and Tools
Cg shaders with Unity3D
Investor Day 2015: GeForce Gaming
NVIDIA – Inventor of the GPU
GTC 2014 - DirectX 11 Rendering and NVIDIA GameWorks in Batman: Arkham Origins
Étude Instagram : de l'utilisateur à l'influenceur
Ad

Similar to NVIDIA Graphics, Cg, and Transparency (20)

PDF
Evolution of the modern graphics architectures with a focus on GPUs | Turing1...
PPTX
graphics processing unit ppt
PPT
Next generation graphics programming on xbox 360
PDF
計算力学シミュレーションに GPU は役立つのか?
PPT
Introduction To Geometry Shaders
PPT
Hardware Shaders
PPTX
Amd accelerated computing -ufrj
PDF
thu-blake-gdc-2014-final
PPTX
OpenGL Shading Language
PPTX
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
PPT
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
PDF
Commandlistsiggraphasia2014 141204005310-conversion-gate02
PDF
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
PPT
OpenGL ES based UI Development on TI Platforms
PPT
GTC 2009 OpenGL Barthold
PPTX
DirectX 11 Rendering in Battlefield 3
PDF
Haskell Accelerate
PPT
D3 D10 Unleashed New Features And Effects
Evolution of the modern graphics architectures with a focus on GPUs | Turing1...
graphics processing unit ppt
Next generation graphics programming on xbox 360
計算力学シミュレーションに GPU は役立つのか?
Introduction To Geometry Shaders
Hardware Shaders
Amd accelerated computing -ufrj
thu-blake-gdc-2014-final
OpenGL Shading Language
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
Commandlistsiggraphasia2014 141204005310-conversion-gate02
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
OpenGL ES based UI Development on TI Platforms
GTC 2009 OpenGL Barthold
DirectX 11 Rendering in Battlefield 3
Haskell Accelerate
D3 D10 Unleashed New Features And Effects

More from Mark Kilgard (19)

PDF
D11: a high-performance, protocol-optional, transport-optional, window system...
PPT
Computers, Graphics, Engineering, Math, and Video Games for High School Students
PPT
Virtual Reality Features of NVIDIA GPUs
PPT
EXT_window_rectangles
PDF
Accelerating Vector Graphics Rendering using the Graphics Hardware Pipeline
PPT
NV_path rendering Functional Improvements
PPT
SIGGRAPH Asia 2012: GPU-accelerated Path Rendering
PDF
Programming with NV_path_rendering: An Annex to the SIGGRAPH Asia 2012 paper...
PDF
GPU-accelerated Path Rendering
PPT
SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
PPT
GTC 2012: NVIDIA OpenGL in 2012
PPT
CS 354 Final Exam Review
PPT
CS 354 Surfaces, Programmable Tessellation, and NPR Graphics
PPT
CS 354 Performance Analysis
PPT
CS 354 Acceleration Structures
PPT
CS 354 Global Illumination
PPT
CS 354 Ray Casting & Tracing
PPT
CS 354 Typography
PPT
CS 354 Vector Graphics & Path Rendering
D11: a high-performance, protocol-optional, transport-optional, window system...
Computers, Graphics, Engineering, Math, and Video Games for High School Students
Virtual Reality Features of NVIDIA GPUs
EXT_window_rectangles
Accelerating Vector Graphics Rendering using the Graphics Hardware Pipeline
NV_path rendering Functional Improvements
SIGGRAPH Asia 2012: GPU-accelerated Path Rendering
Programming with NV_path_rendering: An Annex to the SIGGRAPH Asia 2012 paper...
GPU-accelerated Path Rendering
SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
GTC 2012: NVIDIA OpenGL in 2012
CS 354 Final Exam Review
CS 354 Surfaces, Programmable Tessellation, and NPR Graphics
CS 354 Performance Analysis
CS 354 Acceleration Structures
CS 354 Global Illumination
CS 354 Ray Casting & Tracing
CS 354 Typography
CS 354 Vector Graphics & Path Rendering

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Modernizing your data center with Dell and AMD
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
KodekX | Application Modernization Development
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Review of recent advances in non-invasive hemoglobin estimation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
Modernizing your data center with Dell and AMD
20250228 LYD VKU AI Blended-Learning.pptx
Chapter 3 Spatial Domain Image Processing.pdf
cuic standard and advanced reporting.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Understanding_Digital_Forensics_Presentation.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Spectral efficient network and resource selection model in 5G networks
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Dropbox Q2 2025 Financial Results & Investor Presentation

NVIDIA Graphics, Cg, and Transparency

  • 1.  
  • 2. NVIDIA Graphics, Cg, and Transparency Mark Kilgard Graphics Software Engineer NVIDIA Corporation GPU Shading and Rendering Course 3 July 30, 2006
  • 3. Outline NVIDIA graphics hardware seven years for GeForce + the future Cg—C for Graphics the cross-platform GPU programming language Depth peeling out-of-order transparency now practical
  • 4. Seven Years of GeForce DX9c 2.0 Transparency antialiasing, quad-GPU GeForce 7800 GTX 2005 DX9c 2.0 Vertex textures, structured fragment branching, non-power-of-two textures, generalized floating-point textures, floating-point texture filtering and blending, dual-GPU GeForce 6800 Ultra 2004 DX9c 2.1 Single-board dual-GPU, process efficiency GeForce 7900 GTX 2006 DX9 1.5 Vertex program branching, floating-point fragment programs, 16 texture units, limited floating-point textures, color & depth compression GeForce FX 2003 DX8.1 1.4 Early Z culling, dual-monitor GeForce4 Ti 4600 2002 DX8 1.4 Programmable vertex transformation, 4 texture units, dependent textures, 3D textures, shadow maps, multisampling, occlusion queries GeForce3 2001 DX7 1.3 Hardware transform & lighting, configurable fixed-point shading, cube maps, texture compression, anisotropic texture filtering GeForce 256 2000 Direct3D Version OpenGL Version New Features Product
  • 5. 2006: the GeForce 7900 GTX board 512MB/256-bit GDDR3 1600 MHz effective 8 pieces of 8Mx32 16x PCI-Express DVI x 2 sVideo TV Out SLI Connector
  • 6. 2006: the GeForce 7900 GTX GPU 278 million transistors 650 MHz core clock 1,600 MHz GDDR3 effective memory clock 256-bit memory interface Notable Functionality Non-power-of-two textures with mipmaps Floating-point (fp16) blending and filtering sRGB color space texture filtering and frame buffer blending Vertex textures 16x anisotropic texture filtering Dynamic vertex and fragment branching Double-rate depth/stencil-only rendering Early depth/stencil culling Transparency antialiasing
  • 7. 2006: GeForce 7950 GX2, SLI-on-a-card DVI x 2 sVideo TV Out 16x PCI-Express Sandwich of two printed circuit boards Two GeForce 7 Series GPUs 500 Mhz core 1 GB video memory 512 MB per GPU 1,200 Mhz effective Effective 512-bit memory interface!
  • 8. GeForce Peak Vertex Processing Trends Millions of vertices per second Vertex units 1 1 2 3 6 8 8 2 ×8 rate for trivial 4x4 vertex transform exceeds peak setup rates—allows excess vertex processing Assumes Alternate Frame Rendering (AFR) SLI Mode
  • 9. GeForce Peak Triangle Setup Trends Millions of triangles per second assumes 50% face culling Assumes Alternate Frame Rendering (AFR) SLI Mode
  • 10. GeForce Peak Memory Bandwidth Trends Gigabytes per second Two physical 256-bit memory interfaces 128-bit interface 256-bit interface
  • 11. Effective GPU Memory Bandwidth Compression schemes Lossless depth and color (when multisampling) compression Lossy texture compression (S3TC / DXTC) Typically assumes 4:1 compression Avoid useless work Early killing of fragments (Z cull) Avoid useless blending and texture fetches Very clever memory controller designs Combining memory accesses for improved coherency Caches for texture fetches
  • 12. NVIDIA Graphics Core and Memory Clock Rates Megahertz (Mhz) DDR memory transition—memory rates double physical clock rate
  • 13. GeForce Peak Texture Fetch Trends Millions of texture fetches per second Texture units 2 ×4 2×4 2×4 2×4 16 24 24 2 ×24 assuming no texture cache misses
  • 14. GeForce Peak Depth/Stencil-only Fill assuming no read-modify-write Millions of depth/stencil pixel updates per second double speed depth-stencil only
  • 15. GeForce Transistor Count and Semiconductor Process Millions of transistors Process ( nm ) 180 180 150 130 130 110 90 90 More performance with fewer transistors: Architectural & process efficiency!
  • 16. GeForce 7900 GTX Parallelism Triangle Setup/Raster Shader Instruction Dispatch Fragment Crossbar Memory Partition Memory Partition Memory Partition Memory Partition Z-Cull 8 Vertex Engines 24 Fragment Shaders 16 Raster Operation Pipelines
  • 17. 16 GeForce FX 5900 GeForce 6800 Ultra Vertex Fragment 2 nd Texture Fetch 3 6 4+4 Raster Color Raster Depth 4+4 16+16 GeForce 7900 GTX Hardware Unit 8 24 16+16
  • 18. 2005: Comparison to CPU Pentium Extreme Edition 840 3.2 GHz Dual Core 230M Transistors 90nm process 206 mm^2 2 x 1MB Cache 25.6 GFlops GeForce 7800 GTX 430 MHz 302M Transistors 110nm process 326 mm^2 313 GFlops (shader) 1.3 TFlops (total)
  • 19. 2006: Comparison to CPU Intel Core 2 Extreme X6800 2.93 GHz Dual Core 291M Transistors 65nm process 143 mm^2 4MB Cache 23.2 GFlops GeForce 7900 GTX 650 MHz 278M Transistors 90nm process 196 mm^2 477 GFlops (shader) 2.1 TFlops (total)
  • 20. Giga Flops Imbalance Theoretical programmable IEEE 754 single-precision Giga Flops
  • 21. Future NVIDIA GPU directions DirectX 10 feature set Massive graphics functionality upgrade Language and tool support Performance tuning and content development Improved GPGPU Harness the bandwidth & Gflops for non-graphics Multi-GPU systems innovation Next-generation SLI
  • 22. DirectX 10-class GPU functionality Generalized programmability, including Integer instructions Efficient branching Texture size queries, unfiltered texel fetches, & offset fetches Shadow cube maps for omni-directional shadowing Sourcing constants from bind-able buffer objects Per-primitive programmable processing Emits zero or more strips of triangles/points/lines New line and triangle adjacency primitives Output to multiple viewports and buffers
  • 23. Per-primitive processing example: Automatic silhouette edge rendering emit edge of adjacent triangles that face opposite directions New triangle adjacency primitive = 3 conventional vertices + 3 vertices for adjacent triangles
  • 24. More DirectX 10-class GPU functionality Better blending Improved blending control for multiple draw buffers sRGB and 32-bit floating-point framebuffer blending Streamed output of vertex processing to buffers Render to vertex array Texture improvements Indexing into an “array” of 2D textures Improved render-to-texture Luminance-alpha compressed formats Compact High Dynamic Range texture formats Integer texture formats 32-bit floating-point texture filtering
  • 25. Uses of DirectX 10 functionality GPU Marching Cubes GPU Cloth Styled Line Drawing Deformable Collisions Sparkling Sprites Table-free Noise Deep Waves GPU Fluid Simulation
  • 26. DirectX 10-class functionality parity Feature parity DirectX 10-class features available via OpenGL Cross API portability of programmable shading content through Cg Performance parity 3D API agnostic performance parity on all Windows operating systems System support parity Linux, Mac, FreeBSD, Solaris Shared code base for drivers
  • 27.  
  • 28. Multi-GPU Support Original SLI was just the beginning Quad-SLI SLI support infuses all NVIDIA product design and development New SLI APIs for application-control of multiple GPUs SLI for notebooks Better thermals and power
  • 29. Vertex Cores Fragment Cores Raster Color Cores Raster Depth Cores GeForce 7900 GTX Hardware Unit 8 24 16+16 GeForce 7900 GTX Quad SLI 32 96 64+64
  • 30. Cg: C for Graphics
  • 31. Cg: C for Graphics Cg as it exists today High-level, inspired mostly by C Graphics focused API-independent GLSL tied to OpenGL; HLSL tied to Direct3D; Cg works for both Platform-independent Cg works on PlayStation 3, ATI, NVIDIA, Linux, Solaris, Mac OS X, Windows, etc. Production language and system Cg 1.5 is part of 3D content creation tool chains Portability of Cg shaders is important
  • 32. Evolution of Cg C (AT&T, 1970’s) C++ (AT&T, 1983) Java (Sun, 1994) RenderMan (Pixar, 1988) PixelFlow Shading Language (UNC, 1998) Real-Time Shading Language (Stanford, 2001) Cg / HLSL (NVIDIA/Microsoft, 2002) IRIS GL (SGI, 1982) OpenGL (ARB, 1992) Direct3D (Microsoft, 1995) Reality Lab (RenderMorphics, 1994) General-purpose languages Graphics Application Program Interfaces Shading Languages
  • 33. Cg 1.5 Current release of Cg Supports Windows, Linux, Mac (including x86 Macs) + now Solaris Shader Model 3.0 profiles for Direct3D 9.0c Matches Sony’s PlayStation 3 Cg support Tool chain support: FX Composer 2.0 New functionality Procedural effects generation Combined programs for multiple domains New GLSL profiles to compile Cg to GLSL Improved compiler optimization
  • 34. FX Composer for Cg shader authoring Shaders are assets Portability matters So express shaders in a multi-platform, multi-API language That’s Cg
  • 35. Future: Modernizing Cg Opportunity to re-think the Cg language Experience-driven Shader writing was programming-in-the-small But not anymore! Provide better abstraction mechanisms Must be backward compatible Challenge : Instead of inventing yet-another shading language-specific keyword, think how a C++ programmer express the feature Think templates and classes
  • 36. Cg Directions DirectX 10-class feature support Primitive (geometry) programs Constant buffers Interpolation modes Read-write index-able temporaries New texture targets: texture arrays, shadow cube maps Incorporate established C++ features, examples: Classes Templates Operator overloading But not runtime features like new/delete, RTTI, or exceptions
  • 37. Why C++? Already inspiration for much of Cg Think of Cg’s first-class vectors simply as classes Functionality in C++ is well-understood and popular C++ is biased towards compile-time abstraction Rather than more run-time focus of Java and C# Compile-time abstraction is good since GPUs lack the run-time support for heaps, garbage collection, exceptions, and run-time polymorphism
  • 38. Logical Programmable Graphics Pipeline 3D Application or Game 3D API: OpenGL or Direct3D Driver Programmable Vertex Processor Primitive Assembly Rasterization & Interpolation 3D API Commands Transformed Vertices Assembled Polygons, Lines, and Points GPU Command & Data Stream Programmable Fragment Processor Rasterized Pre-transformed Fragments Transformed Fragments Raster Operations Framebuffer Pixel Updates GPU Front End Pre-transformed Vertices Vertex Index Stream Pixel Location Stream CPU – GPU Boundary Program vertex and fragment domains
  • 39. Future Logical Programmable Graphics Pipeline 3D Application or Game 3D API: OpenGL or Direct3D Driver Programmable Vertex Processor Primitive Assembly Rasterization & Interpolation 3D API Commands Transformed Vertices Output assembled Polygons, Lines, and Points GPU Command & Data Stream Programmable Fragment Processor Rasterized Pre-transformed Fragments Transformed Fragments Raster Operations Framebuffer Pixel Updates GPU Front End Pre-transformed Vertices Vertex Index Stream Pixel Location Stream CPU – GPU Boundary Programmable Primitive Processor Input assembled Polygons, Lines, and Points New per-primitive “geometry” programmable domain
  • 40. Pass Through Geometry Program Example BufferInit<float4,6> flatColor; TRIANGLE void passthru( AttribArray < float4 > position : POSITION , AttribArray < float4 > texCoord : TEXCOORD0 ) { flatAttrib (flatColor: COLOR ); for ( int i=0; i<position.length; i++) { emitVertex (position[i], texCoord[i]); } } Primitive’s attributes arrive as “templated” attribute arrays Length of attribute arrays depends on the input primitive mode, 3 for TRIANGLE Bundles a vertex based on parameter values and semantics Makes sure flat attributes are associated with the proper provoking vertex convention flatColor initialized from constant buffer 6
  • 41. Depth peeling Brute force order-independent transparency algorithm [Everitt 2001] Approach Render transparent objects repeatedly Each pass peels successive color layer using dual-depth buffers Composite peeled layers in order Caveats Typically makes “thin film” assumption No refraction or scattering
  • 42. Transparency: The Good vs. the Ugly
  • 43. Peeled layer visualization composite layers Fragment count per layer
  • 44. Another example: The Good vs. the Ugly
  • 45. Another peeled layer visualization composite layers Fragment count per layer
  • 47. Depth peeling: How it works Conventional depth buffer after rendering Color buffer has color of closest fragment Depth buffer has depth of closest fragment Re-use the depth buffer! Make depth buffer into a shadow map Clear a 2 nd depth buffer Discard fragments if fragment depth is closer than corresponding pixel’s depth in shadow map Save color buffer for compositing Repeat this with current depth buffer to peel another layer Prior depth buffer works as “back stop” for next pass Discard fragments closer or as close as last pass for every pixel
  • 48. Optimizations for real-time depth peeling Optimizations Render-to-texture to ping-pong between 2 back stop depth buffers (no depth buffer copies) Shadow mapping for 2 nd read-only “back stop” depth buffer Asynchronous occlusion queries to determine fragments still being peeled Threshold to stop peeling Smart front-to-back (“under”) compositing Result: 120+ fps depth peeling for peeling and composting up to 14 layers as needed
  • 49. So is transparency a solved problem? Bounding the error Assume a lower bound on opacity of objects … and an upper bound on layers peeled worstCaseError = (1- minOpacity ) maxLayers Example: 20% min. opacity with 15 peeled layers Remaining potential transparency could be off by just 3.5% if looking through 15 layers of 20% opacity (worst possible case) Typical cases are much, much better than that As occlusion query can provide a count of mis-ordered pixels Arguably could be for a certain class of transparency Mostly opaque scenes with thin film transparency like windows CAD models made of virtual Jell-O ®
  • 50. Conclusions NVIDIA GPUs Expect more compute and bandwidth increases >> CPUs DirectX 10 = large functionality upgrade for graphics Cg, the only cross-API, multi-platform language for programmable shading Think shaders as content, not GPU programs trapped inside applications Depth peeling Harnessing the GPU’s brute force for transparency

Editor's Notes