SlideShare a Scribd company logo
Parallel Graphics in Frostbite – Current & FutureJohan AnderssonDICE
MenuGame engine CPU & GPU parallelismRendering techniques & systems – old & newMixed in with some future predictions & wishes
Quick backgroundFrostbite 1.x   [1][2][3]Xbox 360, PS3, DX10Battlefield: Bad Company (shipped)Battlefield 1943 (shipped)Battlefield: Bad Company 2Frostbite 2  [4][5]In developmentXbox 360, PS3DX11 (10.0, 10.1, 11)Disclaimer: Unless specified, pictures are from engine tests, not actual games
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Job-based parallelismMust utilize all cores in the engineXbox 360: 6 HW threadsPS3: 2 HW threads + 6 great SPUsPC: 2-8 HW threads And many more comingDivide up systems into JobsAsync function calls with explicit inputs & outputsTypically fully independent stateless functionsMakes it easier on PS3 SPU & in generalJob dependencies create job graphAll cores consume jobsCELL processor – We like
Frostbite CPU job graphFrame job graph from Frostbite 1 (PS3)Build big job graphsBatch, batch, batchMix CPU- & SPU-jobs Future: Mix in low-latency GPU-jobsJob dependencies determine:Execution order Sync pointsLoad balancingI.e. the effective parallelismBraided Parallelism* [6]Intermixed task- & data-parallelism* Still only 10 hits on google (yet!), but I like Aaron’s term
Rendering jobsRendering systems are heavily divided up into jobsJobs:Terrain geometry processingUndergrowth generation [2]Decal projection [3]Particle simulationFrustum cullingOcclusion cullingOcclusion rasterizationCommand buffer generationPS3: Triangle cullingMost will move to GPUEventually..  A few have already!Mostly one-way data flowI will talk about a couple of these..
Parallel command buffer recording Dispatch draw calls and state to multiple command buffers in parallelScales linearly with # cores1500-4000 draw calls per frameReduces latency & improves performanceImportant for all platforms, used on:Xbox 360PS3 (SPU-based)PC DX11Previously not possible on PC, but now in DX11...
DX11 parallel dispatchFirst class citizen in DX11Killer feature for reducing CPU overhead & latency~90% of our rendering dispatch job time is in D3D/driverDX11 deferred device context per core Together with dynamic resources (cbuffer/vbuffer) for usage on that deferred contextRenderer has list of all draw calls we want to do for each rendering “layer” of the frameSplit draw calls for each layer into chunks of ~256 and dispatch in parallel to the deferred contextsEach chunk generates a command listRender to immediate context & execute command listsProfit! Goal: close to linear scaling up to octa-core when we get full DX11 driver support (up to the IHVs now)Future note: This is ”just” a stopgap measure until we evolve the GPU to be able to fully feed itself (hi LRB)
Occlusion cullingProblem: Buildings & env occlude large amounts of objectsInvisible objects still have to:Update logic & animationsGenerate command bufferProcessed on CPU & GPUDifficult to implement full cullingDestructible buildingsDynamic occludeesDifficult to precompute GPU occlusion queries can be heavy to renderFrom Battlefield: Bad Company PS3
Our solution:Software occlusion rasterization
Software occlusion cullingRasterize coarse zbuffer on SPU/CPU256x114 floatGood fit in SPU LS, but could be 16-bitLow-poly occluder meshesManually conservative100 m view distanceMax 10000 vertices/frameParallel SPU vertex & raster jobsCost: a few millisecondsThen cull all objects against zbufferBefore passed to all other systems = big savings
Screen-space bounding-box testPictures & numbers from Battlefield: Bad Company PS3
GPU occlusion cullingIdeally want GPU rasterization & testing, but:Occlusion queries introduces overhead & latencyCan be manageable, but far from idealConditional rendering only helps GPUNot CPU, frame memory or draw callsFuture 1: Low-latency extra GPU exec. contextRasterization and testing done on GPU where it belongsLockstep with CPU, need to read back data within a few msShould be possible on LRB (latency?), want on all HWFuture 2: Move entire cull & rendering to ”GPU”World rep., cull, systems, dispatch. End goal.
PS3 geometry processingProblem: Slow GPU triangle & vertex setup on PS3Combined with unique situation with powerful & initially not fully utilized ”free” SPUs!Solution: SPU triangle cullingTrade SPU time for GPU timeCull all back faces, micro-triangles, out of frustumBased on Sony’s PS3 EDGE library [7]Also see Jon Olick’s talk from the course last year5 SPU jobs processes frame geometry in parallelOutput is new index buffer for each draw call
Custom geometry processingSoftware control opens up great flexibility and programmability! Simple custom culling/processing that we’ve added:Partition bounding box cullingMesh part cullingClip plane triangle trivial accept & rejectTriangle cull volumes (inverse clip planes)Others are doing: Full skinning, morph targets, CLOD, clothFuture wish: No forced/fixed vertex & geometry shadersDIY compute shaders with fixed-func stages (tesselation and rasterization)Software-controlled queuing of data between stagesTo avoid always spilling out to memory
Decal projectionTraditionally a CPU processRelying on identical visual & physics representation Or duplicated mesh data in CPU memory (on PC) Consoles read visual mesh data directly
UMA! 
Project in SPU-jobs
Output VB/IB to GPUDecals through GS & StreamOutKeep the computation & data on the GPU (DX10)See GDC’09 ”Shadows & Decals – D3D10 techniques in Frostbite”, slides with complete source code online [4]Process all mesh triangles with Geometry ShaderTest decal projection against the trianglesSetup per-triangle clip planes for intersecting trisOutput intersecting triangles using StreamOutIssues:StreamOut managementDrivers (not your standard GS usage)Benefits:CPU & GPU worlds separate
No CPU memory or upload
Huge decals + huge meshesDeferred lighting/shadingTraditional deferred shading:Graphics pipeline rasterizes gbuffer for opaque surfacesNormal, albedos, roughnessLight sources are rendered & accumulate lighting to a textureLight volume or screen-space tile renderingCombine shading & lighting for final outputAlso see Wolfgang’s talk “Light Pre-Pass Renderer Mark III”from Monday for a wider description [8]
Screen-space tile classificationDivide screen up into tiles and determine how many & which light sources intersect each tileOnly apply the visible light sources on pixels in each tileReduced BW & setup cost with multiple lights in single shaderUsed in Naughty Dog’s Uncharted [9] and SCEE PhyreEngine [10]Hmm, isn’t light classification per screen-space tile sort of similar of how a compute shader can work with 2D thread groups?Answer: YES, except a CS can do everything in a single pass!From ”The Technology of Uncharted". GDC’08 [9]
CS-based deferred shading Deferred shading using DX11 CSExperimental implementation in Frostbite 2
Not production tested or optimized
Compute Shader 5.0
Assumption: No shadows (for now)New hybrid Graphics/Compute shading pipeline:Graphics pipeline rasterizes gbuffers for opaque surfacesCompute pipeline uses gbuffers, culls light sources, computes lighting & combines with shading(multiple other variants also possible)
CS requirements & setupInput data is gbuffers, depth buffer & light constantsOutput is fully composited & lit HDR texture1 thread per pixel, 16x16 thread groups (aka tile)NormalRoughnessTexture2D<float4> gbufferTexture1 : register(t0);Texture2D<float4> gbufferTexture2 : register(t1);Texture2D<float4> gbufferTexture3 : register(t2);Texture2D<float4> depthTexture : register(t3);RWTexture2D<float4> outputTexture : register(u0);#define BLOCK_SIZE 16[numthreads(BLOCK_SIZE,BLOCK_SIZE,1)]void csMain(    uint3 groupId : SV_GroupID,    uint3 groupThreadId : SV_GroupThreadID,    uint groupIndex: SV_GroupIndex,    uint3 dispatchThreadId : SV_DispatchThreadID){    ...}Diffuse AlbedoSpecular Albedo
CS steps 1-2groupshared uint minDepthInt;groupshared uint maxDepthInt;// --- globals above, function below -------float depth =       depthTexture.Load(uint3(texCoord, 0)).r;uint depthInt = asuint(depth);minDepthInt = 0xFFFFFFFF;maxDepthInt = 0;GroupMemoryBarrierWithGroupSync();InterlockedMin(minDepthInt, depthInt);InterlockedMax(maxDepthInt, depthInt);GroupMemoryBarrierWithGroupSync();float minGroupDepth = asfloat(minDepthInt);float maxGroupDepth = asfloat(maxDepthInt);Load gbuffers & depthCalculate min & max z in threadgroup / tileUsing InterlockedMin/Max on groupshared variableAtomics only work on ints But casting works (z is always +)Optimization note: Separate pass using parallel reduction with Gather to a small texture could be fasterNote to the future:GPU already has similar values in HiZ/ZCull!  Can skip step 2 if we could resolve out min & max z to a texture directlyMin z looks just like the occlusion software rendering output
CS step 3 – Cull ideaDetermine visible light sources for each tileCull all light sources against tile ”frustum”Light sources can either naively be all light sources in the scene, or CPU frustum culled potentially visible light sourcesOutput for each tile is:# of visible light sourcesIndex list of visible light sourcesExample numbers from test sceneThis is the key part of the algorithm and compute shader, so must try to be rather clever with the implementation!Per-tile visible light count(black = 0 lights, white = 40)
CS step 3 – Cull implementationstruct Light{    float3 pos;    float sqrRadius;    float3 color;    float invSqrRadius;};int lightCount;StructuredBuffer<Light> lights;groupshared uint visibleLightCount = 0;groupshared uint visibleLightIndices[1024];// ----- globals above, cont. function below -----------uint threadCount = BLOCK_SIZE*BLOCK_SIZE; uint passCount = (lightCount+threadCount-1) / threadCount;for (uint passIt = 0; passIt < passCount; ++passIt){    uint lightIndex = passIt*threadCount + groupIndex;    // prevent overrun by clamping to a last ”null” light    lightIndex = min(lightIndex, lightCount);     if (intersects(lights[lightIndex], tile))    {        uint offset;        InterlockedAdd(visibleLightCount, 1, offset);        visibleLightIndices[offset] = lightIndex;    }	}GroupMemoryBarrierWithGroupSync();Each thread switches to process light sources instead of a pixel* Wow, parallelism switcheroo!256 light sources in parallel per tileMultiple iterations for >256 lights	Intersect light source & tileMany variants dep. on accuracy requirements & performanceTile min & max z is used as a shader ”depth bounds” testFor visible lights, append light index to index listAtomic add to threadgroup shared memory. ”inlined stream compaction”Prefix sum + stream compaction should be faster than atomics, but more limitingSynchronize group & switch back to processing pixelsWe now know which light sources affect the tile*Your grandfather’s pixel shader can’t do that!
CS deferred shading final stepsComputed lightingFor each pixel, accumulate lighting from visible lightsRead from tile visible light index list in threadgroup shared memoryCombine lighting & shading albedos / parametersOutput is non-MSAA HDR textureRender transparent surfaces on topfloat3 diffuseLight = 0;float3 specularLight = 0;for (uint lightIt = 0; lightIt < visibleLightCount; ++lightIt){    uint lightIndex = visibleLightIndices[lightIt];    Light light = lights[lightIndex];		    evaluateAndAccumulateLight(        light,         gbufferParameters,        diffuseLight,        specularLight); }Combined final output (not the best example)
Example results
Example: 25+ analytical specular highlights per pixel
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Compute Shader-based Deferred Shading demo
CS-based deferred shading The Good:Constant & absolute minimal bandwidthRead gbuffers & depth once!Doesn’t need intermediate light buffersCan take a lot of memory with HDR, MSAA & color specularScales up to huge amount of big overlapping light sources!Fine-grained culling (16x16)Only ALU cost, good future scalingCould be useful for accumulating VPLsThe Bad:Requires DX11 HW (duh)CS 4.0/4.1 difficult due to atomics & scattered groupshared writesCulling overhead for small light sourcesCan accumulate them using standard light volume renderingOr separate CS for tile-classific.Potentially performanceMSAA texture loads / UAV writing might be slower then standard PSThe Ugly:Can’t output to MSAA texture
DX11 CS UAV limitation.  Future programming modelQueues as compute shader streaming in/outsIn addition to buffers/textures/UAVsSimple & expressive model supporting irregular workloadsKeeps data on chip, supports variable sized caches & coresBuild your pipeline of stages with queues betweenShader & fixed function stages (sampler, rasterizer, tessellator, Zcull)Developers can make the GPU feed itself!GRAMPS model example [8]
What else do we want to do?WARNING: Overly enthusiastic and non all-knowing game developer rantingMixed resolution MSAA particle rendering Depth test per sample, shade per quarter pixel, and depth-aware upsample directly in shaderDemand-paged procedural texturing / compositingZero latency “texture shaders”Pre-tessellation coarse rasterization for z-culling of patchesPotential optimization in scenes of massive geometric overdrawCan be coupled with recursive schemesDeferred shading w/ many & arbitrary BRDFs/materialsQueue up pixels of multiple materials for coherent processing in own shaderInstead of incoherenct screen-space dynamic flow controlLatency-free lens flares Finally! No false/late occlusionOcclusion query results written to CB and used in shader to cull & scaleAnd much much more...

More Related Content

PPTX
Frostbite on Mobile
PPT
A Bit More Deferred Cry Engine3
PPTX
Optimizing the Graphics Pipeline with Compute, GDC 2016
PPTX
A Certain Slant of Light - Past, Present and Future Challenges of Global Illu...
PDF
Screen Space Reflections in The Surge
PPTX
Parallel Futures of a Game Engine (v2.0)
PPTX
Stochastic Screen-Space Reflections
PPTX
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Frostbite on Mobile
A Bit More Deferred Cry Engine3
Optimizing the Graphics Pipeline with Compute, GDC 2016
A Certain Slant of Light - Past, Present and Future Challenges of Global Illu...
Screen Space Reflections in The Surge
Parallel Futures of a Game Engine (v2.0)
Stochastic Screen-Space Reflections
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run

What's hot (20)

PDF
Taking Killzone Shadow Fall Image Quality Into The Next Generation
PDF
Graphics Gems from CryENGINE 3 (Siggraph 2013)
PPTX
FrameGraph: Extensible Rendering Architecture in Frostbite
PDF
Rendering Tech of Space Marine
PPT
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
KEY
Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...
PPTX
Rendering Technologies from Crysis 3 (GDC 2013)
PDF
Bindless Deferred Decals in The Surge 2
PPT
The Unique Lighting of Mirror's Edge
PDF
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
PPSX
Advancements in-tiled-rendering
PPTX
Moving Frostbite to Physically Based Rendering
PPT
Crysis Next-Gen Effects (GDC 2008)
PDF
Dissecting the Rendering of The Surge
PPSX
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
PPT
Secrets of CryENGINE 3 Graphics Technology
PPTX
The Rendering Pipeline - Challenges & Next Steps
PPTX
Parallel Futures of a Game Engine
PDF
Siggraph2016 - The Devil is in the Details: idTech 666
PDF
Advanced Scenegraph Rendering Pipeline
Taking Killzone Shadow Fall Image Quality Into The Next Generation
Graphics Gems from CryENGINE 3 (Siggraph 2013)
FrameGraph: Extensible Rendering Architecture in Frostbite
Rendering Tech of Space Marine
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...
Rendering Technologies from Crysis 3 (GDC 2013)
Bindless Deferred Decals in The Surge 2
The Unique Lighting of Mirror's Edge
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
Advancements in-tiled-rendering
Moving Frostbite to Physically Based Rendering
Crysis Next-Gen Effects (GDC 2008)
Dissecting the Rendering of The Surge
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Secrets of CryENGINE 3 Graphics Technology
The Rendering Pipeline - Challenges & Next Steps
Parallel Futures of a Game Engine
Siggraph2016 - The Devil is in the Details: idTech 666
Advanced Scenegraph Rendering Pipeline
Ad

Viewers also liked (20)

PPTX
Lighting the City of Glass
PPTX
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
PPTX
Photogrammetry and Star Wars Battlefront
PPTX
High Dynamic Range color grading and display in Frostbite
PPTX
Shadows & Decals: D3D10 Techniques in Frostbite (GDC'09)
PPTX
5 Major Challenges in Real-time Rendering (2012)
PPTX
Physically Based and Unified Volumetric Rendering in Frostbite
PPTX
A Real-time Radiosity Architecture
PPT
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
PPT
Bending the Graphics Pipeline
PPT
Destruction Masking in Frostbite 2 using Volume Distance Fields
PPTX
Rendering Battlefield 4 with Mantle
PPT
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
PPTX
Shiny PC Graphics in Battlefield 3
PPS
Audio for Multiplayer & Beyond - Mixing Case Studies From Battlefield: Bad Co...
PDF
Executable Bloat - How it happens and how we can fight it
PPTX
Scope Stack Allocation
PPT
5 Major Challenges in Interactive Rendering
PPTX
How High Dynamic Range Audio Makes Battlefield: Bad Company Go BOOM
PPTX
Mantle for Developers
Lighting the City of Glass
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
Photogrammetry and Star Wars Battlefront
High Dynamic Range color grading and display in Frostbite
Shadows & Decals: D3D10 Techniques in Frostbite (GDC'09)
5 Major Challenges in Real-time Rendering (2012)
Physically Based and Unified Volumetric Rendering in Frostbite
A Real-time Radiosity Architecture
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
Bending the Graphics Pipeline
Destruction Masking in Frostbite 2 using Volume Distance Fields
Rendering Battlefield 4 with Mantle
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Shiny PC Graphics in Battlefield 3
Audio for Multiplayer & Beyond - Mixing Case Studies From Battlefield: Bad Co...
Executable Bloat - How it happens and how we can fight it
Scope Stack Allocation
5 Major Challenges in Interactive Rendering
How High Dynamic Range Audio Makes Battlefield: Bad Company Go BOOM
Mantle for Developers
Ad

Similar to Parallel Graphics in Frostbite - Current & Future (Siggraph 2009) (20)

PPT
Your Game Needs Direct3D 11, So Get Started Now!
PPT
BitSquid Tech: Benefits of a data-driven renderer
PPTX
DirectX 11 Rendering in Battlefield 3
PPTX
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
PPTX
graphics processing unit ppt
PDF
Programar para GPUs
PDF
Xen in Linux (aka PVOPS update)
PPT
Coding for multiple cores
PPT
Advanced Graphics Workshop - GFX2011
PDF
Unite 2013 optimizing unity games for mobile platforms
PPTX
PPT
Threading Successes 06 Allegorithmic
PDF
Commandlistsiggraphasia2014 141204005310-conversion-gate02
PPT
Vpu technology &gpgpu computing
PPT
Vpu technology &gpgpu computing
PPT
Vpu technology &gpgpu computing
PDF
Pcsx2 readme 0.9.6
PPTX
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
PPT
Vpu technology &gpgpu computing
ODP
[Defcon] Hardware backdooring is practical
Your Game Needs Direct3D 11, So Get Started Now!
BitSquid Tech: Benefits of a data-driven renderer
DirectX 11 Rendering in Battlefield 3
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
graphics processing unit ppt
Programar para GPUs
Xen in Linux (aka PVOPS update)
Coding for multiple cores
Advanced Graphics Workshop - GFX2011
Unite 2013 optimizing unity games for mobile platforms
Threading Successes 06 Allegorithmic
Commandlistsiggraphasia2014 141204005310-conversion-gate02
Vpu technology &gpgpu computing
Vpu technology &gpgpu computing
Vpu technology &gpgpu computing
Pcsx2 readme 0.9.6
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
Vpu technology &gpgpu computing
[Defcon] Hardware backdooring is practical

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Approach and Philosophy of On baking technology
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Cloud computing and distributed systems.
PDF
Advanced IT Governance
KodekX | Application Modernization Development
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Dropbox Q2 2025 Financial Results & Investor Presentation
GamePlan Trading System Review: Professional Trader's Honest Take
Diabetes mellitus diagnosis method based random forest with bat algorithm
Understanding_Digital_Forensics_Presentation.pptx
Approach and Philosophy of On baking technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Spectral efficient network and resource selection model in 5G networks
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
20250228 LYD VKU AI Blended-Learning.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Empathic Computing: Creating Shared Understanding
Cloud computing and distributed systems.
Advanced IT Governance

Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)

  • 1. Parallel Graphics in Frostbite – Current & FutureJohan AnderssonDICE
  • 2. MenuGame engine CPU & GPU parallelismRendering techniques & systems – old & newMixed in with some future predictions & wishes
  • 3. Quick backgroundFrostbite 1.x [1][2][3]Xbox 360, PS3, DX10Battlefield: Bad Company (shipped)Battlefield 1943 (shipped)Battlefield: Bad Company 2Frostbite 2 [4][5]In developmentXbox 360, PS3DX11 (10.0, 10.1, 11)Disclaimer: Unless specified, pictures are from engine tests, not actual games
  • 6. Job-based parallelismMust utilize all cores in the engineXbox 360: 6 HW threadsPS3: 2 HW threads + 6 great SPUsPC: 2-8 HW threads And many more comingDivide up systems into JobsAsync function calls with explicit inputs & outputsTypically fully independent stateless functionsMakes it easier on PS3 SPU & in generalJob dependencies create job graphAll cores consume jobsCELL processor – We like
  • 7. Frostbite CPU job graphFrame job graph from Frostbite 1 (PS3)Build big job graphsBatch, batch, batchMix CPU- & SPU-jobs Future: Mix in low-latency GPU-jobsJob dependencies determine:Execution order Sync pointsLoad balancingI.e. the effective parallelismBraided Parallelism* [6]Intermixed task- & data-parallelism* Still only 10 hits on google (yet!), but I like Aaron’s term
  • 8. Rendering jobsRendering systems are heavily divided up into jobsJobs:Terrain geometry processingUndergrowth generation [2]Decal projection [3]Particle simulationFrustum cullingOcclusion cullingOcclusion rasterizationCommand buffer generationPS3: Triangle cullingMost will move to GPUEventually.. A few have already!Mostly one-way data flowI will talk about a couple of these..
  • 9. Parallel command buffer recording Dispatch draw calls and state to multiple command buffers in parallelScales linearly with # cores1500-4000 draw calls per frameReduces latency & improves performanceImportant for all platforms, used on:Xbox 360PS3 (SPU-based)PC DX11Previously not possible on PC, but now in DX11...
  • 10. DX11 parallel dispatchFirst class citizen in DX11Killer feature for reducing CPU overhead & latency~90% of our rendering dispatch job time is in D3D/driverDX11 deferred device context per core Together with dynamic resources (cbuffer/vbuffer) for usage on that deferred contextRenderer has list of all draw calls we want to do for each rendering “layer” of the frameSplit draw calls for each layer into chunks of ~256 and dispatch in parallel to the deferred contextsEach chunk generates a command listRender to immediate context & execute command listsProfit! Goal: close to linear scaling up to octa-core when we get full DX11 driver support (up to the IHVs now)Future note: This is ”just” a stopgap measure until we evolve the GPU to be able to fully feed itself (hi LRB)
  • 11. Occlusion cullingProblem: Buildings & env occlude large amounts of objectsInvisible objects still have to:Update logic & animationsGenerate command bufferProcessed on CPU & GPUDifficult to implement full cullingDestructible buildingsDynamic occludeesDifficult to precompute GPU occlusion queries can be heavy to renderFrom Battlefield: Bad Company PS3
  • 13. Software occlusion cullingRasterize coarse zbuffer on SPU/CPU256x114 floatGood fit in SPU LS, but could be 16-bitLow-poly occluder meshesManually conservative100 m view distanceMax 10000 vertices/frameParallel SPU vertex & raster jobsCost: a few millisecondsThen cull all objects against zbufferBefore passed to all other systems = big savings
  • 14. Screen-space bounding-box testPictures & numbers from Battlefield: Bad Company PS3
  • 15. GPU occlusion cullingIdeally want GPU rasterization & testing, but:Occlusion queries introduces overhead & latencyCan be manageable, but far from idealConditional rendering only helps GPUNot CPU, frame memory or draw callsFuture 1: Low-latency extra GPU exec. contextRasterization and testing done on GPU where it belongsLockstep with CPU, need to read back data within a few msShould be possible on LRB (latency?), want on all HWFuture 2: Move entire cull & rendering to ”GPU”World rep., cull, systems, dispatch. End goal.
  • 16. PS3 geometry processingProblem: Slow GPU triangle & vertex setup on PS3Combined with unique situation with powerful & initially not fully utilized ”free” SPUs!Solution: SPU triangle cullingTrade SPU time for GPU timeCull all back faces, micro-triangles, out of frustumBased on Sony’s PS3 EDGE library [7]Also see Jon Olick’s talk from the course last year5 SPU jobs processes frame geometry in parallelOutput is new index buffer for each draw call
  • 17. Custom geometry processingSoftware control opens up great flexibility and programmability! Simple custom culling/processing that we’ve added:Partition bounding box cullingMesh part cullingClip plane triangle trivial accept & rejectTriangle cull volumes (inverse clip planes)Others are doing: Full skinning, morph targets, CLOD, clothFuture wish: No forced/fixed vertex & geometry shadersDIY compute shaders with fixed-func stages (tesselation and rasterization)Software-controlled queuing of data between stagesTo avoid always spilling out to memory
  • 18. Decal projectionTraditionally a CPU processRelying on identical visual & physics representation Or duplicated mesh data in CPU memory (on PC) Consoles read visual mesh data directly
  • 21. Output VB/IB to GPUDecals through GS & StreamOutKeep the computation & data on the GPU (DX10)See GDC’09 ”Shadows & Decals – D3D10 techniques in Frostbite”, slides with complete source code online [4]Process all mesh triangles with Geometry ShaderTest decal projection against the trianglesSetup per-triangle clip planes for intersecting trisOutput intersecting triangles using StreamOutIssues:StreamOut managementDrivers (not your standard GS usage)Benefits:CPU & GPU worlds separate
  • 22. No CPU memory or upload
  • 23. Huge decals + huge meshesDeferred lighting/shadingTraditional deferred shading:Graphics pipeline rasterizes gbuffer for opaque surfacesNormal, albedos, roughnessLight sources are rendered & accumulate lighting to a textureLight volume or screen-space tile renderingCombine shading & lighting for final outputAlso see Wolfgang’s talk “Light Pre-Pass Renderer Mark III”from Monday for a wider description [8]
  • 24. Screen-space tile classificationDivide screen up into tiles and determine how many & which light sources intersect each tileOnly apply the visible light sources on pixels in each tileReduced BW & setup cost with multiple lights in single shaderUsed in Naughty Dog’s Uncharted [9] and SCEE PhyreEngine [10]Hmm, isn’t light classification per screen-space tile sort of similar of how a compute shader can work with 2D thread groups?Answer: YES, except a CS can do everything in a single pass!From ”The Technology of Uncharted". GDC’08 [9]
  • 25. CS-based deferred shading Deferred shading using DX11 CSExperimental implementation in Frostbite 2
  • 26. Not production tested or optimized
  • 28. Assumption: No shadows (for now)New hybrid Graphics/Compute shading pipeline:Graphics pipeline rasterizes gbuffers for opaque surfacesCompute pipeline uses gbuffers, culls light sources, computes lighting & combines with shading(multiple other variants also possible)
  • 29. CS requirements & setupInput data is gbuffers, depth buffer & light constantsOutput is fully composited & lit HDR texture1 thread per pixel, 16x16 thread groups (aka tile)NormalRoughnessTexture2D<float4> gbufferTexture1 : register(t0);Texture2D<float4> gbufferTexture2 : register(t1);Texture2D<float4> gbufferTexture3 : register(t2);Texture2D<float4> depthTexture : register(t3);RWTexture2D<float4> outputTexture : register(u0);#define BLOCK_SIZE 16[numthreads(BLOCK_SIZE,BLOCK_SIZE,1)]void csMain( uint3 groupId : SV_GroupID, uint3 groupThreadId : SV_GroupThreadID, uint groupIndex: SV_GroupIndex, uint3 dispatchThreadId : SV_DispatchThreadID){ ...}Diffuse AlbedoSpecular Albedo
  • 30. CS steps 1-2groupshared uint minDepthInt;groupshared uint maxDepthInt;// --- globals above, function below -------float depth = depthTexture.Load(uint3(texCoord, 0)).r;uint depthInt = asuint(depth);minDepthInt = 0xFFFFFFFF;maxDepthInt = 0;GroupMemoryBarrierWithGroupSync();InterlockedMin(minDepthInt, depthInt);InterlockedMax(maxDepthInt, depthInt);GroupMemoryBarrierWithGroupSync();float minGroupDepth = asfloat(minDepthInt);float maxGroupDepth = asfloat(maxDepthInt);Load gbuffers & depthCalculate min & max z in threadgroup / tileUsing InterlockedMin/Max on groupshared variableAtomics only work on ints But casting works (z is always +)Optimization note: Separate pass using parallel reduction with Gather to a small texture could be fasterNote to the future:GPU already has similar values in HiZ/ZCull! Can skip step 2 if we could resolve out min & max z to a texture directlyMin z looks just like the occlusion software rendering output
  • 31. CS step 3 – Cull ideaDetermine visible light sources for each tileCull all light sources against tile ”frustum”Light sources can either naively be all light sources in the scene, or CPU frustum culled potentially visible light sourcesOutput for each tile is:# of visible light sourcesIndex list of visible light sourcesExample numbers from test sceneThis is the key part of the algorithm and compute shader, so must try to be rather clever with the implementation!Per-tile visible light count(black = 0 lights, white = 40)
  • 32. CS step 3 – Cull implementationstruct Light{ float3 pos; float sqrRadius; float3 color; float invSqrRadius;};int lightCount;StructuredBuffer<Light> lights;groupshared uint visibleLightCount = 0;groupshared uint visibleLightIndices[1024];// ----- globals above, cont. function below -----------uint threadCount = BLOCK_SIZE*BLOCK_SIZE; uint passCount = (lightCount+threadCount-1) / threadCount;for (uint passIt = 0; passIt < passCount; ++passIt){ uint lightIndex = passIt*threadCount + groupIndex; // prevent overrun by clamping to a last ”null” light lightIndex = min(lightIndex, lightCount); if (intersects(lights[lightIndex], tile)) { uint offset; InterlockedAdd(visibleLightCount, 1, offset); visibleLightIndices[offset] = lightIndex; } }GroupMemoryBarrierWithGroupSync();Each thread switches to process light sources instead of a pixel* Wow, parallelism switcheroo!256 light sources in parallel per tileMultiple iterations for >256 lights Intersect light source & tileMany variants dep. on accuracy requirements & performanceTile min & max z is used as a shader ”depth bounds” testFor visible lights, append light index to index listAtomic add to threadgroup shared memory. ”inlined stream compaction”Prefix sum + stream compaction should be faster than atomics, but more limitingSynchronize group & switch back to processing pixelsWe now know which light sources affect the tile*Your grandfather’s pixel shader can’t do that!
  • 33. CS deferred shading final stepsComputed lightingFor each pixel, accumulate lighting from visible lightsRead from tile visible light index list in threadgroup shared memoryCombine lighting & shading albedos / parametersOutput is non-MSAA HDR textureRender transparent surfaces on topfloat3 diffuseLight = 0;float3 specularLight = 0;for (uint lightIt = 0; lightIt < visibleLightCount; ++lightIt){ uint lightIndex = visibleLightIndices[lightIt]; Light light = lights[lightIndex]; evaluateAndAccumulateLight( light, gbufferParameters, diffuseLight, specularLight); }Combined final output (not the best example)
  • 35. Example: 25+ analytical specular highlights per pixel
  • 38. CS-based deferred shading The Good:Constant & absolute minimal bandwidthRead gbuffers & depth once!Doesn’t need intermediate light buffersCan take a lot of memory with HDR, MSAA & color specularScales up to huge amount of big overlapping light sources!Fine-grained culling (16x16)Only ALU cost, good future scalingCould be useful for accumulating VPLsThe Bad:Requires DX11 HW (duh)CS 4.0/4.1 difficult due to atomics & scattered groupshared writesCulling overhead for small light sourcesCan accumulate them using standard light volume renderingOr separate CS for tile-classific.Potentially performanceMSAA texture loads / UAV writing might be slower then standard PSThe Ugly:Can’t output to MSAA texture
  • 39. DX11 CS UAV limitation. Future programming modelQueues as compute shader streaming in/outsIn addition to buffers/textures/UAVsSimple & expressive model supporting irregular workloadsKeeps data on chip, supports variable sized caches & coresBuild your pipeline of stages with queues betweenShader & fixed function stages (sampler, rasterizer, tessellator, Zcull)Developers can make the GPU feed itself!GRAMPS model example [8]
  • 40. What else do we want to do?WARNING: Overly enthusiastic and non all-knowing game developer rantingMixed resolution MSAA particle rendering Depth test per sample, shade per quarter pixel, and depth-aware upsample directly in shaderDemand-paged procedural texturing / compositingZero latency “texture shaders”Pre-tessellation coarse rasterization for z-culling of patchesPotential optimization in scenes of massive geometric overdrawCan be coupled with recursive schemesDeferred shading w/ many & arbitrary BRDFs/materialsQueue up pixels of multiple materials for coherent processing in own shaderInstead of incoherenct screen-space dynamic flow controlLatency-free lens flares Finally! No false/late occlusionOcclusion query results written to CB and used in shader to cull & scaleAnd much much more...
  • 41. ConclusionsA good parallelization model is key for good game engine performance (duh)Job graphs of mixed task- & data-parallel CPU & SPU jobs works well for usSPU-jobs do the heavy liftingHybrid compute/graphics pipelines looks promisingEfficient interopability is super important (DX11 is great)Deferred lighting & shading in CS is just the startWant a user-defined streaming pipeline modelExpressive & extensible hybrid pipelines with queuesFocus on the data flow & patterns instead of doing sequential memory passes
  • 42. AcknowledgementsDICE & Frostbite teamNicolas Thibieroz, Mark LeatherMiguel Sainz, Yury UralskyKayvon FatahalianMatt Swoboda, Pål-Kristian Engstad Timothy Farrar, Jake Cannell
  • 43. References[1] Johan Andersson. ”Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing Techniques ”. GDC 2007. http://repi.blogspot.com/2009/01/conference-slides.html[2] Natasha Tartarchuk & Johan Andersson. ”Rendering Architecture and Real-time Procedural Shading & Texturing Techniques”. GDC 2007. http://developer.amd.com/Assets/Andersson-Tatarchuk-FrostbiteRenderingArchitecture(GDC07_AMD_Session).pdf[3] Johan Andersson. ”Terrain Rendering in Frostbite using Procedural ShaderSplatting”. Siggraph 2007. http://developer.amd.com/media/gpu_assets/Andersson-TerrainRendering(Siggraph07).pdf[4] Daniel Johansson & Johan Andersson. “Shadows & Decals – D3D10 techniques from Frostbite”. GDC 2009. http://repi.blogspot.com/2009/03/gdc09-shadows-decals-d3d10-techniques.html[5] Bill Bilodeau & Johan Andersson. “Your Game Needs Direct3D 11, So Get Started Now!”. GDC 2009. http://repi.blogspot.com/2009/04/gdc09-your-game-needs-direct3d-11-so.html[6] Aaron Lefohn. ”Programming Larrabee: Beyond Data Parallelism” – ”Beyond Programmable Shading” course. Siggraph 2008. http://s08.idav.ucdavis.edu/lefohn-programming-larrabee.pdf[7] Mark Cerny, Jon Olick, Vince Diesi. “PLAYSTATION Edge”. GDC 2007.[8] Wolfgang Engel. “Light Pre-Pass Renderer Mark III” - “Advances in Real-Time Rendering in 3D Graphics and Games” course notes. Siggraph 2009.[9] Pål-KristianEngstad, "The Technology of Uncharted: Drake’s Fortune". GDC 2008. http://www.naughtydog.com/corporate/press/GDC%202008/UnchartedTechGDC2008.pdf[10] Matt Swoboda. “Deferred Lighting and Post Processing on PLAYSTATION®3”. GDC 2009. http://www.technology.scee.net/files/presentations/gdc2009/DeferredLightingandPostProcessingonPS3.ppt.[11] Kayvon Fatahalian et al. ”GRAMPS: A Programming Model for Graphics Pipelines”. ACM Transactions on Graphics January, 2009. http://graphics.stanford.edu/papers/gramps-tog/[12] Jared Hoberock et al. ”Stream Compaction for Deferred Shading” http://graphics.cs.uiuc.edu/~jch/papers/shadersorting.pdf
  • 44. We are hiring senior developers
  • 45. Questions?Please fill in the course evaluation at: http://www.siggraph.org/courses_evaluation
  • 46. You could win a Siggraph’09 mug (yey!)
  • 47. One winner per course, notified by email in the eveningEmail: johan.andersson@dice.seBlog:http://repi.seTwitter:http://twitter.com/repiigetyourfail.com
  • 49. Timing viewReal-time in-game overlay See CPU, SPU & GPU timing events & effective parallelismWhat we use to reduce sync-points & optimize load balancing between all processorsGPU timing through event queriesAFR-handling rather shaky, but works!*Example: PC, 4 CPU cores, 2 GPUs in AFR*At least on AMD 4870x2 after some alt-tab action

Editor's Notes

  • #3: Concrete effects on the graphics pipelineInclude a few wishes & predictions of how we would like the GPU programming model to evolve
  • #4: FB1 Started out 5 years ago. Target the ”next-generation” consoles.And while we developed the engine we also worked on BFBC1, which was the pilot project. After that shipped, we have been spending quite a bit of effort on a new version of the engine of which one of the major new things is full PC & DX11 support.So I think we have some interesting experiences on both the consoles and modern PCs.And no, I won’t talk about BF3.
  • #6: These large scale environments require heavy processing.Lets go a head and talk about jobs..
  • #7: Better code structure!Gustafson’s LawFixed 33 ms/f
  • #8: 90% of this is a single job graph for the rendering of a frameRely on dynamic load-balancingTask-parallel examples: Terrain culling, particles, building entitiesData-parallel examples: ParticlesHave example of low-latency GPU job later onBraided: Aaron introduced the term at this course last year
  • #9: We have a lot of jobs across the engine, too many to go through so I chose to focus more on some of the rendering.Our intention is to move all of these to the GPU
  • #10: One of the first optimizations for multi-core that we did was to move all rendering dispatch to a seperate thread. This includes all the draw calls and states that we set to D3D.This helps a lot but it doesn’t scale that well as we only utilize a single extra core.Gather
  • #13: Software rasterization of occluders
  • #14: Conservative
  • #15: Want to rasterize on GPU, not CPU CPU Û GPU job dependencies
  • #16: Only visible trianglesJon Olick also talked about Edge at the course last yearSo I wont go into that much detail about thisThis is a huge optimization both for us and many other developers
  • #17: Initially very skepticalIntrinsics problematic
  • #19: 280 instruction GSStreamOut buffer/query management is difficult & buggyWant to use CS
  • #24: Esp. heavy with MSAAParallel reduction probably faster than atomics
  • #32: MSAA texture read cost, UAV costGood: depth bounds on all HW
  • #33: Paper presented at Siggraph yesterday.Dynamic irregular workloadsElegant model, holds a lot of promise
  • #35: OpenCLRapidMind, GRAMPs