Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)

Parallel Graphics in Frostbite – Current & FutureJohan AnderssonDICE

MenuGame engine CPU & GPU parallelismRendering techniques & systems – old & newMixed in with some future predictions & wishes

Quick backgroundFrostbite 1.x [1][2][3]Xbox 360, PS3, DX10Battlefield: Bad Company (shipped)Battlefield 1943 (shipped)Battlefield: Bad Company 2Frostbite 2 [4][5]In developmentXbox 360, PS3DX11 (10.0, 10.1, 11)Disclaimer: Unless specified, pictures are from engine tests, not actual games

Job-based parallelismMust utilize all cores in the engineXbox 360: 6 HW threadsPS3: 2 HW threads + 6 great SPUsPC: 2-8 HW threads And many more comingDivide up systems into JobsAsync function calls with explicit inputs & outputsTypically fully independent stateless functionsMakes it easier on PS3 SPU & in generalJob dependencies create job graphAll cores consume jobsCELL processor – We like

Frostbite CPU job graphFrame job graph from Frostbite 1 (PS3)Build big job graphsBatch, batch, batchMix CPU- & SPU-jobs Future: Mix in low-latency GPU-jobsJob dependencies determine:Execution order Sync pointsLoad balancingI.e. the effective parallelismBraided Parallelism* [6]Intermixed task- & data-parallelism* Still only 10 hits on google (yet!), but I like Aaron’s term

Rendering jobsRendering systems are heavily divided up into jobsJobs:Terrain geometry processingUndergrowth generation [2]Decal projection [3]Particle simulationFrustum cullingOcclusion cullingOcclusion rasterizationCommand buffer generationPS3: Triangle cullingMost will move to GPUEventually.. A few have already!Mostly one-way data flowI will talk about a couple of these..

Parallel command buffer recording Dispatch draw calls and state to multiple command buffers in parallelScales linearly with # cores1500-4000 draw calls per frameReduces latency & improves performanceImportant for all platforms, used on:Xbox 360PS3 (SPU-based)PC DX11Previously not possible on PC, but now in DX11...

DX11 parallel dispatchFirst class citizen in DX11Killer feature for reducing CPU overhead & latency~90% of our rendering dispatch job time is in D3D/driverDX11 deferred device context per core Together with dynamic resources (cbuffer/vbuffer) for usage on that deferred contextRenderer has list of all draw calls we want to do for each rendering “layer” of the frameSplit draw calls for each layer into chunks of ~256 and dispatch in parallel to the deferred contextsEach chunk generates a command listRender to immediate context & execute command listsProfit! Goal: close to linear scaling up to octa-core when we get full DX11 driver support (up to the IHVs now)Future note: This is ”just” a stopgap measure until we evolve the GPU to be able to fully feed itself (hi LRB)

Occlusion cullingProblem: Buildings & env occlude large amounts of objectsInvisible objects still have to:Update logic & animationsGenerate command bufferProcessed on CPU & GPUDifficult to implement full cullingDestructible buildingsDynamic occludeesDifficult to precompute GPU occlusion queries can be heavy to renderFrom Battlefield: Bad Company PS3

Our solution:Software occlusion rasterization

Software occlusion cullingRasterize coarse zbuffer on SPU/CPU256x114 floatGood fit in SPU LS, but could be 16-bitLow-poly occluder meshesManually conservative100 m view distanceMax 10000 vertices/frameParallel SPU vertex & raster jobsCost: a few millisecondsThen cull all objects against zbufferBefore passed to all other systems = big savings

Screen-space bounding-box testPictures & numbers from Battlefield: Bad Company PS3

GPU occlusion cullingIdeally want GPU rasterization & testing, but:Occlusion queries introduces overhead & latencyCan be manageable, but far from idealConditional rendering only helps GPUNot CPU, frame memory or draw callsFuture 1: Low-latency extra GPU exec. contextRasterization and testing done on GPU where it belongsLockstep with CPU, need to read back data within a few msShould be possible on LRB (latency?), want on all HWFuture 2: Move entire cull & rendering to ”GPU”World rep., cull, systems, dispatch. End goal.

PS3 geometry processingProblem: Slow GPU triangle & vertex setup on PS3Combined with unique situation with powerful & initially not fully utilized ”free” SPUs!Solution: SPU triangle cullingTrade SPU time for GPU timeCull all back faces, micro-triangles, out of frustumBased on Sony’s PS3 EDGE library [7]Also see Jon Olick’s talk from the course last year5 SPU jobs processes frame geometry in parallelOutput is new index buffer for each draw call

Custom geometry processingSoftware control opens up great flexibility and programmability! Simple custom culling/processing that we’ve added:Partition bounding box cullingMesh part cullingClip plane triangle trivial accept & rejectTriangle cull volumes (inverse clip planes)Others are doing: Full skinning, morph targets, CLOD, clothFuture wish: No forced/fixed vertex & geometry shadersDIY compute shaders with fixed-func stages (tesselation and rasterization)Software-controlled queuing of data between stagesTo avoid always spilling out to memory

Decal projectionTraditionally a CPU processRelying on identical visual & physics representation Or duplicated mesh data in CPU memory (on PC) Consoles read visual mesh data directly

Output VB/IB to GPUDecals through GS & StreamOutKeep the computation & data on the GPU (DX10)See GDC’09 ”Shadows & Decals – D3D10 techniques in Frostbite”, slides with complete source code online [4]Process all mesh triangles with Geometry ShaderTest decal projection against the trianglesSetup per-triangle clip planes for intersecting trisOutput intersecting triangles using StreamOutIssues:StreamOut managementDrivers (not your standard GS usage)Benefits:CPU & GPU worlds separate

Huge decals + huge meshesDeferred lighting/shadingTraditional deferred shading:Graphics pipeline rasterizes gbuffer for opaque surfacesNormal, albedos, roughnessLight sources are rendered & accumulate lighting to a textureLight volume or screen-space tile renderingCombine shading & lighting for final outputAlso see Wolfgang’s talk “Light Pre-Pass Renderer Mark III”from Monday for a wider description [8]

Screen-space tile classificationDivide screen up into tiles and determine how many & which light sources intersect each tileOnly apply the visible light sources on pixels in each tileReduced BW & setup cost with multiple lights in single shaderUsed in Naughty Dog’s Uncharted [9] and SCEE PhyreEngine [10]Hmm, isn’t light classification per screen-space tile sort of similar of how a compute shader can work with 2D thread groups?Answer: YES, except a CS can do everything in a single pass!From ”The Technology of Uncharted". GDC’08 [9]

CS-based deferred shading Deferred shading using DX11 CSExperimental implementation in Frostbite 2

Not production tested or optimized

Assumption: No shadows (for now)New hybrid Graphics/Compute shading pipeline:Graphics pipeline rasterizes gbuffers for opaque surfacesCompute pipeline uses gbuffers, culls light sources, computes lighting & combines with shading(multiple other variants also possible)

CS requirements & setupInput data is gbuffers, depth buffer & light constantsOutput is fully composited & lit HDR texture1 thread per pixel, 16x16 thread groups (aka tile)NormalRoughnessTexture2D<float4> gbufferTexture1 : register(t0);Texture2D<float4> gbufferTexture2 : register(t1);Texture2D<float4> gbufferTexture3 : register(t2);Texture2D<float4> depthTexture : register(t3);RWTexture2D<float4> outputTexture : register(u0);#define BLOCK_SIZE 16[numthreads(BLOCK_SIZE,BLOCK_SIZE,1)]void csMain( uint3 groupId : SV_GroupID, uint3 groupThreadId : SV_GroupThreadID, uint groupIndex: SV_GroupIndex, uint3 dispatchThreadId : SV_DispatchThreadID){ ...}Diffuse AlbedoSpecular Albedo

CS steps 1-2groupshared uint minDepthInt;groupshared uint maxDepthInt;// --- globals above, function below -------float depth = depthTexture.Load(uint3(texCoord, 0)).r;uint depthInt = asuint(depth);minDepthInt = 0xFFFFFFFF;maxDepthInt = 0;GroupMemoryBarrierWithGroupSync();InterlockedMin(minDepthInt, depthInt);InterlockedMax(maxDepthInt, depthInt);GroupMemoryBarrierWithGroupSync();float minGroupDepth = asfloat(minDepthInt);float maxGroupDepth = asfloat(maxDepthInt);Load gbuffers & depthCalculate min & max z in threadgroup / tileUsing InterlockedMin/Max on groupshared variableAtomics only work on ints But casting works (z is always +)Optimization note: Separate pass using parallel reduction with Gather to a small texture could be fasterNote to the future:GPU already has similar values in HiZ/ZCull! Can skip step 2 if we could resolve out min & max z to a texture directlyMin z looks just like the occlusion software rendering output

CS step 3 – Cull ideaDetermine visible light sources for each tileCull all light sources against tile ”frustum”Light sources can either naively be all light sources in the scene, or CPU frustum culled potentially visible light sourcesOutput for each tile is:# of visible light sourcesIndex list of visible light sourcesExample numbers from test sceneThis is the key part of the algorithm and compute shader, so must try to be rather clever with the implementation!Per-tile visible light count(black = 0 lights, white = 40)

CS step 3 – Cull implementationstruct Light{ float3 pos; float sqrRadius; float3 color; float invSqrRadius;};int lightCount;StructuredBuffer<Light> lights;groupshared uint visibleLightCount = 0;groupshared uint visibleLightIndices[1024];// ----- globals above, cont. function below -----------uint threadCount = BLOCK_SIZE*BLOCK_SIZE; uint passCount = (lightCount+threadCount-1) / threadCount;for (uint passIt = 0; passIt < passCount; ++passIt){ uint lightIndex = passIt*threadCount + groupIndex; // prevent overrun by clamping to a last ”null” light lightIndex = min(lightIndex, lightCount); if (intersects(lights[lightIndex], tile)) { uint offset; InterlockedAdd(visibleLightCount, 1, offset); visibleLightIndices[offset] = lightIndex; } }GroupMemoryBarrierWithGroupSync();Each thread switches to process light sources instead of a pixel* Wow, parallelism switcheroo!256 light sources in parallel per tileMultiple iterations for >256 lights Intersect light source & tileMany variants dep. on accuracy requirements & performanceTile min & max z is used as a shader ”depth bounds” testFor visible lights, append light index to index listAtomic add to threadgroup shared memory. ”inlined stream compaction”Prefix sum + stream compaction should be faster than atomics, but more limitingSynchronize group & switch back to processing pixelsWe now know which light sources affect the tile*Your grandfather’s pixel shader can’t do that!

CS deferred shading final stepsComputed lightingFor each pixel, accumulate lighting from visible lightsRead from tile visible light index list in threadgroup shared memoryCombine lighting & shading albedos / parametersOutput is non-MSAA HDR textureRender transparent surfaces on topfloat3 diffuseLight = 0;float3 specularLight = 0;for (uint lightIt = 0; lightIt < visibleLightCount; ++lightIt){ uint lightIndex = visibleLightIndices[lightIt]; Light light = lights[lightIndex]; evaluateAndAccumulateLight( light, gbufferParameters, diffuseLight, specularLight); }Combined final output (not the best example)

Example: 25+ analytical specular highlights per pixel

Compute Shader-based Deferred Shading demo

CS-based deferred shading The Good:Constant & absolute minimal bandwidthRead gbuffers & depth once!Doesn’t need intermediate light buffersCan take a lot of memory with HDR, MSAA & color specularScales up to huge amount of big overlapping light sources!Fine-grained culling (16x16)Only ALU cost, good future scalingCould be useful for accumulating VPLsThe Bad:Requires DX11 HW (duh)CS 4.0/4.1 difficult due to atomics & scattered groupshared writesCulling overhead for small light sourcesCan accumulate them using standard light volume renderingOr separate CS for tile-classific.Potentially performanceMSAA texture loads / UAV writing might be slower then standard PSThe Ugly:Can’t output to MSAA texture

DX11 CS UAV limitation. Future programming modelQueues as compute shader streaming in/outsIn addition to buffers/textures/UAVsSimple & expressive model supporting irregular workloadsKeeps data on chip, supports variable sized caches & coresBuild your pipeline of stages with queues betweenShader & fixed function stages (sampler, rasterizer, tessellator, Zcull)Developers can make the GPU feed itself!GRAMPS model example [8]

What else do we want to do?WARNING: Overly enthusiastic and non all-knowing game developer rantingMixed resolution MSAA particle rendering Depth test per sample, shade per quarter pixel, and depth-aware upsample directly in shaderDemand-paged procedural texturing / compositingZero latency “texture shaders”Pre-tessellation coarse rasterization for z-culling of patchesPotential optimization in scenes of massive geometric overdrawCan be coupled with recursive schemesDeferred shading w/ many & arbitrary BRDFs/materialsQueue up pixels of multiple materials for coherent processing in own shaderInstead of incoherenct screen-space dynamic flow controlLatency-free lens flares Finally! No false/late occlusionOcclusion query results written to CB and used in shader to cull & scaleAnd much much more...

Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Parallel Graphics in Frostbite - Current & Future (Siggraph 2009) (20)

Recently uploaded (20)

Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)

Editor's Notes