Parallel Futures of a Game Engine

Public version 10Parallel Futures of a Game EngineJohan AnderssonRendering Architect, DICE

BackgroundDICEStockholm, Sweden~250 employeesPart of Electronic ArtsBattlefield & Mirror’s Edge game seriesFrostbiteProprietary game engine used at DICE & EADeveloped by DICE over the last 5 years2

http://badcompany2.ea.com/3

http://badcompany2.ea.com/4

OutlineGame engine 101Current parallelismFuturesQ&A5

Game development2 year development cycleNew IP often takes much longer, 3-5 yearsEngine is continuously in development & usedAAA teams of 70-90 people50% artists30% designers20% programmers10% audioBudgets $20-40 millionCross-platform development is market realityXbox 360 and PlayStation 3PC DX10 and DX11 (and sometimes Mac)Current consoles will stay with us for many more years7

Game engine requirements (1/2)Stable real-time performanceFrame-driven updates, 30 fps Few threads, instead per-frame jobs/tasks for everything Predictable memory usageFixed budgets for systems & content, fail if overAvoid runtime allocationsLove unified memory! Cross-platformThe consoles determines our base tech level & focusPS3 is design target, most difficult and good potentialScale up for PC, dual core is min spec (slow!)8

Game engine requirements (2/2)Full system profiling/debuggingEngine is a vertical solution, touches everywherePIX, xbtracedump, SN Tuner, ETW, GPUViewQuick iterationsEssential in order to be creativeFast building & fast loading, hot-swapping resourcesAffects both the tools and the gameMiddlewareUse when it make senses, cross-platform & optimizedParallelism have to go through our systems9

Levels of code in FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU11

Editor & PipelineEditor (”FrostEd 2”)WYSIWYG editor for contentC#, Windows onlyBasic threading / tasksPipelineOffline/background data-processing & conversionC++, some MC++, Windows onlyTypically IO-boundA few compute-heavy steps use CPU-jobsTexture compression uses CUDA, would prefer OpenCL or CSLighting pre-calculation using IncrediBuild over 100+ machinesCPU parallelism models are generally not a problem here13

General ”game code” (1/2)This is the majority of our 1.5 million lines of C++Runs on Win32, Win64, Xbox 360 and PS3Similar to general application codeHuge amount of code & logic to maintain + continue to developLow compute density”Glue code”Scattered in memory (pointer chasing)Difficult to efficiently parallelizeOut-of-order execution is a big help, but consoles are in-order Key to be able to quickly iterate & changeThis is the actual game logic & glue that builds the gameC++ not ideal, but has the invested infrastructure15

General ”game code” (2/2)PS3 is one of the main challengesStandard CPU parallelization doesn’t helpCELL only has 2 HW threads on the PPUSplit the code in 2: game code & system codeGame logic, policy and glue code only on CPU”If it runs well on the PS3 PPU, it runs well everywhere”Lower-level systems on PS3 SPUsMain goals going forward:Simplify & structure code baseReduce coupling with lower-level systemsIncrease in task parallelism for PCCELL processor16

Job-based parallelismEssential to utilize the cores on our target platformsXbox 360: 6 HW threadsPlayStation 3: 2 HW threads + 6 powerful SPUsPC: 2-16 HW threads (Nehalem HT is great!) Divide up system work into Jobs (a.k.a. Tasks)15-200k C++ code each. 25k is commonCan depend on each other (if needed)Dependencies create job graphAll HW threads consume jobs ~200-300 / frame18

What is a Job for us?An asynchronous function callFunction ptr + 4 uintptr_t parametersCross-platform scheduler: EA JobManagerOften uses work stealing2 types of Jobs in Frostbite:CPU job (good)General code moved into job instead of threadsSPU job (great!)Stateless pure functions, no side effectsData-oriented, explicit memory DMA to local storeDesigned to run on the PS3 SPUs = also very fast on in-order CPUCan hot-swap quick iterations 19

struct FB_ALIGN(16) EntityRenderCullJobData{ enum { MaxSphereTreeCount = 2, MaxStaticCullTreeCount = 2 }; uint sphereTreeCount; const SphereNode* sphereTrees[MaxSphereTreeCount]; u8 viewCount; u8 frustumCount; u8 viewIntersectFlags[32]; Frustum frustums[32]; .... (cut out 2/3 of struct for display size) u32 maxOutEntityCount; // Output data, pre-allocated by callee u32 outEntityCount; EntityRenderCullInfo* outEntities;};void entityRenderCullJob(EntityRenderCullJobData* data);void validate(const EntityRenderCullJobData& data);Frustum culling of dynamic entities in sphere treeEntityRenderCull job example20

struct FB_ALIGN(16) EntityRenderCullJobData{ enum { MaxSphereTreeCount = 2, MaxStaticCullTreeCount = 2 }; uint sphereTreeCount; const SphereNode* sphereTrees[MaxSphereTreeCount]; u8 viewCount; u8 frustumCount; u8 viewIntersectFlags[32]; Frustum frustums[32]; .... (cut out 2/3 of struct for display size) u32 maxOutEntityCount; // Output data, pre-allocated by callee u32 outEntityCount; EntityRenderCullInfo* outEntities;};void entityRenderCullJob(EntityRenderCullJobData* data);void validate(const EntityRenderCullJobData& data);Frustum culling of dynamic entities in sphere treestruct contains all input data neededEntityRenderCull job example21

struct FB_ALIGN(16) EntityRenderCullJobData{ enum { MaxSphereTreeCount = 2, MaxStaticCullTreeCount = 2 }; uint sphereTreeCount; const SphereNode* sphereTrees[MaxSphereTreeCount]; u8 viewCount; u8 frustumCount; u8 viewIntersectFlags[32]; Frustum frustums[32]; .... (cut out 2/3 of struct for display size) u32 maxOutEntityCount; // Output data, pre-allocated by callee u32 outEntityCount; EntityRenderCullInfo* outEntities;};void entityRenderCullJob(EntityRenderCullJobData* data);void validate(const EntityRenderCullJobData& data);Frustum culling of dynamic entities in sphere treestruct contains all input data neededMax output data pre-allocated by calleeEntityRenderCull job example22

struct FB_ALIGN(16) EntityRenderCullJobData{ enum { MaxSphereTreeCount = 2, MaxStaticCullTreeCount = 2 }; uint sphereTreeCount; const SphereNode* sphereTrees[MaxSphereTreeCount]; u8 viewCount; u8 frustumCount; u8 viewIntersectFlags[32]; Frustum frustums[32]; .... (cut out 2/3 of struct for display size) u32 maxOutEntityCount; // Output data, pre-allocated by callee u32 outEntityCount; EntityRenderCullInfo* outEntities;};void entityRenderCullJob(EntityRenderCullJobData* data);void validate(const EntityRenderCullJobData& data);Frustum culling of dynamic entities in sphere treestruct contains all input data neededMax output data pre-allocated by calleeSingle job functionCompile both as CPU & SPU jobEntityRenderCull job example23

struct FB_ALIGN(16) EntityRenderCullJobData{ enum { MaxSphereTreeCount = 2, MaxStaticCullTreeCount = 2 }; uint sphereTreeCount; const SphereNode* sphereTrees[MaxSphereTreeCount]; u8 viewCount; u8 frustumCount; u8 viewIntersectFlags[32]; Frustum frustums[32]; .... (cut out 2/3 of struct for display size) u32 maxOutEntityCount; // Output data, pre-allocated by callee u32 outEntityCount; EntityRenderCullInfo* outEntities;};void entityRenderCullJob(EntityRenderCullJobData* data);void validate(const EntityRenderCullJobData& data);Frustum culling of dynamic entities in sphere treestruct contains all input data neededMax output data pre-allocated by calleeSingle job functionCompile both as CPU & SPU jobOptional struct validation funcEntityRenderCull job example24

EntityRenderCull SPU setup// local store variablesEntityRenderCullJobData g_jobData;float g_zBuffer[256*114];u16 g_terrainHeightData[64*64];int main(uintptr_t dataEa, uintptr_t, uintptr_t, uintptr_t){ dmaBlockGet("jobData", &g_jobData, dataEa, sizeof(g_jobData)); validate(g_jobData); if (g_jobData.zBufferTestEnable) { dmaAsyncGet("zBuffer", g_zBuffer, g_jobData.zBuffer, g_jobData.zBufferResX*g_jobData.zBufferResY*4); g_jobData.zBuffer = g_zBuffer; if (g_jobData.zBufferShadowTestEnable && g_jobData.terrainHeightData) { dmaAsyncGet("terrainHeight", g_terrainHeightData, g_jobData.terrainHeightData, g_jobData.terrainHeightDataSize); g_jobData.terrainHeightData = g_terrainHeightData; } dmaWaitAll(); // block on both DMAs } // run the actual job, will internally do streaming DMAs to the output entity list entityRenderCullJob(&g_jobData); // put back the data because we changed outEntityCount dmaBlockPut(dataEa, &g_jobData, sizeof(g_jobData)); return 0;}25

Frostbite CPU job graphBuild big job graphs:Batch, batch, batch

Future: Mix in low-latency GPU-jobsJob dependencies determine:Execution order

i.e. the effective parallelismIntermixed task- & data-parallelismaka Braided Parallelism

Task-parallel algorithms & coordination28

Timing viewExample: PC, 4 CPU cores, 2 GPUs in AFR (AMD Radeon 4870x2)Real-time in-game overlay See timing events & effective parallelismOn CPU, SPU & GPU – for all platformsUse to reduce sync-points & optimize load balancingGPU timing through DX event queriesOur main performance tool!29

Rendering jobsRendering systems are heavily divided up into CPU- & SPU-jobsJobs:Terrain geometry [3]Undergrowth generation [2]Decal projection [4]Particle simulationFrustum cullingOcclusion cullingOcclusion rasterizationCommand buffer generation [6]PS3: Triangle culling [6]Most will move to GPUEventually.. A few have already!Latency wall, more power and GPU memory accessMostly one-way data flow30

Occlusion culling job exampleProblem: Buildings & env occlude large amounts of objectsObscured objects still have to:Update logic & animationsGenerate command bufferProcessed on CPU & GPU= expensive & wasteful Difficult to implement full culling:Destructible buildingsDynamic occludeesDifficult to precompute From Battlefield: Bad Company PS331

Solution: Software occlusion cullingRasterize coarse zbuffer on SPU/CPU256x114 floatLow-poly occluder meshes100 m view distanceMax 10000 vertices/frameParallel vertex & raster SPU-jobsCost: a few millisecondsCull all objects against zbufferScreen-space bounding-box testBefore passed to all other systemsBig performance savings!32

GPU occlusion cullingIdeally want to use the GPU, but current APIs are limited:Occlusion queries introduces overhead & latencyConditional rendering only helps GPUCompute Shader impl. possible, but same latency wallFuture 1: Low-latency GPU execution contextRasterization and testing done on GPU where it belongsLockstep with CPU, need to read back within a few msPossible on Larrabee, want standard on PC Potential WDDM issueFuture 2: Move entire cull & rendering to ”GPU”World, cull, systems, dispatch. End goal33

Shader typesGenerated shaders [1]Graph-based surface shadersTreated as content, not codeArtist createdGenerates HLSL codeUsed by all meshes and 3d surfacesGraphics / Compute kernelsHand-coded & optimized HLSLStatically linked in with C++Pixel- & compute-shadersLighting, post-processing & special effectsGraph-based surface shader in FrostEd 235

3 major challenges/goals going forward:How do we make it easier to develop, maintain & parallelize general game code?What do we need to continue to innovate & scale up real-time computational graphics?How can we move & scale up advanced simulation and non-graphics tasks to data-parallel manycore processors?Challenges37

3 major challenges/goals going forward:How do we make it easier to develop, maintain & parallelize general game code?What do we need to continue to innovate & scale up real-time computational graphics?How can we move & scale up advanced simulation and non-graphics tasks to data-parallel manycore processors?ChallengesMost likely the same solution(s)!38

Challenge 1“How do we make it easier to develop, maintain & parallelize general game code?”Shared State Concurrency is a killerNot a big believer in Software Transactional Memory either Because of performance and too ”optimistic” flowA more strict & adapted C++ modelSupport for true immutable & r/w-only memory accessPer-thread/task memory access opt-inTo reduce the possibility for side effects in parallel codeAs much compile-time validation as possibleMicro-threads / coroutines as first class citizensMore? (we are used to not having much, for us, practical innovation here)Other languages?39

Challenge 1 - Task parallelismMultiple task librariesEA JobManagerCurrent solution, designed primarily within SPU-job limitationsMS ConcRT, Apple GCD, Intel TBBAll has some good parts!Neither works on all of our platforms, key requirementOpenMP We don’t use it. Tiny band aid, doesn’t satisfy our control needsNeed C++ enhancements to simplify usageC++ 0x lambdas / GCD blocks Glacial C++ development & deployment Want on all platforms, so lost on this console generationMoving away from semi-static job graphs Instead more dynamic on-demand job graphs40

Challenge 2 - DefinitionGoal: ”Real-time interactive graphics & simulation at a Pixar level of quality”Needed visual features:Global indirect lighting & reflectionsComplete anti-aliasing (frame buffers & shader)Sub-pixel geometryOITHuge improvements in character animationThese require massively more compute, BW and improved model!(animation can’t be solved with just more/better compute, so pretend it doesn’t exist for now)41

Challenge 2 - ProblemsProblems & limitations with current model:MSAA sample storage doesn’t scale to 16x+Esp. with HDR & deferred shadingGPU is handicapped by being spoon-fed by CPUIrregular workloads are difficult / inefficientCurrent HLSL is a limited language & model42

Challenge 2 - SolutionsSounds like a job for a high-throughput oriented massive data-parallel processorWith a highly flexible programming modelThe CPU, as we know it, and its APIs are only in the wayPure software solution not practical as next step after DX11 PC 1)Multi-vendor & multi-architecture marketplaceSkeptical we will reach a multi-vendor standard ISA within 3+ yearsFuture consoles on the other hand, this would be preferredAnd would love to be proven wrong by the IHVs!Want a rich high-level compute model as next stepEfficiently target both SW- & HW-pipeline architecturesEven if we had 100% SW solution, to simplify development1) Depending on the time frame43

”Pipelined Compute Shaders”Queues as streaming I/O between compute kernelsSimple & expressive model supporting irregular workloadsKeeps data on chip, supports variable sized caches & coresCan target multiple types of HW & architecturesHybrid graphics/compute user-defined pipelinesLanguage/API defining fixed stages inputs & outputsPipelines can feed other pipelines (similar to DrawIndirect)Reyes-style Rendering with Ray TracingShadeSplitRasterSub-DPrimsTessFrame BufferTrace44

”Pipelined Compute Shaders”Wanted for next DirectX and OpenCL/OpenGLAs a standard, as soon as possibleMy main request/wish!Run on all: GPU, manycore and CPUIHV-specific solutions can be good start for R&DModel is also a good fit for many of our CPU/SPU jobsParts of job graph can be seen as queues between stagesEasier to write kernels/jobs with streaming I/O Instead of explicit fixed-buffers and ”memory passes”Or dynamic memory allocation45

Language?Language for this model is a big questionBut the concepts & infrastructure are what is important!Could be an extendedHLSL or ”data-parallel C++”Data-oriented imperative language (i.e. not standard C++)Think HLSL would probably be easier & the most explicitAmount of code is small and written from scratchSIMT-style implicit vectorization is preferred over explicit vectorizationEasier to target multiple evolving architectures implicitlyOur CPU code is still stuck at SSE2 46

Language (cont.)Requirements:Full rich debugging, ideally in Visual StudioAssertsInternal kernel profilingHot-swapping / edit-and-continue of kernelsOpportunity for IHVs and platform providers to innovate here!Try to aim for an eventual cross-vendor standardThink of the co-development of Nvidia Cg and HLSL47

Parallel Futures of a Game Engine

More Related Content

What's hot (20)

Viewers also liked (19)

Similar to Parallel Futures of a Game Engine (20)

Recently uploaded (20)

Parallel Futures of a Game Engine

Editor's Notes