SlideShare a Scribd company logo
Public version 10Parallel Futures of a Game EngineJohan AnderssonRendering Architect, DICE
BackgroundDICEStockholm, Sweden~250 employeesPart of Electronic ArtsBattlefield & Mirror’s Edge game seriesFrostbiteProprietary game engine used at DICE & EADeveloped by DICE over the last 5 years2
http://badcompany2.ea.com/3
http://badcompany2.ea.com/4
OutlineGame engine 101Current parallelismFuturesQ&A5
Game engine 1016
Game development2 year development cycleNew IP often takes much longer, 3-5 yearsEngine is continuously in development & usedAAA teams of 70-90 people50% artists30% designers20% programmers10% audioBudgets $20-40 millionCross-platform development is market realityXbox 360 and PlayStation 3PC DX10 and DX11 (and sometimes Mac)Current consoles will stay with us for many more years7
Game engine requirements (1/2)Stable real-time performanceFrame-driven updates, 30 fps Few threads, instead per-frame jobs/tasks for everything Predictable memory usageFixed budgets for systems & content, fail if overAvoid runtime allocationsLove unified memory! Cross-platformThe consoles determines our base tech level & focusPS3 is design target, most difficult and good potentialScale up for PC, dual core is min spec (slow!)8
Game engine requirements (2/2)Full system profiling/debuggingEngine is a vertical solution, touches everywherePIX, xbtracedump, SN Tuner, ETW, GPUViewQuick iterationsEssential in order to be creativeFast building & fast loading, hot-swapping resourcesAffects both the tools and the gameMiddlewareUse when it make senses, cross-platform & optimizedParallelism have to go through our systems9
Current parallelism10
Levels of code in FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU11
Levels of code in FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU12
Editor & PipelineEditor (”FrostEd 2”)WYSIWYG editor for contentC#, Windows onlyBasic threading / tasksPipelineOffline/background data-processing & conversionC++, some MC++, Windows onlyTypically IO-boundA few compute-heavy steps use CPU-jobsTexture compression uses CUDA, would prefer OpenCL or CSLighting pre-calculation using IncrediBuild over 100+ machinesCPU parallelism models are generally not a problem here13
Levels of code in FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU14
General ”game code” (1/2)This is the majority of our 1.5 million lines of C++Runs on Win32, Win64, Xbox 360 and PS3Similar to general application codeHuge amount of code & logic to maintain + continue to developLow compute density”Glue code”Scattered in memory (pointer chasing)Difficult to efficiently parallelizeOut-of-order execution is a big help, but consoles are in-order Key to be able to quickly iterate & changeThis is the actual game logic & glue that builds the gameC++ not ideal, but has the invested infrastructure15
General ”game code” (2/2)PS3 is one of the main challengesStandard CPU parallelization doesn’t helpCELL only has 2 HW threads on the PPUSplit the code in 2: game code & system codeGame logic, policy and glue code only on CPU”If it runs well on the PS3 PPU, it runs well everywhere”Lower-level systems on PS3 SPUsMain goals going forward:Simplify & structure code baseReduce coupling with lower-level systemsIncrease in task parallelism for PCCELL processor16
Levels of code in FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU17
Job-based parallelismEssential to utilize the cores on our target platformsXbox 360: 6 HW threadsPlayStation 3: 2 HW threads + 6 powerful SPUsPC: 2-16 HW threads (Nehalem HT is great!) Divide up system work into Jobs (a.k.a. Tasks)15-200k C++ code each. 25k is commonCan depend on each other (if needed)Dependencies create job graphAll HW threads consume jobs ~200-300 / frame18
What is a Job for us?An asynchronous function callFunction ptr + 4 uintptr_t parametersCross-platform scheduler: EA JobManagerOften uses work stealing2 types of Jobs in Frostbite:CPU job (good)General code moved into job instead of threadsSPU job (great!)Stateless pure functions, no side effectsData-oriented, explicit memory DMA to local storeDesigned to run on the PS3 SPUs = also very fast on in-order CPUCan hot-swap quick iterations 19
struct FB_ALIGN(16) EntityRenderCullJobData{	enum	{		MaxSphereTreeCount = 2,		MaxStaticCullTreeCount = 2	};	uint sphereTreeCount;	const SphereNode* sphereTrees[MaxSphereTreeCount];	u8 viewCount;	u8 frustumCount;	u8 viewIntersectFlags[32]; 	Frustum frustums[32];	.... (cut out 2/3 of struct for display size)	u32 maxOutEntityCount;	// Output data, pre-allocated by callee	u32 outEntityCount;	EntityRenderCullInfo* outEntities;};void entityRenderCullJob(EntityRenderCullJobData* data);void validate(const EntityRenderCullJobData& data);Frustum culling of dynamic entities in sphere treeEntityRenderCull job example20
struct FB_ALIGN(16) EntityRenderCullJobData{	enum	{		MaxSphereTreeCount = 2,		MaxStaticCullTreeCount = 2	};	uint sphereTreeCount;	const SphereNode* sphereTrees[MaxSphereTreeCount];	u8 viewCount;	u8 frustumCount;	u8 viewIntersectFlags[32]; 	Frustum frustums[32];	.... (cut out 2/3 of struct for display size)	u32 maxOutEntityCount;	// Output data, pre-allocated by callee	u32 outEntityCount;	EntityRenderCullInfo* outEntities;};void entityRenderCullJob(EntityRenderCullJobData* data);void validate(const EntityRenderCullJobData& data);Frustum culling of dynamic entities in sphere treestruct contains all input data neededEntityRenderCull job example21
struct FB_ALIGN(16) EntityRenderCullJobData{	enum	{		MaxSphereTreeCount = 2,		MaxStaticCullTreeCount = 2	};	uint sphereTreeCount;	const SphereNode* sphereTrees[MaxSphereTreeCount];	u8 viewCount;	u8 frustumCount;	u8 viewIntersectFlags[32]; 	Frustum frustums[32];	.... (cut out 2/3 of struct for display size)	u32 maxOutEntityCount;	// Output data, pre-allocated by callee	u32 outEntityCount;	EntityRenderCullInfo* outEntities;};void entityRenderCullJob(EntityRenderCullJobData* data);void validate(const EntityRenderCullJobData& data);Frustum culling of dynamic entities in sphere treestruct contains all input data neededMax output data pre-allocated by calleeEntityRenderCull job example22
struct FB_ALIGN(16) EntityRenderCullJobData{	enum	{		MaxSphereTreeCount = 2,		MaxStaticCullTreeCount = 2	};	uint sphereTreeCount;	const SphereNode* sphereTrees[MaxSphereTreeCount];	u8 viewCount;	u8 frustumCount;	u8 viewIntersectFlags[32]; 	Frustum frustums[32];	.... (cut out 2/3 of struct for display size)	u32 maxOutEntityCount;	// Output data, pre-allocated by callee	u32 outEntityCount;	EntityRenderCullInfo* outEntities;};void entityRenderCullJob(EntityRenderCullJobData* data);void validate(const EntityRenderCullJobData& data);Frustum culling of dynamic entities in sphere treestruct contains all input data neededMax output data pre-allocated by calleeSingle job functionCompile both as CPU & SPU jobEntityRenderCull job example23
struct FB_ALIGN(16) EntityRenderCullJobData{	enum	{		MaxSphereTreeCount = 2,		MaxStaticCullTreeCount = 2	};	uint sphereTreeCount;	const SphereNode* sphereTrees[MaxSphereTreeCount];	u8 viewCount;	u8 frustumCount;	u8 viewIntersectFlags[32]; 	Frustum frustums[32];	.... (cut out 2/3 of struct for display size)	u32 maxOutEntityCount;	// Output data, pre-allocated by callee	u32 outEntityCount;	EntityRenderCullInfo* outEntities;};void entityRenderCullJob(EntityRenderCullJobData* data);void validate(const EntityRenderCullJobData& data);Frustum culling of dynamic entities in sphere treestruct contains all input data neededMax output data pre-allocated by calleeSingle job functionCompile both as CPU & SPU jobOptional struct validation funcEntityRenderCull job example24
EntityRenderCull SPU setup// local store variablesEntityRenderCullJobData g_jobData;float g_zBuffer[256*114];u16 g_terrainHeightData[64*64];int main(uintptr_t dataEa, uintptr_t, uintptr_t, uintptr_t){    dmaBlockGet("jobData", &g_jobData, dataEa, sizeof(g_jobData));    validate(g_jobData);    if (g_jobData.zBufferTestEnable)    {        dmaAsyncGet("zBuffer", g_zBuffer, g_jobData.zBuffer, g_jobData.zBufferResX*g_jobData.zBufferResY*4);        g_jobData.zBuffer = g_zBuffer;        if (g_jobData.zBufferShadowTestEnable && g_jobData.terrainHeightData)        {            dmaAsyncGet("terrainHeight", g_terrainHeightData, g_jobData.terrainHeightData, g_jobData.terrainHeightDataSize);            g_jobData.terrainHeightData = g_terrainHeightData;        }        dmaWaitAll(); // block on both DMAs    }    // run the actual job, will internally do streaming DMAs to the output entity list    entityRenderCullJob(&g_jobData);     // put back the data because we changed outEntityCount    dmaBlockPut(dataEa, &g_jobData, sizeof(g_jobData));     return 0;}25
Frostbite CPU job graphBuild big job graphs:Batch, batch, batch
Mix CPU- & SPU-jobs
Future: Mix in low-latency GPU-jobsJob dependencies determine:Execution order
Sync points
Load balancing
i.e. the effective parallelismIntermixed task- & data-parallelismaka Braided Parallelism
aka Nested Data-Parallelism
aka Tasks and Kernels26
Data-parallel jobs27
Task-parallel algorithms & coordination28
Timing viewExample: PC, 4 CPU cores, 2 GPUs in AFR (AMD Radeon 4870x2)Real-time in-game overlay  See timing events & effective parallelismOn CPU, SPU & GPU – for all platformsUse to reduce sync-points & optimize load balancingGPU timing through DX event queriesOur main performance tool!29
Rendering jobsRendering systems are heavily divided up into CPU- & SPU-jobsJobs:Terrain geometry [3]Undergrowth generation [2]Decal projection [4]Particle simulationFrustum cullingOcclusion cullingOcclusion rasterizationCommand buffer generation [6]PS3: Triangle culling [6]Most will move to GPUEventually.. A few have already!Latency wall, more power and GPU memory accessMostly one-way data flow30
Occlusion culling job exampleProblem: Buildings & env occlude large amounts of objectsObscured objects still have to:Update logic & animationsGenerate command bufferProcessed on CPU & GPU= expensive & wasteful Difficult to implement full culling:Destructible buildingsDynamic occludeesDifficult to precompute From Battlefield: Bad Company PS331
Solution: Software occlusion cullingRasterize coarse zbuffer on SPU/CPU256x114 floatLow-poly occluder meshes100 m view distanceMax 10000 vertices/frameParallel vertex & raster SPU-jobsCost: a few millisecondsCull all objects against zbufferScreen-space bounding-box testBefore passed to all other systemsBig performance savings!32
GPU occlusion cullingIdeally want to use the GPU, but current APIs are limited:Occlusion queries introduces overhead & latencyConditional rendering only helps GPUCompute Shader impl. possible, but same latency wallFuture 1: Low-latency GPU execution contextRasterization and testing done on GPU where it belongsLockstep with CPU, need to read back within a few msPossible on Larrabee, want standard on PC Potential WDDM issueFuture 2: Move entire cull & rendering to ”GPU”World, cull, systems, dispatch. End goal33
Levels of code in FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU34
Shader typesGenerated shaders [1]Graph-based surface shadersTreated as content, not codeArtist createdGenerates HLSL codeUsed by all meshes and 3d surfacesGraphics / Compute kernelsHand-coded & optimized HLSLStatically linked in with C++Pixel- & compute-shadersLighting, post-processing & special effectsGraph-based surface shader in FrostEd 235
Futures36
3 major challenges/goals going forward:How do we make it easier to develop, maintain & parallelize general game code?What do we need to continue to innovate & scale up real-time computational graphics?How can we move & scale up advanced simulation and non-graphics tasks to data-parallel manycore processors?Challenges37
3 major challenges/goals going forward:How do we make it easier to develop, maintain & parallelize general game code?What do we need to continue to innovate & scale up real-time computational graphics?How can we move & scale up advanced simulation and non-graphics tasks to data-parallel manycore processors?ChallengesMost likely the same solution(s)!38
Challenge 1“How do we make it easier to develop, maintain & parallelize general game code?”Shared State Concurrency is a killerNot a big believer in Software Transactional Memory either Because of performance and too ”optimistic” flowA more strict & adapted C++ modelSupport for true immutable & r/w-only memory accessPer-thread/task memory access opt-inTo reduce the possibility for side effects in parallel codeAs much compile-time validation as possibleMicro-threads / coroutines as first class citizensMore? (we are used to not having much, for us, practical innovation here)Other languages?39
Challenge 1 - Task parallelismMultiple task librariesEA JobManagerCurrent solution, designed primarily within SPU-job limitationsMS ConcRT, Apple GCD, Intel TBBAll has some good parts!Neither works on all of our platforms, key requirementOpenMP We don’t use it. Tiny band aid, doesn’t satisfy our control needsNeed C++ enhancements to simplify usageC++ 0x lambdas / GCD blocks Glacial C++ development & deployment Want on all platforms, so lost on this console generationMoving away from semi-static job graphs Instead more dynamic on-demand job graphs40
Challenge 2 - DefinitionGoal: ”Real-time interactive graphics & simulation at a Pixar level of quality”Needed visual features:Global indirect lighting & reflectionsComplete anti-aliasing (frame buffers & shader)Sub-pixel geometryOITHuge improvements in character animationThese require massively more compute, BW and improved model!(animation can’t be solved with just more/better compute, so pretend it doesn’t exist for now)41
Challenge 2 - ProblemsProblems & limitations with current model:MSAA sample storage doesn’t scale to 16x+Esp. with HDR & deferred shadingGPU is handicapped by being spoon-fed by CPUIrregular workloads are difficult / inefficientCurrent HLSL is a limited language & model42
Challenge 2 - SolutionsSounds like a job for a high-throughput oriented massive data-parallel processorWith a highly flexible programming modelThe CPU, as we know it, and its APIs are only in the wayPure software solution not practical as next step after DX11 PC 1)Multi-vendor & multi-architecture marketplaceSkeptical we will reach a multi-vendor standard ISA within 3+ yearsFuture consoles on the other hand, this would be preferredAnd would love to be proven wrong by the IHVs!Want a rich high-level compute model as next stepEfficiently target both SW- & HW-pipeline architecturesEven if we had 100% SW solution, to simplify development1) Depending on the time frame43
”Pipelined Compute Shaders”Queues as streaming I/O between compute kernelsSimple & expressive model supporting irregular workloadsKeeps data on chip, supports variable sized caches & coresCan target multiple types of HW & architecturesHybrid graphics/compute user-defined pipelinesLanguage/API defining fixed stages inputs & outputsPipelines can feed other pipelines (similar to DrawIndirect)Reyes-style Rendering with Ray TracingShadeSplitRasterSub-DPrimsTessFrame BufferTrace44
”Pipelined Compute Shaders”Wanted for next DirectX and OpenCL/OpenGLAs a standard, as soon as possibleMy main request/wish!Run on all: GPU, manycore and CPUIHV-specific solutions can be good start for R&DModel is also a good fit for many of our CPU/SPU jobsParts of job graph can be seen as queues between stagesEasier to write kernels/jobs with streaming I/O Instead of explicit fixed-buffers and ”memory passes”Or dynamic memory allocation45
Language?Language for this model is a big questionBut the concepts & infrastructure are what is important!Could be an extendedHLSL or ”data-parallel C++”Data-oriented imperative language (i.e. not standard C++)Think HLSL would probably be easier & the most explicitAmount of code is small and written from scratchSIMT-style implicit vectorization is preferred over explicit vectorizationEasier to target multiple evolving architectures implicitlyOur CPU code is still stuck at SSE2 46
Language (cont.)Requirements:Full rich debugging, ideally in Visual StudioAssertsInternal kernel profilingHot-swapping / edit-and-continue of kernelsOpportunity for IHVs and platform providers to innovate here!Try to aim for an eventual cross-vendor standardThink of the co-development of Nvidia Cg and HLSL47

More Related Content

PPTX
Parallel Futures of a Game Engine (v2.0)
PDF
Advanced Scenegraph Rendering Pipeline
PDF
Bindless Deferred Decals in The Surge 2
PPTX
Decima Engine: Visibility in Horizon Zero Dawn
PPTX
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
PPTX
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
PPTX
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
PPTX
Scope Stack Allocation
Parallel Futures of a Game Engine (v2.0)
Advanced Scenegraph Rendering Pipeline
Bindless Deferred Decals in The Surge 2
Decima Engine: Visibility in Horizon Zero Dawn
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
Scope Stack Allocation

What's hot (20)

PPT
BitSquid Tech: Benefits of a data-driven renderer
PDF
Rendering AAA-Quality Characters of Project A1
PPTX
The Rendering Technology of Killzone 2
PPTX
Stochastic Screen-Space Reflections
PPTX
Scene Graphs & Component Based Game Engines
PPSX
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
PDF
Practical SPU Programming in God of War III
PPT
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
PDF
Screen Space Decals in Warhammer 40,000: Space Marine
PPTX
DirectX 11 Rendering in Battlefield 3
PDF
Screen Space Reflections in The Surge
PDF
OpenGL 4.4 - Scene Rendering Techniques
PPTX
Optimizing the Graphics Pipeline with Compute, GDC 2016
PPT
Light prepass
PPT
NVIDIA OpenGL 4.6 in 2017
PDF
Siggraph2016 - The Devil is in the Details: idTech 666
PPT
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
PDF
Dissecting the Rendering of The Surge
PPTX
Terrain in Battlefield 3: A Modern, Complete and Scalable System
PPTX
FrameGraph: Extensible Rendering Architecture in Frostbite
BitSquid Tech: Benefits of a data-driven renderer
Rendering AAA-Quality Characters of Project A1
The Rendering Technology of Killzone 2
Stochastic Screen-Space Reflections
Scene Graphs & Component Based Game Engines
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Practical SPU Programming in God of War III
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Screen Space Decals in Warhammer 40,000: Space Marine
DirectX 11 Rendering in Battlefield 3
Screen Space Reflections in The Surge
OpenGL 4.4 - Scene Rendering Techniques
Optimizing the Graphics Pipeline with Compute, GDC 2016
Light prepass
NVIDIA OpenGL 4.6 in 2017
Siggraph2016 - The Devil is in the Details: idTech 666
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Dissecting the Rendering of The Surge
Terrain in Battlefield 3: A Modern, Complete and Scalable System
FrameGraph: Extensible Rendering Architecture in Frostbite
Ad

Viewers also liked (19)

PPT
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
PPT
Bending the Graphics Pipeline
PPTX
The Rendering Pipeline - Challenges & Next Steps
PPTX
Lighting the City of Glass
PPTX
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
PPTX
High Dynamic Range color grading and display in Frostbite
PPTX
Photogrammetry and Star Wars Battlefront
PDF
Executable Bloat - How it happens and how we can fight it
PPTX
A Real-time Radiosity Architecture
PPT
Stable SSAO in Battlefield 3 with Selective Temporal Filtering
PPTX
5 Major Challenges in Real-time Rendering (2012)
PPTX
Rendering Battlefield 4 with Mantle
PPTX
Frostbite on Mobile
PPS
Audio for Multiplayer & Beyond - Mixing Case Studies From Battlefield: Bad Co...
PPT
5 Major Challenges in Interactive Rendering
PPT
Destruction Masking in Frostbite 2 using Volume Distance Fields
PPTX
Shadows & Decals: D3D10 Techniques in Frostbite (GDC'09)
PPTX
How High Dynamic Range Audio Makes Battlefield: Bad Company Go BOOM
PPTX
Mantle for Developers
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
Bending the Graphics Pipeline
The Rendering Pipeline - Challenges & Next Steps
Lighting the City of Glass
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
High Dynamic Range color grading and display in Frostbite
Photogrammetry and Star Wars Battlefront
Executable Bloat - How it happens and how we can fight it
A Real-time Radiosity Architecture
Stable SSAO in Battlefield 3 with Selective Temporal Filtering
5 Major Challenges in Real-time Rendering (2012)
Rendering Battlefield 4 with Mantle
Frostbite on Mobile
Audio for Multiplayer & Beyond - Mixing Case Studies From Battlefield: Bad Co...
5 Major Challenges in Interactive Rendering
Destruction Masking in Frostbite 2 using Volume Distance Fields
Shadows & Decals: D3D10 Techniques in Frostbite (GDC'09)
How High Dynamic Range Audio Makes Battlefield: Bad Company Go BOOM
Mantle for Developers
Ad

Similar to Parallel Futures of a Game Engine (20)

PPTX
Static analysis of C++ source code
PPTX
Static analysis of C++ source code
PDF
Tema3_Introduction_to_CUDA_C.pdf
PDF
Programar para GPUs
PDF
Introduction to CUDA
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PPTX
Track c-High speed transaction-based hw-sw coverification -eve
PPTX
TestUpload
PDF
A 3D printing programming API
PDF
Skiron - Experiments in CPU Design in D
PPTX
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...
PPTX
PVS-Studio, a solution for resource intensive applications development
PDF
20170602_OSSummit_an_intelligent_storage
PPTX
Optimizing unity games (Google IO 2014)
PPT
Developing a Windows CE OAL.ppt
PDF
BKK16-211 Internet of Tiny Linux (io tl)- Status and Progress
PDF
GPU Programming on CPU - Using C++AMP
PPTX
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
PPT
Parallel computing with Gpu
Static analysis of C++ source code
Static analysis of C++ source code
Tema3_Introduction_to_CUDA_C.pdf
Programar para GPUs
Introduction to CUDA
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Track c-High speed transaction-based hw-sw coverification -eve
TestUpload
A 3D printing programming API
Skiron - Experiments in CPU Design in D
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...
PVS-Studio, a solution for resource intensive applications development
20170602_OSSummit_an_intelligent_storage
Optimizing unity games (Google IO 2014)
Developing a Windows CE OAL.ppt
BKK16-211 Internet of Tiny Linux (io tl)- Status and Progress
GPU Programming on CPU - Using C++AMP
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Parallel computing with Gpu

Recently uploaded (20)

PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
A Presentation on Artificial Intelligence
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Cloud computing and distributed systems.
PDF
Electronic commerce courselecture one. Pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
20250228 LYD VKU AI Blended-Learning.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Understanding_Digital_Forensics_Presentation.pptx
The AUB Centre for AI in Media Proposal.docx
NewMind AI Monthly Chronicles - July 2025
Building Integrated photovoltaic BIPV_UPV.pdf
Machine learning based COVID-19 study performance prediction
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Approach and Philosophy of On baking technology
A Presentation on Artificial Intelligence
Advanced methodologies resolving dimensionality complications for autism neur...
Cloud computing and distributed systems.
Electronic commerce courselecture one. Pdf
Chapter 3 Spatial Domain Image Processing.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Parallel Futures of a Game Engine

  • 1. Public version 10Parallel Futures of a Game EngineJohan AnderssonRendering Architect, DICE
  • 2. BackgroundDICEStockholm, Sweden~250 employeesPart of Electronic ArtsBattlefield & Mirror’s Edge game seriesFrostbiteProprietary game engine used at DICE & EADeveloped by DICE over the last 5 years2
  • 5. OutlineGame engine 101Current parallelismFuturesQ&A5
  • 7. Game development2 year development cycleNew IP often takes much longer, 3-5 yearsEngine is continuously in development & usedAAA teams of 70-90 people50% artists30% designers20% programmers10% audioBudgets $20-40 millionCross-platform development is market realityXbox 360 and PlayStation 3PC DX10 and DX11 (and sometimes Mac)Current consoles will stay with us for many more years7
  • 8. Game engine requirements (1/2)Stable real-time performanceFrame-driven updates, 30 fps Few threads, instead per-frame jobs/tasks for everything Predictable memory usageFixed budgets for systems & content, fail if overAvoid runtime allocationsLove unified memory! Cross-platformThe consoles determines our base tech level & focusPS3 is design target, most difficult and good potentialScale up for PC, dual core is min spec (slow!)8
  • 9. Game engine requirements (2/2)Full system profiling/debuggingEngine is a vertical solution, touches everywherePIX, xbtracedump, SN Tuner, ETW, GPUViewQuick iterationsEssential in order to be creativeFast building & fast loading, hot-swapping resourcesAffects both the tools and the gameMiddlewareUse when it make senses, cross-platform & optimizedParallelism have to go through our systems9
  • 11. Levels of code in FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU11
  • 12. Levels of code in FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU12
  • 13. Editor & PipelineEditor (”FrostEd 2”)WYSIWYG editor for contentC#, Windows onlyBasic threading / tasksPipelineOffline/background data-processing & conversionC++, some MC++, Windows onlyTypically IO-boundA few compute-heavy steps use CPU-jobsTexture compression uses CUDA, would prefer OpenCL or CSLighting pre-calculation using IncrediBuild over 100+ machinesCPU parallelism models are generally not a problem here13
  • 14. Levels of code in FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU14
  • 15. General ”game code” (1/2)This is the majority of our 1.5 million lines of C++Runs on Win32, Win64, Xbox 360 and PS3Similar to general application codeHuge amount of code & logic to maintain + continue to developLow compute density”Glue code”Scattered in memory (pointer chasing)Difficult to efficiently parallelizeOut-of-order execution is a big help, but consoles are in-order Key to be able to quickly iterate & changeThis is the actual game logic & glue that builds the gameC++ not ideal, but has the invested infrastructure15
  • 16. General ”game code” (2/2)PS3 is one of the main challengesStandard CPU parallelization doesn’t helpCELL only has 2 HW threads on the PPUSplit the code in 2: game code & system codeGame logic, policy and glue code only on CPU”If it runs well on the PS3 PPU, it runs well everywhere”Lower-level systems on PS3 SPUsMain goals going forward:Simplify & structure code baseReduce coupling with lower-level systemsIncrease in task parallelism for PCCELL processor16
  • 17. Levels of code in FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU17
  • 18. Job-based parallelismEssential to utilize the cores on our target platformsXbox 360: 6 HW threadsPlayStation 3: 2 HW threads + 6 powerful SPUsPC: 2-16 HW threads (Nehalem HT is great!) Divide up system work into Jobs (a.k.a. Tasks)15-200k C++ code each. 25k is commonCan depend on each other (if needed)Dependencies create job graphAll HW threads consume jobs ~200-300 / frame18
  • 19. What is a Job for us?An asynchronous function callFunction ptr + 4 uintptr_t parametersCross-platform scheduler: EA JobManagerOften uses work stealing2 types of Jobs in Frostbite:CPU job (good)General code moved into job instead of threadsSPU job (great!)Stateless pure functions, no side effectsData-oriented, explicit memory DMA to local storeDesigned to run on the PS3 SPUs = also very fast on in-order CPUCan hot-swap quick iterations 19
  • 20. struct FB_ALIGN(16) EntityRenderCullJobData{ enum { MaxSphereTreeCount = 2, MaxStaticCullTreeCount = 2 }; uint sphereTreeCount; const SphereNode* sphereTrees[MaxSphereTreeCount]; u8 viewCount; u8 frustumCount; u8 viewIntersectFlags[32]; Frustum frustums[32]; .... (cut out 2/3 of struct for display size) u32 maxOutEntityCount; // Output data, pre-allocated by callee u32 outEntityCount; EntityRenderCullInfo* outEntities;};void entityRenderCullJob(EntityRenderCullJobData* data);void validate(const EntityRenderCullJobData& data);Frustum culling of dynamic entities in sphere treeEntityRenderCull job example20
  • 21. struct FB_ALIGN(16) EntityRenderCullJobData{ enum { MaxSphereTreeCount = 2, MaxStaticCullTreeCount = 2 }; uint sphereTreeCount; const SphereNode* sphereTrees[MaxSphereTreeCount]; u8 viewCount; u8 frustumCount; u8 viewIntersectFlags[32]; Frustum frustums[32]; .... (cut out 2/3 of struct for display size) u32 maxOutEntityCount; // Output data, pre-allocated by callee u32 outEntityCount; EntityRenderCullInfo* outEntities;};void entityRenderCullJob(EntityRenderCullJobData* data);void validate(const EntityRenderCullJobData& data);Frustum culling of dynamic entities in sphere treestruct contains all input data neededEntityRenderCull job example21
  • 22. struct FB_ALIGN(16) EntityRenderCullJobData{ enum { MaxSphereTreeCount = 2, MaxStaticCullTreeCount = 2 }; uint sphereTreeCount; const SphereNode* sphereTrees[MaxSphereTreeCount]; u8 viewCount; u8 frustumCount; u8 viewIntersectFlags[32]; Frustum frustums[32]; .... (cut out 2/3 of struct for display size) u32 maxOutEntityCount; // Output data, pre-allocated by callee u32 outEntityCount; EntityRenderCullInfo* outEntities;};void entityRenderCullJob(EntityRenderCullJobData* data);void validate(const EntityRenderCullJobData& data);Frustum culling of dynamic entities in sphere treestruct contains all input data neededMax output data pre-allocated by calleeEntityRenderCull job example22
  • 23. struct FB_ALIGN(16) EntityRenderCullJobData{ enum { MaxSphereTreeCount = 2, MaxStaticCullTreeCount = 2 }; uint sphereTreeCount; const SphereNode* sphereTrees[MaxSphereTreeCount]; u8 viewCount; u8 frustumCount; u8 viewIntersectFlags[32]; Frustum frustums[32]; .... (cut out 2/3 of struct for display size) u32 maxOutEntityCount; // Output data, pre-allocated by callee u32 outEntityCount; EntityRenderCullInfo* outEntities;};void entityRenderCullJob(EntityRenderCullJobData* data);void validate(const EntityRenderCullJobData& data);Frustum culling of dynamic entities in sphere treestruct contains all input data neededMax output data pre-allocated by calleeSingle job functionCompile both as CPU & SPU jobEntityRenderCull job example23
  • 24. struct FB_ALIGN(16) EntityRenderCullJobData{ enum { MaxSphereTreeCount = 2, MaxStaticCullTreeCount = 2 }; uint sphereTreeCount; const SphereNode* sphereTrees[MaxSphereTreeCount]; u8 viewCount; u8 frustumCount; u8 viewIntersectFlags[32]; Frustum frustums[32]; .... (cut out 2/3 of struct for display size) u32 maxOutEntityCount; // Output data, pre-allocated by callee u32 outEntityCount; EntityRenderCullInfo* outEntities;};void entityRenderCullJob(EntityRenderCullJobData* data);void validate(const EntityRenderCullJobData& data);Frustum culling of dynamic entities in sphere treestruct contains all input data neededMax output data pre-allocated by calleeSingle job functionCompile both as CPU & SPU jobOptional struct validation funcEntityRenderCull job example24
  • 25. EntityRenderCull SPU setup// local store variablesEntityRenderCullJobData g_jobData;float g_zBuffer[256*114];u16 g_terrainHeightData[64*64];int main(uintptr_t dataEa, uintptr_t, uintptr_t, uintptr_t){ dmaBlockGet("jobData", &g_jobData, dataEa, sizeof(g_jobData)); validate(g_jobData); if (g_jobData.zBufferTestEnable) { dmaAsyncGet("zBuffer", g_zBuffer, g_jobData.zBuffer, g_jobData.zBufferResX*g_jobData.zBufferResY*4); g_jobData.zBuffer = g_zBuffer; if (g_jobData.zBufferShadowTestEnable && g_jobData.terrainHeightData) { dmaAsyncGet("terrainHeight", g_terrainHeightData, g_jobData.terrainHeightData, g_jobData.terrainHeightDataSize); g_jobData.terrainHeightData = g_terrainHeightData; } dmaWaitAll(); // block on both DMAs } // run the actual job, will internally do streaming DMAs to the output entity list entityRenderCullJob(&g_jobData); // put back the data because we changed outEntityCount dmaBlockPut(dataEa, &g_jobData, sizeof(g_jobData)); return 0;}25
  • 26. Frostbite CPU job graphBuild big job graphs:Batch, batch, batch
  • 27. Mix CPU- & SPU-jobs
  • 28. Future: Mix in low-latency GPU-jobsJob dependencies determine:Execution order
  • 31. i.e. the effective parallelismIntermixed task- & data-parallelismaka Braided Parallelism
  • 33. aka Tasks and Kernels26
  • 35. Task-parallel algorithms & coordination28
  • 36. Timing viewExample: PC, 4 CPU cores, 2 GPUs in AFR (AMD Radeon 4870x2)Real-time in-game overlay See timing events & effective parallelismOn CPU, SPU & GPU – for all platformsUse to reduce sync-points & optimize load balancingGPU timing through DX event queriesOur main performance tool!29
  • 37. Rendering jobsRendering systems are heavily divided up into CPU- & SPU-jobsJobs:Terrain geometry [3]Undergrowth generation [2]Decal projection [4]Particle simulationFrustum cullingOcclusion cullingOcclusion rasterizationCommand buffer generation [6]PS3: Triangle culling [6]Most will move to GPUEventually.. A few have already!Latency wall, more power and GPU memory accessMostly one-way data flow30
  • 38. Occlusion culling job exampleProblem: Buildings & env occlude large amounts of objectsObscured objects still have to:Update logic & animationsGenerate command bufferProcessed on CPU & GPU= expensive & wasteful Difficult to implement full culling:Destructible buildingsDynamic occludeesDifficult to precompute From Battlefield: Bad Company PS331
  • 39. Solution: Software occlusion cullingRasterize coarse zbuffer on SPU/CPU256x114 floatLow-poly occluder meshes100 m view distanceMax 10000 vertices/frameParallel vertex & raster SPU-jobsCost: a few millisecondsCull all objects against zbufferScreen-space bounding-box testBefore passed to all other systemsBig performance savings!32
  • 40. GPU occlusion cullingIdeally want to use the GPU, but current APIs are limited:Occlusion queries introduces overhead & latencyConditional rendering only helps GPUCompute Shader impl. possible, but same latency wallFuture 1: Low-latency GPU execution contextRasterization and testing done on GPU where it belongsLockstep with CPU, need to read back within a few msPossible on Larrabee, want standard on PC Potential WDDM issueFuture 2: Move entire cull & rendering to ”GPU”World, cull, systems, dispatch. End goal33
  • 41. Levels of code in FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU34
  • 42. Shader typesGenerated shaders [1]Graph-based surface shadersTreated as content, not codeArtist createdGenerates HLSL codeUsed by all meshes and 3d surfacesGraphics / Compute kernelsHand-coded & optimized HLSLStatically linked in with C++Pixel- & compute-shadersLighting, post-processing & special effectsGraph-based surface shader in FrostEd 235
  • 44. 3 major challenges/goals going forward:How do we make it easier to develop, maintain & parallelize general game code?What do we need to continue to innovate & scale up real-time computational graphics?How can we move & scale up advanced simulation and non-graphics tasks to data-parallel manycore processors?Challenges37
  • 45. 3 major challenges/goals going forward:How do we make it easier to develop, maintain & parallelize general game code?What do we need to continue to innovate & scale up real-time computational graphics?How can we move & scale up advanced simulation and non-graphics tasks to data-parallel manycore processors?ChallengesMost likely the same solution(s)!38
  • 46. Challenge 1“How do we make it easier to develop, maintain & parallelize general game code?”Shared State Concurrency is a killerNot a big believer in Software Transactional Memory either Because of performance and too ”optimistic” flowA more strict & adapted C++ modelSupport for true immutable & r/w-only memory accessPer-thread/task memory access opt-inTo reduce the possibility for side effects in parallel codeAs much compile-time validation as possibleMicro-threads / coroutines as first class citizensMore? (we are used to not having much, for us, practical innovation here)Other languages?39
  • 47. Challenge 1 - Task parallelismMultiple task librariesEA JobManagerCurrent solution, designed primarily within SPU-job limitationsMS ConcRT, Apple GCD, Intel TBBAll has some good parts!Neither works on all of our platforms, key requirementOpenMP We don’t use it. Tiny band aid, doesn’t satisfy our control needsNeed C++ enhancements to simplify usageC++ 0x lambdas / GCD blocks Glacial C++ development & deployment Want on all platforms, so lost on this console generationMoving away from semi-static job graphs Instead more dynamic on-demand job graphs40
  • 48. Challenge 2 - DefinitionGoal: ”Real-time interactive graphics & simulation at a Pixar level of quality”Needed visual features:Global indirect lighting & reflectionsComplete anti-aliasing (frame buffers & shader)Sub-pixel geometryOITHuge improvements in character animationThese require massively more compute, BW and improved model!(animation can’t be solved with just more/better compute, so pretend it doesn’t exist for now)41
  • 49. Challenge 2 - ProblemsProblems & limitations with current model:MSAA sample storage doesn’t scale to 16x+Esp. with HDR & deferred shadingGPU is handicapped by being spoon-fed by CPUIrregular workloads are difficult / inefficientCurrent HLSL is a limited language & model42
  • 50. Challenge 2 - SolutionsSounds like a job for a high-throughput oriented massive data-parallel processorWith a highly flexible programming modelThe CPU, as we know it, and its APIs are only in the wayPure software solution not practical as next step after DX11 PC 1)Multi-vendor & multi-architecture marketplaceSkeptical we will reach a multi-vendor standard ISA within 3+ yearsFuture consoles on the other hand, this would be preferredAnd would love to be proven wrong by the IHVs!Want a rich high-level compute model as next stepEfficiently target both SW- & HW-pipeline architecturesEven if we had 100% SW solution, to simplify development1) Depending on the time frame43
  • 51. ”Pipelined Compute Shaders”Queues as streaming I/O between compute kernelsSimple & expressive model supporting irregular workloadsKeeps data on chip, supports variable sized caches & coresCan target multiple types of HW & architecturesHybrid graphics/compute user-defined pipelinesLanguage/API defining fixed stages inputs & outputsPipelines can feed other pipelines (similar to DrawIndirect)Reyes-style Rendering with Ray TracingShadeSplitRasterSub-DPrimsTessFrame BufferTrace44
  • 52. ”Pipelined Compute Shaders”Wanted for next DirectX and OpenCL/OpenGLAs a standard, as soon as possibleMy main request/wish!Run on all: GPU, manycore and CPUIHV-specific solutions can be good start for R&DModel is also a good fit for many of our CPU/SPU jobsParts of job graph can be seen as queues between stagesEasier to write kernels/jobs with streaming I/O Instead of explicit fixed-buffers and ”memory passes”Or dynamic memory allocation45
  • 53. Language?Language for this model is a big questionBut the concepts & infrastructure are what is important!Could be an extendedHLSL or ”data-parallel C++”Data-oriented imperative language (i.e. not standard C++)Think HLSL would probably be easier & the most explicitAmount of code is small and written from scratchSIMT-style implicit vectorization is preferred over explicit vectorizationEasier to target multiple evolving architectures implicitlyOur CPU code is still stuck at SSE2 46
  • 54. Language (cont.)Requirements:Full rich debugging, ideally in Visual StudioAssertsInternal kernel profilingHot-swapping / edit-and-continue of kernelsOpportunity for IHVs and platform providers to innovate here!Try to aim for an eventual cross-vendor standardThink of the co-development of Nvidia Cg and HLSL47
  • 55. Unified development environmentWant to debug/profile task- & data-parallel code seamlesslyOn all processors! CPU, GPU & manycoreFrom any vendor = requires standard APIs or ISAsVisual Studio 2010 looks promising for task-parallel PC codeUsable by our offline tools & hopefully PC runtimeWant to integrate our own JobManagerNvidia Nexus looks great for data-parallel GPU codeEventual must have for all HW, how?Huge step forward!48VS2010 Parallel Tasks
  • 56. Future hardware (1/2)2015 = 50 TFLOPS, we would spend it on:80% graphics15% simulation4% misc1% game (wouldn’t use all 500 GFLOPS for game logic & glue!)OOE CPUs more efficient for the majority of our game codeBut for the vast majority of our FLOPS these are fully irrelevantCan evolve to a small dot on a sea of DP coresOr run on scalar ISA wasting vector instructions on a few coresIn other words: no need for separate CPU and GPU!49
  • 57. Future hardware (2/2)Single main memory & address spaceCritical to share resources between graphics, simulation and game in immersive dynamic worldsConfigurable kernel local stores / cacheSimilar to Nvidia Fermi & Intel LarrabeeLocal stores = reliability & good for regular loadsCaches = essential for irregular data structuresCache coherency?Not always important for kernelsBut essential for general code, can partition?50
  • 58. ConclusionsDeveloper productivity can’t be limited by modelIt should enhance productivity & perf on all levelsTools & language constructs play a critical roleLots of opportunity for innovation and standardization!We are willing to go great lengths to utilize any HWIfthat platform is part of our core business target and can makes a differenceWe for one welcome our parallel future!51
  • 59. Thanks toDICE, EA and the Frostbite teamThe graphics/gamedev community on TwitterSteve McCalla, Mike BurrowsChas BoydNicolas Thibieroz, Mark LeatherDan Wexler, Yury UralskyKayvon Fatahalian52
  • 60. ReferencesPrevious Frostbite-related talks:[1] Johan Andersson. ”Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing Techniques ”. GDC 2007. http://repi.blogspot.com/2009/01/conference-slides.html[2] Natasha Tartarchuk & Johan Andersson. ”Rendering Architecture and Real-time Procedural Shading & Texturing Techniques”. GDC 2007. http://developer.amd.com/Assets/Andersson-Tatarchuk-FrostbiteRenderingArchitecture(GDC07_AMD_Session).pdf[3] Johan Andersson. ”Terrain Rendering in Frostbite using Procedural Shader Splatting”. Siggraph 2007. http://developer.amd.com/media/gpu_assets/Andersson-TerrainRendering(Siggraph07).pdf[4] Daniel Johansson & Johan Andersson. “Shadows & Decals – D3D10 techniques from Frostbite”. GDC 2009. http://repi.blogspot.com/2009/03/gdc09-shadows-decals-d3d10-techniques.html[5] Bill Bilodeau & Johan Andersson. “Your Game Needs Direct3D 11, So Get Started Now!”. GDC 2009. http://repi.blogspot.com/2009/04/gdc09-your-game-needs-direct3d-11-so.html[6] Johan Andersson. ”Parallel Graphics in Frostbite”. Siggraph 2009, Beyond Programmable Shading course. http://repi.blogspot.com/2009/08/siggraph09-parallel-graphics-in.html53
  • 61. Questions?Email:johan.andersson@dice.se Blog: http://repi.se Twitter: @repi54Contact me. I do not bite, much..

Editor's Notes

  • #5: Xbox 360 version + offline super-high AA and resolution
  • #9: Isolate systems/areasWe do not use exceptions, nor much error handling
  • #10: Havok & Kynapse
  • #16: ”Glue code”
  • #19: Better code structure!Gustafson’s LawFixed 33 ms/f
  • #20: Data-layout
  • #31: We have a lot of jobs across the engine, too many to go through so I chose to focus more on some of the rendering.Our intention is to move all of these to the GPU
  • #33: Conservative
  • #34: Want to rasterize on GPU, not CPU CPU Û GPU job dependencies
  • #40: Simple keywords as override & sealed have been a help in other areas
  • #44: Would require multi-vendor open ISA standard
  • #48: Bindings to C very interesting!
  • #52: OpenCL