SlideShare a Scribd company logo
World of Tanks* 1.0+:
Enriching gamers experience with multicore
optimized physics and graphics
31.01.2018
Philipp Gerasimov, Mike Voss, Intel
Bronislav Sviglo, Wargaming.net
GDC 2018
2
• Philipp Gerasimov, Intel
Senior Game / Graphics Application Engineer, DRD / SSG,
Munich.
World of Tanks
Introduction
• Mike Voss, Intel
Principal Engineer, Threading Runtimes Team,
DPD / SSG, Austin, Texas.
• Bronislav Sviglo, Wargaming.net
Rendering Team Lead, World of Tanks Team
Minsk.
Agenda
3
• Making your game ready for Modern CPUs
• World of Tanks 1.0
• Going Beyond 1.0
• Threading Building Blocks (TBB)
• Destructions
• Tanks Treads
• Concurrent Rendering
• Summary and Q&A
4
Making your game ready for
modern CPUs
5
• Multi Core
• All modern platforms are MC: mobile phones, consoles, desktop and mobile
PCs.
• Adding more cores is the most efficient way to make faster CPUs, so critical
for game developers to use it.
• Vector Instructions
• Another powerful way to improve code performance.
• SSE/AVX/AVX2 supported across different CPU vendors on PC and consoles.
• Most middleware developers have support for it.
Parallel Computing
Two vectors of parallelism
6
Parallel Computing
Wide HW user base
CPU i7-8700K i7-8650U
# Cores 6 4
# Threads 12 8
Base Frequency 3.70 GHz 1.90 GHz
Max Frequency 4.70 GHz 4.20 GHz
Instruction Set
Extensions
Intel® SSE4.1 Intel®
SSE4.2 Intel® AVX2
Intel® SSE4.1 Intel®
SSE4.2 Intel® AVX2
95W
Desktop SKU
15W
Mobile SKU
CPU i7-4790K i7-4650U
# Cores 4 2
# Threads 8 4
Base Frequency 3.60 GHz 1.70 GHz
Max Frequency 4.00 GHz 3.30 GHz
Instruction Set
Extensions
Intel® SSE4.1 Intel®
SSE4.2 Intel® AVX2
Intel® SSE4.1 Intel®
SSE4.2 Intel® AVX2
84W
Desktop SKU 20182013 15W
Mobile SKU
 Enhance the enthusiasts experience
 Maintain strong backward compatibility
7
• Optimizing for Modern CPUs does not mean just fps
improvements!
• Performance Focused Features:
• Critical for low TDP platforms and consoles (Low Frequency Cores)
• Not being limited by main / client thread performance
• E-Sport class performance for High-End PCs (140+ fps)
• Enhanced Gaming Experience Features:
• Better visual effects (Better Particles, Smarter occlusion culling)
• Better physics (Destructions, collisions, water / wind simulation )
• Better AI (Smarter UI, more characters)
• Better sound (High quality 3D sound)
Parallel Computing
Enriching User Experience
8
World of Tanks 1.0
9
Intro video
10
• Supports huge range of hardware 2004 – 2018+
• Lots of CPUs with 2 physical cores (~60%)
• Has two pipelines simple & advanced
• Still has two Graphics API D3D9 on WinXP & D3D11
• Had several graphics evolutions in the past 0.8.x, 0.9.x
World of Tanks
Graphics overview
11
Old vs. New
12
World of Tanks 1.0+
13
• Quick overview of destructions
• Improved simulation of tank treads
• Concurrent rendering
World of Tanks 1.0+
Keep enriching gamers experience
14
• Dedicated threads for each major engine sub-system
(audio, render, physics, simulation, etc.)
• Task based (pool of threads, work subdivided into tasks)
• Mix of those approaches
World of Tanks 1.0+
How to make the engine concurrent?
15
• Easy to use
• Two types of parallelism: functional/task and data
• Feature rich and robust
• Good support
• Threading Building Blocks
World of Tanks 1.0+
How to select a good job system?
16
Threading Building Blocks (TBB)
17
• What
• Parallel algorithms and data structures
• Threads and synchronization primitives
• Scalable memory allocation and task scheduling
• General Benefits
• Is a library-only solution that does not depend on special compiler support
• Is both a commercial product and an open-source project
• Supports C++, Windows*, Linux*, OS X*, Android* and other OSes
• Commercial support for Intel® AtomTM, CoreTM, Xeon® processors and for
Intel® Xeon PhiTM coprocessors
Threading Building Blocks (TBB)
A widely used C++ template library for parallel programming
(since 2006)
18
• Expresses parallelism at a high-level to get efficient
performance on different platforms
• TBB-based parallelism scales as more cores become available
• Gets great performance on newest multicore platforms
• But works well on older machines too
• Express the parallelism and let TBB map it to the platform
• Supports multilevel and nested parallelism well
• Functional and data parallelism implemented using the same TBB tasks
• Underlying tasks are executed by same set of TBB worker threads
• Leads to composable parallelism – TBB schedules all of it
• A single thread pool avoids oversubscription problems
Threading Building Blocks (TBB)
Why it is useful for World of Tanks
19
Threading Building Blocks (TBB)
opentbb.org
Parallel Execution Interfaces Interfaces Independent of Execution Model
20
buffer
get_next_image
preprocess
detect_with_A
detect_with_B
make_decision
Can express pipelining, task parallelism and data parallelism
Threading Building Blocks (TBB)
Example Feature Detection Algorithm
tbb::parallel_for( … );
tbb::flow_graph
21
To Learn More about TBB:
See Intel’s The Parallel Universe Magazine
https://software.intel.com/en-us/intel-parallel-universe-magazine
http://threadingbuildingblocks.org http://software.intel.com/intel-tbb
22
Destructions
23
Destructions
Overview
• Powered by Havok© destructions module
• Collision and physics in one system
• Out of the box multithreading
24
Collision scene Destruction scene
Destructions
Destruction & collision scenes
Destructions
Architecture
25
Simulation
Main thread
Collision queries Collision queries
Internal Havok threads
Internal Havok threads
Internal Havok threads
~Number of
processor cores
IO thread
26
ToDo:
Havok destruction video
27
Destructions
Summary & feature work
• Summary
• High quality simulation of destruction process
• Multithreading execution significantly increase performance
• Future work
• Use TBB job system to execute havok tasks
28
Tank Treads
29
• Skinned mesh
• Visually static
• The tread moves by scrolling its texture
• Spline tread
• The general shape of the tread is represented by a spline
• Each segment is rendered as a separate model
• Segments positions are determined by the spline
• The spline shape is animated by moving its control points
Tank Treads
Previous implementations
30
• Spline tracks visually superior to skinned meshes, but
still have flaws:
• No proper collisions with the environment
• Too smooth shapes of curves
• A very complex tuning process
Tank Treads
Things to improve
31
• Spring chain simulation
• Procedural animation
• Collisions with the environment
Tank Treads
Designing the new treads
32
• Spring chain controls tread shape
• Tread is divided into 4 parts: front, top, back and bottom
• Ray cast the area underneath the tank and from height field
• Collide each spring joint with it
Tank Treads
General solution & collisions
33
Tank Treads
Performance
* RAD Game Tools Telemetry
34
ToDo:
New tank treads
35
Tank Treads
Summary & feature work
• Summary
• High quality simulation
• Multicore support
• Future work
• Treads simulation for exotic tank types
• Single threaded optimizations
36
Concurrent Rendering
37
• Engine evolution
• Abstract Rendering Interface (ARI)
• Architecture
• Usage of task and data parallelism
Concurrent Rendering
Agenda
38
• Initial release based on BigWorld engine (2010)
Concurrent Rendering
Engine Evolution
Tick
Render
environment
Present (wait)
Direct3D 9 API
Render tanks Render effects
39
• World of Tanks came out as Direct3D 9 game
• Powered by BigWorld Engine
• Didn’t have any layer between engine renderers and Direct3D API
• Highly tied to Direct3D Effect Framework
• Never meant to run efficiently on multi-core
• No job system
• Frame tick and render ran on main thread in the synchronous manner
• GPU workload depended on the subsystems’ render order
Concurrent Rendering
Engine Evolution
Tick
Render
environment
Present (wait)
Direct3D 9 API
Render
tanks
Render effects
40
• Patch 0.9.15 and Core 3.0 Engine (Aug 2016)
Concurrent Rendering
Engine Evolution
Tick
Render
environment
Resolve queries
(wait)
Direct3D 9/11 API
Render tanks Render effects
ARI Command list
Main thread
Render thread
41
• Patch 0.9.15 and Core 3.0 Engine (Aug 2016)
• Abstract Rendering Interface (ARI)
• Unified way to handle Direct3D 9 and Direct 3D 11 hardware
• Performance gain up to 15-30%
• Renderers still working in main thread
• Separate thread for ARI-Direct3D per-frame interactions
• WGFX intermediate compiler
• Independence from Direct 3D effect framework
• Faster effect support
Concurrent Rendering
Engine Evolution
Tick
Render
environment
Resolve
queries (wait)
Direct3D 9/11 API
Render
tanks
Render effects
ARI Command list
Main thread
Render thread
42
• Command list – explicitly describes what should be
done by rendering thread
• Device interface – our “driver” between ARI frontend and
graphics API
• Resource – any resource like buffer, texture, query,
graphics pipeline state, etc.
Concurrent Rendering
Abstract Rendering Interface (ARI)
stores
43
• Commands
• Simple structures storing function arguments for the operation.
• Each command stores all of the state it relies upon.
• Concurrency support
• Command list writing is done in parallel
• Submitted commands don’t have interdependencies
• Examples
• Draw (instancing, indexed, etc)
• Clear (RTV, DSV, UAV)
• Query (begin/end)
Concurrent Rendering
ARI: Command List
CommandList
Stores commands as
plain data structs
Command
Draw
Dispatch
CopyResource
…
44
• Free-threaded: e.g. resource creationremoval, adapter
state
• Single-threaded: e.g. command list submit and compile,
read backs, fencing
• Creation thread only: e.g. special operations such as
device reset.
Concurrent Rendering
ARI: Device Interface
Device::Interface
• Creates Resources
• Present, Adapter Info, Etc
D3D9
D3D11
D3D12
• Frame render graph based on TBB flow graph
• Separate command lists for each rendering subsystem
• TBB primitives for inner subsystems’ parallelism
• Separate thread for per-frame graphics API calls
45
Concurrent Rendering
Approach overview
Core 0 Core 1 Core 2 Core 3Core 0 Core 1 Core 2 Core 3
46
• High-level frame render graph
• Nodes
• Big chunks of work in one of the renderer subsystems
• Intermediate render context flushes
• Edges
• Dependencies between renderer subsystems
• Inner subsystems dependencies
• Indication that context is ready for flush
Concurrent Rendering
Frame render graph
Nodes
• Occlusion Resolve
• Dynamic models
• Static Models
• Transparent models
• Tanks
• Vegetation
• Terrain
• Shadows
• Lighting
• Atmosphere
• Water
• Post-Processing
• GUI
47
• High-level frame render graph
Concurrent Rendering
Frame render graph
VT prepare
Shadows
VegetationAtmosphere
Particles
Visibility
resolve
VFX prepare Decal culling
Particle
shadows
Environment
models
Tanks
Intermediate submit and sync
Water culling
Terrain
Lighting
Intermediate submit Main submit
Post
processing
Transparent
models
VT tiles
UI
Water draw
Issue
occlusion
queries
• Submitting GPU workload
• Main per-frame contexts flush
• Every path comes here
• Uploads all gathered contexts to GPU submission thread
• Intermediate flush
• Synchronization point for tasks that order-dependent on the GPU side
• Prevents the GPU submission thread starvation
48
Concurrent rendering
Frame render graph
Waits for sync
Render thread
submits
Tick
Render
environment
Render water
Render thread submits
Render
tanks
Render
effects
GPU draws
Render thread
waits
GPU waits
49
• Render frame ~17ms
• Parallel execution off
Concurrent rendering
Frame render graph
50
• Render frame ~14ms
• Parallel execution on
• Resolve visibility
• Static models
Concurrent rendering
Frame render graph
51
• Render frame ~12ms
• Parallel execution on
• Resolve visibility
• Static models
• Tanks
• Lighting
• Water
• Vegetation
• Post-processing
Concurrent rendering
Frame render graph
52
• Render frame ~8ms
• Speedup ~2x
• Parallel execution on
• Resolve visibility
• Static models
• Tanks
• Lighting
• Water
• Vegetation
• Post-processing
• Shadows
• ...
Concurrent rendering
Frame render graph
53
• Functional parallelism
• Pros
• Easy to implement
• Easy to read and maintain
• Easy to reason about
Concurrent rendering
Frame render graph
54
• Functional parallelism
• Pros
• Easy to implement
• Easy to read and maintain
• Easy to reason about
• Cons
• Too high level
• Some paths can’t be shortened
• Critical execution path
Concurrent rendering
Frame render graph
55
• Functional parallelism
• Pros
• Easy to implement
• Easy to read and maintain
• Easy to reason about
• Cons
• Too high level
• Some paths can’t be shortened
• Critical execution path
Concurrent rendering
Frame render graph
56
• Functional parallelism
• Pros
• Easy to implement
• Easy to read and maintain
• Easy to reason about
• Cons
• Too high level
• Some paths can’t be shortened
• Critical execution path
• Data parallelism to rescue!
Concurrent rendering
Frame render graph
57
Concurrent rendering
Frame render graph
• Render frame ~12ms
• Parallel execution off
58
Concurrent rendering
Frame render graph
• Render frame ~4ms
• Parallel execution on
• Speedup ~3x
59
Concurrent rendering
Frame render graph
• Render frame ~4ms
• Parallel execution on
• Speedup ~3x
60
Concurrent rendering
Frame render graph
• Render frame ~4ms
• Parallel execution on
• Speedup ~3x
• Data parallelism on
61
• Summary
• Significant speedup in commands gathering
• Code is still simple and easy to modify
• Room for more demanding graphics
• Future work
• Future decomposition
• Parallel algorithms
• Better submission pattern
• Consume spare performance
Concurrent rendering
Summary & future work
62
• Multi-core CPU is the future of gaming
• World of Tanks right now is finding the way to use it
effectively
World of Tanks 1.0+
Summary
Acknowledgments
63
Big thank you to:
• Render, R&D artist, Tools, Engine team
• … and entire WoT dev team
64
Legal Disclaimer
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH
PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS,
INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER
INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life-saving, life-sustaining, critical control or safety systems, or in nuclear facility applications.
Intel products may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are
available on request.
Intel may make changes to dates, specifications, product descriptions, and plans referenced in this document at any time, without notice.
This document may contain information on products in the design phase of development. The information herein is subject to change without notice. Do not finalize a design with
this information.
Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have
no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.
Intel Corporation or its subsidiaries in the United States and other countries may have patents or pending patent applications, trademarks, copyrights, or other intellectual property
rights that relate to the presented subject matter. The furnishing of documents and other materials and information does not provide any license, express or implied, by estoppel or
otherwise, to any such patents, trademarks, copyrights, or other intellectual property rights.
Wireless connectivity and some features may require you to purchase additional software, services or external hardware.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those
tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the
performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel
Performance Benchmark Limitations
Intel, the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
Other names and brands may be claimed as the property of others.
Copyright © 2014 Intel Corporation. All rights reserved.
65
Legal Disclaimer
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY
THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY
OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY
RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark,
are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should
consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with
other products.
Copyright© 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Atom, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other
countries.
66
Optimization Notice
Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel® and non-Intel
microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including
some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and
specific microprocessors they implicate, please refer to the “Intel® Compiler User and Reference Guides” under “Compiler Options." Many library routines that are part of Intel®
compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® compiler products offer
optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel
microprocessors.
Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not
unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental
Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.
While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel® and non-Intel microprocessors, Intel recommends that you
evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or
library; please let us know if you find we do not.
67
Bonus slides
68
• Our physical simulation should consider the following
aspects
• Each tread consists of 60 to 250 segments
• Hundreds of different tanks and dozens of suspensions to support
• Strict requirements for the tread’s stability
• The server doesn’t consider the tread when it simulates the tank’s
movement
• The tread’s setup process should be easy for the artist
Tank Treads
Challenges
69
• Also the tread is able to sag if
there is an empty space beneath it
• For an average PC we always snap the
bottom part of the tread to wheels
• For High-End PC
• Calculate the bottom snapping it to wheels
• Simulate the bottom without snapping
• Based on how far from an obstacle a
particular simulated tread’s joint blend its
final position between snapped and
unsnapped positions
Tank Treads
Sagging
70
• Parallel execution guidelines
• Measure
• Start from parallel algorithms with a coarse grain
• Measure!
• Make global decision in advance to heavy computations
• Reconsider your data structure
• Measure!!
• Fine tune algorithms grain sizes
• Bring in vectorization as last resort
Concurrent rendering
Data parallelism
71
• Data layout guidelines
• Make processed objects self-sufficient as much as is
reasonable
• Group up objects with similar data together and process in
blocks
• Prefer structures of arrays
• Know your data critical paths
• Don't read and write to the same buffer
Concurrent rendering
Data parallelism

More Related Content

PPTX
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
PPTX
Parallelizing Conqueror's Blade
PPTX
Scale CPU Experiences: Maximize Unity* Performance Using the Entity Component...
PPTX
Forts and Fights Scaling Performance on Unreal Engine*
PDF
Scalability for All: Unreal Engine* 4 with Intel
PPTX
Getting Space Pirate Trainer* to Perform on Intel® Graphics
PDF
Create a Scalable and Destructible World in HITMAN 2*
PPTX
Masked Occlusion Culling
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
Parallelizing Conqueror's Blade
Scale CPU Experiences: Maximize Unity* Performance Using the Entity Component...
Forts and Fights Scaling Performance on Unreal Engine*
Scalability for All: Unreal Engine* 4 with Intel
Getting Space Pirate Trainer* to Perform on Intel® Graphics
Create a Scalable and Destructible World in HITMAN 2*
Masked Occlusion Culling

What's hot (20)

PDF
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
PDF
The Architecture of 11th Generation Intel® Processor Graphics
PDF
Accelerate Large-Scale Inverse Kinematics with the Intel® Distribution of Ope...
PDF
Streamed Cloud Gaming Solutions for Android* and PC Games
PDF
It Doesn't Have to Be Hard: How to Fix Your Performance Woes
PDF
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
PPTX
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
PPTX
Dynamic Resolution Techniques for Intel® Processor Graphics | SIGGRAPH 2018 T...
PPTX
Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...
PDF
Open Source Interactive CPU Preview Rendering with Pixar's Universal Scene De...
PDF
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
PPT
Threading Game Engines: QUAKE 4 & Enemy Territory QUAKE Wars
PDF
Intel 8th Core G Series with Radeon Vega M
PDF
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
PDF
Ultra HD Video Scaling: Low-Power HW FF vs. CNN-based Super-Resolution
PDF
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
PPTX
Optimizing Total War*: WARHAMMER II
PPTX
Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...
PDF
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
PDF
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
The Architecture of 11th Generation Intel® Processor Graphics
Accelerate Large-Scale Inverse Kinematics with the Intel® Distribution of Ope...
Streamed Cloud Gaming Solutions for Android* and PC Games
It Doesn't Have to Be Hard: How to Fix Your Performance Woes
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
Dynamic Resolution Techniques for Intel® Processor Graphics | SIGGRAPH 2018 T...
Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...
Open Source Interactive CPU Preview Rendering with Pixar's Universal Scene De...
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
Threading Game Engines: QUAKE 4 & Enemy Territory QUAKE Wars
Intel 8th Core G Series with Radeon Vega M
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
Ultra HD Video Scaling: Low-Power HW FF vs. CNN-based Super-Resolution
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Optimizing Total War*: WARHAMMER II
Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Ad

Similar to World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Physics and Graphics (20)

PDF
Ethos_cdnliveIsrael
PPTX
TIVA_Workshop_Session I.pptx Embedded system design using TIVA
PPT
computer architecture.
PPTX
The next generation of GPU APIs for Game Engines
PPTX
Graphics Processing unit ppt
PDF
支援DSL的嵌入式圖形操作環境
PDF
lec01.pdf
PPSX
Summary Of Course Projects
PDF
5035-Pipeline-Optimization-Techniques.pdf
PPTX
Simulating Networks Using Cisco Modeling Labs (TechWiseTV Workshop)
PPTX
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
PDF
3 boyd direct3_d12 (1)
PPTX
DVLSI_project_presentation_template.pptx
PDF
OOW 2013: Where did my CPU go
PPTX
GPU Renderfarm with Integrated Asset Management & Production System (AMPS)
PPTX
GamingAnywhere: An Open Cloud Gaming System
PDF
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
PDF
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
PDF
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
PDF
Memory, Big Data, NoSQL and Virtualization
Ethos_cdnliveIsrael
TIVA_Workshop_Session I.pptx Embedded system design using TIVA
computer architecture.
The next generation of GPU APIs for Game Engines
Graphics Processing unit ppt
支援DSL的嵌入式圖形操作環境
lec01.pdf
Summary Of Course Projects
5035-Pipeline-Optimization-Techniques.pdf
Simulating Networks Using Cisco Modeling Labs (TechWiseTV Workshop)
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
3 boyd direct3_d12 (1)
DVLSI_project_presentation_template.pptx
OOW 2013: Where did my CPU go
GPU Renderfarm with Integrated Asset Management & Production System (AMPS)
GamingAnywhere: An Open Cloud Gaming System
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
Memory, Big Data, NoSQL and Virtualization
Ad

More from Intel® Software (20)

PPTX
AI for All: Biology is eating the world & AI is eating Biology
PPTX
Python Data Science and Machine Learning at Scale with Intel and Anaconda
PDF
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
PDF
AI for good: Scaling AI in science, healthcare, and more.
PDF
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
PPTX
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
PPTX
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
PPTX
AWS & Intel Webinar Series - Accelerating AI Research
PPTX
Intel Developer Program
PDF
Intel AIDC Houston Summit - Overview Slides
PDF
AIDC NY: BODO AI Presentation - 09.19.2019
PDF
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
PDF
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
PDF
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
PDF
AIDC India - AI on IA
PDF
AIDC India - Intel Movidius / Open Vino Slides
PDF
AIDC India - AI Vision Slides
PDF
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...
PDF
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...
PDF
Bring the Future of Entertainment to Your Living Room: MPEG-I Immersive Video...
AI for All: Biology is eating the world & AI is eating Biology
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
AI for good: Scaling AI in science, healthcare, and more.
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
AWS & Intel Webinar Series - Accelerating AI Research
Intel Developer Program
Intel AIDC Houston Summit - Overview Slides
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
AIDC India - AI on IA
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - AI Vision Slides
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...
Bring the Future of Entertainment to Your Living Room: MPEG-I Immersive Video...

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
A Presentation on Artificial Intelligence
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Encapsulation theory and applications.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
1. Introduction to Computer Programming.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Getting Started with Data Integration: FME Form 101
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Approach and Philosophy of On baking technology
PDF
Machine learning based COVID-19 study performance prediction
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
A Presentation on Artificial Intelligence
Advanced methodologies resolving dimensionality complications for autism neur...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Encapsulation_ Review paper, used for researhc scholars
Encapsulation theory and applications.pdf
A comparative analysis of optical character recognition models for extracting...
Building Integrated photovoltaic BIPV_UPV.pdf
NewMind AI Weekly Chronicles - August'25-Week II
1. Introduction to Computer Programming.pptx
Empathic Computing: Creating Shared Understanding
Getting Started with Data Integration: FME Form 101
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Approach and Philosophy of On baking technology
Machine learning based COVID-19 study performance prediction
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The Rise and Fall of 3GPP – Time for a Sabbatical?
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Physics and Graphics

  • 1. World of Tanks* 1.0+: Enriching gamers experience with multicore optimized physics and graphics 31.01.2018 Philipp Gerasimov, Mike Voss, Intel Bronislav Sviglo, Wargaming.net GDC 2018
  • 2. 2 • Philipp Gerasimov, Intel Senior Game / Graphics Application Engineer, DRD / SSG, Munich. World of Tanks Introduction • Mike Voss, Intel Principal Engineer, Threading Runtimes Team, DPD / SSG, Austin, Texas. • Bronislav Sviglo, Wargaming.net Rendering Team Lead, World of Tanks Team Minsk.
  • 3. Agenda 3 • Making your game ready for Modern CPUs • World of Tanks 1.0 • Going Beyond 1.0 • Threading Building Blocks (TBB) • Destructions • Tanks Treads • Concurrent Rendering • Summary and Q&A
  • 4. 4 Making your game ready for modern CPUs
  • 5. 5 • Multi Core • All modern platforms are MC: mobile phones, consoles, desktop and mobile PCs. • Adding more cores is the most efficient way to make faster CPUs, so critical for game developers to use it. • Vector Instructions • Another powerful way to improve code performance. • SSE/AVX/AVX2 supported across different CPU vendors on PC and consoles. • Most middleware developers have support for it. Parallel Computing Two vectors of parallelism
  • 6. 6 Parallel Computing Wide HW user base CPU i7-8700K i7-8650U # Cores 6 4 # Threads 12 8 Base Frequency 3.70 GHz 1.90 GHz Max Frequency 4.70 GHz 4.20 GHz Instruction Set Extensions Intel® SSE4.1 Intel® SSE4.2 Intel® AVX2 Intel® SSE4.1 Intel® SSE4.2 Intel® AVX2 95W Desktop SKU 15W Mobile SKU CPU i7-4790K i7-4650U # Cores 4 2 # Threads 8 4 Base Frequency 3.60 GHz 1.70 GHz Max Frequency 4.00 GHz 3.30 GHz Instruction Set Extensions Intel® SSE4.1 Intel® SSE4.2 Intel® AVX2 Intel® SSE4.1 Intel® SSE4.2 Intel® AVX2 84W Desktop SKU 20182013 15W Mobile SKU  Enhance the enthusiasts experience  Maintain strong backward compatibility
  • 7. 7 • Optimizing for Modern CPUs does not mean just fps improvements! • Performance Focused Features: • Critical for low TDP platforms and consoles (Low Frequency Cores) • Not being limited by main / client thread performance • E-Sport class performance for High-End PCs (140+ fps) • Enhanced Gaming Experience Features: • Better visual effects (Better Particles, Smarter occlusion culling) • Better physics (Destructions, collisions, water / wind simulation ) • Better AI (Smarter UI, more characters) • Better sound (High quality 3D sound) Parallel Computing Enriching User Experience
  • 10. 10 • Supports huge range of hardware 2004 – 2018+ • Lots of CPUs with 2 physical cores (~60%) • Has two pipelines simple & advanced • Still has two Graphics API D3D9 on WinXP & D3D11 • Had several graphics evolutions in the past 0.8.x, 0.9.x World of Tanks Graphics overview
  • 13. 13 • Quick overview of destructions • Improved simulation of tank treads • Concurrent rendering World of Tanks 1.0+ Keep enriching gamers experience
  • 14. 14 • Dedicated threads for each major engine sub-system (audio, render, physics, simulation, etc.) • Task based (pool of threads, work subdivided into tasks) • Mix of those approaches World of Tanks 1.0+ How to make the engine concurrent?
  • 15. 15 • Easy to use • Two types of parallelism: functional/task and data • Feature rich and robust • Good support • Threading Building Blocks World of Tanks 1.0+ How to select a good job system?
  • 17. 17 • What • Parallel algorithms and data structures • Threads and synchronization primitives • Scalable memory allocation and task scheduling • General Benefits • Is a library-only solution that does not depend on special compiler support • Is both a commercial product and an open-source project • Supports C++, Windows*, Linux*, OS X*, Android* and other OSes • Commercial support for Intel® AtomTM, CoreTM, Xeon® processors and for Intel® Xeon PhiTM coprocessors Threading Building Blocks (TBB) A widely used C++ template library for parallel programming (since 2006)
  • 18. 18 • Expresses parallelism at a high-level to get efficient performance on different platforms • TBB-based parallelism scales as more cores become available • Gets great performance on newest multicore platforms • But works well on older machines too • Express the parallelism and let TBB map it to the platform • Supports multilevel and nested parallelism well • Functional and data parallelism implemented using the same TBB tasks • Underlying tasks are executed by same set of TBB worker threads • Leads to composable parallelism – TBB schedules all of it • A single thread pool avoids oversubscription problems Threading Building Blocks (TBB) Why it is useful for World of Tanks
  • 19. 19 Threading Building Blocks (TBB) opentbb.org Parallel Execution Interfaces Interfaces Independent of Execution Model
  • 20. 20 buffer get_next_image preprocess detect_with_A detect_with_B make_decision Can express pipelining, task parallelism and data parallelism Threading Building Blocks (TBB) Example Feature Detection Algorithm tbb::parallel_for( … ); tbb::flow_graph
  • 21. 21 To Learn More about TBB: See Intel’s The Parallel Universe Magazine https://software.intel.com/en-us/intel-parallel-universe-magazine http://threadingbuildingblocks.org http://software.intel.com/intel-tbb
  • 23. 23 Destructions Overview • Powered by Havok© destructions module • Collision and physics in one system • Out of the box multithreading
  • 24. 24 Collision scene Destruction scene Destructions Destruction & collision scenes
  • 25. Destructions Architecture 25 Simulation Main thread Collision queries Collision queries Internal Havok threads Internal Havok threads Internal Havok threads ~Number of processor cores IO thread
  • 27. 27 Destructions Summary & feature work • Summary • High quality simulation of destruction process • Multithreading execution significantly increase performance • Future work • Use TBB job system to execute havok tasks
  • 29. 29 • Skinned mesh • Visually static • The tread moves by scrolling its texture • Spline tread • The general shape of the tread is represented by a spline • Each segment is rendered as a separate model • Segments positions are determined by the spline • The spline shape is animated by moving its control points Tank Treads Previous implementations
  • 30. 30 • Spline tracks visually superior to skinned meshes, but still have flaws: • No proper collisions with the environment • Too smooth shapes of curves • A very complex tuning process Tank Treads Things to improve
  • 31. 31 • Spring chain simulation • Procedural animation • Collisions with the environment Tank Treads Designing the new treads
  • 32. 32 • Spring chain controls tread shape • Tread is divided into 4 parts: front, top, back and bottom • Ray cast the area underneath the tank and from height field • Collide each spring joint with it Tank Treads General solution & collisions
  • 33. 33 Tank Treads Performance * RAD Game Tools Telemetry
  • 35. 35 Tank Treads Summary & feature work • Summary • High quality simulation • Multicore support • Future work • Treads simulation for exotic tank types • Single threaded optimizations
  • 37. 37 • Engine evolution • Abstract Rendering Interface (ARI) • Architecture • Usage of task and data parallelism Concurrent Rendering Agenda
  • 38. 38 • Initial release based on BigWorld engine (2010) Concurrent Rendering Engine Evolution Tick Render environment Present (wait) Direct3D 9 API Render tanks Render effects
  • 39. 39 • World of Tanks came out as Direct3D 9 game • Powered by BigWorld Engine • Didn’t have any layer between engine renderers and Direct3D API • Highly tied to Direct3D Effect Framework • Never meant to run efficiently on multi-core • No job system • Frame tick and render ran on main thread in the synchronous manner • GPU workload depended on the subsystems’ render order Concurrent Rendering Engine Evolution Tick Render environment Present (wait) Direct3D 9 API Render tanks Render effects
  • 40. 40 • Patch 0.9.15 and Core 3.0 Engine (Aug 2016) Concurrent Rendering Engine Evolution Tick Render environment Resolve queries (wait) Direct3D 9/11 API Render tanks Render effects ARI Command list Main thread Render thread
  • 41. 41 • Patch 0.9.15 and Core 3.0 Engine (Aug 2016) • Abstract Rendering Interface (ARI) • Unified way to handle Direct3D 9 and Direct 3D 11 hardware • Performance gain up to 15-30% • Renderers still working in main thread • Separate thread for ARI-Direct3D per-frame interactions • WGFX intermediate compiler • Independence from Direct 3D effect framework • Faster effect support Concurrent Rendering Engine Evolution Tick Render environment Resolve queries (wait) Direct3D 9/11 API Render tanks Render effects ARI Command list Main thread Render thread
  • 42. 42 • Command list – explicitly describes what should be done by rendering thread • Device interface – our “driver” between ARI frontend and graphics API • Resource – any resource like buffer, texture, query, graphics pipeline state, etc. Concurrent Rendering Abstract Rendering Interface (ARI)
  • 43. stores 43 • Commands • Simple structures storing function arguments for the operation. • Each command stores all of the state it relies upon. • Concurrency support • Command list writing is done in parallel • Submitted commands don’t have interdependencies • Examples • Draw (instancing, indexed, etc) • Clear (RTV, DSV, UAV) • Query (begin/end) Concurrent Rendering ARI: Command List CommandList Stores commands as plain data structs Command Draw Dispatch CopyResource …
  • 44. 44 • Free-threaded: e.g. resource creationremoval, adapter state • Single-threaded: e.g. command list submit and compile, read backs, fencing • Creation thread only: e.g. special operations such as device reset. Concurrent Rendering ARI: Device Interface Device::Interface • Creates Resources • Present, Adapter Info, Etc D3D9 D3D11 D3D12
  • 45. • Frame render graph based on TBB flow graph • Separate command lists for each rendering subsystem • TBB primitives for inner subsystems’ parallelism • Separate thread for per-frame graphics API calls 45 Concurrent Rendering Approach overview Core 0 Core 1 Core 2 Core 3Core 0 Core 1 Core 2 Core 3
  • 46. 46 • High-level frame render graph • Nodes • Big chunks of work in one of the renderer subsystems • Intermediate render context flushes • Edges • Dependencies between renderer subsystems • Inner subsystems dependencies • Indication that context is ready for flush Concurrent Rendering Frame render graph Nodes • Occlusion Resolve • Dynamic models • Static Models • Transparent models • Tanks • Vegetation • Terrain • Shadows • Lighting • Atmosphere • Water • Post-Processing • GUI
  • 47. 47 • High-level frame render graph Concurrent Rendering Frame render graph VT prepare Shadows VegetationAtmosphere Particles Visibility resolve VFX prepare Decal culling Particle shadows Environment models Tanks Intermediate submit and sync Water culling Terrain Lighting Intermediate submit Main submit Post processing Transparent models VT tiles UI Water draw Issue occlusion queries
  • 48. • Submitting GPU workload • Main per-frame contexts flush • Every path comes here • Uploads all gathered contexts to GPU submission thread • Intermediate flush • Synchronization point for tasks that order-dependent on the GPU side • Prevents the GPU submission thread starvation 48 Concurrent rendering Frame render graph Waits for sync Render thread submits Tick Render environment Render water Render thread submits Render tanks Render effects GPU draws Render thread waits GPU waits
  • 49. 49 • Render frame ~17ms • Parallel execution off Concurrent rendering Frame render graph
  • 50. 50 • Render frame ~14ms • Parallel execution on • Resolve visibility • Static models Concurrent rendering Frame render graph
  • 51. 51 • Render frame ~12ms • Parallel execution on • Resolve visibility • Static models • Tanks • Lighting • Water • Vegetation • Post-processing Concurrent rendering Frame render graph
  • 52. 52 • Render frame ~8ms • Speedup ~2x • Parallel execution on • Resolve visibility • Static models • Tanks • Lighting • Water • Vegetation • Post-processing • Shadows • ... Concurrent rendering Frame render graph
  • 53. 53 • Functional parallelism • Pros • Easy to implement • Easy to read and maintain • Easy to reason about Concurrent rendering Frame render graph
  • 54. 54 • Functional parallelism • Pros • Easy to implement • Easy to read and maintain • Easy to reason about • Cons • Too high level • Some paths can’t be shortened • Critical execution path Concurrent rendering Frame render graph
  • 55. 55 • Functional parallelism • Pros • Easy to implement • Easy to read and maintain • Easy to reason about • Cons • Too high level • Some paths can’t be shortened • Critical execution path Concurrent rendering Frame render graph
  • 56. 56 • Functional parallelism • Pros • Easy to implement • Easy to read and maintain • Easy to reason about • Cons • Too high level • Some paths can’t be shortened • Critical execution path • Data parallelism to rescue! Concurrent rendering Frame render graph
  • 57. 57 Concurrent rendering Frame render graph • Render frame ~12ms • Parallel execution off
  • 58. 58 Concurrent rendering Frame render graph • Render frame ~4ms • Parallel execution on • Speedup ~3x
  • 59. 59 Concurrent rendering Frame render graph • Render frame ~4ms • Parallel execution on • Speedup ~3x
  • 60. 60 Concurrent rendering Frame render graph • Render frame ~4ms • Parallel execution on • Speedup ~3x • Data parallelism on
  • 61. 61 • Summary • Significant speedup in commands gathering • Code is still simple and easy to modify • Room for more demanding graphics • Future work • Future decomposition • Parallel algorithms • Better submission pattern • Consume spare performance Concurrent rendering Summary & future work
  • 62. 62 • Multi-core CPU is the future of gaming • World of Tanks right now is finding the way to use it effectively World of Tanks 1.0+ Summary
  • 63. Acknowledgments 63 Big thank you to: • Render, R&D artist, Tools, Engine team • … and entire WoT dev team
  • 64. 64 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life-saving, life-sustaining, critical control or safety systems, or in nuclear facility applications. Intel products may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel may make changes to dates, specifications, product descriptions, and plans referenced in this document at any time, without notice. This document may contain information on products in the design phase of development. The information herein is subject to change without notice. Do not finalize a design with this information. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. Intel Corporation or its subsidiaries in the United States and other countries may have patents or pending patent applications, trademarks, copyrights, or other intellectual property rights that relate to the presented subject matter. The furnishing of documents and other materials and information does not provide any license, express or implied, by estoppel or otherwise, to any such patents, trademarks, copyrights, or other intellectual property rights. Wireless connectivity and some features may require you to purchase additional software, services or external hardware. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations Intel, the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. Copyright © 2014 Intel Corporation. All rights reserved.
  • 65. 65 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright© 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Atom, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
  • 66. 66 Optimization Notice Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel® Compiler User and Reference Guides” under “Compiler Options." Many library routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not.
  • 68. 68 • Our physical simulation should consider the following aspects • Each tread consists of 60 to 250 segments • Hundreds of different tanks and dozens of suspensions to support • Strict requirements for the tread’s stability • The server doesn’t consider the tread when it simulates the tank’s movement • The tread’s setup process should be easy for the artist Tank Treads Challenges
  • 69. 69 • Also the tread is able to sag if there is an empty space beneath it • For an average PC we always snap the bottom part of the tread to wheels • For High-End PC • Calculate the bottom snapping it to wheels • Simulate the bottom without snapping • Based on how far from an obstacle a particular simulated tread’s joint blend its final position between snapped and unsnapped positions Tank Treads Sagging
  • 70. 70 • Parallel execution guidelines • Measure • Start from parallel algorithms with a coarse grain • Measure! • Make global decision in advance to heavy computations • Reconsider your data structure • Measure!! • Fine tune algorithms grain sizes • Bring in vectorization as last resort Concurrent rendering Data parallelism
  • 71. 71 • Data layout guidelines • Make processed objects self-sufficient as much as is reasonable • Group up objects with similar data together and process in blocks • Prefer structures of arrays • Know your data critical paths • Don't read and write to the same buffer Concurrent rendering Data parallelism

Editor's Notes

  • #3: Let me to introduce our presenters today. My name is Philipp Gerasimov, I work at Intel’s Developer Relations Division in GAME team, helping game developers worldwide to enable Intel Platforms and optimize their games. Here is my colleague Alexei Fedorov, he works at Intel ® TBB team within Developer Products Division and he’s one of the key Intel TBB developers. And last but not least our special guest from Wargaming.net, Bronislav Sviglo, who is the Rendering Team Lead for World of Tanks.
  • #4: Here is the agenda for our today presentation.
  • #5: We going to start with looking at the modern CPUs.
  • #9: Hi! My name is Bronislav. I am a lead graphics engineer at Wargaming Minsk Studio and I’ve been working there for about 10 years now. Before we start talking about the near to distance future of multi-core computing in World of Tanks game…. …I’m glad to present to you our biggest ever World of Tanks update in terms of graphics and technology. It’s 1.0 version and it’s based on our own in house engine – Core Engine ©. Let’s take a look at our launch trailer of this huge update (World of Tanks 1.0).
  • #11: In terms of graphics our game supports wide range of hardware. I’d like to remind you that 2004 is actually the year when such significant games for the whole industry like DooM 3, HL2, FarCry1 were released. So, we’re still capable to launch on such an outdated hardware which is by itself in my opinion very impressive . We have two graphics pipelines to support this range of hardware and, unfortunately, we have to support D3D9 and WinXP (China market is still huge). Despite this facts our game has faced several important graphics evolution points: - 0.8.x version brought an additional advanced render pipeline with deferred shading, decals, HDR rendering, better post-processing and many more. - 0.9.x version is very important for us because it brought PB rendering approach, better temporal anti-aliasing, various important improvements into post-processing pipeline and other. In the 1.0 update we’ve made all of the content from scratch so players will see a completely new game in terms of visual aspects & technology.
  • #12: Before we move forward I’d like to show you some pretty pictures where you can see the difference between the old map and new one. I’d also like to emphasize that in terms of gameplay there is no difference. The map basically is the same.
  • #13: The fact that right now, 60% of our audience has only 2 physical cores requires us to be very careful with the process of adding new computationally heavy features. And we also have to be very passionate about having very well optimized single threaded code including vector CPU instructions like SSE, AVX, and so on. Nowadays, our engine is heavily optimized for 2 to 4 cores machines and logical cores are also very helpful for such ongoing activities like IO, streaming, render thread and so on. But our game is a long running product and it should be prepared for the future of widespread multi-core machines among our players. I mean 4+ cores machines. This is the reason why we’re working on it right now…
  • #14: Being able to use all available CPU cores liberates lots of performance which does not always need to lead to just better frame rate. Good, stable performance is ok, but we often can use this additional budget to improve the overall picture quality or make more advanced simulations. … And today we’re going to talk about R&D features which heavily rely on multi-core hardware. They all are in progress and may be a part of next releases: In the first topic I’m going to briefly overview how we’re doing destructions in our game Second topic is how players with multicore machine, may have more robust and accurate tank tread simulation And of course how we make our graphics engine perform concurrently
  • #15: There are lots of ways to make your engine concurrent: Old approaches with set of dedicated threads for each major engine sub-system (audio, render, physics, simulation, etc.) Task based approach where there is no main execution thread, all work is subdivided into small tasks and they are executed on the pool of threads. Mix of those approaches, where on the one hand, we have job system for task execution and, on the other, we have long running dedicated threads for ongoing activities like streaming, IO, render thread etc. We’ve selected the third one. Next step is to chose good threading API and job system…
  • #16: Of course there are lots of job systems or task schedulers in public domain but they all have some limitations, either bad API or lack of some advanced features. But for our project we’d like to have easy to use, feature rich job system with two main types of parallelism (concurrency): functional (task) and data (I’ll speak later on this) And of course an ability to have a good support is essential. Fortunately, we’ve found such library and it’s Intel’s Threading Building Blocks. Mike is going to give you a quick overview of this library…
  • #17: And now we’ll take a look at technical demo.
  • #23: We’re going to start with quick explanation how we’re doing destructions in our game.
  • #24: Our destructions solution is based on Havok third party library and to be correct on the destruction module. This physics engine serves as a physics simulation and collision detection system. And it has out of the box multithreading support which allows us to significantly speedup the performance of physics simulation.
  • #25: To achieve good performance without sacrificing gameplay, our engine uses two different scenes inside havok library: for accurate collision queries and for destruction interaction. Collision scene uses compressed meshes of triangles and they almost identical to the actual model’s geometry. It gives very accurate collision results, but it’s too slow for the physics simulation. Destruction scene uses simplified convex decomposed models, this approach is very fast for physical interaction and decomposed collision models rather lightweight because of their inaccuracy. This separation gave much more freedom to our artists and lowered complexity of the content production.
  • #26: This is the high level look at our destruction system architecture based on the havok library. We have main thread which invokes two threads: Input Output system and Simulation Simulation is divided into number of tasks which are executed on the number of internal havok threads. Collision may be performed at any time and completely thread safe
  • #27: On this video you can see lots of complex destructions. Some of them are already part of 1.0 update, but we’re working on even more advanced destructions.
  • #28: To sum it up I’d say that we’re really happy with the quality of our simulation and this feature is a good example how to discover the potential of multi-core machines. In the future we plan to substitute internal havok scheduler with our own based on TBB. This should help minimize the problem with threads oversubscriptions.
  • #29: The next topic is especially popular for our game about heavy machinery. High quality simulation of tank treads is very important when you’re driving a tank.
  • #30: Currently we have two major types of treads in the 1.0 release. On the minimal settings we just render a skinned mesh of the tread, and scroll the texture along it when the tank moves. Nothing special. “Spline tread” approach uses artist defined control points to construct a spline, and then places along it tread segments. Thus, we achieve the next level of geometric quality of the tread. Tread animation is also improved because we can vibrate the spline’s control points between wheels to simulate the change of the tread’s tension.
  • #31: Spline tracks look very good compared to the just skinned mesh, but, unfortunately, they have several flaws we want to address: The shape of a spline is hard to control. Because of that, when we vibrate its control points some spline’s curves could penetrate the terrain and other obstacles Spline also doesn’t allow us to create hard edges to properly place the tread on blocky objects We vibrate spline control points using invisible springs. And the way the springs vibrate is quite unpredictable when you try to tune it. This results in a very difficult tread setup process for the artist.
  • #32: These problems motivated us to create a tread which would be physically simulated and produces better visual quality. We did several experiments with various physical models, considering quality and performance cost. The choice was to use the spring chain. Springs were less prone to jitter in our collision environment. Of course they may stretch, which is not physically correct, but that stretch is gradual along the tread, thus visually not noticeable. A nice bonus of the springs is that they naturally produces animations when a tank swings. This gives treads a much more natural look than we could achieve with the spline. The artists will also be free from tuning the tread animation, because it’s fully procedural. And last but not least, during the simulation we could collide joints of the springs with the collision scene, producing much more detailed collision effects.
  • #33: Eventual solution is the following: The spring chain doesn’t rotate around the tank. We snap it at two points at the front and back wheels. So it only swings with the tank and collides with the obstacles, producing the shape along which we scroll visible tread segments. The different parts of the track have different tension depending the tank moves forward or backward. To simulate this effect we divide the tread in 4 parts and change length of springs for each part separately to control the tension along the whole tread. At the collision phase we…: Calculate 1D height field underneath each tread by doing several collisions tests with the environment Then we test each spring joint against it Collisions with the wheels are just a point against a disc The result is stable enough.
  • #34: Each tread is simulated as a job for TBB. To increase concurrency we also exploit data parallelism by using parallel_for and parallel_reduce functionality. 1) So, due to heavy usage of multi-threading, tank treads simulation may even not affect resulting performance.
  • #35: On the video you can see all of features of new tank treads like: Sagging effect, where tread may sag if there is an empty space beneath it Tension effect, where the different parts of the tank tread have different tension depending on the tank moves forward or bachward … and of course much more accurate collisions with the environment
  • #36: In summary I’d like to say that we achieved high quality tank treads simulation which such cool features like ability to sag below the tank, better interaction with the environment and so on. In the future we plan to concentrate on tread simulation for some exotic tank types. And of course on better fine grain tuning of concurrent execution.
  • #37: The last but not least topic is about concurrent rendering. Often rendering is one of the difficult things to make concurrent. It becomes even more challenging if you have to support wide range of hardware and various API, especially old one like DX9 which wasn’t created with multi-threading in mind at all. Or DX11 which was, but unsuccessfully, Deferred Context is very pure concept and it just didn’t work. Vulkan, Metal and DX12 is way better, but as I previously mentioned our audience is fragmented towards old low-end hardware and we have to deal with it. Here we’re going to talk about how to enable concurrent rendering while still being able to work on outdated hardware.
  • #38: And this is our agenda about concurrent rendering: First of all we’re going to talk about our past even before 1.0 release. Then I’ll give you an overview about what is Abstract Rendering Interface (later ARI) and how it helps us with concurrency How we do concurrency in rendering And how different types of parallelism help us to achieve good performance and scalability
  • #39: World of Tanks came out as a DirectX 9 game back in 2010 and it was powered by BigWorld engine. There was strict separation between tick and draw methods. And almost all of the subsystems directly called DirectX 9 API.
  • #40: Optimal usage of the GPU was guaranteed only due to a correct order of execution of render sub systems. Material system was based on D3D Effect Framework which didn’t support concurrency in any way, but was very user friendly (we even decided to use a similar approach in our own effect framework called wgfx). And ability to add new Graphics API was possible only with overcomplicating the existing code base.
  • #41: It took us a long time to overcome those limitations and it was challenging because World Of Tanks is an online game and meanwhile the project was evolving. Patch 9.15 was released with lots of under the hood optimizations and we moved away from direct calls of graphics API towards Abstract Rendering Interface. ARI allowed us to uniformly use different graphics API without losing performance. During the process of creating ARI we were inspired by upcoming at that moment low level graphics APIs like Mantle and DX12, and tried our best to make architecture of ARI very similar to them. (I’ll speak later on it) So, basically with ARI we were able to gather the render commands into the software command buffers. Those gathered command buffers then were submitted to render thread which in turns was in charge of processing them and making direct calls to an appropriate graphics API (at that moment they were either DX9 or DX11).
  • #42: Wgfx effect framework was introduced as a faster, multicore aware alternative to D3D effect framework with lots of under the hood optimizations and customized API. Despite the fact, that we had separate render thread for processing software command buffers, we still gathered render commands only in the main thread. But even that gave us up to 30% boost in performance back then. Thus, the next step is to gather render commands concurrently from different threads.
  • #43: As I previously said when we were creating the ARI framework we were inspired by the low-level graphics API like Mantle and DX12 at that moment. So, ARI is based on the following cornerstones: Command list contains a list of render commands like draw, clear, update, copy-resource, etc. which is supposed to be later executed on the render thread. Device Interface is basically our “driver” between our user code and specific graphics API (like DX9, DX11 or DX12) Resource describes any kind of graphics resource like buffer, texture, query, graphics pipeline state, and so on.
  • #44: Concurrency is achieved by submitting a list of commands into separate render thread. To be able to record a list of commands for later execution on render thread, each command should be simple POD (plain old data) object which contains all required information for later usage and should not have any dependency on any other command. Command list preparation may be done concurrently from different threads because each command list doesn’t depend on each other. The correct result is guaranteed by a correct order of submitting command list into separate render thread. Example of commands may be draw, clear or query, and so on. I’d like to point out that there are some kind of commands like update or copy-resource which require to have manual memory management and life time control of the content you provide to be a part of these commands. So, when you try to upload texture on GPU you usually don’t want to copy the whole texture content right into the command, instead you just provide pointer or link on the memory to copy from, and guarantee that this memory remains valid till the end of this frame.
  • #45: ARI interface for device access is divided into three categories based on the access pattern: Free-threaded or Thread- independent interface, is used for operations which may be executed any time from any thread. With this interface we can perform resource creation and deletion, receiving adapter state and so on. Single-threaded interface, is mostly related to the command list operations like filling list with commands, compiling it to native commands in case of DX12, and so on. Creation thread only interface, is a legacy thing and it’s about limitations of existing old graphics API like DX9, DX11 to perform special operations like present or reset on a render device.
  • #46: Concurrent rendering in our game is achieved by TBB flow graph, which manages dependencies between render tasks and defines the actual order of submission gathered command lists into the render thread. Render tasks, which represent independent command lists for each render sub-system of the engine And separate render thread which consumes software command buffers and makes native calls to graphics APIs Notice, how significantly concurrent rendering increase usage of all available CPU cores.
  • #47: On the high-level render frame is managed by TBB flow graph which acts as a render task scheduler. Flow graph consists of number of nodes and edges. Each node of the graph is a big chunk of render work within one sub system. Edges form dependencies between different sub-systems or key renderer stages. Here you can see various examples of the render nodes, so you can have an idea what they might be.
  • #48: And on this diagram you can see example how high-level frame render graph may look like. Having such high level architecture of the frame is very helpful for everyone in the team to quickly understand what is the actual flow or order of execution and what sub-systems depends on each other. More over this code is really easy to modify and optimize because all high-level information about course grain tasks are located in one place. Task based or functional concurrency works well here because what you need to do here to enable concurrency is just wrap up each task and add dependency and that’s all.
  • #49: As I previously mentioned our engine uses TBB flow graph to gather command list for various sub-systems in parallel and then flush them in a predefined order into the render thread. Unfortunately, if we do only one flush to the render thread per frame it will make render thread stay idle. 1) So, we have to perform intermediate flushes to give the render thread some work as soon as possible at any given time.
  • #50: And now lets take a look at the average render frame on the Intel i5 CPU with 4 cores. Here we can see the main thread which gathers all the commands into the number of independent command lists one by one. And it takes a lot of time to do it (about 17 ms). And this is the render thread which is actually processing previous render frame. You may also notice that render thread is idle for pretty big amount of time. This is because our main thread can’t generate render commands fast enough.
  • #51: And now we enable the concurrent command generation, but at first, do in parallel only two tasks. As you can see render tasks executed only in two threads. This is because of dependencies between tasks. You might also notice that the overall frame time is reduced and more over the idle time in render thread is lower.
  • #52: If we add five more tasks, TBB flow graph starts executing them concurrently on different threads with respect to dependencies. It’s also interesting, that in terms of TBB flow graph there is no main thread and it acts like a usual worker thread.
  • #53: When we put all the render tasks into job system overall frame rate is doubled. But there is some curious moment, this bubble is essentially shows that we have to be careful with two things: Size of the each individual render task And set of dependencies between this task and others … or it might lead to such big inefficiencies.
  • #54: So, as you can see functional or task parallelism (concurrency) is pretty easy to implement. It requires us a minimal effort to write and support such code, and because we can take a high level view of our rendering pipeline, we can easily make modifications to increase parallelism in our engine.
  • #55: Unfortunately, high level frame render graph requires us to reason about big chunks of work which is not always optimal. Just because not all tasks have the same size or the execution time. Or not all tasks may be executed in parallel because some of them may have dependencies on each other.
  • #56: So this approach quickly leads to the critical execution path. You may think about minimizing amount of dependencies between tasks which is good BTW or subdividing task into smaller one which may significantly reduce the readability of the code. Ok, but how could we improve performance while still keep our code readable?
  • #57: Before, we were using only functional (or task) parallelism and there is also data parallelism. So, what we can do is to exploit data parallelism inside each sub system independently, and be sure that TBB is going to do the rest. Thus, we still have easy to read and maintain high level frame render graph and each sub-system may produce additional tasks to better utilize all available CPU cores.
  • #58: Let’s take a look at high level CPU like i7 with 6 physical cores. We have ongoing render thread and set of render tasks executed in main thread. The render frame takes about 12 ms
  • #59: And now the same frame but with enabled concurrent rendering. As you can see the render frame is reduced significantly and more over all 6 cores are busy. 1) But to reduce the critical path, big tasks should be subdivided into smaller one (this is the future work).
  • #60: And we still have bubbles where CPU cores are idle but the overall CPU utilization is much higher.
  • #61: To reduce amount of bubbles we’re using data parallelism. Here you can see how VFX prepare task schedules additional sub-tasks to be executed in parallel on other threads. So, bubbles are eliminated or at least number of them is reduced.
  • #62: In summary I’d like to say that concurrent rendering allows us significantly reduce time spent during commands generation. Despite the fact that we’ve added parallel command generation, the code is still simple and easy to modify. And sparse performance may be used for other awesome features like tank treads, more objects on the scene or even more destructions. There is still lots of work to be done, the renderer may be subdivided into even smaller tasks and each task in turn may be decomposed into number of sub-tasks by using parallel algorithms. Also we have to be very careful with the command submission pattern on outdated graphics API. and being able to always have enough work for render thread, so it won’t stay idle. DX12 and Vulkan allows us to process commands right inside the generation thread, so we plan to exploit it to reduce pressure on render thread in the future.
  • #63: To sum it up I’d like to say that multi-core age is coming and right now even not so powerful hardware often have 2 or more cores. And today I’ve showed you how we at World of Tanks development team trying to leverage all available power form multi-core machines.
  • #64: I’d like to give a big thanks to Render, R&D Art, Tools Teams… and of course to the entire World Of Tanks development team.
  • #68: And now we’ll take a look at technical demo.
  • #69: And of course we faced a bunch of problems to solve. A tank could have up to 250 segments in each tread. It’s quite a bit considering that we have up to 30 tanks simultaneously and springs should collide between the wheels and environment obstacles Also we have hundreds of different tanks with different suspensions. So the final algorithm should be pretty general to support such a variety When the tank destroys a building the debris could move freely and independently, even jitter to some extent. In case of the tread we can’t afford it, the springs are always in a highly constrained environment (between wheels and the terrain) and jitterings are really annoying In real life the treads influence where can the tank can move and how. But the server doesn’t consider this effect because of performance reasons. This results in frequent situations when there is no way you can place the tread between wheels and the terrain Also the setup process of the new tread should be easy enough for the artists
  • #70: And the final feature of the tread – its ability to sag below the tank. But before we continue please remember that the server doesn’t consider the tread. Because of that we frequently can’t properly place the tread between the bottom wheels and the terrain. Thus natural ability of the spring chain to sag doesn’t work out of the box. For an average PC we disabled the effect, snapping the bottom part of the tread to the nearby wheels. But for high end we can afford to process additional algorithm to resolve the mentioned problem. At the first pass we calculate the bottom of the track snapping it to the nearby wheels as for an average PC. At the second pass we resimulate the bottom without snap constraints Now each bottom spring joint has two possible positions: from snapped and unsnapped passes For each joint we calculate how close it is to the terrain and obstacles and based on that we blend the final position between the two mentioned positions. If the joint is far away from any obstacle – we favor position from the second pass, in other case – from the first. This allows the tread to sag and we have no jitters when there is no way to place it between the bottom wheels and the terrain.