SlideShare a Scribd company logo
1
Talk about Insomniac's approach to entering VR
Performance Profiling - critical part of developing VR games
Engine changes to improve the experience in VR
Raja - how intel helped with performance optimizations and creating a more
engaging experience on higher end hardware
2
3
Insomniac has been developing games for over 20 years
Mostly console games using in house engine
Engine runs on PC, though designed for console
Characteristics might be uncommon on PC
4
Want to use our engine because it has been working well and we are
comfortable
Don't want to diverge much from console
Reduce maintenance
Engine changes should ideally benefit both platforms
Knew working in VR would be challenging
Performance requirements meant we would have to scale back content from
what we'd have on a console
Also, game design would be different vs console
5
First VR title
Edge explored how we could present an interesting environment with
shapes rather than relying on the same level of environment detail we would
have on console
Edge also let us examine how a 3rd person game could have engaging
gameplay while being comfortable in VR
6
Feral Rites used a 3rd person camera to provide a clear view of enemies
surrounding the player
We used camera teleportation and player head movement to allow
comfortable exploration and backtracking
7
Later in developing the Unspoken, knew that, unlike on a console, we could
make gameplay that required looking up and down.
On a console that can be frustrating, but in VR it feels natural.
This along, with the use of motion controllers, allowed us to make a game
where the player is engaged in casting spells and doesn't feel like they are
pressing buttons a controller
On the technical side, a challenge in The Unspoken has been to maintaining
a playable frame rate while giving players a strong sense of immersion as
they summon creatures and split open the earth
8
Vr requires a high end system
Limited hardware variations and can take advantage of features in newer
drivers
Target a min spec
4 core intel i5, nv 970 GPU
Focused on making a comfortable experience on that platform
Avoiding frame rate drops very important
9
Large amount of GPU driver and VR runtime overhead
Shows up on threads and applications that our game didn't directly create
Harder to measure
Can cause our game threads to be interrupted
Different from console where there is not a lot of preemption from external
processes – know how long they will take
Given the stuff outside our game, despite 4 core machine, assume we have
about 3 processor threads to work with on min spec
10
That gives you some history on the VR games we developed and how we
approached VR.
Before talking about profiling, helpful to have background in how threading
in our engine works
Main thread does gameplay sim and kicks jobs
Job threads to handle things like physic sim, scene queries, skinning to allow
the main thread to move forward
At the end of a frame, perform culling and build work for the render thread
Next frame, the render thread submits the previously simulated frame to the
GPU
Render thread incurs overhead of talking to the driver
Works fairly well in VR, but requires some planning to balance workload and
avoid stalls. That’s where profiling comes in.
11
12
Artists generally use in game profiling tools we have built
These largely rely on CPU or GPU timestamp queries
Useful to quickly isolate the cost of a particular asset or a group of assets
For instance, might need to know how much time skinning is taking and then
drill down to individual models
13
Might want to see the costs related to environment art
Something that we can show them more easily then a general purpose tool
Profiling tools in general can give misleading results when over budget
May look like the CPU is taking a long time when really it is waiting for the
GPU
May look like a function is slow. Caused by the processor is switched over
to another task
Can cause results to vary wildly from frame to frame
Particular a problem when there is a heavy CPU load
Suggest identifying issues and verifying fixes on target hardware and doing
profiling on a machine that is in budget
14
Programmers tend to use Telemetry which is a product from RAD Game
Tools.
It can quickly show the correlation between game state and performance
Something like Event Tracking for Windows is useful too, but not the first
choice
Harder to reason about what is going on in the game with what the profiler is
showing
Using telemetry, can quickly identify frames that are over budget and see
what high level functions were slow on those frames.
<CLICK>
In addition to CPU execution, added GPU timestamps to the data telemetry
collects
Synchronization of GPU timestamps to CPU time is a little tricky but it gets
you in the ballpark
Helpful to see how the GPU and CPU affect each other
15
In addition to individual frames, it's useful to see what is going on over a
span of time
For instance, maybe there is an explosion that is causing frame rate
problems
A way to approach this is to plot graphs of the different rendering layers
and see what is slow
<CLICK>
Here we can see the maximum amount of time that alpha took as this effect
played
Maybe that needs to come down. Or maybe it's not alpha but instead a spike
in skinning that is causing draw submission to stall
That's where seeing how the GPU and CPU interact is really useful
16
Telemetry is helpful but it can't answer all the questions
Sometimes we can only form hypotheses and need to look outside of our
game to find the true cause of an issue
GPU View is helpful for this. GPU View is a development tool from Microsoft
that can show the CPU and GPU state of the machine.
This allows us to see when non-game threads are running on the processor
and when work from the game is being executed on the GPU
The difficulty is that it takes some work to correlate what the game is doing
with what is being shown
Stack sampling works ok, but doesn't give the detail we would see with
instrumentation in Telemetry
Here is a capture of a part of the game that was struggling to be in frame
17
18
Not a concern on consoles because we can compile offline since we know
the target hardware
On PC, we load directx bytecode which gets compiled by the driver at
runtime
Not so much an issue if there is a lot of free CPU time, but caused huge stalls
on min spec in Edge of Nowhere
That game was a problem because it streams new segments of the game
requiring new shaders while the player is walking through tunnels
This means any frame hitch is noticeable
19
The workaround was to create all the shaders the game would need when it
started
The compilation gets handled by a background driver thread and doesn't
cause a stall unless the shader is needed before compilation has finished
Just wait for needed shader to compile, not all of them
The start of the game fades in some splash screens and then the menu.
Since the the screen is essentially black when we start drawing those and
block on shader compilation, the player doesn't perceive a frame hitch
Compilation continues in the background while the player navigates the
menu and is probably mostly done by the time actual gameplay gets going.
On subsequent runs, shaders are cached by the driver so we don't need to
wait for compilation
Dicey
Driver dependent. Can also be disabled in some drivers.
It worked for us.
20
Use visibility queries to determine how much of an object is visible on a
given frame.
Used for things like lens flares. Test # of pixels that are visible for a shape
and use that to blend in lens flare effect.
At 30 hz on console, we expect to be able to submit query after we lay down
the g buffer and read it back later after lighting to set an intensity value
before GPU submission.
Remember that I said we have tight GPU synchronization.
On console we have probably 10 ms between issuing a query and reading it
back. For VR, we're looking at 1 ms.
Stalling the render thread isn't really an option because it can create periods
where the GPU has no work
The solution was to submit queries on one frame and read them back on the
next frame.
That means we're 1 frame behind on visibility, but at 90hz, don't notice it
21
This next optimization is interesting because it doesn't work on consoles
When applying lighting, the light data is stored in a constant buffer on PC
That's because it is about 16% faster on certain NVIDIA hardware than using
a structured buffer
AMD hardware doesn't seem to make much differences
However, we store it in a structured buffer on PS4 because using a constant
buffer turned out to be significantly slower
This is surprising given the AMD PC results
So it may be faster to use one type of buffer over another, but to really know
you need to profile
22
Generally, moving to VR meant rendering everything a second time
Also mean turning off a lot of graphical features
Ambient occlusion, screen space reflection and baked GI were all axed
because they were too expensive
It had to be done. our console baked lighting took 4 ms in vr. there was no
way that could work.
Our baked lighting does a lot
Stores bounced lighting. Accounts for sky visibility and specular
occlusion.
It allows us to bake in lights and get them essentially for free
And it applies to static and dynamic objects equally to create a well
grounded look
Instead, we had a single environment probe
23
Edge of Nowhere involves going in and out of caves in Antarctica. As you
might expect, the env probe used for the surface of a glacier doesn't look
great inside a cave.
It took a lot of effort from the lighters to carefully construct transitions from
outside to inside
Given these challenges and 2 other VR titles on the horizon, it made sense to
see if we could adopt our console solution somehow
24
On console, we store samples in sparse 16 meter cubes called light grids
Samples store irradiance information in 6 directions and an occlusion plane
The occlusion plane is automatically calculated and helps avoid light leaking
25
To calculate lighting for a point, we find the 8 nearest samples, use the
occlusion planes to determine contribution of each sample, and manually
interpolate the sample values
In VR, occlusion and interpolation each cost around 1 ms. That means they
are both too expensive to work out without significant reductions
somewhere else.
In addition to light grids, the console implementation is able to use multiple
environment probes to add specular light.
Since we were mainly concerned with diffuse light and spec occlusion, this
part of the console approach was ignored and we stuck with a single probe
for specular lighting.
Even ignoring specular, the entire approach was very expensive and it
became clear we needed to do something very different.
26
I found that we could take two samples of a volume texture in less than
1/10th of a millisecond.
That give us free interpolation of sample data.
However, it meant we could only have 8 channels of lighting data and
couldn't use the occlusion planes.
27
Working without occlusion planes was the first challenge.
The already thick walls in Edge of Nowhere helped avoid light leaking which
solved part of the problem.
Still without occlusion planes, if we were to take samples on a uniform grid,
the result would be discolored patches from samples that are embedded in
geometry.
The solution was to move samples outside of geometry and slightly away
from surfaces.
The red spheres in this picture represent samples points on the uniform grid
that were too close to geometry.
If you look closely, there is a white line connecting the red sphere to a
different sphere which represents the new sample location and values.
It does help to keep samples that are deeply embedded dark though. This
provides good darkening in cases like shrubs and rock piles.
28
With a method in place to deal with embedded samples, the next issue was
compress data to fit into a limited number of textures.
In Edge of Nowhere, indirect light tended to vary in direction, but not color
so much
We took advantage of this by converting the RGB irradiance samples into a
luma-chroma format
We then store 6 directions of luminance and an average color in every
sample
That means the data could be fetched and interpolated using two
texture reads
29
Two large volume textures store lighting data within about 100 meters of the
camera and are updated as the player moved around
A default lighting value is applied to areas outside the volume covered by
the textures
Sampling, unpacking and fading to a default value takes 1/10 of a
millisecond on the min spec hardware
30
Here is the difference between console and VR
Side by side the difference is noticeable, though it's not bad
Of course this is game development. And this was the baked lighting was
adopted for Edge of Nowhere, the game didn't actually ship with it because
the tech was a side project and came on line too late in production for
people to feel comfortable switching.
Feral Rite and the Unspoken have used it and its worked out well for them.
31
Thanks Bob!
Let’s dig deeper into how we approached performance analysis on The
Unspoken.
Before we look at the execution timeline flow in VR, let’s quickly go over
platform differences in the way render presentation and thread interaction is
handled.
32
As Bob mentioned, regardless of that platform, MT is ahead of RS by a frame.
The low-level rendering APIs on consoles allow for very tight control b/w RS
and the GPU, and consequently MT.
(adv)
After RS finishes submitting all the GPU commands, it doesn’t wake up MT
immediately.
It waits for the GPU to let it know that it’s reached the post-processing stage.
Once that happens, RS wakes MT up and work for the next frame begins just-
in-time.
This minimizes input latency and does come at the risk of starving the GPU,
but since the hardware is fixed, it can be tweaked almost to perfection.
(adv)
Present simple inserts a flag and doesn’t block.
33
On PC, the display flip functionality is handled by DXGI/display driver and we
have a layer of abstraction to interact with it.
(adv)
Depending on the arguments to Present and max frame latency, it can block
to prevent the CPU from going way ahead of the GPU (and keep i/p latency
under check).
There isn’t a non-intrusive way to kick off work depending on where the GPU
is. What this leads to is less control over when exactly MT needs to start
working.
(adv)
PC VR adds another dependency.
The Oculus/SteamVR runtime throttles the app to the HMD refresh rate by
blocking on frame submission. Engine devs need to be aware of this! There
isn’t a polling option.
Buffering is a no-no unless we throw away
Further reading:
DXGI flip vs bitblt models: https://msdn.microsoft.com/en-
us/library/windows/desktop/hh706346%28v=vs.85%29.aspx
Ovr_submitFrame: https://developer3.oculus.com/doc/1.3.0-
libovr/_o_v_r___c_a_p_i_8h.html#aa84f958b8d78f2594adf6fb0ad372558
34
Here’s a simplified timeline view of how the engine execution flows in VR.
(talk about the layout)
We’ll focus on MT and RS threads on the CPU and see how one frame’s
worth of work flows.
(adv)
MT starts things off with sim and render prep work for frame N, during which
RS is submitting draws for frame N-1.
RS does the mirror present (immediate mode) and is blocked by ovr_sf
(adv)
When it regains execution, it wakes up MT, updates gfx resources, submits
non-skinned objects and waits on MT/job threads to finish generating the
skinned vertex buffer data.
(adv)
The GPU is mostly idle until the skinned objects are submitted. And we want
to fix this!
(adv)
Once all the render work has been submitted, RS does the mirror present
(non-blocking) and calls ovr_submitFrame, which blocks.
(adv)
The Oculus runtime wakes up a few ms before HMD vsync and submits the
post-process (tw,bd.ca) work on the compute queue.
(adv)
35
Over the next vblank interval, this data is streamed to the display and finally
shows up on the next vsync.
35
To summarize the dependencies that we need to address:
1) Main thread needn’t wait for Render Submission thread to be unblocked
before starting work on the next frame
- Eliminates Render Submission thread from waiting on Main thread to
update skinned vertices as well
- Adds to input/motion to photon latency, but time warp does a good
enough job to make it unnoticeable.
2) The diagram before didn’t show this, but we’ll talk about it in a bit.
36
(walk through animations)
Notice that we didn’t make any improvements to the rendering itself. Fixing
system level bottlenecks are generally the biggest bang for the buck.
37
Insomniac Games’ engine uses an in-house CPU occlusion culling system
based on the downsampled depth buffer.
On the console with the 30 ms frame budget, we could read the
downsampled depth buffer of the current GPU frame after the G-buffer pass
(while the GPU is doing other work) for MT to use for occlusion culling
objects in the next frame.
On VR, this isn’t possible at all w/o causing the GPU to starve or MT to wait
too long.
So, the deferred occlusion system ended up using the two-frame old depth
buffer for culling.
Worth noting that since this is VR and each eye is treated as a separate
camera, this is done once per eye.
38
The main thread reprojects the downsampled depth buffer, fills holes,
creates a mipchain and uses the latter for occlusion queries.
39
40
(use animations)
Note: No need to over-engineer a solution to use one eye’s depth buffer to
do occlusion for both eyes either. MT isn’t the bottleneck here!
41
We found this issue with GPUView. Ended up looking at event callstack on
the render thread to debug it.
There are still some bubbles at the start of the GPU queue. This is why an
ETW based tracing that shows the DMA packets is so useful.
Using CPU/GPU timers in Telemetry doesn’t accurately portray the reality.
The remaining bubbles are because RS is updating various GPU resources at
the start and not feeding the GPU.
We are going to try out deferred contexts for resource updates
(Updatesubresource style) when RS is blocked to save us the driver copy
overhead and reduce GPU starvation.
42
Let’s briefly talk about ASW and what the task timeline looks like when it is
enabled by the Oculus runtime.
Async time warp does a rotational reprojection of the eye buffer based on
the HMD’s orientation at the point the Oculus post-processing is done.
Compensates for HMD rotational movements from the time the render
commands were dispatched.
Space warp accounts for translation changes based on a per-pixel velocity
buffer and is a mechanism to allow low-end VR rigs to render/simulate at
45Hz while still refreshing the HMD at 90Hz.
What happens as a result? ovr_submitFrame blockis for a much longer time
(~15ms vs ~5ms w/o it)
Worth pointing out that the input/motion to photon latency is worsened
because of the engine refactoring to unblock MT.
43
GPUView lets us understand the behind-the-scenes part pretty well.
It’s interesting to see how much of time is simply wasted because we
couldn’t meet the <10ms gpu frame budget.
An argument worth making here is to provide the best possible frame quality
and relegate to ASW ON if we simply cannot meet the 90Hz frame update.
44
So far, we’ve talked about removing GPU bubbles; let’s look at the CPU side
of things for a bit.
45
With VR dev and its 10 ms frame budget, there’s all the more reason to use
the CPU effectively.
Over the last decade, there’s been a steady move to push stuff onto the GPU.
VR brings the onus back on the CPU to simulate, cull, etc.
(adv)
Here’s a GPUView snip from an early build of the game running on an i7-
6700K/GTX980. Even with the GPU almost pegged, the net CPU utilization is
< 40%.
IG’s engine already used a CPU path for physics, particles and occlusion
culling, yet there was so much of idle CPU time up for grabs.
46
Enough with all these timeline views, don’t you think?
Here’s some eye candy from the game, showing the comparison b/w the
default settings on a low-end and high-end VR rig that allows the game to
run at ~90fps.
The quality preset is automatically set based on the underlying system
config, the player is free to override it.
We’ll look at the features in some more detail in the next slide.
47
With the ultra setting, you’ll see the following additions during the course of
a match:
(narrate from slides)
48
Here’s a video from the level editor showing what we were going for, and
what we ended up with.
You really do need a LOT of particles to show fluid sim like motion.
Unfortunately, the cost of updates did not permit us to use these assets.
49
Can’t use 3D Perlin noise to move particles. You’ll find that they clump after
a point and don’t keep flowing.
That’s because the resulting potential field is smooth, but not divergence
free.
Luckily, we have the vector calculus operator ‘curl’ that does black magic.
Here’re links to resources on the topic:
https://www.cs.ubc.ca/~rbridson/docs/bridson-siggraph2007-curlnoise.pdf
http://prideout.net/blog/?p=63
catlikecoding.com/unity/tutorials/noise-derivatives/
50
So, how complex was it to add curl into the engine and vfx tool chain?
… existing engine concepts …
In terms of authoring, it’s super easy to make crazy cool looking stuff since
it’s a few sliders that are being pushed around.
However, it’s a very unintuitive process right now. There’re no flow fields in
the editor to show what an effect may look like.
51
So, what does it perform like? ~1ms for 1000 particles
Adds 4-5 ms total work per frame., 2-3 ms in net cost due to tasking.
Not performant enough for massive fluid-like particle sim.
Implementation is currently:
FP ops bound
Limitation of the particle update system
..
52
53
Before we look at performance stats on different settings/hardware, let’s
revisit the importance of measurement based development.
Having multiple lenses to look at your games performance is key to making
informed decisions to improve the game.
While most VR titles use Unity and Unreal, having a native in-house engine
allows for immense fine tuning.
Insomniac’s in-house engine is a great example of what a performance
driven dev approach can yield.
Let’s revisit the profiling options available..
54
Here’s comparison data from our lab using the VR stat logging system in the
engine on the Lockport bridge arena.
The table shows data for an i5-4590 + GTX 970 / 980 and i5-6700K +
GTX980 at various settings.
Two main observations:
- The game drops lesser # frames on a better CPU even when it’s
predominantly GPU limited
- ASW disabled yields much lesser dropped frames than ASW auto. The
reason isn’t clear yet for this. It is possible that the ovrPerfStats field was
interpreted incorrectly when ASW is on.
55
In conclusion,
1) engine dependencies need to be looked into for VR, and hopefully
decoupled more. It can help the non-VR path as well in the process!
2) Profiling, instrumentation and knowledge of different tools helps detect
and reason about bottlenecks in the execution.
3) Don’t discount the CPU! It is tremendously powerful, and can shave quite
a bunch of GPU time and help improve immersion.
And lest you forget, Unused silicon is sad silicon!
56
Shoutout to Shaun, Yancy, Abdul [IG], Cristiano, ChrisK, Dave and Brian
{Intel}
Thank you for your listening, we hope you found this useful.
Please spare a few minutes to rate us.
If you have any questions, please step up to the mic, while we show you
some wonderful legal disclaimers.
57
58
59
60
61

More Related Content

PPT
BitSquid Tech: Benefits of a data-driven renderer
PPTX
Game Development Session - 3 | Introduction to Unity
PPTX
Battlefield 4 + Frostbite + Mantle
PDF
Unite 2013 optimizing unity games for mobile platforms
PDF
VMworld 2013: On the Way to GPU Virtualization – 3D Acceleration in Virtual M...
PPTX
Rendering Battlefield 4 with Mantle
PPT
How we optimized our Game - Jake & Tess' Finding Monsters Adventure
PPTX
Parallel Futures of a Game Engine (v2.0)
BitSquid Tech: Benefits of a data-driven renderer
Game Development Session - 3 | Introduction to Unity
Battlefield 4 + Frostbite + Mantle
Unite 2013 optimizing unity games for mobile platforms
VMworld 2013: On the Way to GPU Virtualization – 3D Acceleration in Virtual M...
Rendering Battlefield 4 with Mantle
How we optimized our Game - Jake & Tess' Finding Monsters Adventure
Parallel Futures of a Game Engine (v2.0)

What's hot (6)

PDF
What's ahead for film and animation with Unity 2020 - Unite Copenhagen 2019
PDF
Mobile Performance Tuning: Poor Man's Tips And Tricks
PPTX
Practical Guide for Optimizing Unity on Mobiles
PPTX
Unity - Internals: memory and performance
PPTX
Next Level Mobile Graphics | Munseong Kang, Oleksii Vasylenko
PDF
Optimizing Large Scenes in Unity
What's ahead for film and animation with Unity 2020 - Unite Copenhagen 2019
Mobile Performance Tuning: Poor Man's Tips And Tricks
Practical Guide for Optimizing Unity on Mobiles
Unity - Internals: memory and performance
Next Level Mobile Graphics | Munseong Kang, Oleksii Vasylenko
Optimizing Large Scenes in Unity
Ad

Similar to Console to PC VR: Lessons Learned from the Unspoken (20)

PPTX
Development and Optimization of GearVR games using Unreal Engine
PDF
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
PDF
Virtual arts presentation - Casual Connect Berlin 2017 - Getting your VR game...
PPTX
Forts and Fights Scaling Performance on Unreal Engine*
PPTX
Oculus insight building the best vr aaron davies
PPTX
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...
PDF
Hill Stephen Rendering Tools Splinter Cell Conviction
PPTX
Art and design for VR
PDF
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
PPSX
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
PPTX
The Rendering Pipeline - Challenges & Next Steps
PPTX
VR Optimization Techniques
PDF
PlayStation: Cutting Edge Techniques
PPTX
Making High Quality Interactive VR with Unreal Engine Luis Cataldi
PPTX
Making High Quality Interactive VR with Unreal Engine Luis Cataldi
PPTX
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
PPSX
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
PDF
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
PPTX
Developing and optimizing a procedural game: The Elder Scrolls Blades- Unite ...
PPT
Optimizing Direct X On Multi Core Architectures
Development and Optimization of GearVR games using Unreal Engine
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
Virtual arts presentation - Casual Connect Berlin 2017 - Getting your VR game...
Forts and Fights Scaling Performance on Unreal Engine*
Oculus insight building the best vr aaron davies
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...
Hill Stephen Rendering Tools Splinter Cell Conviction
Art and design for VR
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
The Rendering Pipeline - Challenges & Next Steps
VR Optimization Techniques
PlayStation: Cutting Edge Techniques
Making High Quality Interactive VR with Unreal Engine Luis Cataldi
Making High Quality Interactive VR with Unreal Engine Luis Cataldi
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
Developing and optimizing a procedural game: The Elder Scrolls Blades- Unite ...
Optimizing Direct X On Multi Core Architectures
Ad

Recently uploaded (20)

PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Geodesy 1.pptx...............................................
DOCX
573137875-Attendance-Management-System-original
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
composite construction of structures.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Construction Project Organization Group 2.pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPT
Mechanical Engineering MATERIALS Selection
PPT
Project quality management in manufacturing
PDF
PPT on Performance Review to get promotions
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
OOP with Java - Java Introduction (Basics)
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Foundation to blockchain - A guide to Blockchain Tech
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
CYBER-CRIMES AND SECURITY A guide to understanding
Geodesy 1.pptx...............................................
573137875-Attendance-Management-System-original
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
composite construction of structures.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Construction Project Organization Group 2.pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Mechanical Engineering MATERIALS Selection
Project quality management in manufacturing
PPT on Performance Review to get promotions
UNIT 4 Total Quality Management .pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
OOP with Java - Java Introduction (Basics)

Console to PC VR: Lessons Learned from the Unspoken

  • 1. 1
  • 2. Talk about Insomniac's approach to entering VR Performance Profiling - critical part of developing VR games Engine changes to improve the experience in VR Raja - how intel helped with performance optimizations and creating a more engaging experience on higher end hardware 2
  • 3. 3
  • 4. Insomniac has been developing games for over 20 years Mostly console games using in house engine Engine runs on PC, though designed for console Characteristics might be uncommon on PC 4
  • 5. Want to use our engine because it has been working well and we are comfortable Don't want to diverge much from console Reduce maintenance Engine changes should ideally benefit both platforms Knew working in VR would be challenging Performance requirements meant we would have to scale back content from what we'd have on a console Also, game design would be different vs console 5
  • 6. First VR title Edge explored how we could present an interesting environment with shapes rather than relying on the same level of environment detail we would have on console Edge also let us examine how a 3rd person game could have engaging gameplay while being comfortable in VR 6
  • 7. Feral Rites used a 3rd person camera to provide a clear view of enemies surrounding the player We used camera teleportation and player head movement to allow comfortable exploration and backtracking 7
  • 8. Later in developing the Unspoken, knew that, unlike on a console, we could make gameplay that required looking up and down. On a console that can be frustrating, but in VR it feels natural. This along, with the use of motion controllers, allowed us to make a game where the player is engaged in casting spells and doesn't feel like they are pressing buttons a controller On the technical side, a challenge in The Unspoken has been to maintaining a playable frame rate while giving players a strong sense of immersion as they summon creatures and split open the earth 8
  • 9. Vr requires a high end system Limited hardware variations and can take advantage of features in newer drivers Target a min spec 4 core intel i5, nv 970 GPU Focused on making a comfortable experience on that platform Avoiding frame rate drops very important 9
  • 10. Large amount of GPU driver and VR runtime overhead Shows up on threads and applications that our game didn't directly create Harder to measure Can cause our game threads to be interrupted Different from console where there is not a lot of preemption from external processes – know how long they will take Given the stuff outside our game, despite 4 core machine, assume we have about 3 processor threads to work with on min spec 10
  • 11. That gives you some history on the VR games we developed and how we approached VR. Before talking about profiling, helpful to have background in how threading in our engine works Main thread does gameplay sim and kicks jobs Job threads to handle things like physic sim, scene queries, skinning to allow the main thread to move forward At the end of a frame, perform culling and build work for the render thread Next frame, the render thread submits the previously simulated frame to the GPU Render thread incurs overhead of talking to the driver Works fairly well in VR, but requires some planning to balance workload and avoid stalls. That’s where profiling comes in. 11
  • 12. 12
  • 13. Artists generally use in game profiling tools we have built These largely rely on CPU or GPU timestamp queries Useful to quickly isolate the cost of a particular asset or a group of assets For instance, might need to know how much time skinning is taking and then drill down to individual models 13
  • 14. Might want to see the costs related to environment art Something that we can show them more easily then a general purpose tool Profiling tools in general can give misleading results when over budget May look like the CPU is taking a long time when really it is waiting for the GPU May look like a function is slow. Caused by the processor is switched over to another task Can cause results to vary wildly from frame to frame Particular a problem when there is a heavy CPU load Suggest identifying issues and verifying fixes on target hardware and doing profiling on a machine that is in budget 14
  • 15. Programmers tend to use Telemetry which is a product from RAD Game Tools. It can quickly show the correlation between game state and performance Something like Event Tracking for Windows is useful too, but not the first choice Harder to reason about what is going on in the game with what the profiler is showing Using telemetry, can quickly identify frames that are over budget and see what high level functions were slow on those frames. <CLICK> In addition to CPU execution, added GPU timestamps to the data telemetry collects Synchronization of GPU timestamps to CPU time is a little tricky but it gets you in the ballpark Helpful to see how the GPU and CPU affect each other 15
  • 16. In addition to individual frames, it's useful to see what is going on over a span of time For instance, maybe there is an explosion that is causing frame rate problems A way to approach this is to plot graphs of the different rendering layers and see what is slow <CLICK> Here we can see the maximum amount of time that alpha took as this effect played Maybe that needs to come down. Or maybe it's not alpha but instead a spike in skinning that is causing draw submission to stall That's where seeing how the GPU and CPU interact is really useful 16
  • 17. Telemetry is helpful but it can't answer all the questions Sometimes we can only form hypotheses and need to look outside of our game to find the true cause of an issue GPU View is helpful for this. GPU View is a development tool from Microsoft that can show the CPU and GPU state of the machine. This allows us to see when non-game threads are running on the processor and when work from the game is being executed on the GPU The difficulty is that it takes some work to correlate what the game is doing with what is being shown Stack sampling works ok, but doesn't give the detail we would see with instrumentation in Telemetry Here is a capture of a part of the game that was struggling to be in frame 17
  • 18. 18
  • 19. Not a concern on consoles because we can compile offline since we know the target hardware On PC, we load directx bytecode which gets compiled by the driver at runtime Not so much an issue if there is a lot of free CPU time, but caused huge stalls on min spec in Edge of Nowhere That game was a problem because it streams new segments of the game requiring new shaders while the player is walking through tunnels This means any frame hitch is noticeable 19
  • 20. The workaround was to create all the shaders the game would need when it started The compilation gets handled by a background driver thread and doesn't cause a stall unless the shader is needed before compilation has finished Just wait for needed shader to compile, not all of them The start of the game fades in some splash screens and then the menu. Since the the screen is essentially black when we start drawing those and block on shader compilation, the player doesn't perceive a frame hitch Compilation continues in the background while the player navigates the menu and is probably mostly done by the time actual gameplay gets going. On subsequent runs, shaders are cached by the driver so we don't need to wait for compilation Dicey Driver dependent. Can also be disabled in some drivers. It worked for us. 20
  • 21. Use visibility queries to determine how much of an object is visible on a given frame. Used for things like lens flares. Test # of pixels that are visible for a shape and use that to blend in lens flare effect. At 30 hz on console, we expect to be able to submit query after we lay down the g buffer and read it back later after lighting to set an intensity value before GPU submission. Remember that I said we have tight GPU synchronization. On console we have probably 10 ms between issuing a query and reading it back. For VR, we're looking at 1 ms. Stalling the render thread isn't really an option because it can create periods where the GPU has no work The solution was to submit queries on one frame and read them back on the next frame. That means we're 1 frame behind on visibility, but at 90hz, don't notice it 21
  • 22. This next optimization is interesting because it doesn't work on consoles When applying lighting, the light data is stored in a constant buffer on PC That's because it is about 16% faster on certain NVIDIA hardware than using a structured buffer AMD hardware doesn't seem to make much differences However, we store it in a structured buffer on PS4 because using a constant buffer turned out to be significantly slower This is surprising given the AMD PC results So it may be faster to use one type of buffer over another, but to really know you need to profile 22
  • 23. Generally, moving to VR meant rendering everything a second time Also mean turning off a lot of graphical features Ambient occlusion, screen space reflection and baked GI were all axed because they were too expensive It had to be done. our console baked lighting took 4 ms in vr. there was no way that could work. Our baked lighting does a lot Stores bounced lighting. Accounts for sky visibility and specular occlusion. It allows us to bake in lights and get them essentially for free And it applies to static and dynamic objects equally to create a well grounded look Instead, we had a single environment probe 23
  • 24. Edge of Nowhere involves going in and out of caves in Antarctica. As you might expect, the env probe used for the surface of a glacier doesn't look great inside a cave. It took a lot of effort from the lighters to carefully construct transitions from outside to inside Given these challenges and 2 other VR titles on the horizon, it made sense to see if we could adopt our console solution somehow 24
  • 25. On console, we store samples in sparse 16 meter cubes called light grids Samples store irradiance information in 6 directions and an occlusion plane The occlusion plane is automatically calculated and helps avoid light leaking 25
  • 26. To calculate lighting for a point, we find the 8 nearest samples, use the occlusion planes to determine contribution of each sample, and manually interpolate the sample values In VR, occlusion and interpolation each cost around 1 ms. That means they are both too expensive to work out without significant reductions somewhere else. In addition to light grids, the console implementation is able to use multiple environment probes to add specular light. Since we were mainly concerned with diffuse light and spec occlusion, this part of the console approach was ignored and we stuck with a single probe for specular lighting. Even ignoring specular, the entire approach was very expensive and it became clear we needed to do something very different. 26
  • 27. I found that we could take two samples of a volume texture in less than 1/10th of a millisecond. That give us free interpolation of sample data. However, it meant we could only have 8 channels of lighting data and couldn't use the occlusion planes. 27
  • 28. Working without occlusion planes was the first challenge. The already thick walls in Edge of Nowhere helped avoid light leaking which solved part of the problem. Still without occlusion planes, if we were to take samples on a uniform grid, the result would be discolored patches from samples that are embedded in geometry. The solution was to move samples outside of geometry and slightly away from surfaces. The red spheres in this picture represent samples points on the uniform grid that were too close to geometry. If you look closely, there is a white line connecting the red sphere to a different sphere which represents the new sample location and values. It does help to keep samples that are deeply embedded dark though. This provides good darkening in cases like shrubs and rock piles. 28
  • 29. With a method in place to deal with embedded samples, the next issue was compress data to fit into a limited number of textures. In Edge of Nowhere, indirect light tended to vary in direction, but not color so much We took advantage of this by converting the RGB irradiance samples into a luma-chroma format We then store 6 directions of luminance and an average color in every sample That means the data could be fetched and interpolated using two texture reads 29
  • 30. Two large volume textures store lighting data within about 100 meters of the camera and are updated as the player moved around A default lighting value is applied to areas outside the volume covered by the textures Sampling, unpacking and fading to a default value takes 1/10 of a millisecond on the min spec hardware 30
  • 31. Here is the difference between console and VR Side by side the difference is noticeable, though it's not bad Of course this is game development. And this was the baked lighting was adopted for Edge of Nowhere, the game didn't actually ship with it because the tech was a side project and came on line too late in production for people to feel comfortable switching. Feral Rite and the Unspoken have used it and its worked out well for them. 31
  • 32. Thanks Bob! Let’s dig deeper into how we approached performance analysis on The Unspoken. Before we look at the execution timeline flow in VR, let’s quickly go over platform differences in the way render presentation and thread interaction is handled. 32
  • 33. As Bob mentioned, regardless of that platform, MT is ahead of RS by a frame. The low-level rendering APIs on consoles allow for very tight control b/w RS and the GPU, and consequently MT. (adv) After RS finishes submitting all the GPU commands, it doesn’t wake up MT immediately. It waits for the GPU to let it know that it’s reached the post-processing stage. Once that happens, RS wakes MT up and work for the next frame begins just- in-time. This minimizes input latency and does come at the risk of starving the GPU, but since the hardware is fixed, it can be tweaked almost to perfection. (adv) Present simple inserts a flag and doesn’t block. 33
  • 34. On PC, the display flip functionality is handled by DXGI/display driver and we have a layer of abstraction to interact with it. (adv) Depending on the arguments to Present and max frame latency, it can block to prevent the CPU from going way ahead of the GPU (and keep i/p latency under check). There isn’t a non-intrusive way to kick off work depending on where the GPU is. What this leads to is less control over when exactly MT needs to start working. (adv) PC VR adds another dependency. The Oculus/SteamVR runtime throttles the app to the HMD refresh rate by blocking on frame submission. Engine devs need to be aware of this! There isn’t a polling option. Buffering is a no-no unless we throw away Further reading: DXGI flip vs bitblt models: https://msdn.microsoft.com/en- us/library/windows/desktop/hh706346%28v=vs.85%29.aspx Ovr_submitFrame: https://developer3.oculus.com/doc/1.3.0- libovr/_o_v_r___c_a_p_i_8h.html#aa84f958b8d78f2594adf6fb0ad372558 34
  • 35. Here’s a simplified timeline view of how the engine execution flows in VR. (talk about the layout) We’ll focus on MT and RS threads on the CPU and see how one frame’s worth of work flows. (adv) MT starts things off with sim and render prep work for frame N, during which RS is submitting draws for frame N-1. RS does the mirror present (immediate mode) and is blocked by ovr_sf (adv) When it regains execution, it wakes up MT, updates gfx resources, submits non-skinned objects and waits on MT/job threads to finish generating the skinned vertex buffer data. (adv) The GPU is mostly idle until the skinned objects are submitted. And we want to fix this! (adv) Once all the render work has been submitted, RS does the mirror present (non-blocking) and calls ovr_submitFrame, which blocks. (adv) The Oculus runtime wakes up a few ms before HMD vsync and submits the post-process (tw,bd.ca) work on the compute queue. (adv) 35
  • 36. Over the next vblank interval, this data is streamed to the display and finally shows up on the next vsync. 35
  • 37. To summarize the dependencies that we need to address: 1) Main thread needn’t wait for Render Submission thread to be unblocked before starting work on the next frame - Eliminates Render Submission thread from waiting on Main thread to update skinned vertices as well - Adds to input/motion to photon latency, but time warp does a good enough job to make it unnoticeable. 2) The diagram before didn’t show this, but we’ll talk about it in a bit. 36
  • 38. (walk through animations) Notice that we didn’t make any improvements to the rendering itself. Fixing system level bottlenecks are generally the biggest bang for the buck. 37
  • 39. Insomniac Games’ engine uses an in-house CPU occlusion culling system based on the downsampled depth buffer. On the console with the 30 ms frame budget, we could read the downsampled depth buffer of the current GPU frame after the G-buffer pass (while the GPU is doing other work) for MT to use for occlusion culling objects in the next frame. On VR, this isn’t possible at all w/o causing the GPU to starve or MT to wait too long. So, the deferred occlusion system ended up using the two-frame old depth buffer for culling. Worth noting that since this is VR and each eye is treated as a separate camera, this is done once per eye. 38
  • 40. The main thread reprojects the downsampled depth buffer, fills holes, creates a mipchain and uses the latter for occlusion queries. 39
  • 41. 40
  • 42. (use animations) Note: No need to over-engineer a solution to use one eye’s depth buffer to do occlusion for both eyes either. MT isn’t the bottleneck here! 41
  • 43. We found this issue with GPUView. Ended up looking at event callstack on the render thread to debug it. There are still some bubbles at the start of the GPU queue. This is why an ETW based tracing that shows the DMA packets is so useful. Using CPU/GPU timers in Telemetry doesn’t accurately portray the reality. The remaining bubbles are because RS is updating various GPU resources at the start and not feeding the GPU. We are going to try out deferred contexts for resource updates (Updatesubresource style) when RS is blocked to save us the driver copy overhead and reduce GPU starvation. 42
  • 44. Let’s briefly talk about ASW and what the task timeline looks like when it is enabled by the Oculus runtime. Async time warp does a rotational reprojection of the eye buffer based on the HMD’s orientation at the point the Oculus post-processing is done. Compensates for HMD rotational movements from the time the render commands were dispatched. Space warp accounts for translation changes based on a per-pixel velocity buffer and is a mechanism to allow low-end VR rigs to render/simulate at 45Hz while still refreshing the HMD at 90Hz. What happens as a result? ovr_submitFrame blockis for a much longer time (~15ms vs ~5ms w/o it) Worth pointing out that the input/motion to photon latency is worsened because of the engine refactoring to unblock MT. 43
  • 45. GPUView lets us understand the behind-the-scenes part pretty well. It’s interesting to see how much of time is simply wasted because we couldn’t meet the <10ms gpu frame budget. An argument worth making here is to provide the best possible frame quality and relegate to ASW ON if we simply cannot meet the 90Hz frame update. 44
  • 46. So far, we’ve talked about removing GPU bubbles; let’s look at the CPU side of things for a bit. 45
  • 47. With VR dev and its 10 ms frame budget, there’s all the more reason to use the CPU effectively. Over the last decade, there’s been a steady move to push stuff onto the GPU. VR brings the onus back on the CPU to simulate, cull, etc. (adv) Here’s a GPUView snip from an early build of the game running on an i7- 6700K/GTX980. Even with the GPU almost pegged, the net CPU utilization is < 40%. IG’s engine already used a CPU path for physics, particles and occlusion culling, yet there was so much of idle CPU time up for grabs. 46
  • 48. Enough with all these timeline views, don’t you think? Here’s some eye candy from the game, showing the comparison b/w the default settings on a low-end and high-end VR rig that allows the game to run at ~90fps. The quality preset is automatically set based on the underlying system config, the player is free to override it. We’ll look at the features in some more detail in the next slide. 47
  • 49. With the ultra setting, you’ll see the following additions during the course of a match: (narrate from slides) 48
  • 50. Here’s a video from the level editor showing what we were going for, and what we ended up with. You really do need a LOT of particles to show fluid sim like motion. Unfortunately, the cost of updates did not permit us to use these assets. 49
  • 51. Can’t use 3D Perlin noise to move particles. You’ll find that they clump after a point and don’t keep flowing. That’s because the resulting potential field is smooth, but not divergence free. Luckily, we have the vector calculus operator ‘curl’ that does black magic. Here’re links to resources on the topic: https://www.cs.ubc.ca/~rbridson/docs/bridson-siggraph2007-curlnoise.pdf http://prideout.net/blog/?p=63 catlikecoding.com/unity/tutorials/noise-derivatives/ 50
  • 52. So, how complex was it to add curl into the engine and vfx tool chain? … existing engine concepts … In terms of authoring, it’s super easy to make crazy cool looking stuff since it’s a few sliders that are being pushed around. However, it’s a very unintuitive process right now. There’re no flow fields in the editor to show what an effect may look like. 51
  • 53. So, what does it perform like? ~1ms for 1000 particles Adds 4-5 ms total work per frame., 2-3 ms in net cost due to tasking. Not performant enough for massive fluid-like particle sim. Implementation is currently: FP ops bound Limitation of the particle update system .. 52
  • 54. 53
  • 55. Before we look at performance stats on different settings/hardware, let’s revisit the importance of measurement based development. Having multiple lenses to look at your games performance is key to making informed decisions to improve the game. While most VR titles use Unity and Unreal, having a native in-house engine allows for immense fine tuning. Insomniac’s in-house engine is a great example of what a performance driven dev approach can yield. Let’s revisit the profiling options available.. 54
  • 56. Here’s comparison data from our lab using the VR stat logging system in the engine on the Lockport bridge arena. The table shows data for an i5-4590 + GTX 970 / 980 and i5-6700K + GTX980 at various settings. Two main observations: - The game drops lesser # frames on a better CPU even when it’s predominantly GPU limited - ASW disabled yields much lesser dropped frames than ASW auto. The reason isn’t clear yet for this. It is possible that the ovrPerfStats field was interpreted incorrectly when ASW is on. 55
  • 57. In conclusion, 1) engine dependencies need to be looked into for VR, and hopefully decoupled more. It can help the non-VR path as well in the process! 2) Profiling, instrumentation and knowledge of different tools helps detect and reason about bottlenecks in the execution. 3) Don’t discount the CPU! It is tremendously powerful, and can shave quite a bunch of GPU time and help improve immersion. And lest you forget, Unused silicon is sad silicon! 56
  • 58. Shoutout to Shaun, Yancy, Abdul [IG], Cristiano, ChrisK, Dave and Brian {Intel} Thank you for your listening, we hope you found this useful. Please spare a few minutes to rate us. If you have any questions, please step up to the mic, while we show you some wonderful legal disclaimers. 57
  • 59. 58
  • 60. 59
  • 61. 60
  • 62. 61