Console to PC VR: Lessons Learned from the Unspoken

Talk about Insomniac's approach to entering VR
Performance Profiling - critical part of developing VR games
Engine changes to improve the experience in VR
Raja - how intel helped with performance optimizations and creating a more
engaging experience on higher end hardware
2

Insomniac has been developing games for over 20 years
Mostly console games using in house engine
Engine runs on PC, though designed for console
Characteristics might be uncommon on PC
4

Want to use our engine because it has been working well and we are
comfortable
Don't want to diverge much from console
Reduce maintenance
Engine changes should ideally benefit both platforms
Knew working in VR would be challenging
Performance requirements meant we would have to scale back content from
what we'd have on a console
Also, game design would be different vs console
5

First VR title
Edge explored how we could present an interesting environment with
shapes rather than relying on the same level of environment detail we would
have on console
Edge also let us examine how a 3rd person game could have engaging
gameplay while being comfortable in VR
6

Feral Rites used a 3rd person camera to provide a clear view of enemies
surrounding the player
We used camera teleportation and player head movement to allow
comfortable exploration and backtracking
7

Later in developing the Unspoken, knew that, unlike on a console, we could
make gameplay that required looking up and down.
On a console that can be frustrating, but in VR it feels natural.
This along, with the use of motion controllers, allowed us to make a game
where the player is engaged in casting spells and doesn't feel like they are
pressing buttons a controller
On the technical side, a challenge in The Unspoken has been to maintaining
a playable frame rate while giving players a strong sense of immersion as
they summon creatures and split open the earth
8

Vr requires a high end system
Limited hardware variations and can take advantage of features in newer
drivers
Target a min spec
4 core intel i5, nv 970 GPU
Focused on making a comfortable experience on that platform
Avoiding frame rate drops very important
9

Large amount of GPU driver and VR runtime overhead
Shows up on threads and applications that our game didn't directly create
Harder to measure
Can cause our game threads to be interrupted
Different from console where there is not a lot of preemption from external
processes – know how long they will take
Given the stuff outside our game, despite 4 core machine, assume we have
about 3 processor threads to work with on min spec
10

That gives you some history on the VR games we developed and how we
approached VR.
Before talking about profiling, helpful to have background in how threading
in our engine works
Main thread does gameplay sim and kicks jobs
Job threads to handle things like physic sim, scene queries, skinning to allow
the main thread to move forward
At the end of a frame, perform culling and build work for the render thread
Next frame, the render thread submits the previously simulated frame to the
GPU
Render thread incurs overhead of talking to the driver
Works fairly well in VR, but requires some planning to balance workload and
avoid stalls. That’s where profiling comes in.
11

Artists generally use in game profiling tools we have built
These largely rely on CPU or GPU timestamp queries
Useful to quickly isolate the cost of a particular asset or a group of assets
For instance, might need to know how much time skinning is taking and then
drill down to individual models
13

Might want to see the costs related to environment art
Something that we can show them more easily then a general purpose tool
Profiling tools in general can give misleading results when over budget
May look like the CPU is taking a long time when really it is waiting for the
GPU
May look like a function is slow. Caused by the processor is switched over
to another task
Can cause results to vary wildly from frame to frame
Particular a problem when there is a heavy CPU load
Suggest identifying issues and verifying fixes on target hardware and doing
profiling on a machine that is in budget
14

Programmers tend to use Telemetry which is a product from RAD Game
Tools.
It can quickly show the correlation between game state and performance
Something like Event Tracking for Windows is useful too, but not the first
choice
Harder to reason about what is going on in the game with what the profiler is
showing
Using telemetry, can quickly identify frames that are over budget and see
what high level functions were slow on those frames.
<CLICK>
In addition to CPU execution, added GPU timestamps to the data telemetry
collects
Synchronization of GPU timestamps to CPU time is a little tricky but it gets
you in the ballpark
Helpful to see how the GPU and CPU affect each other
15

In addition to individual frames, it's useful to see what is going on over a
span of time
For instance, maybe there is an explosion that is causing frame rate
problems
A way to approach this is to plot graphs of the different rendering layers
and see what is slow
<CLICK>
Here we can see the maximum amount of time that alpha took as this effect
played
Maybe that needs to come down. Or maybe it's not alpha but instead a spike
in skinning that is causing draw submission to stall
That's where seeing how the GPU and CPU interact is really useful
16

Telemetry is helpful but it can't answer all the questions
Sometimes we can only form hypotheses and need to look outside of our
game to find the true cause of an issue
GPU View is helpful for this. GPU View is a development tool from Microsoft
that can show the CPU and GPU state of the machine.
This allows us to see when non-game threads are running on the processor
and when work from the game is being executed on the GPU
The difficulty is that it takes some work to correlate what the game is doing
with what is being shown
Stack sampling works ok, but doesn't give the detail we would see with
instrumentation in Telemetry
Here is a capture of a part of the game that was struggling to be in frame
17

Not a concern on consoles because we can compile offline since we know
the target hardware
On PC, we load directx bytecode which gets compiled by the driver at
runtime
Not so much an issue if there is a lot of free CPU time, but caused huge stalls
on min spec in Edge of Nowhere
That game was a problem because it streams new segments of the game
requiring new shaders while the player is walking through tunnels
This means any frame hitch is noticeable
19

The workaround was to create all the shaders the game would need when it
started
The compilation gets handled by a background driver thread and doesn't
cause a stall unless the shader is needed before compilation has finished
Just wait for needed shader to compile, not all of them
The start of the game fades in some splash screens and then the menu.
Since the the screen is essentially black when we start drawing those and
block on shader compilation, the player doesn't perceive a frame hitch
Compilation continues in the background while the player navigates the
menu and is probably mostly done by the time actual gameplay gets going.
On subsequent runs, shaders are cached by the driver so we don't need to
wait for compilation
Dicey
Driver dependent. Can also be disabled in some drivers.
It worked for us.
20

Use visibility queries to determine how much of an object is visible on a
given frame.
Used for things like lens flares. Test # of pixels that are visible for a shape
and use that to blend in lens flare effect.
At 30 hz on console, we expect to be able to submit query after we lay down
the g buffer and read it back later after lighting to set an intensity value
before GPU submission.
Remember that I said we have tight GPU synchronization.
On console we have probably 10 ms between issuing a query and reading it
back. For VR, we're looking at 1 ms.
Stalling the render thread isn't really an option because it can create periods
where the GPU has no work
The solution was to submit queries on one frame and read them back on the
next frame.
That means we're 1 frame behind on visibility, but at 90hz, don't notice it
21

This next optimization is interesting because it doesn't work on consoles
When applying lighting, the light data is stored in a constant buffer on PC
That's because it is about 16% faster on certain NVIDIA hardware than using
a structured buffer
AMD hardware doesn't seem to make much differences
However, we store it in a structured buffer on PS4 because using a constant
buffer turned out to be significantly slower
This is surprising given the AMD PC results
So it may be faster to use one type of buffer over another, but to really know
you need to profile
22

Generally, moving to VR meant rendering everything a second time
Also mean turning off a lot of graphical features
Ambient occlusion, screen space reflection and baked GI were all axed
because they were too expensive
It had to be done. our console baked lighting took 4 ms in vr. there was no
way that could work.
Our baked lighting does a lot
Stores bounced lighting. Accounts for sky visibility and specular
occlusion.
It allows us to bake in lights and get them essentially for free
And it applies to static and dynamic objects equally to create a well
grounded look
Instead, we had a single environment probe
23

Edge of Nowhere involves going in and out of caves in Antarctica. As you
might expect, the env probe used for the surface of a glacier doesn't look
great inside a cave.
It took a lot of effort from the lighters to carefully construct transitions from
outside to inside
Given these challenges and 2 other VR titles on the horizon, it made sense to
see if we could adopt our console solution somehow
24

On console, we store samples in sparse 16 meter cubes called light grids
Samples store irradiance information in 6 directions and an occlusion plane
The occlusion plane is automatically calculated and helps avoid light leaking
25

To calculate lighting for a point, we find the 8 nearest samples, use the
occlusion planes to determine contribution of each sample, and manually
interpolate the sample values
In VR, occlusion and interpolation each cost around 1 ms. That means they
are both too expensive to work out without significant reductions
somewhere else.
In addition to light grids, the console implementation is able to use multiple
environment probes to add specular light.
Since we were mainly concerned with diffuse light and spec occlusion, this
part of the console approach was ignored and we stuck with a single probe
for specular lighting.
Even ignoring specular, the entire approach was very expensive and it
became clear we needed to do something very different.
26

I found that we could take two samples of a volume texture in less than
1/10th of a millisecond.
That give us free interpolation of sample data.
However, it meant we could only have 8 channels of lighting data and
couldn't use the occlusion planes.
27

Working without occlusion planes was the first challenge.
The already thick walls in Edge of Nowhere helped avoid light leaking which
solved part of the problem.
Still without occlusion planes, if we were to take samples on a uniform grid,
the result would be discolored patches from samples that are embedded in
geometry.
The solution was to move samples outside of geometry and slightly away
from surfaces.
The red spheres in this picture represent samples points on the uniform grid
that were too close to geometry.
If you look closely, there is a white line connecting the red sphere to a
different sphere which represents the new sample location and values.
It does help to keep samples that are deeply embedded dark though. This
provides good darkening in cases like shrubs and rock piles.
28

With a method in place to deal with embedded samples, the next issue was
compress data to fit into a limited number of textures.
In Edge of Nowhere, indirect light tended to vary in direction, but not color
so much
We took advantage of this by converting the RGB irradiance samples into a
luma-chroma format
We then store 6 directions of luminance and an average color in every
sample
That means the data could be fetched and interpolated using two
texture reads
29

Two large volume textures store lighting data within about 100 meters of the
camera and are updated as the player moved around
A default lighting value is applied to areas outside the volume covered by
the textures
Sampling, unpacking and fading to a default value takes 1/10 of a
millisecond on the min spec hardware
30

Here is the difference between console and VR
Side by side the difference is noticeable, though it's not bad
Of course this is game development. And this was the baked lighting was
adopted for Edge of Nowhere, the game didn't actually ship with it because
the tech was a side project and came on line too late in production for
people to feel comfortable switching.
Feral Rite and the Unspoken have used it and its worked out well for them.
31

Thanks Bob!
Let’s dig deeper into how we approached performance analysis on The
Unspoken.
Before we look at the execution timeline flow in VR, let’s quickly go over
platform differences in the way render presentation and thread interaction is
handled.
32

As Bob mentioned, regardless of that platform, MT is ahead of RS by a frame.
The low-level rendering APIs on consoles allow for very tight control b/w RS
and the GPU, and consequently MT.
(adv)
After RS finishes submitting all the GPU commands, it doesn’t wake up MT
immediately.
It waits for the GPU to let it know that it’s reached the post-processing stage.
Once that happens, RS wakes MT up and work for the next frame begins just-
in-time.
This minimizes input latency and does come at the risk of starving the GPU,
but since the hardware is fixed, it can be tweaked almost to perfection.
(adv)
Present simple inserts a flag and doesn’t block.
33

On PC, the display flip functionality is handled by DXGI/display driver and we
have a layer of abstraction to interact with it.
(adv)
Depending on the arguments to Present and max frame latency, it can block
to prevent the CPU from going way ahead of the GPU (and keep i/p latency
under check).
There isn’t a non-intrusive way to kick off work depending on where the GPU
is. What this leads to is less control over when exactly MT needs to start
working.
(adv)
PC VR adds another dependency.
The Oculus/SteamVR runtime throttles the app to the HMD refresh rate by
blocking on frame submission. Engine devs need to be aware of this! There
isn’t a polling option.
Buffering is a no-no unless we throw away
Further reading:
DXGI flip vs bitblt models: https://msdn.microsoft.com/en-
us/library/windows/desktop/hh706346%28v=vs.85%29.aspx
Ovr_submitFrame: https://developer3.oculus.com/doc/1.3.0-
libovr/_o_v_r___c_a_p_i_8h.html#aa84f958b8d78f2594adf6fb0ad372558
34

Here’s a simplified timeline view of how the engine execution flows in VR.
(talk about the layout)
We’ll focus on MT and RS threads on the CPU and see how one frame’s
worth of work flows.
(adv)
MT starts things off with sim and render prep work for frame N, during which
RS is submitting draws for frame N-1.
RS does the mirror present (immediate mode) and is blocked by ovr_sf
(adv)
When it regains execution, it wakes up MT, updates gfx resources, submits
non-skinned objects and waits on MT/job threads to finish generating the
skinned vertex buffer data.
(adv)
The GPU is mostly idle until the skinned objects are submitted. And we want
to fix this!
(adv)
Once all the render work has been submitted, RS does the mirror present
(non-blocking) and calls ovr_submitFrame, which blocks.
(adv)
The Oculus runtime wakes up a few ms before HMD vsync and submits the
post-process (tw,bd.ca) work on the compute queue.
(adv)
35

Over the next vblank interval, this data is streamed to the display and finally
shows up on the next vsync.
35

To summarize the dependencies that we need to address:
1) Main thread needn’t wait for Render Submission thread to be unblocked
before starting work on the next frame
- Eliminates Render Submission thread from waiting on Main thread to
update skinned vertices as well
- Adds to input/motion to photon latency, but time warp does a good
enough job to make it unnoticeable.
2) The diagram before didn’t show this, but we’ll talk about it in a bit.
36

(walk through animations)
Notice that we didn’t make any improvements to the rendering itself. Fixing
system level bottlenecks are generally the biggest bang for the buck.
37

Insomniac Games’ engine uses an in-house CPU occlusion culling system
based on the downsampled depth buffer.
On the console with the 30 ms frame budget, we could read the
downsampled depth buffer of the current GPU frame after the G-buffer pass
(while the GPU is doing other work) for MT to use for occlusion culling
objects in the next frame.
On VR, this isn’t possible at all w/o causing the GPU to starve or MT to wait
too long.
So, the deferred occlusion system ended up using the two-frame old depth
buffer for culling.
Worth noting that since this is VR and each eye is treated as a separate
camera, this is done once per eye.
38

The main thread reprojects the downsampled depth buffer, fills holes,
creates a mipchain and uses the latter for occlusion queries.
39

(use animations)
Note: No need to over-engineer a solution to use one eye’s depth buffer to
do occlusion for both eyes either. MT isn’t the bottleneck here!
41

We found this issue with GPUView. Ended up looking at event callstack on
the render thread to debug it.
There are still some bubbles at the start of the GPU queue. This is why an
ETW based tracing that shows the DMA packets is so useful.
Using CPU/GPU timers in Telemetry doesn’t accurately portray the reality.
The remaining bubbles are because RS is updating various GPU resources at
the start and not feeding the GPU.
We are going to try out deferred contexts for resource updates
(Updatesubresource style) when RS is blocked to save us the driver copy
overhead and reduce GPU starvation.
42

Let’s briefly talk about ASW and what the task timeline looks like when it is
enabled by the Oculus runtime.
Async time warp does a rotational reprojection of the eye buffer based on
the HMD’s orientation at the point the Oculus post-processing is done.
Compensates for HMD rotational movements from the time the render
commands were dispatched.
Space warp accounts for translation changes based on a per-pixel velocity
buffer and is a mechanism to allow low-end VR rigs to render/simulate at
45Hz while still refreshing the HMD at 90Hz.
What happens as a result? ovr_submitFrame blockis for a much longer time
(~15ms vs ~5ms w/o it)
Worth pointing out that the input/motion to photon latency is worsened
because of the engine refactoring to unblock MT.
43

GPUView lets us understand the behind-the-scenes part pretty well.
It’s interesting to see how much of time is simply wasted because we
couldn’t meet the <10ms gpu frame budget.
An argument worth making here is to provide the best possible frame quality
and relegate to ASW ON if we simply cannot meet the 90Hz frame update.
44

So far, we’ve talked about removing GPU bubbles; let’s look at the CPU side
of things for a bit.
45

With VR dev and its 10 ms frame budget, there’s all the more reason to use
the CPU effectively.
Over the last decade, there’s been a steady move to push stuff onto the GPU.
VR brings the onus back on the CPU to simulate, cull, etc.
(adv)
Here’s a GPUView snip from an early build of the game running on an i7-
6700K/GTX980. Even with the GPU almost pegged, the net CPU utilization is
< 40%.
IG’s engine already used a CPU path for physics, particles and occlusion
culling, yet there was so much of idle CPU time up for grabs.
46

Enough with all these timeline views, don’t you think?
Here’s some eye candy from the game, showing the comparison b/w the
default settings on a low-end and high-end VR rig that allows the game to
run at ~90fps.
The quality preset is automatically set based on the underlying system
config, the player is free to override it.
We’ll look at the features in some more detail in the next slide.
47

With the ultra setting, you’ll see the following additions during the course of
a match:
(narrate from slides)
48

Here’s a video from the level editor showing what we were going for, and
what we ended up with.
You really do need a LOT of particles to show fluid sim like motion.
Unfortunately, the cost of updates did not permit us to use these assets.
49

Can’t use 3D Perlin noise to move particles. You’ll find that they clump after
a point and don’t keep flowing.
That’s because the resulting potential field is smooth, but not divergence
free.
Luckily, we have the vector calculus operator ‘curl’ that does black magic.
Here’re links to resources on the topic:
https://www.cs.ubc.ca/~rbridson/docs/bridson-siggraph2007-curlnoise.pdf
http://prideout.net/blog/?p=63
catlikecoding.com/unity/tutorials/noise-derivatives/
50

So, how complex was it to add curl into the engine and vfx tool chain?
… existing engine concepts …
In terms of authoring, it’s super easy to make crazy cool looking stuff since
it’s a few sliders that are being pushed around.
However, it’s a very unintuitive process right now. There’re no flow fields in
the editor to show what an effect may look like.
51

So, what does it perform like? ~1ms for 1000 particles
Adds 4-5 ms total work per frame., 2-3 ms in net cost due to tasking.
Not performant enough for massive fluid-like particle sim.
Implementation is currently:
FP ops bound
Limitation of the particle update system
..
52

Before we look at performance stats on different settings/hardware, let’s
revisit the importance of measurement based development.
Having multiple lenses to look at your games performance is key to making
informed decisions to improve the game.
While most VR titles use Unity and Unreal, having a native in-house engine
allows for immense fine tuning.
Insomniac’s in-house engine is a great example of what a performance
driven dev approach can yield.
Let’s revisit the profiling options available..
54

Here’s comparison data from our lab using the VR stat logging system in the
engine on the Lockport bridge arena.
The table shows data for an i5-4590 + GTX 970 / 980 and i5-6700K +
GTX980 at various settings.
Two main observations:
- The game drops lesser # frames on a better CPU even when it’s
predominantly GPU limited
- ASW disabled yields much lesser dropped frames than ASW auto. The
reason isn’t clear yet for this. It is possible that the ovrPerfStats field was
interpreted incorrectly when ASW is on.
55

In conclusion,
1) engine dependencies need to be looked into for VR, and hopefully
decoupled more. It can help the non-VR path as well in the process!
2) Profiling, instrumentation and knowledge of different tools helps detect
and reason about bottlenecks in the execution.
3) Don’t discount the CPU! It is tremendously powerful, and can shave quite
a bunch of GPU time and help improve immersion.
And lest you forget, Unused silicon is sad silicon!
56

Shoutout to Shaun, Yancy, Abdul [IG], Cristiano, ChrisK, Dave and Brian
{Intel}
Thank you for your listening, we hope you found this useful.
Please spare a few minutes to rate us.
If you have any questions, please step up to the mic, while we show you
some wonderful legal disclaimers.
57

Console to PC VR: Lessons Learned from the Unspoken

More Related Content

What's hot (6)

Similar to Console to PC VR: Lessons Learned from the Unspoken (20)

Recently uploaded (20)

Console to PC VR: Lessons Learned from the Unspoken