SlideShare a Scribd company logo
The filtered and culled
Visibility Buffer
Wolfgang Engel
Confetti
October 13th, 2016
Demo
Table of contents
• The Visibility Buffer
• Cluster Culling / Triangle Filtering
• Re-using triangle filtered results for multiple rendering passes
The Visibility Buffer
Motivation
• Forward rendering shades all fragments in triangle-submission order
• Wastes rendering power on pixels that don’t contribute to the final image
• Deferred shading solves this problem in 2 steps:
• First, surface attributes are stored in screen buffers -> G-Buffer
• Second, shading is computed for visible fragments only
• However, deferred shading increases memory bandwidth
consumption:
• Screen buffers for: normal, depth, albedo, material ID,…
• G-Buffer size becomes challenging at high resolutions
High-Level View 2009
G-Buffer (taken from [Engel2009] Killzone 2 layout)
Depth Buffer
Deferred
Lighting
Forward
Rendering
Switch off depth write
Specular /
Motion Vec
Normals
Albedo /
Shadow
Render opaque objects Transparent objects
Sort Back-To-Front
8:8:8:8 8:8:8:8 8:8:8:8 32-bit
High-Level View 2014
G-Buffer – Frostbite Engine [Lagarde]
10:10:10:2
8:8:8:8
8:8:8:8
11:11:10
Depth - R32G8X24_TYPELESS
High-Level View 2014
G-Buffer – Frostbite Engine [Lagarde]
Depth Buffer
Deferred
Lighting
Forward
Rendering
Switch off depth write
BaseColor etc.Normals etc. MetalMask etc.
Render opaque objects Transparent objects
Sort Back-To-Front
10:10:10:2 8:8:8:8 8:8:8:8 32-bit
IBL / GI
11:11:10
High-Level View
Visibility Buffer (similar to [Burns][Schied])
Depth Buffer
Vertex Buffer
Visibility Vertex Buffer holds opaque
and transparent objects
Lighting
Forward
Rendering
Transparent objects
Sort Back-To-Front
Encodes
- 1 bit alpha-mask
- 8-bit drawID
- 23-bit triangleID / primID
8:8:8:8
Visibility Buffer PC 1080p – Memory
Memory Description NoMSAA 2xMSAA 4xMSAA
Visibility Buffer keeps address of each triangle in 32-bit per pixel
4 bytes * 1920 * 1080
7.9 MB 15.8 MB 31.6* MB
Depth Buffer 4 byte * 1920 * 1080 7.9 MB 15.8 MB 31.6* MB
Hierarchical Z 4 byte * 1920 * 1080 * 1/64 0.12 MB - -
Vertex Buffer 28 bytes per vertex + padding
float3 position;
uint normal;
uint tangent;
uint texCoord;
uint materialID;
uint pad; // to align to 128 bits – 32 byte for NVIDIA
Does not increase with resolution.
Textures 21* MB - -
Draw arguments, Uniform, Descriptors etc. 2* MB - -
Overall 38.80* MB 54.60* MB 86.20* MB
* rough estimate; driver might increase it
G-Buffer PC 1080p – Memory
Memory Description NoMSAA 2xMSAA 4xMSAA
Normals 10:10:10:2 - 4 bytes * 1920 * 1080 7.9 MB 15.8* MB 31.6* MB
PBR 8:8:8:8 - 4 bytes * 1920 * 1080 7.9 MB 15.8* MB 31.6* MB
Albedo 8:8:8:8 - 4 bytes * 1920 * 1080 7.9 MB 15.8* MB 31.6* MB
GI / IBL 11:11:10 - 4 bytes * 1920 * 1080 7.9 MB 15.8* MB 31.6* MB
Depth 8:8:8:8 - 4 bytes * 1920 * 1080 7.9 MB 15.8* MB 31.6* MB
Hierarchical Z 4 bytes * 1920 * 1080 * 1/64 0.49 MB - -
Draw arguments. Etc. 2* MB
Overall 41.99* MB 81.49* MB 160.49* MB
* rough estimate; driver might increase it
Visibility Buffer PC 4k – Memory
Memory Description NoMSAA 2xMSAA 4xMSAA
Visibility Buffer keeps address of each triangle in 32-bit per pixel
4 bytes * 3840 * 2160
31.64 MB 63.28* MB 126.56* MB
Depth Buffer 4 bytes * 3840 * 2160 31.64 MB 63.28* MB 126.56* MB
Hierarchical Z 4 bytes * 3840 * 2160 * 1/64 0.49 MB - -
Vertex Buffer 28 bytes per vertex + padding
float3 position;
uint normal;
uint tangent;
uint texCoord;
uint materialID;
uint pad; // to align to 128 bits – 32 byte for NVIDIA
Does not increase with resolution.
- -
Textures 21* MB
Draw arguments, Uniform, Descriptors etc. 2* MB
Overall 86.77 MB 150.05 MB 276.61 MB
* rough estimate; driver might increase it
G-Buffer PC 4k – Memory
Memory Description NoMSAA 2xMSAA 4xMSAA
Normals 10:10:10:2 - 4 bytes * 3840 * 2160 31.64 MB 63.28* MB 126.56* MB
PBR 8:8:8:8 - 4 bytes * 3840 * 2160 31.64 MB 63.28* MB 126.56* MB
Albedo 8:8:8:8 - 4 bytes * 3840 * 2160 31.64 MB 63.28* MB 126.56* MB
GI / IBL 11:11:10 - 4 bytes * 3840 * 2160 31.64 MB 63.28* MB 126.56* MB
Depth 8:8:8:8 - 4 bytes * 3840 * 2160 31.64 MB 63.28* MB 126.56* MB
Hierarchical Z 4 bytes * 3840 * 2160 * 1/64 0.49 MB - -
Draw arguments. Etc. 2* MB - -
Overall 160.69* MB 318.89* MB 635.29* MB
* rough estimate; driver might increase it
Visibility Buffer XOne 1080p – Memory
ESRAM Description NoMSAA 2xMSAA
Visibility Buffer keeps address of each triangle in 32-bit per pixel
4 bytes * 1920 * 1080
7.9 MB 15.8 MB
Depth Buffer 4 byte * 1920 * 1080 7.9 MB 15.8 MB
Hierarchical Z 4 byte * 1920 * 1080 * 1/64 0.12 MB -
Visibility Buffer – Filling pseudocode
• Visibility Buffer generation step
• For each pixel on screen:
• Pack (alpha masked bit, drawID, primitiveID) into 1 32-bit UINT
• Write that into a screen-sized buffer
• The tuple (alpha masked bit, drawID, primitiveID) will allow a shader to access the
triangle data in the shading step
Visibility Buffer – Shader code
// in visibilityPass.shd
uint calculateOutputVBID(bool opaque, uint drawID, uint primitiveID){
uint drawID_primID = ((drawID << 23) & 0x7F800000) | (primitiveID & 0x007FFFFF);
if (opaque)
return drawID_primID;
else
return (1 << 31) | drawID_primID;
}
float4 unpackUnorm4x8(uint p){
return float4(float(p & 0x000000FF) / 255.0, float((p & 0x0000FF00) >> 8) / 255.0,
float((p & 0x00FF0000) >> 16) / 255.0, float((p & 0xFF000000) >> 24) / 255.0);
}
cbuffer RootConstants : register(b0, space0){
uint DrawID;
};
float4 main(uint primitiveID : SV_PrimitiveID) : SV_Target{
return unpackUnorm4x8(calculateOutputVBID(true,DrawID,primitiveID));
}
Visibility Buffer – Shading pseudocode
• For each pixel in screen-space we do:
1. Get drawID/triangleID at pixel pos
2. Load data for the 3 vertices from the VB
3. Compute the partial derivatives of the barycentric coordinates – triangle gradients
i. Compute x and y value of each vertex of a triangle in projected screen-space
ii. Compute Barycentric Coordinates
iii. Compute partial derivatives of the barycentric coordinates
4. Interpolate vertex attributes at pixelpos using gradients (could do triangle / object-
space lighting)
a) Attribs use w from position to compute perspective correct interpolation
b) MVP matrix is applied to position
5. We have all data ready: shade and calculate final color
Visibility Buffer – Shading pseudocode
Barycentric coordinates λ1, λ2, λ3 on an equilateral triangle
and on a right triangle
Visibility Buffer – Shading pseudocode
/* taken from computeDerivaties.shd
void computeBarycentricGradients(float2 v[3], out float3 db_dx, out float3 db_dy){
float d = 1.0 / determinant(float2x2(v[2] - v[1], v[0] - v[1]));
db_dx = float3(v[1].y - v[2].y, v[2].y - v[0].y, v[0].y - v[1].y) * d;
db_dy = float3(v[2].x - v[1].x, v[0].x - v[2].x, v[1].x - v[0].x) * d;
}*/
// in shadeSun.shd
float3 one_over_w = 1.0 / float3(position0.w, position1.w, position2.w);
// projected vertices
float2 pos_scr[3] = {position0.xy *one_over_w.x, position1.xy *one_over_w.y, position2.xy *one_over_w.z };
float3 db_dx, db_dy;
// gradient barycentric coords x/y
computeBarycentricGradients(pos_scr, db_dx, db_dy);
Following [Schied2015] Appendix A Equation (4):
Visibility Buffer – Shading pseudocode
• For each pixel in screen-space we do:
1. Get drawID/triangleID at pixel pos
2. Load data for the 3 vertices from the VB
3. Compute the partial derivates of the barycentric coordinates – triangle
gradidents
4. Interpolate vertex attributes at pixelpos using gradients (could do triangle /
object-space lighting)
a) Attribs use w from position to compute perspective correct interpolation
b) MVP matrix is applied to position
5. We have all data ready: shade and calculate final color
Visibility Buffer – Shading pseudocode
// in shadeSun.shd
float3 interpolateAttribute(float3x3 attributes, float3 db_dx, float3 db_dy, float2 d){
float3 attribute_x = mul(db_dx, attributes);
float3 attribute_y = mul(db_dy, attributes);
float3 attribute_s = attributes[0];
return (attribute_s + d.x * attribute_x + d.y * attribute_y);
}
float2 d = In.screenPos + -pos_scr[0];
float w = 1.0 / interpolateAttribute(one_over_w, db_dx, db_dy, d);
float z = w * projection[2][2] + projection[2][3];
float3 position = mul(invVP, float4(In.screenPos * w, z, w)).xyz;
Visibility Buffer – Shading pseudocode
• For each pixel in screen-space we do:
1. Get drawID/triangleID at pixel pos
2. Load data for the 3 vertices from the VB
3. Compute the partial derivates of the barycentric coordinates – triangle
gradidents
4. Interpolate vertex attributes at pixelpos using gradients (could do triangle /
object-space lighting)
a) Attribs use w from position to compute perspective correct interpolation
b) MVP matrix is applied to position
5. We have all data ready: shade and calculate final color
-> we skip the shading source code here … just Blinn-Phong
Visibility Buffer - Benefits
• Better decouples visibility from shading
• Calculating derivatives can be done separate from the shading phase (we
include it in the moment)
• We can shade then with different frequency or quality
• Improves memory efficiency
• Improves cache utilization
• Memory accesses are highly coherent
 high cache hit rates
• A G-Buffer needs to store data per screen-space pixel. Compared to a vertex / index buffer some
of this data is redundant
-> we can see 99% L2 cache hits for the Visibility Buffer for textures, vertex and index buffers
Visibility Buffer - Benefits
• Stores less data for complex lighting models like e.g. PBR compared to a
G-Buffer
• PBR data for Visibility Buffer is a struct in constant memory indexed by material id in
vertex structure
• This struct holds indices into a texture array for various PBR textures
• It also holds per-material descriptions of what is necessary to drive BRDF
• Any data that changes per-pixel is stored in textures that are referenced by the struct
• Decouples G-buffer footprint from screen resolution
• Improves performance at high resolutions: 2K, 4K, MSAA …
• Improves performance on bandwidth-limited platforms
Visibility Buffer – Triangle Counts
Triangles rendered after culling
• 1.87 Million triangles main view
• 2.40 Million triangles shadow map
San Miguel Scene
• 8 Million Triangles
• 5 Million Vertices
Modern Game in 2016
1.8 Million triangles combined in Ultra
How do you do lighting?
• We calculate for each pixel the lighting based on the triangle
data and therefore replace the whole rendering pipeline incl.
the vertex shader, primitive assembly and rasterizer in the
pixel shader
• Because all the triangle data is available, Object-Space lighting
should be possible
-> the idea is that the whole triangle is lit in object-space and
the result is cached
What about Tessellation?
•Following [Wihlidal][Brainerd] this is a pre-step before
or in parallel to Triangle filtering
Why didn’t we implement this earlier?
•Two recent developments made the Visibility Buffer
more attractive compared to a G-Buffer
• DirectX 12 / Vulkan with ExecuteIndirect or multi draw indirect
• Triangle culling / filtering [EDGE][Chajdas] [Wihlidal]
… see the next slides …
Cluster Culling / Triangle filtering
Motivation
• Polygonal complexity of games increases every year
• Efficient triangle removal is an important aspect
• 2 culling stages
• Cluster culling : cull groups of triangles before sending them to the GPU
(following [Chajdas] on the CPU; [Wihlidal] on GPU)
• Triangle Filtering: cull individual triangles after being sent to the GPU
Cluster culling
• Groups triangles in small chunks of
256 triangles with similar
orientations
• Chunks have a model matrix
associated (they can be moved
around)
• Each chunk must pass a quick
visibility test before being sent to
the GPU:
• Cone test
Cone test for fast cluster culling
Triangles
Normals
If the eye is in the safe area then
we can NOT see any triangle
because they are back-facing
Exclusion volume
Triangle cluster
Exclusion volume
Triangles
Normals
First, locate the center of
the cluster
Exclusion volume Normals
Then, negatively accumulate
normal starting at cluster
center
Exclusion volume Normals
Negatively accumulate
second normal after the first
one
Exclusion volume Normals
Accumulate next
one
Exclusion volume Normals
After accumulating the last
one we have the starting
point of the exclusion
volume
and the direction
Exclusion volume
This is the calculated
exclusion volume If the eye is in this area then
the eye can NOT see any
triangle
Most restrictive triangle
planes used to calculate cone
open angle
Cone angle
Cluster Culling
• Effectivity depends on the orientation of the faces
in the cluster
• The more similar the orientations, the bigger the exclusion /
culling volume
• Depending on the triangles the exclusion volume can
not be calculated
• Invalid cluster for cluster culling  just pass it!
Invalid cluster  cluster culling not
possible
Cluster Culling
• Code can be found in TriangleFiltering.h
• CreateClusters() // creates triangle clusters
• AddRequest() // tests the triangle clusters
Compute-based triangle filtering
• Motivation:
• Filter triangles before they go into the graphics pipeline
• Use the unused compute units during graphics pipeline execution with async
compute
• Compute-based filter  one triangle per thread
• Degenerate triangle culling
• Back-face culling
• Frustum culling
• Small primitives culling
• Depth culling (requires coarse depth buffer)(not in this demo)
• Triangle indices that pass these tests are appended to index buffer
Compute-based triangle filtering
• Degenerate triangle culling
• Allows to cull invisible zero-area triangles
• Cost: quick test (discard if at least two triangle indices are equal)
• Effectiveness: low
// in filterTriangles.shd
#if ENABLE_CULL_INDEX
if ( indices[0] == indices[1]
|| indices[1] == indices[2]
|| indices[0] == indices[2])
{
cull = true;
}
#endif
Compute-based triangle filtering
• Back-face culling
• Allows to cull triangles that face away from the viewer
• If tessellation is used must take into account max patch height
• Cost: calculate the determinant of a 3x3 matrix [Olano] (homogeneous 2D
coordinates)
• Effectiveness: high (potentially cull 50% of the geometry)
Compute-based triangle filtering
// in filterTriangles.shd
#if ENABLE_CULL_BACKFACE
// Culling in homogenous coordinates
// Read: "Triangle Scan Conversion using 2D Homogeneous Coordinates"
// by Marc Olano, Trey Greer
// http://www.cs.unc.edu/~olano/papers/2dh-tri/2dh-tri.pdf
float3x3 m = float3x3(vertices[0].xyw, vertices[1].xyw, vertices[2].xyw);
if (cullBackFace)
cull = cull || (determinant(m) > 0);
#endif
Compute-based triangle filtering
• Near Plane clipping -> triangles behind the near clipping
plane of the frustum need to be clipped
// in filterTriangles.shd
#if ENABLE_CULL_FRUSTUM || ENABLE_CULL_SMALL_PRIMITIVES
int verticesInFrontOfNearPlane = 0;
// Transform vertices[i].xy into normalized 0..1 screen space
for (uint i = 0; i < 3; ++i) {
if (vertices[i].w < 0) { // check for behind near clipping plane
++verticesInFrontOfNearPlane;
// flip the w so that any triangle that stradles the plane wont be projected onto
// two sides of the frustrum
vertices[i].w *= (-1.f);
}
vertices[i].xy /= vertices[i].w * 2;
vertices[i].xy += float2(0.5, 0.5);
}
#endif
Compute-based triangle filtering
// in filterTriangles.shd
if (verticesInFrontOfNearPlane == 3)
{
cull = true;
}
• Part 2 source code – triangle behind near clipping plane
Compute-based triangle filtering
• Frustum culling
• Allows to cull triangles that are projected
outside the clipping cube
• Takes into account near and far planes
• Cost: check if all vertices lie in the negative
side of the clip-space cube
• Effectiveness: medium-high (depends on
the size of the scene and eye pos)
Compute-based triangle filtering
// in filterTriangles.shd
#if ENABLE_CULL_FRUSTUM
{
if (verticesInFrontOfNearPlane == 3)
{
cull = true;
}
float minx = min (min (vertices[0].x, vertices[1].x), vertices[2].x);
float miny = min (min (vertices[0].y, vertices[1].y), vertices[2].y);
float maxx = max (max (vertices[0].x, vertices[1].x), vertices[2].x);
float maxy = max (max (vertices[0].y, vertices[1].y), vertices[2].y);
cull = cull || (maxx < 0) || (maxy < 0) || (minx > 1) || (miny > 1);
}
#endif
Compute-based triangle filtering
• Small-primitives culling
• Allows to cull triangles that are too small to be seen
• Triangles that do not touch any sample point after projection
• Long and thin triangles that do not touch any sample are culled as well
• More efficient use of hardware resources
• Cost: triangle touches any subpixel samples
• Effectiveness: medium (depends on the size of the triangles and
screen res)
Primitive-rate bound
• only one primitive per cycle per tile can be
scanned (see [Wihlidal])
• Very inefficient use of rasterization units
Compute-based triangle filtering
• Implementation is a two-step approach
• Test if one of the vertices is outside the Guardband
-> can’t be a small triangle
• If all vertices of a triangle are inside the Guardband
-> do the actual small size test
Compute-based triangle filtering
• Guard band clipping
Usually part of rasterization: there is a guard band around the
viewport
• Medium gray triangles -> outside viewport -> rejected
• Light gray triangles -> completely in the guard band -> rasterized and the pixels outside of the
viewport are ignored by rasterizer
• Dark triangle needs to be clipped
Compute-based triangle filtering
• Test if vertex outside the Guardband
// in filterTriangles.shd
int2 minBB = int2(1 << 30, 1 << 30);
int2 maxBB = int2(-(1 << 30), -(1 << 30));
bool insideGuardBand = true;
for (uint i = 0; i < 3; ++i) {
float2 screenSpacePositionFP = vertices[i].xy * windowSize;
// Check if we would overflow after conversion
if (screenSpacePositionFP.x < -(1 << 23)
|| screenSpacePositionFP.x > (1 << 23)
|| screenSpacePositionFP.y < -(1 << 23)
|| screenSpacePositionFP.y > (1 << 23))
{ insideGuardBand = false;
}
// scale based on distance to from center to msaa sample point
int2 screenSpacePosition = int2(screenSpacePositionFP * (SUBPIXEL_SAMPLES * samples));
minBB = min (screenSpacePosition, minBB);
maxBB = max (screenSpacePosition, maxBB);
}
Compute-based triangle filtering
• Is the triangle of “small” size?
// in filterTriangles.shd
if (insideGuardBand)
{
const uint SUBPIXEL_SAMPLE_CENTER = SUBPIXEL_SAMPLES / 2;
const uint SUBPIXEL_SAMPLE_SIZE = SUBPIXEL_SAMPLES - 1;
/*
Test is:
Is the minimum of the bounding box right or above the sample
point and is the width less than the pixel width in samples in
one direction.
This will also cull very long triangles which fall between
multiple samples.
*/
cull = cull || any( ((minBB & SUBPIXEL_MASK) > SUBPIXEL_SAMPLE_CENTER)
&& ((maxBB - ((minBB & ~SUBPIXEL_MASK) + SUBPIXEL_SAMPLE_CENTER))
< (SUBPIXEL_SAMPLE_SIZE)));
}
Compute-based triangle filtering
• Depth culling (not used in this
demo)
• Allows to cull triangles that are occluded
by the scene
• This test requires a coarse depth buffer
• Cost: load depth values from map and
check triangle/BB intersection
• Effectiveness: medium-high (depends
on scene complexity and the size of the
triangles)
Can be generated by
• Downsampling previous z-buffer and
reprojecting depths
• Rendering selected LOD geometry at
low res
Compute-based triangle filtering
Degenerated triangles
Backfaces
Frustum
Small primitives
Depth
Culling tests Culled results Compaction
• Triangle filtering is executed on groups of 256 triangles (one batch) ->
empty draws
• Draw batch compaction to the rescue
• Can be run in parallel in a compute shader
• Eliminates empty draws from the multi indirect draw buffer
Draw this using
multi draw indirect / ExecuteIndirect
Compute-based triangle filtering - Frame
pseudocode
1. [CPU] Early discard invisible geometry using triangle cluster culling
2. [CS] Generate unculled indices and multi draw indirect buffers using
triangle filtering (one triangle per thread)
3. Like before
Triangle cluster culling / filtering – Draw Calls
• For this static scene one large vertex buffer and an index buffer
generated by triangle culling and filtering is used
• Draw batches that hold a block of geometry each for one material
-> Only two “materials” opaque and alpha masked, transparent objects
and other materials would go into the same buffer
• For dynamic objects we would use a dedicated VB/IB pair for each; this
is optional
Triangle cluster culling / filtering – Draw Calls
• ExecuteIndirect makes it possible to completely defer the workload to
the GPU, so that the actual work items can be batched in a more
coherent way
• Grouping multiple draw calls into a single one
• reduces CPU overhead -> only a single Graphics API call
• reduces GPU overhead (alleviates pressure on VS, CP and triangle assembly) by
letting the user provide an optimal set of workload through
• culling indices == invisible trianges through async compute
• culling draw calls -> the resulting indirect argument buffer only includes valid draw calls
Triangle cluster culling / filtering – Draw Calls
San Miguel Scene Number of Draw calls
(ExecuteIndirect)
• Shadow opaque 214
• Shadow alpha masked 59
• Main view opaque 200
• Main view alpha masked 60
• Dispatch calls for filtering 81
Triangle cluster culling / filtering – Benefits
• Allows to cull triangles before sending them to the graphics
pipeline
• Avoid overwhelming parts of the graphics pipeline (rasterizer)
• Graphics pipeline is better utilized with the visible triangles
(rasterizer efficiency, command processor,…)
• Can make use of async compute to potentially overlap with
the graphics pipeline
Re-using triangle filtered results
for multiple Views / Rendering
Passes
Motivation
• Compute-based triangle filtering comes at a cost
• For every triangle: load indices and vertices, transform vertices,
append (lock) triangle data to index buffer
• We came up with the idea to generate filtered data for several rendering
passes like main view, shadows etc.
• Use filtered data to cull the same triangle set from different views
• Load indices/vertices, transform vertices only once for all views
Motivation
• Reduces the effectivity of cluster culling / triangle filtering
• harder to cull cluster for N views
• however, it was worth it:
Use filtered data to cull triangles from different views
• The algorithm is generalized to test against different N-views
• Load indices / vertices once, transform vertices for every view
Algorithm
unculledClusters  ClusterCulling(sceneObjs, views)
filteredIndicesArray  TriangleCulling(unculledClusters, views)
ShadowPass( filteredIndices[shadowView] )
MainPass1( filteredIndices[mainView] )
CPU
GPU
GPU
GPU
Compute
Graphics
Graphics
MainPass2( filteredIndices[mainView] )GPU Graphics
Adding re-usage of triangle filtered data - Frame pseudocode
1. [CPU] Early discard geometry not visible from any view using cluster
culling
2. [CS] Generate N index and N multi draw indirect / ExecuteIndirect
buffers using triangle filtering testing against the N views (one
triangle per thread)
3. For each i view use (ith index buffer and ith ExecuteIndirect buffer):
1. [Gfx] Clear visibility and depth buffers
2. [VS,PS] Visibility buffer pass
[PS] Output triangle / instance IDs
3. [PS] Interpolate attributes from gradients and shade pixel * Using a dedicated Visibility
Buffer for shadow pass is
overkill, but you can still use the
filtered data for it.
Results San Miguel Scene average for main view
 8 Million triangles
 5 Million vertices
Total triangles Rendered Culled
8,010,146 851,517 (10.6%) 7,158,629 (89.4%)
3,185,203 (39.8%) Back-face
5,244,787 (65.5%) Frustum
1,950,030 (24.4%) Small primitives
GPU Culling Shadow
Map
Fill VB HDAO Shade VB Resolve
MSAA
UI Overall
Visibility Buffer 4k – No MSAA
AMD RADEON R9 380 2.66 1.42 5.07 2.51 3.43 - 0.02 15.19
NVIDIA GeForce GTX 970 3.46 1.12 3.36 1.82 3.29 - 0.02 13.52
Visibility Buffer 4k – No MSAA No Culling
AMD RADEON R9 380 - 5.18 9.25 2.51 3.43 - 0.02 20.45
NVIDIA GeForce GTX 970 - 3.74 5.31 1.79 3.25 - 0.02 14.39
Visibility Buffer 4k – 2x MSAA
AMD RADEON R9 380 2.72 1.42 8.27 3.65 8.27 2.29 0.03 25.87
NVIDIA GeForce GTX 970 3.34 1.09 6.17 3.58 6.17 0.34 0.02 19.87
Visibility Buffer 4k – 4x MSAA
AMD RADEON R9 380 2.70 1.43 8.73 5.70 15.58 3.61 0.03 37.86
NVIDIA GeForce GTX 970 3.37 1.11 7.86 6.86 12.35 0.87 0.02 32.68
Visibility Buffer 3840 x2160
GPU Culling Shadow
Map
Fill Buffer HDAO Shade Buffer Resolve
MSAA
UI Overall
Deferred Shading 4k – No MSAA
AMD RADEON R9 380 2.67 1.42 12.19 2.51 1.29 - 0.03 20.19
NVIDIA GeForce GTX 970 3.36 1.20 9.04 1.79 1.21 - 0.02 16.82
Deferred Shading 4k – No MSAA No Culling
AMD RADEON R9 380 - 5.18 15.00 2.51 1.29 - 0.03 24.06
NVIDIA GeForce GTX 970 - 3.75 10.44 1.79 1.22 - 0.02 17.49
Deferred Shading 4k – 2x MSAA
AMD RADEON R9 380 2.70 1.42 21.65 3.65 10.86 2.29 0.03 42.68
NVIDIA GeForce GTX 970 3.35 1.10 15.36 3.59 2.41 0.34 0.02 26.27
Deferred Shading 4k – 4x MSAA
AMD RADEON R9 380 2.72 1.44 35.88 5.74 20.13 3.60 0.02 69.64
NVIDIA GeForce GTX 970 3.40 1.18 30.29 6.87 5.39 0.87 0.02 48.12
Deferred Shading 3840 x2160
GPU Culling Shadow
Map
Fill VB HDAO Shade VB Resolve
MSAA
UI Overall
Visibility Buffer 1080p – No MSAA
Xbox One 7.19 3.32 3.90 1.73 3.90 - 0.02 19.78
Visibility Buffer 1080p – No MSAA No Culling
Xbox One - 9.16 9.09 1.73 3.89 - 0.02 23.98
Visibility Buffer 1080p – 2x MSAA
Xbox One 7.28 3.20 7.57 5.17 7.57 0.46 0.02 28.00
XBOX One - Visibility Buffer 1080p
GPU Culling Shadow
Map
Fill Buffer HDAO Shade Buffer Resolve
MSAA
UI Overall
Deferred Shading 1080p – No MSAA
Xbox One 7.19 3.14 11.18 1.74 1.38 - 0.02 24.77
Deferred Shading 1080p – No MSAA No Culling
Xbox One - 9.09 15.07 1.73 1.39 - 0.02 27.45
Deferred Shading 1080p – 2x MSAA
Xbox One 7.21 3.18 21.85 5.18 8.46 0.47 0.02 46.41
XBOX One - Deferred Shading 1080p
Deferred Shading
GPU
AMD RADEON R9 380
1080p 1440p 2160p
No MSAA 9.75 12.30 20.19
No MSAA – No Culling 14.16 16.66 24.06
2x MSAA 16.16 23.09 42.68
4x MSAA 24.90 36.37 69.64
NVIDIA GeForce GTX
970
1080p 1440p 2160p
No MSAA 8.72 10.67 16.82
No MSAA – No Culling 10.30 11.91 17.49
2x MSAA 11.58 15.23 26.27
4x MSAA 17.00 24.76 48.12
GPU
AMD RADEON R9 380
1080p 1440p 2160p
No MSAA 8.57 10.72 15.19
No MSAA – No Culling 14.52 15.86 20.45
2x MSAA 11.44 16.38 25.87
4x MSAA 15.27 20.82 37.86
Visibility Buffer
Summary
Xbox One 1080p 1440p 2160p
No MSAA 24.77 - -
No MSAA – No Culling 27.45 - -
2x MSAA 46.41 - -
Xbox One 1080p 1440p 2160p
No MSAA 19.78 - -
No MSAA – No Culling 23.98 - -
2x MSAA 28.00 - -
NVIDIA GeForce GTX
970
1080p 1440p 2160p
No MSAA 7.68 8.83 13.52
No MSAA – No Culling 9.44 10.64 14.39
2x MSAA 9.63 12.21 19.87
4x MSAA 12.47 17.47 32.68
How about VR?
•We are working on the StarVR SDK. StarVR uses a large
field of view with very high resolution
•The Visibility Buffer helps substantially with
performance here
• We cull and prepare the data for all views and the shadow map views in
one go
Executive Summary
We built a rendering system that
• Cluster culls and filters triangles for different views like main
view, shadow view, reflection view, GI view etc. at the same time
• The optimized triangles are used to fill a screen-space Visibility
Buffer or more Visibility Buffers for more views
• We then render lights, shadows, bounce lights with the
optimized geometry based on visibility
• We can differ between visibility of geometry and shading
frequency
• We can light per triangle or in so called object space
Future work
• Re-use culled triangles over several frames
• Use intrinsics for several parts of the pipeline [Chajdas
Compaction]
• Better improve asynchronous scheduling
• Async compute is powerful
Source Code
• Source code: https://we.tl/MOdVkljptr
• Probably later on github if we can fit everything on there  (1 GB
limit)
Credits
• Christoph Schied – wrote implementation of his paper with the OpenGL 4.5 run-time at our office
• Confetti People
• Marijn Tamis – wrote the initial OpenGL 4.5 run-time
• Leroy Sikkes – wrote the initial DirectX 12 run-time and added hardware performance
counters
• Max Oomen (intern) added linear lighting and fixed many bugs
• Jesús Gumbau – added triangle filtering, came up with the idea and implemented re-usage of
filtered triangle data and made it cross-platform running on NVIDIA and AMD GPUs and then
brought it to DirectX 12
• Jordan Logan – brought it to XBOX One and optimized for this console
Credits
• Most of the code for triangle culling and filtering is based on
[Chajdas]. We added three features:
• MSAA support for small triangle removal
• Triangle in front of Near Plane culling
• Multi-Viewport support
Acknowledgements
• Graham Wihlidal DICE Frostbite
• Nicolas Thibieroz, Gareth Thomas, Matthaeus Chajdas, Steven Tovey AMD
• Mike Acton Insomniac
• James McLaren, Q-Games
• Remi Arnaud, Starbreeze / StarVR
• Kev Gee Microsoft
References
• [Burns] Christopher A. Burns, Warren A. Hunt “The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading” Journal of Computer
Graphics Techniques (JCGT) 2:2 (2013), 55- 69. Available online at http://jcgt.org/published/0002/02/04
• [Chajdas] Matthaeus Chajdas “GeometryFX” http://gpuopen.com/gaming-product/geometryfx/
• [Chajdas Compaction] Matthaeus Chajdas “Fast compaction with mbcnt”, http://gpuopen.com/fast-compaction-with-mbcnt/
• [Edge] Edge Library, PS3 SDK
• [Engel2009] Wolfgang Engel, “Light Pre-Pass”, “Advances in Real-Time Rendering in 3D Graphics and Games”, SIGGRAPH 2009,
http://halo.bungie.net/news/content.aspx?link=Siggraph_09
• [Lagarde] Sebastien Lagarde, Charles de Rousiers, “Moving Frostbite to Physically Based Rendering”, Course notes SIGGRAPH 2014
• [Olano] Marc Olano, http://www.cs.unc.edu/~olano/papers/2dh-tri/2dh-tri.pdf
• [Schied2015] Christoph Schied, Carsten Dachsbacher “Deferred Attribute Interpolation for Memory-Efficient Deferred Shading”,
http://cg.ivd.kit.edu/publications/2015/dais/DAIS.pdf
• [Schied2016] Christoph Schied, Carten Dachsbacher “Deferred Attribute Interpolation Shading”, GPU Pro 7, CRC Press / shorter and free
version: http://cg.ivd.kit.edu/publications/2015/dais/DAIS.pdf
• [Wihlidal] Graham Wihlidal, “Optimizing the Graphics Pipeline with Compute”, GDC 2016, http://www.frostbite.com/2016/03/optimizing-
the-graphics-pipeline-with-compute/
wolf@conffx.com

More Related Content

PPT
A Bit More Deferred Cry Engine3
PPTX
Optimizing the Graphics Pipeline with Compute, GDC 2016
PPT
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
PPSX
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
PDF
Graphics Gems from CryENGINE 3 (Siggraph 2013)
PDF
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
PPTX
Physically Based and Unified Volumetric Rendering in Frostbite
PPTX
Frostbite on Mobile
A Bit More Deferred Cry Engine3
Optimizing the Graphics Pipeline with Compute, GDC 2016
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Graphics Gems from CryENGINE 3 (Siggraph 2013)
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
Physically Based and Unified Volumetric Rendering in Frostbite
Frostbite on Mobile

What's hot (20)

PPTX
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
PPTX
Stochastic Screen-Space Reflections
PPTX
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
PPTX
Lighting the City of Glass
PDF
Volumetric Lighting for Many Lights in Lords of the Fallen
PPSX
Oit And Indirect Illumination Using Dx11 Linked Lists
PPTX
DirectX 11 Rendering in Battlefield 3
PDF
Lighting of Killzone: Shadow Fall
PDF
Rendering AAA-Quality Characters of Project A1
PPTX
Shiny PC Graphics in Battlefield 3
PPTX
A Certain Slant of Light - Past, Present and Future Challenges of Global Illu...
PDF
Rendering Tech of Space Marine
PDF
Screen Space Reflections in The Surge
PPT
Crysis Next-Gen Effects (GDC 2008)
PDF
OpenGL 4.4 - Scene Rendering Techniques
PPSX
Advancements in-tiled-rendering
PPT
Z Buffer Optimizations
PPT
Secrets of CryENGINE 3 Graphics Technology
PPTX
FrameGraph: Extensible Rendering Architecture in Frostbite
PDF
Bindless Deferred Decals in The Surge 2
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Stochastic Screen-Space Reflections
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Lighting the City of Glass
Volumetric Lighting for Many Lights in Lords of the Fallen
Oit And Indirect Illumination Using Dx11 Linked Lists
DirectX 11 Rendering in Battlefield 3
Lighting of Killzone: Shadow Fall
Rendering AAA-Quality Characters of Project A1
Shiny PC Graphics in Battlefield 3
A Certain Slant of Light - Past, Present and Future Challenges of Global Illu...
Rendering Tech of Space Marine
Screen Space Reflections in The Surge
Crysis Next-Gen Effects (GDC 2008)
OpenGL 4.4 - Scene Rendering Techniques
Advancements in-tiled-rendering
Z Buffer Optimizations
Secrets of CryENGINE 3 Graphics Technology
FrameGraph: Extensible Rendering Architecture in Frostbite
Bindless Deferred Decals in The Surge 2
Ad

Similar to Triangle Visibility buffer (20)

PPTX
Optimizing Games for Mobiles
PDF
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
PPTX
Rendering of Complex 3D Treemaps (GRAPP 2013)
PDF
Advanced Scenegraph Rendering Pipeline
PDF
new_age_graphics_android_x86
PPT
OpenGL 4 for 2010
PDF
GameProgramming for college students DMAD
PPTX
Penn graphics
PDF
Optimizing Parallel Reduction in CUDA : NOTES
PPTX
lossy compression JPEG
PDF
Extreme dxt compression
PPT
Efficient LDI Representation (TPCG 2008)
PDF
Hpg2011 papers kazakov
PPT
D3 D10 Unleashed New Features And Effects
PDF
Reduction
PPTX
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
PPTX
Beyond porting
PDF
The Explanation the Pipeline design strategy.pdf
PPTX
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
PDF
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
Optimizing Games for Mobiles
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
Rendering of Complex 3D Treemaps (GRAPP 2013)
Advanced Scenegraph Rendering Pipeline
new_age_graphics_android_x86
OpenGL 4 for 2010
GameProgramming for college students DMAD
Penn graphics
Optimizing Parallel Reduction in CUDA : NOTES
lossy compression JPEG
Extreme dxt compression
Efficient LDI Representation (TPCG 2008)
Hpg2011 papers kazakov
D3 D10 Unleashed New Features And Effects
Reduction
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
Beyond porting
The Explanation the Pipeline design strategy.pdf
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
Ad

More from Wolfgang Engel (13)

PPTX
A modern Post-Processing Pipeline
PPT
Light and Shadows
PPT
Paris Master Class 2011 - 07 Dynamic Global Illumination
PPT
Paris Master Class 2011 - 06 Gpu Particle System
PPT
Paris Master Class 2011 - 05 Post-Processing Pipeline
PPT
Paris Master Class 2011 - 04 Shadow Maps
PPT
Paris Master Class 2011 - 03 Order Independent Transparency
PPT
Paris Master Class 2011 - 02 Screen Space Material System
PPT
Paris Master Class 2011 - 01 Deferred Lighting, MSAA
PPT
Paris Master Class 2011 - 00 Paris Master Class
PPTX
A new Post-Processing Pipeline
PPTX
Massive Point Light Soft Shadows
PPTX
Hair in Tomb Raider
A modern Post-Processing Pipeline
Light and Shadows
Paris Master Class 2011 - 07 Dynamic Global Illumination
Paris Master Class 2011 - 06 Gpu Particle System
Paris Master Class 2011 - 05 Post-Processing Pipeline
Paris Master Class 2011 - 04 Shadow Maps
Paris Master Class 2011 - 03 Order Independent Transparency
Paris Master Class 2011 - 02 Screen Space Material System
Paris Master Class 2011 - 01 Deferred Lighting, MSAA
Paris Master Class 2011 - 00 Paris Master Class
A new Post-Processing Pipeline
Massive Point Light Soft Shadows
Hair in Tomb Raider

Recently uploaded (20)

PPTX
Lecture Notes Electrical Wiring System Components
PPTX
additive manufacturing of ss316l using mig welding
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Well-logging-methods_new................
PDF
ETO & MEO Certificate of Competency Questions and Answers
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPT
Project quality management in manufacturing
PPTX
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
Lecture Notes Electrical Wiring System Components
additive manufacturing of ss316l using mig welding
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
OOP with Java - Java Introduction (Basics)
Internet of Things (IOT) - A guide to understanding
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Lesson 3_Tessellation.pptx finite Mathematics
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Well-logging-methods_new................
ETO & MEO Certificate of Competency Questions and Answers
CYBER-CRIMES AND SECURITY A guide to understanding
CH1 Production IntroductoryConcepts.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Project quality management in manufacturing
Fluid Mechanics, Module 3: Basics of Fluid Mechanics

Triangle Visibility buffer

  • 1. The filtered and culled Visibility Buffer Wolfgang Engel Confetti October 13th, 2016
  • 3. Table of contents • The Visibility Buffer • Cluster Culling / Triangle Filtering • Re-using triangle filtered results for multiple rendering passes
  • 5. Motivation • Forward rendering shades all fragments in triangle-submission order • Wastes rendering power on pixels that don’t contribute to the final image • Deferred shading solves this problem in 2 steps: • First, surface attributes are stored in screen buffers -> G-Buffer • Second, shading is computed for visible fragments only • However, deferred shading increases memory bandwidth consumption: • Screen buffers for: normal, depth, albedo, material ID,… • G-Buffer size becomes challenging at high resolutions
  • 6. High-Level View 2009 G-Buffer (taken from [Engel2009] Killzone 2 layout) Depth Buffer Deferred Lighting Forward Rendering Switch off depth write Specular / Motion Vec Normals Albedo / Shadow Render opaque objects Transparent objects Sort Back-To-Front 8:8:8:8 8:8:8:8 8:8:8:8 32-bit
  • 7. High-Level View 2014 G-Buffer – Frostbite Engine [Lagarde] 10:10:10:2 8:8:8:8 8:8:8:8 11:11:10 Depth - R32G8X24_TYPELESS
  • 8. High-Level View 2014 G-Buffer – Frostbite Engine [Lagarde] Depth Buffer Deferred Lighting Forward Rendering Switch off depth write BaseColor etc.Normals etc. MetalMask etc. Render opaque objects Transparent objects Sort Back-To-Front 10:10:10:2 8:8:8:8 8:8:8:8 32-bit IBL / GI 11:11:10
  • 9. High-Level View Visibility Buffer (similar to [Burns][Schied]) Depth Buffer Vertex Buffer Visibility Vertex Buffer holds opaque and transparent objects Lighting Forward Rendering Transparent objects Sort Back-To-Front Encodes - 1 bit alpha-mask - 8-bit drawID - 23-bit triangleID / primID 8:8:8:8
  • 10. Visibility Buffer PC 1080p – Memory Memory Description NoMSAA 2xMSAA 4xMSAA Visibility Buffer keeps address of each triangle in 32-bit per pixel 4 bytes * 1920 * 1080 7.9 MB 15.8 MB 31.6* MB Depth Buffer 4 byte * 1920 * 1080 7.9 MB 15.8 MB 31.6* MB Hierarchical Z 4 byte * 1920 * 1080 * 1/64 0.12 MB - - Vertex Buffer 28 bytes per vertex + padding float3 position; uint normal; uint tangent; uint texCoord; uint materialID; uint pad; // to align to 128 bits – 32 byte for NVIDIA Does not increase with resolution. Textures 21* MB - - Draw arguments, Uniform, Descriptors etc. 2* MB - - Overall 38.80* MB 54.60* MB 86.20* MB * rough estimate; driver might increase it
  • 11. G-Buffer PC 1080p – Memory Memory Description NoMSAA 2xMSAA 4xMSAA Normals 10:10:10:2 - 4 bytes * 1920 * 1080 7.9 MB 15.8* MB 31.6* MB PBR 8:8:8:8 - 4 bytes * 1920 * 1080 7.9 MB 15.8* MB 31.6* MB Albedo 8:8:8:8 - 4 bytes * 1920 * 1080 7.9 MB 15.8* MB 31.6* MB GI / IBL 11:11:10 - 4 bytes * 1920 * 1080 7.9 MB 15.8* MB 31.6* MB Depth 8:8:8:8 - 4 bytes * 1920 * 1080 7.9 MB 15.8* MB 31.6* MB Hierarchical Z 4 bytes * 1920 * 1080 * 1/64 0.49 MB - - Draw arguments. Etc. 2* MB Overall 41.99* MB 81.49* MB 160.49* MB * rough estimate; driver might increase it
  • 12. Visibility Buffer PC 4k – Memory Memory Description NoMSAA 2xMSAA 4xMSAA Visibility Buffer keeps address of each triangle in 32-bit per pixel 4 bytes * 3840 * 2160 31.64 MB 63.28* MB 126.56* MB Depth Buffer 4 bytes * 3840 * 2160 31.64 MB 63.28* MB 126.56* MB Hierarchical Z 4 bytes * 3840 * 2160 * 1/64 0.49 MB - - Vertex Buffer 28 bytes per vertex + padding float3 position; uint normal; uint tangent; uint texCoord; uint materialID; uint pad; // to align to 128 bits – 32 byte for NVIDIA Does not increase with resolution. - - Textures 21* MB Draw arguments, Uniform, Descriptors etc. 2* MB Overall 86.77 MB 150.05 MB 276.61 MB * rough estimate; driver might increase it
  • 13. G-Buffer PC 4k – Memory Memory Description NoMSAA 2xMSAA 4xMSAA Normals 10:10:10:2 - 4 bytes * 3840 * 2160 31.64 MB 63.28* MB 126.56* MB PBR 8:8:8:8 - 4 bytes * 3840 * 2160 31.64 MB 63.28* MB 126.56* MB Albedo 8:8:8:8 - 4 bytes * 3840 * 2160 31.64 MB 63.28* MB 126.56* MB GI / IBL 11:11:10 - 4 bytes * 3840 * 2160 31.64 MB 63.28* MB 126.56* MB Depth 8:8:8:8 - 4 bytes * 3840 * 2160 31.64 MB 63.28* MB 126.56* MB Hierarchical Z 4 bytes * 3840 * 2160 * 1/64 0.49 MB - - Draw arguments. Etc. 2* MB - - Overall 160.69* MB 318.89* MB 635.29* MB * rough estimate; driver might increase it
  • 14. Visibility Buffer XOne 1080p – Memory ESRAM Description NoMSAA 2xMSAA Visibility Buffer keeps address of each triangle in 32-bit per pixel 4 bytes * 1920 * 1080 7.9 MB 15.8 MB Depth Buffer 4 byte * 1920 * 1080 7.9 MB 15.8 MB Hierarchical Z 4 byte * 1920 * 1080 * 1/64 0.12 MB -
  • 15. Visibility Buffer – Filling pseudocode • Visibility Buffer generation step • For each pixel on screen: • Pack (alpha masked bit, drawID, primitiveID) into 1 32-bit UINT • Write that into a screen-sized buffer • The tuple (alpha masked bit, drawID, primitiveID) will allow a shader to access the triangle data in the shading step
  • 16. Visibility Buffer – Shader code // in visibilityPass.shd uint calculateOutputVBID(bool opaque, uint drawID, uint primitiveID){ uint drawID_primID = ((drawID << 23) & 0x7F800000) | (primitiveID & 0x007FFFFF); if (opaque) return drawID_primID; else return (1 << 31) | drawID_primID; } float4 unpackUnorm4x8(uint p){ return float4(float(p & 0x000000FF) / 255.0, float((p & 0x0000FF00) >> 8) / 255.0, float((p & 0x00FF0000) >> 16) / 255.0, float((p & 0xFF000000) >> 24) / 255.0); } cbuffer RootConstants : register(b0, space0){ uint DrawID; }; float4 main(uint primitiveID : SV_PrimitiveID) : SV_Target{ return unpackUnorm4x8(calculateOutputVBID(true,DrawID,primitiveID)); }
  • 17. Visibility Buffer – Shading pseudocode • For each pixel in screen-space we do: 1. Get drawID/triangleID at pixel pos 2. Load data for the 3 vertices from the VB 3. Compute the partial derivatives of the barycentric coordinates – triangle gradients i. Compute x and y value of each vertex of a triangle in projected screen-space ii. Compute Barycentric Coordinates iii. Compute partial derivatives of the barycentric coordinates 4. Interpolate vertex attributes at pixelpos using gradients (could do triangle / object- space lighting) a) Attribs use w from position to compute perspective correct interpolation b) MVP matrix is applied to position 5. We have all data ready: shade and calculate final color
  • 18. Visibility Buffer – Shading pseudocode Barycentric coordinates λ1, λ2, λ3 on an equilateral triangle and on a right triangle
  • 19. Visibility Buffer – Shading pseudocode /* taken from computeDerivaties.shd void computeBarycentricGradients(float2 v[3], out float3 db_dx, out float3 db_dy){ float d = 1.0 / determinant(float2x2(v[2] - v[1], v[0] - v[1])); db_dx = float3(v[1].y - v[2].y, v[2].y - v[0].y, v[0].y - v[1].y) * d; db_dy = float3(v[2].x - v[1].x, v[0].x - v[2].x, v[1].x - v[0].x) * d; }*/ // in shadeSun.shd float3 one_over_w = 1.0 / float3(position0.w, position1.w, position2.w); // projected vertices float2 pos_scr[3] = {position0.xy *one_over_w.x, position1.xy *one_over_w.y, position2.xy *one_over_w.z }; float3 db_dx, db_dy; // gradient barycentric coords x/y computeBarycentricGradients(pos_scr, db_dx, db_dy); Following [Schied2015] Appendix A Equation (4):
  • 20. Visibility Buffer – Shading pseudocode • For each pixel in screen-space we do: 1. Get drawID/triangleID at pixel pos 2. Load data for the 3 vertices from the VB 3. Compute the partial derivates of the barycentric coordinates – triangle gradidents 4. Interpolate vertex attributes at pixelpos using gradients (could do triangle / object-space lighting) a) Attribs use w from position to compute perspective correct interpolation b) MVP matrix is applied to position 5. We have all data ready: shade and calculate final color
  • 21. Visibility Buffer – Shading pseudocode // in shadeSun.shd float3 interpolateAttribute(float3x3 attributes, float3 db_dx, float3 db_dy, float2 d){ float3 attribute_x = mul(db_dx, attributes); float3 attribute_y = mul(db_dy, attributes); float3 attribute_s = attributes[0]; return (attribute_s + d.x * attribute_x + d.y * attribute_y); } float2 d = In.screenPos + -pos_scr[0]; float w = 1.0 / interpolateAttribute(one_over_w, db_dx, db_dy, d); float z = w * projection[2][2] + projection[2][3]; float3 position = mul(invVP, float4(In.screenPos * w, z, w)).xyz;
  • 22. Visibility Buffer – Shading pseudocode • For each pixel in screen-space we do: 1. Get drawID/triangleID at pixel pos 2. Load data for the 3 vertices from the VB 3. Compute the partial derivates of the barycentric coordinates – triangle gradidents 4. Interpolate vertex attributes at pixelpos using gradients (could do triangle / object-space lighting) a) Attribs use w from position to compute perspective correct interpolation b) MVP matrix is applied to position 5. We have all data ready: shade and calculate final color -> we skip the shading source code here … just Blinn-Phong
  • 23. Visibility Buffer - Benefits • Better decouples visibility from shading • Calculating derivatives can be done separate from the shading phase (we include it in the moment) • We can shade then with different frequency or quality • Improves memory efficiency • Improves cache utilization • Memory accesses are highly coherent  high cache hit rates • A G-Buffer needs to store data per screen-space pixel. Compared to a vertex / index buffer some of this data is redundant -> we can see 99% L2 cache hits for the Visibility Buffer for textures, vertex and index buffers
  • 24. Visibility Buffer - Benefits • Stores less data for complex lighting models like e.g. PBR compared to a G-Buffer • PBR data for Visibility Buffer is a struct in constant memory indexed by material id in vertex structure • This struct holds indices into a texture array for various PBR textures • It also holds per-material descriptions of what is necessary to drive BRDF • Any data that changes per-pixel is stored in textures that are referenced by the struct • Decouples G-buffer footprint from screen resolution • Improves performance at high resolutions: 2K, 4K, MSAA … • Improves performance on bandwidth-limited platforms
  • 25. Visibility Buffer – Triangle Counts Triangles rendered after culling • 1.87 Million triangles main view • 2.40 Million triangles shadow map San Miguel Scene • 8 Million Triangles • 5 Million Vertices Modern Game in 2016 1.8 Million triangles combined in Ultra
  • 26. How do you do lighting? • We calculate for each pixel the lighting based on the triangle data and therefore replace the whole rendering pipeline incl. the vertex shader, primitive assembly and rasterizer in the pixel shader • Because all the triangle data is available, Object-Space lighting should be possible -> the idea is that the whole triangle is lit in object-space and the result is cached
  • 27. What about Tessellation? •Following [Wihlidal][Brainerd] this is a pre-step before or in parallel to Triangle filtering
  • 28. Why didn’t we implement this earlier? •Two recent developments made the Visibility Buffer more attractive compared to a G-Buffer • DirectX 12 / Vulkan with ExecuteIndirect or multi draw indirect • Triangle culling / filtering [EDGE][Chajdas] [Wihlidal] … see the next slides …
  • 29. Cluster Culling / Triangle filtering
  • 30. Motivation • Polygonal complexity of games increases every year • Efficient triangle removal is an important aspect • 2 culling stages • Cluster culling : cull groups of triangles before sending them to the GPU (following [Chajdas] on the CPU; [Wihlidal] on GPU) • Triangle Filtering: cull individual triangles after being sent to the GPU
  • 31. Cluster culling • Groups triangles in small chunks of 256 triangles with similar orientations • Chunks have a model matrix associated (they can be moved around) • Each chunk must pass a quick visibility test before being sent to the GPU: • Cone test
  • 32. Cone test for fast cluster culling Triangles Normals If the eye is in the safe area then we can NOT see any triangle because they are back-facing Exclusion volume Triangle cluster
  • 34. Exclusion volume Normals Then, negatively accumulate normal starting at cluster center
  • 35. Exclusion volume Normals Negatively accumulate second normal after the first one
  • 37. Exclusion volume Normals After accumulating the last one we have the starting point of the exclusion volume and the direction
  • 38. Exclusion volume This is the calculated exclusion volume If the eye is in this area then the eye can NOT see any triangle Most restrictive triangle planes used to calculate cone open angle Cone angle
  • 39. Cluster Culling • Effectivity depends on the orientation of the faces in the cluster • The more similar the orientations, the bigger the exclusion / culling volume • Depending on the triangles the exclusion volume can not be calculated • Invalid cluster for cluster culling  just pass it! Invalid cluster  cluster culling not possible
  • 40. Cluster Culling • Code can be found in TriangleFiltering.h • CreateClusters() // creates triangle clusters • AddRequest() // tests the triangle clusters
  • 41. Compute-based triangle filtering • Motivation: • Filter triangles before they go into the graphics pipeline • Use the unused compute units during graphics pipeline execution with async compute • Compute-based filter  one triangle per thread • Degenerate triangle culling • Back-face culling • Frustum culling • Small primitives culling • Depth culling (requires coarse depth buffer)(not in this demo) • Triangle indices that pass these tests are appended to index buffer
  • 42. Compute-based triangle filtering • Degenerate triangle culling • Allows to cull invisible zero-area triangles • Cost: quick test (discard if at least two triangle indices are equal) • Effectiveness: low // in filterTriangles.shd #if ENABLE_CULL_INDEX if ( indices[0] == indices[1] || indices[1] == indices[2] || indices[0] == indices[2]) { cull = true; } #endif
  • 43. Compute-based triangle filtering • Back-face culling • Allows to cull triangles that face away from the viewer • If tessellation is used must take into account max patch height • Cost: calculate the determinant of a 3x3 matrix [Olano] (homogeneous 2D coordinates) • Effectiveness: high (potentially cull 50% of the geometry)
  • 44. Compute-based triangle filtering // in filterTriangles.shd #if ENABLE_CULL_BACKFACE // Culling in homogenous coordinates // Read: "Triangle Scan Conversion using 2D Homogeneous Coordinates" // by Marc Olano, Trey Greer // http://www.cs.unc.edu/~olano/papers/2dh-tri/2dh-tri.pdf float3x3 m = float3x3(vertices[0].xyw, vertices[1].xyw, vertices[2].xyw); if (cullBackFace) cull = cull || (determinant(m) > 0); #endif
  • 45. Compute-based triangle filtering • Near Plane clipping -> triangles behind the near clipping plane of the frustum need to be clipped // in filterTriangles.shd #if ENABLE_CULL_FRUSTUM || ENABLE_CULL_SMALL_PRIMITIVES int verticesInFrontOfNearPlane = 0; // Transform vertices[i].xy into normalized 0..1 screen space for (uint i = 0; i < 3; ++i) { if (vertices[i].w < 0) { // check for behind near clipping plane ++verticesInFrontOfNearPlane; // flip the w so that any triangle that stradles the plane wont be projected onto // two sides of the frustrum vertices[i].w *= (-1.f); } vertices[i].xy /= vertices[i].w * 2; vertices[i].xy += float2(0.5, 0.5); } #endif
  • 46. Compute-based triangle filtering // in filterTriangles.shd if (verticesInFrontOfNearPlane == 3) { cull = true; } • Part 2 source code – triangle behind near clipping plane
  • 47. Compute-based triangle filtering • Frustum culling • Allows to cull triangles that are projected outside the clipping cube • Takes into account near and far planes • Cost: check if all vertices lie in the negative side of the clip-space cube • Effectiveness: medium-high (depends on the size of the scene and eye pos)
  • 48. Compute-based triangle filtering // in filterTriangles.shd #if ENABLE_CULL_FRUSTUM { if (verticesInFrontOfNearPlane == 3) { cull = true; } float minx = min (min (vertices[0].x, vertices[1].x), vertices[2].x); float miny = min (min (vertices[0].y, vertices[1].y), vertices[2].y); float maxx = max (max (vertices[0].x, vertices[1].x), vertices[2].x); float maxy = max (max (vertices[0].y, vertices[1].y), vertices[2].y); cull = cull || (maxx < 0) || (maxy < 0) || (minx > 1) || (miny > 1); } #endif
  • 49. Compute-based triangle filtering • Small-primitives culling • Allows to cull triangles that are too small to be seen • Triangles that do not touch any sample point after projection • Long and thin triangles that do not touch any sample are culled as well • More efficient use of hardware resources • Cost: triangle touches any subpixel samples • Effectiveness: medium (depends on the size of the triangles and screen res) Primitive-rate bound • only one primitive per cycle per tile can be scanned (see [Wihlidal]) • Very inefficient use of rasterization units
  • 50. Compute-based triangle filtering • Implementation is a two-step approach • Test if one of the vertices is outside the Guardband -> can’t be a small triangle • If all vertices of a triangle are inside the Guardband -> do the actual small size test
  • 51. Compute-based triangle filtering • Guard band clipping Usually part of rasterization: there is a guard band around the viewport • Medium gray triangles -> outside viewport -> rejected • Light gray triangles -> completely in the guard band -> rasterized and the pixels outside of the viewport are ignored by rasterizer • Dark triangle needs to be clipped
  • 52. Compute-based triangle filtering • Test if vertex outside the Guardband // in filterTriangles.shd int2 minBB = int2(1 << 30, 1 << 30); int2 maxBB = int2(-(1 << 30), -(1 << 30)); bool insideGuardBand = true; for (uint i = 0; i < 3; ++i) { float2 screenSpacePositionFP = vertices[i].xy * windowSize; // Check if we would overflow after conversion if (screenSpacePositionFP.x < -(1 << 23) || screenSpacePositionFP.x > (1 << 23) || screenSpacePositionFP.y < -(1 << 23) || screenSpacePositionFP.y > (1 << 23)) { insideGuardBand = false; } // scale based on distance to from center to msaa sample point int2 screenSpacePosition = int2(screenSpacePositionFP * (SUBPIXEL_SAMPLES * samples)); minBB = min (screenSpacePosition, minBB); maxBB = max (screenSpacePosition, maxBB); }
  • 53. Compute-based triangle filtering • Is the triangle of “small” size? // in filterTriangles.shd if (insideGuardBand) { const uint SUBPIXEL_SAMPLE_CENTER = SUBPIXEL_SAMPLES / 2; const uint SUBPIXEL_SAMPLE_SIZE = SUBPIXEL_SAMPLES - 1; /* Test is: Is the minimum of the bounding box right or above the sample point and is the width less than the pixel width in samples in one direction. This will also cull very long triangles which fall between multiple samples. */ cull = cull || any( ((minBB & SUBPIXEL_MASK) > SUBPIXEL_SAMPLE_CENTER) && ((maxBB - ((minBB & ~SUBPIXEL_MASK) + SUBPIXEL_SAMPLE_CENTER)) < (SUBPIXEL_SAMPLE_SIZE))); }
  • 54. Compute-based triangle filtering • Depth culling (not used in this demo) • Allows to cull triangles that are occluded by the scene • This test requires a coarse depth buffer • Cost: load depth values from map and check triangle/BB intersection • Effectiveness: medium-high (depends on scene complexity and the size of the triangles) Can be generated by • Downsampling previous z-buffer and reprojecting depths • Rendering selected LOD geometry at low res
  • 55. Compute-based triangle filtering Degenerated triangles Backfaces Frustum Small primitives Depth Culling tests Culled results Compaction • Triangle filtering is executed on groups of 256 triangles (one batch) -> empty draws • Draw batch compaction to the rescue • Can be run in parallel in a compute shader • Eliminates empty draws from the multi indirect draw buffer Draw this using multi draw indirect / ExecuteIndirect
  • 56. Compute-based triangle filtering - Frame pseudocode 1. [CPU] Early discard invisible geometry using triangle cluster culling 2. [CS] Generate unculled indices and multi draw indirect buffers using triangle filtering (one triangle per thread) 3. Like before
  • 57. Triangle cluster culling / filtering – Draw Calls • For this static scene one large vertex buffer and an index buffer generated by triangle culling and filtering is used • Draw batches that hold a block of geometry each for one material -> Only two “materials” opaque and alpha masked, transparent objects and other materials would go into the same buffer • For dynamic objects we would use a dedicated VB/IB pair for each; this is optional
  • 58. Triangle cluster culling / filtering – Draw Calls • ExecuteIndirect makes it possible to completely defer the workload to the GPU, so that the actual work items can be batched in a more coherent way • Grouping multiple draw calls into a single one • reduces CPU overhead -> only a single Graphics API call • reduces GPU overhead (alleviates pressure on VS, CP and triangle assembly) by letting the user provide an optimal set of workload through • culling indices == invisible trianges through async compute • culling draw calls -> the resulting indirect argument buffer only includes valid draw calls
  • 59. Triangle cluster culling / filtering – Draw Calls San Miguel Scene Number of Draw calls (ExecuteIndirect) • Shadow opaque 214 • Shadow alpha masked 59 • Main view opaque 200 • Main view alpha masked 60 • Dispatch calls for filtering 81
  • 60. Triangle cluster culling / filtering – Benefits • Allows to cull triangles before sending them to the graphics pipeline • Avoid overwhelming parts of the graphics pipeline (rasterizer) • Graphics pipeline is better utilized with the visible triangles (rasterizer efficiency, command processor,…) • Can make use of async compute to potentially overlap with the graphics pipeline
  • 61. Re-using triangle filtered results for multiple Views / Rendering Passes
  • 62. Motivation • Compute-based triangle filtering comes at a cost • For every triangle: load indices and vertices, transform vertices, append (lock) triangle data to index buffer • We came up with the idea to generate filtered data for several rendering passes like main view, shadows etc. • Use filtered data to cull the same triangle set from different views • Load indices/vertices, transform vertices only once for all views
  • 63. Motivation • Reduces the effectivity of cluster culling / triangle filtering • harder to cull cluster for N views • however, it was worth it:
  • 64. Use filtered data to cull triangles from different views • The algorithm is generalized to test against different N-views • Load indices / vertices once, transform vertices for every view Algorithm unculledClusters  ClusterCulling(sceneObjs, views) filteredIndicesArray  TriangleCulling(unculledClusters, views) ShadowPass( filteredIndices[shadowView] ) MainPass1( filteredIndices[mainView] ) CPU GPU GPU GPU Compute Graphics Graphics MainPass2( filteredIndices[mainView] )GPU Graphics
  • 65. Adding re-usage of triangle filtered data - Frame pseudocode 1. [CPU] Early discard geometry not visible from any view using cluster culling 2. [CS] Generate N index and N multi draw indirect / ExecuteIndirect buffers using triangle filtering testing against the N views (one triangle per thread) 3. For each i view use (ith index buffer and ith ExecuteIndirect buffer): 1. [Gfx] Clear visibility and depth buffers 2. [VS,PS] Visibility buffer pass [PS] Output triangle / instance IDs 3. [PS] Interpolate attributes from gradients and shade pixel * Using a dedicated Visibility Buffer for shadow pass is overkill, but you can still use the filtered data for it.
  • 66. Results San Miguel Scene average for main view  8 Million triangles  5 Million vertices Total triangles Rendered Culled 8,010,146 851,517 (10.6%) 7,158,629 (89.4%) 3,185,203 (39.8%) Back-face 5,244,787 (65.5%) Frustum 1,950,030 (24.4%) Small primitives
  • 67. GPU Culling Shadow Map Fill VB HDAO Shade VB Resolve MSAA UI Overall Visibility Buffer 4k – No MSAA AMD RADEON R9 380 2.66 1.42 5.07 2.51 3.43 - 0.02 15.19 NVIDIA GeForce GTX 970 3.46 1.12 3.36 1.82 3.29 - 0.02 13.52 Visibility Buffer 4k – No MSAA No Culling AMD RADEON R9 380 - 5.18 9.25 2.51 3.43 - 0.02 20.45 NVIDIA GeForce GTX 970 - 3.74 5.31 1.79 3.25 - 0.02 14.39 Visibility Buffer 4k – 2x MSAA AMD RADEON R9 380 2.72 1.42 8.27 3.65 8.27 2.29 0.03 25.87 NVIDIA GeForce GTX 970 3.34 1.09 6.17 3.58 6.17 0.34 0.02 19.87 Visibility Buffer 4k – 4x MSAA AMD RADEON R9 380 2.70 1.43 8.73 5.70 15.58 3.61 0.03 37.86 NVIDIA GeForce GTX 970 3.37 1.11 7.86 6.86 12.35 0.87 0.02 32.68 Visibility Buffer 3840 x2160
  • 68. GPU Culling Shadow Map Fill Buffer HDAO Shade Buffer Resolve MSAA UI Overall Deferred Shading 4k – No MSAA AMD RADEON R9 380 2.67 1.42 12.19 2.51 1.29 - 0.03 20.19 NVIDIA GeForce GTX 970 3.36 1.20 9.04 1.79 1.21 - 0.02 16.82 Deferred Shading 4k – No MSAA No Culling AMD RADEON R9 380 - 5.18 15.00 2.51 1.29 - 0.03 24.06 NVIDIA GeForce GTX 970 - 3.75 10.44 1.79 1.22 - 0.02 17.49 Deferred Shading 4k – 2x MSAA AMD RADEON R9 380 2.70 1.42 21.65 3.65 10.86 2.29 0.03 42.68 NVIDIA GeForce GTX 970 3.35 1.10 15.36 3.59 2.41 0.34 0.02 26.27 Deferred Shading 4k – 4x MSAA AMD RADEON R9 380 2.72 1.44 35.88 5.74 20.13 3.60 0.02 69.64 NVIDIA GeForce GTX 970 3.40 1.18 30.29 6.87 5.39 0.87 0.02 48.12 Deferred Shading 3840 x2160
  • 69. GPU Culling Shadow Map Fill VB HDAO Shade VB Resolve MSAA UI Overall Visibility Buffer 1080p – No MSAA Xbox One 7.19 3.32 3.90 1.73 3.90 - 0.02 19.78 Visibility Buffer 1080p – No MSAA No Culling Xbox One - 9.16 9.09 1.73 3.89 - 0.02 23.98 Visibility Buffer 1080p – 2x MSAA Xbox One 7.28 3.20 7.57 5.17 7.57 0.46 0.02 28.00 XBOX One - Visibility Buffer 1080p
  • 70. GPU Culling Shadow Map Fill Buffer HDAO Shade Buffer Resolve MSAA UI Overall Deferred Shading 1080p – No MSAA Xbox One 7.19 3.14 11.18 1.74 1.38 - 0.02 24.77 Deferred Shading 1080p – No MSAA No Culling Xbox One - 9.09 15.07 1.73 1.39 - 0.02 27.45 Deferred Shading 1080p – 2x MSAA Xbox One 7.21 3.18 21.85 5.18 8.46 0.47 0.02 46.41 XBOX One - Deferred Shading 1080p
  • 71. Deferred Shading GPU AMD RADEON R9 380 1080p 1440p 2160p No MSAA 9.75 12.30 20.19 No MSAA – No Culling 14.16 16.66 24.06 2x MSAA 16.16 23.09 42.68 4x MSAA 24.90 36.37 69.64 NVIDIA GeForce GTX 970 1080p 1440p 2160p No MSAA 8.72 10.67 16.82 No MSAA – No Culling 10.30 11.91 17.49 2x MSAA 11.58 15.23 26.27 4x MSAA 17.00 24.76 48.12 GPU AMD RADEON R9 380 1080p 1440p 2160p No MSAA 8.57 10.72 15.19 No MSAA – No Culling 14.52 15.86 20.45 2x MSAA 11.44 16.38 25.87 4x MSAA 15.27 20.82 37.86 Visibility Buffer Summary Xbox One 1080p 1440p 2160p No MSAA 24.77 - - No MSAA – No Culling 27.45 - - 2x MSAA 46.41 - - Xbox One 1080p 1440p 2160p No MSAA 19.78 - - No MSAA – No Culling 23.98 - - 2x MSAA 28.00 - - NVIDIA GeForce GTX 970 1080p 1440p 2160p No MSAA 7.68 8.83 13.52 No MSAA – No Culling 9.44 10.64 14.39 2x MSAA 9.63 12.21 19.87 4x MSAA 12.47 17.47 32.68
  • 72. How about VR? •We are working on the StarVR SDK. StarVR uses a large field of view with very high resolution •The Visibility Buffer helps substantially with performance here • We cull and prepare the data for all views and the shadow map views in one go
  • 73. Executive Summary We built a rendering system that • Cluster culls and filters triangles for different views like main view, shadow view, reflection view, GI view etc. at the same time • The optimized triangles are used to fill a screen-space Visibility Buffer or more Visibility Buffers for more views • We then render lights, shadows, bounce lights with the optimized geometry based on visibility • We can differ between visibility of geometry and shading frequency • We can light per triangle or in so called object space
  • 74. Future work • Re-use culled triangles over several frames • Use intrinsics for several parts of the pipeline [Chajdas Compaction] • Better improve asynchronous scheduling • Async compute is powerful
  • 75. Source Code • Source code: https://we.tl/MOdVkljptr • Probably later on github if we can fit everything on there  (1 GB limit)
  • 76. Credits • Christoph Schied – wrote implementation of his paper with the OpenGL 4.5 run-time at our office • Confetti People • Marijn Tamis – wrote the initial OpenGL 4.5 run-time • Leroy Sikkes – wrote the initial DirectX 12 run-time and added hardware performance counters • Max Oomen (intern) added linear lighting and fixed many bugs • Jesús Gumbau – added triangle filtering, came up with the idea and implemented re-usage of filtered triangle data and made it cross-platform running on NVIDIA and AMD GPUs and then brought it to DirectX 12 • Jordan Logan – brought it to XBOX One and optimized for this console
  • 77. Credits • Most of the code for triangle culling and filtering is based on [Chajdas]. We added three features: • MSAA support for small triangle removal • Triangle in front of Near Plane culling • Multi-Viewport support
  • 78. Acknowledgements • Graham Wihlidal DICE Frostbite • Nicolas Thibieroz, Gareth Thomas, Matthaeus Chajdas, Steven Tovey AMD • Mike Acton Insomniac • James McLaren, Q-Games • Remi Arnaud, Starbreeze / StarVR • Kev Gee Microsoft
  • 79. References • [Burns] Christopher A. Burns, Warren A. Hunt “The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading” Journal of Computer Graphics Techniques (JCGT) 2:2 (2013), 55- 69. Available online at http://jcgt.org/published/0002/02/04 • [Chajdas] Matthaeus Chajdas “GeometryFX” http://gpuopen.com/gaming-product/geometryfx/ • [Chajdas Compaction] Matthaeus Chajdas “Fast compaction with mbcnt”, http://gpuopen.com/fast-compaction-with-mbcnt/ • [Edge] Edge Library, PS3 SDK • [Engel2009] Wolfgang Engel, “Light Pre-Pass”, “Advances in Real-Time Rendering in 3D Graphics and Games”, SIGGRAPH 2009, http://halo.bungie.net/news/content.aspx?link=Siggraph_09 • [Lagarde] Sebastien Lagarde, Charles de Rousiers, “Moving Frostbite to Physically Based Rendering”, Course notes SIGGRAPH 2014 • [Olano] Marc Olano, http://www.cs.unc.edu/~olano/papers/2dh-tri/2dh-tri.pdf • [Schied2015] Christoph Schied, Carsten Dachsbacher “Deferred Attribute Interpolation for Memory-Efficient Deferred Shading”, http://cg.ivd.kit.edu/publications/2015/dais/DAIS.pdf • [Schied2016] Christoph Schied, Carten Dachsbacher “Deferred Attribute Interpolation Shading”, GPU Pro 7, CRC Press / shorter and free version: http://cg.ivd.kit.edu/publications/2015/dais/DAIS.pdf • [Wihlidal] Graham Wihlidal, “Optimizing the Graphics Pipeline with Compute”, GDC 2016, http://www.frostbite.com/2016/03/optimizing- the-graphics-pipeline-with-compute/

Editor's Notes

  • #10: 1-bit alpha is used to tell which buffer to look at: There is one buffer with alpha masked geometry and one buffer with opaque geometry. Each one is drawn with a dedicated indirect command buffer. 8-bit drawID: We use multiDrawIndirect (or ExecuteIndirect in DirectX 12) to draw all batches at once drawID represents which of those draw batches the triangle belongs to 23-bit triangleID: This describes the ith-triangle in a batch j instanceID: depends on how we want to distribute between non-instanced and instanced
  • #15: Normals and tangents are packed Octahedral An advantage of using a vertex centric struct rather than a triangle-centric struct is that we can use vertex indexing … the index buffer keeps the vertices unique like in indexed triangle lists … or indexed triangle lists … Basically a vertex buffer mapped as a SRV Material ID represents the texture set that is used. In PBR this would be a set of textures. Texture2D textures[] : register(t7, space0); // new in SM5.1 Texture2D textures[512] : register(t7, space0); // XBOX One -> no unbounded array Overall data read in worst case for shading scene is 31 MB … for textures and vertices … XBOX One 1080p
  • #18: Not using rasterizer
  • #21: Not using rasterizer? Advantage?
  • #23: Not using rasterizer? Advantage?
  • #24: in the G-buffer you need to fetch all pixel data because it is significant we added an indirection to that so all pixels in a triangle hold the same value which points to a small region in the VB Example: triangle that has one color
  • #25: Why does it store less data for PBR? There is PBR data per material and per pixel. The index in the Visibility Buffer is going into a constant memory struct. This struct holds indices for the various PBR textures and per material descriptions of what is necessary to drive BRDF. Any data that changes per pixel is stored in textures that are referenced in that struct.
  • #29: We have one extra level of indirection compared to deferred, in fact our approach needs bindless textures to work … that’s now available with DirectX 12 / Vulkan in generic way
  • #31: Volume construction: I think he is just saying that, since the method we use is just an approximation of that paper, might not be able to obtain the optimal / mathematically correct solution the paper describes however the worst case for the paper is O(n^3) so the solution we are using (which is O( n ) always) might be better suited for real-time applications like games ---------------------- From Jesus Hi Matthaeus, Thanks for pointing us to that paper, we think it might be worth implementing that, since producing more accurate cones could help the algorithm discard more geometry earlier (and more efficiently). Anyway, I was considering implementing the cluster culling part on the GPU as well. That way we might be able to introduce more accurate tests at cluster culling time (fustrum culling / depth culling) that should allow to discard more geometry earlier. Also, I think it is necessary since we want to be able to feed the culling system with dynamically tessellated geometry...  What do you think? Cheers. ---------------- From Mattaeus Hi Jesus,   absolutely – doing a pass over the clusters on the GPU would allow per-cluster frustum, depth, and back-face culling, and is going to be beneficial going forward – more so than the fine-grained culling which can’t be made more efficient and which highly depends on the actual raster pipeline throughput.   Basically the whole thing needs to be two-phase, coarse and fine-grained culling, and GeometryFX does the fine-grained culling pretty well (except for depth cull, but fine-grained depth cull is borderline on efficiency because you have one texture fetch per triangle), while coarse-grained culling is very basic. Extending the coarse-grained culling is going to provide the most !/$ going forward.   Cheers,   Matthäus
  • #45: It is in section 5.2 of this paper: it states that if the matrix composed of the homogeneus coordinates of the vertices has no inverse, then it shouldn't be rendered because it's back-facing to see if it has inverse or not, you check the determinant In which space is vertices[]? In eye space
  • #49: // normalize 0..1 range vertices[i].xy /= vertices[i].w * 2; vertices[i].xy += float2(0.5, 0.5);
  • #53: We are testing to see if the bounding box of the triangle contains any samples. Floating point wastes a lot of precision with its distribution so we use a fixed point representation that has a uniform distribution so we don't run into an issue with we subtract two numbers and get a precision error. Compiler might cast the fixed point number to fp. The guard band is a way of detecting if we are going to overflow when we go to the 23.8 fixed point. If we overflow then we will skip culling it because it is either outside the view frustum or it is a really big triangle. screenSpacePositionFP is in 0..1 range SUBPIXEL_SAMPLES converts the number to the fixed-point representation. samples is the scaling factor for MSAA. which conveniently for 2x, 4x and 8x is the number of samples per a pixel. My idea for MSAA is that we scale by the positions by the distance of the farthest sample from the center so that 1 unit will occupy one pixel position. it will have false positives since it does not consider the layout of the samples. No idea if it would be worth it to add in the default sample patterns to the filtering. Might be something interesting to try.
  • #55: One of the main questions here is: do we do depth culling against Triangles Triangle cluster Or something else? mattheaus gave me a good response regarding depth culling on the GPU yesterday he said doing the depth test at triangle level would be a little overkill since we would have to sample the texture for every triangle he said it was better to do depth culling to discard the whole cluster that makes sense however, what I understood from Graham slides is they did depth-culling per triangle since they classified depth culling in the same level of the other triangle tests moreover, he showed images where the character seemed to be depth-culled at triangle level
  • #56: tringles in batches are always compacted, since the compute shader appends unculled triangles one after the other inside a batch that's ok in fact after culling maybe none of those batches have 256 triangles because some were culled each batch is associated with the number of triangles in a batch and the starting index in the index buffer so for example: batch1 --> start index:0 num indices:12 batch2: --> start index:12 num indices: 256 (this is full) batch3 --> start index: 268 num indices: 120 Compaction is done in a CS …
  • #58: in our scene we have 5M vertices (VB) and 21M indices (IB) however, we don't need to fully duplicate the IB if we do not want to draw it all at once yes, each vertex is 24 bytes (including position and compressed texcoords, tangent and normal) VB – 5 million vertices - 114 MB IB – 21 million indices – 80 MB with 32-bit integers; we use 16-bit integers since draw batches are limited to 256 triangles we also have the draw arguments buffer for the multidraw indirect but that's small that buffer contain batch definitions in gpu memory it's only 5 uints per batch
  • #59: ExecuteIndirect replaces / enhances DrawIndirect and DispatchIndirect Might require Material Layer Merging: a single ExecuteIndirect for each set of shader combinations to reduce state switching without breaking batching Might allow Render Group Proxies: collection of nearby meshes can be drawn as a batch … shown in the Visibility Buffer demo Other potential use cases: GPU skinning with async compute Tiled-based lighting GPU sorting GPU Particles
  • #67: one triangle may be culled in different tests