Get Multicore Differentiation and Great Integrated Graphics Performance

Game Tick thread is running ~10 fps

Main Thread and Render Thread are running in sync

Game Tick builds game state and passes it to main thread

We calculate intersection with all the frustums at once

Intersection results are stored in a bitfield

We calculate the area of the projected bounding box and cull small models

Then we select the LOD level based on the
approximated average triangle size on screen

Next step is building instance data for the individual parts

Based on the mesh and instance data and the material
we figure out which parts of the pipeline the mesh needs to be rendered to.

Each pipeline stage uses the same instance data.

Every single mesh has a number of instance lists.
One instance list per pipeline stage.
The instance is added to all the relevant instance lists.

The instance lists are duplicated to every worker thread to ensure lockless access.

And double buffered because while the render thread
is rendering the current frame, the main thread (and workers)
Is adding instances to the next frame’s instance lists.

We have one mesh list per mesh type.
Every mesh with at least a single instance in the frame
will get added to exactly one mesh list matching its mesh type.

Just like the instance lists we have one mesh list for every single worker thread.

We start by processing the mesh lists
We run one task per mesh type

Per-thread mesh lists are combined to a single list of meshes

Then we process all the meshes in the list one by one

Each mesh has instance lists per pipeline stage

We start with the first pipeline stage and
combine the instances there into a single list

Then process all subsequent stages sequentially

Instance data is now uploaded to the GPU and prepared for batching

The actual rendering follows the different pipeline stages

Each stage renders a selection of mesh types

The meshes are prepared by the render thread workers

We have to start by waiting for the preparation tasks to be finished

The tasks process all meshes/instances for all pipeline stages
We can use an atomic counter per pipeline stage

No need to further split the tasks
We just increased the granularity of waiting for other sub-tasks

Usually everything is ready after the short initial wait

The basic building blocks of the system are emitters

Emitters are emitting a number of particles over time

Particles are often just sorted per emitter

Problems start happening when the emitters overlap

We sort all the particles together

Emission takes place on the CPU

Sorting is responsible for moving dead particles to one end of the GPU buffer

This make uploading new particles trivial

Running out of particle space stomps over particles farthest from camera

(for Total War:THREE KINGDOMS)
We were confident that order-independent transparency is the solution

Get Multicore Differentiation and Great Integrated Graphics Performance

More Related Content

Similar to Get Multicore Differentiation and Great Integrated Graphics Performance (20)

More from Intel® Software (20)

Recently uploaded (20)

Get Multicore Differentiation and Great Integrated Graphics Performance