SPU Optimizations - Part 2

Introduction to SPU Optimizations
Part 2: An Optimization Example

P˚l-Kristian Engstad
a
pal engstad@naughtydog.com

March 5, 2010

P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations

Introduction

As seen in part 1 of these slides, the SPUs have a very powerful instruction set,
but how can we utilize this power eﬃciently?

Solution:

Analyze the problem.
Get C/C++ code working.
Optimize!

P˚

Analyzing for the SPU

Analyze memory requirements.
Partition your problem and data into sub-problems and sub-data until they
ﬁt on a single SPU.
Re-arrange your data to facilitate a streaming model.

SPUs only have 256 kB local memory, so in most cases you need to partition
your input data to ﬁt. Sometimes, this can be hard. If your input data is a
“spaghetti-tree” with pointers to pointers, etc., then your task will be very hard.

The SPU wants:

Data in contigous memory chunks, preferably each larger than 128 bytes.
Data aligned to 128 byte boundaries, or at least to 16 byte boundaries.

P˚

Optimize for the SPU

First things ﬁrst. We have to:

Get (C/C++) code working on a single SPU using synchronous memory
transfers.
Enable asynchronous memory transfers.
Enable asynchronous execution (multiple SPUs).

When everything works. Optimize:

Parallelize inner loops, using intrinsics.
Translate inner loops to assembly.
Employ software pipe-lining.
Optimize setup, prologue and epilogue code.

Also make sure you test your loops to make sure they work on known input
data.

P˚

Optimizing Example - Lighting Calculation

Given a vertex position Pv , and light position Pl and we calculate the light
direction,
ˆ = (Pl − Pv )/ Pl − Pv = L/ L ,
l (1)
√
where we use the distance to the light r = L = L · L. The attenuation
factor is a function of the distance to the light, using two constants α1 and α2 ,
α1
α(r ) = clamp( − α2 , 0, 1), (2)
r2
and then ﬁnally get the color at that vertex Vc through

Vc = α(r ) max(0, nv · ˆ Lc ,
ˆ l) (3)

where Lc is the (r , g , b)-color of the light source, and nv is the vertex normal.
ˆ

P˚


// Cg-function:
float3 lcalc(float3 vpos, float3 vnrm,
float att0, float att1,
float3 lpos, float3 lclr)
{
float3 L = lpos - vpos;
float r = sqrt(dot(l, l));
float3 ldir = L / r;
float at = saturate(att0 / (r*r) - att1);
return at * max(0, dot(vnrm, ldir)) * lclr;
}

P˚


A trivial re-arrangement, to avoid two divides is to notice that
1 1
= √ = rsqrt(d), (4)
r d
and „ «2
1 1 1
= √ = √ = (rsqrt(d))2 (5)
r2 ( d)2 d
thus:

P˚


// Core function
float3 lcalc(float3 vpos, float3 vnrm,
float att0, float att1,
float3 lpos, float3 lclr)
{
float3 L = lpos - vpos;
float d = dot(l, l);
float irs = rsqrt(d);
float3 ldir = L * irs;
float at = min(att0 * irs * irs - att1, 1);
return max(0, at) * max(0, dot(vnrm, ldir)) * lclr;
}

P˚


We want to use the lcalc function to accumulate the light from several light
sources. So, our algorithm is:

foreach vertex do
vertex.clr = float3(0,0,0);
foreach light do
vertex.clr += lcalc(vertex.pos, vertex.nrm,
light.att, light.pos, light.clr)
endfor
endfor

P˚


Let’s see how well GCC can do. We’ll deﬁne some structures:

#define ALIGN(x) __attribute__((aligned(16)))
#define INLINE inline __attribute__((always_inline))
#define VECRET /* __attribute__((vecreturn)) PPU only */
struct float3 { float x, y, z; } VECRET;
struct light { float3 pos; float att0;
float3 clr; float att1; } ALIGN(16);
struct vertex { float3 pos; float dummy1;
float3 nrm; float dummy2;
float3 clr; float dummy3; } ALIGN(16);
INLINE float3 lcalc(vertex &v, light& l) // For convenience.
{ return lcalc(v.pos, v.nrm, l.att0, l.att1, l.pos, l.clr); }

P˚


Some math functions:

INLINE float3 mkvec3(float x, float y, float z)
{ float3 res = { x, y, z }; return res; }
INLINE float dot(float3 a, float3 b)
{ return a.x*b.x + a.y*b.y + a.z*b.z; }
INLINE float3 operator-(float3 a, float3 b)
{ return mkvec3(a.x-b.x, a.y-b.y, a.z-b.z); }
INLINE void operator+=(float3& a, float3 b)
{ a.x += b.x; a.y += b.y; a.z += b.z; }
INLINE float3 operator*(float k, float3 b)
{ return mkvec3(k*b.x, k*b.y, k*b.z); }
INLINE float rsqrt(float x) { return 1/sqrtf(x); }
INLINE float max(float a, float b) { return (a > b) ? a : b; }
INLINE float min(float a, float b) { return (a < b) ? a : b; }

P˚


Our loop now becomes:

void accLights(int nLights, light *light,
int nVerts, vertex *vtx)
{
for (int v = 0; v < nVerts; v++)
for (int l = 0; l < nLights; l++)
if (l == 0)
vtx[v].clr = lcalc(vtx[v], light[l]);
else
vtx[v].clr += lcalc(vtx[v], light[l]);
}

Results: 231.6 cycles/vtx/light.

P˚


Where does GCC go wrong?

It inlines all code. [Good.]
Cycles per instruction = 2.07.
Lots of stalls, 50.7% due to dependencies.
No vectorization!

Let’s try to help the compiler.

P˚


Our loop now becomes:

int nVerts, vertex *vtx) {
if (nLights > 0)
for (int v = 0; v < nVerts; v+=4)
for (int i = 0; i < 4; i++)
vtx[v+i].clr = lcalc(vtx[v+i], light[l]);
for (int l = 1; l < nLights; l++)
for (int v = 0; v < nVerts; v+=4)
for (int i = 0; i < 4; i++)
vtx[v+i].clr += lcalc(vtx[v+i], light[l]);
}

Results: 208.5 cycles/vtx/light.

P˚


Let’s vectorize by hand. We will reorganize our vertexes to be groups of 4
vertexes organised such that 4 x-s are located in a single vector register:

typedef vector float real; // 4 real numbers
struct vec3 {
real xs, ys, zs;
vec3() {}
vec3(real x, real y, real z) : xs(x), ys(y), zs(z) {}
};
struct vertex4 { vec3 pos, nrm, clr; };

Think of a hardware vector register as four real values, not as a vector. A vec3
is four (3-)vectors, and a vertex4 is four vertexes.

P˚


As before, we need to deﬁne addition, subtraction, dot-product, scalar
multiplication, and a couple of other functions:

INLINE vec3 operator+(vec3 a, vec3 b)
{ return vec3(a.xs + b.xs, a.ys + b.ys, a.zs + b.zs); }
INLINE vec3 operator-(vec3 a, vec3 b)
{ return vec3(a.xs - b.xs, a.ys - b.ys, a.zs - b.zs); }
INLINE real dot(vec3 a, vec3 b)
{ return a.xs * b.xs + a.ys * b.ys + a.zs * b.zs; }
INLINE vec3 operator*(real as, vec3 b)
{ return vec3(as * b.xs, as * b.ys, as * b.zs); }
INLINE vec3 splat(float3 a)
{ return vec3(spu_splats(a.x), spu_splats(a.y), spu_splats(a.z)); }

P˚


Also, since the light values are the same for each of the four components, we
must “splat” the light components into the vector:

struct lightenv { // 8 vector floats.
vec3 pos, clr;
real att0s, att1s;
lightenv(light in) {
pos = splat(in.pos); clr = splat(in.clr);
att0s = spu_splats(in.att0);
att1s = spu_splats(in.att1);
}
};

P˚


The core function:

// Light calculation for 4 vertexes
static INLINE vec3 lcalc4(vertex4 v, lightenv l)
{
vec3 L = l.pos - v.pos;
real Lsq = dot(L, L);
real irs = rsqrtf4(Lsq);
vec3 ldir = irs * L;
real at = fminf4(l.att0s * irs * irs - l.att1s, ones);
real dp = dot(v.nrm, ldir);
return fmaxf4(zeros, at) * fmaxf4(zeros, dp) * lclr;
}

P˚


The calling function:

int nVerts, vertex4 *vtx)
{
for (int l = 0; l < nLights; l++) {
lightenv lgt(light[l]);
for (int v = 0; v < nVerts/4; v++) {
vec3 clr = lcalc4(vtx[v], lgt);
vtx[v].clr = clr + ((l==0) ? zerovec : vtx[v].clr);
}
}
}

Result: 31.8 cycles/vertex/light! This is a factor of 6.56 better, but we can
do much more...

P˚


It is more than 4 times better since in addtion to having vectorized the code (4
times speedup), the compiler no longer has to do akward shuﬄes when loading
and storing values.

However, using spusim, we still see that more than 50% of the cycles are
wasted by dependency stalls. Of the rest, 76% are still single-issued. So, GCC
does a terrible job of optimizing our loop.

We can try to unroll the loop manually:

P˚


for (int v = 0; v < nVerts/4; v+=4) {
vec3 clr0 = lcalc4(vtx[v+0], lgt);
vtx[v+0].clr = clr0 + ((l==0) ? zerovec : vtx[v+0].clr);
}

Result: 15.0 cycles/vertex/light!! Pretty good, but spusim tells us that
although most stalls are gone (about 10%), 60% of the cycles are single-issue
and not dual issue.

P˚


What is the minimum number of cycles in our loop? We have:

L = l.pos - v.pos // 3 fs
Lsq = dot(L, L) // fm + 2 fma
dp = dot(v.nrm, L) // fm + 2 fma
irs = rsqrtf4(Lsq) // frsqrest+fi+(2fm+fnms+fma)
at = l.att0s * irs * irs - l.att1s // fm, fms
at = fminf4(at, ones) // fcgt, selb
m0 = fmaxf4(zeros, at) // fcgt, selb
m1 = fmaxf4(zeros, dp * irs) // fm, fcgt, selb
ms = m0 * m1 // fm
tmp = lclr * irs // 3 fm
clr = ms * tmp + pclr // 3 fma

So, adding one instruction for adding the loop-counter, we should be able to
software pipeline this down to 31/4 = 7.75 cycles!

P˚


We can improve this loop a lot by using a process called software pipe-lining.
It is a fairly mechanic process:

Write out the instruction sequence in assembly.
Add up even instructions and odd instructions. The maximum is the II,
the initiation interval.
Also note (write down) the latency of each instruction.
Create a table with two columns and as many rows as the II.
Insert instructions from the top adhering to latency demands.
If you get to row II, then wrap around.

In the following slides, we’ll show how we do it.

P˚


We’ll start with

vec3 L = l.pos - v.pos;

This is going to be translated (in assembly) to:

{o6} lqx vposx, pxptr, iofs ; vposx := mem[pxptr+iofs]
{o6} lqx vposy, pyptr, iofs ; vposy := mem[pyptr+iofs]
{o6} lqx vposz, pzptr, iofs ; vposz := mem[pzptr+iofs]
{e6} fs Lx, lposx, vposx ; even pipe, latency = 6.
{e6} fs Ly, lposy, vposy ; {text} are comments.
{e6} fs Lz, lposz, vposz

The reason we use the x-form is that we then will have only one oﬀset variable,
simplifying our loop control. Of course, this means that we have to initialize
pxptr, pyptr and pzptr accordingly. All light variables must also be initialized.

Let’s start!

P˚

;;; Start of loop (page 1)
{nop} {o6:0} lqx vposx, pxptr, iofs; 0
{nop} {o6:0} lqx vposy, pyptr, iofs; 1
{nop} {o6:0} lqx vposz, pzptr, iofs; 2
{nop} {lnop} ; 3
{nop} {lnop} ; 4
{nop} {lnop} ; 5
{e6:0} fs Lx, lposx, vposx {lnop} ; 6
{e6:0} fs Ly, lposy, vposy {lnop} ; 7
{e6:0} fs Lz, lposz, vposz {lnop} ; 8
{nop} {lnop} ; 9
{nop} {lnop} ; 10
{nop} {lnop} ; 11
{nop} {lnop} ; 12
{nop} {lnop} ; 13
{nop} {lnop} ; 14
{nop} {lnop} ; 15

P˚


Then we just continue with

Lsq = dot(L, L)
dp = dot(v.nrm, L)

Translated:

{e6} fm Lsq, Lx, Lx ; Lsq = Lx*Lx
{e6} fma Lsq, Ly, Ly, Lsq ; Lsq = Ly*Ly + Lsq
{e6} fma Lsq, Lz, Lz, Lsq ; Lsq = Lz*Lz + Lsq
{e6} fm dp, vnrmx, Lx ; dp = Nx*Lx
{e6} fma dp, vnrmy, Ly, dp ; dp = Ny*Ly + dp
{e6} fma dp, vnrmz, Lz, dp ; dp = Nz*Lz + dp

P˚

{nop} {o6:0} lqx vnrmx, nxptr, iofs; 3
{nop} {o6:0} lqx vnrmy, nyptr, iofs; 4
{nop} {o6:0} lqx vnrmz, nzptr, iofs; 5
{e6:0} fs Lx, lposx, vposx {lnop} ; 6
{nop} {lnop} ; 9
{nop} {lnop} ; 10
{nop} {lnop} ; 11
{e6:0} fm Lsq, Lx, Lx {lnop} ; 12
{e6:0} fm dp, vnrmx, Lx {lnop} ; 13
{nop} {lnop} ; 14
{nop} {lnop} ; 15

P˚

;;; Continued (page 2)
{nop} {lnop} ; 16
{nop} {lnop} ; 17
{nop} {lnop} ; 18
{e6:0} fma Lsq, Ly, Ly, Lsq {lnop} ; 19
{e6:0} fma dp, vnrmy, Ly, dp {lnop} ; 20
{nop} {lnop} ; 21
{nop} {lnop} ; 22
{nop} {lnop} ; 23
{nop} {lnop} ; 24
{e6:0} fma Lsq_, Lz, Lz, Lsq {lnop} ; 25
{e6:0} fma dp_, vnrmz, Lz, dp {lnop} ; 26
{nop} {lnop} ; 27
{nop} {lnop} ; 28
{nop} {lnop} ; 29
{nop} {o1:-} brnz ..., looptop ; 30
;;; Must wrap around!

Notice that we rename the output variables from Lsq and dp to Lsq and dp ,
so that they don’t clash with registers when they are carried into the next loop.

P˚


The irs = frsqrf4(Lsq) function is implemented using:

{o4} frsqest t0, Lsq
{e7} fi t1, Lsq, t0
{e6} fm t2, t1, Lsq
{e6} fm t3, t1, onehalf
{e6} fnms t2, t2, t1, one
{e6} fma irs, t3, t2, t1

After this sequence, irs is good to 24 bits precision. Notice the huge latency
(∆L = 4 + 7 + 6 + 6 + 6 + 6 = 35), so the result is ready the next iteration.

P˚

{nop} {o6:0} lqx vnrmz, nzptr, iofs; 5
{e6:0} fs Lx, lposx, vposx {o4:1} frsqest t0, Lsq_ ; 6
{nop} {lnop} ; 9
{e7:1} fi t1, Lsq_, t0 {lnop} ; 10
{nop} {lnop} ; 11
{nop} {lnop} ; 14
{nop} {lnop} ; 15

We annotate with :1 to indicate that these instructions are delayed into the
next iteration of the loop.

P˚

{nop} {lnop} ; 16
{nop} {lnop} ; 17
{e6:1} fm t2, t1, Lsq_ {lnop} ; 18
{nop} {lnop} ; 21
{nop} {lnop} ; 22
{nop} {lnop} ; 23
{e6:1} fm t3, t1, onehalf {lnop} ; 24
{e6:0} fma dp, vnrmz, Lz, dp {lnop} ; 26
{nop} {lnop} ; 27
{nop} {lnop} ; 28
{nop} {lnop} ; 29
{e6:1} fnms t2, t2, t1, one {o1:-} brnz ..., looptop ; 30

P˚

{e6:2} fma irs, t3, t2, t1 {o6:0} lqx vnrmz, nzptr, iofs; 5
{nop} {lnop} ; 9
{e7:1} fi t1, Lsq_, t0 {lnop} ; 10
{nop} {lnop} ; 11
{nop} {lnop} ; 14
{nop} {lnop} ; 15

P˚


Continuing with:

at = l.att0s * irs * irs - l.att1s // fm, fms
at = fminf4(at, ones) // fcgt, selb

This translates to:

{e6} fm ir, irs, irs
{e6} fms at, att0s, ir, att1s
{e2} fcgt c0, at, ones ; c0.w = at.w>0 ? 111..111 : 000..000
{e2} selb at, c0, at, ones

P˚

{nop} {lnop} ; 9
{e7:1} fi t1, Lsq_, t0 {lnop} ; 10
{e6:2} fm ir, irs, irs {lnop} ; 11
{nop} {lnop} ; 14
{nop} {lnop} ; 15

P˚

{nop} {lnop} ; 16
{nop} {lnop} ; 17
{e6:1} fm t2, t1, Lsq_ {lnop} ; 18
{e6:2} fms at, att0s, ir, att1s {lnop} ; 20
{nop} {lnop} ; 22
{nop} {lnop} ; 23
{e2:2} fcgt c0, at, ones {lnop} ; 26
{e6:0} fma dp, vnrmz, Lz, dp {lnop} ; 27
{e2:2} selb at, c0, at, ones {lnop} ; 28
{nop} {lnop} ; 29

P˚


Continuing with:

m1 = fmaxf4(zeros, dp*irs) // fm, fcgt, selb
m0 = fmaxf4(zeros, at) // fcgt, selb
ms = m0 * m1 // fm

This translates to:

{e6} fm dprs, dp, irs
{e2} fcgt c1, zeros, dprs
{e2} selb m1, c1, dprs, zeros
{e2} fcgt c2, zeros, at
{e2} selb m0, c2, at, zeros

P˚

;;; Start of loop (page 1) : lmove dst, src == rotm dst, src, zeros
{nop} {lnop} ; 9
{e7:1} fi t1, Lsq_, t0 {lnop} ; 10
{e6:2} fm dprs, dp__, irs {lnop} ; 13
{e6:0} fm dp, vnrmx, Lx {o4:1} lmove dp__, dp_ ; 14
{nop} {lnop} ; 15

P˚

{nop} {lnop} ; 16
{nop} {lnop} ; 17
{e6:1} fm t2, t1, Lsq_ {lnop} ; 18
{e2:2} fcgt c1, zeros, dprs {lnop} ; 22
{nop} {lnop} ; 23
{e2:2} selb m1, c1, dprs, zeros {lnop} ; 29

P˚

{e2:3} fcgt c2, zeros, at {o6:0} lqx vposx, pxptr, iofs; 0
{e2:3} selb m0, c2, at, zeros{o6:0} lqx vposz, pzptr, iofs; 2
{e6:3} fm ms, m0, m1 {o6:0} lqx vnrmy, nyptr, iofs; 4
{nop} {lnop} ; 9
{e7:1} fi t1, Lsq_, t0 {lnop} ; 10
{e6:2} fm dprs, dp__, irs {lnop} ; 13
{nop} {lnop} ; 15

P˚


Finally, we can squeeze in our last 6 instructions:

tmp = lclr * irs // 3 fm
clr = ms * tmp + prev // 3 fma

Translation:

{e6} fm tmpr, lclrr, irs
{e6} fm tmpg, lclrg, irs
{e6} fm tmpb, lclrb, irs
{e6} fma clrr, ms, tmpr, prevr ; prev color
{e6} fma clrg, ms, tmpg, prevg
{e6} fma clrb, ms, tmpb, prevg

P˚

{e6:3} fm tmpr, lclrr, irs_ {o6:0} lqx vposy, pyptr, iofs; 1
{e6:3} fm tmpg, lclrg, irs_ {o6:0} lqx vnrmx, nxptr, iofs; 3
{e6:3} fm ms, m0, m1 {o6:0} lqx vnrmy, nyptr, iofs; 4
{e6:3} fm tmpb, lclrb, irs_ {lnop} ; 9
{e7:1} fi t1, Lsq_, t0 {lnop} ; 10
{e6:0} fm Lsq, Lx, Lx {o4:2} lmove irs_, irs ; 12
{e6:2} fm dprs, dp__, irs_ {lnop} ; 13
{e6:3} fma clrr, ms, tmpr, prevr {lnop} ; 15

Here, we needed to extend the lifetime of irs, thus introducing the move to
irs in iteration 2, in order to have a clean copy in iteration 3.

P˚

{nop} {lnop} ; 16
{e6:3} fma clrg, ms, tmpg, prevg {lnop} ; 17
{e6:1} fm t2, t1, Lsq_ {lnop} ; 18
{e2:2} fcgt c1, zeros, dprs {lnop} ; 22
{e6:3} fma clrb, ms, tmpb, prevb {lnop} ; 23

P˚


We’re almost done! We must finish up with a technique to select prevr, prevg
and prevb which must zero for the first loop, but use the previous color the
next loops. We can do this by using a different select map. For light 0, we’ll
use mask = m 0000, which creates all zeros, and mask = m ABCD for all others.
So, for each of the color’s (R,G,B)-fields:

{o6} lqx vclrr, crptr, oofs ; load using output offset
{o4} shufb prevr, vclrr, vclrr, mask ; copy vclrr or set to zeros
;; use prevr
{o6} stqx vclrr, crptr, oofs ; load using output offset

P˚

{e6:3} fm tmpr, lclrr, irs_ {o6:0} lqx vposy, pyptr, iofs; 1
{e6:3} fm tmpg, lclrg, irs_ {o6:0} lqx vnrmx, nxptr, iofs; 3
{e6:3} fm ms, m0, m1 {o6:3} lqx vclrr, crptr, oofs; 4
{e6:2} fma irs, t3, t2, t1 {o6:3} lqx vclrg, cgptr, oofs; 5
{e6:0} fs Ly, lposy, vposy {o6:3} lqx vclrb, cbptr, oofs; 7
{e6:0} fs Lz, lposz, vposz {o6:0} lqx vnrmy, nyptr, iofs; 8
{e6:3} fm tmpb, lclrb, irs_ {o6:0} lqx vnrmz, nzptr, iofs; 9
{e7:1} fi t1, Lsq_, t0 {o4:3} shufb prevr, vclrr, vclrr, mask; 10
{e6:2} fm ir, irs, irs {o4:3} shufb prevg, vclrg, vclrg, mask; 11
{e6:0} fm Lsq, Lx, Lx {o4:2} lmove irs_, irs ; 12
{e6:2} fm dprs, dp__, irs {o4:3} shufb prevb, vclrb, vclrb, mask; 13
{e6:3} fma clrr, ms, tmpr, prevr {lnop} ; 15

P˚

{nop} {lnop} ; 16
{e6:1} fm t2, t1, Lsq_ {lnop} ; 18
{e2:2} fcgt c1, zeros, dprs {o6:3} stqx clrr, crptr, oofs; 22
{e6:1} fm t3, t1, onehalf {o6:3} stqx clrg, cgptr, oofs; 24
{e2:2} selb at, c0, at, ones {o6:3} stqx clrb, cbptr, oofs; 28

P˚


We’re finally close to done. The last thing we need is to make sure our input
and output offsets are correct through out the execution. As seen, we count
downwards by 0x90, which equals sizeof(vertex4) per loop.

So, we initialize iofs = (numVerts/4 - 1) * 0x90, and that should take
care of the input loads. For the output offsets we need a delay unit, initialized
such that the first 3 times we’ll overwrite the last vertex4 positions. We’ll use
the “delay machine” shuffle m bcdA for this.

{o4} shufb oofs, iofs, oofs, m_bcdA ; oofs.x <- oofs.y,
oofs.y <- oofs.z,
oofs.z <- oofs.w,
oofs.w <- iofs.x

P˚


Example, (numVerts/4 = 4), at cycle 0, N=0x90:
Loop 0: iofs = 3N, oofs = { 3N, 3N, 3N, 3N }
Loop 4: iofs =-1N, oofs = { 2N, 1N, 0N,-1N }
Loop 5: iofs =-2N, oofs = { 1N, 0N,-1N,-2N }
Loop 6: iofs =-3N, oofs = { 0N,-1N,-2N,-3N }

P˚

{e2:-} ai iofs, iofs, -0x90 {lnop} ; 16
{e6:1} fm t2, t1, Lsq_ {o4:3} shufb oofs, iofs, oofs, m_bcdA
{e2:2} fcgt c1, zeros, dprs {o6:3} stqx clrr, crptr, oofs; 22
{e6:1} fm t3, t1, onehalf {o6:3} stqx clrg, cgptr, oofs; 24
{e2:2} selb at, c0, at, ones {o6:3} stqx clrb, cbptr, oofs; 28
{e6:1} fnms t2, t2, t1, on {o1:3} brnz oofs, looptop ; 30

P˚


And now we are done! Well, not entirely – we’ll have to code the prologue (the
entry code) and the epilogue (exit code), test it and make sure it works. But it
will!

Looking at the code, there are no stalls. In four iterations taking 4 × 31 = 124
cycles, the ﬁrst four result will be produced, and for every next interval of 31
cycles, another four results are produced. That means that our loop runs at
7.75 cycles/vertex/light!

Was this hard to do?

Fairly mechanic. We have tools that can schedule automatically!
Requires simple loops.
Usually a factor 2-4 of what unrolled GCC can achieve.
Our “frontend” tool automatically colors registers.
And it has some other nice features as well.
So: Just do it!

P˚


Improvements on algorithm:

The strange storage format can be changed to allow for regular
array-of-structures.
All we need to do is to add in transposing of input data.
9 loads turns into 12 loads.
Transposing 4 vector regs into 3 takes: 7o, 1e+6o, 2e+5o or 4e+4o.
Transposing 3 vector regs into 4 takes: 6o, 1e+5o, 2e+4o.
3 stores turns into 4 stores.
Overall addition: odd = 3 + 3 × 4 + 6 + 1 = 22, even = 3 × 4 = 12. Since
we had 10 odd for free: 22 − 10 = 12 extra cycles, or 3.0 extra cycles/loop.

P˚


Conclusion:

The original algorithm can be coded, without changing memory layouts, down
to 10.75 cycles/loop, or 7.75 cycles/loop with our proposed change in memory
layout.

If you remember nothing else, then remember that:

vectorization can give huge improvements without going to assembly,
and that
software loop-scheduling can give huge improvements on top of that.

And ﬁnally, a question:

Why on earth can’t the compiler do a better job?

P˚

SPU Optimizations - Part 2

More Related Content

What's hot (20)

Viewers also liked (11)

Similar to SPU Optimizations - Part 2 (20)

Recently uploaded (20)

SPU Optimizations - Part 2