5. FMAچیست؟
•The FMA instruction set is an extension to the 128 and 256-bit Streaming SIMD
Extensions instructions.
•FMA perform fused multiply–add (FMA) operations.
•FMA4 operation has the form d = round(a · b + c)
•FMA3 operation has the form a = round(a · b + c)
the three-operand form (FMA3) requires that d be the same register as a, b or c
An FMA has only one rounding (it effectively keeps infinite precision for the
internal temporary multiply result), while an ADD + MUL has two
18. های برنامه کار گردش
Bandwidth Bound
numactl Memkind Cache mode
▷Simply run the whole
program in MCDRAM
▷No code modification
required
▷ Manually allocate
BW-critical memory to
MCDRAM
▷Memkind calls need to
be added.
▷ Allow the chip to figure
out how to use
MCDRAM
▷No code modification
required
27. مثال:دو ضرب سازی ساده
ماتریس
#pragma omp parallel for
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
#pragma vector aligned
for (int k = 0; k < n; k++)
C[i*n+j]+=A[i*n+k]*B[k*n+j];
#pragma omp parallel for
for (int i = 0; i < n; i++)
for (int k = 0; k < n; k++)
#pragma vector aligned
for (int j = 0; j < n; j++)
C[i*n+j]+=A[i*n+k]*B[k*n+j];
Before: After:
34. حلقه بندی قطعه:Unroll/Register
Blocking
for (int i = 0; i < m; i++) // Original code:
for (int j = 0; j < n; j++)
compute(a[i], b[j]); // Memory access is unit-stride in j
// Step 1: strip-mine outer loop
for (int ii = 0; ii < m; ii += TILE)
for (int i = ii; i < ii + TILE; i++)
for (int j = 0; j < n; j++)
compute(a[i], b[j]); // Same order of operation as original
// Step 2: permute and vectorize outer loop
for (int ii = 0; ii < m; ii += TILE)
#pragma simd
for (int j = 0; j < n; j++)
for (int i = ii; i < ii + TILE; i++)
compute(a[i], b[j]); //each vector in b[j] a total of TILE time
1
2
3
1
2
3
4
5
1
2
3
4
5
6
35. روشLoop Fusion(ادغام
حلقه)
ها حلقه ادغام بوسیله کش از مجدد استفاده
الینی پایپ پردازش فرآیند یک در
MyData* data = new MyData(n);
for (int i = 0; i < n; i++)
Initialize(data[i]);
for (int i = 0; i < n; i++)
Stage1(data[i]);
for (int i = 0; i < n; i++)
Stage2(data[i]);
MyData* data = new MyData(n);
for (int i = 0; i < n; i++) {
Initialize(data[i]);
Stage1(data[i]);
Stage2(data[i]);
}
اجانبی مثبت ثرات:،شوند می جابجا مراحل بین کمتری داده
کارایی افزایش ،حافظه به ارجاعات کاهش
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
11
45. FLOPsاز تعدادی
ها پردازندهAMD Bobcat:
1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
4 SP FLOPs/cycle: 4-wide SSE addition every other cycle + 4-wide SSE multiplication every other cycle
AMD Jaguar:
3 DP FLOPs/cycle: 4-wide AVX addition every other cycle + 4-wide AVX multiplication in four cycles
8 SP FLOPs/cycle: 8-wide AVX addition every other cycle + 8-wide AVX multiplication every other cycle
ARM Cortex-A9:
1.5 DP FLOPs/cycle: scalar addition + scalar multiplication every other cycle
4 SP FLOPs/cycle: 4-wide NEON addition every other cycle + 4-wide NEON multiplication every other cycle
ARM Cortex-A15:
2 DP FLOPs/cycle: scalar FMA or scalar multiply-add
8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add
Qualcomm Krait:
2 DP FLOPs/cycle: scalar FMA or scalar multiply-add
8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add
IBM PowerPC A2 (Blue Gene/Q), per core:
8 DP FLOPs/cycle: 4-wide QPX FMA every cycle
SP elements are extended to DP and processed on the same units
IBM PowerPC A2 (Blue Gene/Q), per thread:
4 DP FLOPs/cycle: 4-wide QPX FMA every other cycle
SP elements are extended to DP and processed on the same units
Intel Xeon Phi (Knights Corner), per core:
16 DP FLOPs/cycle: 8-wide FMA every cycle
32 SP FLOPs/cycle: 16-wide FMA every cycle