Intel® Xeon® Phi Coprocessor
High Performance Programming
Parallelizing a Simple Image Blurring
Algorithm
Brian Gesiak
April 16th, 2014
Research Student, The University of Tokyo
@modocache
Today
• Image blurring with a 9-point stencil algorithm
• Comparing performance
• Intel® Xeon® Dual Processor
• Intel® Xeon® Phi Coprocessor
• Iteratively improving performance
• Worst: Completely serial
• Better: Adding loop vectorization
• Best: Supporting multiple threads
• Further optimizations
• Padding arrays for improved cache performance
• Read-less writes, i.e.: streaming stores
• Using huge memory pages
Stencil Algorithms
A 9-Point Stencil on a 2D Matrix
Stencil Algorithms
A 9-Point Stencil on a 2D Matrix
Stencil Algorithms
typedef double real;
typedef struct {
real center;
real next;
real diagonal;
} weight_t;
A 9-Point Stencil on a 2D Matrix
Stencil Algorithms
typedef double real;
typedef struct {
real center;
real next;
real diagonal;
} weight_t;
weight.center;
A 9-Point Stencil on a 2D Matrix
Stencil Algorithms
typedef double real;
typedef struct {
real center;
real next;
real diagonal;
} weight_t;
weight.center;
weight.next;
A 9-Point Stencil on a 2D Matrix
Stencil Algorithms
typedef double real;
typedef struct {
real center;
real next;
real diagonal;
} weight_t;
weight.center;
weight.diagonal;
weight.next;
A 9-Point Stencil on a 2D Matrix
Image Blurring
Applying a 9-Point Stencil to a Bitmap
Image Blurring
Applying a 9-Point Stencil to a Bitmap
Image Blurring
Applying a 9-Point Stencil to a Bitmap
Halo Effect
Image Blurring
Applying a 9-Point Stencil to a Bitmap
• Apply a 9-point stencil to a 5,900 x 10,000 px image
• Apply the stencil 1,000 times
Sample Application
• Apply a 9-point stencil to a 5,900 x 10,000 px image
• Apply the stencil 1,000 times
Sample Application
• Apply a 9-point stencil to a 5,900 x 10,000 px image
• Apply the stencil 1,000 times
Sample Application
Comparing Processors
Xeon® Dual Processor vs. Xeon® Phi Coprocessor
Processor
Clock
Frequency
Number of
Cores
Memory
Size/Type
Peak DP/SP
FLOPs
Peak
Memory
Bandwidth
Comparing Processors
Xeon® Dual Processor vs. Xeon® Phi Coprocessor
Processor
Clock
Frequency
Number of
Cores
Memory
Size/Type
Peak DP/SP
FLOPs
Peak
Memory
Bandwidth
Intel® Xeon®
Dual
Processor
2.6 GHz
16 (8 x 2
CPUs)
63 GB /
DDR3
345.6 / 691.2
GigaFLOP/s
85.3 GB/s
Comparing Processors
Xeon® Dual Processor vs. Xeon® Phi Coprocessor
Processor
Clock
Frequency
Number of
Cores
Memory
Size/Type
Peak DP/SP
FLOPs
Peak
Memory
Bandwidth
Intel® Xeon®
Dual
Processor
2.6 GHz
16 (8 x 2
CPUs)
63 GB /
DDR3
345.6 / 691.2
GigaFLOP/s
85.3 GB/s
Intel® Xeon®
Phi
Coprocessor
1.091 GHz 61
8 GB/
GDDR5
1.065/2.130
TeraFLOP/s
352 GB/s
1st Comparison: Serial Execution
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count) {
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
!
// Swap buffers for next iteration
real *ftmp = fin;
fin = fout;
fout = ftmp;
}
}
Assumed vector dependency
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
244.178 seconds
(4 minutes)
4,107.658
Intel® Xeon® Phi
Coprocessor
2,838.342 seconds
(47.3 minutes)
353.375
1st Comparison: Serial Execution
Results
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
244.178 seconds
(4 minutes)
4,107.658
Intel® Xeon® Phi
Coprocessor
2,838.342 seconds
(47.3 minutes)
353.375
1st Comparison: Serial Execution
Results
$ icc -openmp -O3 stencil.c -o stencil
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
244.178 seconds
(4 minutes)
4,107.658
Intel® Xeon® Phi
Coprocessor
2,838.342 seconds
(47.3 minutes)
353.375
1st Comparison: Serial Execution
Results
$ icc -openmp -mmic -O3 stencil.c -o stencil_phi
Dual is 11 times faster than Phi
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
244.178 seconds
(4 minutes)
4,107.658
Intel® Xeon® Phi
Coprocessor
2,838.342 seconds
(47.3 minutes)
353.375
1st Comparison: Serial Execution
Results
2nd Comparison: Vectorization
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
#pragma ivdep
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
// ...
}
Ignoring Assumed Vector Dependencies
2nd Comparison: Vectorization
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
#pragma ivdep
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
// ...
}
Ignoring Assumed Vector Dependencies
ivdep
Tells compiler to ignore assumed dependencies
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-
AEDF-2434F4676E1B.htm
ivdep
Tells compiler to ignore assumed dependencies
• In our program, the compiler cannot determine whether
the two pointers refer to the same block of memory. So
the compiler assumes they do.
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-
AEDF-2434F4676E1B.htm
ivdep
Tells compiler to ignore assumed dependencies
• In our program, the compiler cannot determine whether
the two pointers refer to the same block of memory. So
the compiler assumes they do.
• The ivdep pragma negates this assumption.
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-
AEDF-2434F4676E1B.htm
ivdep
Tells compiler to ignore assumed dependencies
• In our program, the compiler cannot determine whether
the two pointers refer to the same block of memory. So
the compiler assumes they do.
• The ivdep pragma negates this assumption.
• Proven dependencies may not be ignored.
Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599-
AEDF-2434F4676E1B.htm
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
186.585 seconds
(3.1 minutes)
5,375.572
Intel® Xeon® Phi
Coprocessor
623.302 seconds
(10.3 minutes)
1,609.171
2nd Comparison: Vectorization
Results
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
186.585 seconds
(3.1 minutes)
5,375.572
Intel® Xeon® Phi
Coprocessor
623.302 seconds
(10.3 minutes)
1,609.171
2nd Comparison: Vectorization
Results
$ icc -openmp -O3 stencil.c -o stencil
1.3 times
faster
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
186.585 seconds
(3.1 minutes)
5,375.572
Intel® Xeon® Phi
Coprocessor
623.302 seconds
(10.3 minutes)
1,609.171
2nd Comparison: Vectorization
Results
$ icc -openmp -mmic -O3 stencil.c -o stencil_phi
4.5 times
faster
1.3 times
faster
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
186.585 seconds
(3.1 minutes)
5,375.572
Intel® Xeon® Phi
Coprocessor
623.302 seconds
(10.3 minutes)
1,609.171
2nd Comparison: Vectorization
Results
4.5 times
faster
1.3 times
faster
Dual is now only 4 times faster than Phi
3rd Comparison: Multithreading
Work Division Using Parallel For Loops
3rd Comparison: Multithreading
#pragma omp parallel for
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
#pragma ivdep
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
// ...
}
Work Division Using Parallel For Loops
3rd Comparison: Multithreading
#pragma omp parallel for
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
#pragma ivdep
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
// ...
}
Work Division Using Parallel For Loops
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Dual Proc.,
16 Threads
43.862 22,867.185
Xeon® Dual Proc.,
32 Threads
46.247 21,688.103
Xeon® Phi,
61 Threads
11.366 88,246.452
Xeon® Phi,
122 Threads
8.772 114,338.399
Xeon® Phi,
183 Threads
10.546 94,946.364
Xeon® Phi,
244 Threads
12.696 78,999.44
3rd Comparison: Multithreading
Results
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Dual Proc.,
16 Threads
43.862 22,867.185
Xeon® Dual Proc.,
32 Threads
46.247 21,688.103
Xeon® Phi,
61 Threads
11.366 88,246.452
Xeon® Phi,
122 Threads
8.772 114,338.399
Xeon® Phi,
183 Threads
10.546 94,946.364
Xeon® Phi,
244 Threads
12.696 78,999.44
3rd Comparison: Multithreading
Results
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Dual Proc.,
16 Threads
43.862 22,867.185
Xeon® Dual Proc.,
32 Threads
46.247 21,688.103
Xeon® Phi,
61 Threads
11.366 88,246.452
Xeon® Phi,
122 Threads
8.772 114,338.399
Xeon® Phi,
183 Threads
10.546 94,946.364
Xeon® Phi,
244 Threads
12.696 78,999.44
3rd Comparison: Multithreading
Results
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Dual Proc.,
16 Threads
43.862 22,867.185
Xeon® Dual Proc.,
32 Threads
46.247 21,688.103
Xeon® Phi,
61 Threads
11.366 88,246.452
Xeon® Phi,
122 Threads
8.772 114,338.399
Xeon® Phi,
183 Threads
10.546 94,946.364
Xeon® Phi,
244 Threads
12.696 78,999.44
3rd Comparison: Multithreading
Results
4x
71x
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Dual Proc.,
16 Threads
43.862 22,867.185
Xeon® Dual Proc.,
32 Threads
46.247 21,688.103
Xeon® Phi,
61 Threads
11.366 88,246.452
Xeon® Phi,
122 Threads
8.772 114,338.399
Xeon® Phi,
183 Threads
10.546 94,946.364
Xeon® Phi,
244 Threads
12.696 78,999.44
3rd Comparison: Multithreading
Results
4x
71x
Phi now 5 times faster
Further Optimizations
Further Optimizations
1. Padded arrays
Further Optimizations
1. Padded arrays
2. Streaming stores
Further Optimizations
1. Padded arrays
2. Streaming stores
3. Huge memory pages
Optimization 1: Padded Arrays
Optimizing Cache Access
Optimization 1: Padded Arrays
• We can add extra, unused data to the end of each row
Optimizing Cache Access
Optimization 1: Padded Arrays
• We can add extra, unused data to the end of each row
• Doing so aligns heavily used memory addresses for
efficient cache line access
Optimizing Cache Access
Optimization 1: Padded Arrays
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
((5900*sizeof(real)+63)/64)*(64/sizeof(real));
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
(real *)_mm_malloc(size, kPaddingSize);
(real *)_mm_malloc(size, kPaddingSize);
sizeof(real)* width*kPaddingSize * height;
((5900*sizeof(real)+63)/64)*(64/sizeof(real));
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = 10000;
int width = 5900;
int count = 1000;
!
size_t size = sizeof(real) * width * height;
real *fin = (real *)malloc(size);
real *fout = (real *)malloc(size);
!
weight_t weight = { .center = 0.99,
.next = 0.00125,
.diagonal = 0.00125 };
stencil_9pt(fin, fout, width, height, weight, count);
!
// ...save results
!
free(fin);
free(fout);
return 0;
}
_mm_free(fin);
_mm_free(fout);
(real *)_mm_malloc(size, kPaddingSize);
(real *)_mm_malloc(size, kPaddingSize);
sizeof(real)* width*kPaddingSize * height;
((5900*sizeof(real)+63)/64)*(64/sizeof(real));
Optimization 1: Padded Arrays
Accommodating for Padding
Optimization 1: Padded Arrays
#pragma omp parallel for
for (int y = 1; y < height - 1; ++y) {
!
// ...calculate center, east, northwest, etc.
int center = 1 + y * kPaddingSize + 1;
int north = center - kPaddingSize;
int south = center + kPaddingSize;
int east = center + 1;
int west = center - 1;
int northwest = north - 1;
int northeast = north + 1;
int southwest = south - 1;
int southeast = south + 1;
!
#pragma ivdep
// ...
}
Accommodating for Padding
Optimization 1: Padded Arrays
#pragma omp parallel for
for (int y = 1; y < height - 1; ++y) {
!
// ...calculate center, east, northwest, etc.
int center = 1 + y * kPaddingSize + 1;
int north = center - kPaddingSize;
int south = center + kPaddingSize;
int east = center + 1;
int west = center - 1;
int northwest = north - 1;
int northeast = north + 1;
int southwest = south - 1;
int southeast = south + 1;
!
#pragma ivdep
// ...
}
Accommodating for Padding
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
11.644 86,138.371
Xeon® Phi,
122 Threads
8.973 111,774.803
Xeon® Phi,
183 Threads
10.326 97,132.546
Xeon® Phi,
244 Threads
11.469 87,452.707
Optimization 1: Padded Arrays
Results
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
11.644 86,138.371
Xeon® Phi,
122 Threads
8.973 111,774.803
Xeon® Phi,
183 Threads
10.326 97,132.546
Xeon® Phi,
244 Threads
11.469 87,452.707
Optimization 1: Padded Arrays
Results
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
11.644 86,138.371
Xeon® Phi,
122 Threads
8.973 111,774.803
Xeon® Phi,
183 Threads
10.326 97,132.546
Xeon® Phi,
244 Threads
11.469 87,452.707
Optimization 1: Padded Arrays
Results
Optimization 2: Streaming Stores
Read-less Writes
Optimization 2: Streaming Stores
Read-less Writes
• By default, Xeon® Phi processors read the value at an
address before writing to that address.
Optimization 2: Streaming Stores
Read-less Writes
• By default, Xeon® Phi processors read the value at an
address before writing to that address.
• When calculating the weighted average for a pixel in our
program, we do not use the original value of that pixel.
Therefore, enabling streaming stores should result in
better performance.
Optimization 2: Streaming Stores
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
#pragma ivdep
#pragma vector nontemporal
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
// ...
}
Read-less Writes with Vector Nontemporal
Optimization 2: Streaming Stores
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate center, east, northwest, etc.
#pragma ivdep
#pragma vector nontemporal
for (int x = 1; x < width - 1; ++x) {
fout[center] = weight.diagonal * fin[northwest] +
weight.next * fin[west] +
// ...add weighted, adjacent pixels
weight.center * fin[center];
// ...increment locations
++center; ++north; ++northeast;
}
}
// ...
}
Read-less Writes with Vector Nontemporal
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
13.588 73,978.915
Xeon® Phi,
122 Threads
8.491 111,774.803
Xeon® Phi,
183 Threads
8.663 115,773.405
Xeon® Phi,
244 Threads
9.507 105,498.781
Optimization 2: Streaming Stores
Results
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
13.588 73,978.915
Xeon® Phi,
122 Threads
8.491 111,774.803
Xeon® Phi,
183 Threads
8.663 115,773.405
Xeon® Phi,
244 Threads
9.507 105,498.781
Optimization 2: Streaming Stores
Results
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
13.588 73,978.915
Xeon® Phi,
122 Threads
8.491 111,774.803
Xeon® Phi,
183 Threads
8.663 115,773.405
Xeon® Phi,
244 Threads
9.507 105,498.781
Optimization 2: Streaming Stores
Results
Optimization 3: Huge Memory Pages
• Memory pages map virtual memory used by our
program to physical memory
Optimization 3: Huge Memory Pages
• Memory pages map virtual memory used by our
program to physical memory
• Mappings are stored in a translation look-aside buffer
(TLB)
Optimization 3: Huge Memory Pages
• Memory pages map virtual memory used by our
program to physical memory
• Mappings are stored in a translation look-aside buffer
(TLB)
• Mappings are traversed in a“page table walk”
Optimization 3: Huge Memory Pages
• Memory pages map virtual memory used by our
program to physical memory
• Mappings are stored in a translation look-aside buffer
(TLB)
• Mappings are traversed in a“page table walk”
• malloc and _mm_malloc use 4KB memory pages by
default
Optimization 3: Huge Memory Pages
• Memory pages map virtual memory used by our
program to physical memory
• Mappings are stored in a translation look-aside buffer
(TLB)
• Mappings are traversed in a“page table walk”
• malloc and _mm_malloc use 4KB memory pages by
default
• By increasing the size of each memory page, traversal
time may be reduced
Optimization 3: Huge Memory Pages
Optimization 3: Huge Memory Pages
size_t size = sizeof(real) * width * kPaddingSize * height;
real *fin = (real *)_mm_malloc(size, kPaddingSize);
Optimization 3: Huge Memory Pages
size_t size = sizeof(real) * width * kPaddingSize * height;
real *fin = (real *)_mm_malloc(size, kPaddingSize);real *fin = (real *)mmap(0,
size,
PROT_READ|PROT_WRITE,
MAP_ANON|MAP_PRIVATE|MAP_HUGETLB,
-1.0);
Optimization 3: Huge Memory Pages
size_t size = sizeof(real) * width * kPaddingSize * height;
real *fin = (real *)_mm_malloc(size, kPaddingSize);real *fin = (real *)mmap(0,
size,
PROT_READ|PROT_WRITE,
MAP_ANON|MAP_PRIVATE|MAP_HUGETLB,
-1.0);
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
14.486 69,239.365
Xeon® Phi,
122 Threads
8.226 121,924.389
Xeon® Phi,
183 Threads
8.749 114,636.799
Xeon® Phi,
244 Threads
9.466 105,955.358
Results
Optimization 3: Huge Memory Pages
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
14.486 69,239.365
Xeon® Phi,
122 Threads
8.226 121,924.389
Xeon® Phi,
183 Threads
8.749 114,636.799
Xeon® Phi,
244 Threads
9.466 105,955.358
Results
Optimization 3: Huge Memory Pages
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
14.486 69,239.365
Xeon® Phi,
122 Threads
8.226 121,924.389
Xeon® Phi,
183 Threads
8.749 114,636.799
Xeon® Phi,
244 Threads
9.466 105,955.358
Results
Optimization 3: Huge Memory Pages
Takeaways
• The key to achieving high-performance is to use loop
vectorization and multiple threads
• Completely serial programs run faster on standard
processors
• Only properly designed programs achieve peak
performance on an Intel® Xeon® Phi Coprocessor
• Other optimizations may be used to tweak performance
• Data padding,
• Streaming stores
• Huge memory pages
Sources and Additional Resources
• Today’s slides
• http://modocache.io/xeon-phi-high-performance
• Intel® Xeon® Phi Coprocessor High Performance
Programming (James Jeffers, James Reinders)
• http://www.amazon.com/dp/0124104142
• Intel Documentation
• ivdep: https://software.intel.com/sites/products/
documentation/doclib/iss/2013/compiler/cpp-lin/
GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm
• vector: https://software.intel.com/sites/products/
documentation/studio/composer/en-us/2011Update/
compiler_c/cref_cls/common/
cppref_pragma_vector.htm

More Related Content

PPTX
PDF
TensorFlow example for AI Ukraine2016
PPTX
Introduction to TensorFlow 2 and Keras
PPTX
H2 o berkeleydltf
PPTX
Introduction to Deep Learning, Keras, and Tensorflow
PDF
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
PPT
noise removal in matlab
PDF
Google TensorFlow Tutorial
TensorFlow example for AI Ukraine2016
Introduction to TensorFlow 2 and Keras
H2 o berkeleydltf
Introduction to Deep Learning, Keras, and Tensorflow
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
noise removal in matlab
Google TensorFlow Tutorial

What's hot (20)

PPTX
Sound analysis and processing with MATLAB
PPTX
Introduction to TensorFlow 2
PPTX
Introduction to TensorFlow 2
PPTX
Working with tf.data (TF 2)
PDF
TensorFlow Tutorial
PPTX
Introduction to PyTorch
PPTX
TensorFlow in Your Browser
PDF
Scientific visualization with_gr
PPTX
Introduction to Tensorflow
PDF
Power ai tensorflowworkloadtutorial-20171117
PDF
Natural language processing open seminar For Tensorflow usage
PDF
Dive Into PyTorch
PDF
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
PPTX
TensorFlow
PPTX
Tensorflow - Intro (2017)
PDF
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
PPTX
Explanation on Tensorflow example -Deep mnist for expert
PPTX
Machine Learning - Introduction to Tensorflow
PDF
Tensor board
PDF
Introduction to TensorFlow 2.0
Sound analysis and processing with MATLAB
Introduction to TensorFlow 2
Introduction to TensorFlow 2
Working with tf.data (TF 2)
TensorFlow Tutorial
Introduction to PyTorch
TensorFlow in Your Browser
Scientific visualization with_gr
Introduction to Tensorflow
Power ai tensorflowworkloadtutorial-20171117
Natural language processing open seminar For Tensorflow usage
Dive Into PyTorch
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
TensorFlow
Tensorflow - Intro (2017)
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
Explanation on Tensorflow example -Deep mnist for expert
Machine Learning - Introduction to Tensorflow
Tensor board
Introduction to TensorFlow 2.0
Ad

Viewers also liked (7)

PDF
RSpec 3.0: Under the Covers
PDF
Apple Templates Considered Harmful
PDF
iOS UI Component API Design
PDF
iOS Behavior-Driven Development
PDF
アップルのテンプレートは有害と考えられる
PDF
iOSビヘイビア駆動開発
PDF
iOS UI Component API Design
RSpec 3.0: Under the Covers
Apple Templates Considered Harmful
iOS UI Component API Design
iOS Behavior-Driven Development
アップルのテンプレートは有害と考えられる
iOSビヘイビア駆動開発
iOS UI Component API Design
Ad

Similar to Intel® Xeon® Phi Coprocessor High Performance Programming (20)

PPTX
#OOP_D_ITS - 2nd - C++ Getting Started
PPTX
#OOP_D_ITS - 3rd - Pointer And References
PDF
how to reuse code
PPT
ch08.ppt
PDF
C Recursion, Pointers, Dynamic memory management
DOCX
#include stdafx.h using namespace std; #include stdlib.h.docx
PDF
Memory Management for C and C++ _ language
PPTX
Node.js behind: V8 and its optimizations
PDF
Let’s talk about microbenchmarking
PDF
Workshop 10: ECMAScript 6
PDF
Learn C program in Complete c programing string and its functions like array...
PDF
Write a function in C++ to generate an N-node random binary search t.pdf
PDF
Please do Part A, Ill be really gratefulThe main.c is the skeleto.pdf
PDF
Adam Sitnik "State of the .NET Performance"
PDF
State of the .Net Performance
PPTX
Go Programming Language (Golang)
PPTX
Computer Programming for Engineers Spring 2023Lab 8 - Pointers.pptx
PPT
C++ Language
PDF
Introduction to programming - class 11
PPTX
functions
#OOP_D_ITS - 2nd - C++ Getting Started
#OOP_D_ITS - 3rd - Pointer And References
how to reuse code
ch08.ppt
C Recursion, Pointers, Dynamic memory management
#include stdafx.h using namespace std; #include stdlib.h.docx
Memory Management for C and C++ _ language
Node.js behind: V8 and its optimizations
Let’s talk about microbenchmarking
Workshop 10: ECMAScript 6
Learn C program in Complete c programing string and its functions like array...
Write a function in C++ to generate an N-node random binary search t.pdf
Please do Part A, Ill be really gratefulThe main.c is the skeleto.pdf
Adam Sitnik "State of the .NET Performance"
State of the .Net Performance
Go Programming Language (Golang)
Computer Programming for Engineers Spring 2023Lab 8 - Pointers.pptx
C++ Language
Introduction to programming - class 11
functions

Recently uploaded (20)

PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPTX
Modernising the Digital Integration Hub
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPT
Geologic Time for studying geology for geologist
PDF
Five Habits of High-Impact Board Members
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
Architecture types and enterprise applications.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
Getting Started with Data Integration: FME Form 101
PDF
CloudStack 4.21: First Look Webinar slides
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
Hindi spoken digit analysis for native and non-native speakers
DP Operators-handbook-extract for the Mautical Institute
1 - Historical Antecedents, Social Consideration.pdf
Tartificialntelligence_presentation.pptx
A review of recent deep learning applications in wood surface defect identifi...
Modernising the Digital Integration Hub
Enhancing emotion recognition model for a student engagement use case through...
Geologic Time for studying geology for geologist
Five Habits of High-Impact Board Members
Developing a website for English-speaking practice to English as a foreign la...
Architecture types and enterprise applications.pdf
Group 1 Presentation -Planning and Decision Making .pptx
A contest of sentiment analysis: k-nearest neighbor versus neural network
A novel scalable deep ensemble learning framework for big data classification...
O2C Customer Invoices to Receipt V15A.pptx
Getting Started with Data Integration: FME Form 101
CloudStack 4.21: First Look Webinar slides
Web Crawler for Trend Tracking Gen Z Insights.pptx
Final SEM Unit 1 for mit wpu at pune .pptx
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...

Intel® Xeon® Phi Coprocessor High Performance Programming

  • 1. Intel® Xeon® Phi Coprocessor High Performance Programming Parallelizing a Simple Image Blurring Algorithm Brian Gesiak April 16th, 2014 Research Student, The University of Tokyo @modocache
  • 2. Today • Image blurring with a 9-point stencil algorithm • Comparing performance • Intel® Xeon® Dual Processor • Intel® Xeon® Phi Coprocessor • Iteratively improving performance • Worst: Completely serial • Better: Adding loop vectorization • Best: Supporting multiple threads • Further optimizations • Padding arrays for improved cache performance • Read-less writes, i.e.: streaming stores • Using huge memory pages
  • 3. Stencil Algorithms A 9-Point Stencil on a 2D Matrix
  • 4. Stencil Algorithms A 9-Point Stencil on a 2D Matrix
  • 5. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; A 9-Point Stencil on a 2D Matrix
  • 6. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; weight.center; A 9-Point Stencil on a 2D Matrix
  • 7. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; weight.center; weight.next; A 9-Point Stencil on a 2D Matrix
  • 8. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; weight.center; weight.diagonal; weight.next; A 9-Point Stencil on a 2D Matrix
  • 9. Image Blurring Applying a 9-Point Stencil to a Bitmap
  • 10. Image Blurring Applying a 9-Point Stencil to a Bitmap
  • 11. Image Blurring Applying a 9-Point Stencil to a Bitmap
  • 12. Halo Effect Image Blurring Applying a 9-Point Stencil to a Bitmap
  • 13. • Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times Sample Application
  • 14. • Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times Sample Application
  • 15. • Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times Sample Application
  • 16. Comparing Processors Xeon® Dual Processor vs. Xeon® Phi Coprocessor Processor Clock Frequency Number of Cores Memory Size/Type Peak DP/SP FLOPs Peak Memory Bandwidth
  • 17. Comparing Processors Xeon® Dual Processor vs. Xeon® Phi Coprocessor Processor Clock Frequency Number of Cores Memory Size/Type Peak DP/SP FLOPs Peak Memory Bandwidth Intel® Xeon® Dual Processor 2.6 GHz 16 (8 x 2 CPUs) 63 GB / DDR3 345.6 / 691.2 GigaFLOP/s 85.3 GB/s
  • 18. Comparing Processors Xeon® Dual Processor vs. Xeon® Phi Coprocessor Processor Clock Frequency Number of Cores Memory Size/Type Peak DP/SP FLOPs Peak Memory Bandwidth Intel® Xeon® Dual Processor 2.6 GHz 16 (8 x 2 CPUs) 63 GB / DDR3 345.6 / 691.2 GigaFLOP/s 85.3 GB/s Intel® Xeon® Phi Coprocessor 1.091 GHz 61 8 GB/ GDDR5 1.065/2.130 TeraFLOP/s 352 GB/s
  • 20. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  • 21. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  • 22. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  • 23. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  • 24. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  • 25. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  • 26. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  • 27. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  • 28. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  • 29. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } } Assumed vector dependency
  • 30. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results
  • 31. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results $ icc -openmp -O3 stencil.c -o stencil
  • 32. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results $ icc -openmp -mmic -O3 stencil.c -o stencil_phi
  • 33. Dual is 11 times faster than Phi Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results
  • 34. 2nd Comparison: Vectorization for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Ignoring Assumed Vector Dependencies
  • 35. 2nd Comparison: Vectorization for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Ignoring Assumed Vector Dependencies
  • 36. ivdep Tells compiler to ignore assumed dependencies Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm
  • 37. ivdep Tells compiler to ignore assumed dependencies • In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do. Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm
  • 38. ivdep Tells compiler to ignore assumed dependencies • In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do. • The ivdep pragma negates this assumption. Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm
  • 39. ivdep Tells compiler to ignore assumed dependencies • In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do. • The ivdep pragma negates this assumption. • Proven dependencies may not be ignored. Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm
  • 40. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results
  • 41. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results $ icc -openmp -O3 stencil.c -o stencil 1.3 times faster
  • 42. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results $ icc -openmp -mmic -O3 stencil.c -o stencil_phi 4.5 times faster 1.3 times faster
  • 43. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results 4.5 times faster 1.3 times faster Dual is now only 4 times faster than Phi
  • 44. 3rd Comparison: Multithreading Work Division Using Parallel For Loops
  • 45. 3rd Comparison: Multithreading #pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Work Division Using Parallel For Loops
  • 46. 3rd Comparison: Multithreading #pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Work Division Using Parallel For Loops
  • 47. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results
  • 48. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results
  • 49. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results
  • 50. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results 4x 71x
  • 51. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results 4x 71x Phi now 5 times faster
  • 54. Further Optimizations 1. Padded arrays 2. Streaming stores
  • 55. Further Optimizations 1. Padded arrays 2. Streaming stores 3. Huge memory pages
  • 56. Optimization 1: Padded Arrays Optimizing Cache Access
  • 57. Optimization 1: Padded Arrays • We can add extra, unused data to the end of each row Optimizing Cache Access
  • 58. Optimization 1: Padded Arrays • We can add extra, unused data to the end of each row • Doing so aligns heavily used memory addresses for efficient cache line access Optimizing Cache Access
  • 60. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  • 61. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  • 62. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  • 63. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  • 64. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  • 65. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  • 66. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  • 67. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; } ((5900*sizeof(real)+63)/64)*(64/sizeof(real));
  • 68. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; } (real *)_mm_malloc(size, kPaddingSize); (real *)_mm_malloc(size, kPaddingSize); sizeof(real)* width*kPaddingSize * height; ((5900*sizeof(real)+63)/64)*(64/sizeof(real));
  • 69. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; } _mm_free(fin); _mm_free(fout); (real *)_mm_malloc(size, kPaddingSize); (real *)_mm_malloc(size, kPaddingSize); sizeof(real)* width*kPaddingSize * height; ((5900*sizeof(real)+63)/64)*(64/sizeof(real));
  • 70. Optimization 1: Padded Arrays Accommodating for Padding
  • 71. Optimization 1: Padded Arrays #pragma omp parallel for for (int y = 1; y < height - 1; ++y) { ! // ...calculate center, east, northwest, etc. int center = 1 + y * kPaddingSize + 1; int north = center - kPaddingSize; int south = center + kPaddingSize; int east = center + 1; int west = center - 1; int northwest = north - 1; int northeast = north + 1; int southwest = south - 1; int southeast = south + 1; ! #pragma ivdep // ... } Accommodating for Padding
  • 72. Optimization 1: Padded Arrays #pragma omp parallel for for (int y = 1; y < height - 1; ++y) { ! // ...calculate center, east, northwest, etc. int center = 1 + y * kPaddingSize + 1; int north = center - kPaddingSize; int south = center + kPaddingSize; int east = center + 1; int west = center - 1; int northwest = north - 1; int northeast = north + 1; int southwest = south - 1; int southeast = south + 1; ! #pragma ivdep // ... } Accommodating for Padding
  • 73. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 11.644 86,138.371 Xeon® Phi, 122 Threads 8.973 111,774.803 Xeon® Phi, 183 Threads 10.326 97,132.546 Xeon® Phi, 244 Threads 11.469 87,452.707 Optimization 1: Padded Arrays Results
  • 74. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 11.644 86,138.371 Xeon® Phi, 122 Threads 8.973 111,774.803 Xeon® Phi, 183 Threads 10.326 97,132.546 Xeon® Phi, 244 Threads 11.469 87,452.707 Optimization 1: Padded Arrays Results
  • 75. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 11.644 86,138.371 Xeon® Phi, 122 Threads 8.973 111,774.803 Xeon® Phi, 183 Threads 10.326 97,132.546 Xeon® Phi, 244 Threads 11.469 87,452.707 Optimization 1: Padded Arrays Results
  • 76. Optimization 2: Streaming Stores Read-less Writes
  • 77. Optimization 2: Streaming Stores Read-less Writes • By default, Xeon® Phi processors read the value at an address before writing to that address.
  • 78. Optimization 2: Streaming Stores Read-less Writes • By default, Xeon® Phi processors read the value at an address before writing to that address. • When calculating the weighted average for a pixel in our program, we do not use the original value of that pixel. Therefore, enabling streaming stores should result in better performance.
  • 79. Optimization 2: Streaming Stores for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep #pragma vector nontemporal for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Read-less Writes with Vector Nontemporal
  • 80. Optimization 2: Streaming Stores for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep #pragma vector nontemporal for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Read-less Writes with Vector Nontemporal
  • 81. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 13.588 73,978.915 Xeon® Phi, 122 Threads 8.491 111,774.803 Xeon® Phi, 183 Threads 8.663 115,773.405 Xeon® Phi, 244 Threads 9.507 105,498.781 Optimization 2: Streaming Stores Results
  • 82. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 13.588 73,978.915 Xeon® Phi, 122 Threads 8.491 111,774.803 Xeon® Phi, 183 Threads 8.663 115,773.405 Xeon® Phi, 244 Threads 9.507 105,498.781 Optimization 2: Streaming Stores Results
  • 83. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 13.588 73,978.915 Xeon® Phi, 122 Threads 8.491 111,774.803 Xeon® Phi, 183 Threads 8.663 115,773.405 Xeon® Phi, 244 Threads 9.507 105,498.781 Optimization 2: Streaming Stores Results
  • 84. Optimization 3: Huge Memory Pages
  • 85. • Memory pages map virtual memory used by our program to physical memory Optimization 3: Huge Memory Pages
  • 86. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buffer (TLB) Optimization 3: Huge Memory Pages
  • 87. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buffer (TLB) • Mappings are traversed in a“page table walk” Optimization 3: Huge Memory Pages
  • 88. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buffer (TLB) • Mappings are traversed in a“page table walk” • malloc and _mm_malloc use 4KB memory pages by default Optimization 3: Huge Memory Pages
  • 89. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buffer (TLB) • Mappings are traversed in a“page table walk” • malloc and _mm_malloc use 4KB memory pages by default • By increasing the size of each memory page, traversal time may be reduced Optimization 3: Huge Memory Pages
  • 90. Optimization 3: Huge Memory Pages size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);
  • 91. Optimization 3: Huge Memory Pages size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);real *fin = (real *)mmap(0, size, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE|MAP_HUGETLB, -1.0);
  • 92. Optimization 3: Huge Memory Pages size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);real *fin = (real *)mmap(0, size, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE|MAP_HUGETLB, -1.0);
  • 93. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 14.486 69,239.365 Xeon® Phi, 122 Threads 8.226 121,924.389 Xeon® Phi, 183 Threads 8.749 114,636.799 Xeon® Phi, 244 Threads 9.466 105,955.358 Results Optimization 3: Huge Memory Pages
  • 94. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 14.486 69,239.365 Xeon® Phi, 122 Threads 8.226 121,924.389 Xeon® Phi, 183 Threads 8.749 114,636.799 Xeon® Phi, 244 Threads 9.466 105,955.358 Results Optimization 3: Huge Memory Pages
  • 95. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 14.486 69,239.365 Xeon® Phi, 122 Threads 8.226 121,924.389 Xeon® Phi, 183 Threads 8.749 114,636.799 Xeon® Phi, 244 Threads 9.466 105,955.358 Results Optimization 3: Huge Memory Pages
  • 96. Takeaways • The key to achieving high-performance is to use loop vectorization and multiple threads • Completely serial programs run faster on standard processors • Only properly designed programs achieve peak performance on an Intel® Xeon® Phi Coprocessor • Other optimizations may be used to tweak performance • Data padding, • Streaming stores • Huge memory pages
  • 97. Sources and Additional Resources • Today’s slides • http://modocache.io/xeon-phi-high-performance • Intel® Xeon® Phi Coprocessor High Performance Programming (James Jeffers, James Reinders) • http://www.amazon.com/dp/0124104142 • Intel Documentation • ivdep: https://software.intel.com/sites/products/ documentation/doclib/iss/2013/compiler/cpp-lin/ GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm • vector: https://software.intel.com/sites/products/ documentation/studio/composer/en-us/2011Update/ compiler_c/cref_cls/common/ cppref_pragma_vector.htm