Intel® Xeon® Phi Coprocessor High Performance Programming

1. Intel® Xeon® Phi Coprocessor High Performance Programming Parallelizing a Simple Image Blurring Algorithm Brian Gesiak April 16th, 2014 Research Student, The University of Tokyo @modocache

2. Today • Image blurring with a 9-point stencil algorithm • Comparing performance • Intel® Xeon® Dual Processor • Intel® Xeon® Phi Coprocessor • Iteratively improving performance • Worst: Completely serial • Better: Adding loop vectorization • Best: Supporting multiple threads • Further optimizations • Padding arrays for improved cache performance • Read-less writes, i.e.: streaming stores • Using huge memory pages

3. Stencil Algorithms A 9-Point Stencil on a 2D Matrix

4. Stencil Algorithms A 9-Point Stencil on a 2D Matrix

5. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; A 9-Point Stencil on a 2D Matrix

6. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; weight.center; A 9-Point Stencil on a 2D Matrix

7. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; weight.center; weight.next; A 9-Point Stencil on a 2D Matrix

8. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; weight.center; weight.diagonal; weight.next; A 9-Point Stencil on a 2D Matrix

9. Image Blurring Applying a 9-Point Stencil to a Bitmap

12. Halo Eﬀect Image Blurring Applying a 9-Point Stencil to a Bitmap

13. • Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times Sample Application

16. Comparing Processors Xeon® Dual Processor vs. Xeon® Phi Coprocessor Processor Clock Frequency Number of Cores Memory Size/Type Peak DP/SP FLOPs Peak Memory Bandwidth

17. Comparing Processors Xeon® Dual Processor vs. Xeon® Phi Coprocessor Processor Clock Frequency Number of Cores Memory Size/Type Peak DP/SP FLOPs Peak Memory Bandwidth Intel® Xeon® Dual Processor 2.6 GHz 16 (8 x 2 CPUs) 63 GB / DDR3 345.6 / 691.2 GigaFLOP/s 85.3 GB/s

18. Comparing Processors Xeon® Dual Processor vs. Xeon® Phi Coprocessor Processor Clock Frequency Number of Cores Memory Size/Type Peak DP/SP FLOPs Peak Memory Bandwidth Intel® Xeon® Dual Processor 2.6 GHz 16 (8 x 2 CPUs) 63 GB / DDR3 345.6 / 691.2 GigaFLOP/s 85.3 GB/s Intel® Xeon® Phi Coprocessor 1.091 GHz 61 8 GB/ GDDR5 1.065/2.130 TeraFLOP/s 352 GB/s

19. 1st Comparison: Serial Execution

20. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }

29. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } } Assumed vector dependency

30. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results

31. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results $ icc -openmp -O3 stencil.c -o stencil

32. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results $ icc -openmp -mmic -O3 stencil.c -o stencil_phi

33. Dual is 11 times faster than Phi Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results

34. 2nd Comparison: Vectorization for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Ignoring Assumed Vector Dependencies

35. 2nd Comparison: Vectorization for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Ignoring Assumed Vector Dependencies

36. ivdep Tells compiler to ignore assumed dependencies Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm

37. ivdep Tells compiler to ignore assumed dependencies • In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do. Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm

38. ivdep Tells compiler to ignore assumed dependencies • In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do. • The ivdep pragma negates this assumption. Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm

39. ivdep Tells compiler to ignore assumed dependencies • In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do. • The ivdep pragma negates this assumption. • Proven dependencies may not be ignored. Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm

40. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results

41. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results $ icc -openmp -O3 stencil.c -o stencil 1.3 times faster

42. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results $ icc -openmp -mmic -O3 stencil.c -o stencil_phi 4.5 times faster 1.3 times faster

43. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results 4.5 times faster 1.3 times faster Dual is now only 4 times faster than Phi

44. 3rd Comparison: Multithreading Work Division Using Parallel For Loops

45. 3rd Comparison: Multithreading #pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Work Division Using Parallel For Loops

46. 3rd Comparison: Multithreading #pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Work Division Using Parallel For Loops

47. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results

50. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results 4x 71x

51. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results 4x 71x Phi now 5 times faster

52. Further Optimizations

53. Further Optimizations 1. Padded arrays

54. Further Optimizations 1. Padded arrays 2. Streaming stores

55. Further Optimizations 1. Padded arrays 2. Streaming stores 3. Huge memory pages

56. Optimization 1: Padded Arrays Optimizing Cache Access

57. Optimization 1: Padded Arrays • We can add extra, unused data to the end of each row Optimizing Cache Access

58. Optimization 1: Padded Arrays • We can add extra, unused data to the end of each row • Doing so aligns heavily used memory addresses for eﬃcient cache line access Optimizing Cache Access

59. Optimization 1: Padded Arrays

60. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }

67. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; } ((5900*sizeof(real)+63)/64)*(64/sizeof(real));

68. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; } (real *)_mm_malloc(size, kPaddingSize); (real *)_mm_malloc(size, kPaddingSize); sizeof(real)* width*kPaddingSize * height; ((5900*sizeof(real)+63)/64)*(64/sizeof(real));

69. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; } _mm_free(fin); _mm_free(fout); (real *)_mm_malloc(size, kPaddingSize); (real *)_mm_malloc(size, kPaddingSize); sizeof(real)* width*kPaddingSize * height; ((5900*sizeof(real)+63)/64)*(64/sizeof(real));

70. Optimization 1: Padded Arrays Accommodating for Padding

71. Optimization 1: Padded Arrays #pragma omp parallel for for (int y = 1; y < height - 1; ++y) { ! // ...calculate center, east, northwest, etc. int center = 1 + y * kPaddingSize + 1; int north = center - kPaddingSize; int south = center + kPaddingSize; int east = center + 1; int west = center - 1; int northwest = north - 1; int northeast = north + 1; int southwest = south - 1; int southeast = south + 1; ! #pragma ivdep // ... } Accommodating for Padding

72. Optimization 1: Padded Arrays #pragma omp parallel for for (int y = 1; y < height - 1; ++y) { ! // ...calculate center, east, northwest, etc. int center = 1 + y * kPaddingSize + 1; int north = center - kPaddingSize; int south = center + kPaddingSize; int east = center + 1; int west = center - 1; int northwest = north - 1; int northeast = north + 1; int southwest = south - 1; int southeast = south + 1; ! #pragma ivdep // ... } Accommodating for Padding

73. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 11.644 86,138.371 Xeon® Phi, 122 Threads 8.973 111,774.803 Xeon® Phi, 183 Threads 10.326 97,132.546 Xeon® Phi, 244 Threads 11.469 87,452.707 Optimization 1: Padded Arrays Results

76. Optimization 2: Streaming Stores Read-less Writes

77. Optimization 2: Streaming Stores Read-less Writes • By default, Xeon® Phi processors read the value at an address before writing to that address.

78. Optimization 2: Streaming Stores Read-less Writes • By default, Xeon® Phi processors read the value at an address before writing to that address. • When calculating the weighted average for a pixel in our program, we do not use the original value of that pixel. Therefore, enabling streaming stores should result in better performance.

79. Optimization 2: Streaming Stores for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep #pragma vector nontemporal for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Read-less Writes with Vector Nontemporal

80. Optimization 2: Streaming Stores for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep #pragma vector nontemporal for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Read-less Writes with Vector Nontemporal

81. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 13.588 73,978.915 Xeon® Phi, 122 Threads 8.491 111,774.803 Xeon® Phi, 183 Threads 8.663 115,773.405 Xeon® Phi, 244 Threads 9.507 105,498.781 Optimization 2: Streaming Stores Results

84. Optimization 3: Huge Memory Pages

85. • Memory pages map virtual memory used by our program to physical memory Optimization 3: Huge Memory Pages

86. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buﬀer (TLB) Optimization 3: Huge Memory Pages

87. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buﬀer (TLB) • Mappings are traversed in a“page table walk” Optimization 3: Huge Memory Pages

88. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buﬀer (TLB) • Mappings are traversed in a“page table walk” • malloc and _mm_malloc use 4KB memory pages by default Optimization 3: Huge Memory Pages

89. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buﬀer (TLB) • Mappings are traversed in a“page table walk” • malloc and _mm_malloc use 4KB memory pages by default • By increasing the size of each memory page, traversal time may be reduced Optimization 3: Huge Memory Pages

90. Optimization 3: Huge Memory Pages size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);

91. Optimization 3: Huge Memory Pages size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);real *fin = (real *)mmap(0, size, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE|MAP_HUGETLB, -1.0);

92. Optimization 3: Huge Memory Pages size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);real *fin = (real *)mmap(0, size, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE|MAP_HUGETLB, -1.0);

93. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 14.486 69,239.365 Xeon® Phi, 122 Threads 8.226 121,924.389 Xeon® Phi, 183 Threads 8.749 114,636.799 Xeon® Phi, 244 Threads 9.466 105,955.358 Results Optimization 3: Huge Memory Pages

96. Takeaways • The key to achieving high-performance is to use loop vectorization and multiple threads • Completely serial programs run faster on standard processors • Only properly designed programs achieve peak performance on an Intel® Xeon® Phi Coprocessor • Other optimizations may be used to tweak performance • Data padding, • Streaming stores • Huge memory pages

97. Sources and Additional Resources • Today’s slides • http://modocache.io/xeon-phi-high-performance • Intel® Xeon® Phi Coprocessor High Performance Programming (James Jeﬀers, James Reinders) • http://www.amazon.com/dp/0124104142 • Intel Documentation • ivdep: https://software.intel.com/sites/products/ documentation/doclib/iss/2013/compiler/cpp-lin/ GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm • vector: https://software.intel.com/sites/products/ documentation/studio/composer/en-us/2011Update/ compiler_c/cref_cls/common/ cppref_pragma_vector.htm

Intel® Xeon® Phi Coprocessor High Performance Programming

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Intel® Xeon® Phi Coprocessor High Performance Programming (20)

Recently uploaded (20)

Intel® Xeon® Phi Coprocessor High Performance Programming