SlideShare a Scribd company logo
Deep Convolutional Network
Evaluation on the Intel Xeon Phi
Gaurav Raina
MSc Graduation Project
5-1-2016
Cameras are ubiquitous
1
Vision processing on mobile devices
• Currently most processing off-line
• High compute demands + energy
• Move to edge processing
2
Motivation
• Convolutional neural nets very generic (support
many vision tasks)
• Traffic sign
• Pedestrian
• Face detection
• Accelerate with an
power efficient core
3
Problem statement
“Efficiently parallelize a Convolutional Neural Network
on a highly-parallel power efficient processor
platform”
4
You are here:
5
Overview
1. ConvNet algorithm
2. Optimization Approach
3. Mapping on the Core i7
4. Mapping on the Xeon Phi
5. Conclusion & Future Work
6
Overview
1. Convolution Network (ConvNet) algorithm
2. Optimization Approach
3. Mapping on the core i7
4. Mapping on the Xeon Phi
5. Conclusion & Future Work
7
Introduction Neural Networks
• Artificial neuron model
8
Convolution example
9
Image credit:
deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
Speed sign detection application
10
Image courtesy: Maurice Peemen
ConvNet Application in action
11
Video courtesy: Maurice Peemen
https://youtu.be/kkha3sPoU70
ConvNet Code Structure
1. for( 0 < r < 6 ){
2. acc = bias[r];
3. for( 0 < m < YL1 ){
4. for( 0 < n < XL1 ){
5. for( 0< k < 6 ){
6. for( 0 < l < 6 ){
7. acc = acc + in_layer[m,n,l] x weight[r,k,l];
8. }
9. }
10. index = saturate_shift(acc); //10bit fixedpoint format
11. output_layer[r,m,n]=fixact[index];
12. }
13. }
14. }
“r” = o/p feature maps (6) “k*l” = 6*6 convolution kernel
“n” = Neuron outputs fixact = sigmoid activation function
12
Compute
Store
Overview
1. ConvNet algorithm
2. Optimization Approach
3. Mapping on the Core i7
4. Mapping on the Xeon Phi
5. Conclusion & Future Work
13
Optimization Approach
• Methodology:
• Test on Core i7 (Haswell – AVX2)
• Move to Xeon Phi (Knights Corner - IMCI)
• Steps:
1. Loop unrolling
2. Vectorization using SIMD intrinsics (DLP)
− Fused Multiply Add instruction
3. Parallelization using OpenMP (TLP)
14
1 core
Many-core
SIMD Vectorization example
15
Courtesy: www.kernel.org
Intel MIC Programming models
16
Credit: Dr. Volker Weinberg,
Introduction into Intel Xeon Phi Programming LRZ, 28.4.2015
Overview
1. ConvNet algorithm
2. Optimization Approach
3. Mapping on the Core i7
4. Mapping on the Xeon Phi
5. Conclusion & Future Work
17
Roofline Model
18
actual FLOP/Byte ratio
attainableGFLOP/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4
1/2 1 2 4 8 16
Performance Roofline
Y coordinate is
performance
Processor BW
Roofline
(Slope is BW)
Kernel 2
Kernel 1
Each kernels
performance
bound
Each kernels
performance
bound
Intel Core i7
• Intel Core i7 @3.5GHz
• Haswell micro-architecture
• AVX2 vector instructions
− 256bit vectors
19
Multiply Accumulate intrinsic – AVX2
20
Calculation of Ops/Byte
• acc += in_layer[i]*weight[j]
• Intrinsics used
• add(acc, madd(in_layer,weight))
• Bytes Loaded
• in_layer[i] - 1bytes
• weight[j] - 2bytes
• Operational Intensity
• 2ops/3bytes = 0.67 Ops/Byte
21
Speedup after SIMD intrinsics
• (w.r.t non-vectorized code)
• Intel C Compiler
• Layer 1 - 4.7x
• Layer 2 - 5.7x
• Layer 3 - 4.88x
• Overall CNN – 5.6x
• GCC compiler
• Layer 1 - 4.7x
• Layer 2 - 6.8x
• Layer 3 - 6.7x
• Overall CNN – 6.3x
22
• (w.r.t auto-vectorized code)
• ICC
• 4.9x
• 11.3x
• 4.8x
• Overall CNN - 5x
• GCC
• same
• same
• same
• Overall CNN – 6.3x
Roofline - Core i7 - manual v/s auto
23
Layer3 Hand-
optimized
0.67, 35.54
Complete CNN Hand-
optimized,
0.67, 32.46
Complete CNN Auto-
vectorized ,
0.67, 5.134
8
16
32
64
0.125 0.25 0.5 1 2
Performance(GigaOps/s)
Operational Intensity (Ops/Byte)
Single core SIMD ops roofline - Intel i7 5930K @3.5GHz
56 Gops/s -Vector ops ceiling 112GBytes/s Write BW L1 cache
224GBytes/s Read BW L1 cache 68 GBytes/s BW to DDR RAM
16.6 GBytes/s STREAM BW Layer1 Hand-optimized - gcc
Layer2 Hand-optimized - icc Layer3 Hand-optimized - gcc
Complete CNN Hand-optimized - gcc Complete CNN Auto-vectorized -gcc
Complete CNN no-vectorization gcc
Overview
1. ConvNet algorithm
2. Optimization Approach
3. Mapping on the Intel core i7
4. Mapping on the Xeon Phi
5. Conclusion & Future Work
24
Intel Xeon Phi
• Knights Corner
• Initial Many Core
Instructions (IMCI)
• Knights Landing
• AVX-512
• 57-61core
25
Credit: Intel
Intel Xeon Phi
26
Intel Many Integrated Core Architecture
Credit: http://semiaccurate.com/2012/08/28/
intel-details-knights-corner-architecture-at-long-last/
Core Architecture Overview
• 60+ in-order, low power IA cores
• Bi-directional Ring interconnect
• Two pipelines (u & v)
• Scalar Unit based on Pentium
• 512bit SIMD Vector Processing unit
• 4 hardware threads
• Coherent 512KB L2 Cache per core
27
Image courtesy: Intel PRACE MIC Summer School, July 2013, CINECA
Ref. pg. 18-19, section 2.1.2 Xeon Phi Co-proc system software devs guide
28
Going from Core i7 to Xeon Phi (AVX to KNC)
29
Going from Core i7 to Xeon Phi (AVX to IMCI)
madd()
fmadd()
• acc = acc + in_layer[m,n,l] x weight[r,k,l]
30
Fused Multiply-Add on Xeon Phi
31
Intrinsics Kernel implementation
Speedup after SIMD intrinsics
• (w.r.t non-vectorized code)
• Intel C Compiler
• Layer 1 - 5.7x
• Layer 2 - 10.2x
• Layer 3 - 12.4x
• Overall CNN – 11x
• ~0.75 Frame per second
− 57 cores => 43 FPS
32
• (w.r.t auto-vectorized code)
• ICC
• 5.6x
• 6.3x
• 10.7x
• Overall CNN – 9.2x
Roofline – Xeon Phi
33
0.125
0.25
0.5
1
2
4
8
16
32
64
0.25 0.5 1
Performance(GigaFLOP/s)
Operational Intensity (FLOP/Byte)
Single core Roofline - Xeon Phi @1.1GHz
35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling
70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache
5.8GB/s STREAM BW to DDR RAM
Roofline – Xeon Phi
34
0.125
0.25
0.5
1
2
4
8
16
32
64
0.25 0.5 1
Performance(GigaFLOP/s)
Operational Intensity (FLOP/Byte)
Single core Roofline - Xeon Phi @1.1GHz
35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling
70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache
5.8GB/s STREAM BW to DDR RAM Layer 1 - hand optimized
Layer 1 Auto vectorized Layer 2 - hand optimized
Layer 2 Auto vectorized Layer 3 - hand optimized
Layer 3 Auto vectorized
Roofline – Xeon Phi - Complete
35
Complete - hand
optimized, 0.67, 1.5626
Complete Auto
vectorized, 0.67, 0.17020.125
0.25
0.5
1
2
4
8
16
32
64
0.25 0.5 1
Performance(GigaFLOP/s)
Operational Intensity (FLOP/Byte)
Single core Roofline - Xeon Phi @1.1GHz
35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling
70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache
5.8GB/s BW to DDR RAM Complete - hand optimized
Complete Auto vectorized
Demo
• Speed sign application running on:
• The Core i7
• The Xeon Phi
36
Overview
1. ConvNet algorithm
2. Optimization Approach
3. Mapping on the Core i7
4. Mapping on the Xeon Phi
5. Conclusion & Future Work
37
You are here:
38
Conclusion
• Contribution
• Core i7 – 6.3x
• Xeon Phi – 11x
• Design trade-off:
• Developer time v/s Optimized code
• Architecture specific intrinsics v/s generic OpenMP
39
Future Work
OpenMP number of threads
• Varying number of threads per core
• 1T x 57 cores = 57T
• 4T x 57 cores = 228T
• Varying thread distribution on Cores
• KMP_AFFINITY (Environment Variable)
• Splitting work using OpenMP directives
• #pragma omp for
40
41
Baseline
OpenMP
Scaling Vectorization Peeling
Elapsed time (s): 5605.027 127.616 17.767 15.619
FLOPS (MFlops) : 254.991 11199.45 80442.24 91506.41
Throughput (GB/s): 0.235 10.338 74.254 84.467
Test code on Xeon Phi
• Baseline - simulate diffusion of a solute through a volume of liquid
• OpenMP Scaling
• #pragma omp for collapse(2)
• Vectorization
• #pragma simd
Credit: Jeffers, James, and James Reinders. Intel Xeon Phi coprocessor
high-performance programming. Newnes, 2013.
Thank You.
Questions?

More Related Content

PDF
Thesis Report - Gaurav Raina MSc ES - v2
PDF
SDVIs and In-Situ Visualization on TACC's Stampede
PDF
MIT's experience on OpenPOWER/POWER 9 platform
PDF
Preparing Codes for Intel Knights Landing (KNL)
PDF
HPC Accelerating Combustion Engine Design
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
Best Practices and Performance Studies for High-Performance Computing Clusters
PDF
Design and Implementation of Quintuple Processor Architecture Using FPGA
Thesis Report - Gaurav Raina MSc ES - v2
SDVIs and In-Situ Visualization on TACC's Stampede
MIT's experience on OpenPOWER/POWER 9 platform
Preparing Codes for Intel Knights Landing (KNL)
HPC Accelerating Combustion Engine Design
Energy Efficient Computing using Dynamic Tuning
Best Practices and Performance Studies for High-Performance Computing Clusters
Design and Implementation of Quintuple Processor Architecture Using FPGA

What's hot (20)

PDF
Trends in Systems and How to Get Efficient Performance
PDF
Intel python 2017
PDF
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
PDF
Image Fusion - Approaches in Hardware
PDF
Lightweight DNN Processor Design (based on NVDLA)
PDF
Increasing Throughput per Node for Content Delivery Networks
PDF
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
PPTX
Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPU
PDF
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PDF
FPGAs and Machine Learning
PDF
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
PDF
IBM HPC Transformation with AI
PDF
XPDDS18: Xen Testing at Intel - Xudong Hao, Intel
PDF
Efficient execution of quantized deep learning models a compiler approach
PDF
Scalability for All: Unreal Engine* 4 with Intel
PDF
“DepthAI: Embedded, Performant Spatial AI and Computer Vision,” a Presentatio...
PDF
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
PPTX
Developing Real-Time Systems on Application Processors
PDF
asap2013-khoa-presentation
Trends in Systems and How to Get Efficient Performance
Intel python 2017
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Image Fusion - Approaches in Hardware
Lightweight DNN Processor Design (based on NVDLA)
Increasing Throughput per Node for Content Delivery Networks
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Modern INTEL Microprocessors' Architecture and Sneak Peak at NVIDIA TEGRA GPU
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
Preparing to program Aurora at Exascale - Early experiences and future direct...
FPGAs and Machine Learning
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
IBM HPC Transformation with AI
XPDDS18: Xen Testing at Intel - Xudong Hao, Intel
Efficient execution of quantized deep learning models a compiler approach
Scalability for All: Unreal Engine* 4 with Intel
“DepthAI: Embedded, Performant Spatial AI and Computer Vision,” a Presentatio...
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Developing Real-Time Systems on Application Processors
asap2013-khoa-presentation
Ad

Similar to Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small (20)

PDF
Deep Convolutional Neural Network acceleration on the Intel Xeon Phi
PDF
Deep Convolutional Network evaluation on the Intel Xeon Phi
PDF
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
PDF
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...
PDF
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
PDF
customization of a deep learning accelerator, based on NVDLA
PDF
26_Fan.pdf
PPTX
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
PDF
Deep Learning Initiative @ NECSTLab
PDF
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
PDF
Hardware for Deep Learning AI ML CNN.pdf
PDF
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
PPTX
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
PPTX
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
PDF
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
PPTX
Dp2 ppt by_bikramjit_chowdhury_final
PDF
AI Crash Course- Supercomputing
PPTX
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
PPTX
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
PDF
Accelerating AI from the Cloud to the Edge
Deep Convolutional Neural Network acceleration on the Intel Xeon Phi
Deep Convolutional Network evaluation on the Intel Xeon Phi
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
customization of a deep learning accelerator, based on NVDLA
26_Fan.pdf
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Deep Learning Initiative @ NECSTLab
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
Hardware for Deep Learning AI ML CNN.pdf
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
Dp2 ppt by_bikramjit_chowdhury_final
AI Crash Course- Supercomputing
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
Accelerating AI from the Cloud to the Edge
Ad

Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

  • 1. Deep Convolutional Network Evaluation on the Intel Xeon Phi Gaurav Raina MSc Graduation Project 5-1-2016
  • 3. Vision processing on mobile devices • Currently most processing off-line • High compute demands + energy • Move to edge processing 2
  • 4. Motivation • Convolutional neural nets very generic (support many vision tasks) • Traffic sign • Pedestrian • Face detection • Accelerate with an power efficient core 3
  • 5. Problem statement “Efficiently parallelize a Convolutional Neural Network on a highly-parallel power efficient processor platform” 4
  • 7. Overview 1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work 6
  • 8. Overview 1. Convolution Network (ConvNet) algorithm 2. Optimization Approach 3. Mapping on the core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work 7
  • 9. Introduction Neural Networks • Artificial neuron model 8
  • 11. Speed sign detection application 10 Image courtesy: Maurice Peemen
  • 12. ConvNet Application in action 11 Video courtesy: Maurice Peemen https://youtu.be/kkha3sPoU70
  • 13. ConvNet Code Structure 1. for( 0 < r < 6 ){ 2. acc = bias[r]; 3. for( 0 < m < YL1 ){ 4. for( 0 < n < XL1 ){ 5. for( 0< k < 6 ){ 6. for( 0 < l < 6 ){ 7. acc = acc + in_layer[m,n,l] x weight[r,k,l]; 8. } 9. } 10. index = saturate_shift(acc); //10bit fixedpoint format 11. output_layer[r,m,n]=fixact[index]; 12. } 13. } 14. } “r” = o/p feature maps (6) “k*l” = 6*6 convolution kernel “n” = Neuron outputs fixact = sigmoid activation function 12 Compute Store
  • 14. Overview 1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work 13
  • 15. Optimization Approach • Methodology: • Test on Core i7 (Haswell – AVX2) • Move to Xeon Phi (Knights Corner - IMCI) • Steps: 1. Loop unrolling 2. Vectorization using SIMD intrinsics (DLP) − Fused Multiply Add instruction 3. Parallelization using OpenMP (TLP) 14 1 core Many-core
  • 17. Intel MIC Programming models 16 Credit: Dr. Volker Weinberg, Introduction into Intel Xeon Phi Programming LRZ, 28.4.2015
  • 18. Overview 1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work 17
  • 19. Roofline Model 18 actual FLOP/Byte ratio attainableGFLOP/s 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Performance Roofline Y coordinate is performance Processor BW Roofline (Slope is BW) Kernel 2 Kernel 1 Each kernels performance bound Each kernels performance bound
  • 20. Intel Core i7 • Intel Core i7 @3.5GHz • Haswell micro-architecture • AVX2 vector instructions − 256bit vectors 19
  • 22. Calculation of Ops/Byte • acc += in_layer[i]*weight[j] • Intrinsics used • add(acc, madd(in_layer,weight)) • Bytes Loaded • in_layer[i] - 1bytes • weight[j] - 2bytes • Operational Intensity • 2ops/3bytes = 0.67 Ops/Byte 21
  • 23. Speedup after SIMD intrinsics • (w.r.t non-vectorized code) • Intel C Compiler • Layer 1 - 4.7x • Layer 2 - 5.7x • Layer 3 - 4.88x • Overall CNN – 5.6x • GCC compiler • Layer 1 - 4.7x • Layer 2 - 6.8x • Layer 3 - 6.7x • Overall CNN – 6.3x 22 • (w.r.t auto-vectorized code) • ICC • 4.9x • 11.3x • 4.8x • Overall CNN - 5x • GCC • same • same • same • Overall CNN – 6.3x
  • 24. Roofline - Core i7 - manual v/s auto 23 Layer3 Hand- optimized 0.67, 35.54 Complete CNN Hand- optimized, 0.67, 32.46 Complete CNN Auto- vectorized , 0.67, 5.134 8 16 32 64 0.125 0.25 0.5 1 2 Performance(GigaOps/s) Operational Intensity (Ops/Byte) Single core SIMD ops roofline - Intel i7 5930K @3.5GHz 56 Gops/s -Vector ops ceiling 112GBytes/s Write BW L1 cache 224GBytes/s Read BW L1 cache 68 GBytes/s BW to DDR RAM 16.6 GBytes/s STREAM BW Layer1 Hand-optimized - gcc Layer2 Hand-optimized - icc Layer3 Hand-optimized - gcc Complete CNN Hand-optimized - gcc Complete CNN Auto-vectorized -gcc Complete CNN no-vectorization gcc
  • 25. Overview 1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Intel core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work 24
  • 26. Intel Xeon Phi • Knights Corner • Initial Many Core Instructions (IMCI) • Knights Landing • AVX-512 • 57-61core 25 Credit: Intel
  • 27. Intel Xeon Phi 26 Intel Many Integrated Core Architecture Credit: http://semiaccurate.com/2012/08/28/ intel-details-knights-corner-architecture-at-long-last/
  • 28. Core Architecture Overview • 60+ in-order, low power IA cores • Bi-directional Ring interconnect • Two pipelines (u & v) • Scalar Unit based on Pentium • 512bit SIMD Vector Processing unit • 4 hardware threads • Coherent 512KB L2 Cache per core 27 Image courtesy: Intel PRACE MIC Summer School, July 2013, CINECA Ref. pg. 18-19, section 2.1.2 Xeon Phi Co-proc system software devs guide
  • 29. 28 Going from Core i7 to Xeon Phi (AVX to KNC)
  • 30. 29 Going from Core i7 to Xeon Phi (AVX to IMCI) madd() fmadd() • acc = acc + in_layer[m,n,l] x weight[r,k,l]
  • 33. Speedup after SIMD intrinsics • (w.r.t non-vectorized code) • Intel C Compiler • Layer 1 - 5.7x • Layer 2 - 10.2x • Layer 3 - 12.4x • Overall CNN – 11x • ~0.75 Frame per second − 57 cores => 43 FPS 32 • (w.r.t auto-vectorized code) • ICC • 5.6x • 6.3x • 10.7x • Overall CNN – 9.2x
  • 34. Roofline – Xeon Phi 33 0.125 0.25 0.5 1 2 4 8 16 32 64 0.25 0.5 1 Performance(GigaFLOP/s) Operational Intensity (FLOP/Byte) Single core Roofline - Xeon Phi @1.1GHz 35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling 70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache 5.8GB/s STREAM BW to DDR RAM
  • 35. Roofline – Xeon Phi 34 0.125 0.25 0.5 1 2 4 8 16 32 64 0.25 0.5 1 Performance(GigaFLOP/s) Operational Intensity (FLOP/Byte) Single core Roofline - Xeon Phi @1.1GHz 35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling 70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache 5.8GB/s STREAM BW to DDR RAM Layer 1 - hand optimized Layer 1 Auto vectorized Layer 2 - hand optimized Layer 2 Auto vectorized Layer 3 - hand optimized Layer 3 Auto vectorized
  • 36. Roofline – Xeon Phi - Complete 35 Complete - hand optimized, 0.67, 1.5626 Complete Auto vectorized, 0.67, 0.17020.125 0.25 0.5 1 2 4 8 16 32 64 0.25 0.5 1 Performance(GigaFLOP/s) Operational Intensity (FLOP/Byte) Single core Roofline - Xeon Phi @1.1GHz 35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling 70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache 5.8GB/s BW to DDR RAM Complete - hand optimized Complete Auto vectorized
  • 37. Demo • Speed sign application running on: • The Core i7 • The Xeon Phi 36
  • 38. Overview 1. ConvNet algorithm 2. Optimization Approach 3. Mapping on the Core i7 4. Mapping on the Xeon Phi 5. Conclusion & Future Work 37
  • 40. Conclusion • Contribution • Core i7 – 6.3x • Xeon Phi – 11x • Design trade-off: • Developer time v/s Optimized code • Architecture specific intrinsics v/s generic OpenMP 39
  • 41. Future Work OpenMP number of threads • Varying number of threads per core • 1T x 57 cores = 57T • 4T x 57 cores = 228T • Varying thread distribution on Cores • KMP_AFFINITY (Environment Variable) • Splitting work using OpenMP directives • #pragma omp for 40
  • 42. 41 Baseline OpenMP Scaling Vectorization Peeling Elapsed time (s): 5605.027 127.616 17.767 15.619 FLOPS (MFlops) : 254.991 11199.45 80442.24 91506.41 Throughput (GB/s): 0.235 10.338 74.254 84.467 Test code on Xeon Phi • Baseline - simulate diffusion of a solute through a volume of liquid • OpenMP Scaling • #pragma omp for collapse(2) • Vectorization • #pragma simd Credit: Jeffers, James, and James Reinders. Intel Xeon Phi coprocessor high-performance programming. Newnes, 2013.