Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

Deep Convolutional Network
Evaluation on the Intel Xeon Phi
Gaurav Raina
MSc Graduation Project
5-1-2016

Vision processing on mobile devices
• Currently most processing off-line
• High compute demands + energy
• Move to edge processing
2

Motivation
• Convolutional neural nets very generic (support
many vision tasks)
• Traffic sign
• Pedestrian
• Face detection
• Accelerate with an
power efficient core
3

Problem statement
“Efficiently parallelize a Convolutional Neural Network
on a highly-parallel power efficient processor
platform”
4

Overview
1. ConvNet algorithm
2. Optimization Approach
3. Mapping on the Core i7
4. Mapping on the Xeon Phi
5. Conclusion & Future Work
6

Overview
1. Convolution Network (ConvNet) algorithm
3. Mapping on the core i7
7

Introduction Neural Networks
• Artificial neuron model
8

Convolution example
9
Image credit:
deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

Speed sign detection application
10
Image courtesy: Maurice Peemen

ConvNet Application in action
11
Video courtesy: Maurice Peemen
https://youtu.be/kkha3sPoU70

ConvNet Code Structure
1. for( 0 < r < 6 ){
2. acc = bias[r];
3. for( 0 < m < YL1 ){
4. for( 0 < n < XL1 ){
5. for( 0< k < 6 ){
6. for( 0 < l < 6 ){
7. acc = acc + in_layer[m,n,l] x weight[r,k,l];
8. }
9. }
10. index = saturate_shift(acc); //10bit fixedpoint format
11. output_layer[r,m,n]=fixact[index];
12. }
13. }
14. }
“r” = o/p feature maps (6) “k*l” = 6*6 convolution kernel
“n” = Neuron outputs fixact = sigmoid activation function
12
Compute
Store

Overview
13

Optimization Approach
• Methodology:
• Test on Core i7 (Haswell – AVX2)
• Move to Xeon Phi (Knights Corner - IMCI)
• Steps:
1. Loop unrolling
2. Vectorization using SIMD intrinsics (DLP)
− Fused Multiply Add instruction
3. Parallelization using OpenMP (TLP)
14
1 core
Many-core

SIMD Vectorization example
15
Courtesy: www.kernel.org

Intel MIC Programming models
16
Credit: Dr. Volker Weinberg,
Introduction into Intel Xeon Phi Programming LRZ, 28.4.2015

Overview
17

Roofline Model
18
actual FLOP/Byte ratio
attainableGFLOP/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4
1/2 1 2 4 8 16
Performance Roofline
Y coordinate is
performance
Processor BW
Roofline
(Slope is BW)
Kernel 2
Kernel 1
Each kernels
performance
bound
Each kernels
performance
bound

Intel Core i7
• Intel Core i7 @3.5GHz
• Haswell micro-architecture
• AVX2 vector instructions
− 256bit vectors
19

Multiply Accumulate intrinsic – AVX2
20

Calculation of Ops/Byte
• acc += in_layer[i]*weight[j]
• Intrinsics used
• add(acc, madd(in_layer,weight))
• Bytes Loaded
• in_layer[i] - 1bytes
• weight[j] - 2bytes
• Operational Intensity
• 2ops/3bytes = 0.67 Ops/Byte
21

Speedup after SIMD intrinsics
• (w.r.t non-vectorized code)
• Intel C Compiler
• Layer 1 - 4.7x
• Layer 2 - 5.7x
• Layer 3 - 4.88x
• Overall CNN – 5.6x
• GCC compiler
• Layer 1 - 4.7x
• Layer 2 - 6.8x
• Layer 3 - 6.7x
22
• (w.r.t auto-vectorized code)
• ICC
• 4.9x
• 11.3x
• 4.8x
• Overall CNN - 5x
• GCC
• same
• same
• same

Roofline - Core i7 - manual v/s auto
23
Layer3 Hand-
optimized
0.67, 35.54
Complete CNN Hand-
optimized,
0.67, 32.46
Complete CNN Auto-
vectorized ,
0.67, 5.134
8
16
32
64
0.125 0.25 0.5 1 2
Performance(GigaOps/s)
Operational Intensity (Ops/Byte)
Single core SIMD ops roofline - Intel i7 5930K @3.5GHz
56 Gops/s -Vector ops ceiling 112GBytes/s Write BW L1 cache
224GBytes/s Read BW L1 cache 68 GBytes/s BW to DDR RAM
16.6 GBytes/s STREAM BW Layer1 Hand-optimized - gcc
Layer2 Hand-optimized - icc Layer3 Hand-optimized - gcc
Complete CNN Hand-optimized - gcc Complete CNN Auto-vectorized -gcc
Complete CNN no-vectorization gcc

Overview
3. Mapping on the Intel core i7
24

Intel Xeon Phi
• Knights Corner
• Initial Many Core
Instructions (IMCI)
• Knights Landing
• AVX-512
• 57-61core
25
Credit: Intel

Intel Xeon Phi
26
Intel Many Integrated Core Architecture
Credit: http://semiaccurate.com/2012/08/28/
intel-details-knights-corner-architecture-at-long-last/

Core Architecture Overview
• 60+ in-order, low power IA cores
• Bi-directional Ring interconnect
• Two pipelines (u & v)
• Scalar Unit based on Pentium
• 512bit SIMD Vector Processing unit
• 4 hardware threads
• Coherent 512KB L2 Cache per core
27
Image courtesy: Intel PRACE MIC Summer School, July 2013, CINECA
Ref. pg. 18-19, section 2.1.2 Xeon Phi Co-proc system software devs guide

28
Going from Core i7 to Xeon Phi (AVX to KNC)

29
Going from Core i7 to Xeon Phi (AVX to IMCI)
madd()
fmadd()
• acc = acc + in_layer[m,n,l] x weight[r,k,l]

30
Fused Multiply-Add on Xeon Phi

31
Intrinsics Kernel implementation

Speedup after SIMD intrinsics
• (w.r.t non-vectorized code)
• Intel C Compiler
• Layer 1 - 5.7x
• Layer 2 - 10.2x
• Layer 3 - 12.4x
• Overall CNN – 11x
• ~0.75 Frame per second
− 57 cores => 43 FPS
32
• (w.r.t auto-vectorized code)
• ICC
• 5.6x
• 6.3x
• 10.7x

Roofline – Xeon Phi
33
0.125
0.25
0.5
1
2
4
8
16
32
64
0.25 0.5 1
Performance(GigaFLOP/s)
Operational Intensity (FLOP/Byte)
Single core Roofline - Xeon Phi @1.1GHz
35.2GFLOP/s Vector compute ceiling 0.48 GFLOP/s Scalar compute ceiling
70.4GBytes/s BW L1 cache 35.2GBytes/s BW L2 cache
5.8GB/s STREAM BW to DDR RAM

Roofline – Xeon Phi
34
0.125
0.25
0.5
1
2
4
8
16
32
64
0.25 0.5 1
5.8GB/s STREAM BW to DDR RAM Layer 1 - hand optimized
Layer 1 Auto vectorized Layer 2 - hand optimized
Layer 2 Auto vectorized Layer 3 - hand optimized
Layer 3 Auto vectorized

Roofline – Xeon Phi - Complete
35
Complete - hand
optimized, 0.67, 1.5626
Complete Auto
vectorized, 0.67, 0.17020.125
0.25
0.5
1
2
4
8
16
32
64
0.25 0.5 1
5.8GB/s BW to DDR RAM Complete - hand optimized
Complete Auto vectorized

Demo
• Speed sign application running on:
• The Core i7
• The Xeon Phi
36

Overview
37

Conclusion
• Contribution
• Core i7 – 6.3x
• Xeon Phi – 11x
• Design trade-off:
• Developer time v/s Optimized code
• Architecture specific intrinsics v/s generic OpenMP
39

Future Work
OpenMP number of threads
• Varying number of threads per core
• 1T x 57 cores = 57T
• 4T x 57 cores = 228T
• Varying thread distribution on Cores
• KMP_AFFINITY (Environment Variable)
• Splitting work using OpenMP directives
• #pragma omp for
40

41
Baseline
OpenMP
Scaling Vectorization Peeling
Elapsed time (s): 5605.027 127.616 17.767 15.619
FLOPS (MFlops) : 254.991 11199.45 80442.24 91506.41
Throughput (GB/s): 0.235 10.338 74.254 84.467
Test code on Xeon Phi
• Baseline - simulate diffusion of a solute through a volume of liquid
• OpenMP Scaling
• #pragma omp for collapse(2)
• Vectorization
• #pragma simd
Credit: Jeffers, James, and James Reinders. Intel Xeon Phi coprocessor
high-performance programming. Newnes, 2013.

Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small

More Related Content

What's hot (20)

Similar to Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small (20)

Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav Raina small