SlideShare a Scribd company logo
Optimizing Machine Learning
workloads on Intel®
Platforms
Colfax International — colfaxresearch.com
November 2016
colfaxresearch.com/ Welcome © Colfax International, 2013–2016
Disclaimer
2
While best efforts have been used in preparing this training, Colfax International makes no
representations or warranties of any kind and assumes no liabilities of any kind with respect to
the accuracy or completeness of the contents and specifically disclaims any implied warranties
of merchantability or fitness of use for a particular purpose. The publisher shall not be held
liable or responsible to any person or entity with respect to any loss or incidental or
consequential damages caused, or alleged to have been caused, directly or indirectly, by the
information or programs contained herein. No warranty may be created or extended by sales
representatives or written sales materials.
colfaxresearch.com/ About This Document © Colfax International, 2013–2016
Colfax Research
3
http://colfaxresearch.com/
colfaxresearch.com/ About This Document © Colfax International, 2013–2016
§2. Code Modernization
What is Code Modernization?
5
.
Code Modernization..
......
Optimizing software to better utilize features available in modern computer
architectures.
Scalar Tuning
what goes on in the pipeline?
Threading
do cores cooperate efficiently?
Vectorization
is SIMD parallelism used well?
Memory
is cache usage maximized or
RAM access streamlined?
Communication
can coordination in a distributed or
heterogeneous system be improved?
colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
Case Study: VGG-Net on Torch
6
 0
 5
 10
 15
 20
 25
 30
Original Intel Compiler
+MKL
Middleware
Changes
User Code
Changes
Parallel
Strategy
MCDRAM as
Cache
Performance (images/s)
Optimization of NeuralTalk2
colfaxresearch.com
55x
28x
Intel® Xeon® processor E5-2650 v4 (2 sockets)
0.91 1.5
11
15
25
Intel® Xeon Phi™ processor 7210 (KNL)
5.7
10
21
28
Colfax Research Summary Paper
colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
Intel Python Performance
7
LU
Decomposition
Cholesky
Decomposition
Singular Value
Decomposition
DGEMM
0
20
40
60
80
100
120
140
160
180
RelativePerformance
1.0 1.0 1.0 1.03.5 3.6 1.1
7.0
29.0
17.0
8.3
154.0
colfaxresearch.com
Intel Python on Knights Landing Processors (N=5000)
CPython, SciPy
CPython, NumPy
Intel Python, SciPy
Portal: software.intel.com/intel-distribution-for-python. See also: CR paper.
colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
Three Approaches
8
.
High Level Approach
..
......
Use high level libraries that are pre-optimized for modern architectures.
▷ IntelCaffe, TensorFlow, Scikit-learn etc.
.
Low Level Approach
..
......
Apply code modernization techniques to frameworks/applications.
▷ Colfax Research Website, HOW series, Intel Modern Code page etc.
.
Middle Ground Approach
..
......
Integrate pre-optimized kernels into frameworks/applications.
▷ Intel®
MKL DNN primitives, Intel®
DAAL, Intel®
MKL DNN etc.
colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
§3. The High Level Approach
Intel Libraries for Machine Learning
10
LeNet (Cifar10, minibatch 64)
 Xeon Phi
Processor 
 Broadwell Xeon
Processor
0
5
10
15
20
25
30
 Forward/Backward Perf (k­img/s, mini­batch 64) 
0.15k 0.75k
13.27k
25.16k
 
 BVLC
 Intel
VGG16 (ImageNet, minibatch 64)
 Xeon Phi
Processor 
 Broadwell Xeon
Processor
0
10
20
30
40
50
60
70
 Forward/Backward Perf (img/s, mini­batch 64) 
0.91
3.82
54.40
28.57
 
 BVLC
 Intel
colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016
References for Intel Machine Learning Libraries
11
▷ Intel MKL (https://software.intel.com/en-us/intel-mkl)
▷ Intel®
MKL-DNN (https://github.com/01org/MKL-DNN)
▷ IntelCaffe (https://github.com/intel/caffe)
▷ Intel Theano (https://github.com/intel/theano)
▷ Intel DAAL (https://software.intel.com/en-us/intel-daal)
▷ Intel Torch (https://github.com/xhzhao/Optimized-Torch)
▷ IntelPython (https://software.intel.com/en-us/intel-distribution-for-python)
• Scikit-learn, Numpy, Scipy etc.
▷ And more coming...
• TensorFlow, CNTK, etc.
colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016
Intel Distribution for Python
12
SciPy
Caffe
Intel Distribution for Python →
Intel Math Kernel Library →
Intel DAAL
Portal: software.intel.com/intel-distribution-for-python. See also: CR paper.
colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016
§4. Low Level Approach
Optimization Areas
14
Scalar Tuning
what goes on in the pipeline?
Threading
do cores cooperate efficiently?
Vectorization
is SIMD parallelism used well?
Memory
is cache usage maximized or
RAM access streamlined?
Communication
can coordination in a distributed or
heterogeneous system be improved?
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
Case Study: VGG-Net on Torch
15
 0
 5
 10
 15
 20
 25
 30
Original Intel Compiler
+MKL
Middleware
Changes
User Code
Changes
Parallel
Strategy
MCDRAM as
Cache
Performance (images/s)
Optimization of NeuralTalk2
colfaxresearch.com
55x
28x
Intel® Xeon® processor E5-2650 v4 (2 sockets)
0.91 1.5
11
15
25
Intel® Xeon Phi™ processor 7210 (KNL)
5.7
10
21
28
Colfax Research Summary Paper
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
Base Torch Performance
16
 0
 2
 4
 6
 8
 10
 12
 14
 16
 18
 10  20  30  40  50  60
images/s
Batch Count (images)
Comp. Perf. (64 threads)
By Layer:
▷ ReLU: 66%
▷ Conv: 30%
▷ MaxPool: 3%
▷ Other: <1%
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
Performance After ReLU Optimization
17
 0
 5
 10
 15
 20
 25
 30
 35
 40
 10  20  30  40  50  60
images/s
Batch Count (images)
Original (64 threads)
ReLU optimized (64 threads)
RELU -> 160x boost
By Layer:
▷ ReLU: 1%
▷ Conv: 85%
▷ MaxPool: 11%
▷ Other: 3%
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
FALCON paper
18
https://colfaxresearch.com/falcon-library/
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
Learn More
Colfax Research
20
http://colfaxresearch.com/
colfaxresearch.com/ Learn More © Colfax International, 2013–2016
→ HowSeries.com
§5. The Middle Ground Approach
Intel MKL and Intel MKL-DNN
23
slide credit: Intel corp.
colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016
Stand-alone Example: Convolution
24
1 // Creating MKL DNN primitive object
2 dnnPrimitive_t convFwd;
3 dnnConvolutionCreateForward_F32(&convFwd, NULL, dnnAlgorithmConvolutionDirect,
4 dim, input_dims, output_dims, filter_dims,
5 conv_strides, padding, dnnBorderZeros);
6
7 // Creating the needed data buffer
8 void* conv_res[dnnResourceNumber];
9 conv_res[dnnResourceSrc] = (void*) input;
10 conv_res[dnnResourceFilter] = (void*) filter;
11 conv_res[dnnResourceDst] = (void*) output;
12
13 // Execute the workload
14 dnnExecute_F32(pConvFwd, conv_res);
For more: Intel MKL documentation on DNN primitives
colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016
Example Integration: IntelCaffe
25
GitHub link: https://github.com/intel/caffe/
Example layer implementations: caffe/src/caffe/layers/mkl_*.cpp
1 // Grabbing parameters from Caffe Layers
2 PoolingParameter pool_param = this->layer_param_.pooling_param();
3 channels_ = bottom[0]->channels();
4 height_ = bottom[0]->height();
5 width_ = bottom[0]->width();
6 num_ = bottom[0]->num();
7 // ... //
8 kernel_h_ = pool_param.kernel_h(); kernel_w_ = pool_param.kernel_w();
9 // ..... //
10
11 // Creating the math kernel from these parameters
12 status = dnnPoolingCreateForward<Dtype>( /* ... */ );
colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016
§6. Distributed Memory Computation
"FLOPs Are Cheap"?
27
.
Theoretical estimates, Intel®
Xeon E5-2697 V3 processor
..
......
Performance =28 cores ×2.7 GHz ×(256/64) vec.lanes ×2 FMA ×2 FPU ≈ 1.2TFLOP/s
RequiredDataRate =1.2TFLOP/s×8bytes ≈ 10TB/s
OPAMaxBandwidth =12.5GB/s ≈ 0.01TB/s
Ratio = 10/0.01 ≈ 1000 (FLOPs)/(Memory Transferred)
To put it short...
.
Difficulty of Distributed Computation
..
......
In the time it takes to transfer one data element, processors can do thousands of
operation on one data element.
colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016
Distributed Computation for Neural Networks
28
Forward
Backward
Loss
Update
Forward
Backward
Loss
Update
Gather
Gradients
Forward
Backward
Loss
Update
Forward
Backward
Loss
Update
Partial
Results
Partial
Results
node 2node 1 node 2node 1
Data Parallel Model Parallel
Gradient Trnsferred but not Data Data Trnsferred but not Gradient
colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016
Caffe Scaling
29
Source: Intel®
Corporation. (Caffe* Training on Multi-node
Distributed-memory Systems Based on Intel®
Xeon®
Processor E5 Family)
colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016
Machine Learning Framework: Intel®
DAAL
Algorithms in DAAL
31
Analysis
- Low Order Moments
- Quantile
- Correlation and Variance
- Cosine Distance Matrix
- Correlation Distance Matrix
- K-Means Clustering
- Principal Component Analysis
- Cholesky Decomposition
Training & prediction
- Regression
- Linear/Ridge Regresion
- Clasification
- Naive Bayes Classifier
- Boosting
- SVM
- Neural Networks
- Multi-Class Classifier
- Singular Value Decomposition
- QR Decomposition
- Expectation-Maximization
- Multivariate Outlier Detection
- Univariate Outlier Detection
- Association Rules
- Kernel Functions
- Quality Metrics
Portal: DAAL page. See also: intro article, CR papers.
colfaxresearch.com/ Machine Learning Framework: Intel® DAAL © Colfax International, 2013–2016
Algorithms in DAAL
32
Data Set
Partial
Computation
Data Set
Partial
Computation
Data Set
Partial
Computation
Final Result Final Result
Data Set
Data Set
Data Set
Final Result
Full
Computation
Full
Computation
Data Set
Distributed Mode Batch Mode Online Mode
Portal: DAAL page. See also: intro article, CR papers.
colfaxresearch.com/ Machine Learning Framework: Intel® DAAL © Colfax International, 2013–2016
Communication Framework: MPI
Structure of MPI Applications: Hello World
34
1 #include "mpi.h"
2 #include <cstdio>
3 int main (int argc, char *argv[]) {
4 MPI_Init (&argc, &argv); // Initialize MPI envirnmnt
5 int rank, size, namelen;
6 char name[MPI_MAX_PROCESSOR_NAME];
7 MPI_Comm_rank (MPI_COMM_WORLD, &rank); // ID of current process
8 MPI_Get_processor_name (name, &namelen); // Hostname of node
9 MPI_Comm_size (MPI_COMM_WORLD, &size); // Number of processes
10 printf ("Hello World from rank %d running on %s!n", rank, name);
11 if (rank == 0) printf("MPI World size = %d processesn", size);
12 MPI_Finalize (); // Terminate MPI environment
13 }
colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016
Collective Communication: Gather
35
1 int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype,
2 void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm);
Gather
sender
data
sender
data
sender
data
sender
data
receiver
colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016
Collective Communication: Broadcast
36
1 int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype,
2 int root, MPI_Comm comm );
sender
data
receiver receiver receiverreceiver
Broadcast
colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016
Implementation
Example Distributed Image Processing: DAAL
38
▷ Algorithm <step1Local> is responsible for the forward/backward propagation.
1 training::Distributed<step1Local> local_net; // local net algorithm
2 local_net.compute(); // forward/backward
3 part_res = local_net.getPartialResult(); // getting partial result
4 local_net.input.get(training::inputModel)
5 ->setWeightsAndBiases(wb); // Update the weights/bias
▷ Algorithm <step2Master> is responsible for accumulating the gradient.
1 training::Distributed<step2Master> master_net; // master net algorithm
2 master_net.input.add(training::partialResults, // Add partial result
3 0, part_res);
4 master_net.compute(); // Accumulate gradients
5 wbModel = master_net.getPartialResult() // Get Current Model
6 ->get(training::resultFromMaster)
7 ->get(training::model);
8 wb = wbModel->getWeightsAndBiases(); // Extract weights/bias
colfaxresearch.com/ Implementation © Colfax International, 2013–2016
Example Distributed Image Processing (Part 1)
39
1 // Computation part of the node with the master net
2 // Local forward and backward propagation
3 local_net.compute();
4 part_res[master_node_id] = local_net.getPartialResult();
5
6 // ... Code to store the result into a buffer (char *) ... //
7
8 // Send the result to the master node
9 MPI_Gather(....);
10
11 // ... Code to reconstruct the partial result from the buffer... //
12
13 // accumulate the partial result from nodes
14 for(int i = 0; i < num_nodes; i++)
15 master_net.input.add(training::partialResults, node, part_res[i]);
16 master_net.compute();
colfaxresearch.com/ Implementation © Colfax International, 2013–2016
Example Distributed Image Processing (Part 2)
40
1 // ... Continuing on the master compute ... //
2
3 // Extract the weight/bias from the master net
4 training::ModelPtr wbModel = master_net.getPartialResult()
5 ->get(training::resultFromMaster)
6 ->get(training::model);
7 NumericTablePtr wb = wbModel->getWeightsAndBiases();
8
9 // ... Code to store weights/bias into a buffer (char*) ... //
10
11 // Broadcast the weights/bias to all nodes //
12 MPI_Bcast(.....);
13
14 // ... Code to reconstruct the weights/bias from buffer ... //
15
16 // Update the weights on local node
17 local_net.input.get(training::inputModel)->setWeightsAndBiases(wb);
colfaxresearch.com/ Implementation © Colfax International, 2013–2016
Parallel Efficiency
41
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 4.5
 1  2  3  4
Parallel Efficiency
Number of Nodes
93%
91%
87%
Linear Scaling (Theoretical)
Distributed Lenet
Further performance optimizations and model parallelism are coming soon...
colfaxresearch.com/ Implementation © Colfax International, 2013–2016
§7. Final Words
Colfax Research
43
http://colfaxresearch.com/
colfaxresearch.com/ Final Words © Colfax International, 2013–2016
Thank you for your Attention!
Join us at Booth #2407 at SC16!
colfaxresearch.com/ Final Words © Colfax International, 2013–2016

More Related Content

PDF
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
PDF
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
PDF
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
PDF
AIDC Summit LA- Hands-on Training
PDF
Accelerating SparkML Workloads on the Intel Xeon+FPGA Platform with Srivatsan...
PPTX
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
PPTX
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
PDF
Intel Powered AI Applications for Telco
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
AIDC Summit LA- Hands-on Training
Accelerating SparkML Workloads on the Intel Xeon+FPGA Platform with Srivatsan...
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
Intel Powered AI Applications for Telco

Similar to Intel colfax optimizing-machine-learning-workloads (20)

PPTX
Denis Nagorny - Pumping Python Performance
PDF
Development of accelerators for ML and I(nference)aaS systems on FPGA
PDF
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
PDF
Distributed deep learning optimizations for Finance
PPTX
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
PDF
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
PDF
Very large scale distributed deep learning on BigDL
PDF
Accelerating AI from the Cloud to the Edge
PDF
E3MV - Embedded Vision - Sundance
PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
Enabling a hardware accelerated deep learning data science experience for Apa...
PDF
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
PDF
From the latency to the throughput age
PPT
No[1][1]
PDF
Best Practices and Performance Studies for High-Performance Computing Clusters
PDF
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
PDF
Scale Up Performance with Intel® Development
PDF
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
PDF
Distributed deep learning optimizations - AI WithTheBest
PPTX
AI Hardware Landscape 2021
Denis Nagorny - Pumping Python Performance
Development of accelerators for ML and I(nference)aaS systems on FPGA
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
Distributed deep learning optimizations for Finance
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Very large scale distributed deep learning on BigDL
Accelerating AI from the Cloud to the Edge
E3MV - Embedded Vision - Sundance
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Enabling a hardware accelerated deep learning data science experience for Apa...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
From the latency to the throughput age
No[1][1]
Best Practices and Performance Studies for High-Performance Computing Clusters
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Scale Up Performance with Intel® Development
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Distributed deep learning optimizations - AI WithTheBest
AI Hardware Landscape 2021
Ad

Recently uploaded (20)

PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Approach and Philosophy of On baking technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Empathic Computing: Creating Shared Understanding
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Understanding_Digital_Forensics_Presentation.pptx
Big Data Technologies - Introduction.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation_ Review paper, used for researhc scholars
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Approach and Philosophy of On baking technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Empathic Computing: Creating Shared Understanding
The AUB Centre for AI in Media Proposal.docx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
sap open course for s4hana steps from ECC to s4
Spectral efficient network and resource selection model in 5G networks
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Ad

Intel colfax optimizing-machine-learning-workloads

  • 1. Optimizing Machine Learning workloads on Intel® Platforms Colfax International — colfaxresearch.com November 2016 colfaxresearch.com/ Welcome © Colfax International, 2013–2016
  • 2. Disclaimer 2 While best efforts have been used in preparing this training, Colfax International makes no representations or warranties of any kind and assumes no liabilities of any kind with respect to the accuracy or completeness of the contents and specifically disclaims any implied warranties of merchantability or fitness of use for a particular purpose. The publisher shall not be held liable or responsible to any person or entity with respect to any loss or incidental or consequential damages caused, or alleged to have been caused, directly or indirectly, by the information or programs contained herein. No warranty may be created or extended by sales representatives or written sales materials. colfaxresearch.com/ About This Document © Colfax International, 2013–2016
  • 5. What is Code Modernization? 5 . Code Modernization.. ...... Optimizing software to better utilize features available in modern computer architectures. Scalar Tuning what goes on in the pipeline? Threading do cores cooperate efficiently? Vectorization is SIMD parallelism used well? Memory is cache usage maximized or RAM access streamlined? Communication can coordination in a distributed or heterogeneous system be improved? colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
  • 6. Case Study: VGG-Net on Torch 6  0  5  10  15  20  25  30 Original Intel Compiler +MKL Middleware Changes User Code Changes Parallel Strategy MCDRAM as Cache Performance (images/s) Optimization of NeuralTalk2 colfaxresearch.com 55x 28x Intel® Xeon® processor E5-2650 v4 (2 sockets) 0.91 1.5 11 15 25 Intel® Xeon Phi™ processor 7210 (KNL) 5.7 10 21 28 Colfax Research Summary Paper colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
  • 7. Intel Python Performance 7 LU Decomposition Cholesky Decomposition Singular Value Decomposition DGEMM 0 20 40 60 80 100 120 140 160 180 RelativePerformance 1.0 1.0 1.0 1.03.5 3.6 1.1 7.0 29.0 17.0 8.3 154.0 colfaxresearch.com Intel Python on Knights Landing Processors (N=5000) CPython, SciPy CPython, NumPy Intel Python, SciPy Portal: software.intel.com/intel-distribution-for-python. See also: CR paper. colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
  • 8. Three Approaches 8 . High Level Approach .. ...... Use high level libraries that are pre-optimized for modern architectures. ▷ IntelCaffe, TensorFlow, Scikit-learn etc. . Low Level Approach .. ...... Apply code modernization techniques to frameworks/applications. ▷ Colfax Research Website, HOW series, Intel Modern Code page etc. . Middle Ground Approach .. ...... Integrate pre-optimized kernels into frameworks/applications. ▷ Intel® MKL DNN primitives, Intel® DAAL, Intel® MKL DNN etc. colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
  • 9. §3. The High Level Approach
  • 10. Intel Libraries for Machine Learning 10 LeNet (Cifar10, minibatch 64)  Xeon Phi Processor   Broadwell Xeon Processor 0 5 10 15 20 25 30  Forward/Backward Perf (k­img/s, mini­batch 64)  0.15k 0.75k 13.27k 25.16k    BVLC  Intel VGG16 (ImageNet, minibatch 64)  Xeon Phi Processor   Broadwell Xeon Processor 0 10 20 30 40 50 60 70  Forward/Backward Perf (img/s, mini­batch 64)  0.91 3.82 54.40 28.57    BVLC  Intel colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016
  • 11. References for Intel Machine Learning Libraries 11 ▷ Intel MKL (https://software.intel.com/en-us/intel-mkl) ▷ Intel® MKL-DNN (https://github.com/01org/MKL-DNN) ▷ IntelCaffe (https://github.com/intel/caffe) ▷ Intel Theano (https://github.com/intel/theano) ▷ Intel DAAL (https://software.intel.com/en-us/intel-daal) ▷ Intel Torch (https://github.com/xhzhao/Optimized-Torch) ▷ IntelPython (https://software.intel.com/en-us/intel-distribution-for-python) • Scikit-learn, Numpy, Scipy etc. ▷ And more coming... • TensorFlow, CNTK, etc. colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016
  • 12. Intel Distribution for Python 12 SciPy Caffe Intel Distribution for Python → Intel Math Kernel Library → Intel DAAL Portal: software.intel.com/intel-distribution-for-python. See also: CR paper. colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016
  • 13. §4. Low Level Approach
  • 14. Optimization Areas 14 Scalar Tuning what goes on in the pipeline? Threading do cores cooperate efficiently? Vectorization is SIMD parallelism used well? Memory is cache usage maximized or RAM access streamlined? Communication can coordination in a distributed or heterogeneous system be improved? colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
  • 15. Case Study: VGG-Net on Torch 15  0  5  10  15  20  25  30 Original Intel Compiler +MKL Middleware Changes User Code Changes Parallel Strategy MCDRAM as Cache Performance (images/s) Optimization of NeuralTalk2 colfaxresearch.com 55x 28x Intel® Xeon® processor E5-2650 v4 (2 sockets) 0.91 1.5 11 15 25 Intel® Xeon Phi™ processor 7210 (KNL) 5.7 10 21 28 Colfax Research Summary Paper colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
  • 16. Base Torch Performance 16  0  2  4  6  8  10  12  14  16  18  10  20  30  40  50  60 images/s Batch Count (images) Comp. Perf. (64 threads) By Layer: ▷ ReLU: 66% ▷ Conv: 30% ▷ MaxPool: 3% ▷ Other: <1% colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
  • 17. Performance After ReLU Optimization 17  0  5  10  15  20  25  30  35  40  10  20  30  40  50  60 images/s Batch Count (images) Original (64 threads) ReLU optimized (64 threads) RELU -> 160x boost By Layer: ▷ ReLU: 1% ▷ Conv: 85% ▷ MaxPool: 11% ▷ Other: 3% colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
  • 22. §5. The Middle Ground Approach
  • 23. Intel MKL and Intel MKL-DNN 23 slide credit: Intel corp. colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016
  • 24. Stand-alone Example: Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd, NULL, dnnAlgorithmConvolutionDirect, 4 dim, input_dims, output_dims, filter_dims, 5 conv_strides, padding, dnnBorderZeros); 6 7 // Creating the needed data buffer 8 void* conv_res[dnnResourceNumber]; 9 conv_res[dnnResourceSrc] = (void*) input; 10 conv_res[dnnResourceFilter] = (void*) filter; 11 conv_res[dnnResourceDst] = (void*) output; 12 13 // Execute the workload 14 dnnExecute_F32(pConvFwd, conv_res); For more: Intel MKL documentation on DNN primitives colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016
  • 25. Example Integration: IntelCaffe 25 GitHub link: https://github.com/intel/caffe/ Example layer implementations: caffe/src/caffe/layers/mkl_*.cpp 1 // Grabbing parameters from Caffe Layers 2 PoolingParameter pool_param = this->layer_param_.pooling_param(); 3 channels_ = bottom[0]->channels(); 4 height_ = bottom[0]->height(); 5 width_ = bottom[0]->width(); 6 num_ = bottom[0]->num(); 7 // ... // 8 kernel_h_ = pool_param.kernel_h(); kernel_w_ = pool_param.kernel_w(); 9 // ..... // 10 11 // Creating the math kernel from these parameters 12 status = dnnPoolingCreateForward<Dtype>( /* ... */ ); colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016
  • 26. §6. Distributed Memory Computation
  • 27. "FLOPs Are Cheap"? 27 . Theoretical estimates, Intel® Xeon E5-2697 V3 processor .. ...... Performance =28 cores ×2.7 GHz ×(256/64) vec.lanes ×2 FMA ×2 FPU ≈ 1.2TFLOP/s RequiredDataRate =1.2TFLOP/s×8bytes ≈ 10TB/s OPAMaxBandwidth =12.5GB/s ≈ 0.01TB/s Ratio = 10/0.01 ≈ 1000 (FLOPs)/(Memory Transferred) To put it short... . Difficulty of Distributed Computation .. ...... In the time it takes to transfer one data element, processors can do thousands of operation on one data element. colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016
  • 28. Distributed Computation for Neural Networks 28 Forward Backward Loss Update Forward Backward Loss Update Gather Gradients Forward Backward Loss Update Forward Backward Loss Update Partial Results Partial Results node 2node 1 node 2node 1 Data Parallel Model Parallel Gradient Trnsferred but not Data Data Trnsferred but not Gradient colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016
  • 29. Caffe Scaling 29 Source: Intel® Corporation. (Caffe* Training on Multi-node Distributed-memory Systems Based on Intel® Xeon® Processor E5 Family) colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016
  • 31. Algorithms in DAAL 31 Analysis - Low Order Moments - Quantile - Correlation and Variance - Cosine Distance Matrix - Correlation Distance Matrix - K-Means Clustering - Principal Component Analysis - Cholesky Decomposition Training & prediction - Regression - Linear/Ridge Regresion - Clasification - Naive Bayes Classifier - Boosting - SVM - Neural Networks - Multi-Class Classifier - Singular Value Decomposition - QR Decomposition - Expectation-Maximization - Multivariate Outlier Detection - Univariate Outlier Detection - Association Rules - Kernel Functions - Quality Metrics Portal: DAAL page. See also: intro article, CR papers. colfaxresearch.com/ Machine Learning Framework: Intel® DAAL © Colfax International, 2013–2016
  • 32. Algorithms in DAAL 32 Data Set Partial Computation Data Set Partial Computation Data Set Partial Computation Final Result Final Result Data Set Data Set Data Set Final Result Full Computation Full Computation Data Set Distributed Mode Batch Mode Online Mode Portal: DAAL page. See also: intro article, CR papers. colfaxresearch.com/ Machine Learning Framework: Intel® DAAL © Colfax International, 2013–2016
  • 34. Structure of MPI Applications: Hello World 34 1 #include "mpi.h" 2 #include <cstdio> 3 int main (int argc, char *argv[]) { 4 MPI_Init (&argc, &argv); // Initialize MPI envirnmnt 5 int rank, size, namelen; 6 char name[MPI_MAX_PROCESSOR_NAME]; 7 MPI_Comm_rank (MPI_COMM_WORLD, &rank); // ID of current process 8 MPI_Get_processor_name (name, &namelen); // Hostname of node 9 MPI_Comm_size (MPI_COMM_WORLD, &size); // Number of processes 10 printf ("Hello World from rank %d running on %s!n", rank, name); 11 if (rank == 0) printf("MPI World size = %d processesn", size); 12 MPI_Finalize (); // Terminate MPI environment 13 } colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016
  • 35. Collective Communication: Gather 35 1 int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype, 2 void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm); Gather sender data sender data sender data sender data receiver colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016
  • 36. Collective Communication: Broadcast 36 1 int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype, 2 int root, MPI_Comm comm ); sender data receiver receiver receiverreceiver Broadcast colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016
  • 38. Example Distributed Image Processing: DAAL 38 ▷ Algorithm <step1Local> is responsible for the forward/backward propagation. 1 training::Distributed<step1Local> local_net; // local net algorithm 2 local_net.compute(); // forward/backward 3 part_res = local_net.getPartialResult(); // getting partial result 4 local_net.input.get(training::inputModel) 5 ->setWeightsAndBiases(wb); // Update the weights/bias ▷ Algorithm <step2Master> is responsible for accumulating the gradient. 1 training::Distributed<step2Master> master_net; // master net algorithm 2 master_net.input.add(training::partialResults, // Add partial result 3 0, part_res); 4 master_net.compute(); // Accumulate gradients 5 wbModel = master_net.getPartialResult() // Get Current Model 6 ->get(training::resultFromMaster) 7 ->get(training::model); 8 wb = wbModel->getWeightsAndBiases(); // Extract weights/bias colfaxresearch.com/ Implementation © Colfax International, 2013–2016
  • 39. Example Distributed Image Processing (Part 1) 39 1 // Computation part of the node with the master net 2 // Local forward and backward propagation 3 local_net.compute(); 4 part_res[master_node_id] = local_net.getPartialResult(); 5 6 // ... Code to store the result into a buffer (char *) ... // 7 8 // Send the result to the master node 9 MPI_Gather(....); 10 11 // ... Code to reconstruct the partial result from the buffer... // 12 13 // accumulate the partial result from nodes 14 for(int i = 0; i < num_nodes; i++) 15 master_net.input.add(training::partialResults, node, part_res[i]); 16 master_net.compute(); colfaxresearch.com/ Implementation © Colfax International, 2013–2016
  • 40. Example Distributed Image Processing (Part 2) 40 1 // ... Continuing on the master compute ... // 2 3 // Extract the weight/bias from the master net 4 training::ModelPtr wbModel = master_net.getPartialResult() 5 ->get(training::resultFromMaster) 6 ->get(training::model); 7 NumericTablePtr wb = wbModel->getWeightsAndBiases(); 8 9 // ... Code to store weights/bias into a buffer (char*) ... // 10 11 // Broadcast the weights/bias to all nodes // 12 MPI_Bcast(.....); 13 14 // ... Code to reconstruct the weights/bias from buffer ... // 15 16 // Update the weights on local node 17 local_net.input.get(training::inputModel)->setWeightsAndBiases(wb); colfaxresearch.com/ Implementation © Colfax International, 2013–2016
  • 41. Parallel Efficiency 41  0  0.5  1  1.5  2  2.5  3  3.5  4  4.5  1  2  3  4 Parallel Efficiency Number of Nodes 93% 91% 87% Linear Scaling (Theoretical) Distributed Lenet Further performance optimizations and model parallelism are coming soon... colfaxresearch.com/ Implementation © Colfax International, 2013–2016
  • 44. Thank you for your Attention! Join us at Booth #2407 at SC16! colfaxresearch.com/ Final Words © Colfax International, 2013–2016