Software AI Accelerators: The Next Frontier | Software for AI Optimization Summit 2021 Keynote

Software AI Accelerators
T h e N e x t F r o n t i e r
S o f t w a r e f o r A I O p t i m i z a t i o n S u m m i t
W e i L i
V P & G M M a c h i n e L e a r n i n g P e r f o r m a n c e
I n t e l C o r p o r a t i o n

2
HARDWARE AI ACCELERATORS
HW Acceleration

10 - 100x
SOFTWARE AI ACCELERATORS
3
Up to
HW Acceleration
With SW
Acceleration
Photo Source: NASA

AI HARDWARE SPECTRUM
4
GENERAL PURPOSE PURPOSE BUILT
GPU ACCELERATORS
CPU

UNSCALABLE TO SCALABLE SOFTWARE
5
Services & Solutions
Applications
M i d d l e w a r e
F r a m e w o r k s
A n d R u n t i m e s
L o w L e v e l
L i b r a r i e s
V i r t u a l i z a t i o n /
O r c h e s t r a t i o n
O S
D r i v e r s
F W I P & B I O S
M i d d l e w a r e
F r a m e w o r k s
L o w L e v e l
L i b r a r i e s
O S
D r i v e r s
F W I P & B I O S
M i d d l e w a r e
F r a m e w o r k s
L o w L e v e l
L i b r a r i e s
O S
D r i v e r s
F W I P & B I O S
M i d d l e w a r e
F r a m e w o r k s
L o w L e v e l
L i b r a r i e s
O S
D r i v e r s
F W I P & B I O S
…
GPU ACCELER AT O R
[1]
CPU ACCELER AT O R
[N]
Services & Solutions
Applications
Middleware, Frameworks and Runtimes
GPU ACCELERATORS
CPU

AI SOFTWARE STACK
6
Data Scientists &
Developers
AI/Analytics
Tools, Toolkits,
Verticals
Deep Learning,
Machine Learning,
Big Data
Frameworks
Libraries &
Compilers
HW
Intel® LPOT
( L o w p r e c i s i o n
o p t i m i z a t i o n t o o l )
Analyt i cs
Zoo
Intel®
oneAPI AI
Analyt i cs
Toolkit
SigOpt
P
a
d
d
l
e
P
a
d
d
l
e
T
e
n
s
o
r
F
l
o
w
P
y
t
h
o
n
/
N
u
m
b
a
TVM
P
y
T
o
r
c
h
M
X
N
e
t
S
p
a
r
k
S
Q
L
+
M
L
/
D
L
s
c
a
l
e
o
u
t
M
o
d
i
n
NumPy
X
G
-
B
o
o
s
t
S
c
i
k
i
t
-
L
e
a
r
n
P
a
n
d
a
s
O
p
e
n
V
I
N
O
GPU ACCELERATORS
CPU

KERNEL OPTIMIZATION EXAMPLE
7
Optimizations: vectorization, data reuse, parallelization
Optimized convolution in oneDNN
A simple program is good, but may be slow

GRAPH OPTIMIZATION EXAMPLE
8
Baseline
S u m
R e L U
C o n v 1 x 1
B a t c h N o r m
R e L U
C o n v 3 x 3
B a t c h N o r m
R e L U
C o n v 1 x 1
R e L U
S u m
R e L U
C o n v 1 x 1
B a t c h N o r m
INT8 Optimized Model (generated by Intel Lo w Precision Optimization To o l)
BN Folding Conv + ReLU Conv + Sum
S u m
R e L U
C o n v 1 x 1 ’
R e L U
C o n v 3 x 3 ’
R e L U
C o n v 1 x 1 ’
S u m
R e L U
C o n v 1 x 1 ’
Sum’
Conv1x1’’
Conv3x3’’
Conv1x1’’
Sum’
Conv1x1’’
Sum’
Conv1x1’’
Conv3x3’’
Conv1x1’’’
Conv1x1’’
A0
B0
A1
B1
A2
B2
A3
B3
…
…
A63
B63
C0
A0 *B0 + A1 *B1+A2
*B2+A2 *B2+C0
…
…
C15
A60 *B60 + A61 *B61+A62
*B62+A63 *B63+C015

Intel Optimization for TENSORFLOW
9
IMMEDIATE PERFORMANCE BENEFITS
Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive;
ResNet50 v1.5, FP32/INT8, BS=128, https://github.com/IntelAI/models/blob/master/benchmarks/image_recognition/tensorflow/resnet50v1_5/README.md; SSD-MobileNetv1, FP32/INT8, BS=448,
https://github.com/IntelAI/models/blob/master/benchmarks/object_detection/tensorflow/ssd-mobilenet/README.md. Software: Tensorflow 2.4.0 for FP32 & Intel-Tensorflow (icx-base) for both FP32 and INT8, test by Intel
on 5/12/2021. Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex.

Intel Optimization for TENSORFLOW
10
ResNet50 v1.5, FP32/INT8, BS=128, https://github.com/IntelAI/models/blob/master/benchmarks/image_recognition/tensorflow/resnet50v1_5/README.md; SSD-MobileNetv1, FP32/INT8, BS=448,
https://github.com/IntelAI/models/blob/master/benchmarks/object_detection/tensorflow/ssd-mobilenet/README.md. Software: Tensorflow 2.4.0 for FP32 & Intel-Tensorflow (icx-base) for both FP32 and INT8, test by Intel
on 5/12/2021. Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex.

Intel Optimization for PYTORCH
11
ResNet50 v1.5, FP32/INT8, BS=128, https://github.com/IntelAI/models/blob/icx-launch-public/quickstart/ipex-bkc/resnet50-icx/inference; DLRM, FP32/INT8, BS=16, https://github.com/IntelAI/models/blob/icx-launch-
public/quickstart/ipex-bkc/dlrm-icx/inference/fp32/README.md. Software: PyTorch v1.5 w/o DNNL build for FP32 & PyTorch v1.5 + IPEX (icx) for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary. For
workloads and configurations visit www.Intel.com/PerformanceIndex.

Intel Optimization for PYTORCH
12
ResNet50 v1.5, FP32/INT8, BS=128, https://github.com/IntelAI/models/blob/icx-launch-public/quickstart/ipex-bkc/resnet50-icx/inference; DLRM, FP32/INT8, BS=16, https://github.com/IntelAI/models/blob/icx-launch-
public/quickstart/ipex-bkc/dlrm-icx/inference/fp32/README.md. Software: PyTorch v1.5 w/o DNNL build for FP32 & PyTorch v1.5 + IPEX (icx) for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary. For
workloads and configurations visit www.Intel.com/PerformanceIndex.
Photo Source: NASA

Intel Optimization for MXNET
13
ResNet50 v1, FP32/INT8, BS=128, https://github.com/apache/incubator-mxnet/blob/v2.0.0.alpha/python/mxnet/gluon/model_zoo/vision/resnet.py; MobileNetv2, FP32/INT8, BS=128, https://github.com/apache/incubator-
mxnet/blob/v2.0.0.alpha/python/mxnet/gluon/model_zoo/vision/mobilenet.py. Software: MXNet 2.0.0.alpha w/o DNNL build for FP32 & MXNet 2.0.0.alpha for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary.
For workloads and configurations visit www.Intel.com/PerformanceIndex.

Intel Optimization for MXNET
14
ResNet50 v1, FP32/INT8, BS=128, https://github.com/apache/incubator-mxnet/blob/v2.0.0.alpha/python/mxnet/gluon/model_zoo/vision/resnet.py; MobileNetv2, FP32/INT8, BS=128, https://github.com/apache/incubator-
mxnet/blob/v2.0.0.alpha/python/mxnet/gluon/model_zoo/vision/mobilenet.py. Software: MXNet 2.0.0.alpha w/o DNNL build for FP32 & MXNet 2.0.0.alpha for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary.
For workloads and configurations visit www.Intel.com/PerformanceIndex.
Photo Source: NASA

Intel Extension for Scikit-learn
15
Intel Xeon Platinum 8276L CPU @ 2.20 GHz, 2 sockets, 28 cores per socket; For workloads and configurations visit www.Intel.com/PerformanceIndex.
Details: https://medium.com/intel-analytics-software/accelerate-your-scikit-learn-applications-a06cacf44912

PERFORMANCE IN KAGGLE COMPETITIONS
16
Kaggle challenge Domain Algorithm(s)
Stock E2E Time
(minutes)
Intel Extension for
Scikit-learn
E2E Time (minutes)
Speed up
KDD Cup 1999 Computer Networks kNN 282 1.24 227.4x
Credit Card Default Finance SVC 11.9 0.2 59.5x
Digit Recognizer (KNN) Image Classification SVC 84.32 1.47 57.5x
Melanoma Identification Image Classification kNN 99.89 2.08 48x
Digit Recognizer (SVM) Image Classification PCA, SVC 125.5 4.92 25.5x
What's cooking?
Natural Language
Processing
SVC,
XGBoost
35.8 2.66 13.5x
Real or Not? Disaster Tweets
Natural Language
Processing
SVC 37.8 4.27 8.9x
Home Credit Default Finance
Random
Forest
2.9 1.44 2x
Intel Xeon Gold 5218 @ 2.3 GHz (2nd generation Intel Xeon Scalable processors): 2 sockets, 16 cores per socket, HT:off, Turbo:off. For workloads and configurations visit www.Intel.com/PerformanceIndex.
Details: https://medium.com/intel-analytics-software/accelerate-kaggle-challenges-using-intel-ai-analytics-toolkit-beb148f66d5a

GRAPH ANALYTICS WITH oneDAL
17
Triangle Counting Algorithm
V = Vertices, E = Edges, speed up due to relabel in g
1.38 1.67 1.74 1.82
2.98
8.02
166.1
1
10
100
1000
Enron
(V: 0.03M, E: 0.4M)
Pokec
(V: 1.6M, E: 30.6M)
Google
(V: 0.9M, E: 5.1M)
Indochina-2004
(V: 7.4M, E: 151M)
Wikipedia
(V: 12.1M, E: 378M)
Twitter
(V: 61M, E: 1202M)
Web
(V: 50M, E: 1810M)
Speed
Up
Data Sets
Enron
(V: 0.03M, E: 0.4M)
Pokec
(V: 1.6M, E: 30.6M)
Google
(V: 0.9M, E: 5.1M)
Indochina-2004
(V: 7.4M, E: 151M)
Wikipedia
(V: 12.1M, E: 378M)
Twitter
(V: 61M, E: 1202M)
Web
(V: 50M, E: 1810M)
Intel Xeon Platinum 8280 CPU @ 2.70 GHz, 2x28 cores, HT: on; For workloads and configurations visit www.Intel.com/PerformanceIndex.
Data sets: https://gihub.com/sbeamer/gapbs | https://snap.Stanford.edu/data

E2E WORKLOAD PERFORMANCE
18
R e a d c s v E T L T r a i n T e s t S p l i t M L
0
10
20
30
40
50
60
70
80
90
100
Readcsv ETL Train Test Split ML Total Time
Speed
up
Unoptimized Software Optimized Optimized hyperparameters
CENSUS Phase-wise % breakdown CENSUS Performance improvement with hyperparameter optimizations
Readcsv ETL ML
PLAsTiCC Phase-wise % breakdown
PLAsTiCC Performance improvement with hyperparameter optimizations
23x
0
10
20
30
40
50
60
70
Readcsv ETL ML Total Time
Speed
up
Unoptimized Software Optimized Optimized hyperparameters
29x
Higher is
better
Details: https://medium.com/intel-analytics-software/performance-optimizations-for-end-to-end-ai-pipelines-231e0966505a
Intel® Xeon Platinum 8280L @ 28 cores; For workloads and configurations visit www.Intel.com/PerformanceIndex.

AI APPLICATIONS FROM PARTNERSHIPS
19
Athlete Training Telecom Network Quality Drug Discovery

SUMMARY AND CALL-TO-ACTION
20
Software AI Accelerators can deliver orders of magnitude
performance
Even more potential for the AI software community
▪ Create compiler technologies to automate kernel optimizations
▪ Increase parallelism to achieve higher compute utilization
▪ Optimize for memory bandwidth, memory size, NUMA
▪ Scale to large distributed compute
Find more at: ai.intel.com

NOTICES & DISCLAIMERS
21
▪ Results have been estimated or simulated.
▪ Performance varies by use, configuration and other factors. Learn more at
www.Intel.com/PerformanceIndex.
▪ Performance results are based on testing as of dates shown in configurations and may not reflect
all publicly available updates. See backup for configuration details. No product or component
can be absolutely secure.
▪ Your costs and results may vary.
▪ Intel technologies may require enabled hardware, software or service activation.
▪ All product plans and roadmaps are subject to change without notice.
▪ Intel contributes to the development of benchmarks by participating in, sponsoring, and/or
contributing technical support to various benchmarking groups, including the BenchmarkXPRT
Development Community administered by Principled Technologies.
▪ © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Other names and brands may be claimed as the property of
others.

Software AI Accelerators: The Next Frontier | Software for AI Optimization Summit 2021 Keynote

More Related Content

What's hot (20)

Similar to Software AI Accelerators: The Next Frontier | Software for AI Optimization Summit 2021 Keynote (20)

More from Intel® Software (20)

Recently uploaded (20)

Software AI Accelerators: The Next Frontier | Software for AI Optimization Summit 2021 Keynote