SlideShare a Scribd company logo
Software AI Accelerators
T h e N e x t F r o n t i e r
S o f t w a r e f o r A I O p t i m i z a t i o n S u m m i t
W e i L i
V P & G M M a c h i n e L e a r n i n g P e r f o r m a n c e
I n t e l C o r p o r a t i o n
2
HARDWARE AI ACCELERATORS
HW Acceleration
10 - 100x
SOFTWARE AI ACCELERATORS
3
Up to
HW Acceleration
With SW
Acceleration
Photo Source: NASA
AI HARDWARE SPECTRUM
4
GENERAL PURPOSE PURPOSE BUILT
GPU ACCELERATORS
CPU
UNSCALABLE TO SCALABLE SOFTWARE
5
Services & Solutions
Applications
M i d d l e w a r e
F r a m e w o r k s
A n d R u n t i m e s
L o w L e v e l
L i b r a r i e s
V i r t u a l i z a t i o n /
O r c h e s t r a t i o n
O S
D r i v e r s
F W I P & B I O S
M i d d l e w a r e
F r a m e w o r k s
A n d R u n t i m e s
L o w L e v e l
L i b r a r i e s
V i r t u a l i z a t i o n /
O r c h e s t r a t i o n
O S
D r i v e r s
F W I P & B I O S
M i d d l e w a r e
F r a m e w o r k s
A n d R u n t i m e s
L o w L e v e l
L i b r a r i e s
V i r t u a l i z a t i o n /
O r c h e s t r a t i o n
O S
D r i v e r s
F W I P & B I O S
M i d d l e w a r e
F r a m e w o r k s
A n d R u n t i m e s
L o w L e v e l
L i b r a r i e s
V i r t u a l i z a t i o n /
O r c h e s t r a t i o n
O S
D r i v e r s
F W I P & B I O S
…
GPU ACCELER AT O R
[1]
CPU ACCELER AT O R
[N]
Services & Solutions
Applications
Middleware, Frameworks and Runtimes
GPU ACCELERATORS
CPU
AI SOFTWARE STACK
6
Data Scientists &
Developers
AI/Analytics
Tools, Toolkits,
Verticals
Deep Learning,
Machine Learning,
Big Data
Frameworks
Libraries &
Compilers
HW
Intel® LPOT
( L o w p r e c i s i o n
o p t i m i z a t i o n t o o l )
Analyt i cs
Zoo
Intel®
oneAPI AI
Analyt i cs
Toolkit
SigOpt
P
a
d
d
l
e
P
a
d
d
l
e
T
e
n
s
o
r
F
l
o
w
P
y
t
h
o
n
/
N
u
m
b
a
TVM
P
y
T
o
r
c
h
M
X
N
e
t
S
p
a
r
k
S
Q
L
+
M
L
/
D
L
s
c
a
l
e
o
u
t
M
o
d
i
n
NumPy
X
G
-
B
o
o
s
t
S
c
i
k
i
t
-
L
e
a
r
n
P
a
n
d
a
s
O
p
e
n
V
I
N
O
GPU ACCELERATORS
CPU
KERNEL OPTIMIZATION EXAMPLE
7
Optimizations: vectorization, data reuse, parallelization
Optimized convolution in oneDNN
A simple program is good, but may be slow
GRAPH OPTIMIZATION EXAMPLE
8
Baseline
S u m
R e L U
C o n v 1 x 1
B a t c h N o r m
R e L U
C o n v 3 x 3
B a t c h N o r m
R e L U
C o n v 1 x 1
R e L U
S u m
R e L U
C o n v 1 x 1
B a t c h N o r m
INT8 Optimized Model (generated by Intel Lo w Precision Optimization To o l)
BN Folding Conv + ReLU Conv + Sum
S u m
R e L U
C o n v 1 x 1 ’
R e L U
C o n v 3 x 3 ’
R e L U
C o n v 1 x 1 ’
S u m
R e L U
C o n v 1 x 1 ’
Sum’
Conv1x1’’
Conv3x3’’
Conv1x1’’
Sum’
Conv1x1’’
Sum’
Conv1x1’’
Conv3x3’’
Conv1x1’’’
Conv1x1’’
A0
B0
A1
B1
A2
B2
A3
B3
…
…
A63
B63
C0
A0 *B0 + A1 *B1+A2
*B2+A2 *B2+C0
…
…
C15
A60 *B60 + A61 *B61+A62
*B62+A63 *B63+C015
Intel Optimization for TENSORFLOW
9
IMMEDIATE PERFORMANCE BENEFITS
Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive;
ResNet50 v1.5, FP32/INT8, BS=128, https://github.com/IntelAI/models/blob/master/benchmarks/image_recognition/tensorflow/resnet50v1_5/README.md; SSD-MobileNetv1, FP32/INT8, BS=448,
https://github.com/IntelAI/models/blob/master/benchmarks/object_detection/tensorflow/ssd-mobilenet/README.md. Software: Tensorflow 2.4.0 for FP32 & Intel-Tensorflow (icx-base) for both FP32 and INT8, test by Intel
on 5/12/2021. Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex.
Intel Optimization for TENSORFLOW
10
IMMEDIATE PERFORMANCE BENEFITS
Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive;
ResNet50 v1.5, FP32/INT8, BS=128, https://github.com/IntelAI/models/blob/master/benchmarks/image_recognition/tensorflow/resnet50v1_5/README.md; SSD-MobileNetv1, FP32/INT8, BS=448,
https://github.com/IntelAI/models/blob/master/benchmarks/object_detection/tensorflow/ssd-mobilenet/README.md. Software: Tensorflow 2.4.0 for FP32 & Intel-Tensorflow (icx-base) for both FP32 and INT8, test by Intel
on 5/12/2021. Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex.
Intel Optimization for PYTORCH
11
IMMEDIATE PERFORMANCE BENEFITS
Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive;
ResNet50 v1.5, FP32/INT8, BS=128, https://github.com/IntelAI/models/blob/icx-launch-public/quickstart/ipex-bkc/resnet50-icx/inference; DLRM, FP32/INT8, BS=16, https://github.com/IntelAI/models/blob/icx-launch-
public/quickstart/ipex-bkc/dlrm-icx/inference/fp32/README.md. Software: PyTorch v1.5 w/o DNNL build for FP32 & PyTorch v1.5 + IPEX (icx) for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary. For
workloads and configurations visit www.Intel.com/PerformanceIndex.
Intel Optimization for PYTORCH
12
IMMEDIATE PERFORMANCE BENEFITS
Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive;
ResNet50 v1.5, FP32/INT8, BS=128, https://github.com/IntelAI/models/blob/icx-launch-public/quickstart/ipex-bkc/resnet50-icx/inference; DLRM, FP32/INT8, BS=16, https://github.com/IntelAI/models/blob/icx-launch-
public/quickstart/ipex-bkc/dlrm-icx/inference/fp32/README.md. Software: PyTorch v1.5 w/o DNNL build for FP32 & PyTorch v1.5 + IPEX (icx) for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary. For
workloads and configurations visit www.Intel.com/PerformanceIndex.
Photo Source: NASA
Intel Optimization for MXNET
13
IMMEDIATE PERFORMANCE BENEFITS
Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive;
ResNet50 v1, FP32/INT8, BS=128, https://github.com/apache/incubator-mxnet/blob/v2.0.0.alpha/python/mxnet/gluon/model_zoo/vision/resnet.py; MobileNetv2, FP32/INT8, BS=128, https://github.com/apache/incubator-
mxnet/blob/v2.0.0.alpha/python/mxnet/gluon/model_zoo/vision/mobilenet.py. Software: MXNet 2.0.0.alpha w/o DNNL build for FP32 & MXNet 2.0.0.alpha for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary.
For workloads and configurations visit www.Intel.com/PerformanceIndex.
Intel Optimization for MXNET
14
IMMEDIATE PERFORMANCE BENEFITS
Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive;
ResNet50 v1, FP32/INT8, BS=128, https://github.com/apache/incubator-mxnet/blob/v2.0.0.alpha/python/mxnet/gluon/model_zoo/vision/resnet.py; MobileNetv2, FP32/INT8, BS=128, https://github.com/apache/incubator-
mxnet/blob/v2.0.0.alpha/python/mxnet/gluon/model_zoo/vision/mobilenet.py. Software: MXNet 2.0.0.alpha w/o DNNL build for FP32 & MXNet 2.0.0.alpha for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary.
For workloads and configurations visit www.Intel.com/PerformanceIndex.
Photo Source: NASA
Intel Extension for Scikit-learn
15
Intel Xeon Platinum 8276L CPU @ 2.20 GHz, 2 sockets, 28 cores per socket; For workloads and configurations visit www.Intel.com/PerformanceIndex.
Details: https://medium.com/intel-analytics-software/accelerate-your-scikit-learn-applications-a06cacf44912
PERFORMANCE IN KAGGLE COMPETITIONS
16
Kaggle challenge Domain Algorithm(s)
Stock E2E Time
(minutes)
Intel Extension for
Scikit-learn
E2E Time (minutes)
Speed up
KDD Cup 1999 Computer Networks kNN 282 1.24 227.4x
Credit Card Default Finance SVC 11.9 0.2 59.5x
Digit Recognizer (KNN) Image Classification SVC 84.32 1.47 57.5x
Melanoma Identification Image Classification kNN 99.89 2.08 48x
Digit Recognizer (SVM) Image Classification PCA, SVC 125.5 4.92 25.5x
What's cooking?
Natural Language
Processing
SVC,
XGBoost
35.8 2.66 13.5x
Real or Not? Disaster Tweets
Natural Language
Processing
SVC 37.8 4.27 8.9x
Home Credit Default Finance
Random
Forest
2.9 1.44 2x
Intel Xeon Gold 5218 @ 2.3 GHz (2nd generation Intel Xeon Scalable processors): 2 sockets, 16 cores per socket, HT:off, Turbo:off. For workloads and configurations visit www.Intel.com/PerformanceIndex.
Details: https://medium.com/intel-analytics-software/accelerate-kaggle-challenges-using-intel-ai-analytics-toolkit-beb148f66d5a
GRAPH ANALYTICS WITH oneDAL
17
Triangle Counting Algorithm
V = Vertices, E = Edges, speed up due to relabel in g
1.38 1.67 1.74 1.82
2.98
8.02
166.1
1
10
100
1000
Enron
(V: 0.03M, E: 0.4M)
Pokec
(V: 1.6M, E: 30.6M)
Google
(V: 0.9M, E: 5.1M)
Indochina-2004
(V: 7.4M, E: 151M)
Wikipedia
(V: 12.1M, E: 378M)
Twitter
(V: 61M, E: 1202M)
Web
(V: 50M, E: 1810M)
Speed
Up
Data Sets
Enron
(V: 0.03M, E: 0.4M)
Pokec
(V: 1.6M, E: 30.6M)
Google
(V: 0.9M, E: 5.1M)
Indochina-2004
(V: 7.4M, E: 151M)
Wikipedia
(V: 12.1M, E: 378M)
Twitter
(V: 61M, E: 1202M)
Web
(V: 50M, E: 1810M)
Intel Xeon Platinum 8280 CPU @ 2.70 GHz, 2x28 cores, HT: on; For workloads and configurations visit www.Intel.com/PerformanceIndex.
Data sets: https://gihub.com/sbeamer/gapbs | https://snap.Stanford.edu/data
E2E WORKLOAD PERFORMANCE
18
R e a d c s v E T L T r a i n T e s t S p l i t M L
0
10
20
30
40
50
60
70
80
90
100
Readcsv ETL Train Test Split ML Total Time
Speed
up
Unoptimized Software Optimized Optimized hyperparameters
CENSUS Phase-wise % breakdown CENSUS Performance improvement with hyperparameter optimizations
Readcsv ETL ML
PLAsTiCC Phase-wise % breakdown
PLAsTiCC Performance improvement with hyperparameter optimizations
23x
0
10
20
30
40
50
60
70
Readcsv ETL ML Total Time
Speed
up
Unoptimized Software Optimized Optimized hyperparameters
29x
Higher is
better
Details: https://medium.com/intel-analytics-software/performance-optimizations-for-end-to-end-ai-pipelines-231e0966505a
Intel® Xeon Platinum 8280L @ 28 cores; For workloads and configurations visit www.Intel.com/PerformanceIndex.
AI APPLICATIONS FROM PARTNERSHIPS
19
Athlete Training Telecom Network Quality Drug Discovery
SUMMARY AND CALL-TO-ACTION
20
Software AI Accelerators can deliver orders of magnitude
performance
Even more potential for the AI software community
▪ Create compiler technologies to automate kernel optimizations
▪ Increase parallelism to achieve higher compute utilization
▪ Optimize for memory bandwidth, memory size, NUMA
▪ Scale to large distributed compute
Find more at: ai.intel.com
NOTICES & DISCLAIMERS
21
▪ Results have been estimated or simulated.
▪ Performance varies by use, configuration and other factors. Learn more at
www.Intel.com/PerformanceIndex​.
▪ Performance results are based on testing as of dates shown in configurations and may not reflect
all publicly available ​updates. See backup for configuration details. No product or component
can be absolutely secure.
▪ Your costs and results may vary.
▪ Intel technologies may require enabled hardware, software or service activation.
▪ All product plans and roadmaps are subject to change without notice.
▪ Intel contributes to the development of benchmarks by participating in, sponsoring, and/or
contributing technical support to various benchmarking groups, including the BenchmarkXPRT
Development Community administered by Principled Technologies.
▪ © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Other names and brands may be claimed as the property of
others. ​

More Related Content

PPTX
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
PPTX
Q1 Memory Fabric Forum: Building Fast and Secure Chips with CXL IP
PDF
Message Signaled Interrupts
PDF
Hardware & Software Platforms for HPC, AI and ML
PDF
HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...
PPTX
Machine Learning on Your Hand - Introduction to Tensorflow Lite Preview
PDF
Shared Memory Centric Computing with CXL & OMI
PPTX
An Introduction to TensorFlow architecture
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Q1 Memory Fabric Forum: Building Fast and Secure Chips with CXL IP
Message Signaled Interrupts
Hardware & Software Platforms for HPC, AI and ML
HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...
Machine Learning on Your Hand - Introduction to Tensorflow Lite Preview
Shared Memory Centric Computing with CXL & OMI
An Introduction to TensorFlow architecture

What's hot (20)

PDF
What are latest new features that DPDK brings into 2018?
PDF
DPDK: Multi Architecture High Performance Packet Processing
PDF
Jetson AGX Xavier and the New Era of Autonomous Machines
PPTX
MTU (maximum transmission unit) & MRU (maximum receive unit)
PDF
P4 Updates (2020) (Japanese)
PDF
Efficient execution of quantized deep learning models a compiler approach
PPTX
The TCP/IP Stack in the Linux Kernel
PPTX
Enable DPDK and SR-IOV for containerized virtual network functions with zun
PDF
ARM AAE - System Issues
PPTX
CXL Memory Expansion, Pooling, Sharing, FAM Enablement, and Switching
PDF
Tensorflow presentation
PPTX
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
PDF
BKK16-315 Graphics Stack Update
PDF
Intel dpdk Tutorial
PDF
The linux networking architecture
PDF
Architecture of TPU, GPU and CPU
PPTX
Broadcom PCIe & CXL Switches OCP Final.pptx
PPTX
Revisit DCA, PCIe TPH and DDIO
PDF
Deep learning: Hardware Landscape
PDF
Monitoring with Ganglia
What are latest new features that DPDK brings into 2018?
DPDK: Multi Architecture High Performance Packet Processing
Jetson AGX Xavier and the New Era of Autonomous Machines
MTU (maximum transmission unit) & MRU (maximum receive unit)
P4 Updates (2020) (Japanese)
Efficient execution of quantized deep learning models a compiler approach
The TCP/IP Stack in the Linux Kernel
Enable DPDK and SR-IOV for containerized virtual network functions with zun
ARM AAE - System Issues
CXL Memory Expansion, Pooling, Sharing, FAM Enablement, and Switching
Tensorflow presentation
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
BKK16-315 Graphics Stack Update
Intel dpdk Tutorial
The linux networking architecture
Architecture of TPU, GPU and CPU
Broadcom PCIe & CXL Switches OCP Final.pptx
Revisit DCA, PCIe TPH and DDIO
Deep learning: Hardware Landscape
Monitoring with Ganglia
Ad

Similar to Software AI Accelerators: The Next Frontier | Software for AI Optimization Summit 2021 Keynote (20)

PDF
AIDC India - AI on IA
PDF
Accelerating AI from the Cloud to the Edge
PDF
AIDC Summit LA- Hands-on Training
PDF
Accelerate Your AI Today
PDF
FPGAs and Machine Learning
PDF
Accelerate Machine Learning Software on Intel Architecture
PPTX
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
PDF
Intel Powered AI Applications for Telco
PDF
Microsoft Build 2019- Intel AI Workshop
PDF
“Optimization Techniques with Intel’s OpenVINO to Enhance Performance on Your...
PDF
“Getting Efficient DNN Inference Performance: Is It Really About the TOPS?,” ...
PDF
“Intel Video AI Box—Converging AI, Media and Computing in a Compact and Open ...
PDF
FPGA Hardware Accelerator for Machine Learning
PPTX
AI Hardware Landscape 2021
PDF
Accelerating SparkML Workloads on the Intel Xeon+FPGA Platform with Srivatsan...
PDF
AI Crash Course- Supercomputing
PDF
Workstations powered by Intel can play a vital role in CPU-intensive AI devel...
PDF
Intel® AI: AI Lab at Intel
PDF
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
PDF
Enabling a hardware accelerated deep learning data science experience for Apa...
AIDC India - AI on IA
Accelerating AI from the Cloud to the Edge
AIDC Summit LA- Hands-on Training
Accelerate Your AI Today
FPGAs and Machine Learning
Accelerate Machine Learning Software on Intel Architecture
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
Intel Powered AI Applications for Telco
Microsoft Build 2019- Intel AI Workshop
“Optimization Techniques with Intel’s OpenVINO to Enhance Performance on Your...
“Getting Efficient DNN Inference Performance: Is It Really About the TOPS?,” ...
“Intel Video AI Box—Converging AI, Media and Computing in a Compact and Open ...
FPGA Hardware Accelerator for Machine Learning
AI Hardware Landscape 2021
Accelerating SparkML Workloads on the Intel Xeon+FPGA Platform with Srivatsan...
AI Crash Course- Supercomputing
Workstations powered by Intel can play a vital role in CPU-intensive AI devel...
Intel® AI: AI Lab at Intel
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
Enabling a hardware accelerated deep learning data science experience for Apa...
Ad

More from Intel® Software (20)

PPTX
AI for All: Biology is eating the world & AI is eating Biology
PPTX
Python Data Science and Machine Learning at Scale with Intel and Anaconda
PDF
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
PDF
AI for good: Scaling AI in science, healthcare, and more.
PPTX
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
PPTX
AWS & Intel Webinar Series - Accelerating AI Research
PPTX
Intel Developer Program
PDF
Intel AIDC Houston Summit - Overview Slides
PDF
AIDC NY: BODO AI Presentation - 09.19.2019
PDF
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
PDF
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
PDF
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
PDF
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
PDF
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
PDF
AIDC India - Intel Movidius / Open Vino Slides
PDF
AIDC India - AI Vision Slides
PDF
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
PDF
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...
PDF
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
PDF
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...
AI for All: Biology is eating the world & AI is eating Biology
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
AI for good: Scaling AI in science, healthcare, and more.
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
AWS & Intel Webinar Series - Accelerating AI Research
Intel Developer Program
Intel AIDC Houston Summit - Overview Slides
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - AI Vision Slides
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...

Recently uploaded (20)

PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Materi_Pemrograman_Komputer-Looping.pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPT
JAVA ppt tutorial basics to learn java programming
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
AI in Product Development-omnex systems
PPT
Introduction Database Management System for Course Database
PDF
Digital Strategies for Manufacturing Companies
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Introduction to Artificial Intelligence
DOCX
The Five Best AI Cover Tools in 2025.docx
How Creative Agencies Leverage Project Management Software.pdf
Upgrade and Innovation Strategies for SAP ERP Customers
Materi_Pemrograman_Komputer-Looping.pptx
Operating system designcfffgfgggggggvggggggggg
Softaken Excel to vCard Converter Software.pdf
ai tools demonstartion for schools and inter college
Design an Analysis of Algorithms II-SECS-1021-03
JAVA ppt tutorial basics to learn java programming
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
AI in Product Development-omnex systems
Introduction Database Management System for Course Database
Digital Strategies for Manufacturing Companies
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PTS Company Brochure 2025 (1).pdf.......
Design an Analysis of Algorithms I-SECS-1021-03
Odoo POS Development Services by CandidRoot Solutions
Introduction to Artificial Intelligence
The Five Best AI Cover Tools in 2025.docx

Software AI Accelerators: The Next Frontier | Software for AI Optimization Summit 2021 Keynote

  • 1. Software AI Accelerators T h e N e x t F r o n t i e r S o f t w a r e f o r A I O p t i m i z a t i o n S u m m i t W e i L i V P & G M M a c h i n e L e a r n i n g P e r f o r m a n c e I n t e l C o r p o r a t i o n
  • 3. 10 - 100x SOFTWARE AI ACCELERATORS 3 Up to HW Acceleration With SW Acceleration Photo Source: NASA
  • 4. AI HARDWARE SPECTRUM 4 GENERAL PURPOSE PURPOSE BUILT GPU ACCELERATORS CPU
  • 5. UNSCALABLE TO SCALABLE SOFTWARE 5 Services & Solutions Applications M i d d l e w a r e F r a m e w o r k s A n d R u n t i m e s L o w L e v e l L i b r a r i e s V i r t u a l i z a t i o n / O r c h e s t r a t i o n O S D r i v e r s F W I P & B I O S M i d d l e w a r e F r a m e w o r k s A n d R u n t i m e s L o w L e v e l L i b r a r i e s V i r t u a l i z a t i o n / O r c h e s t r a t i o n O S D r i v e r s F W I P & B I O S M i d d l e w a r e F r a m e w o r k s A n d R u n t i m e s L o w L e v e l L i b r a r i e s V i r t u a l i z a t i o n / O r c h e s t r a t i o n O S D r i v e r s F W I P & B I O S M i d d l e w a r e F r a m e w o r k s A n d R u n t i m e s L o w L e v e l L i b r a r i e s V i r t u a l i z a t i o n / O r c h e s t r a t i o n O S D r i v e r s F W I P & B I O S … GPU ACCELER AT O R [1] CPU ACCELER AT O R [N] Services & Solutions Applications Middleware, Frameworks and Runtimes GPU ACCELERATORS CPU
  • 6. AI SOFTWARE STACK 6 Data Scientists & Developers AI/Analytics Tools, Toolkits, Verticals Deep Learning, Machine Learning, Big Data Frameworks Libraries & Compilers HW Intel® LPOT ( L o w p r e c i s i o n o p t i m i z a t i o n t o o l ) Analyt i cs Zoo Intel® oneAPI AI Analyt i cs Toolkit SigOpt P a d d l e P a d d l e T e n s o r F l o w P y t h o n / N u m b a TVM P y T o r c h M X N e t S p a r k S Q L + M L / D L s c a l e o u t M o d i n NumPy X G - B o o s t S c i k i t - L e a r n P a n d a s O p e n V I N O GPU ACCELERATORS CPU
  • 7. KERNEL OPTIMIZATION EXAMPLE 7 Optimizations: vectorization, data reuse, parallelization Optimized convolution in oneDNN A simple program is good, but may be slow
  • 8. GRAPH OPTIMIZATION EXAMPLE 8 Baseline S u m R e L U C o n v 1 x 1 B a t c h N o r m R e L U C o n v 3 x 3 B a t c h N o r m R e L U C o n v 1 x 1 R e L U S u m R e L U C o n v 1 x 1 B a t c h N o r m INT8 Optimized Model (generated by Intel Lo w Precision Optimization To o l) BN Folding Conv + ReLU Conv + Sum S u m R e L U C o n v 1 x 1 ’ R e L U C o n v 3 x 3 ’ R e L U C o n v 1 x 1 ’ S u m R e L U C o n v 1 x 1 ’ Sum’ Conv1x1’’ Conv3x3’’ Conv1x1’’ Sum’ Conv1x1’’ Sum’ Conv1x1’’ Conv3x3’’ Conv1x1’’’ Conv1x1’’ A0 B0 A1 B1 A2 B2 A3 B3 … … A63 B63 C0 A0 *B0 + A1 *B1+A2 *B2+A2 *B2+C0 … … C15 A60 *B60 + A61 *B61+A62 *B62+A63 *B63+C015
  • 9. Intel Optimization for TENSORFLOW 9 IMMEDIATE PERFORMANCE BENEFITS Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive; ResNet50 v1.5, FP32/INT8, BS=128, https://github.com/IntelAI/models/blob/master/benchmarks/image_recognition/tensorflow/resnet50v1_5/README.md; SSD-MobileNetv1, FP32/INT8, BS=448, https://github.com/IntelAI/models/blob/master/benchmarks/object_detection/tensorflow/ssd-mobilenet/README.md. Software: Tensorflow 2.4.0 for FP32 & Intel-Tensorflow (icx-base) for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex.
  • 10. Intel Optimization for TENSORFLOW 10 IMMEDIATE PERFORMANCE BENEFITS Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive; ResNet50 v1.5, FP32/INT8, BS=128, https://github.com/IntelAI/models/blob/master/benchmarks/image_recognition/tensorflow/resnet50v1_5/README.md; SSD-MobileNetv1, FP32/INT8, BS=448, https://github.com/IntelAI/models/blob/master/benchmarks/object_detection/tensorflow/ssd-mobilenet/README.md. Software: Tensorflow 2.4.0 for FP32 & Intel-Tensorflow (icx-base) for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex.
  • 11. Intel Optimization for PYTORCH 11 IMMEDIATE PERFORMANCE BENEFITS Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive; ResNet50 v1.5, FP32/INT8, BS=128, https://github.com/IntelAI/models/blob/icx-launch-public/quickstart/ipex-bkc/resnet50-icx/inference; DLRM, FP32/INT8, BS=16, https://github.com/IntelAI/models/blob/icx-launch- public/quickstart/ipex-bkc/dlrm-icx/inference/fp32/README.md. Software: PyTorch v1.5 w/o DNNL build for FP32 & PyTorch v1.5 + IPEX (icx) for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex.
  • 12. Intel Optimization for PYTORCH 12 IMMEDIATE PERFORMANCE BENEFITS Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive; ResNet50 v1.5, FP32/INT8, BS=128, https://github.com/IntelAI/models/blob/icx-launch-public/quickstart/ipex-bkc/resnet50-icx/inference; DLRM, FP32/INT8, BS=16, https://github.com/IntelAI/models/blob/icx-launch- public/quickstart/ipex-bkc/dlrm-icx/inference/fp32/README.md. Software: PyTorch v1.5 w/o DNNL build for FP32 & PyTorch v1.5 + IPEX (icx) for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex. Photo Source: NASA
  • 13. Intel Optimization for MXNET 13 IMMEDIATE PERFORMANCE BENEFITS Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive; ResNet50 v1, FP32/INT8, BS=128, https://github.com/apache/incubator-mxnet/blob/v2.0.0.alpha/python/mxnet/gluon/model_zoo/vision/resnet.py; MobileNetv2, FP32/INT8, BS=128, https://github.com/apache/incubator- mxnet/blob/v2.0.0.alpha/python/mxnet/gluon/model_zoo/vision/mobilenet.py. Software: MXNet 2.0.0.alpha w/o DNNL build for FP32 & MXNet 2.0.0.alpha for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex.
  • 14. Intel Optimization for MXNET 14 IMMEDIATE PERFORMANCE BENEFITS Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive; ResNet50 v1, FP32/INT8, BS=128, https://github.com/apache/incubator-mxnet/blob/v2.0.0.alpha/python/mxnet/gluon/model_zoo/vision/resnet.py; MobileNetv2, FP32/INT8, BS=128, https://github.com/apache/incubator- mxnet/blob/v2.0.0.alpha/python/mxnet/gluon/model_zoo/vision/mobilenet.py. Software: MXNet 2.0.0.alpha w/o DNNL build for FP32 & MXNet 2.0.0.alpha for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex. Photo Source: NASA
  • 15. Intel Extension for Scikit-learn 15 Intel Xeon Platinum 8276L CPU @ 2.20 GHz, 2 sockets, 28 cores per socket; For workloads and configurations visit www.Intel.com/PerformanceIndex. Details: https://medium.com/intel-analytics-software/accelerate-your-scikit-learn-applications-a06cacf44912
  • 16. PERFORMANCE IN KAGGLE COMPETITIONS 16 Kaggle challenge Domain Algorithm(s) Stock E2E Time (minutes) Intel Extension for Scikit-learn E2E Time (minutes) Speed up KDD Cup 1999 Computer Networks kNN 282 1.24 227.4x Credit Card Default Finance SVC 11.9 0.2 59.5x Digit Recognizer (KNN) Image Classification SVC 84.32 1.47 57.5x Melanoma Identification Image Classification kNN 99.89 2.08 48x Digit Recognizer (SVM) Image Classification PCA, SVC 125.5 4.92 25.5x What's cooking? Natural Language Processing SVC, XGBoost 35.8 2.66 13.5x Real or Not? Disaster Tweets Natural Language Processing SVC 37.8 4.27 8.9x Home Credit Default Finance Random Forest 2.9 1.44 2x Intel Xeon Gold 5218 @ 2.3 GHz (2nd generation Intel Xeon Scalable processors): 2 sockets, 16 cores per socket, HT:off, Turbo:off. For workloads and configurations visit www.Intel.com/PerformanceIndex. Details: https://medium.com/intel-analytics-software/accelerate-kaggle-challenges-using-intel-ai-analytics-toolkit-beb148f66d5a
  • 17. GRAPH ANALYTICS WITH oneDAL 17 Triangle Counting Algorithm V = Vertices, E = Edges, speed up due to relabel in g 1.38 1.67 1.74 1.82 2.98 8.02 166.1 1 10 100 1000 Enron (V: 0.03M, E: 0.4M) Pokec (V: 1.6M, E: 30.6M) Google (V: 0.9M, E: 5.1M) Indochina-2004 (V: 7.4M, E: 151M) Wikipedia (V: 12.1M, E: 378M) Twitter (V: 61M, E: 1202M) Web (V: 50M, E: 1810M) Speed Up Data Sets Enron (V: 0.03M, E: 0.4M) Pokec (V: 1.6M, E: 30.6M) Google (V: 0.9M, E: 5.1M) Indochina-2004 (V: 7.4M, E: 151M) Wikipedia (V: 12.1M, E: 378M) Twitter (V: 61M, E: 1202M) Web (V: 50M, E: 1810M) Intel Xeon Platinum 8280 CPU @ 2.70 GHz, 2x28 cores, HT: on; For workloads and configurations visit www.Intel.com/PerformanceIndex. Data sets: https://gihub.com/sbeamer/gapbs | https://snap.Stanford.edu/data
  • 18. E2E WORKLOAD PERFORMANCE 18 R e a d c s v E T L T r a i n T e s t S p l i t M L 0 10 20 30 40 50 60 70 80 90 100 Readcsv ETL Train Test Split ML Total Time Speed up Unoptimized Software Optimized Optimized hyperparameters CENSUS Phase-wise % breakdown CENSUS Performance improvement with hyperparameter optimizations Readcsv ETL ML PLAsTiCC Phase-wise % breakdown PLAsTiCC Performance improvement with hyperparameter optimizations 23x 0 10 20 30 40 50 60 70 Readcsv ETL ML Total Time Speed up Unoptimized Software Optimized Optimized hyperparameters 29x Higher is better Details: https://medium.com/intel-analytics-software/performance-optimizations-for-end-to-end-ai-pipelines-231e0966505a Intel® Xeon Platinum 8280L @ 28 cores; For workloads and configurations visit www.Intel.com/PerformanceIndex.
  • 19. AI APPLICATIONS FROM PARTNERSHIPS 19 Athlete Training Telecom Network Quality Drug Discovery
  • 20. SUMMARY AND CALL-TO-ACTION 20 Software AI Accelerators can deliver orders of magnitude performance Even more potential for the AI software community ▪ Create compiler technologies to automate kernel optimizations ▪ Increase parallelism to achieve higher compute utilization ▪ Optimize for memory bandwidth, memory size, NUMA ▪ Scale to large distributed compute Find more at: ai.intel.com
  • 21. NOTICES & DISCLAIMERS 21 ▪ Results have been estimated or simulated. ▪ Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex​. ▪ Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available ​updates. See backup for configuration details. No product or component can be absolutely secure. ▪ Your costs and results may vary. ▪ Intel technologies may require enabled hardware, software or service activation. ▪ All product plans and roadmaps are subject to change without notice. ▪ Intel contributes to the development of benchmarks by participating in, sponsoring, and/or contributing technical support to various benchmarking groups, including the BenchmarkXPRT Development Community administered by Principled Technologies. ▪ © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. ​