AI NEXTCon |
Agenda
Applications
FrameworkOptimization
Fundamental
Feature Ask
Optimize for
Application
Make optimization
Re-usable for others
Junhua wang ai_next_con
Applications
(0.78, 0.8, 0.4, 0.3, 0.9,...)
(0.75, 0.6, 0.1, 0.7, 0.2,...)
… …
• Semantic similarity
Vector Representation
Nearest Neighbor Search
in Semantic Space
Q: {is it legal for 17
year old to buy a car}
Bag of Words Inverted Index Matching
car
legal
…
OR
AND
buy
own
• L1: BM25F
Ranking
Posting 1
Posting 2
Posting 3
Posting 4
…
buy
legal
…
L2/L3/L4
ReRanking
 Semantic search can help recall issues, nearly a third of relevance DSATs.
Query: {how many women voices in Switchboard telephone corpus }
Cannot recall the good urls by query
term alteration and term match
DL model captures full context, and builds semantic
meanings into vectors. The query vector and
document vector are near in vector space.
Applications
•
•
•
•
•
Applications
Query: Where's the nearest fruit smoothies
Location: Omaha, Nebraska
Applications
Framework
Deep Learning Platform
DLIS Pluggable Runtime
Linux ContainerNative Windows
Microsoft
CNTK TensorFlowWin
DeepCPU
TensorFlowLinux
Caffe
Hardware Accerlation
CPU GPUFPGA
Self-Serve
Portal
Model
Development
toolkit
Model
Repository
Theano ...
Workloads
Web Text Speech Image Enterprise
DLVS Pluggable Runtime
HNSW K-D Tree Faiss
ANN Index Build
on Multi-Tenancy
FrontDoor
DLIS DLVS
• Customizable runtime
• Privacy and Compliance Certification
• In production globally





•
•
•
1Ms QPS, 100s models, 100Bs vectors, 20+ Regions
Framework
Optimal distribution to match model requirements to server fleet
Framework
Windows Machine SKU-2
Model2
Model2
Linux Machine SKU-4
Windows Machine SKU-1
Model1
Model1
Linux Machine SKU-5
Model6
Windows Machine SKU-3
Model6
Model6
Model6
Model2
Model2
Windows Machine SKU-2
Model2
Model2
Windows Machine SKU-2
Model3
Model4
Model5
Model2
Model2
Model1
Model1
Model1
Model1
Model1
Model1
Model1
Model1
Model1
Model4
Model5
Multiple model instances
across multiple machines
Multiple model instances
share same machine
Different
Operating
System in
same bed
Different runtime
Model7
Linux Machine SKU-5
Model8Model7
Model7
CNTK
TensorFlow
Windows
DeepCPU
TensorFlow
Linux
Different
machine
SKU in
same bed
Query:
coffee in Melbourne
Semantic
Representation
Vector
Online
Inferencing
Batch
Inferencing
Document 1
Document 2
...
Vector Set
Similarity Search
Framework
Vector Recall by Nearest Neighbor Search
Search among
points in bucket
Hash query
to this bucket
NNG
HNSW
KD-tree
Semantic word 1
Semantic word 2 Semantic word 3
TP-tree
&
Wang, Jingdong, and Shipeng Li. "Query-driven iterated neighborhood graph search for large scale indexing." Proceedings of the 20th ACM international conference on Multimedia. ACM, 2012.
Framework
Optimization
Hardware +
Software
Acceleration
DeepCPU
BrainWave / FPGA
DeepGPU
RNN Serving Performance Challenges
Language Modeling
Machine Translation
Machine Reading
Comprehension
Conversation Bot
Speech
Recognition
…
Limited Parallelism
Limited Bandwidth
• Small batch size
• Sequential dependency
• Vector-matrix multiplication
• Low data reuse
14
Xt-1 Xt Xt+1
Ot-1 Ot Ot+1
St-1 St
St+1
W W W
U U U
V V V
Optimization
1. Matrix computation:
2. Activation function
3. Operation Fusing
4. Affinity
5. Locality
6. Parallelism
7. Task scheduling
Collaborating with Yuxiong He, Minjia Zhang, Samyam Rajbhandari, Wenhan Wang,
Microsoft AI and Research.
Optimization
𝑧𝑡 = 𝜎 𝑊𝑧 𝑥 𝑡 + 𝑈𝑧ℎ 𝑡−1 + 𝑏 𝑧
𝑟𝑡 = 𝜎 𝑊𝑟 𝑥 𝑡 + 𝑈𝑟ℎ 𝑡−1 + 𝑏 𝑟
ℎ 𝑡 = 𝑧𝑡 ∘ ℎ 𝑡−1 + 1 − 𝑧𝑡 ∘ tanh(𝑊ℎ 𝑥 𝑡 + 𝑈ℎ 𝑟𝑡 ∘ ℎ 𝑡−1 + 𝑏ℎ)
On a machine with 12 cores…
a) 1 core per operation, multiplications done in parallel
1 1 1 1 1
1
time
cores
6
12
b) 12 cores per operation, multiplications done sequentially
12 12 12 12 12
12
6
12
cores
time
many idle cores
unbalanced load
poor speedup of
intra-op parallelism
Optimization
Optimization
𝑧𝑡 = 𝜎 𝑊𝑧 𝑥 𝑡 + 𝑈𝑧ℎ 𝑡−1 + 𝑏 𝑧
𝑟𝑡 = 𝜎 𝑊𝑟 𝑥 𝑡 + 𝑈𝑟ℎ 𝑡−1 + 𝑏 𝑟
ℎ 𝑡 = 𝑧𝑡 ∘ ℎ 𝑡−1 + 1 − 𝑧𝑡 ∘ tanh(𝑊ℎ 𝑥 𝑡 + 𝑈ℎ 𝑟𝑡 ∘ ℎ 𝑡−1 + 𝑏ℎ)
On a machine with 12 cores…
d) an optimized configuration, reducing latency
6
12
cores
time
2 2 3 3 2
6
c) 4 cores per operation
4 4 4 4 4
4
time
cores
6
12
1 1 1 2 2
2
1 2
Bad scheduling order
✓ Workload size
✓ Parallelism efficiency
✓ Critical path
✓ Load balancing
Optimization
19
Cache-Aware Partitioning
20
Optimization
21
Optimization
DL Scenarios Original Latency
Latency
Target
Optimized Latency
Latency
reduction
Throughput
improvement
Turing Prototype 2 ~100ms 10ms 9ms >10X > 10X
Turing Prototype 3 ~107ms 10ms 4.1ms >20X > 50X
Deep Query Document
Similarity
10~12ms for [query,
1 doc] x 33 docs
6ms
1.5ms for [query, 1 doc];
<6ms for [query, 33 docs]
>6X > 30X
Malta Click Features
10ms for
[query, 1 passage]
x 150 passages
5ms
<1ms for [query, 1 passage];
<5ms for [query, 150 passages]
>10X > 100X
Ads seq2seq model for
query rewriting
51ms 5ms 4ms >10X > 3X
AGI Encoder V2 ~29ms 10ms 5.4ms 5X 5X
RNet (InfoBot + Bing)
~45ms for 1 [query,
passage]
10ms
4.0ms for 1 [query,
passage];
<8.5ms for 20 [query,
passage]
11X > 100X
Bing query tagging 9~16ms on CNTK 3ms 0.95ms 10X > 10X
WideDeepRight Model
(TP3 L1)
~25ms for [query, 1
title url]
7ms for a
batch size of
33
5.4ms for [query,
33 title url];
10X > 100X
TP3 L2 Classifier 60ms 3ms 3ms 20X 20X
TP3 L1 8ms 3ms 1ms 8X 8X
Optimization
ONNX/WinML
23
Optimization
24
Original TensorFlow model
TensorFlow model with DeepCPU operator
Optimization
F F F
L0
L1
F F F
L0
Pretrained DNN Model
in TF/CNTK/ONNX, etc.
Scalable DNN Hardware
Microservice
BrainWave
Soft DPU
Instr Decoder
& Control
Neural FU
Network switches
FPGAs
Optimization
Optimization
Production Bing DNN Model 1
CPU only Brainwave accelerated Improvement
Model Details GRU 128X200 (X2) + W2Vec LSTM 500X200 (x8) +W2Vec Brainwave accelerated mode
is > 10X larger and > 10X
lower latencyEnd-to-End latency per Batch
1 request at 95%
9ms 0.85ms
Production Bing DNN Model 2
CPU only Brainwave accelerated Improvement
Model Details 1D CNN + W2Vec (RNNs
removed)
1D CNN + W2Vec + GRU
500x500 (x4)
Brainwave accelerated mode
is > 10X larger and 3X lower
latency
End-to-End latency per Batch
1 request at 95%
15ms 5ms
Optimization




Layer GEMM
𝑊𝑖
𝑊𝑓
𝑊𝑜
𝑊𝑐
G*H
S
𝑥 𝑡S
N
G*H
N
Recurrent GEMM
𝑈𝑖
𝑈𝑓
𝑈 𝑜
𝑈𝑐
H
ℎ 𝑡−1H
G*H
N
N
G*H
S = synthetic_dim
H = hidden_dim
N = batch_size
G = num_gates
Optimization
Optimization
RF
𝑊𝑒𝑖𝑔ℎ𝑡𝑠
H
G*H
H
Shared Memory
ℎ 𝑡−1
result
NN
H
G*H
GRU P4 - FP32, batch_size = 1
*Can add more work in this instance
Other
variables
H N RF Usage SMEM
100 1 3∗100∗100∗4
256∗1024
≅ 46%
100+3∗100 ∗4
96∗1024
≅ 2%
20 1 3∗20∗20∗4
256∗1024
≅ 2% 20+3∗20 ∗4
96∗1024
≪ 1%
Summary
Significant gain from deep learning
in search, speech, vision and
machine reading comprehension.
Large scale and low latency
inference and vector search service
in production
Heterogenous hardware and
pluggable framework support
Junhua wang ai_next_con
Junhua wang ai_next_con

More Related Content

PPTX
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
PDF
Safe and Efficient Off-Policy Reinforcement Learning
PDF
Introduction to TensorFlow
PPTX
Introduction of "TrailBlazer" algorithm
PPTX
Introduction To Tensorflow
PPTX
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
PPTX
Modern classification techniques
PDF
Memory efficient java tutorial practices and challenges
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Safe and Efficient Off-Policy Reinforcement Learning
Introduction to TensorFlow
Introduction of "TrailBlazer" algorithm
Introduction To Tensorflow
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Modern classification techniques
Memory efficient java tutorial practices and challenges

What's hot (20)

PDF
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
PPTX
Online learning, Vowpal Wabbit and Hadoop
PDF
H2O World - GBM and Random Forest in H2O- Mark Landry
PDF
TensorFlow Dev Summit 2017 요약
PDF
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
PDF
Josh Patterson MLconf slides
PPTX
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
PDF
Dual Learning for Machine Translation (NIPS 2016)
PDF
Data Wrangling For Kaggle Data Science Competitions
PPTX
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
PDF
強化学習の分散アーキテクチャ変遷
PPTX
An Introduction to TensorFlow architecture
PPTX
STRIP: stream learning of influence probabilities.
PPTX
TensorFrames: Google Tensorflow on Apache Spark
PDF
TensorFlow and Keras: An Overview
PDF
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
PDF
Technical Tricks of Vowpal Wabbit
PPTX
Tensor flow
PDF
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Online learning, Vowpal Wabbit and Hadoop
H2O World - GBM and Random Forest in H2O- Mark Landry
TensorFlow Dev Summit 2017 요약
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Josh Patterson MLconf slides
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Dual Learning for Machine Translation (NIPS 2016)
Data Wrangling For Kaggle Data Science Competitions
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
強化学習の分散アーキテクチャ変遷
An Introduction to TensorFlow architecture
STRIP: stream learning of influence probabilities.
TensorFrames: Google Tensorflow on Apache Spark
TensorFlow and Keras: An Overview
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Technical Tricks of Vowpal Wabbit
Tensor flow
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
Ad

Similar to Junhua wang ai_next_con (20)

PDF
Deep Learning Inference at speed and scale
PDF
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
PDF
Deep Learning for New User Interactions (Gestures, Speech and Emotions)
PDF
DLD meetup 2017, Efficient Deep Learning
PDF
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
PDF
Scaling Deep Learning with MXNet
PDF
Hardware Acceleration for Machine Learning
PDF
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
PPTX
Beyond data and model parallelism for deep neural networks
PDF
PDF
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
PDF
An Introduction to Neural Architecture Search
PPTX
Computer vision lab seminar(deep learning) yong hoon
PPTX
Deep Learning for Recommender Systems
PDF
Machine learning in science and industry — day 4
PDF
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
PDF
Frequently Bought Together Recommendations Based on Embeddings
PPTX
Advanced AI for People in a Hurry
PDF
How to use Apache TVM to optimize your ML models
PDF
Machine Learning Challenges and Opportunities in Education, Industry, and Res...
Deep Learning Inference at speed and scale
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Deep Learning for New User Interactions (Gestures, Speech and Emotions)
DLD meetup 2017, Efficient Deep Learning
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Scaling Deep Learning with MXNet
Hardware Acceleration for Machine Learning
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Beyond data and model parallelism for deep neural networks
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
An Introduction to Neural Architecture Search
Computer vision lab seminar(deep learning) yong hoon
Deep Learning for Recommender Systems
Machine learning in science and industry — day 4
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
Frequently Bought Together Recommendations Based on Embeddings
Advanced AI for People in a Hurry
How to use Apache TVM to optimize your ML models
Machine Learning Challenges and Opportunities in Education, Industry, and Res...
Ad

Recently uploaded (20)

PPTX
most interesting chapter in the world ppt
PPTX
Trending Python Topics for Data Visualization in 2025
PPTX
Advanced SystemCare Ultimate Crack + Portable (2025)
PDF
Microsoft Office 365 Crack Download Free
PPTX
Tech Workshop Escape Room Tech Workshop
PDF
AI Guide for Business Growth - Arna Softech
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PPTX
Full-Stack Developer Courses That Actually Land You Jobs
PDF
E-Commerce Website Development Companyin india
PDF
Introduction to Ragic - #1 No Code Tool For Digitalizing Your Business Proces...
PPTX
"Secure File Sharing Solutions on AWS".pptx
PDF
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PPTX
Matchmaking for JVMs: How to Pick the Perfect GC Partner
PDF
Salesforce Agentforce AI Implementation.pdf
PDF
iTop VPN Crack Latest Version Full Key 2025
PPTX
CNN LeNet5 Architecture: Neural Networks
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
most interesting chapter in the world ppt
Trending Python Topics for Data Visualization in 2025
Advanced SystemCare Ultimate Crack + Portable (2025)
Microsoft Office 365 Crack Download Free
Tech Workshop Escape Room Tech Workshop
AI Guide for Business Growth - Arna Softech
How Tridens DevSecOps Ensures Compliance, Security, and Agility
Full-Stack Developer Courses That Actually Land You Jobs
E-Commerce Website Development Companyin india
Introduction to Ragic - #1 No Code Tool For Digitalizing Your Business Proces...
"Secure File Sharing Solutions on AWS".pptx
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
Matchmaking for JVMs: How to Pick the Perfect GC Partner
Salesforce Agentforce AI Implementation.pdf
iTop VPN Crack Latest Version Full Key 2025
CNN LeNet5 Architecture: Neural Networks
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025

Junhua wang ai_next_con

  • 4. Applications (0.78, 0.8, 0.4, 0.3, 0.9,...) (0.75, 0.6, 0.1, 0.7, 0.2,...) … … • Semantic similarity Vector Representation Nearest Neighbor Search in Semantic Space Q: {is it legal for 17 year old to buy a car} Bag of Words Inverted Index Matching car legal … OR AND buy own • L1: BM25F Ranking Posting 1 Posting 2 Posting 3 Posting 4 … buy legal … L2/L3/L4 ReRanking
  • 5.  Semantic search can help recall issues, nearly a third of relevance DSATs. Query: {how many women voices in Switchboard telephone corpus } Cannot recall the good urls by query term alteration and term match DL model captures full context, and builds semantic meanings into vectors. The query vector and document vector are near in vector space. Applications
  • 7. Query: Where's the nearest fruit smoothies Location: Omaha, Nebraska Applications
  • 8. Framework Deep Learning Platform DLIS Pluggable Runtime Linux ContainerNative Windows Microsoft CNTK TensorFlowWin DeepCPU TensorFlowLinux Caffe Hardware Accerlation CPU GPUFPGA Self-Serve Portal Model Development toolkit Model Repository Theano ... Workloads Web Text Speech Image Enterprise DLVS Pluggable Runtime HNSW K-D Tree Faiss ANN Index Build on Multi-Tenancy FrontDoor DLIS DLVS • Customizable runtime • Privacy and Compliance Certification • In production globally
  • 9.      • • • 1Ms QPS, 100s models, 100Bs vectors, 20+ Regions Framework
  • 10. Optimal distribution to match model requirements to server fleet Framework Windows Machine SKU-2 Model2 Model2 Linux Machine SKU-4 Windows Machine SKU-1 Model1 Model1 Linux Machine SKU-5 Model6 Windows Machine SKU-3 Model6 Model6 Model6 Model2 Model2 Windows Machine SKU-2 Model2 Model2 Windows Machine SKU-2 Model3 Model4 Model5 Model2 Model2 Model1 Model1 Model1 Model1 Model1 Model1 Model1 Model1 Model1 Model4 Model5 Multiple model instances across multiple machines Multiple model instances share same machine Different Operating System in same bed Different runtime Model7 Linux Machine SKU-5 Model8Model7 Model7 CNTK TensorFlow Windows DeepCPU TensorFlow Linux Different machine SKU in same bed
  • 12. Vector Recall by Nearest Neighbor Search Search among points in bucket Hash query to this bucket NNG HNSW KD-tree Semantic word 1 Semantic word 2 Semantic word 3 TP-tree & Wang, Jingdong, and Shipeng Li. "Query-driven iterated neighborhood graph search for large scale indexing." Proceedings of the 20th ACM international conference on Multimedia. ACM, 2012. Framework
  • 14. RNN Serving Performance Challenges Language Modeling Machine Translation Machine Reading Comprehension Conversation Bot Speech Recognition … Limited Parallelism Limited Bandwidth • Small batch size • Sequential dependency • Vector-matrix multiplication • Low data reuse 14 Xt-1 Xt Xt+1 Ot-1 Ot Ot+1 St-1 St St+1 W W W U U U V V V Optimization
  • 15. 1. Matrix computation: 2. Activation function 3. Operation Fusing 4. Affinity 5. Locality 6. Parallelism 7. Task scheduling Collaborating with Yuxiong He, Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, Microsoft AI and Research. Optimization
  • 16. 𝑧𝑡 = 𝜎 𝑊𝑧 𝑥 𝑡 + 𝑈𝑧ℎ 𝑡−1 + 𝑏 𝑧 𝑟𝑡 = 𝜎 𝑊𝑟 𝑥 𝑡 + 𝑈𝑟ℎ 𝑡−1 + 𝑏 𝑟 ℎ 𝑡 = 𝑧𝑡 ∘ ℎ 𝑡−1 + 1 − 𝑧𝑡 ∘ tanh(𝑊ℎ 𝑥 𝑡 + 𝑈ℎ 𝑟𝑡 ∘ ℎ 𝑡−1 + 𝑏ℎ) On a machine with 12 cores… a) 1 core per operation, multiplications done in parallel 1 1 1 1 1 1 time cores 6 12 b) 12 cores per operation, multiplications done sequentially 12 12 12 12 12 12 6 12 cores time many idle cores unbalanced load poor speedup of intra-op parallelism Optimization
  • 17. Optimization 𝑧𝑡 = 𝜎 𝑊𝑧 𝑥 𝑡 + 𝑈𝑧ℎ 𝑡−1 + 𝑏 𝑧 𝑟𝑡 = 𝜎 𝑊𝑟 𝑥 𝑡 + 𝑈𝑟ℎ 𝑡−1 + 𝑏 𝑟 ℎ 𝑡 = 𝑧𝑡 ∘ ℎ 𝑡−1 + 1 − 𝑧𝑡 ∘ tanh(𝑊ℎ 𝑥 𝑡 + 𝑈ℎ 𝑟𝑡 ∘ ℎ 𝑡−1 + 𝑏ℎ) On a machine with 12 cores… d) an optimized configuration, reducing latency 6 12 cores time 2 2 3 3 2 6 c) 4 cores per operation 4 4 4 4 4 4 time cores 6 12 1 1 1 2 2 2 1 2 Bad scheduling order ✓ Workload size ✓ Parallelism efficiency ✓ Critical path ✓ Load balancing
  • 22. DL Scenarios Original Latency Latency Target Optimized Latency Latency reduction Throughput improvement Turing Prototype 2 ~100ms 10ms 9ms >10X > 10X Turing Prototype 3 ~107ms 10ms 4.1ms >20X > 50X Deep Query Document Similarity 10~12ms for [query, 1 doc] x 33 docs 6ms 1.5ms for [query, 1 doc]; <6ms for [query, 33 docs] >6X > 30X Malta Click Features 10ms for [query, 1 passage] x 150 passages 5ms <1ms for [query, 1 passage]; <5ms for [query, 150 passages] >10X > 100X Ads seq2seq model for query rewriting 51ms 5ms 4ms >10X > 3X AGI Encoder V2 ~29ms 10ms 5.4ms 5X 5X RNet (InfoBot + Bing) ~45ms for 1 [query, passage] 10ms 4.0ms for 1 [query, passage]; <8.5ms for 20 [query, passage] 11X > 100X Bing query tagging 9~16ms on CNTK 3ms 0.95ms 10X > 10X WideDeepRight Model (TP3 L1) ~25ms for [query, 1 title url] 7ms for a batch size of 33 5.4ms for [query, 33 title url]; 10X > 100X TP3 L2 Classifier 60ms 3ms 3ms 20X 20X TP3 L1 8ms 3ms 1ms 8X 8X Optimization
  • 24. 24 Original TensorFlow model TensorFlow model with DeepCPU operator Optimization
  • 25. F F F L0 L1 F F F L0 Pretrained DNN Model in TF/CNTK/ONNX, etc. Scalable DNN Hardware Microservice BrainWave Soft DPU Instr Decoder & Control Neural FU Network switches FPGAs Optimization
  • 27. Production Bing DNN Model 1 CPU only Brainwave accelerated Improvement Model Details GRU 128X200 (X2) + W2Vec LSTM 500X200 (x8) +W2Vec Brainwave accelerated mode is > 10X larger and > 10X lower latencyEnd-to-End latency per Batch 1 request at 95% 9ms 0.85ms Production Bing DNN Model 2 CPU only Brainwave accelerated Improvement Model Details 1D CNN + W2Vec (RNNs removed) 1D CNN + W2Vec + GRU 500x500 (x4) Brainwave accelerated mode is > 10X larger and 3X lower latency End-to-End latency per Batch 1 request at 95% 15ms 5ms Optimization
  • 28.     Layer GEMM 𝑊𝑖 𝑊𝑓 𝑊𝑜 𝑊𝑐 G*H S 𝑥 𝑡S N G*H N Recurrent GEMM 𝑈𝑖 𝑈𝑓 𝑈 𝑜 𝑈𝑐 H ℎ 𝑡−1H G*H N N G*H S = synthetic_dim H = hidden_dim N = batch_size G = num_gates Optimization
  • 29. Optimization RF 𝑊𝑒𝑖𝑔ℎ𝑡𝑠 H G*H H Shared Memory ℎ 𝑡−1 result NN H G*H GRU P4 - FP32, batch_size = 1 *Can add more work in this instance Other variables H N RF Usage SMEM 100 1 3∗100∗100∗4 256∗1024 ≅ 46% 100+3∗100 ∗4 96∗1024 ≅ 2% 20 1 3∗20∗20∗4 256∗1024 ≅ 2% 20+3∗20 ∗4 96∗1024 ≪ 1%
  • 30. Summary Significant gain from deep learning in search, speech, vision and machine reading comprehension. Large scale and low latency inference and vector search service in production Heterogenous hardware and pluggable framework support