“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Presentation from MLPerf

MLPerf: An Industry Standard Performance
Benchmark Suite for Machine Learning
(Inference)
Christine Cheng
Intel, MLPerf Inference Co-chair
christine.cheng@intel.com
(Work by many people in MLPerf community)
Edge AI and Vision Innovation Forum – Invited Talk – July 2020

Agenda
● Why do we need MLPerf Inference?
● What’s in MLPerf inference?
● Typical Submission Workflow
● Impact of MLPerf inference 2019
● What’s in MLPerf inference 2020?
2

Agenda
3

Why Benchmark Machine Learning Systems?
• Machine learning needs the entire software stack and hardware
working seamlessly together
• Exponential growth in research & innovations
• Need MLPerf to level the playing field
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8259424
50+
publications
every single
day!
MLArxivPapers
Year
4

Let’s imagine…
• If you have new AI HW which you are trying to sell…
• Market promotion
• Claim performance against competition
• Demonstrate value to customers
The Cost of Latency in High-Frequency Trading
5

System X System Y
What task?
What model?
What dataset?
What batch size?
What quantization?
What software libraries?
…
6
State of ML Benchmarking Today

Researchers from
MLPerf is an
ML performance benchmarking effort
with wide industry
and academic support.
7

MLPerf Goals
• Enforce replicability to ensure reliable results
• Use representative workloads, reflecting production use-cases
• Encourage innovation to improve the state-of-the-art of ML
• Accelerate progress in ML via fair and useful measurement
• Serve both the commercial and research communities
• Keep benchmarking affordable (so that all can play)
8

Agenda
9

Do you specify the model? Closed division does, Open division does not.
MLPerf inference measures rate of inference
e.g.
image
Input
e.g.
“cat”
Result
(with required quality,
e.g. 75.1%)
e.g.
ResNet
Trained model
10

MLPerf v0.5 Inference Workloads
Use Case Neural Network
Vision
ResNet-50 v1.5
SSD ResNet-34
SSD MobileNet v1 (edge only)
Speech
Language
Commerce
Datacenter / Edge Inference
Minimal viable set for initial launch (v0.5)
11

Vision
ResNet-50 v1.5
SSD ResNet-34
3D UNET
Speech RNN-T
Language BERT Large
Commerce DLRM (datacenter only)
We evolved from a minimum benchmark set to a broad suite (v0.7)
12
Blue = new in v0.7

Vision
ResNet-50 v1.5
SSD ResNet-34
3D UNET
Speech RNN-T
Language BERT Large
Commerce DLRM (datacenter only)
Blue = new in v0.7
Mobile Inference
Use
Case
Neural Network
Vision
MobileNetEdgeTPU
SSD-MobileNet v2
DeepLabv3
Languag
e
Mobile-BERT
Cat by Alvesgaspar, Dog by December21st2012Freak
We evolved from a minimum benchmark set to a broad suite (v0.7)
13

Four scenarios to handle different use cases
Single stream
(e.g. cell phone
augmented vision)
Multiple stream
(e.g. multiple camera
driving assistance)
Server
(e.g. translation app)
Offline
(e.g. photo sorting app)
14

Different metric for each scenario
Single stream
e.g. cell phone
augmented vision
Multiple stream
e.g. multiple camera
driving assistance
Server
e.g. translation site
Offline
e.g. photo sorting
Latency
Number streams
subject to latency
bound
QPS
subject to latency
bound
Throughput
15

Inference Submitters’ Implementations
• Even greater range of software and
hardware solutions
• So, allow submitters to reimplement
subject to inference rules
• Use standard set of pre-trained
weights for Closed Division
• Use standard C++ “load
generator” that handles scenarios
and metrics
SUT
Common
weights
Must use
Load generator
Generates Times Validates
16

Not a quantization contest!
• Quantization is key to efficient inference, but do not
want a quantization contest
• Can the Closed division quantize?
• Yes, but must be principled: describe
reproducible method
• Can the Closed division calibrate?
• Yes, but must use a fixed set of calibration data
• Can the Closed division retrain?
• No, not a retraining contest. But, provide
retrained 8 bit models..
FP 32
weights
FP / INT X
weights
?
17

Agenda
18

Typical Workflow by week number
... ... -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5
Submission
deadline
Result
publication
Code freeze &
rule freeze
Benchmark list
freeze
Result review
by committee
& submitters
MLPerf
19

Typical Workflow by week number
... ... -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5
Submission
deadline
Result
publication
Result review
by committee
& submitters
Code freeze &
rule freeze
Benchmark list
freeze
- propose
models
- read
current
rules
- Attend weekly
submitters meetings to
discuss details on
rules, models and
implementations
- Develop SW
- Clarify rules
- Tuning SW for HW
- Pass compliance
checker
Sign
CLA
MLPerf
Submitter
Marketing
Preparation
SW
release
Ideal
Attend weekly submitter meetings
20

Agenda
21

Inference Results: Diverse Systems & Applications
● 600+ inference
results
● Over 30 systems
submitted
● 10,000x
difference in
performance
22

MLPerf is increasing market transparency
23

ML hardware is projected to be a ~$60B industry in 2025.
(Tractica.com $66.3B, Marketsandmarkets.com: $59.2B)
“What get measured, gets improved.” — Peter Drucker
Benchmarking aligns research with development,
engineering with marketing, and competitors across the
industry in pursuit of the same clear objective.
24

Agenda
25

Quarterly Cadence (post COVID)
May June July Aug. Sept. Oct. Nov. Dec.
Training 0.7
Submission
Publication
Inference 0.7
Submission
Publication
Training 0.8
Submission
Publication
Inference 0.8
Submission
Publication

Area Problem Training Inference
Vision Image classification ResNet
Object detection SSD-MobileNet v1
SDD-ResNet34
Image segmentation Mask R-CNN
Medical Imaging 3D UNET
Speech Speech-to-text RNN-T
Language Translation GNMT
Transformer
NLP BERT
Commerce Recommendation DLRM
Research Reinforcement Learning MiniGo
Evolving benchmark suite v0.6 v0.7 v0.8 N/A
27

Area Problem Training Inference Mobile
Vision Image classification ResNet MobileNetEdgeTPU
Object detection SSD-MobileNet v1 SSD-MN v2
SDD-ResNet34
Image segmentation Mask R-CNN DeepLab-MN v3
Language Translation GNMT
Transformer
NLP BERT MobileBERT
28

Area Problem Training Inference Mobile
Vision Image classification ResNet MobileNetEdgeTPU
Object detection SSD-MobileNet v1 Upgrade v0.8 SSD-MN v2
SDD-ResNet34 Upgrade v0.8 Upgrade v0.8
Image segmentation Mask R-CNN DeepLab-MN v3
Language Translation GNMT Remove v0.8
Transformer
NLP BERT MobileBERT
29

Challenges in 2020
● Evolve the benchmark suites fairly
● Improve efficiency information
● Reduce result sparsity
● Reduce benchmarking cost
30

Evolve the benchmark suites fairly
● What does fair mean?
○ Reflect the most impactful industry and research needs
● What is most impactful?
○ Convene Advisory Boards of 3-5 industry users + 3-5 academics, and ask
○ Existing: Recommendation, Medical Imaging
○ Forming: Automotive, Speech
31

Improve efficiency information
The problem: Inference in particular is infinitely scalable.
Benchmark results are not enough; need
more information to determine efficiency.
Now: number of chips
Crude
Better than nothing
Future: power, cloud cost, others?
Not simple
System Offline
Inference:
ResNet
Chips Power
Foo 800ips 1 1200w
Bar 1000ips 4 400w
32

Reduce result sparsity
Why do we allow sparse results?
Specialized chips
Vendors need to focus investment
Ways to make results denser?
Prune benchmarks x scenarios?
Make small number required?
Benchmark X Benchmark Y
Chip A 23
Chip B 17,583
The problem:
33

Reduce benchmarking cost
● Why is good ML benchmarking relatively expensive?
a. We benchmark real user value
b. Software stack and system diversity; need to allow reimplementation
c. Strict requirements on inference model equivalence to the reference
● Ways to reduce cost:
a. Out-of-the-box code division?
b. Improve our reference code and best practices
34

Good ML benchmarking is hard.
We welcome ideas and contributions.
35

How to get involved?
mlperf.org/get-involved
info@mlperf.org
36

MLPerf is the work of many
Aaron Zhong David Patterson Jared Duke Peter Bailis
Abid Muslim Debajyoti Pal Jeff Jiao Peter Baldwin
Andrew Hock Debo Dutta Jeffery Liao Peter Mattson
Ankur Ankur Deepak Narayanan Jonah Alben Ramesh Chukka
Anton Lokhmotov Dehao Chen Jonathan Cohen Sachin Idgunji
Arun Rajan Dilip Sequeira Kim Hazelwood Sam Davis
Ashish Sirasao Ephrem Wu Koichi Yamada Sarah Bird
Atsushi Ike Fei Sun Lillian Pentecost Sergey Serebryakov
Bill Jia Francisco Massa Lingjie Xu Steve Farrell
Bing Yu Frank Wei Mark Charlebois Taylor Robie
Brian Anderson Gennady Pekhimenko Masafumi Yamazaki Tayo Oguntebi
Carole-Jean Wu George Yuan Matei Zaharia Thomas B. Jablin
Christine Cheng Greg Diamos Maximilien Breughe Tom St. John
Cliff Young Gu-Yeon Wei Michael Thomson Tsuguchika Tabaru
Cody Coleman Guenther Schmuelling Naveen Kumar Udit Gupta
Colin Osborne Guokai Ma Pan Deng Victor Bittorf
Daniel Kang Hanlin Tang Pankaj Kanwar Vijay Janapa Reddi
Dave Fick Itay Hubara Paulius Micikevicious William Chou
David Brooks J. Scott Gardner Peizhao Zhang Xinyuan Huang
David Kanter Jacob Balma Peng Meng Yuchen Zhou

Agenda
● How else can we make ML better?
39

We are creating a non-profit called
MLCommons with a mission to
“Accelerate ML innovation.”
40

Recipe for accelerated innovation
Benchmarks Large public
datasets
+ + Best practices Outreach+
Photo credits (left to right): Simon A. Eugster CC BY-SA 3.0, Riksantikvarieämbetet / Pål-Nils Nilsson CC BY 2.5 se, Public Domain, Public
Domain
41

Best practice example: MLBox
MLBox
(a Docker)
datasets/
params
platform_spec
plarform_instance
Platform (HW+OS)
trained_model/
outputs/
logs/
directory
file
file
file
directory
directory
directory
Wikimedia
https://commons.wikimedia.org/wiki/File:Container_01_KMJ.jpg
Need a lightweight way to share models for benchmarking or experiments
Basic idea: a Docker with a standard file system interface
Goal: shipping container for ML models
42

Dataset example: People’s Speech
● Public datasets fuel innovation
○ ~80% of sample of large tech company
research papers cite public datasets [ Study
by Dr. Vijay Janapa Reddi, Harvard ]
● “People’s Speech” dataset
○ Goal:
■ 100,000+ hours of transcribed speech in
diverse languages by diverse speakers
■ Public-use license
○ Why?
■ Smart speakers / assistants expected to
reach entire Earth population by 2025
■ 1000+ languages with 1M+ speakers https://commons.wikimedia.org/wiki/File:List_of_languages_by_numb
er_of_native_speakers.png
43

“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Presentation from MLPerf

More Related Content

What's hot (20)

Similar to “An Industry Standard Performance Benchmark Suite for Machine Learning,” a Presentation from MLPerf (20)

More from Edge AI and Vision Alliance (20)

Recently uploaded (20)

“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Presentation from MLPerf