SlideShare a Scribd company logo
CNN Quantization
Performance evaluation
Emanuele Ghelfi Emiliano Gagliardi
June 18, 2017
Politecnico di Milano
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 1 / 25
Contents
1 Purposes
2 Quantization
3 Our work
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 2 / 25
Purposes
Problem
Accuracy/inference speed trade-off
For real world application, convolutional neural network(CNN) model
can take more than 100MB of space and can be computationally too
expensive
How to enable embedded devices like smart phones with the power of
Neural Networks?
There is the need of floating point precision?
No, deep neural network tends to cope well with noise in their input
Training still needs floating point precision to work, it is an iteration of
little incremental adjustments of the weights
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 3 / 25
Purposes
Solution
The solution is quantization.
Deep networks can be trained with floating point precision, then a
quantization algorithm can be applied to obtain smaller models and speed
up the inference phase reducing memory requirements:
Fixed-point compute units are typically faster and consume far less
hardware resources and power than floating-point engines
Low-precision data representation reduces the memory footprint,
enabling larger models to fit within the given memory capacity
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 4 / 25
Purposes
Project purpose
Using a machine learning framework with support for convolutional neural
networks:
Define different kind of network
Train
Quantize
Evaluate the original and the quantized models
Compare them in terms of model size, cache misses, and inference time
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 5 / 25
Quantization
What is quantization?
Developing this project we saw two different approaches:
Tensorflow
Caffe Ristretto
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 6 / 25
Quantization
Tensorflow quantization
Unsupervised approach
Get a trained network
Obtain for each layer the min and
the max of the weights value
Represent the weights distributed
linearly between the minimum and
maximum with 8 bits precision
The operations have to be
reimplemented for the 8-bit format
The resulting data structure is composed by an array containing the
quantized value, and the two float min and max
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 7 / 25
Quantization
Caffe Ristretto quantization
Supervised approach
Get a trained network
Three different methods:
Dynamic fixed point: a modified fixed-point format
Minifloat: bit-width reduced floating point numbers
Power of two: layers with power-of-two parameters don’t need any
multipliers, when implemented in hardware
Evaluate the performance of the network during quantization in order
to keep the accuracy higher than a given threshold
Support for training of quantized networks (fine-tuning)
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 8 / 25
Quantization
Caffè Ristretto quantization
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 9 / 25
Our work
Caffe ristretto
The results obtained with Ristretto on a simple network for the Mnist
dataset are not so satisfying...
network accuracy model size (MB) Time (ms) LL_d misses (106
) L1_d misses (106
)
Orginal 0.9903 1.7 29.2 32.098 277.189
Dynamic f. p. 0.9829 1.7 126.41 42.077 303.209
Minifloat 0.9916 1.7 29.5 37.149 282.396
Power of two 0.9899 1.7 61.1 35.774 280.819
Linux running on macbook pro, cachegrind tool for cache statistics. Intel i5 2.9 GHz, L3 cache 3MB, 16 GB ram.
The quantized values are stored in float size after the quantization
The quantized layers implementation works with float variables:
perform the computation with low precision values stored in float
variables
quantize the results, still stored in float variables
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 10 / 25
Our work
Tensorflow
The quantization is better supported
The quantized model is stored with low precision weights
Some low precision operations are already implemented
We tried different topologies of networks, to see how quantization affects
different architectures
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 11 / 25
Our work
How we used the tensorflow quantization tool
We used python (with a bit OO, since we needed a way to use it with
different networks)
An abstract class defines the pattern of the network that the main
script can handle
The core methods of the pattern are
prepare: load the data and build the computational graph and the
training step of the network
train: iterate the train step
The main script takes in input an instance and:
calls prepare and train
quantizes the obtained network
evaluates the accuracy
evaluates cache performance using linux-perf
plots the data
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 12 / 25
Our work
Topologies on MNIST
(a) Big
(b) More
convolutional (c) More FC (d) Tf Example (e) Only FC
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 13 / 25
Our work
Some data on MNIST - accuracy
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 14 / 25
Our work
Some data on MNIST - models size
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 15 / 25
Our work
Some data on MNIST - data cache misses
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 16 / 25
Our work
Some data on MNIST - inference time
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 17 / 25
Our work
Some data on CIFAR10 - accuracy
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 18 / 25
Our work
Some data on CIFAR10 - models size
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 19 / 25
Our work
Some data on CIFAR10 - data cache misses
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 20 / 25
Our work
Some data on CIFAR10 - inference time
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 21 / 25
Our work
Why is the inference time worst?
We see an improvement in performance only for the size of the model,
and so for the data cache misses
Inference time and last level cache misses are worst in quantized
networks
From the tensorflow github page:
Only a subset of ops are supported, and on many platforms the quantized
code may actually be slower than the float equivalents, but this is a way of
increasing performance substantially when all the circumstances are right.
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 22 / 25
Our work
Original net - tensorflow benchmark tool
Original network on MNIST dataset:
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 23 / 25
Our work
Quantized net - tensorflow benchmark tool
Quantized network on MNIST dataset:
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 24 / 25
Our work
References
Tensorflow quantization: https://www.tensorflow.org/performance/quantization
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/graph_trans-
forms
Ristretto: http://lepsucd.com/?page_id=621
Github repository of the project:
https://github.com/EmilianoGagliardiEmanueleGhelfi/CNN-compression-performance
Deep Learning with Limited Numerical Precision (Suyog Gupta, Ankur
Agrawal, Kailash Gopalakrishnan)
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 25 / 25

More Related Content

PDF
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
PPTX
Introduction to Machine Learning
PPTX
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
PPTX
Detecting malaria using a deep convolutional neural network
PDF
Neural networks and deep learning
PPTX
Keras CNN Pre-trained Deep Learning models for Flower Recognition
PPT
information-package-diagram.ppt
DOCX
Project Report -Vaibhav
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
Introduction to Machine Learning
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
Detecting malaria using a deep convolutional neural network
Neural networks and deep learning
Keras CNN Pre-trained Deep Learning models for Flower Recognition
information-package-diagram.ppt
Project Report -Vaibhav

What's hot (20)

PPT
Introduction to soft computing
PDF
AI in Financial Services: Transforming the Financial Landscape
PPTX
Support Vector Machines- SVM
PPTX
Convolutional neural network from VGG to DenseNet
DOCX
Digit recognition using mnist database
PPTX
Image classification using convolutional neural network
PPTX
FP7 iCore project presentation
PDF
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
PPTX
Cloud Infrastructure Mechanisms
PPTX
Machine Learning - Convolutional Neural Network
PPTX
Virtualization- Cloud Computing
PDF
Cloud computing system models for distributed and cloud computing
PPTX
Intro to deep learning
PDF
cloud virtualization technology
PPT
Cluster Computing
PDF
An Introduction to Neural Architecture Search
PDF
Classification By Back Propagation
PPSX
ADABoost classifier
PPTX
Chapter 10
Introduction to soft computing
AI in Financial Services: Transforming the Financial Landscape
Support Vector Machines- SVM
Convolutional neural network from VGG to DenseNet
Digit recognition using mnist database
Image classification using convolutional neural network
FP7 iCore project presentation
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Cloud Infrastructure Mechanisms
Machine Learning - Convolutional Neural Network
Virtualization- Cloud Computing
Cloud computing system models for distributed and cloud computing
Intro to deep learning
cloud virtualization technology
Cluster Computing
An Introduction to Neural Architecture Search
Classification By Back Propagation
ADABoost classifier
Chapter 10
Ad

Similar to CNN Quantization (20)

PDF
Survey of recent deep learning with low precision
PDF
Novel hybrid framework for image compression for supportive hardware design o...
PDF
SigOpt at GTC - Tuning the Untunable
PDF
Tuning for Systematic Trading: Talk 2: Deep Learning
PPTX
Anomaly Detection with Azure and .net
PPTX
DL-CO2-Session6-VGGNet_GoogLeNet_ResNet_DenseNet_RCNN.pptx
PDF
Hyper-parameter optimization of convolutional neural network based on particl...
PDF
TEST-COST-SENSITIVE CONVOLUTIONAL NEURAL NETWORKS WITH EXPERT BRANCHES
PDF
Traffic sign classification
PPTX
Single Image Super Resolution using Fuzzy Deep Convolutional Networks
PPTX
Anomaly Detection with Azure and .NET
PDF
IRJET- Mango Classification using Convolutional Neural Networks
PDF
Deep learning: challenges and applications
PPTX
Project report
PDF
IRJET- American Sign Language Classification
PDF
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
PPTX
Slides galvin-widjaja
PDF
Michele vitali milano r aprile 2013
PPTX
SOFI Developer Meeting Göttingen 28th March 2015
PDF
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
Survey of recent deep learning with low precision
Novel hybrid framework for image compression for supportive hardware design o...
SigOpt at GTC - Tuning the Untunable
Tuning for Systematic Trading: Talk 2: Deep Learning
Anomaly Detection with Azure and .net
DL-CO2-Session6-VGGNet_GoogLeNet_ResNet_DenseNet_RCNN.pptx
Hyper-parameter optimization of convolutional neural network based on particl...
TEST-COST-SENSITIVE CONVOLUTIONAL NEURAL NETWORKS WITH EXPERT BRANCHES
Traffic sign classification
Single Image Super Resolution using Fuzzy Deep Convolutional Networks
Anomaly Detection with Azure and .NET
IRJET- Mango Classification using Convolutional Neural Networks
Deep learning: challenges and applications
Project report
IRJET- American Sign Language Classification
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
Slides galvin-widjaja
Michele vitali milano r aprile 2013
SOFI Developer Meeting Göttingen 28th March 2015
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
Ad

Recently uploaded (20)

PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Geodesy 1.pptx...............................................
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Welding lecture in detail for understanding
DOCX
573137875-Attendance-Management-System-original
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
composite construction of structures.pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPT
Project quality management in manufacturing
PPTX
Lecture Notes Electrical Wiring System Components
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Geodesy 1.pptx...............................................
CYBER-CRIMES AND SECURITY A guide to understanding
CH1 Production IntroductoryConcepts.pptx
Welding lecture in detail for understanding
573137875-Attendance-Management-System-original
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
composite construction of structures.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Project quality management in manufacturing
Lecture Notes Electrical Wiring System Components
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Internet of Things (IOT) - A guide to understanding
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
UNIT 4 Total Quality Management .pptx
bas. eng. economics group 4 presentation 1.pptx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx

CNN Quantization

  • 1. CNN Quantization Performance evaluation Emanuele Ghelfi Emiliano Gagliardi June 18, 2017 Politecnico di Milano Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 1 / 25
  • 2. Contents 1 Purposes 2 Quantization 3 Our work Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 2 / 25
  • 3. Purposes Problem Accuracy/inference speed trade-off For real world application, convolutional neural network(CNN) model can take more than 100MB of space and can be computationally too expensive How to enable embedded devices like smart phones with the power of Neural Networks? There is the need of floating point precision? No, deep neural network tends to cope well with noise in their input Training still needs floating point precision to work, it is an iteration of little incremental adjustments of the weights Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 3 / 25
  • 4. Purposes Solution The solution is quantization. Deep networks can be trained with floating point precision, then a quantization algorithm can be applied to obtain smaller models and speed up the inference phase reducing memory requirements: Fixed-point compute units are typically faster and consume far less hardware resources and power than floating-point engines Low-precision data representation reduces the memory footprint, enabling larger models to fit within the given memory capacity Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 4 / 25
  • 5. Purposes Project purpose Using a machine learning framework with support for convolutional neural networks: Define different kind of network Train Quantize Evaluate the original and the quantized models Compare them in terms of model size, cache misses, and inference time Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 5 / 25
  • 6. Quantization What is quantization? Developing this project we saw two different approaches: Tensorflow Caffe Ristretto Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 6 / 25
  • 7. Quantization Tensorflow quantization Unsupervised approach Get a trained network Obtain for each layer the min and the max of the weights value Represent the weights distributed linearly between the minimum and maximum with 8 bits precision The operations have to be reimplemented for the 8-bit format The resulting data structure is composed by an array containing the quantized value, and the two float min and max Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 7 / 25
  • 8. Quantization Caffe Ristretto quantization Supervised approach Get a trained network Three different methods: Dynamic fixed point: a modified fixed-point format Minifloat: bit-width reduced floating point numbers Power of two: layers with power-of-two parameters don’t need any multipliers, when implemented in hardware Evaluate the performance of the network during quantization in order to keep the accuracy higher than a given threshold Support for training of quantized networks (fine-tuning) Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 8 / 25
  • 9. Quantization Caffè Ristretto quantization Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 9 / 25
  • 10. Our work Caffe ristretto The results obtained with Ristretto on a simple network for the Mnist dataset are not so satisfying... network accuracy model size (MB) Time (ms) LL_d misses (106 ) L1_d misses (106 ) Orginal 0.9903 1.7 29.2 32.098 277.189 Dynamic f. p. 0.9829 1.7 126.41 42.077 303.209 Minifloat 0.9916 1.7 29.5 37.149 282.396 Power of two 0.9899 1.7 61.1 35.774 280.819 Linux running on macbook pro, cachegrind tool for cache statistics. Intel i5 2.9 GHz, L3 cache 3MB, 16 GB ram. The quantized values are stored in float size after the quantization The quantized layers implementation works with float variables: perform the computation with low precision values stored in float variables quantize the results, still stored in float variables Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 10 / 25
  • 11. Our work Tensorflow The quantization is better supported The quantized model is stored with low precision weights Some low precision operations are already implemented We tried different topologies of networks, to see how quantization affects different architectures Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 11 / 25
  • 12. Our work How we used the tensorflow quantization tool We used python (with a bit OO, since we needed a way to use it with different networks) An abstract class defines the pattern of the network that the main script can handle The core methods of the pattern are prepare: load the data and build the computational graph and the training step of the network train: iterate the train step The main script takes in input an instance and: calls prepare and train quantizes the obtained network evaluates the accuracy evaluates cache performance using linux-perf plots the data Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 12 / 25
  • 13. Our work Topologies on MNIST (a) Big (b) More convolutional (c) More FC (d) Tf Example (e) Only FC Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 13 / 25
  • 14. Our work Some data on MNIST - accuracy Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 14 / 25
  • 15. Our work Some data on MNIST - models size Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 15 / 25
  • 16. Our work Some data on MNIST - data cache misses Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 16 / 25
  • 17. Our work Some data on MNIST - inference time Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 17 / 25
  • 18. Our work Some data on CIFAR10 - accuracy Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 18 / 25
  • 19. Our work Some data on CIFAR10 - models size Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 19 / 25
  • 20. Our work Some data on CIFAR10 - data cache misses Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 20 / 25
  • 21. Our work Some data on CIFAR10 - inference time Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 21 / 25
  • 22. Our work Why is the inference time worst? We see an improvement in performance only for the size of the model, and so for the data cache misses Inference time and last level cache misses are worst in quantized networks From the tensorflow github page: Only a subset of ops are supported, and on many platforms the quantized code may actually be slower than the float equivalents, but this is a way of increasing performance substantially when all the circumstances are right. Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 22 / 25
  • 23. Our work Original net - tensorflow benchmark tool Original network on MNIST dataset: Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 23 / 25
  • 24. Our work Quantized net - tensorflow benchmark tool Quantized network on MNIST dataset: Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 24 / 25
  • 25. Our work References Tensorflow quantization: https://www.tensorflow.org/performance/quantization https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/graph_trans- forms Ristretto: http://lepsucd.com/?page_id=621 Github repository of the project: https://github.com/EmilianoGagliardiEmanueleGhelfi/CNN-compression-performance Deep Learning with Limited Numerical Precision (Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan) Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 25 / 25