CNN Quantization

CNN Quantization
Performance evaluation
Emanuele Ghelﬁ Emiliano Gagliardi
June 18, 2017
Politecnico di Milano
Emanuele Ghelﬁ, Emiliano Gagliardi CNN Quantization June 18, 2017 1 / 25

Contents
1 Purposes
2 Quantization
3 Our work

Purposes
Problem
Accuracy/inference speed trade-off
For real world application, convolutional neural network(CNN) model
can take more than 100MB of space and can be computationally too
expensive
How to enable embedded devices like smart phones with the power of
Neural Networks?
There is the need of floating point precision?
No, deep neural network tends to cope well with noise in their input
Training still needs floating point precision to work, it is an iteration of
little incremental adjustments of the weights

Purposes
Solution
The solution is quantization.
Deep networks can be trained with floating point precision, then a
quantization algorithm can be applied to obtain smaller models and speed
up the inference phase reducing memory requirements:
Fixed-point compute units are typically faster and consume far less
hardware resources and power than floating-point engines
Low-precision data representation reduces the memory footprint,
enabling larger models to fit within the given memory capacity

Purposes
Project purpose
Using a machine learning framework with support for convolutional neural
networks:
Deﬁne diﬀerent kind of network
Train
Quantize
Evaluate the original and the quantized models
Compare them in terms of model size, cache misses, and inference time

Quantization
What is quantization?
Developing this project we saw two different approaches:
Tensorflow
Caffe Ristretto

Quantization
Tensorﬂow quantization
Unsupervised approach
Get a trained network
Obtain for each layer the min and
the max of the weights value
Represent the weights distributed
linearly between the minimum and
maximum with 8 bits precision
The operations have to be
reimplemented for the 8-bit format
The resulting data structure is composed by an array containing the
quantized value, and the two ﬂoat min and max

Quantization
Caffe Ristretto quantization
Supervised approach
Get a trained network
Three different methods:
Dynamic fixed point: a modified fixed-point format
Minifloat: bit-width reduced floating point numbers
Power of two: layers with power-of-two parameters don’t need any
multipliers, when implemented in hardware
Evaluate the performance of the network during quantization in order
to keep the accuracy higher than a given threshold
Support for training of quantized networks (fine-tuning)

Quantization
Caﬀè Ristretto quantization

Our work
Caffe ristretto
The results obtained with Ristretto on a simple network for the Mnist
dataset are not so satisfying...
network accuracy model size (MB) Time (ms) LL_d misses (106
) L1_d misses (106
)
Orginal 0.9903 1.7 29.2 32.098 277.189
Dynamic f. p. 0.9829 1.7 126.41 42.077 303.209
Minifloat 0.9916 1.7 29.5 37.149 282.396
Power of two 0.9899 1.7 61.1 35.774 280.819
Linux running on macbook pro, cachegrind tool for cache statistics. Intel i5 2.9 GHz, L3 cache 3MB, 16 GB ram.
The quantized values are stored in float size after the quantization
The quantized layers implementation works with float variables:
perform the computation with low precision values stored in float
variables
quantize the results, still stored in float variables

Our work
Tensorflow
The quantization is better supported
The quantized model is stored with low precision weights
Some low precision operations are already implemented
We tried different topologies of networks, to see how quantization affects
different architectures

Our work
How we used the tensorflow quantization tool
We used python (with a bit OO, since we needed a way to use it with
different networks)
An abstract class defines the pattern of the network that the main
script can handle
The core methods of the pattern are
prepare: load the data and build the computational graph and the
training step of the network
train: iterate the train step
The main script takes in input an instance and:
calls prepare and train
quantizes the obtained network
evaluates the accuracy
evaluates cache performance using linux-perf
plots the data

Our work
Topologies on MNIST
(a) Big
(b) More
convolutional (c) More FC (d) Tf Example (e) Only FC

Our work
Some data on MNIST - accuracy

Our work
Some data on MNIST - models size

Our work
Some data on MNIST - data cache misses

Our work
Some data on MNIST - inference time

Our work
Some data on CIFAR10 - accuracy

Our work
Some data on CIFAR10 - models size

Our work
Some data on CIFAR10 - data cache misses

Our work
Some data on CIFAR10 - inference time

Our work
Why is the inference time worst?
We see an improvement in performance only for the size of the model,
and so for the data cache misses
Inference time and last level cache misses are worst in quantized
networks
From the tensorﬂow github page:
Only a subset of ops are supported, and on many platforms the quantized
code may actually be slower than the ﬂoat equivalents, but this is a way of
increasing performance substantially when all the circumstances are right.

Our work
Original net - tensorﬂow benchmark tool
Original network on MNIST dataset:

Our work
Quantized net - tensorﬂow benchmark tool
Quantized network on MNIST dataset:

Our work
References
Tensorflow quantization: https://www.tensorflow.org/performance/quantization
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/graph_trans-
forms
Ristretto: http://lepsucd.com/?page_id=621
Github repository of the project:
https://github.com/EmilianoGagliardiEmanueleGhelfi/CNN-compression-performance
Deep Learning with Limited Numerical Precision (Suyog Gupta, Ankur
Agrawal, Kailash Gopalakrishnan)

CNN Quantization

More Related Content

What's hot (20)

Similar to CNN Quantization (20)

Recently uploaded (20)

CNN Quantization