Deep learning and computer vision

Francesco Pugliese, PhD
Italian National Institute of Statistics, Division
"Information and Application Architecture“, Directorate
for methodology and statistical design
Matteo Testi, MsC
Data Science at Sferanet S.r.l
Email Francesco Pugliese : francesco.pugliese@istat.it
Email Matteo Testi: testi@sferaspa.com

• Image classification is the task of taking an input image and
outputting a class (a cat, dog, etc) or a probability of those
classes that better describe the image. For humans, this task of
recognition is one of the first skills we learn.
• When we see an image or just when we look at the world
around us, most of the time we are able to immediately
characterize the scene and give each object a label, all this
without even consciously noticing it.
2

3
What we see
What computers see

4
Convolutional Neural Networks (CNNs) are biologically-inspired
variants of MLPs. From Hubel and Wiesel’s early work on the cat’s
visual cortex, we know the visual cortex contains a complex
arrangement of cells. These cells are sensitive to small sub-regions
of the visual field, called a receptive field.

5
• For example, some neurons fired when exposed to vertical
edges and some when shown horizontal or diagonal edges.
• Feature identifiers

7
LeNet was one of the very first convolutional neural networks
which helped to propel the field of Deep Learning. This
pioneering work by Yann LeCun was named LeNet5 after many
previous successful iterations since the year 1988.

8
MNIST database is handwritten digits composed of 60.000
pattern.
• 60.000 training set
• 10,000 test set
LeNet has been applied to this dataset with accuracy of 0.95%.

12
So far, in tens of experiments,
the resulting performances
were many magnitudes better
than other machine learning
techniques today available.
GPU
•The advent of GPUs makes possible the
training of very large neural networks with
even more than 150 millions of parameters.
BIG
DATA
• A new generation of larger
training and test sets.
Dropou
t
• Better model regularization techniques
have been discovered such as
“Dropout” or “Data Augmentation”

13
- A new study proves the relationship between Vision
capabilities and Intelligence (Tsukahara et al., 2016).
- Computer Vision needs human-like abilities.
EVERYDAY LIFE BIOMEDICAL IMAGES

14
•A new generation of machines
might accomplish typical human tasks
such as recognizing and moving
objects, driving cars, cultivating fields,
cleaning streets, city garbage
collecting, etc.

ImageNet: ImageNet is a dataset of over 15 million labeled high-resolution images belonging to
roughly 22,000 categories.
• Since 2010 a competition called «ImageNet Large-Scale Visual Recognition Challenge
(ILSVRC)» uses a subset of ImageNet with roughly 1000 images in each of 1000 categories.
Train Set: 1.2 million training images,
Validation Set: 50,000 images
Test set: 150,000 images.
15

Kaggle: In 2010, Kaggle was founded as a platform for predictive modeling and analytics competitions on which
companies and researchers post their data.
• Statisticians and data scientists from all over the world compete to produce the best models.
• Data Science Bowl 2017 was the biggest competition focused on “Lung Cancer Detection”. The competition was
founded by Arnold Foundation and awarded $1 million in prizes (1st ranked $500,000).
Train Set: around 150 CT labelled scans images per patient from 1200 patients encoded in DICOM format.
Stage 1 test set: 190 patients CT scans.
Stage 2 test : 500 patients CT scans.
16
Grand Challenges in Biomedical Image Analysis: This is a web-
site hosting new competitions in the Biomedicine field.
Specifically, LUNA (LUng Nodule Analysis) focuses on a large-
scale evaluation of automatic nodule detection algorithms.
Train Set: LIDC/IDRI database consisting of 888 CT Scans labelled
by 4 expert radiologists.

Each neuron in the convolutional layer is
connected only to a local region in the
input volume spatially. In this case there
are 5 neurons along the depth all looking
at the same region.
17
Convolutional Neural Networks (CNN) are biologically-inspired variants of MLPs. We know the
visual cortex contains a complex arrangement of cells (Hubel, D. and Wiesel, T., 1968). These cells
are sensitive to small sub-regions of the visual field, called a receptive field. Other layers are: RELU
layer, Pool Layer. Typical CNNs settings are: a) Number of Kernels (Filters), b) Receptive Field size,
b) Padding, c) Stride. These parameters are tied by the following equation:

18
Conv
Nets
AlexNet
(2012)
GoogleNet
(2014)
Residual
Nets
(2015)
VGG Net
(2° ranked
in 2014)
Traditional issues with
Convolutional Layers:
•Wide convolutional layers
determine overfitting and
vanishing gradient problem
with the Solver (SGD, Adam,
etc).
•Low depth architectures
produce raw features (need to
push the depth forward).
Model Regularization :
•Dropout - Co-adaptation and
Models Ensemble (Srivastava,
N. et al., 2014).
•Weight penalty L1/L2
•Data Augmentation (crop, flip,
rotation, ZCA whitening, etc.)

19
Critical Feautures (Krizhevsky, A. et al, 2012)
• 8 trainable layers: 5 convolutional layers and 3 fully connected layers.
• Max pooling layers after 1st, 2nd and 5th layer.
• Rectified Linear Units (ReLUs) (Nair, V., & Hinton, G. E. 2010).
• Local Response Normalization.
• 60 millions parameters, 650 thousands neurons.
• Regularizations: Dropout (prob 0.5 in the first 2 fc layers, Data Augmentation (translactions,
horizontal reflections, PCA on RGB).
• Trained on 2 GTX 580 3 GB GPUs.
Results:
• 1 CNNs: 40.7% Top-1
Error, 18.2% Top-5
Error
• 5 CNNs: 38.1% Top-1
Error,16.4% Top-5 Error
• SIFT+FVs: 26.2% Top-5
Error (Sánchez, J., et al.,
2013).

20
Critical Feautures (Simonyan, K., & Zisserman, A., 2014 ):
• Kernels with small receptive fields: 3x3 which is the smallest size to capture the notion of
left/right up/down, center. It is easy to see that a stack of two 3×3 conv. layers (without spatial
pooling in between) has an effective receptive field of 5×5, and so on.
• Small size Receptive Field is a way to increase the nonlinearity of the decision function fields of
the conv. layers.
• Increasing depth architectures: VGG-16 (2xConv3-64, 2xConv3-128, 3xConv3-256, 6xConv3-512,
3xFC), VGG-19 (same as VGG-16 but with 8xConv3-512).
• Upside: less complex topology, outperforms GoogleNet on single-network classification accuracy
• Downside: 138 millions
parameters for VGG-16 !
Results:
• Multi ConvNet model :
(D/[256;512]/256,384,51
2),
(E/[256;512]/256,384,51
2), multi-crop & dense
eval: 23.7% Top-1 Error,
6.8% Top-5 Error.

21
Critical Feautures (Szegedy, C., et al., 2015):
• Computationally Effective Deep architecture: 22 layers
• Why the name inception, you ask? Because the
module represents a network within a network. If you
don't get the reference, go watch Christopher Nolan's
“INCEPTION”, computer scientists are hilarious.
• Inception: it isbasically the parallel combination of 1×1
3×3, and 5×5 convolutional filters.
• Bottleneck layer: The great insight of the inception
module is the use of 1×1 conv-
olutional blocks (NiN) to reduce
the number of features before
the expensive parallel blocks.
• Upside: 4 millions parameters!
• Downside: Not scalable!
Results:
• 7 Models Ensemble : 6.67%
Top-5 Error.

22
Critical Feautures (He, K., et al., 2016). :
• Degradation Problem: Stacking more and more layers
IS NOT better. With the network depth increasing,
accuracy gets saturated and then degrades rapidly! It’s
an issue of “solvers”.
• Solves the “Degradation problem”: by fitting a residual
mapping which is easier to optimize.
• Shortcut connections:
• Very deep architecture: up to 1202 layers with
WideResnet with only 19.4 million parameters!
• Upside: Increasing accuracy with more depth
• Downside: They don’t consider other architectures
breakthroughs.
Results:
• ResNet : 3.57% Top-5 Error.
• CNNs show superhuman
abilities at Image Recognition!
5% Human estimated Top-5
error (Johnson, R. C., 2015).

24
Problems:
• Feature extraction: In biomedicine feature
extraction is not as easy as in an Imagenet
competition with general images. A previous
Image Preprocessing is needed. This is called
Segmentation.
• On Kaggle website there are whole
competitions just regarding Segmentation.
One of these was called «Ultrasound Nerve
Segmentation».

25
Critical Feautures (Ronneberger, O., et al., 2015):
• U-NET can be trained end-to-end from very few images
and outperforms the prior best methods.
• It consists of a contracting path (left side) to capture
context and an symmetric expansive path (right side)
enabling precise localization.
• Upsampling part (repeating rows and cols) has a large
number of feature channels which allow the network
to propagate context information to higher resolution
layers.
• Spatial Dropout: feature maps
dropout.
• Upside: Small training set.
• Downside: Risk of overfitting.

26
Candidate Nodule
Selection via
UNET
Dilation, Erosion,
Nodules Distance
Merging
False Positive
Reduction via
WideResNet

Tsukahara, J. S., Harrison, T. L., & Engle, R. W. (2016). The relationship between baseline pupil size and
intelligence. Cognitive Psychology, 91, 109-123.
Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The
Journal of physiology, 195(1), 215-243.
Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to
prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958.
Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of
the 27th international conference on machine learning (ICML-10) (pp. 807-814).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural
networks. In Advances in neural information processing systems (pp. 1097-1105).
Sánchez, J., Perronnin, F., Mensink, T., & Verbeek, J. (2013). Image classification with the fisher vector: Theory
and practice. International journal of computer vision, 105(3), 222-245.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556.
28
.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with
convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-9).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (pp. 770-778).
Johnson, R. C. (2015). Microsoft, Google beat humans at image recognition. EE Times.
Ronneberger, O., Fischer, P., & Brox, T. (2015, October). U-net: Convolutional networks for biomedical image
segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp.
234-241). Springer International Publishing.
29
.

Thank you for attention.
Francesco Pugliese
Matteo Testi
30

Deep learning and computer vision

More Related Content

What's hot (20)

Similar to Deep learning and computer vision (20)

More from MeetupDataScienceRoma (20)

Recently uploaded (20)

Deep learning and computer vision

Editor's Notes