Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks

1
Comparison of Fine-tuning and Extension
Strategies for Deep Convolutional Neural
Networks
Nikiforos Pittaras1, Foteini Markatopoulou1,2, Vasileios Mezaris1,
and Ioannis Patras2
1Information Technologies Institute / Centre for Research and Technology Hellas
2Queen Mary University of London

4
Typical solution
We evaluate
DCNN-based
approaches

6
Literature review: Fine-tuning strategies
•Replacing the classification
layer with a new output
layer [2,5,18]
• The new layer is learned
from scratch
• All the other layers are fine-
tuned

7
•Re-initializing the last N
layers [1,9,18]
• The last N layers are
learned from scratch
• The first M-N layers are
fine-tuned with a low
learning rate [18] (or could
remain frozen [1,9])

8
•Extending the network by
one or more fully connected
layers [1,8,9,15]

10
FT3-ex: extension strategy
Add one or more FC layers
before the classification layer

11
Insert one or more FC layers
for each auxiliary classifier

12
We use the output of the last
three layers as features to
train LR classifiers

13
We also evaluate the direct
output of each network

14
Evaluation setup
Dataset: TRECVID SIN 2013
• 800 and 200 hours of internet archive videos for training and testing
• One keyframe per video shot
• Evaluated concepts: 38, Evaluation measure: MXinfAP
Dataset: PASCAL VOC 2012
• 5717 training, 5823 validation and 10991 test images
• Evaluation on the validation set instead of the original test set
• Evaluated concepts: 20, Evaluation measure: MAP
We fine-tuned 3 pre-trained ImageNet DCNNs:
• CaffeNet-1k, trained on 1000 ImageNet categories
• GoogLeNet-1k, trained on the same 1000 ImageNet categories
• GoogLeNet-5k, trained using 5055 ImageNet categories

15
Evaluation setup
For each pair of utilized network and fine-tuning strategy we evaluate:
• The direct output of the network
• Logistic regression (LR) classifiers trained on DCNN-based features
• One LR classifier per concept trained using the output of one layer
• The late-fused output (arithmetic mean) of LR classifiers trained using the
last three layers
We also evaluate the two auxiliary classifiers of the GoogLeNet-based networks

16
Preliminary experiments for
parameter selection
• Classification accuracy for the FT1-def strategy and CaffeNet-1k-60-SIN
• k: the learning rate multiplier of the pre-trained layers
• e: the number of training epochs
• The best accuracy per e is underlined; the globally best accuracy is bold and
underlined
• Improved accuracy for:
• Smaller learning rate values for the pre-trained layers
• Higher values for the training epochs

17
Experimental results – # target concepts
• Goal: Assess impact of number of target concepts
• Experiment: We fine-tune the network for either 60 or all 345 concepts; we
evaluate the same 38⊆60 concepts
• Conclusion: Fine-tuning a network for more concepts improves concept
detection accuracy
19
20
21
22
23
24
25
26
FT1-def
FT2-re1
FT2-re2
FT3-ex1-64
FT3-ex1-128
FT3-ex1-256
FT3-ex1-512
FT3-ex1-1024
FT3-ex1-2048
FT3-ex1-4096
FT3-ex2-64
FT3-ex2-128
FT3-ex2-256
FT3-ex2-512
FT3-ex2-1024
FT3-ex2-2048
FT3-ex2-4096
MXInfAP(%)
Fine-tuning strategies
TRECVID SIN Dataset
CaffeNet-1k-60-SIN
CaffeNet-1k-345-SIN

18
Experimental results – direct output
• Goal: Assess DCNNs as standalone classifiers (direct output)
• Experiment: Fine-tuning on the TRECVID SIN dataset
• Conclusion: FT3-ex1-64 and FT3-ex1-128 constitute the top-two methods
irrespective of the employed DCNN
14
16
18
20
22
24
26
FT1-def
FT2-re1
FT2-re2
FT3-ex1-64
FT3-ex1-128
FT3-ex1-256
FT3-ex1-512
FT3-ex1-1024
FT3-ex1-2048
FT3-ex1-4096
FT3-ex2-64
FT3-ex2-128
FT3-ex2-256
FT3-ex2-512
FT3-ex2-1024
FT3-ex2-2048
FT3-ex2-4096
MXInfAP(%)
TRECVID SIN Dataset
CaffeNet-1k-SIN
GoogLeNet-1k-SIN
GoogLeNet-5k-SIN

19
• Experiment: Fine-tuning on the PASCAL VOC dataset
• Conclusion: FT3-ex1-512 and FT3-ex1-1024 the best performing strategies for the
CaffeNet network
58
63
68
73
78
83
FT1-def
FT2-re1
FT2-re2
FT3-ex1-64
FT3-ex1-128
FT3-ex1-256
FT3-ex1-512
FT3-ex1-1024
FT3-ex1-2048
FT3-ex1-4096
FT3-ex2-64
FT3-ex2-128
FT3-ex2-256
FT3-ex2-512
FT3-ex2-1024
FT3-ex2-2048
FT3-ex2-4096
MAP(%)
PASCAL VOC Dataset
CaffeNet-1k-VOC
GoogLeNet-1k-VOC
GoogLeNet-5k-VOC

20
• Experiment: Fine-tuning on the PASCAL VOC dataset
• Conclusion: FT3-ex1-2048 and FT3-ex1-4096 the top-two methods for the
GoogLeNet-based networks
58
63
68
73
78
83
FT1-def
FT2-re1
FT2-re2
FT3-ex1-64
FT3-ex1-128
FT3-ex1-256
FT3-ex1-512
FT3-ex1-1024
FT3-ex1-2048
FT3-ex1-4096
FT3-ex2-64
FT3-ex2-128
FT3-ex2-256
FT3-ex2-512
FT3-ex2-1024
FT3-ex2-2048
FT3-ex2-4096
MAP(%)
PASCAL VOC Dataset
CaffeNet-1k-VOC
GoogLeNet-1k-VOC
GoogLeNet-5k-VOC

21
• Main conclusions: FT3-ex strategy with one extension layer is always the best
solution
• The optimal dimension of the extension layer depends on the dataset and the
network architecture

22
Experimental results – DCNN features
• Goal: Assess DCNNs as feature generators (DCNN-based features)
• Experiment: LR concept detectors trained on the output of the last 3 layers and
fused in terms of arithmetic-mean
• Conclusion: FT3- ex1-512 in the top-five methods; FT3-ex2-64 is always among
the five worst fine-tuning methods
17
19
21
23
25
27
29
31
33
FT1-def
FT2-re1
FT2-re2
FT3-ex1-64
FT3-ex1-128
FT3-ex1-256
FT3-ex1-512
FT3-ex1-1024
FT3-ex1-2048
FT3-ex1-4096
FT3-ex2-64
FT3-ex2-128
FT3-ex2-256
FT3-ex2-512
FT3-ex2-1024
FT3-ex2-2048
FT3-ex2-4096
MXInfAP(%)
TRECVID SIN Dataset
CaffeNet-1k-SIN
GoogLeNet-1k-SIN
GoogLeNet-5k-SIN

23
• The same conclusions hold for the PASCAL VOC Dataset
50
55
60
65
70
75
80
85
90
FT1-def
FT2-re1
FT2-re2
FT3-ex1-64
FT3-ex1-128
FT3-ex1-256
FT3-ex1-512
FT3-ex1-1024
FT3-ex1-2048
FT3-ex1-4096
FT3-ex2-64
FT3-ex2-128
FT3-ex2-256
FT3-ex2-512
FT3-ex2-1024
FT3-ex2-2048
FT3-ex2-4096
MAP(%)
PASCAL VOC Dataset
CaffeNet-1k-VOC
GoogLeNet-1k-VOC
GoogLeNet-5k-VOC

24
• Main conclusions: FT3-ex strategy almost always outperforms the other two fine-
tuning strategies
• FT3-ex1-512 is in the top-five methods
• Additional conclusions: drawn from results presented in the paper
• Features extracted from the top layers are more accurate than layers
positioned lower in the network; the optimal layer varies, depending on the
target domain dataset
• Better to combine features extracted from many layers
• The presented results correspond to the fused output of the last 3 layers

25
Conclusions
• Extension strategy almost always outperforms all the other strategies
• Increase the depth with one fully-connected layer
• Fine-tune the rest of the layers
• DCNN-based features significantly outperform the direct output
• In a few cases the direct output works comparably well
• Choose based on the application that the DCNN will be used; e.g., real time
applications’ time and memory limitations
• Better to combine features extracted from many layers

26
References
[1] Campos, V., Salvador, A., Giro-i Nieto, X., Jou, B.: Diving deep into sentiment: understanding fine-tuned
CNNs for visual sentiment prediction. In: 1st International Workshop on Affect and Sentiment in Multimedia
(ASM 2015), pp. 57–62. ACM, Brisbane (2015)
[2] Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into
convolutional nets. In: British Machine Vision Conference (2014)
[5.] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and
semantic segmentation. In: Computer Vision and Pattern Recognition (CVPR 2014) (2014)
[8] Markatopoulou, F., et al.: ITI-CERTH participation in TRECVID 2015. In: TRECVID 2015 Workshop. NIST,
Gaithersburg (2015)
[9] Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using
convolutional neural networks. In: Computer Vision and Pattern Recognition (CVPR 2014) (2014)
[15] Snoek, C., Fontijne, D., van de Sande, K.E., Stokman, H., et al.: Qualcomm Research and University of
Amsterdam at TRECVID 2015: recognizing concepts, objects, and events in video. In: TRECVID 2015 Workshop.
NIST, Gaithersburg (2015)
[18] Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? CoRR
abs/1411.1792 (2014)

27
Thank you for your attention!
Questions?
More information and contact:
Dr. Vasileios Mezaris
bmezaris@iti.gr
http://www.iti.gr/~bmezaris

Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks (20)

More from InVID Project (17)

Recently uploaded (20)

Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks