Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman



Update – 15th September 2014
We won ‘Best Scientific Paper Award’ at BMVC 2014!
15th July 2014
Precompiled MEX files and models for computing the ConvNet features in the paper are now available (see below). Full source code is also available.

Overview

The latest generation of Convolutional Neural Networks (CNN) have achieved impressive results in challenging benchmarks on image recognition and object detection, significantly raising the interest of the community in these methods. Nevertheless, it is still unclear how different CNN methods compare with each other and with previous state-of-the-art shallow representations such as the Bag-of-Visual-Words and the Improved Fisher Vector. This work conducts a rigorous evaluation of these new techniques, exploring different deep architectures and comparing them on a common ground, identifying and disclosing important implementation details in a similar vein to our previous work on shallow encoding methods, and identifying the aspects of deep and shallow methods that can be successfully shared.

We evaluate over several datasets (PASCAL VOC 2007 and 2012, Caltech-101, Caltech-256) and our best method achieves state-of-the-art performance over all four. We release the full source code and CNN models for the experiments on this page, in the hope that it would provide good baselines for future image representation research.

Results


ILSVRC-2012
(top-5 error)
VOC-2007
(mAP)
VOC-2012
(mAP)
Caltech-101
(accuracy)
Caltech-256
(accuracy)
FK IN 512
68.0
CNN M 2048
13.5
80.1
82.3
CNN S
13.1
79.6
82.7
88.54 ± 0.33
78.82 ± 0.31
CNN S TUNE-CLS
13.1
83.0
88.35 ± 0.56
CNN S TUNE-RNK
13.1
82.4
83.2
Zeiler & Fergus [2]
16.1
79.0
86.5 ± 0.5
74.2 ± 0.3
Razavian et al. [3], [4]
14.7
77.2
Oquab et al. [5]
18
77.7
78.7 / 82.8

Software & Paper Updates

Software to compute the ConvNet features used in the paper is now available from the sofware page:

Encoder Package Download Page

In addition to the feature computation binaries and CNN models available above, we further plan to release the full source code in the near future, along with continuing to make updates to the paper, with new versions available on the arXiv page.

Related publications


K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman   Best Scientific Paper Award
British Machine Vision Conference, 2014

K. Chatfield, V. Lempitsky, A. Vedaldi and A. Zisserman
British Machine Vision Conference, 2011

Acknowledgements

Funding is provided by the EPSRC, ERC grant VisRec no. 228180, and EU Project FP7 AXES ICT-269980. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.

ERC AXES