Fine tuning a convolutional network for cultural event recognition

FINE-TUNING A CONVOLUTIONAL NETWORK
FOR CULTURAL EVENT RECOGNITION
ADVISORS:
Andrea Calafell
Xavier Giró-i-Nieto Amaia Salvador
20/07/2015
AUTHOR:
Matthias Zeppelzauer

OUTLINE
1. Motivation and State of the art
2. Baseline
3. Study of the dataset bias
4. Denoising
5. Fracking
6. Fine-tuning deeper layers only
7. Ensemble of event detectors
8. Conclusions and future work
2

MOTIVATION: Cultural Heritage
3Chinese New year

MOTIVATION: Cultural Heritage
4Carnival Rio

Onsite social media is big data...
6

...and online explorers need our help
7

CHALEARN: Looking at People
8
TRAINING
SET
5,875
VALIDATION
SET
2,332
TEST
SET
3,569
50
EVENTS

MOTIVATION: Goals
9
● Improve the results obtained in
ChaLearn Challenge.
● Exploit the noisy data collected
from Flickr

STATE OF THE ART: CaffeNet
10
Content
Visual
Time stamp Context
Geolocation
Text
Zaharieva’15 X X X
Mattivi’11 X X
Bossard’13 X X
Cao’08 X X X
Sutanto’13 X
Schinas’12 X X
Brenner’13 X X
Nguyen’13 X X
MediaEval
Social
Event Detection

STATE OF THE ART: CaffeNet
11
CaffeNet
ARCHITECTURE
[Khrizevsky’12]
SOFTWARE
[Jia’14]
DATA
[Deng’09]

STATE OF THE ART: CNN ARCHITECTURE
12
Convolutional Neural Network architecture
Babenko et al, Neural codes for image retrieval. In Computer Vision-ECCV, 2014

STATE OF THE ART: Object+Scene CNNs
13
Object-Scene Convolutional Neural Network for event recognition
Wang et al, Object-scene convolutional neural networks for event recognition in images. In CVPRW, 2015

OUTLINE
2. Baseline
4. Denoising
5. Fracking
14

BASELINE: Fine-tuning a ConvNet
15
50

BASELINE: ChaLearn @ CVPRW 2015
16
Awarded with the 2nd prize of the Cultural Event Recognition Challenge in the ChaLearn Workshop at CVPR
2015
Salvador. A, Giro-i-Nieto. X, Calafell, A, et al, Cultural Event Recognition with Visual ConvNets and Temporal Models. In
CVPRW, 2015

17
2015
CVPRW, 2015

OUTLINE
2. Baseline
4. Denoising
5. Fracking
18

Convnets require to be trained with...
19
a large amount of
labeled images

but clean data is expensive...
20
and downloading noisy data in
an unsupervised fashion is
easier and cheaper.

NOISY DATA: Flickr Dataset
21
FLICKR
DATASET
4,068
50
EVENTS

DATASET BIAS
22
Dataset bias when fine-tuning with ChaLearn or Flickr
dataset:

OUTLINE
2. Baseline
4. Denoising
5. Fracking
23

DENOISING THE FLICKR DATASET
24
Mosaic of Queens Day from ChaLearn Mosaic of Queens Day from Flickr

25Example event: Annual Buffalo Roundup
Fine-tuned
model with
ChaLearn
New subset
from

BASELINE: Dataset ordering during fine-tuning
26
CaffeNet
FINE-TUNING JOINT:

27
Joint fine-tuning of the clean and noisy datasets:
0.6136

28
CaffeNet
FINE-TUNING: FINE-TUNING:

29
Sequential fine-tuning of the clean and noisy datasets:
0.6136

30
CaffeNet
FINE-TUNING:FINE-TUNING:

31
Sequential fine-tuning of the noisy and clean datasets:
0.6136
+1,3%

OUTLINE
2. Baseline
4. Denoising
5. Fracking
32

FRACKING MINING +/- SAMPLES
33

FRACKING THE TRAINING DATASET
34Example event: Pingxi Lantern Festival
Fine-tuned
model with
ChaLearn
New subset
from
hard negatives
hard positive

35
CaffeNet
FINE-TUNING: Fine-tuning
with fracking
subset from:

FRACKING THE TRAINING DATASET
36
Results of fine-tuning using fracking in images from ChaLearn:
baseline: 0.61365
+0,9%

OUTLINE
2. Baseline
4. Denoising
5. Fracking
37

FINE-TUNING DEEPER LAYERS ONLY
38
Layer 2 responds to corners and other edge/color conjunctions.

39
Layer 3 has more complex invariances, capturing similar textures
Zeiler et al, Visualizing and Understanding Convolutional Networks, In Computer Vision-ECCV 2014,

40
50
Andrej Karpathy. Convolutional neural networks for visual recognition. In Stanford CS class CS231n.
FC6 FC7
FC8

41
Results of only fine-tuning the deeper layers:
+3%
0.61365

42
Results of only fine-tuning the deeper layers :
+4%
0.6136

OUTLINE
2. Baseline
4. Denoising
5. Fracking
43

44
2015
CVPRW, 2015

ENSEMBLE OF EVENT DETECTORS
45
SINGLE CONVNET FOR THE 50 EVENTS:

46
ONE CONVNET FOR EACH EVENTS:

47
Results of ensemble of binary :
+6,6%
0.6136

OUTLINE
2. Baseline
4. Denoising
5. Fracking
48

CONLUSIONS
49
● The Flickr dataset helped us to improve the score by swapping the
order in which we were using the clean and noisy datasets
CaffeNet
FINE-TUNING:FINE-TUNING:
+1,3%

CONLUSIONS
50
● The network actually succeeds in improving his performance by
learning from its own mistakes when applying fracking.
+0,9%
CaffeNet
FINE-TUNING:
Fine-tuning with
fracking subset
from:

CONLUSIONS
51
● The results are better if we keep the weights learned in the earlier
layers from a very large dataset.
50
+4%

CONLUSIONS
52
● Fine-tuning one convnet for each class increases the score.
+6,6%

FUTURE WORK
53
● Mix our solutions with a fine-tuned network with PLACES, and with other
local solutions.
SCENE CNN
(PLACES)
LOCAL
NOW
● Compete (and try to win) ChaLearn @ ICCV 2015 !!

Fine tuning a convolutional network for cultural event recognition

More Related Content

What's hot (20)

Similar to Fine tuning a convolutional network for cultural event recognition (20)

More from Universitat Politècnica de Catalunya (20)

Recently uploaded (20)

Fine tuning a convolutional network for cultural event recognition