An In-Depth Evaluation of Multimodal Video Genre Categorization

University Politehnica of Bucharest
An In-Depth Evaluation of Multimodal
Video Genre Categorization
Ionuț MIRONICĂ1
imironica@imag.pub.ro
Bogdan IONESCU1,2
bionescu@imag.pub.ro
Peter KNEES3
peter.knees@jku.at
1 2 3
Monday, April 29, 2013 CBMI 2013
Patrick LAMBERT2
patrick.lambert
@univ-savoie.fr
11th International Workshop on Content-Based Multimedia Indexing,
CBMI 2013, Veszprém, Hungary, June 17-19, 2013.
University
POLITEHNICA
of Bucharest

Presentation outline
• Introduction
• Video Content Description
• Fusion Techniques
• Experimental results
• Conclusions
Monday, April 29, 2013 CBMI 2013 2

Problem Statement
Concepts
• Content Based Video Retrieval
• Genre Retrieval
Query Database
genre
query
Query Results

Global Approach
> challenge: find a way to assign (genre) tags to unknown videos;
> approach: machine learning paradigm;
tagged video video database
database
web food autos
train
classifier
label data
unlabeled data
…
labeled data

Global Approach
• the entire proces relies on the concept of “similarity” computed between
content annotations (numeric features),
• We focus on:
objective 1: go multimodal (truly)
visual audio Text & metadata
objective 2: test a broad range of classifiers
objective 3: test a broad range of fusion techniques

Video Content Description - audio
• Zero-Crossing Rate,
• Linear Predictive Coefficients,
• Line Spectral Pairs,
• Mel-Frequency Cepstral Coefficients,
• spectral centroid, flux, rolloff, and
kurtosis,
+ variance of each feature over
a certain window.
time
global
feature
=
mean &
variance
[B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands]
Standard audio features
(audio frame-based)
f1 fn f … 2
+ var{f2} var{fn}

Video Content Description - visual
MPEG-7 & color/texture descriptors
(visual frame-based)
• Classic color histogram,
[OpenCV toolbox, http://opencv.willowgarage.com]
• Local Binary Pattern,
• Autocorrelogram,
• Color Coherence Vector,
• Color Layout Pattern,
• Edge Histogram,
• Structure Color Descriptor,
• Color moments.
time
f1 fn …
global
feature
=
mean &
dispersion &
skewness &
kurtosis &
median &
root mean square
f2

Feature descriptors
Bag of Words
•we train the model with 4,096 words
•rgbSIFT and spatial pyramids (2x2)
BBaagg--ooff--VViissuuaall--WWoorrddss FFrraammeewwoorrkk
[CIVR 2009, J. Uijlings et all]
Detection on interest points Codewords Dictionary
Generate
BoW histograms
Train classifier

Histogram of oriented Gradients (HoG)
•divides the image into 3x3 cells and for each of them builds a pixel-wise histogram of
edge orientations.
[CITS 2009, O. Ludwig,et all]
Feature descriptors

Objective: describe structural information in terms of contours and
their relations;
Contour properties:
b
z : degree of circularity; – ½ circle vs. full circle
e : edginess parameter – zig-zag vs. sinusoid;
: degree of curvature (proportional to the maximum amplitude of the
bowness space); – straight vs. bow
[IJCV, C. Rasche’10]
10
+ Appearance parameters:
m s: mean, std.dev. of intensity along the contour; c , c
m s: fuzziness, obtained from a blob (DOG) filter: I * DOG f , f
edginess
y : symmetry parameter – irregular vs. “even”
symmetry
Structural descriptors

Video Content Description - text
TF-IDF descriptors
(Term Frequency-Inverse Document Frequency)
Text sources: ASR and metadata
1. remove XML markups,
2. remove terms <5%-percentile of the frequency distribution,
3. select term corpus: retaining for each genre class m terms (e.g. m =
150 for ASR and 20 for metadata) with the highest χ2 values that
occur more frequently than in complement classes,
4. for each document we represent the TF-IDF values.

Classifiers
We test a broad range of classifiers:
• SVM with linear, RBF and Chi kernels
• 5-NN
• Random Trees and Extremely
Random Trees

Fusion Techniques
GGlloobbaall DDeessccrriippttoorr
Feature
concatenation
DDeecciissiioonn
Global
Confidence
score
Obtain the Global
Confidence Score
Early Fusion
Descriptor 1
DDeessccrriippttoorr 22
DDeessccrriippttoorr nn
Feature
extraction
Descriptor 1
normalized
Descriptor 2
normalized
Descriptor n
normalized
Feature
Normalization
CCllaassssiiffiieerr
Classification
Step

Fusion Techniques
Late Fusion
Confidence value 1
(normalized)
Confidence value 2
(normalized)
Confidence value n
(normalized)
Confidence Scores
Normalization
Descriptor 1
DDeessccrriippttoorr 22
DDeessccrriippttoorr nn
Feature
extraction
CCllaassssiiffiieerr 11
CCllaassssiiffiieerr 22
CCllaassssiiffiieerr nn
Classification Step
DDeecciissiioonn
Global Confidence
score
Global
Confidence Score

Fusion Techniques
Late Fusion
where
- cvi is the confidence value of classifier i for class q , d is the current
video, i are some weights and N is the number of classifiers to be aggregated.
- rank() represents the rank of classifier i.

E xperimental Setup
MediaEval 2012 Dataset - Tagging Task
• 14,838 episodes from 2,249 shows ~ 3,260 hours of data
• splited into Development and Test sets
5,288 for development / 9,550 for test
• focuses on semi-professional video on the Internet

E xperimental Setup
MediaEval 2012 Dataset
• 2266 GGeennrree llaabbeellss
1000 art 1001 autos_and_vehicles
1002 business 1003 citizen_journalism
1004 comedy 1005 conferences_and_other_events
1006 default_category 1007 documentary
1008 educational 1009 food_and_drink
1010 gaming 1011 health
1012 literature 1013 movies_and_television
1014 music_and_entertainment 1015 personal_or_auto-biographical
1016 politics 1017 religion
1018 school_and_education 1019 sports
1020 technology 1021 the_environment
1022 the_mainstream_media 1023 travel
1024 videoblogging 1025 web_development_and_sites

18
Experimental Setup
• Mean Average Precision
summarizes rankings from multiple queries by averaging average precision
• Classifier’s parameters and late fusion weights were
optimized on training dataset

Evaluation
(1) Classification performance on individual modalities
Feature SVM
Linear
SVM RBF SVM - Chi 5-NN Random
Forest
Ext. Random
Forests
Hog 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44%
Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32%
MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17%
Structural
7.55 % 17.17% 22.76% 8.65% 13.85% 14.85%
Descriptors
Audio descriptors 20.68% 24.52% 35.56% 18.31% 34.41% 42.33%
TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93%
TD-IDF on Metadata 56.33 % 58.14% 47.95% 57.19% 58.66% 57.52%
(MAP values)

Evaluation
(1) Classification performance on individual modalities (visual)
Feature SVM
Linear
Forest
Ext. Random
Forests
HOG 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44%
Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32%
MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17%
Structural
7.55 % 17.17% 22.76% 8.65% 13.85% 14.85%
Descriptors
TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93%
(MAP values) Visual Performance
- Best performance with MPEG-7 (ERF) and HOG (SVM-RBF)
- Bag-of-Visual-Words is not performing very well

Evaluation
(1) Classification performance on individual modalities (audio)
Feature SVM
Linear
Forest
Ext. Random
Forests
HOG 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44%
Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32%
MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17%
Structural
7.55 % 17.17% 22.76% 8.65% 13.85% 14.85%
Descriptors
TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93%
(MAP values) Audio Performance
- Best performance with Extremely Random Forests (42.33%)
- Provide higher discriminative power than visual features

Evaluation
(1) Classification performance on individual modalities (text)
Feature SVM
Linear
Forest
Ext. Random
Forests
HOG 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44%
Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32%
MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17%
Structural
7.55 % 17.17% 22.76% 8.65% 13.85% 14.85%
Descriptors
TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93%
(MAP values) Text Performance
- Best performance with Metadata and Random Forests (58.66%)
- ASR provide lower performance than audio
- Metadata features outperformes all the features

Evaluation
(2) Performance on Multimodal Integration
Comb
SUM
Comb
Mean
Comb
MNZ
Comb
Rank
Early
Fusion
All Visual 35.82% 36.76% 38.21% 30.90% 30.11%
All Audio 43.86% 44.19% 44.50% 41.81% 42.33%
All Text 62.62% 62.81% 62.69% 50.60% 55.68%
All 64.24% 65.61% 65.82% 53.84% 60.12%
(MAP values)
Fusion Techniques Performance
- late fusion provide higher performance than early fusion
- CombMNZ tends to provide the best accurate results

Evaluation
(3) Comparison to MediaEval 2012 Tagging task results
Team Modality Method MAP
proposed all Late Fusion CombMNZ with all descriptors 65.82%
proposed text Late Fusion CombMean with TF-IDF of ASR and metadata 62.81%
TUB text Naive Bayes with Bag of Words on text (metadata) 52.25%
proposed all Late Fusion CombMNZ with all descriptors except for metadata 51.9%
proposed audio Late Fusion CombMean with standard audio descriptors 44.50%
proposed visual Late Fusion CombMean with MPEG-7 related, structural, HoG
and B-o-VW with rgbSIFT
38.21%
ARF text SVM linear on early fusion of TF-IDF of ASR and metadata 37.93%
TUD visual & text Late Fusion of SVM with B-o-W (visual word, ASR & metadata) 35.81%
KIT visual SVM with Visual descriptors (color, texture, B-o-VW with rgbSIFT) 35.81%
TUD-MM text Dynamic Bayesian networks on text (ASR & metadata) 25.00%
UNICAMP -
visual Late fusion (KNN, Naive Bayes, SVM, Random Forests) with
21.12%
UFMG
BOW (text ASR)
ARF audio SVM linear with block-based audio features 18.92%
(MAP values)

Conclusions
> we provided an in-depth evaluation of truly multimodal video description in the
context of a real-world genre-categorization scenario;
> we demonstrated the potential of appropriate late fusion to genre categorization
and achieve very high categorization performance;
> we proved that late fusion can boost performance of automated content
descriptors to achieve close performance;
> we setup a new baseline for the Genre Tagging Task by outperforming the
performance of the other participants;
Acknowledgement: we would like to thank Prof. Nicu Sebe and Dr. J.
Uijlings from University of Trento for their support.
We also acknowledge the 2012 Genre Tagging Task of the MediaEval
Multimedia Benchmark for the dataset (http://www.multimediaeval.org/).

Thank you!
Questions?

An In-Depth Evaluation of Multimodal Video Genre Categorization

More Related Content

Similar to An In-Depth Evaluation of Multimodal Video Genre Categorization (20)

Recently uploaded (20)

An In-Depth Evaluation of Multimodal Video Genre Categorization