SlideShare a Scribd company logo
University Politehnica of Bucharest 
An In-Depth Evaluation of Multimodal 
Video Genre Categorization 
Ionuț MIRONICĂ1 
imironica@imag.pub.ro 
Bogdan IONESCU1,2 
bionescu@imag.pub.ro 
Peter KNEES3 
peter.knees@jku.at 
1 2 3 
Monday, April 29, 2013 CBMI 2013 
Patrick LAMBERT2 
patrick.lambert 
@univ-savoie.fr 
11th International Workshop on Content-Based Multimedia Indexing, 
CBMI 2013, Veszprém, Hungary, June 17-19, 2013. 
University 
POLITEHNICA 
of Bucharest
University Politehnica of Bucharest 
Presentation outline 
• Introduction 
• Video Content Description 
• Fusion Techniques 
• Experimental results 
• Conclusions 
Monday, April 29, 2013 CBMI 2013 2
University Politehnica of Bucharest 
Problem Statement 
Concepts 
• Content Based Video Retrieval 
• Genre Retrieval 
Query Database 
genre 
query 
Query Results 
Monday, April 29, 2013 CBMI 2013 3
University Politehnica of Bucharest 
Global Approach 
> challenge: find a way to assign (genre) tags to unknown videos; 
> approach: machine learning paradigm; 
tagged video video database 
database 
web food autos 
train 
classifier 
label data 
unlabeled data 
… 
labeled data 
Monday, April 29, 2013 CBMI 2013 4
University Politehnica of Bucharest 
Global Approach 
• the entire proces relies on the concept of “similarity” computed between 
content annotations (numeric features), 
• We focus on: 
objective 1: go multimodal (truly) 
visual audio Text & metadata 
objective 2: test a broad range of classifiers 
objective 3: test a broad range of fusion techniques 
Monday, April 29, 2013 CBMI 2013
University Politehnica of Bucharest 
Video Content Description - audio 
• Zero-Crossing Rate, 
• Linear Predictive Coefficients, 
• Line Spectral Pairs, 
• Mel-Frequency Cepstral Coefficients, 
• spectral centroid, flux, rolloff, and 
kurtosis, 
+ variance of each feature over 
a certain window. 
time 
global 
feature 
= 
mean & 
variance 
[B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands] 
Standard audio features 
(audio frame-based) 
f1 fn f … 2 
+ var{f2} var{fn} 
Monday, April 29, 2013 CBMI 2013
University Politehnica of Bucharest 
Video Content Description - visual 
MPEG-7 & color/texture descriptors 
(visual frame-based) 
• Classic color histogram, 
[OpenCV toolbox, http://opencv.willowgarage.com] 
Monday, April 29, 2013 CBMI 2013 
• Local Binary Pattern, 
• Autocorrelogram, 
• Color Coherence Vector, 
• Color Layout Pattern, 
• Edge Histogram, 
• Structure Color Descriptor, 
• Color moments. 
time 
f1 fn … 
global 
feature 
= 
mean & 
dispersion & 
skewness & 
kurtosis & 
median & 
root mean square 
f2
University Politehnica of Bucharest 
Video Content Description - visual 
Feature descriptors 
Bag of Words 
•we train the model with 4,096 words 
•rgbSIFT and spatial pyramids (2x2) 
BBaagg--ooff--VViissuuaall--WWoorrddss FFrraammeewwoorrkk 
Monday, April 29, 2013 CBMI 2013 
[CIVR 2009, J. Uijlings et all] 
Detection on interest points Codewords Dictionary 
Generate 
BoW histograms 
Train classifier
University Politehnica of Bucharest 
Video Content Description - visual 
Histogram of oriented Gradients (HoG) 
•divides the image into 3x3 cells and for each of them builds a pixel-wise histogram of 
edge orientations. 
Monday, April 29, 2013 CBMI 2013 
[CITS 2009, O. Ludwig,et all] 
Feature descriptors
University Politehnica of Bucharest 
Video Content Description - visual 
Objective: describe structural information in terms of contours and 
their relations; 
Contour properties: 
b 
z : degree of circularity; – ½ circle vs. full circle 
e : edginess parameter – zig-zag vs. sinusoid; 
: degree of curvature (proportional to the maximum amplitude of the 
bowness space); – straight vs. bow 
Monday, April 29, 2013 CBMI 2013 
[IJCV, C. Rasche’10] 
10 
+ Appearance parameters: 
m s: mean, std.dev. of intensity along the contour; c , c 
m s: fuzziness, obtained from a blob (DOG) filter: I * DOG f , f 
edginess 
y : symmetry parameter – irregular vs. “even” 
symmetry 
Structural descriptors
University Politehnica of Bucharest 
Video Content Description - text 
TF-IDF descriptors 
(Term Frequency-Inverse Document Frequency) 
Text sources: ASR and metadata 
1. remove XML markups, 
2. remove terms <5%-percentile of the frequency distribution, 
3. select term corpus: retaining for each genre class m terms (e.g. m = 
150 for ASR and 20 for metadata) with the highest χ2 values that 
occur more frequently than in complement classes, 
4. for each document we represent the TF-IDF values.
University Politehnica of Bucharest 
Classifiers 
We test a broad range of classifiers: 
• SVM with linear, RBF and Chi kernels 
• 5-NN 
• Random Trees and Extremely 
Random Trees 
Monday, April 29, 2013 CBMI 2013 12
University Politehnica of Bucharest 
Fusion Techniques 
GGlloobbaall DDeessccrriippttoorr 
Feature 
concatenation 
DDeecciissiioonn 
Global 
Confidence 
score 
Obtain the Global 
Confidence Score 
Early Fusion 
Descriptor 1 
DDeessccrriippttoorr 22 
DDeessccrriippttoorr nn 
Feature 
extraction 
Descriptor 1 
normalized 
Descriptor 2 
normalized 
Descriptor n 
normalized 
Feature 
Normalization 
CCllaassssiiffiieerr 
Classification 
Step 
Monday, April 29, 2013 CBMI 2013 13
University Politehnica of Bucharest 
Fusion Techniques 
Late Fusion 
Confidence value 1 
(normalized) 
Confidence value 2 
(normalized) 
Confidence value n 
(normalized) 
Confidence Scores 
Normalization 
Descriptor 1 
DDeessccrriippttoorr 22 
DDeessccrriippttoorr nn 
Feature 
extraction 
CCllaassssiiffiieerr 11 
CCllaassssiiffiieerr 22 
CCllaassssiiffiieerr nn 
Classification Step 
DDeecciissiioonn 
Global Confidence 
score 
Global 
Confidence Score 
Monday, April 29, 2013 CBMI 2013 14
University Politehnica of Bucharest 
Fusion Techniques 
Late Fusion 
where 
- cvi is the confidence value of classifier i for class q , d is the current 
video, i are some weights and N is the number of classifiers to be aggregated. 
- rank() represents the rank of classifier i. 
Monday, April 29, 2013 CBMI 2013 15
University Politehnica of Bucharest 
E xperimental Setup 
MediaEval 2012 Dataset - Tagging Task 
• 14,838 episodes from 2,249 shows ~ 3,260 hours of data 
• splited into Development and Test sets 
5,288 for development / 9,550 for test 
• focuses on semi-professional video on the Internet 
Monday, April 29, 2013 CBMI 2013 16
University Politehnica of Bucharest 
E xperimental Setup 
MediaEval 2012 Dataset 
• 2266 GGeennrree llaabbeellss 
1000 art 1001 autos_and_vehicles 
1002 business 1003 citizen_journalism 
1004 comedy 1005 conferences_and_other_events 
1006 default_category 1007 documentary 
1008 educational 1009 food_and_drink 
1010 gaming 1011 health 
1012 literature 1013 movies_and_television 
1014 music_and_entertainment 1015 personal_or_auto-biographical 
1016 politics 1017 religion 
1018 school_and_education 1019 sports 
1020 technology 1021 the_environment 
1022 the_mainstream_media 1023 travel 
1024 videoblogging 1025 web_development_and_sites 
Monday, April 29, 2013 CBMI 2013 17
University Politehnica of Bucharest 
Monday, April 29, 2013 CBMI 2013 
18 
Experimental Setup 
• Mean Average Precision 
summarizes rankings from multiple queries by averaging average precision 
• Classifier’s parameters and late fusion weights were 
optimized on training dataset
University Politehnica of Bucharest 
Evaluation 
(1) Classification performance on individual modalities 
Feature SVM 
Linear 
SVM RBF SVM - Chi 5-NN Random 
Forest 
Ext. Random 
Forests 
Hog 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44% 
Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32% 
MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17% 
Structural 
7.55 % 17.17% 22.76% 8.65% 13.85% 14.85% 
Descriptors 
Audio descriptors 20.68% 24.52% 35.56% 18.31% 34.41% 42.33% 
TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93% 
TD-IDF on Metadata 56.33 % 58.14% 47.95% 57.19% 58.66% 57.52% 
(MAP values) 
Monday, April 29, 2013 CBMI 2013 19
University Politehnica of Bucharest 
Evaluation 
(1) Classification performance on individual modalities (visual) 
Feature SVM 
Linear 
SVM RBF SVM - Chi 5-NN Random 
Forest 
Ext. Random 
Forests 
HOG 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44% 
Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32% 
MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17% 
Structural 
7.55 % 17.17% 22.76% 8.65% 13.85% 14.85% 
Descriptors 
Audio descriptors 20.68% 24.52% 35.56% 18.31% 34.41% 42.33% 
TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93% 
TD-IDF on Metadata 56.33 % 58.14% 47.95% 57.19% 58.66% 57.52% 
(MAP values) Visual Performance 
- Best performance with MPEG-7 (ERF) and HOG (SVM-RBF) 
- Bag-of-Visual-Words is not performing very well 
Monday, April 29, 2013 CBMI 2013 20
University Politehnica of Bucharest 
Evaluation 
(1) Classification performance on individual modalities (audio) 
Feature SVM 
Linear 
SVM RBF SVM - Chi 5-NN Random 
Forest 
Ext. Random 
Forests 
HOG 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44% 
Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32% 
MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17% 
Structural 
7.55 % 17.17% 22.76% 8.65% 13.85% 14.85% 
Descriptors 
Audio descriptors 20.68% 24.52% 35.56% 18.31% 34.41% 42.33% 
TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93% 
TD-IDF on Metadata 56.33 % 58.14% 47.95% 57.19% 58.66% 57.52% 
(MAP values) Audio Performance 
- Best performance with Extremely Random Forests (42.33%) 
- Provide higher discriminative power than visual features 
Monday, April 29, 2013 CBMI 2013 21
University Politehnica of Bucharest 
Evaluation 
(1) Classification performance on individual modalities (text) 
Feature SVM 
Linear 
SVM RBF SVM - Chi 5-NN Random 
Forest 
Ext. Random 
Forests 
HOG 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44% 
Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32% 
MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17% 
Structural 
7.55 % 17.17% 22.76% 8.65% 13.85% 14.85% 
Descriptors 
Audio descriptors 20.68% 24.52% 35.56% 18.31% 34.41% 42.33% 
TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93% 
TD-IDF on Metadata 56.33 % 58.14% 47.95% 57.19% 58.66% 57.52% 
(MAP values) Text Performance 
- Best performance with Metadata and Random Forests (58.66%) 
- ASR provide lower performance than audio 
- Metadata features outperformes all the features 
Monday, April 29, 2013 CBMI 2013 22
University Politehnica of Bucharest 
Evaluation 
(2) Performance on Multimodal Integration 
Comb 
SUM 
Comb 
Mean 
Comb 
MNZ 
Comb 
Rank 
Early 
Fusion 
All Visual 35.82% 36.76% 38.21% 30.90% 30.11% 
All Audio 43.86% 44.19% 44.50% 41.81% 42.33% 
All Text 62.62% 62.81% 62.69% 50.60% 55.68% 
All 64.24% 65.61% 65.82% 53.84% 60.12% 
(MAP values) 
Fusion Techniques Performance 
- late fusion provide higher performance than early fusion 
- CombMNZ tends to provide the best accurate results 
Monday, April 29, 2013 CBMI 2013 23
University Politehnica of Bucharest 
Evaluation 
(3) Comparison to MediaEval 2012 Tagging task results 
Team Modality Method MAP 
proposed all Late Fusion CombMNZ with all descriptors 65.82% 
proposed text Late Fusion CombMean with TF-IDF of ASR and metadata 62.81% 
TUB text Naive Bayes with Bag of Words on text (metadata) 52.25% 
proposed all Late Fusion CombMNZ with all descriptors except for metadata 51.9% 
proposed audio Late Fusion CombMean with standard audio descriptors 44.50% 
proposed visual Late Fusion CombMean with MPEG-7 related, structural, HoG 
and B-o-VW with rgbSIFT 
38.21% 
ARF text SVM linear on early fusion of TF-IDF of ASR and metadata 37.93% 
TUD visual & text Late Fusion of SVM with B-o-W (visual word, ASR & metadata) 35.81% 
KIT visual SVM with Visual descriptors (color, texture, B-o-VW with rgbSIFT) 35.81% 
TUD-MM text Dynamic Bayesian networks on text (ASR & metadata) 25.00% 
UNICAMP - 
visual Late fusion (KNN, Naive Bayes, SVM, Random Forests) with 
21.12% 
UFMG 
BOW (text ASR) 
ARF audio SVM linear with block-based audio features 18.92% 
(MAP values) 
Monday, April 29, 2013 CBMI 2013 24
University Politehnica of Bucharest 
Conclusions 
> we provided an in-depth evaluation of truly multimodal video description in the 
context of a real-world genre-categorization scenario; 
> we demonstrated the potential of appropriate late fusion to genre categorization 
and achieve very high categorization performance; 
> we proved that late fusion can boost performance of automated content 
descriptors to achieve close performance; 
> we setup a new baseline for the Genre Tagging Task by outperforming the 
performance of the other participants; 
Acknowledgement: we would like to thank Prof. Nicu Sebe and Dr. J. 
Uijlings from University of Trento for their support. 
We also acknowledge the 2012 Genre Tagging Task of the MediaEval 
Multimedia Benchmark for the dataset (http://www.multimediaeval.org/). 
Monday, April 29, 2013 CBMI 2013 25
University Politehnica of Bucharest 
Thank you! 
Questions? 
Monday, April 29, 2013 CBMI 2013 26

More Related Content

PDF
Speaker Identification based on GFCC using GMM-UBM
PDF
A novel automatic voice recognition system based on text-independent in a noi...
PPTX
Speech driven gesture generation with Autoencoders - Project
PDF
Video Key-Frame Extraction using Unsupervised Clustering and Mutual Comparison
PDF
Volume 2-issue-6-2119-2124
PDF
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
PDF
Binary code-based Human Detection
PDF
Object Detection with Discrmininatively Trained Part based Models
Speaker Identification based on GFCC using GMM-UBM
A novel automatic voice recognition system based on text-independent in a noi...
Speech driven gesture generation with Autoencoders - Project
Video Key-Frame Extraction using Unsupervised Clustering and Mutual Comparison
Volume 2-issue-6-2119-2124
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
Binary code-based Human Detection
Object Detection with Discrmininatively Trained Part based Models

Similar to An In-Depth Evaluation of Multimodal Video Genre Categorization (20)

PPT
Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval
PDF
Machine Learning for Computer Vision.pdf
PDF
01_Introduction.pdf.pdf
PDF
C04841417
PDF
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
PDF
Research and activity report
PDF
Ijarcet vol-2-issue-4-1347-1351
PDF
A Framework for Curved Videotext Detection and Extraction
PDF
A Framework for Curved Videotext Detection and Extraction
PDF
Propose shot boundary detection methods by using visual hybrid features
PDF
“Person Re-Identification and Tracking at the Edge: Challenges and Techniques...
PDF
Video indexing using shot boundary detection approach and search tracks
PDF
Real-time Non-Intrusive Speech Quality Estimation of VoIP Using Genetic Progr...
PDF
A Study of Digital Media Based Voice Activity Detection Protocols
PDF
Analysis of Impact of Channel Error Rate on Average PSNR in Multimedia Traffic
PDF
AcademicProject
PPTX
Detecting Violent Content in Hollywood Movies by Mid-level Audio Representations
PDF
A Survey on Portable Camera-Based Assistive Text and Product Label Reading Fr...
PDF
Comparative Study of Various Algorithms for Detection of Fades in Video Seque...
PDF
A0540106
Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval
Machine Learning for Computer Vision.pdf
01_Introduction.pdf.pdf
C04841417
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
Research and activity report
Ijarcet vol-2-issue-4-1347-1351
A Framework for Curved Videotext Detection and Extraction
A Framework for Curved Videotext Detection and Extraction
Propose shot boundary detection methods by using visual hybrid features
“Person Re-Identification and Tracking at the Edge: Challenges and Techniques...
Video indexing using shot boundary detection approach and search tracks
Real-time Non-Intrusive Speech Quality Estimation of VoIP Using Genetic Progr...
A Study of Digital Media Based Voice Activity Detection Protocols
Analysis of Impact of Channel Error Rate on Average PSNR in Multimedia Traffic
AcademicProject
Detecting Violent Content in Hollywood Movies by Mid-level Audio Representations
A Survey on Portable Camera-Based Assistive Text and Product Label Reading Fr...
Comparative Study of Various Algorithms for Detection of Fades in Video Seque...
A0540106
Ad

Recently uploaded (20)

PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PDF
Analyzing Impact of Pakistan Economic Corridor on Import and Export in Pakist...
PPTX
introduction to high performance computing
PPTX
UNIT - 3 Total quality Management .pptx
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Nature of X-rays, X- Ray Equipment, Fluoroscopy
PDF
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
PPTX
Current and future trends in Computer Vision.pptx
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PDF
III.4.1.2_The_Space_Environment.p pdffdf
Fundamentals of Mechanical Engineering.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
Analyzing Impact of Pakistan Economic Corridor on Import and Export in Pakist...
introduction to high performance computing
UNIT - 3 Total quality Management .pptx
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
Categorization of Factors Affecting Classification Algorithms Selection
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
Automation-in-Manufacturing-Chapter-Introduction.pdf
R24 SURVEYING LAB MANUAL for civil enggi
UNIT 4 Total Quality Management .pptx
Nature of X-rays, X- Ray Equipment, Fluoroscopy
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
Current and future trends in Computer Vision.pptx
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
III.4.1.2_The_Space_Environment.p pdffdf
Ad

An In-Depth Evaluation of Multimodal Video Genre Categorization

  • 1. University Politehnica of Bucharest An In-Depth Evaluation of Multimodal Video Genre Categorization Ionuț MIRONICĂ1 imironica@imag.pub.ro Bogdan IONESCU1,2 bionescu@imag.pub.ro Peter KNEES3 peter.knees@jku.at 1 2 3 Monday, April 29, 2013 CBMI 2013 Patrick LAMBERT2 patrick.lambert @univ-savoie.fr 11th International Workshop on Content-Based Multimedia Indexing, CBMI 2013, Veszprém, Hungary, June 17-19, 2013. University POLITEHNICA of Bucharest
  • 2. University Politehnica of Bucharest Presentation outline • Introduction • Video Content Description • Fusion Techniques • Experimental results • Conclusions Monday, April 29, 2013 CBMI 2013 2
  • 3. University Politehnica of Bucharest Problem Statement Concepts • Content Based Video Retrieval • Genre Retrieval Query Database genre query Query Results Monday, April 29, 2013 CBMI 2013 3
  • 4. University Politehnica of Bucharest Global Approach > challenge: find a way to assign (genre) tags to unknown videos; > approach: machine learning paradigm; tagged video video database database web food autos train classifier label data unlabeled data … labeled data Monday, April 29, 2013 CBMI 2013 4
  • 5. University Politehnica of Bucharest Global Approach • the entire proces relies on the concept of “similarity” computed between content annotations (numeric features), • We focus on: objective 1: go multimodal (truly) visual audio Text & metadata objective 2: test a broad range of classifiers objective 3: test a broad range of fusion techniques Monday, April 29, 2013 CBMI 2013
  • 6. University Politehnica of Bucharest Video Content Description - audio • Zero-Crossing Rate, • Linear Predictive Coefficients, • Line Spectral Pairs, • Mel-Frequency Cepstral Coefficients, • spectral centroid, flux, rolloff, and kurtosis, + variance of each feature over a certain window. time global feature = mean & variance [B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands] Standard audio features (audio frame-based) f1 fn f … 2 + var{f2} var{fn} Monday, April 29, 2013 CBMI 2013
  • 7. University Politehnica of Bucharest Video Content Description - visual MPEG-7 & color/texture descriptors (visual frame-based) • Classic color histogram, [OpenCV toolbox, http://opencv.willowgarage.com] Monday, April 29, 2013 CBMI 2013 • Local Binary Pattern, • Autocorrelogram, • Color Coherence Vector, • Color Layout Pattern, • Edge Histogram, • Structure Color Descriptor, • Color moments. time f1 fn … global feature = mean & dispersion & skewness & kurtosis & median & root mean square f2
  • 8. University Politehnica of Bucharest Video Content Description - visual Feature descriptors Bag of Words •we train the model with 4,096 words •rgbSIFT and spatial pyramids (2x2) BBaagg--ooff--VViissuuaall--WWoorrddss FFrraammeewwoorrkk Monday, April 29, 2013 CBMI 2013 [CIVR 2009, J. Uijlings et all] Detection on interest points Codewords Dictionary Generate BoW histograms Train classifier
  • 9. University Politehnica of Bucharest Video Content Description - visual Histogram of oriented Gradients (HoG) •divides the image into 3x3 cells and for each of them builds a pixel-wise histogram of edge orientations. Monday, April 29, 2013 CBMI 2013 [CITS 2009, O. Ludwig,et all] Feature descriptors
  • 10. University Politehnica of Bucharest Video Content Description - visual Objective: describe structural information in terms of contours and their relations; Contour properties: b z : degree of circularity; – ½ circle vs. full circle e : edginess parameter – zig-zag vs. sinusoid; : degree of curvature (proportional to the maximum amplitude of the bowness space); – straight vs. bow Monday, April 29, 2013 CBMI 2013 [IJCV, C. Rasche’10] 10 + Appearance parameters: m s: mean, std.dev. of intensity along the contour; c , c m s: fuzziness, obtained from a blob (DOG) filter: I * DOG f , f edginess y : symmetry parameter – irregular vs. “even” symmetry Structural descriptors
  • 11. University Politehnica of Bucharest Video Content Description - text TF-IDF descriptors (Term Frequency-Inverse Document Frequency) Text sources: ASR and metadata 1. remove XML markups, 2. remove terms <5%-percentile of the frequency distribution, 3. select term corpus: retaining for each genre class m terms (e.g. m = 150 for ASR and 20 for metadata) with the highest χ2 values that occur more frequently than in complement classes, 4. for each document we represent the TF-IDF values.
  • 12. University Politehnica of Bucharest Classifiers We test a broad range of classifiers: • SVM with linear, RBF and Chi kernels • 5-NN • Random Trees and Extremely Random Trees Monday, April 29, 2013 CBMI 2013 12
  • 13. University Politehnica of Bucharest Fusion Techniques GGlloobbaall DDeessccrriippttoorr Feature concatenation DDeecciissiioonn Global Confidence score Obtain the Global Confidence Score Early Fusion Descriptor 1 DDeessccrriippttoorr 22 DDeessccrriippttoorr nn Feature extraction Descriptor 1 normalized Descriptor 2 normalized Descriptor n normalized Feature Normalization CCllaassssiiffiieerr Classification Step Monday, April 29, 2013 CBMI 2013 13
  • 14. University Politehnica of Bucharest Fusion Techniques Late Fusion Confidence value 1 (normalized) Confidence value 2 (normalized) Confidence value n (normalized) Confidence Scores Normalization Descriptor 1 DDeessccrriippttoorr 22 DDeessccrriippttoorr nn Feature extraction CCllaassssiiffiieerr 11 CCllaassssiiffiieerr 22 CCllaassssiiffiieerr nn Classification Step DDeecciissiioonn Global Confidence score Global Confidence Score Monday, April 29, 2013 CBMI 2013 14
  • 15. University Politehnica of Bucharest Fusion Techniques Late Fusion where - cvi is the confidence value of classifier i for class q , d is the current video, i are some weights and N is the number of classifiers to be aggregated. - rank() represents the rank of classifier i. Monday, April 29, 2013 CBMI 2013 15
  • 16. University Politehnica of Bucharest E xperimental Setup MediaEval 2012 Dataset - Tagging Task • 14,838 episodes from 2,249 shows ~ 3,260 hours of data • splited into Development and Test sets 5,288 for development / 9,550 for test • focuses on semi-professional video on the Internet Monday, April 29, 2013 CBMI 2013 16
  • 17. University Politehnica of Bucharest E xperimental Setup MediaEval 2012 Dataset • 2266 GGeennrree llaabbeellss 1000 art 1001 autos_and_vehicles 1002 business 1003 citizen_journalism 1004 comedy 1005 conferences_and_other_events 1006 default_category 1007 documentary 1008 educational 1009 food_and_drink 1010 gaming 1011 health 1012 literature 1013 movies_and_television 1014 music_and_entertainment 1015 personal_or_auto-biographical 1016 politics 1017 religion 1018 school_and_education 1019 sports 1020 technology 1021 the_environment 1022 the_mainstream_media 1023 travel 1024 videoblogging 1025 web_development_and_sites Monday, April 29, 2013 CBMI 2013 17
  • 18. University Politehnica of Bucharest Monday, April 29, 2013 CBMI 2013 18 Experimental Setup • Mean Average Precision summarizes rankings from multiple queries by averaging average precision • Classifier’s parameters and late fusion weights were optimized on training dataset
  • 19. University Politehnica of Bucharest Evaluation (1) Classification performance on individual modalities Feature SVM Linear SVM RBF SVM - Chi 5-NN Random Forest Ext. Random Forests Hog 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44% Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32% MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17% Structural 7.55 % 17.17% 22.76% 8.65% 13.85% 14.85% Descriptors Audio descriptors 20.68% 24.52% 35.56% 18.31% 34.41% 42.33% TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93% TD-IDF on Metadata 56.33 % 58.14% 47.95% 57.19% 58.66% 57.52% (MAP values) Monday, April 29, 2013 CBMI 2013 19
  • 20. University Politehnica of Bucharest Evaluation (1) Classification performance on individual modalities (visual) Feature SVM Linear SVM RBF SVM - Chi 5-NN Random Forest Ext. Random Forests HOG 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44% Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32% MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17% Structural 7.55 % 17.17% 22.76% 8.65% 13.85% 14.85% Descriptors Audio descriptors 20.68% 24.52% 35.56% 18.31% 34.41% 42.33% TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93% TD-IDF on Metadata 56.33 % 58.14% 47.95% 57.19% 58.66% 57.52% (MAP values) Visual Performance - Best performance with MPEG-7 (ERF) and HOG (SVM-RBF) - Bag-of-Visual-Words is not performing very well Monday, April 29, 2013 CBMI 2013 20
  • 21. University Politehnica of Bucharest Evaluation (1) Classification performance on individual modalities (audio) Feature SVM Linear SVM RBF SVM - Chi 5-NN Random Forest Ext. Random Forests HOG 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44% Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32% MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17% Structural 7.55 % 17.17% 22.76% 8.65% 13.85% 14.85% Descriptors Audio descriptors 20.68% 24.52% 35.56% 18.31% 34.41% 42.33% TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93% TD-IDF on Metadata 56.33 % 58.14% 47.95% 57.19% 58.66% 57.52% (MAP values) Audio Performance - Best performance with Extremely Random Forests (42.33%) - Provide higher discriminative power than visual features Monday, April 29, 2013 CBMI 2013 21
  • 22. University Politehnica of Bucharest Evaluation (1) Classification performance on individual modalities (text) Feature SVM Linear SVM RBF SVM - Chi 5-NN Random Forest Ext. Random Forests HOG 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44% Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32% MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17% Structural 7.55 % 17.17% 22.76% 8.65% 13.85% 14.85% Descriptors Audio descriptors 20.68% 24.52% 35.56% 18.31% 34.41% 42.33% TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93% TD-IDF on Metadata 56.33 % 58.14% 47.95% 57.19% 58.66% 57.52% (MAP values) Text Performance - Best performance with Metadata and Random Forests (58.66%) - ASR provide lower performance than audio - Metadata features outperformes all the features Monday, April 29, 2013 CBMI 2013 22
  • 23. University Politehnica of Bucharest Evaluation (2) Performance on Multimodal Integration Comb SUM Comb Mean Comb MNZ Comb Rank Early Fusion All Visual 35.82% 36.76% 38.21% 30.90% 30.11% All Audio 43.86% 44.19% 44.50% 41.81% 42.33% All Text 62.62% 62.81% 62.69% 50.60% 55.68% All 64.24% 65.61% 65.82% 53.84% 60.12% (MAP values) Fusion Techniques Performance - late fusion provide higher performance than early fusion - CombMNZ tends to provide the best accurate results Monday, April 29, 2013 CBMI 2013 23
  • 24. University Politehnica of Bucharest Evaluation (3) Comparison to MediaEval 2012 Tagging task results Team Modality Method MAP proposed all Late Fusion CombMNZ with all descriptors 65.82% proposed text Late Fusion CombMean with TF-IDF of ASR and metadata 62.81% TUB text Naive Bayes with Bag of Words on text (metadata) 52.25% proposed all Late Fusion CombMNZ with all descriptors except for metadata 51.9% proposed audio Late Fusion CombMean with standard audio descriptors 44.50% proposed visual Late Fusion CombMean with MPEG-7 related, structural, HoG and B-o-VW with rgbSIFT 38.21% ARF text SVM linear on early fusion of TF-IDF of ASR and metadata 37.93% TUD visual & text Late Fusion of SVM with B-o-W (visual word, ASR & metadata) 35.81% KIT visual SVM with Visual descriptors (color, texture, B-o-VW with rgbSIFT) 35.81% TUD-MM text Dynamic Bayesian networks on text (ASR & metadata) 25.00% UNICAMP - visual Late fusion (KNN, Naive Bayes, SVM, Random Forests) with 21.12% UFMG BOW (text ASR) ARF audio SVM linear with block-based audio features 18.92% (MAP values) Monday, April 29, 2013 CBMI 2013 24
  • 25. University Politehnica of Bucharest Conclusions > we provided an in-depth evaluation of truly multimodal video description in the context of a real-world genre-categorization scenario; > we demonstrated the potential of appropriate late fusion to genre categorization and achieve very high categorization performance; > we proved that late fusion can boost performance of automated content descriptors to achieve close performance; > we setup a new baseline for the Genre Tagging Task by outperforming the performance of the other participants; Acknowledgement: we would like to thank Prof. Nicu Sebe and Dr. J. Uijlings from University of Trento for their support. We also acknowledge the 2012 Genre Tagging Task of the MediaEval Multimedia Benchmark for the dataset (http://www.multimediaeval.org/). Monday, April 29, 2013 CBMI 2013 25
  • 26. University Politehnica of Bucharest Thank you! Questions? Monday, April 29, 2013 CBMI 2013 26