Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos

Competence Center Information Retrieval & Machine Learning
Esra Acar1, Frank Hopfgartner2 and Sahin Albayrak1
1 DAI Laboratory, TU Berlin, Germany
2 Humanities Advanced Technology and Information Institute, University of Glasgow, UK
13th International Workshop on Content-Based Multimedia Indexing (CBMI), Prague.
Fusion of Learned Multi-Modal Representations and
Dense Trajectories for Emotional Analysis in Videos
Esra Acar
10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

Outline
10. Jun 2015
► Introduction
► The Video Affective Analysis Method
 Overview
 Audio and static visual representation learning
 Mid-level dynamic visual representations
 Model generation
► Performance Evaluation
 Dataset & ground-truth
 Experimental setup
 Results
 Sample video clips
► Conclusions & Future Work
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

Introduction – 1
10. Jun 2015
► Delivering personalized video content extracted from colossal
amounts of multimedia is still a challenge.
► Video affective analysis can bring an answer to such a challenge
from an original perspective by
 Analyzing video content at affective-level, and
 Providing access to videos based on emotions expected to
arise in the audience.
► Affective analysis can either be categorical or dimensional.

Introduction – 2
10. Jun 2015
► In the context of categorical affective analysis, one direction followed
by many researchers consists in using machine learning methods.
► Machine learning approaches make use of a specific data
representation.
 One key issue is to find an effective representation of video
content.
► Features can be classified based on the level of semantic information
they carry:
 low-level (e.g., pixel value)
 mid-level (e.g., bag of visual words)
 high-level (e.g., a guitarist performing a song)

Introduction – 3
10. Jun 2015
► Another possible feature distinction in video analysis is the one
between static and dynamic (or temporal) features.
► Commonplace approaches among video affective content
analysis methods consist in using:
 low-level audio-visual features,
 mid-level representations based on low-level ones (e.g.,
horror sound, laughter), and
 high-level semantic attributes (e.g., SentiBank, ObjectBank).
► Affective video analysis methods in the literature use
 mainly handcrafted low and mid-level features, and
 the temporal aspect of videos in a limited manner.

The Video Affective Analysis Method – 1
10. Jun 2015
► Two main issues we address in this work:
 learning mid-level audio and static visual features, and
 deriving effective mid-level motion representations.
► Our approach is a categorical affective analysis solution which
tries to map each video into one of the four quadrants in the
Valence-Arousal-space (VA-space).
► The choice of categorical or dimensional is not critical.
 In practice, categories can always be mapped onto
dimensions and vice versa.

The Video Affective Analysis Method – 2
10. Jun 2015
► arousal  intensity of emotion
► valence  type of emotion
Excerpt from Yazdani, A., Skodras, E., Fakotakis, N., & Ebrahimi, T. (2013).
Multimedia content analysis for emotional characterization of music video clips.
EURASIP Journal on Image and Video Processing, 2013(1), 26.

An Overview of the Steps in the System
10. Jun 2015
► (1) one-minute highlight extracts of music video clips are first segmented
into pieces of 5 second length;
► (2) audio and visual feature extraction;
► (3) learning mid-level audio and static visual representations (training);
► (4) generating mid-level audio-visual representations;
► (5) generating an affective analysis model (training);
► (6) classifying a video segment of 5-second length into one of the four
quadrants in the VA-space (test); and
► (7) classifying an extract using the results of the 5-second segments (test).

Audio and Static Visual Representation Learning – 1
10. Jun 2015
► Mel-Frequency Cepstral Coefficients (MFCC) and color values in
the HSV color space are used as raw data.
► Convolutional neural networks (CNNs) are used for mid-level
feature extraction.
 three convolution and two subsampling layers.
 trained using the backpropagation algorithm.
 the output of the last convolution layer  mid-level audio or
visual representation.

Audio and Static Visual Representation Learning – 2
(a) A high-level overview of our representation learning method, (b) the detailed CNN architectures for audio and visual representation learning. The
architecture contains three convolution and two subsampling layers, one output layer fully connected to the last convolution layer (C6). (CNN: Convolutional
Neural Network, MFCC: Mel-Frequency Cepstral Coefficients, A: Audio, V: Visual)

Mid-Level Dynamic Visual Representations – 1
10. Jun 2015
► Motion in edited videos (e.g., music video clips) is shown to be
an important cue for affective video analysis.
► We adopt the work of Wang et al. on dense trajectories.
► Dense trajectories
 are dynamic visual features derived from tracking densely
sampled feature points in multiple spatial scales.
 are initially used for unconstrained video action recognition.
 constitute a powerful tool for motion description.

Mid-Level Dynamic Visual Representations – 2
10. Jun 2015
► Steps to construct mid-level motion representations:
 Dense trajectories of length 15 frames are
extracted from each video segment.
represented by HoG, HoF and motion boundary histograms in the x
and y directions (MBHx and MBHy, respectively).
 A separate dictionary for each dense trajectory descriptor is
learned.
Sparse dictionary learning technique used to generate a dictionary of
size k (k = 512).
400 x k feature vectors are sampled from the training data.
 Sparse representations are generated using the LARS algorithm and
the max-pooling technique (i.e., sparse coded Bag-of-Words).

Model Generation – 1
10. Jun 2015
► Mid-level audio and static visual representations are created by
using the CNN models.
► Mid-level motion representations are derived using sparse
coded BoW.
► Mid-level audio, dynamic and static visual representations are
fed into separate multi-class SVMs (RBF kernel).
► The probability estimates of the models are merged using
linear or SVM-based fusion.

Model Generation – 2
10. Jun 2015
► We investigated two distinct fusion techniques to combine the
outputs of the SVM models:
 Linear fusion: probability estimates are fused at the decision-
level using different weights for each modality. The weights
are optimized on the training data.
 SVM-based fusion: probability estimates of the SVMs are
concatenated into vectors and used to construct higher level
representations which are used to construct another SVM to
predict the label of a video segment.

Performance Evaluation
10. Jun 2015
► The experiments aim at comparing the discriminative power of
our method against the method that uses low-level audio-
visual features (i.e., the baseline method), and the works
presented in [1] and [2].
 [1] A. Yazdani, K. Kappeler, and T. Ebrahimi, “Affective content
analysis of music video clips,” in MIRUM. ACM, 2011.
 [2] E. Acar, F. Hopfgartner, and S. Albayrak, “Understanding
affective content of music videos through learned
representations,” in MMM, 2014.

Dataset & Ground-truth – 1
10. Jun 2015
► We use the DEAP dataset www.eecs.qmul.ac.uk/mmv/datasets/deap)
► The DEAP dataset
 is for the analysis of human affective states using
electroencephalogram, physiological and video signals.
 consists of the ratings from an online self-assessment where
120 one-minute extracts of music videos were each rated by
14-16 volunteers based on arousal, valence and dominance.
► Only one-minute highlight extracts from these 74 videos
(available at YouTube) have been used in the experiments (i.e.,
888 video segments).

Dataset & Ground-truth – 2
10. Jun 2015
► Four affective labels each representing one quadrant in the VA-
space used for classification.
 high arousal-high valence (ha-hv) – 19 songs,
 low arousal-high valence (la-hv) – 19 songs,
 low arousal-low valence (la-lv) – 14 songs, and
 high arousal-low valence (ha-lv) – 22 songs
► The labels are provided in the dataset and are determined by
the average ratings of the participants in the online self-
assessment.

Experimental Setup – 1
10. Jun 2015
► MFCC extraction  frame sizes of 25 ms with 10 ms overlap,
13-dimensional.
► Mean and standard deviation of MFCC  low-level audio
representation (LLR audio).
► Normalized HSV histograms (16, 4, 4 bins) in the HSV color
space  low-level visual representation (LLR visual).

Experimental Setup – 2
10. Jun 2015
► The most computationally expensive phase  training of the
CNN models.
 MFCC – 150 seconds, color – 350 seconds (on average per
epoch)
► The generation of feature representations per video segment
 MFCC using CNNs – 0.5 seconds,
 Color using CNNs – 1.2 seconds, and
 Dense trajectory based sparse coded BoW – 16 seconds.
► All the timing evaluations were performed with a machine with
2.40GHz CPU and 8GB RAM.
► Leave-one-song-out cross validation scheme used.

Results – Unimodal Representations
10. Jun 2015
► Motion and audio representations are more discriminative than static
visual features.
► Motion representation is superior  affect present in video clips is
often characterized by motion (e.g., camera motion).
► Color values in the HSV space lead to more discriminative mid-level
representations than color values in the RGB space (when compared
to our previous work).
Classification Accuracies on the DEAP dataset (MLR: mid-level representation)

Results – Multi-modal Representations
10. Jun 2015
► The performance gain over prior works is remarkable for SVM-
based fusion  an advanced fusion mechanism is better.
► Differences with the setup of work [3]:
 40 video clips from the DEAP dataset used in [3].
 only the clips which induce strong emotions used in [3].
Classification Accuracies on the DEAP dataset (MLR: mid-level representation)

Results – Confusion Matrix
10. Jun 2015
Confusion matrices on the DEAP dataset (Mean accuracy: 50% for (a) and 58.11% for (b)).
Lighter areas along the main diagonal correspond to better discrimination.
(b) MLR audio, motion and static visual – linear fusion(a) MLR audio and static visual

Correctly Classified (HA-HV)
Emiliana Torrini - JungleDrum

Wrongly Classified (HA-HV)
10. Jun 2015
Predicted  HA-LV
TheGo!Team – Huddle Formation

Correctly Classified (LA-HV)
Grand Archives – Miniature Birds

Wrongly Classified (LA-HV)
10. Jun 2015
Predicted  HA-LV
The Cardigans – Carnival

Correctly Classified (LA-LV)
James Blunt – Goodbye My Lover

Wrongly Classified (LA-LV)
10. Jun 2015
Predicted  LA-HV
Porcupine Tree – Normal

Correctly Classified (HA-LV)
ARCH ENEMY - My Apocalypse

Wrongly Classified (HA-LV)
10. Jun 2015
Predicted  HA-HV
The Cranberries – Zombie

Conclusions & Future Work – 1
10. Jun 2015
► We presented an approach where
 higher level representations were learned from raw data
using CNNs, and
 fused with dense trajectory based motion features at the
decision-level.
► Experimental results on the DEAP dataset support our
assumptions
 (1) that higher level audio-visual representations learned
using CNNs are more discriminative than low-level ones, and
 (2) that including dense trajectories contribute to increase
the classification performance.

Conclusions & Future Work – 2
10. Jun 2015
► Future work
 to concentrate on the modeling aspect of the problem and
explore machine learning techniques such as ensemble
learning.
 to extend our approach to user-generated videos (i.e., usually
not professionally edited).
 to incorporate high-level representations such as sentiment-
level semantics.

Competence Center Information Retrieval &
Machine Learning
www.dai-labor.de
Fon
Fax
+49 (0) 30 / 314 – 74
+49 (0) 30 / 314 – 74 003
DAI-Labor
Technische Universität Berlin
Fakultät IV – Elektrontechnik & Informatik
Sekretariat TEL 14
Ernst-Reuter-Platz 7
10587 Berlin, Deutschland
Esra Acar
Researcher
M.Sc.
esra.acar@tu-berlin.de
Thanks!
013

Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos

More Related Content

What's hot (19)

Similar to Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos (20)

Recently uploaded (20)

Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos