SlideShare a Scribd company logo
Competence Center Information Retrieval & Machine Learning
Esra Acar1, Frank Hopfgartner2 and Sahin Albayrak1
1 DAI Laboratory, TU Berlin, Germany
2 Humanities Advanced Technology and Information Institute, University of Glasgow, UK
13th International Workshop on Content-Based Multimedia Indexing (CBMI), Prague.
Fusion of Learned Multi-Modal Representations and
Dense Trajectories for Emotional Analysis in Videos
Esra Acar
10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Outline
10. Jun 2015
► Introduction
► The Video Affective Analysis Method
 Overview
 Audio and static visual representation learning
 Mid-level dynamic visual representations
 Model generation
► Performance Evaluation
 Dataset & ground-truth
 Experimental setup
 Results
 Sample video clips
► Conclusions & Future Work
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Introduction – 1
10. Jun 2015
► Delivering personalized video content extracted from colossal
amounts of multimedia is still a challenge.
► Video affective analysis can bring an answer to such a challenge
from an original perspective by
 Analyzing video content at affective-level, and
 Providing access to videos based on emotions expected to
arise in the audience.
► Affective analysis can either be categorical or dimensional.
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Introduction – 2
10. Jun 2015
► In the context of categorical affective analysis, one direction followed
by many researchers consists in using machine learning methods.
► Machine learning approaches make use of a specific data
representation.
 One key issue is to find an effective representation of video
content.
► Features can be classified based on the level of semantic information
they carry:
 low-level (e.g., pixel value)
 mid-level (e.g., bag of visual words)
 high-level (e.g., a guitarist performing a song)
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Introduction – 3
10. Jun 2015
► Another possible feature distinction in video analysis is the one
between static and dynamic (or temporal) features.
► Commonplace approaches among video affective content
analysis methods consist in using:
 low-level audio-visual features,
 mid-level representations based on low-level ones (e.g.,
horror sound, laughter), and
 high-level semantic attributes (e.g., SentiBank, ObjectBank).
► Affective video analysis methods in the literature use
 mainly handcrafted low and mid-level features, and
 the temporal aspect of videos in a limited manner.
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
The Video Affective Analysis Method – 1
10. Jun 2015
► Two main issues we address in this work:
 learning mid-level audio and static visual features, and
 deriving effective mid-level motion representations.
► Our approach is a categorical affective analysis solution which
tries to map each video into one of the four quadrants in the
Valence-Arousal-space (VA-space).
► The choice of categorical or dimensional is not critical.
 In practice, categories can always be mapped onto
dimensions and vice versa.
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
The Video Affective Analysis Method – 2
10. Jun 2015
► arousal  intensity of emotion
► valence  type of emotion
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Excerpt from Yazdani, A., Skodras, E., Fakotakis, N., & Ebrahimi, T. (2013).
Multimedia content analysis for emotional characterization of music video clips.
EURASIP Journal on Image and Video Processing, 2013(1), 26.
An Overview of the Steps in the System
10. Jun 2015
► (1) one-minute highlight extracts of music video clips are first segmented
into pieces of 5 second length;
► (2) audio and visual feature extraction;
► (3) learning mid-level audio and static visual representations (training);
► (4) generating mid-level audio-visual representations;
► (5) generating an affective analysis model (training);
► (6) classifying a video segment of 5-second length into one of the four
quadrants in the VA-space (test); and
► (7) classifying an extract using the results of the 5-second segments (test).
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Audio and Static Visual Representation Learning – 1
10. Jun 2015
► Mel-Frequency Cepstral Coefficients (MFCC) and color values in
the HSV color space are used as raw data.
► Convolutional neural networks (CNNs) are used for mid-level
feature extraction.
 three convolution and two subsampling layers.
 trained using the backpropagation algorithm.
 the output of the last convolution layer  mid-level audio or
visual representation.
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Audio and Static Visual Representation Learning – 2
10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
(a) A high-level overview of our representation learning method, (b) the detailed CNN architectures for audio and visual representation learning. The
architecture contains three convolution and two subsampling layers, one output layer fully connected to the last convolution layer (C6). (CNN: Convolutional
Neural Network, MFCC: Mel-Frequency Cepstral Coefficients, A: Audio, V: Visual)
Mid-Level Dynamic Visual Representations – 1
10. Jun 2015
► Motion in edited videos (e.g., music video clips) is shown to be
an important cue for affective video analysis.
► We adopt the work of Wang et al. on dense trajectories.
► Dense trajectories
 are dynamic visual features derived from tracking densely
sampled feature points in multiple spatial scales.
 are initially used for unconstrained video action recognition.
 constitute a powerful tool for motion description.
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Mid-Level Dynamic Visual Representations – 2
10. Jun 2015
► Steps to construct mid-level motion representations:
 Dense trajectories of length 15 frames are
extracted from each video segment.
represented by HoG, HoF and motion boundary histograms in the x
and y directions (MBHx and MBHy, respectively).
 A separate dictionary for each dense trajectory descriptor is
learned.
Sparse dictionary learning technique used to generate a dictionary of
size k (k = 512).
400 x k feature vectors are sampled from the training data.
 Sparse representations are generated using the LARS algorithm and
the max-pooling technique (i.e., sparse coded Bag-of-Words).
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Model Generation – 1
10. Jun 2015
► Mid-level audio and static visual representations are created by
using the CNN models.
► Mid-level motion representations are derived using sparse
coded BoW.
► Mid-level audio, dynamic and static visual representations are
fed into separate multi-class SVMs (RBF kernel).
► The probability estimates of the models are merged using
linear or SVM-based fusion.
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Model Generation – 2
10. Jun 2015
► We investigated two distinct fusion techniques to combine the
outputs of the SVM models:
 Linear fusion: probability estimates are fused at the decision-
level using different weights for each modality. The weights
are optimized on the training data.
 SVM-based fusion: probability estimates of the SVMs are
concatenated into vectors and used to construct higher level
representations which are used to construct another SVM to
predict the label of a video segment.
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Performance Evaluation
10. Jun 2015
► The experiments aim at comparing the discriminative power of
our method against the method that uses low-level audio-
visual features (i.e., the baseline method), and the works
presented in [1] and [2].
 [1] A. Yazdani, K. Kappeler, and T. Ebrahimi, “Affective content
analysis of music video clips,” in MIRUM. ACM, 2011.
 [2] E. Acar, F. Hopfgartner, and S. Albayrak, “Understanding
affective content of music videos through learned
representations,” in MMM, 2014.
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Dataset & Ground-truth – 1
10. Jun 2015
► We use the DEAP dataset www.eecs.qmul.ac.uk/mmv/datasets/deap)
► The DEAP dataset
 is for the analysis of human affective states using
electroencephalogram, physiological and video signals.
 consists of the ratings from an online self-assessment where
120 one-minute extracts of music videos were each rated by
14-16 volunteers based on arousal, valence and dominance.
► Only one-minute highlight extracts from these 74 videos
(available at YouTube) have been used in the experiments (i.e.,
888 video segments).
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Dataset & Ground-truth – 2
10. Jun 2015
► Four affective labels each representing one quadrant in the VA-
space used for classification.
 high arousal-high valence (ha-hv) – 19 songs,
 low arousal-high valence (la-hv) – 19 songs,
 low arousal-low valence (la-lv) – 14 songs, and
 high arousal-low valence (ha-lv) – 22 songs
► The labels are provided in the dataset and are determined by
the average ratings of the participants in the online self-
assessment.
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Experimental Setup – 1
10. Jun 2015
► MFCC extraction  frame sizes of 25 ms with 10 ms overlap,
13-dimensional.
► Mean and standard deviation of MFCC  low-level audio
representation (LLR audio).
► Normalized HSV histograms (16, 4, 4 bins) in the HSV color
space  low-level visual representation (LLR visual).
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Experimental Setup – 2
10. Jun 2015
► The most computationally expensive phase  training of the
CNN models.
 MFCC – 150 seconds, color – 350 seconds (on average per
epoch)
► The generation of feature representations per video segment
 MFCC using CNNs – 0.5 seconds,
 Color using CNNs – 1.2 seconds, and
 Dense trajectory based sparse coded BoW – 16 seconds.
► All the timing evaluations were performed with a machine with
2.40GHz CPU and 8GB RAM.
► Leave-one-song-out cross validation scheme used.
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Results – Unimodal Representations
10. Jun 2015
► Motion and audio representations are more discriminative than static
visual features.
► Motion representation is superior  affect present in video clips is
often characterized by motion (e.g., camera motion).
► Color values in the HSV space lead to more discriminative mid-level
representations than color values in the RGB space (when compared
to our previous work).
Classification Accuracies on the DEAP dataset (MLR: mid-level representation)
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Results – Multi-modal Representations
10. Jun 2015
► The performance gain over prior works is remarkable for SVM-
based fusion  an advanced fusion mechanism is better.
► Differences with the setup of work [3]:
 40 video clips from the DEAP dataset used in [3].
 only the clips which induce strong emotions used in [3].
Classification Accuracies on the DEAP dataset (MLR: mid-level representation)
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Results – Confusion Matrix
10. Jun 2015
Confusion matrices on the DEAP dataset (Mean accuracy: 50% for (a) and 58.11% for (b)).
Lighter areas along the main diagonal correspond to better discrimination.
(b) MLR audio, motion and static visual – linear fusion(a) MLR audio and static visual
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Correctly Classified (HA-HV)
10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Emiliana Torrini - JungleDrum
Wrongly Classified (HA-HV)
10. Jun 2015
Predicted  HA-LV
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
TheGo!Team – Huddle Formation
Correctly Classified (LA-HV)
10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Grand Archives – Miniature Birds
Wrongly Classified (LA-HV)
10. Jun 2015
Predicted  HA-LV
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
The Cardigans – Carnival
Correctly Classified (LA-LV)
10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
James Blunt – Goodbye My Lover
Wrongly Classified (LA-LV)
10. Jun 2015
Predicted  LA-HV
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Porcupine Tree – Normal
Correctly Classified (HA-LV)
10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
ARCH ENEMY - My Apocalypse
Wrongly Classified (HA-LV)
10. Jun 2015
Predicted  HA-HV
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
The Cranberries – Zombie
Conclusions & Future Work – 1
10. Jun 2015
► We presented an approach where
 higher level representations were learned from raw data
using CNNs, and
 fused with dense trajectory based motion features at the
decision-level.
► Experimental results on the DEAP dataset support our
assumptions
 (1) that higher level audio-visual representations learned
using CNNs are more discriminative than low-level ones, and
 (2) that including dense trajectories contribute to increase
the classification performance.
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Conclusions & Future Work – 2
10. Jun 2015
► Future work
 to concentrate on the modeling aspect of the problem and
explore machine learning techniques such as ensemble
learning.
 to extend our approach to user-generated videos (i.e., usually
not professionally edited).
 to incorporate high-level representations such as sentiment-
level semantics.
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Competence Center Information Retrieval &
Machine Learning
www.dai-labor.de
Fon
Fax
+49 (0) 30 / 314 – 74
+49 (0) 30 / 314 – 74 003
DAI-Labor
Technische Universität Berlin
Fakultät IV – Elektrontechnik & Informatik
Sekretariat TEL 14
Ernst-Reuter-Platz 7
10587 Berlin, Deutschland
Esra Acar
Researcher
M.Sc.
esra.acar@tu-berlin.de
Thanks!
013
10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

More Related Content

PPT
VAW-GAN for disentanglement and recomposition of emotional elements in speech
PPT
IEEE ICASSP 2021
PPT
Oral Qualification Examination_Kun_Zhou
PDF
Deep Learning & NLP: Graphs to the Rescue!
PPTX
Keynote taiwan
PDF
Semantic Relatedness for Evaluation of Course Equivalencies
PPT
2011 06 01 (uned) emadrid aelsaddik uottawa tangible objects for e learning
PDF
Presentacion Dcai 2010
VAW-GAN for disentanglement and recomposition of emotional elements in speech
IEEE ICASSP 2021
Oral Qualification Examination_Kun_Zhou
Deep Learning & NLP: Graphs to the Rescue!
Keynote taiwan
Semantic Relatedness for Evaluation of Course Equivalencies
2011 06 01 (uned) emadrid aelsaddik uottawa tangible objects for e learning
Presentacion Dcai 2010

What's hot (19)

PDF
A semi-supervised method for efficient construction of statistical spoken lan...
PDF
Multi modal retrieval and generation with deep distributed models
PDF
Information Retrieval with Deep Learning
PDF
Acoustic Features to Predict Topic Change
PDF
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
PDF
Deep learning for natural language embeddings
PDF
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
PDF
IRJET- Music Genre Classification using Machine Learning Algorithms: A Compar...
PDF
Zero shot learning through cross-modal transfer
PDF
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
PDF
Register journal vol 9 no 2 december 2016
PPTX
Music genre prediction
PDF
Dual Steganography for Hiding Video in Video
PDF
Deep Learning for Information Retrieval
PDF
Deep Learning for NLP: An Introduction to Neural Word Embeddings
PPTX
PDF
IRJET - Audio Emotion Analysis
PDF
Culture-aware Music Recommendation
PDF
Pragmatic evaluation of folksonomies
A semi-supervised method for efficient construction of statistical spoken lan...
Multi modal retrieval and generation with deep distributed models
Information Retrieval with Deep Learning
Acoustic Features to Predict Topic Change
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
Deep learning for natural language embeddings
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
IRJET- Music Genre Classification using Machine Learning Algorithms: A Compar...
Zero shot learning through cross-modal transfer
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Register journal vol 9 no 2 december 2016
Music genre prediction
Dual Steganography for Hiding Video in Video
Deep Learning for Information Retrieval
Deep Learning for NLP: An Introduction to Neural Word Embeddings
IRJET - Audio Emotion Analysis
Culture-aware Music Recommendation
Pragmatic evaluation of folksonomies
Ad

Similar to Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos (20)

PDF
Content Modelling for Human Action Detection via Multidimensional Approach
PDF
Real-time classification of gorilla video segments
PDF
Speech Emotion Recognition Using Neural Networks
PDF
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
PDF
PDF
Ijarcet vol-2-issue-4-1347-1351
PDF
A Review Paper on Speech Based Emotion Detection Using Deep Learning
PDF
Visual and audio scene classification for detecting discrepancies (MAD'24 wo...
PDF
Speech emotion recognition using 2D-convolutional neural network
PPTX
Music Gesture for Visual Sound Separation
PDF
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
PDF
A Music Visual Interface via Emotion Detection Supervisor
PDF
Audio-
PPT
On the Influence Propagation of Web Videos
PDF
International Journal on Natural Language Computing (IJNLC)
PDF
IRJET- Sentimental Analysis on Audio and Video
PDF
silent sound technology pdf
PPTX
SPEECH BASED EMOTION RECOGNITION USING VOICE
PDF
AN AFFECTIVE AWARE PSEUDO ASSOCIATION METHOD TO CONNECT DISJOINT USERS ACROSS...
PDF
IRJET- Music Genre Recognition using Convolution Neural Network
Content Modelling for Human Action Detection via Multidimensional Approach
Real-time classification of gorilla video segments
Speech Emotion Recognition Using Neural Networks
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
Ijarcet vol-2-issue-4-1347-1351
A Review Paper on Speech Based Emotion Detection Using Deep Learning
Visual and audio scene classification for detecting discrepancies (MAD'24 wo...
Speech emotion recognition using 2D-convolutional neural network
Music Gesture for Visual Sound Separation
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
A Music Visual Interface via Emotion Detection Supervisor
Audio-
On the Influence Propagation of Web Videos
International Journal on Natural Language Computing (IJNLC)
IRJET- Sentimental Analysis on Audio and Video
silent sound technology pdf
SPEECH BASED EMOTION RECOGNITION USING VOICE
AN AFFECTIVE AWARE PSEUDO ASSOCIATION METHOD TO CONNECT DISJOINT USERS ACROSS...
IRJET- Music Genre Recognition using Convolution Neural Network
Ad

Recently uploaded (20)

PPTX
gene cloning powerpoint for general biology 2
PPTX
Biomechanics of the Hip - Basic Science.pptx
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PDF
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
PDF
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
PPTX
PMR- PPT.pptx for students and doctors tt
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPT
LEC Synthetic Biology and its application.ppt
PDF
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PDF
The Land of Punt — A research by Dhani Irwanto
PPTX
Microbes in human welfare class 12 .pptx
PPTX
Hypertension_Training_materials_English_2024[1] (1).pptx
PPTX
Lesson-1-Introduction-to-the-Study-of-Chemistry.pptx
PPTX
Substance Disorders- part different drugs change body
PDF
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
PPTX
perinatal infections 2-171220190027.pptx
gene cloning powerpoint for general biology 2
Biomechanics of the Hip - Basic Science.pptx
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
PMR- PPT.pptx for students and doctors tt
lecture 2026 of Sjogren's syndrome l .pdf
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
LEC Synthetic Biology and its application.ppt
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
The Land of Punt — A research by Dhani Irwanto
Microbes in human welfare class 12 .pptx
Hypertension_Training_materials_English_2024[1] (1).pptx
Lesson-1-Introduction-to-the-Study-of-Chemistry.pptx
Substance Disorders- part different drugs change body
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
perinatal infections 2-171220190027.pptx

Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos

  • 1. Competence Center Information Retrieval & Machine Learning Esra Acar1, Frank Hopfgartner2 and Sahin Albayrak1 1 DAI Laboratory, TU Berlin, Germany 2 Humanities Advanced Technology and Information Institute, University of Glasgow, UK 13th International Workshop on Content-Based Multimedia Indexing (CBMI), Prague. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos Esra Acar 10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 2. Outline 10. Jun 2015 ► Introduction ► The Video Affective Analysis Method  Overview  Audio and static visual representation learning  Mid-level dynamic visual representations  Model generation ► Performance Evaluation  Dataset & ground-truth  Experimental setup  Results  Sample video clips ► Conclusions & Future Work Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 3. Introduction – 1 10. Jun 2015 ► Delivering personalized video content extracted from colossal amounts of multimedia is still a challenge. ► Video affective analysis can bring an answer to such a challenge from an original perspective by  Analyzing video content at affective-level, and  Providing access to videos based on emotions expected to arise in the audience. ► Affective analysis can either be categorical or dimensional. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 4. Introduction – 2 10. Jun 2015 ► In the context of categorical affective analysis, one direction followed by many researchers consists in using machine learning methods. ► Machine learning approaches make use of a specific data representation.  One key issue is to find an effective representation of video content. ► Features can be classified based on the level of semantic information they carry:  low-level (e.g., pixel value)  mid-level (e.g., bag of visual words)  high-level (e.g., a guitarist performing a song) Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 5. Introduction – 3 10. Jun 2015 ► Another possible feature distinction in video analysis is the one between static and dynamic (or temporal) features. ► Commonplace approaches among video affective content analysis methods consist in using:  low-level audio-visual features,  mid-level representations based on low-level ones (e.g., horror sound, laughter), and  high-level semantic attributes (e.g., SentiBank, ObjectBank). ► Affective video analysis methods in the literature use  mainly handcrafted low and mid-level features, and  the temporal aspect of videos in a limited manner. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 6. The Video Affective Analysis Method – 1 10. Jun 2015 ► Two main issues we address in this work:  learning mid-level audio and static visual features, and  deriving effective mid-level motion representations. ► Our approach is a categorical affective analysis solution which tries to map each video into one of the four quadrants in the Valence-Arousal-space (VA-space). ► The choice of categorical or dimensional is not critical.  In practice, categories can always be mapped onto dimensions and vice versa. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 7. The Video Affective Analysis Method – 2 10. Jun 2015 ► arousal  intensity of emotion ► valence  type of emotion Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis Excerpt from Yazdani, A., Skodras, E., Fakotakis, N., & Ebrahimi, T. (2013). Multimedia content analysis for emotional characterization of music video clips. EURASIP Journal on Image and Video Processing, 2013(1), 26.
  • 8. An Overview of the Steps in the System 10. Jun 2015 ► (1) one-minute highlight extracts of music video clips are first segmented into pieces of 5 second length; ► (2) audio and visual feature extraction; ► (3) learning mid-level audio and static visual representations (training); ► (4) generating mid-level audio-visual representations; ► (5) generating an affective analysis model (training); ► (6) classifying a video segment of 5-second length into one of the four quadrants in the VA-space (test); and ► (7) classifying an extract using the results of the 5-second segments (test). Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 9. Audio and Static Visual Representation Learning – 1 10. Jun 2015 ► Mel-Frequency Cepstral Coefficients (MFCC) and color values in the HSV color space are used as raw data. ► Convolutional neural networks (CNNs) are used for mid-level feature extraction.  three convolution and two subsampling layers.  trained using the backpropagation algorithm.  the output of the last convolution layer  mid-level audio or visual representation. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 10. Audio and Static Visual Representation Learning – 2 10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis (a) A high-level overview of our representation learning method, (b) the detailed CNN architectures for audio and visual representation learning. The architecture contains three convolution and two subsampling layers, one output layer fully connected to the last convolution layer (C6). (CNN: Convolutional Neural Network, MFCC: Mel-Frequency Cepstral Coefficients, A: Audio, V: Visual)
  • 11. Mid-Level Dynamic Visual Representations – 1 10. Jun 2015 ► Motion in edited videos (e.g., music video clips) is shown to be an important cue for affective video analysis. ► We adopt the work of Wang et al. on dense trajectories. ► Dense trajectories  are dynamic visual features derived from tracking densely sampled feature points in multiple spatial scales.  are initially used for unconstrained video action recognition.  constitute a powerful tool for motion description. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 12. Mid-Level Dynamic Visual Representations – 2 10. Jun 2015 ► Steps to construct mid-level motion representations:  Dense trajectories of length 15 frames are extracted from each video segment. represented by HoG, HoF and motion boundary histograms in the x and y directions (MBHx and MBHy, respectively).  A separate dictionary for each dense trajectory descriptor is learned. Sparse dictionary learning technique used to generate a dictionary of size k (k = 512). 400 x k feature vectors are sampled from the training data.  Sparse representations are generated using the LARS algorithm and the max-pooling technique (i.e., sparse coded Bag-of-Words). Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 13. Model Generation – 1 10. Jun 2015 ► Mid-level audio and static visual representations are created by using the CNN models. ► Mid-level motion representations are derived using sparse coded BoW. ► Mid-level audio, dynamic and static visual representations are fed into separate multi-class SVMs (RBF kernel). ► The probability estimates of the models are merged using linear or SVM-based fusion. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 14. Model Generation – 2 10. Jun 2015 ► We investigated two distinct fusion techniques to combine the outputs of the SVM models:  Linear fusion: probability estimates are fused at the decision- level using different weights for each modality. The weights are optimized on the training data.  SVM-based fusion: probability estimates of the SVMs are concatenated into vectors and used to construct higher level representations which are used to construct another SVM to predict the label of a video segment. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 15. Performance Evaluation 10. Jun 2015 ► The experiments aim at comparing the discriminative power of our method against the method that uses low-level audio- visual features (i.e., the baseline method), and the works presented in [1] and [2].  [1] A. Yazdani, K. Kappeler, and T. Ebrahimi, “Affective content analysis of music video clips,” in MIRUM. ACM, 2011.  [2] E. Acar, F. Hopfgartner, and S. Albayrak, “Understanding affective content of music videos through learned representations,” in MMM, 2014. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 16. Dataset & Ground-truth – 1 10. Jun 2015 ► We use the DEAP dataset www.eecs.qmul.ac.uk/mmv/datasets/deap) ► The DEAP dataset  is for the analysis of human affective states using electroencephalogram, physiological and video signals.  consists of the ratings from an online self-assessment where 120 one-minute extracts of music videos were each rated by 14-16 volunteers based on arousal, valence and dominance. ► Only one-minute highlight extracts from these 74 videos (available at YouTube) have been used in the experiments (i.e., 888 video segments). Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 17. Dataset & Ground-truth – 2 10. Jun 2015 ► Four affective labels each representing one quadrant in the VA- space used for classification.  high arousal-high valence (ha-hv) – 19 songs,  low arousal-high valence (la-hv) – 19 songs,  low arousal-low valence (la-lv) – 14 songs, and  high arousal-low valence (ha-lv) – 22 songs ► The labels are provided in the dataset and are determined by the average ratings of the participants in the online self- assessment. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 18. Experimental Setup – 1 10. Jun 2015 ► MFCC extraction  frame sizes of 25 ms with 10 ms overlap, 13-dimensional. ► Mean and standard deviation of MFCC  low-level audio representation (LLR audio). ► Normalized HSV histograms (16, 4, 4 bins) in the HSV color space  low-level visual representation (LLR visual). Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 19. Experimental Setup – 2 10. Jun 2015 ► The most computationally expensive phase  training of the CNN models.  MFCC – 150 seconds, color – 350 seconds (on average per epoch) ► The generation of feature representations per video segment  MFCC using CNNs – 0.5 seconds,  Color using CNNs – 1.2 seconds, and  Dense trajectory based sparse coded BoW – 16 seconds. ► All the timing evaluations were performed with a machine with 2.40GHz CPU and 8GB RAM. ► Leave-one-song-out cross validation scheme used. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 20. Results – Unimodal Representations 10. Jun 2015 ► Motion and audio representations are more discriminative than static visual features. ► Motion representation is superior  affect present in video clips is often characterized by motion (e.g., camera motion). ► Color values in the HSV space lead to more discriminative mid-level representations than color values in the RGB space (when compared to our previous work). Classification Accuracies on the DEAP dataset (MLR: mid-level representation) Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 21. Results – Multi-modal Representations 10. Jun 2015 ► The performance gain over prior works is remarkable for SVM- based fusion  an advanced fusion mechanism is better. ► Differences with the setup of work [3]:  40 video clips from the DEAP dataset used in [3].  only the clips which induce strong emotions used in [3]. Classification Accuracies on the DEAP dataset (MLR: mid-level representation) Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 22. Results – Confusion Matrix 10. Jun 2015 Confusion matrices on the DEAP dataset (Mean accuracy: 50% for (a) and 58.11% for (b)). Lighter areas along the main diagonal correspond to better discrimination. (b) MLR audio, motion and static visual – linear fusion(a) MLR audio and static visual Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 23. Correctly Classified (HA-HV) 10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis Emiliana Torrini - JungleDrum
  • 24. Wrongly Classified (HA-HV) 10. Jun 2015 Predicted  HA-LV Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis TheGo!Team – Huddle Formation
  • 25. Correctly Classified (LA-HV) 10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis Grand Archives – Miniature Birds
  • 26. Wrongly Classified (LA-HV) 10. Jun 2015 Predicted  HA-LV Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis The Cardigans – Carnival
  • 27. Correctly Classified (LA-LV) 10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis James Blunt – Goodbye My Lover
  • 28. Wrongly Classified (LA-LV) 10. Jun 2015 Predicted  LA-HV Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis Porcupine Tree – Normal
  • 29. Correctly Classified (HA-LV) 10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis ARCH ENEMY - My Apocalypse
  • 30. Wrongly Classified (HA-LV) 10. Jun 2015 Predicted  HA-HV Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis The Cranberries – Zombie
  • 31. Conclusions & Future Work – 1 10. Jun 2015 ► We presented an approach where  higher level representations were learned from raw data using CNNs, and  fused with dense trajectory based motion features at the decision-level. ► Experimental results on the DEAP dataset support our assumptions  (1) that higher level audio-visual representations learned using CNNs are more discriminative than low-level ones, and  (2) that including dense trajectories contribute to increase the classification performance. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 32. Conclusions & Future Work – 2 10. Jun 2015 ► Future work  to concentrate on the modeling aspect of the problem and explore machine learning techniques such as ensemble learning.  to extend our approach to user-generated videos (i.e., usually not professionally edited).  to incorporate high-level representations such as sentiment- level semantics. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
  • 33. Competence Center Information Retrieval & Machine Learning www.dai-labor.de Fon Fax +49 (0) 30 / 314 – 74 +49 (0) 30 / 314 – 74 003 DAI-Labor Technische Universität Berlin Fakultät IV – Elektrontechnik & Informatik Sekretariat TEL 14 Ernst-Reuter-Platz 7 10587 Berlin, Deutschland Esra Acar Researcher M.Sc. esra.acar@tu-berlin.de Thanks! 013 10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis