M3er multiplicative_multimodal_emotion_recognition

THE THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-20)

MULTIMODEL(MULTI-INPUT) IN EMOTION RECOGNITION
 Reason:
o Richer information: Cues from different modalities can augment or complement each other, and hence lead to
more sophisticated inference algorithms.
o Robustness to Sensor Noise: Information on different modalities captured through sensors can often be cor
rupted due to signal noise, or be missing altogether when the particular modality is not expressed, or can not be
captured due to occlusion, sensor artifacts, etc. We call such modalities ineffectual. Ineffectual modali ties are
especially prevalent in in-the-wild datasets.

DATASET
 IEMOCAP(2008):
 CMU_MOSEI(2018):

DATASET
 MULTI-COMPARE BETWEEN
CMU-MOSEI AND IEMOCAP
IEMOCAP

CHALLENGE
 Challenge:
o Decide which modalities should be combined and how
o Lack of agreement on the most efficient mechanism for combining(fusing) multi modalities

TECHNIQUES
 Early fusion:
Sikka et al 2013: Multiple Kernel Learning for Emotion
Recognition in the Wild
Majumder et al (2018)

 Late fusion:
Gunes et al 2007: Multimodal emotion recognition
from expressive faces, body gestures
Lee et al (2018) Convolutional Attention
Networks for Multimodal Emotion
Recognition from Speech and Text Data

RELATED WORK
 Multimodalities comparision
Dataset Method Modalities F1 scores MA
IEMOCAP
Kim et al (2013) Deep Belief Network Motion capture and audio
video
72.8 %
Yoon et al(2019) Multi-hop attention Text and Speech 77,6 %
Majumdar et al (2018) Text, Audio and Video 76.5 %
CMU-
MOSEI
Zadeh et al (2018) Dynamic Fusion
Graph
Language, vision and
acoustic
76.3%
Lee et al (2018) Text and Speech 89% 84.08%
Sahay et al(2018) tensor fusion network Text and audio 66.8%

SOLUTION
The general diagram of M3ER

MODALITIES CHECK
 Purpose: filter ineffectual data to increase the accuracy of reality data
Using Canonical Correlation Analysis (CCA) to compute
the correlation score, ρ, of every pair of input modalities
Compute the correlation score for the pair {𝑓𝑖, 𝑓𝑗}
Check them against an empirically chosen
threshold (τ)

REGENERATING PROXY FEATURE VECTORS
 Purpose: decrease the noise of each feature by regenerating proxy feature vectors for the ineffectual
modalities missed
Finding vj = argminjd(vj, ff), where is any distance metric
Compute constants ai ∈ R by solving the following linear system:

MULTIPLICATIVE MODALITY FUSION
 Idea: to explicitly suppress the weaker (not so expressive) modalities, which indirectly boost the stronger
(expressive) modalities
The loss for the 𝑖𝑡ℎ modality:

MODALITY COMBINATION
 Requirement:
o Be able to process the sotisphicated – data driven ( CMU-MOSEI, Youtube…) which has noise, occlusion, …
o Increase the reliability
 Proposal combination:
o Using single-hidden-layer LSTMs, each of output dimension 32.
o Then using multiplicative fusion to combine 3 32 dimensional feature vectors.
o This feature vecto is concatenated with the final value of the memory variable, and the resultant 160 dimensional
feature vector is passed through a 64 dimensional fully connected layer followed by a 6 dimensional fully
connected to generate the network outputs

EXPERIMENTS
 Feature extraction:
 Text(ft): Pre-trained GloVe word with 300-dimension embedding method
 Using the COVAREP software (Degottex et al., 2014) to extract acoustic features including 12 Mel-frequency
cepstral coefficients, pitch, voiced/unvoiced segmenting features, glottal source parameters, peak slope
parameters and maxima dispersion quotients.
 Using the combination of face embeddings obtained from state-ofthe-art facial recognition models, facial action
units, and facial landmarks for CMU-MOSEI

LIMITATION
• Often confuses between certain class labels
• There is no absolute precision of the human perception of emotion in
an instant moment
• May consider adding context to emotional recognition

M3er multiplicative_multimodal_emotion_recognition

More Related Content

What's hot (20)

Similar to M3er multiplicative_multimodal_emotion_recognition (20)

Recently uploaded (20)

M3er multiplicative_multimodal_emotion_recognition

Editor's Notes