THE THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-20)
MULTIMODEL(MULTI-INPUT) IN EMOTION RECOGNITION
 Reason:
o Richer information: Cues from different modalities can augment or complement each other, and hence lead to
more sophisticated inference algorithms.
o Robustness to Sensor Noise: Information on different modalities captured through sensors can often be cor
rupted due to signal noise, or be missing altogether when the particular modality is not expressed, or can not be
captured due to occlusion, sensor artifacts, etc. We call such modalities ineffectual. Ineffectual modali ties are
especially prevalent in in-the-wild datasets.
DATASET
 IEMOCAP(2008):
 CMU_MOSEI(2018):
DATASET
 MULTI-COMPARE BETWEEN
CMU-MOSEI AND IEMOCAP
IEMOCAP
CHALLENGE
 Challenge:
o Decide which modalities should be combined and how
o Lack of agreement on the most efficient mechanism for combining(fusing) multi modalities
TECHNIQUES
 Early fusion:
Sikka et al 2013: Multiple Kernel Learning for Emotion
Recognition in the Wild
Majumder et al (2018)
 Late fusion:
Gunes et al 2007: Multimodal emotion recognition
from expressive faces, body gestures
Lee et al (2018) Convolutional Attention
Networks for Multimodal Emotion
Recognition from Speech and Text Data
RELATED WORK
 Multimodalities comparision
Dataset Method Modalities F1 scores MA
IEMOCAP
Kim et al (2013) Deep Belief Network Motion capture and audio
video
72.8 %
Yoon et al(2019) Multi-hop attention Text and Speech 77,6 %
Majumdar et al (2018) Text, Audio and Video 76.5 %
CMU-
MOSEI
Zadeh et al (2018) Dynamic Fusion
Graph
Language, vision and
acoustic
76.3%
Lee et al (2018) Text and Speech 89% 84.08%
Sahay et al(2018) tensor fusion network Text and audio 66.8%
SOLUTION
The general diagram of M3ER
MODALITIES CHECK
 Purpose: filter ineffectual data to increase the accuracy of reality data
Using Canonical Correlation Analysis (CCA) to compute
the correlation score, ρ, of every pair of input modalities
Compute the correlation score for the pair {𝑓𝑖, 𝑓𝑗}
Check them against an empirically chosen
threshold (τ)
REGENERATING PROXY FEATURE VECTORS
 Purpose: decrease the noise of each feature by regenerating proxy feature vectors for the ineffectual
modalities missed
Finding vj = argminjd(vj, ff), where is any distance metric
Compute constants ai ∈ R by solving the following linear system:
MULTIPLICATIVE MODALITY FUSION
 Idea: to explicitly suppress the weaker (not so expressive) modalities, which indirectly boost the stronger
(expressive) modalities
The loss for the 𝑖𝑡ℎ modality:
MODALITY COMBINATION
 Requirement:
o Be able to process the sotisphicated – data driven ( CMU-MOSEI, Youtube…) which has noise, occlusion, …
o Increase the reliability
 Proposal combination:
o Using single-hidden-layer LSTMs, each of output dimension 32.
o Then using multiplicative fusion to combine 3 32 dimensional feature vectors.
o This feature vecto is concatenated with the final value of the memory variable, and the resultant 160 dimensional
feature vector is passed through a 64 dimensional fully connected layer followed by a 6 dimensional fully
connected to generate the network outputs
EXPERIMENTS
 Feature extraction:
 Text(ft): Pre-trained GloVe word with 300-dimension embedding method
 Using the COVAREP software (Degottex et al., 2014) to extract acoustic features including 12 Mel-frequency
cepstral coefficients, pitch, voiced/unvoiced segmenting features, glottal source parameters, peak slope
parameters and maxima dispersion quotients.
 Using the combination of face embeddings obtained from state-ofthe-art facial recognition models, facial action
units, and facial landmarks for CMU-MOSEI
EVALUATION
LIMITATION
• Often confuses between certain class labels
• There is no absolute precision of the human perception of emotion in
an instant moment
• May consider adding context to emotional recognition
THANK YOU
ENA HO

More Related Content

PDF
Emotion Recognition Based On Audio Speech
PDF
F0363942
PDF
Ijarcet vol-2-issue-4-1347-1351
PDF
Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...
PPTX
SPEECH BASED EMOTION RECOGNITION USING VOICE
PDF
IRJET- Emotion recognition using Speech Signal: A Review
PDF
Emotion Speech Recognition - Convolutional Neural Network Capstone Project
PDF
Emotion Recognition Based On Audio Speech
F0363942
Ijarcet vol-2-issue-4-1347-1351
Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...
SPEECH BASED EMOTION RECOGNITION USING VOICE
IRJET- Emotion recognition using Speech Signal: A Review
Emotion Speech Recognition - Convolutional Neural Network Capstone Project

What's hot (20)

PDF
H010215561
PDF
Speech emotion recognition
PDF
Speech Emotion Recognition by Using Combinations of Support Vector Machine (S...
PDF
Human Emotion Recognition using Machine Learning
PPTX
Emotion recognition using image processing in deep learning
PPT
Voice Recognition
PPT
Speech Recognition
PDF
A critical insight into multi-languages speech emotion databases
PPTX
Speaker recognition using MFCC
DOC
Speaker recognition.
PDF
Deep Learning for Speech Recognition - Vikrant Singh Tomar
PDF
Automatic speech recognition system using deep learning
PDF
Deep Learning in practice : Speech recognition and beyond - Meetup
PPT
Automatic speech recognition
PPTX
Speech Recognition Technology
PPTX
Voice recognition system
PDF
ASERS-CNN: ARABIC SPEECH EMOTION RECOGNITION SYSTEM BASED ON CNN MODEL
PDF
ASERS-LSTM: Arabic Speech Emotion Recognition System Based on LSTM Model
PPTX
Short story presentation
H010215561
Speech emotion recognition
Speech Emotion Recognition by Using Combinations of Support Vector Machine (S...
Human Emotion Recognition using Machine Learning
Emotion recognition using image processing in deep learning
Voice Recognition
Speech Recognition
A critical insight into multi-languages speech emotion databases
Speaker recognition using MFCC
Speaker recognition.
Deep Learning for Speech Recognition - Vikrant Singh Tomar
Automatic speech recognition system using deep learning
Deep Learning in practice : Speech recognition and beyond - Meetup
Automatic speech recognition
Speech Recognition Technology
Voice recognition system
ASERS-CNN: ARABIC SPEECH EMOTION RECOGNITION SYSTEM BASED ON CNN MODEL
ASERS-LSTM: Arabic Speech Emotion Recognition System Based on LSTM Model
Short story presentation
Ad

Similar to M3er multiplicative_multimodal_emotion_recognition (20)

PPTX
Multimodal emotion recognition at utterance level with spatio-temporal featur...
PDF
Optimized multi-layer self-attention network for feature-level data fusion in...
PDF
Optimized multi-layer self-attention network for feature-level data fusion in...
PDF
[slide] Attentive Modality Hopping Mechanism for Speech Emotion Recognition
PPTX
Multimodal Learning with Severely Missing Modality.pptx
PPTX
AI-Driven Emotion Recognition - Integrated Electronic Systems
PDF
Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local a...
PPTX
moussaoui-atsip-2024_presentation_mf.pptx
PDF
STUDENT TEACHER INTERACTION ANALYSIS WITH EMOTION RECOGNITION FROM VIDEO AND ...
PDF
CONFIDENCE LEVEL ESTIMATOR BASED ON FACIAL AND VOICE EXPRESSION RECOGNITION A...
DOCX
65.MULTI MODAL SPEECH TRANSFORMER DECODERS WHEN DO MULTIPLE MODALITIES IMPROV...
PDF
IRJET-speech emotion.pdf
PDF
Dragos Datcu_PhD_Thesis
PDF
VAEs for multimodal disentanglement
PDF
Speech emotion recognition using 2D-convolutional neural network
PPTX
Audio Visual Emotion Recognition Using Cross Correlation and Wavelet Packet D...
PDF
SCAI invited talk @EMNLP2020
PDF
Presentation Mini Project_Presentation Mini Project.pdf
PDF
Multimodal AI Models Comprehensive Guide 2024.pdf
PDF
Audio-
Multimodal emotion recognition at utterance level with spatio-temporal featur...
Optimized multi-layer self-attention network for feature-level data fusion in...
Optimized multi-layer self-attention network for feature-level data fusion in...
[slide] Attentive Modality Hopping Mechanism for Speech Emotion Recognition
Multimodal Learning with Severely Missing Modality.pptx
AI-Driven Emotion Recognition - Integrated Electronic Systems
Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local a...
moussaoui-atsip-2024_presentation_mf.pptx
STUDENT TEACHER INTERACTION ANALYSIS WITH EMOTION RECOGNITION FROM VIDEO AND ...
CONFIDENCE LEVEL ESTIMATOR BASED ON FACIAL AND VOICE EXPRESSION RECOGNITION A...
65.MULTI MODAL SPEECH TRANSFORMER DECODERS WHEN DO MULTIPLE MODALITIES IMPROV...
IRJET-speech emotion.pdf
Dragos Datcu_PhD_Thesis
VAEs for multimodal disentanglement
Speech emotion recognition using 2D-convolutional neural network
Audio Visual Emotion Recognition Using Cross Correlation and Wavelet Packet D...
SCAI invited talk @EMNLP2020
Presentation Mini Project_Presentation Mini Project.pdf
Multimodal AI Models Comprehensive Guide 2024.pdf
Audio-
Ad

Recently uploaded (20)

PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PDF
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
PPTX
introduction to high performance computing
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PPTX
Information Storage and Retrieval Techniques Unit III
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PDF
Soil Improvement Techniques Note - Rabbi
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PPTX
CyberSecurity Mobile and Wireless Devices
PPTX
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
PDF
August -2025_Top10 Read_Articles_ijait.pdf
PDF
Design Guidelines and solutions for Plastics parts
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PPTX
Management Information system : MIS-e-Business Systems.pptx
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
Exploratory_Data_Analysis_Fundamentals.pdf
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
introduction to high performance computing
Categorization of Factors Affecting Classification Algorithms Selection
Information Storage and Retrieval Techniques Unit III
August 2025 - Top 10 Read Articles in Network Security & Its Applications
Soil Improvement Techniques Note - Rabbi
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
III.4.1.2_The_Space_Environment.p pdffdf
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
CyberSecurity Mobile and Wireless Devices
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
August -2025_Top10 Read_Articles_ijait.pdf
Design Guidelines and solutions for Plastics parts
Fundamentals of Mechanical Engineering.pptx
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
distributed database system" (DDBS) is often used to refer to both the distri...
Management Information system : MIS-e-Business Systems.pptx
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION

M3er multiplicative_multimodal_emotion_recognition

  • 1. THE THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-20)
  • 2. MULTIMODEL(MULTI-INPUT) IN EMOTION RECOGNITION  Reason: o Richer information: Cues from different modalities can augment or complement each other, and hence lead to more sophisticated inference algorithms. o Robustness to Sensor Noise: Information on different modalities captured through sensors can often be cor rupted due to signal noise, or be missing altogether when the particular modality is not expressed, or can not be captured due to occlusion, sensor artifacts, etc. We call such modalities ineffectual. Ineffectual modali ties are especially prevalent in in-the-wild datasets.
  • 5. CHALLENGE  Challenge: o Decide which modalities should be combined and how o Lack of agreement on the most efficient mechanism for combining(fusing) multi modalities
  • 6. TECHNIQUES  Early fusion: Sikka et al 2013: Multiple Kernel Learning for Emotion Recognition in the Wild Majumder et al (2018)
  • 7.  Late fusion: Gunes et al 2007: Multimodal emotion recognition from expressive faces, body gestures Lee et al (2018) Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data
  • 8. RELATED WORK  Multimodalities comparision Dataset Method Modalities F1 scores MA IEMOCAP Kim et al (2013) Deep Belief Network Motion capture and audio video 72.8 % Yoon et al(2019) Multi-hop attention Text and Speech 77,6 % Majumdar et al (2018) Text, Audio and Video 76.5 % CMU- MOSEI Zadeh et al (2018) Dynamic Fusion Graph Language, vision and acoustic 76.3% Lee et al (2018) Text and Speech 89% 84.08% Sahay et al(2018) tensor fusion network Text and audio 66.8%
  • 10. MODALITIES CHECK  Purpose: filter ineffectual data to increase the accuracy of reality data Using Canonical Correlation Analysis (CCA) to compute the correlation score, ρ, of every pair of input modalities Compute the correlation score for the pair {𝑓𝑖, 𝑓𝑗} Check them against an empirically chosen threshold (τ)
  • 11. REGENERATING PROXY FEATURE VECTORS  Purpose: decrease the noise of each feature by regenerating proxy feature vectors for the ineffectual modalities missed Finding vj = argminjd(vj, ff), where is any distance metric Compute constants ai ∈ R by solving the following linear system:
  • 12. MULTIPLICATIVE MODALITY FUSION  Idea: to explicitly suppress the weaker (not so expressive) modalities, which indirectly boost the stronger (expressive) modalities The loss for the 𝑖𝑡ℎ modality:
  • 13. MODALITY COMBINATION  Requirement: o Be able to process the sotisphicated – data driven ( CMU-MOSEI, Youtube…) which has noise, occlusion, … o Increase the reliability  Proposal combination: o Using single-hidden-layer LSTMs, each of output dimension 32. o Then using multiplicative fusion to combine 3 32 dimensional feature vectors. o This feature vecto is concatenated with the final value of the memory variable, and the resultant 160 dimensional feature vector is passed through a 64 dimensional fully connected layer followed by a 6 dimensional fully connected to generate the network outputs
  • 14. EXPERIMENTS  Feature extraction:  Text(ft): Pre-trained GloVe word with 300-dimension embedding method  Using the COVAREP software (Degottex et al., 2014) to extract acoustic features including 12 Mel-frequency cepstral coefficients, pitch, voiced/unvoiced segmenting features, glottal source parameters, peak slope parameters and maxima dispersion quotients.  Using the combination of face embeddings obtained from state-ofthe-art facial recognition models, facial action units, and facial landmarks for CMU-MOSEI
  • 16. LIMITATION • Often confuses between certain class labels • There is no absolute precision of the human perception of emotion in an instant moment • May consider adding context to emotional recognition

Editor's Notes

  • #7: Đường ống phân loại của phương pháp đề xuất. Khi các tính năng hình ảnh và âm thanh được trích xuất, chúng tôi xây dựng một hạt nhân hàm cơ sở xuyên tâm (RBF) từ mỗi bộ mô tả. Sau đó, chúng tôi sử dụng MKL để kết hợp tối ưu các hạt nhân tính năng cho đầu vào vào bộ phân loại SVM.
  • #8: A direct way to learn about the relationship between these two feature vectors would be to utilize a shallow model, which is a simple concatenation of two feature vectors. However, since the correlations between feature vectors from speech and text is highly non-linear, it is difficult for a shallow model to properly learn multimodal representations. Therefore, we utilize trainable attention mechanisms to learn nonlinear correlations between these feature vectors. Attention mechanisms also help retain information in the timedomain by forming temporal embedding between two feature vectors. 2:Using the cross-validation method to integrate