SlideShare a Scribd company logo
Conversational transfer learning for emotion recognition
1
2022-03-18 Paper Introduction
Davamanyu Hazarika, Soujanya Poria, Roger Zimmerman, Rada Mihalcea
Journal of Information Fusion 65(2021), IF=12.975, Cited=29
Takato Hayashi
Abstract
2
• Recognizing emotions in conversations is a challenging task due to the presence of
contextual dependencies governed by self- and inter-personal influences.
• However, purely supervised strategies demand large amounts of annotated data, which is
lacking in most of available corpora in this task.
• This paper proposed an approach, TL-ERC, where we pre-train a hierarchical dialogue
model on multi-turn conversations (source) and then transfer its parameters to a
conversational emotion classifier (target).
• TL-ERC improves in performance and robustness against limited training data. This
model also achieves better validation performances in significantly fewer epochs.
Introduction
3
• Several works in the literature have indicated that emotional goals and influences act as
latent controllers in dialogues [1, 2]
• Poria et al [3] demonstrated the interplay of several factors, such as the topic of the
conversation, speakers’ personality, argumentation-logic, viewpoint, and intent, which
modulate the emotional state of the speaker and finally lead to an utterance.
Introduction
4
• (a) illustrates the presence of emotional inertia which occur thought self-influences in emotional states. The character
Snorri maintains a frustrated emotional state by not being affected/influenced by the other speakers.
• conversation (b) and (c) demonstrate the role if inter-speaker influences in emotional transition across turns.
• In (b), the character Josh is triggered for an emotional shift due to influenced based on his counterpart responses.
• (c) demonstrates the effect of mirroring which often arises due to topical agreement between speakers.
Methodology
5
• To perform the generative task of conversation modeling, we use the Hierarchical Recurrent Encoder-Decoder (HRED)
architecture. HRED is a classic framework for seq2seq conversational response generation that models conversations in
a hierarchical fashion.
• For a given conversation context with sentences 𝑥!, ⋯ , 𝑥", HRED generates the response 𝑥"#! as follow:
1. Sentence encoder : It encodes each sentence in the context using an encoder RNN, such that,
ℎ"
$%&
= 𝑓'
$%&
𝑥", ℎ"(!
$%&
Ø Source : generative conversation modeling
Methodology
6
2. Context encoder : The sentence representations are then fed into a context RNN that models the conversational context
until time step t as
ℎ"
&")
= 𝑓'
&")
ℎ"
$%&
, ℎ"(!
&)"
3. Sentence decoder : Finally, an auto-regressive decoder RNN generates sentence 𝑥"#!conditioned on ℎ"
&")
, i.e.,
𝑝' 𝑥"#!(𝑥*" = 𝑓'
+$&
𝑥( ℎ"
&)"
= ∏, 𝑓'
+$&
𝑥"#!,,| ℎ"
&)"
, 𝑥"#!,.,
• With the 𝑖th conversation being a sequence of utterances 𝐶, = 𝑥,,!, ⋯ , 𝑥,,%!
, HRED trains all the conversations in the
dataset together by using the maximum likelihood estimation objective 𝑎𝑟𝑔𝑚𝑎𝑥' = ∑, 𝑙𝑜𝑔 𝑝' 𝐶,
• We call the parameters associated with Sentence encoder as 𝜃$%&
/012&$
, the parameters associated with Context encoder as
𝜃&")
/012&$
, the parameters associated with Sentence decoder as 𝜃+$&
/012&$
.
Methodology
7
Ø Target : Emotion Recognition in conversation
• The input for this task is also a conversation C with constituent utterances 𝑥,,!, ⋯ , 𝑥,,%!
. Each 𝑥, is associated with an
emotion label 𝑦, ∈ 𝕐.
1. Sentence encoding
• To encode each utterance in the conversation, this paper use BERT, with its parameters represented as 𝜃3456
. BERT
is chosen over the HRED sentence encoder (𝜃$%&
/012&$) as its provides better performance. Hidden vector of the first
token [CLS] across the considered transformer layers and mean-pool them is used as final sentence representation.
Methodology
8
2. Context encoding
• A similar context encoder RNN is used as the source HRED model with the option to transfer the learned parameter
𝜃&")
/012&$
. The context RNN transforms it as follows :
• Here, 𝑉7,2,8, 𝑊7,2,8 , 𝒃7,2,8 are parameters for the RNN function and 𝑊9, 𝒃9 are additional parameters of a
dense layers. For our setup, adhering to size considerations, we consider our transfer parameters to be 𝜃&")
/012&$
=
𝑊7,2,8,9
, 𝒃7,2,8,9
.
3. Classification
• For each turn in the conversation, the output from the context RNN is projected to the label-space, which provides
the predicted emotion for the associated utterance. Similar to HRED, we train for all the utterances in the
conversation together using the standard Cross Entropy loss. For regression targets, we utilize the Mean Square
Error (MSE) loss, instead.
Datasets
9
Ø Source task
• Cornell movie dialog corpus is a popular collection of fictional conversations extracted from movie scripts. In this
dataset, conversations are sampled from a diverse set of 617 movies leading to over 83k dialogues.
• Ubuntu dialog corpus is a larger corpus with around 1 million dialogues, which, like the Cornell corpus, comprises of
unstructured multi-turn dialogues based on Ubuntu chat logs (Internet Relay Chat).
Datasets
10
Ø Source task
• Primarily, this research consider the textual modality of a small-sized multimodal dataset IEMOCAP. Each
conversational video is segmented into utterances and annotated with the following emotion labels: anger, happiness,
sadness, neutral, excitement, and frustration.
• This research also analyze results on a moderately-sized emotional dialogue dataset DailyDialog with labeled emotions:
anger, happiness, sadness, surprise, fear disgust and no_emotion. Unlike spoken utterances in IEMOCAP, the
conversations are chat-based based on daily life topics.
• Finally, this research choose a regression-based dataset SEMAINE with labeled valence, arousal, power, and expectancy,
which is a video-based corpus of human-agent emotional interactions.
Ø Metrics
• For ERC, this research use weighted-F-score metric for the classification tasks on IEMOCAP and DailyDialog. For
DailyDialog, this research remove no_emotion class from the F-score calculations due to its high majority. For the
regression task on SEMAINE, we take the Pearson correlation coefficient (r) as its metric. This research also provide the
average best epoch (BE) on which the least validation losses are observed. A lower BE represents the model’s ability to
reach optimum performance in lesser training epochs.
Model variants and baselines
11
• This research experiment on different variants of TL-ERC based on the parameter initialization procedure.
• Next, to compare TL-ERC with the existing literature, this research select some prior state-of-the-art models evaluated
on the target datasets:
CNN, Memmet, C-LSTM, C-LSTM+Att, CMN, DialogueRNN
Result and Analysis
12
• In both datasets of IEMOCAP and DailyDialog, results indicate clear and statistically significant improvements of the
models that use pre-trained weights over the randomly initialized variant.
• Similar trends are observed in the regression task based on the SE- MAINE corpus. For valence, arousal, and power
dimensions, the improvement is significant. For expectation, the performance is marginally better but at a much lesser
BE, indicating faster generalization.
• Result also indicate that the pre-trained models are significantly more robust against limited training resources compared
to models trained from scratch.
Result and Analysis
13
• Effect of bias in random splits is investigated. the relative performance within each split follows similar trends of
improvement for TL-based models.
• The trace of the validation loss indicates that the presence of weight initialization leads to faster convergence in terms of
the best validation loss.
Result and Analysis
14
• It is conducted a comparative study between the performance of models initialized with HRED-based sentence encoders
(𝜃$%&
/012&$) versus the BERT encoders (𝜃3456). Results demonstrate that BERT provides better representations, which
leads to better performance.
• It is provided the results for various baselines. As seen, our proposed TL-ERC comfortably outperforms both non-
contextual and contextual baselines.

More Related Content

PDF
Kf2517971799
PDF
text summarization using amr
PDF
Abstractive Text Summarization
PDF
IRJET- Emotion recognition using Speech Signal: A Review
PDF
Report for Speech Emotion Recognition
PDF
Understanding and estimation of
PDF
Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...
PDF
Significance of Speech Intelligibility Assessors in Medium Classroom Using An...
Kf2517971799
text summarization using amr
Abstractive Text Summarization
IRJET- Emotion recognition using Speech Signal: A Review
Report for Speech Emotion Recognition
Understanding and estimation of
Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...
Significance of Speech Intelligibility Assessors in Medium Classroom Using An...

What's hot (19)

PDF
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnn
PDF
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
PDF
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
PDF
Estimating the quality of digitally transmitted speech over satellite communi...
PDF
IRJET - Audio Emotion Analysis
PDF
Speaker specific feature based clustering and its applications in language in...
PDF
Turkish language modeling using BERT
PDF
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
PDF
Improving Neural Abstractive Text Summarization with Prior Knowledge
PDF
histogram-based-emotion
PDF
PRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGE
PDF
Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model
PPT
M sc thesis_presentation_
PDF
Voice Recognition System using Template Matching
PDF
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
PDF
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
PDF
Human Emotion Recognition From Speech
PDF
A comparative analysis of classifiers in emotion recognition thru acoustic fea...
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnn
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
Estimating the quality of digitally transmitted speech over satellite communi...
IRJET - Audio Emotion Analysis
Speaker specific feature based clustering and its applications in language in...
Turkish language modeling using BERT
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
Improving Neural Abstractive Text Summarization with Prior Knowledge
histogram-based-emotion
PRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGE
Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model
M sc thesis_presentation_
Voice Recognition System using Template Matching
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
Human Emotion Recognition From Speech
A comparative analysis of classifiers in emotion recognition thru acoustic fea...
Ad

Similar to Conversational transfer learning for emotion recognition (20)

PPTX
Natural Language Processing Advancements By Deep Learning - A Survey
PDF
Understanding Natural Languange with Corpora-based Generation of Dependency G...
PPTX
Comparative Analysis of Transformer Based Pre-Trained NLP Models
PDF
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
PDF
Intrinsic and Extrinsic Evaluations of Word Embeddings
PPTX
Emerging Techniques in Machine Learning, Data Science and Internet of Things
DOCX
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
PDF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
PDF
IRJET-speech emotion.pdf
PDF
ENSEMBLE MODEL FOR CHUNKING
PPTX
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
PPTX
ICSE20_Tao_slides.pptx
PDF
rs_day_10012015
PPTX
Natural Language Processing Advancements By Deep Learning: A Survey
PDF
Text summarization
PDF
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
PPTX
Fast and accurate sentiment classification us and naive bayes model b516001
PPTX
Word embedding
PDF
Extractive Summarization with Very Deep Pretrained Language Model
PDF
Generative Artificial Intelligence and Large Language Model
Natural Language Processing Advancements By Deep Learning - A Survey
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
Intrinsic and Extrinsic Evaluations of Word Embeddings
Emerging Techniques in Machine Learning, Data Science and Internet of Things
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
IRJET-speech emotion.pdf
ENSEMBLE MODEL FOR CHUNKING
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
ICSE20_Tao_slides.pptx
rs_day_10012015
Natural Language Processing Advancements By Deep Learning: A Survey
Text summarization
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
Fast and accurate sentiment classification us and naive bayes model b516001
Word embedding
Extractive Summarization with Very Deep Pretrained Language Model
Generative Artificial Intelligence and Large Language Model
Ad

Recently uploaded (20)

PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PPT
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
PPTX
Software Engineering and software moduleing
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PPTX
Feature types and data preprocessing steps
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PPTX
Management Information system : MIS-e-Business Systems.pptx
PPT
Occupational Health and Safety Management System
PPTX
Information Storage and Retrieval Techniques Unit III
PDF
Abrasive, erosive and cavitation wear.pdf
PPTX
"Array and Linked List in Data Structures with Types, Operations, Implementat...
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PPTX
Artificial Intelligence
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPT
Total quality management ppt for engineering students
Fundamentals of safety and accident prevention -final (1).pptx
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
Software Engineering and software moduleing
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
Automation-in-Manufacturing-Chapter-Introduction.pdf
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
Feature types and data preprocessing steps
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
Management Information system : MIS-e-Business Systems.pptx
Occupational Health and Safety Management System
Information Storage and Retrieval Techniques Unit III
Abrasive, erosive and cavitation wear.pdf
"Array and Linked List in Data Structures with Types, Operations, Implementat...
Safety Seminar civil to be ensured for safe working.
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
Artificial Intelligence
R24 SURVEYING LAB MANUAL for civil enggi
Total quality management ppt for engineering students

Conversational transfer learning for emotion recognition

  • 1. Conversational transfer learning for emotion recognition 1 2022-03-18 Paper Introduction Davamanyu Hazarika, Soujanya Poria, Roger Zimmerman, Rada Mihalcea Journal of Information Fusion 65(2021), IF=12.975, Cited=29 Takato Hayashi
  • 2. Abstract 2 • Recognizing emotions in conversations is a challenging task due to the presence of contextual dependencies governed by self- and inter-personal influences. • However, purely supervised strategies demand large amounts of annotated data, which is lacking in most of available corpora in this task. • This paper proposed an approach, TL-ERC, where we pre-train a hierarchical dialogue model on multi-turn conversations (source) and then transfer its parameters to a conversational emotion classifier (target). • TL-ERC improves in performance and robustness against limited training data. This model also achieves better validation performances in significantly fewer epochs.
  • 3. Introduction 3 • Several works in the literature have indicated that emotional goals and influences act as latent controllers in dialogues [1, 2] • Poria et al [3] demonstrated the interplay of several factors, such as the topic of the conversation, speakers’ personality, argumentation-logic, viewpoint, and intent, which modulate the emotional state of the speaker and finally lead to an utterance.
  • 4. Introduction 4 • (a) illustrates the presence of emotional inertia which occur thought self-influences in emotional states. The character Snorri maintains a frustrated emotional state by not being affected/influenced by the other speakers. • conversation (b) and (c) demonstrate the role if inter-speaker influences in emotional transition across turns. • In (b), the character Josh is triggered for an emotional shift due to influenced based on his counterpart responses. • (c) demonstrates the effect of mirroring which often arises due to topical agreement between speakers.
  • 5. Methodology 5 • To perform the generative task of conversation modeling, we use the Hierarchical Recurrent Encoder-Decoder (HRED) architecture. HRED is a classic framework for seq2seq conversational response generation that models conversations in a hierarchical fashion. • For a given conversation context with sentences 𝑥!, ⋯ , 𝑥", HRED generates the response 𝑥"#! as follow: 1. Sentence encoder : It encodes each sentence in the context using an encoder RNN, such that, ℎ" $%& = 𝑓' $%& 𝑥", ℎ"(! $%& Ø Source : generative conversation modeling
  • 6. Methodology 6 2. Context encoder : The sentence representations are then fed into a context RNN that models the conversational context until time step t as ℎ" &") = 𝑓' &") ℎ" $%& , ℎ"(! &)" 3. Sentence decoder : Finally, an auto-regressive decoder RNN generates sentence 𝑥"#!conditioned on ℎ" &") , i.e., 𝑝' 𝑥"#!(𝑥*" = 𝑓' +$& 𝑥( ℎ" &)" = ∏, 𝑓' +$& 𝑥"#!,,| ℎ" &)" , 𝑥"#!,., • With the 𝑖th conversation being a sequence of utterances 𝐶, = 𝑥,,!, ⋯ , 𝑥,,%! , HRED trains all the conversations in the dataset together by using the maximum likelihood estimation objective 𝑎𝑟𝑔𝑚𝑎𝑥' = ∑, 𝑙𝑜𝑔 𝑝' 𝐶, • We call the parameters associated with Sentence encoder as 𝜃$%& /012&$ , the parameters associated with Context encoder as 𝜃&") /012&$ , the parameters associated with Sentence decoder as 𝜃+$& /012&$ .
  • 7. Methodology 7 Ø Target : Emotion Recognition in conversation • The input for this task is also a conversation C with constituent utterances 𝑥,,!, ⋯ , 𝑥,,%! . Each 𝑥, is associated with an emotion label 𝑦, ∈ 𝕐. 1. Sentence encoding • To encode each utterance in the conversation, this paper use BERT, with its parameters represented as 𝜃3456 . BERT is chosen over the HRED sentence encoder (𝜃$%& /012&$) as its provides better performance. Hidden vector of the first token [CLS] across the considered transformer layers and mean-pool them is used as final sentence representation.
  • 8. Methodology 8 2. Context encoding • A similar context encoder RNN is used as the source HRED model with the option to transfer the learned parameter 𝜃&") /012&$ . The context RNN transforms it as follows : • Here, 𝑉7,2,8, 𝑊7,2,8 , 𝒃7,2,8 are parameters for the RNN function and 𝑊9, 𝒃9 are additional parameters of a dense layers. For our setup, adhering to size considerations, we consider our transfer parameters to be 𝜃&") /012&$ = 𝑊7,2,8,9 , 𝒃7,2,8,9 . 3. Classification • For each turn in the conversation, the output from the context RNN is projected to the label-space, which provides the predicted emotion for the associated utterance. Similar to HRED, we train for all the utterances in the conversation together using the standard Cross Entropy loss. For regression targets, we utilize the Mean Square Error (MSE) loss, instead.
  • 9. Datasets 9 Ø Source task • Cornell movie dialog corpus is a popular collection of fictional conversations extracted from movie scripts. In this dataset, conversations are sampled from a diverse set of 617 movies leading to over 83k dialogues. • Ubuntu dialog corpus is a larger corpus with around 1 million dialogues, which, like the Cornell corpus, comprises of unstructured multi-turn dialogues based on Ubuntu chat logs (Internet Relay Chat).
  • 10. Datasets 10 Ø Source task • Primarily, this research consider the textual modality of a small-sized multimodal dataset IEMOCAP. Each conversational video is segmented into utterances and annotated with the following emotion labels: anger, happiness, sadness, neutral, excitement, and frustration. • This research also analyze results on a moderately-sized emotional dialogue dataset DailyDialog with labeled emotions: anger, happiness, sadness, surprise, fear disgust and no_emotion. Unlike spoken utterances in IEMOCAP, the conversations are chat-based based on daily life topics. • Finally, this research choose a regression-based dataset SEMAINE with labeled valence, arousal, power, and expectancy, which is a video-based corpus of human-agent emotional interactions. Ø Metrics • For ERC, this research use weighted-F-score metric for the classification tasks on IEMOCAP and DailyDialog. For DailyDialog, this research remove no_emotion class from the F-score calculations due to its high majority. For the regression task on SEMAINE, we take the Pearson correlation coefficient (r) as its metric. This research also provide the average best epoch (BE) on which the least validation losses are observed. A lower BE represents the model’s ability to reach optimum performance in lesser training epochs.
  • 11. Model variants and baselines 11 • This research experiment on different variants of TL-ERC based on the parameter initialization procedure. • Next, to compare TL-ERC with the existing literature, this research select some prior state-of-the-art models evaluated on the target datasets: CNN, Memmet, C-LSTM, C-LSTM+Att, CMN, DialogueRNN
  • 12. Result and Analysis 12 • In both datasets of IEMOCAP and DailyDialog, results indicate clear and statistically significant improvements of the models that use pre-trained weights over the randomly initialized variant. • Similar trends are observed in the regression task based on the SE- MAINE corpus. For valence, arousal, and power dimensions, the improvement is significant. For expectation, the performance is marginally better but at a much lesser BE, indicating faster generalization. • Result also indicate that the pre-trained models are significantly more robust against limited training resources compared to models trained from scratch.
  • 13. Result and Analysis 13 • Effect of bias in random splits is investigated. the relative performance within each split follows similar trends of improvement for TL-based models. • The trace of the validation loss indicates that the presence of weight initialization leads to faster convergence in terms of the best validation loss.
  • 14. Result and Analysis 14 • It is conducted a comparative study between the performance of models initialized with HRED-based sentence encoders (𝜃$%& /012&$) versus the BERT encoders (𝜃3456). Results demonstrate that BERT provides better representations, which leads to better performance. • It is provided the results for various baselines. As seen, our proposed TL-ERC comfortably outperforms both non- contextual and contextual baselines.