Conversational transfer learning for emotion recognition

Conversational transfer learning for emotion recognition
1
2022-03-18 Paper Introduction
Davamanyu Hazarika, Soujanya Poria, Roger Zimmerman, Rada Mihalcea
Journal of Information Fusion 65(2021), IF=12.975, Cited=29
Takato Hayashi

Abstract
2
• Recognizing emotions in conversations is a challenging task due to the presence of
contextual dependencies governed by self- and inter-personal influences.
• However, purely supervised strategies demand large amounts of annotated data, which is
lacking in most of available corpora in this task.
• This paper proposed an approach, TL-ERC, where we pre-train a hierarchical dialogue
model on multi-turn conversations (source) and then transfer its parameters to a
conversational emotion classifier (target).
• TL-ERC improves in performance and robustness against limited training data. This
model also achieves better validation performances in significantly fewer epochs.

Introduction
3
• Several works in the literature have indicated that emotional goals and influences act as
latent controllers in dialogues [1, 2]
• Poria et al [3] demonstrated the interplay of several factors, such as the topic of the
conversation, speakers’ personality, argumentation-logic, viewpoint, and intent, which
modulate the emotional state of the speaker and finally lead to an utterance.

Introduction
4
• (a) illustrates the presence of emotional inertia which occur thought self-influences in emotional states. The character
Snorri maintains a frustrated emotional state by not being affected/influenced by the other speakers.
• conversation (b) and (c) demonstrate the role if inter-speaker influences in emotional transition across turns.
• In (b), the character Josh is triggered for an emotional shift due to influenced based on his counterpart responses.
• (c) demonstrates the effect of mirroring which often arises due to topical agreement between speakers.

Methodology
5
• To perform the generative task of conversation modeling, we use the Hierarchical Recurrent Encoder-Decoder (HRED)
architecture. HRED is a classic framework for seq2seq conversational response generation that models conversations in
a hierarchical fashion.
• For a given conversation context with sentences 𝑥!, ⋯ , 𝑥", HRED generates the response 𝑥"#! as follow:
1. Sentence encoder : It encodes each sentence in the context using an encoder RNN, such that,
ℎ"
$%&
= 𝑓'
$%&
𝑥", ℎ"(!
$%&
Ø Source : generative conversation modeling

Methodology
6
2. Context encoder : The sentence representations are then fed into a context RNN that models the conversational context
until time step t as
ℎ"
&")
= 𝑓'
&")
ℎ"
$%&
, ℎ"(!
&)"
3. Sentence decoder : Finally, an auto-regressive decoder RNN generates sentence 𝑥"#!conditioned on ℎ"
&")
, i.e.,
𝑝' 𝑥"#!(𝑥*" = 𝑓'
+$&
𝑥( ℎ"
&)"
= ∏, 𝑓'
+$&
𝑥"#!,,| ℎ"
&)"
, 𝑥"#!,.,
• With the 𝑖th conversation being a sequence of utterances 𝐶, = 𝑥,,!, ⋯ , 𝑥,,%!
, HRED trains all the conversations in the
dataset together by using the maximum likelihood estimation objective 𝑎𝑟𝑔𝑚𝑎𝑥' = ∑, 𝑙𝑜𝑔 𝑝' 𝐶,
• We call the parameters associated with Sentence encoder as 𝜃$%&
/012&$
, the parameters associated with Context encoder as
𝜃&")
/012&$
, the parameters associated with Sentence decoder as 𝜃+$&
/012&$
.

Methodology
7
Ø Target : Emotion Recognition in conversation
• The input for this task is also a conversation C with constituent utterances 𝑥,,!, ⋯ , 𝑥,,%!
. Each 𝑥, is associated with an
emotion label 𝑦, ∈ 𝕐.
1. Sentence encoding
• To encode each utterance in the conversation, this paper use BERT, with its parameters represented as 𝜃3456
. BERT
is chosen over the HRED sentence encoder (𝜃$%&
/012&$) as its provides better performance. Hidden vector of the first
token [CLS] across the considered transformer layers and mean-pool them is used as final sentence representation.

Methodology
8
2. Context encoding
• A similar context encoder RNN is used as the source HRED model with the option to transfer the learned parameter
𝜃&")
/012&$
. The context RNN transforms it as follows :
• Here, 𝑉7,2,8, 𝑊7,2,8 , 𝒃7,2,8 are parameters for the RNN function and 𝑊9, 𝒃9 are additional parameters of a
dense layers. For our setup, adhering to size considerations, we consider our transfer parameters to be 𝜃&")
/012&$
=
𝑊7,2,8,9
, 𝒃7,2,8,9
.
3. Classification
• For each turn in the conversation, the output from the context RNN is projected to the label-space, which provides
the predicted emotion for the associated utterance. Similar to HRED, we train for all the utterances in the
conversation together using the standard Cross Entropy loss. For regression targets, we utilize the Mean Square
Error (MSE) loss, instead.

Datasets
9
Ø Source task
• Cornell movie dialog corpus is a popular collection of fictional conversations extracted from movie scripts. In this
dataset, conversations are sampled from a diverse set of 617 movies leading to over 83k dialogues.
• Ubuntu dialog corpus is a larger corpus with around 1 million dialogues, which, like the Cornell corpus, comprises of
unstructured multi-turn dialogues based on Ubuntu chat logs (Internet Relay Chat).

Datasets
10
Ø Source task
• Primarily, this research consider the textual modality of a small-sized multimodal dataset IEMOCAP. Each
conversational video is segmented into utterances and annotated with the following emotion labels: anger, happiness,
sadness, neutral, excitement, and frustration.
• This research also analyze results on a moderately-sized emotional dialogue dataset DailyDialog with labeled emotions:
anger, happiness, sadness, surprise, fear disgust and no_emotion. Unlike spoken utterances in IEMOCAP, the
conversations are chat-based based on daily life topics.
• Finally, this research choose a regression-based dataset SEMAINE with labeled valence, arousal, power, and expectancy,
which is a video-based corpus of human-agent emotional interactions.
Ø Metrics
• For ERC, this research use weighted-F-score metric for the classification tasks on IEMOCAP and DailyDialog. For
DailyDialog, this research remove no_emotion class from the F-score calculations due to its high majority. For the
regression task on SEMAINE, we take the Pearson correlation coefficient (r) as its metric. This research also provide the
average best epoch (BE) on which the least validation losses are observed. A lower BE represents the model’s ability to
reach optimum performance in lesser training epochs.

Model variants and baselines
11
• This research experiment on different variants of TL-ERC based on the parameter initialization procedure.
• Next, to compare TL-ERC with the existing literature, this research select some prior state-of-the-art models evaluated
on the target datasets:
CNN, Memmet, C-LSTM, C-LSTM+Att, CMN, DialogueRNN

Result and Analysis
12
• In both datasets of IEMOCAP and DailyDialog, results indicate clear and statistically significant improvements of the
models that use pre-trained weights over the randomly initialized variant.
• Similar trends are observed in the regression task based on the SE- MAINE corpus. For valence, arousal, and power
dimensions, the improvement is significant. For expectation, the performance is marginally better but at a much lesser
BE, indicating faster generalization.
• Result also indicate that the pre-trained models are significantly more robust against limited training resources compared
to models trained from scratch.

Result and Analysis
13
• Effect of bias in random splits is investigated. the relative performance within each split follows similar trends of
improvement for TL-based models.
• The trace of the validation loss indicates that the presence of weight initialization leads to faster convergence in terms of
the best validation loss.

Result and Analysis
14
• It is conducted a comparative study between the performance of models initialized with HRED-based sentence encoders
(𝜃$%&
/012&$) versus the BERT encoders (𝜃3456). Results demonstrate that BERT provides better representations, which
leads to better performance.
• It is provided the results for various baselines. As seen, our proposed TL-ERC comfortably outperforms both non-
contextual and contextual baselines.

Conversational transfer learning for emotion recognition

More Related Content

What's hot (19)

Similar to Conversational transfer learning for emotion recognition (20)

Recently uploaded (20)

Conversational transfer learning for emotion recognition