Recognizing Emotions in Spontaneous Speech

Recognizing Emotions in Spontaneous Speech
[Keynote Address | CSCITA 2017 | Mumbai]
Dr. Sunil Kumar Kopparapu
SunilKumar.Kopparapu@TCS.COM
TCS Innovation Labs - Mumbai
Tata Consultancy Services Ltd, Yantra Park,
Thane (West), Maharastra 400601,
INDIA.
April 8, 2017
1

Talk Motivated by ...
RMS Do You Mean What You Say? Recognizing Emotions in
Spontaneous Speech, Invited Talk, ICNETS2, Chennai,
March 2017.
[R]upayan, [M]eghna, [S]unil [Published in 2017]
2

RS ”Improved Speech Emotion Recognition using Error
Correcting Codes”, ICME 2016 Workshop on Affective
Social Multimedia Computing 2016, Seattle, USA.
RS ”Validating Is ECC-ANN Combination Equivalent to
DNN? for Speech Emotion Recognition”, 2016 IEEE
International Conference on Systems, Man, and
Cybernatics (SMC) Budapest, Hungary, 2016.
RMS ”Knowledge based framework for intelligent emotion
recognition in spontaneous speech”, 20th International
Conference on Knowledge Based and Intelligent
Information and Engineering Systems, KES2016, York,
UK, Sep 2016
RMS ”Mining call center conversations exhibiting similar
affective states”, 30th Pacific Asia Conference on
Language, Information and Computation (PACLIC 30),
Seoul, South Korea (28-30 October 2016)
RMS ”Spontaneous Speech Emotion Recognition Using Prior
Knowledge”, 23rd International Conference on Pattern
Recognition (ICPR) in Cancun, Mexico, Dec 2016
[R]upayan, [M]eghna, [S]unil [Published in 2016]
2

Created using http://tagcrowd.com/
2

Created using http://tagcrowd.com/
Acknowledging eﬀort of Rupayan, Meghna
2

Title (Recognize - Emotion - Spontaneous Speech)
• Recognizing
assigning a predeﬁned label to a sample from a set of known labels!
(example) Given n labels L = (l1, l2, · · · , ln); Assign a label to sample
• Emotions
(fact) no consensus on a deﬁnition
feelings or thoughts that arise spontaneously instead of conscious
thought
(Wiki) brief conscious experience characterized by arousal (intense
mental activity) and valence | mood (a high degree of pleasure or
displeasure)
(example) happiness (positive mood, high arousal), anger (negative
mood, high arousal) and sadness (negative mood, low arousal).
• Spontaneous Speech
(example) day to day conversation
(opposite!) not acted!
natural language speech (lax grammar)
3

Importance of Emotion
Self Help | Car Insurance
• System (:-)): /Thanks you for calling xyz Insurance, How may I help
you?/
• Customer(:-(): /I just had an accident and need to claim insurance
and tow the car/
Reactions
4

you?/
and tow the car/
Reactions
• System (:-)): /Can I have your 18 digit insurance policy number/
4

you?/
and tow the car/
Reactions
• System (:-)): /Sorry to hear this. Where is your vehicle currently/
4

you?/
and tow the car/
Reactions
• System (:-(): /Sorry to hear this. Where is your vehicle currently/
4

you?/
and tow the car/
Reactions
4

you?/
and tow the car/
Reactions
Detecting emotions in conversations can lead to better Ux
4

Information in Speech
Audio −→ 1. Play 2. Play
5

• Non-linguistic, (who said it)
gender ,
emotional states,
speaker name
5

gender ,
emotional states,
speaker name
• Linguistic (what (s)he said)
Language name and
what was said (written text, )
5

gender ,
emotional states,
speaker name
Language name and
what was said (written text, )
• Para-linguistic (how well said, quality; manner, clarity, accent)
Deliberately | Subconsciously added by speaker; not inferable from
written text.
5

gender (Male),
emotional states,
speaker name (Mahatma Gandhi)
Language name (English) and
what was said (written text, ”truth persists, in the midst of darkness
light persists”)
written text.
5

gender (Male),
emotional states,
speaker name (Mahatma Gandhi)
Language name and
what was said (written text, ”truth persists, in the midst of darkness
light persists”)
written text.
Goal
Automatically extract information in speech signal
(≡ Speech, Speaker, Gender, Emotion, Language, · · · → Recognition)
What we do? In Speech and Natural Language Processing Anything else?
5

Work in Speech and NL: A Bird’s eye view
#1 Robust Speaker and Speech Recognition
#2 Emotion Recognition in Spontaneous Speech
#3 Remote Speech Therapy
#4 SuNo - Speech and NL for Self Help
#5 Speech Analysis for Laryngeal cancer detection
6

#1 Robust Speaker and Speech Recognition
#2 Emotion Recognition in Spontaneous Speech
#3 Remote Speech Therapy
#4 SuNo - Speech and NL for Self Help
#5 Speech Analysis for Laryngeal cancer detection
Indian Scenario
• Large number of Languages (dialects, accents)
• Noisy Environment
• Use of more than one language in the same sentence
• Spoken Variations
Kopparapu | Yerraguntla | Thangal Kunju Musaliar
• People-Machine interaction comfort
• Non-availability of speech corpus
6

Emotion in Audio
6

Emotion in Audio
Will
Concentrate on Emotion in Spontaneous Speech (Eg. Call Center Audio)!
6

Emotion | Sentiment from Audio
#1 Speech → Text → Emotion (2 Steps)
#2 Speech → Emotion (1 Step)
7

Step 1 Speech to text using automatic speech recognition (ASR)
Step 2 Analyzing linguistic (text) content
7

Based on what is spoken
Learning | Classiﬁcation (text)
7

Language dependent; dependent on performance of ASR
7

Step 1 Analyzing non-linguistic (audio) content
7

based on how spoken
Extraction of features (pitch, energy, spectral, speaking rate etc.)
from audio
7

based on how spoken
from audio
Learning | Classiﬁcation (Rule based, Statistical)
7

based on how spoken
from audio
Independent of ASR performances, Can handle mixed language
7

based on how spoken
from audio
Independent of ASR performances, Can handle mixed language
Q
Do we always mean what we speak?
7

Can we Trust Words?
• ”Thanks” during or after a conversation
8

Can we Trust Words?
Quote!
”I don’t Trust Words.
I even question Actions.
But I never Doubt Patterns.”
8

Can we Trust Words?
Quote!
”I don’t Trust Words.
I even question Actions.
But I never Doubt Patterns.”
=⇒ Second (Speech → Emotion) Approach
8

Emotion in Speech: Acted vs Spontaneous
Acted Play
• Speaker (acts) to express emotion (Expressive, Loud)
• One emotion in the entire speech
Spontaneous Play
• Emotional state of speaker in (spontaneous) conversation (Subtle)
• Several emotions in the entire speech
Does this make a diﬀerence?
9

Spontaneous Speech Challenge
• Emotion ∈ {Anger, Sad, Neutral, Happy, · · · }
• Can be expressed in 2D space
(Valence, Arousal)
10

Emotion in Speech: Encoded vs Decoded
• Speaker → Audio → Listener
Encoded emotion → (exact emotion) felt by the speaker
Decoded emotion → (perceived emotion) (of the speaker) by listener
Challenge
Which is the True Emotion?
Emotion Extraction
11

Generic Emotion Recognition
Learning Process ( =⇒ a system needs to be trained; typical AI)
• Training Phase (using annotated | labeled audio segments)
Segment audio into smaller parts
(appropriate) Feature extraction from audio
(select) Train Classiﬁers (SVM, Neural Networks, Deep Networks)
• Testing Phase
Use (learnt) classiﬁer to label audio
12

Generic Emotion Recognition
Learning Process ( =⇒ a system needs to be trained; typical AI)
• Training Phase (using annotated | labeled audio segments)
Segment audio into smaller parts
(appropriate) Feature extraction from audio
(select) Train Classifiers (SVM, Neural Networks, Deep Networks)
• Testing Phase
Use (learnt) classifier to label audio
Performance is dependent on
• the choice of speech features,
• the choice of classifier
#1 availability of training data
#2 the accuracy of the annotated audio,
#3 sufficient data for all classes
12

Availability of Speech Corpus
Available
• Speech corpus available for acted speech
13

Available
Challenge
• What works for acted does not work for spontaneous speech
(tain-test mismatch)
13

Available
Challenge
• What works for acted does not work for spontaneous speech
(tain-test mismatch)
Need to
Build Spontaneous Speech Corpus!
13

Building Spontaneous Speech Corpus
• Encoded emotion: felt by the speaker
• Decoded emotion: perceived emotion of the speaker by listener
14

Challenge | Building Corpus
Building a realistic spontaneous speech corpus would need
• a person to speak in a certain emotional state and/or
• annotate what he or she spoke;
generating such realistic spontaneous data corpus is extremely diﬃcult
and is a huge challenge.
14

Challenge | Building Corpus
Building a realistic spontaneous speech corpus would need
• a person to speak in a certain emotional state and/or
• annotate what he or she spoke;
generating such realistic spontaneous data corpus is extremely diﬃcult
and is a huge challenge.
We need to do with decoded (listener’s perspective of the speakers)
emotion!
14

Emotion Annotation Challenge
Listener’s perspective of the speakers emotion
• Time, costs, human eﬀorts (minor challenges)
• Consistency and reliability in annotations (major challenge)
15

Emotion Annotation Challenge
Listener’s perspective of the speakers emotion
• Time, costs, human eﬀorts (minor challenges)
• Consistency and reliability in annotations (major challenge)
Observations
• fair amount of disagreement among the evaluators (κ = 0.12)
(κ = 1 is best)
• κ improves (0.65) when knowledge is provided to annotators
15

Σing Challenges in Spontaneous Speech Emotion
Recognition
#1 Intensity of emotion: mostly subtle
(Acted versus Spontaneous)
#2 What works for acted speech does not work for spontaneous speech
(Data mismatch for ML based systems)
#3 Spontaneous speech corpus
(Difficulty in building corpus - Encoded emotion)
#4 Difficulties in annotation
(Decoded emotion disagreement)
#5 Difficulties in uniformity of data in each class
(Difficulty in getting happy data in a call center!)
16

Σing Challenges in Spontaneous Speech Emotion
Recognition
#1 Intensity of emotion: mostly subtle
(Acted versus Spontaneous)
#2 What works for acted speech does not work for spontaneous speech
(Data mismatch for ML based systems)
#3 Spontaneous speech corpus
(Difficulty in building corpus - Encoded emotion)
#4 Difficulties in annotation
(Decoded emotion disagreement)
#5 Difficulties in uniformity of data in each class
(Difficulty in getting happy data in a call center!)
=⇒
emotion recognition literature on acted speech does not help in
spontaneous speech emotion recognition
16

Observation
Recall
(κ = 1 is best)
17

Observation
Recall
(κ = 1 is best)
Hypothesis
the use of
prior knowledge
can help address
recognizing emotions in spontaneous speech
17

Our Eﬀort: Addressing -Spontaneous Speech-
Emotion
• Knowledge based framework (applicable to acted speech also!)
The Crux!
18

Emotion
Tested on two datasets.
• SERES Details
• Call Center Conversations Demo 1 Demo 2
18

Emotion
18

Emotion
• Classiﬁer error corrections
Like conventional PR problem: feature extraction at the front end,
classiﬁcation done by ANN
Error correction decoding to correct errors
19

Concluding Remarks
More Challenges
• Real Time Emotion
Not much work regarding the minimum or maximum time duration
that exhibit an emotion.
Our work shows that 600 ms of audio is sufficient to reliably extract
emotion (Audio Segmentation ... for Improved Emotion Recognition)
• Feature extraction and representation: optimal features for
spontaneous speech (no agreement)
• Acoustic variability: different speakers, speaking styles, speaking
rates, different sentences, different semantics
• Speakers gender, culture, and environment
20

Concluding Remarks
Diﬃculty Level Less Challenging −→ Very Challenging
Type Acted −→ Spontaneous
Media Video −→ Only Audio
Read Speech −→ Natural Speech
Language Single Language −→ Mixed Language
Clarity Clean Speech −→ Noisy Speech
Distance Close Talk −→ Far Field
Corpus Available −→ Not Available
Annotation Good −→ Poor
Context Available −→ Not Available
Good ASR Available −→ Not Available
20

Concluding Remarks
Diﬃculty Level Less Challenging −→ Very Challenging
Type Acted −→ Spontaneous
Media Video −→ Only Audio
Read Speech −→ Natural Speech
Language Single Language −→ Mixed Language
Clarity Clean Speech −→ Noisy Speech
Distance Close Talk −→ Far Field
Corpus Available −→ Not Available
Annotation Good −→ Poor
Context Available −→ Not Available
Good ASR Available −→ Not Available
• Applicable to most Speech Signal Processing problems!
• Deﬁnitely for Emotion Recognition
20

Concluding Remarks
There is a large play ﬁeld ...
go ahead Research and Innovate!
Good Luck
20

Thank You
• Acknowledgements (Rupayan and Meghna)
TCS Innovation Labs - Mumbai
• Queries? | Comments | Suggestions?
Dr Sunil Kopparapu
SunilKumar.Kopparapu@TCS.Com
TCS Innovation Lab - Mumbai
Tata Consultancy Services Limited
Loc: 72.977265, 19.225129 Yantra Park, Thane (West), India.
END
21

Fun
• Listen to this Play
22

Fun
• Not Loud? Listen Again Play
22

Fun
• Your response? Noise?
22

Fun
• Let us see if this helps Hidden (some seconds)
22

Fun
• Greetings!
22

Fun
• Greetings!
Can this be used?
22

A System & Method For Visual Message Comm
• Indian Patent Application [1116/MUM/2013; Mar 25, 2013]
• US Publication number US20140287676 A1
Back
23

Speech Enabled Railway Enquiry System
• Languages
Indian English
Marathi
Hindi
• Trains
Passing thru Mumbai
Real time Info
• Functionality
Arrival Departure
Fare
Availability
PNR
Running Status
Demonstration Menu Based
Back
24

Recognizing Emotions in Spontaneous Speech

More Related Content

Similar to Recognizing Emotions in Spontaneous Speech (20)

Recently uploaded (20)

Recognizing Emotions in Spontaneous Speech