SlideShare a Scribd company logo
Recognizing Emotions in Spontaneous Speech
[Keynote Address | CSCITA 2017 | Mumbai]
Dr. Sunil Kumar Kopparapu
SunilKumar.Kopparapu@TCS.COM
TCS Innovation Labs - Mumbai
Tata Consultancy Services Ltd, Yantra Park,
Thane (West), Maharastra 400601,
INDIA.
April 8, 2017
1
Talk Motivated by ...
RMS Do You Mean What You Say? Recognizing Emotions in
Spontaneous Speech, Invited Talk, ICNETS2, Chennai,
March 2017.
[R]upayan, [M]eghna, [S]unil [Published in 2017]
2
Talk Motivated by ...
RS ”Improved Speech Emotion Recognition using Error
Correcting Codes”, ICME 2016 Workshop on Affective
Social Multimedia Computing 2016, Seattle, USA.
RS ”Validating Is ECC-ANN Combination Equivalent to
DNN? for Speech Emotion Recognition”, 2016 IEEE
International Conference on Systems, Man, and
Cybernatics (SMC) Budapest, Hungary, 2016.
RMS ”Knowledge based framework for intelligent emotion
recognition in spontaneous speech”, 20th International
Conference on Knowledge Based and Intelligent
Information and Engineering Systems, KES2016, York,
UK, Sep 2016
RMS ”Mining call center conversations exhibiting similar
affective states”, 30th Pacific Asia Conference on
Language, Information and Computation (PACLIC 30),
Seoul, South Korea (28-30 October 2016)
RMS ”Spontaneous Speech Emotion Recognition Using Prior
Knowledge”, 23rd International Conference on Pattern
Recognition (ICPR) in Cancun, Mexico, Dec 2016
[R]upayan, [M]eghna, [S]unil [Published in 2016]
2
Talk Motivated by ...
Created using http://tagcrowd.com/
2
Talk Motivated by ...
Created using http://tagcrowd.com/
Acknowledging effort of Rupayan, Meghna
2
Title (Recognize - Emotion - Spontaneous Speech)
• Recognizing
assigning a predefined label to a sample from a set of known labels!
(example) Given n labels L = (l1, l2, · · · , ln); Assign a label to sample
• Emotions
(fact) no consensus on a definition
feelings or thoughts that arise spontaneously instead of conscious
thought
(Wiki) brief conscious experience characterized by arousal (intense
mental activity) and valence | mood (a high degree of pleasure or
displeasure)
(example) happiness (positive mood, high arousal), anger (negative
mood, high arousal) and sadness (negative mood, low arousal).
• Spontaneous Speech
(example) day to day conversation
(opposite!) not acted!
natural language speech (lax grammar)
3
Importance of Emotion
Self Help | Car Insurance
• System (:-)): /Thanks you for calling xyz Insurance, How may I help
you?/
• Customer(:-(): /I just had an accident and need to claim insurance
and tow the car/
Reactions
4
Importance of Emotion
Self Help | Car Insurance
• System (:-)): /Thanks you for calling xyz Insurance, How may I help
you?/
• Customer(:-(): /I just had an accident and need to claim insurance
and tow the car/
Reactions
• System (:-)): /Can I have your 18 digit insurance policy number/
4
Importance of Emotion
Self Help | Car Insurance
• System (:-)): /Thanks you for calling xyz Insurance, How may I help
you?/
• Customer(:-(): /I just had an accident and need to claim insurance
and tow the car/
Reactions
• System (:-)): /Sorry to hear this. Where is your vehicle currently/
4
Importance of Emotion
Self Help | Car Insurance
• System (:-)): /Thanks you for calling xyz Insurance, How may I help
you?/
• Customer(:-(): /I just had an accident and need to claim insurance
and tow the car/
Reactions
• System (:-)): /Sorry to hear this. Where is your vehicle currently/
• System (:-(): /Sorry to hear this. Where is your vehicle currently/
4
Importance of Emotion
Self Help | Car Insurance
• System (:-)): /Thanks you for calling xyz Insurance, How may I help
you?/
• Customer(:-(): /I just had an accident and need to claim insurance
and tow the car/
Reactions
• System (:-)): /Can I have your 18 digit insurance policy number/
• System (:-)): /Sorry to hear this. Where is your vehicle currently/
• System (:-(): /Sorry to hear this. Where is your vehicle currently/
4
Importance of Emotion
Self Help | Car Insurance
• System (:-)): /Thanks you for calling xyz Insurance, How may I help
you?/
• Customer(:-(): /I just had an accident and need to claim insurance
and tow the car/
Reactions
• System (:-)): /Can I have your 18 digit insurance policy number/
• System (:-)): /Sorry to hear this. Where is your vehicle currently/
• System (:-(): /Sorry to hear this. Where is your vehicle currently/
Detecting emotions in conversations can lead to better Ux
4
Information in Speech
Audio −→ 1. Play 2. Play
5
Information in Speech
Audio −→ 1. Play 2. Play
• Non-linguistic, (who said it)
gender ,
emotional states,
speaker name
5
Information in Speech
Audio −→ 1. Play 2. Play
• Non-linguistic, (who said it)
gender ,
emotional states,
speaker name
• Linguistic (what (s)he said)
Language name and
what was said (written text, )
5
Information in Speech
Audio −→ 1. Play 2. Play
• Non-linguistic, (who said it)
gender ,
emotional states,
speaker name
• Linguistic (what (s)he said)
Language name and
what was said (written text, )
• Para-linguistic (how well said, quality; manner, clarity, accent)
Deliberately | Subconsciously added by speaker; not inferable from
written text.
5
Information in Speech
Audio −→ 1. Play 2. Play
• Non-linguistic, (who said it)
gender (Male),
emotional states,
speaker name (Mahatma Gandhi)
• Linguistic (what (s)he said)
Language name (English) and
what was said (written text, ”truth persists, in the midst of darkness
light persists”)
• Para-linguistic (how well said, quality; manner, clarity, accent)
Deliberately | Subconsciously added by speaker; not inferable from
written text.
5
Information in Speech
Audio −→ 1. Play 2. Play
• Non-linguistic, (who said it)
gender (Male),
emotional states,
speaker name (Mahatma Gandhi)
• Linguistic (what (s)he said)
Language name and
what was said (written text, ”truth persists, in the midst of darkness
light persists”)
• Para-linguistic (how well said, quality; manner, clarity, accent)
Deliberately | Subconsciously added by speaker; not inferable from
written text.
Goal
Automatically extract information in speech signal
(≡ Speech, Speaker, Gender, Emotion, Language, · · · → Recognition)
What we do? In Speech and Natural Language Processing Anything else?
5
Work in Speech and NL: A Bird’s eye view
#1 Robust Speaker and Speech Recognition
#2 Emotion Recognition in Spontaneous Speech
#3 Remote Speech Therapy
#4 SuNo - Speech and NL for Self Help
#5 Speech Analysis for Laryngeal cancer detection
6
Work in Speech and NL: A Bird’s eye view
#1 Robust Speaker and Speech Recognition
#2 Emotion Recognition in Spontaneous Speech
#3 Remote Speech Therapy
#4 SuNo - Speech and NL for Self Help
#5 Speech Analysis for Laryngeal cancer detection
Indian Scenario
• Large number of Languages (dialects, accents)
• Noisy Environment
• Use of more than one language in the same sentence
• Spoken Variations
Kopparapu | Yerraguntla | Thangal Kunju Musaliar
• People-Machine interaction comfort
• Non-availability of speech corpus
6
Work in Speech and NL: A Bird’s eye view
Emotion in Audio
6
Work in Speech and NL: A Bird’s eye view
Emotion in Audio
Will
Concentrate on Emotion in Spontaneous Speech (Eg. Call Center Audio)!
6
Emotion | Sentiment from Audio
#1 Speech → Text → Emotion (2 Steps)
#2 Speech → Emotion (1 Step)
7
Emotion | Sentiment from Audio
#1 Speech → Text → Emotion (2 Steps)
Step 1 Speech to text using automatic speech recognition (ASR)
Step 2 Analyzing linguistic (text) content
#2 Speech → Emotion (1 Step)
7
Emotion | Sentiment from Audio
#1 Speech → Text → Emotion (2 Steps)
Step 1 Speech to text using automatic speech recognition (ASR)
Step 2 Analyzing linguistic (text) content
Based on what is spoken
Learning | Classification (text)
#2 Speech → Emotion (1 Step)
7
Emotion | Sentiment from Audio
#1 Speech → Text → Emotion (2 Steps)
Step 1 Speech to text using automatic speech recognition (ASR)
Step 2 Analyzing linguistic (text) content
Based on what is spoken
Learning | Classification (text)
Language dependent; dependent on performance of ASR
#2 Speech → Emotion (1 Step)
7
Emotion | Sentiment from Audio
#1 Speech → Text → Emotion (2 Steps)
Step 1 Speech to text using automatic speech recognition (ASR)
Step 2 Analyzing linguistic (text) content
Based on what is spoken
Learning | Classification (text)
Language dependent; dependent on performance of ASR
#2 Speech → Emotion (1 Step)
Step 1 Analyzing non-linguistic (audio) content
7
Emotion | Sentiment from Audio
#1 Speech → Text → Emotion (2 Steps)
Step 1 Speech to text using automatic speech recognition (ASR)
Step 2 Analyzing linguistic (text) content
Based on what is spoken
Learning | Classification (text)
Language dependent; dependent on performance of ASR
#2 Speech → Emotion (1 Step)
Step 1 Analyzing non-linguistic (audio) content
based on how spoken
Extraction of features (pitch, energy, spectral, speaking rate etc.)
from audio
7
Emotion | Sentiment from Audio
#1 Speech → Text → Emotion (2 Steps)
Step 1 Speech to text using automatic speech recognition (ASR)
Step 2 Analyzing linguistic (text) content
Based on what is spoken
Learning | Classification (text)
Language dependent; dependent on performance of ASR
#2 Speech → Emotion (1 Step)
Step 1 Analyzing non-linguistic (audio) content
based on how spoken
Extraction of features (pitch, energy, spectral, speaking rate etc.)
from audio
Learning | Classification (Rule based, Statistical)
7
Emotion | Sentiment from Audio
#1 Speech → Text → Emotion (2 Steps)
Step 1 Speech to text using automatic speech recognition (ASR)
Step 2 Analyzing linguistic (text) content
Based on what is spoken
Learning | Classification (text)
Language dependent; dependent on performance of ASR
#2 Speech → Emotion (1 Step)
Step 1 Analyzing non-linguistic (audio) content
based on how spoken
Extraction of features (pitch, energy, spectral, speaking rate etc.)
from audio
Learning | Classification (Rule based, Statistical)
Independent of ASR performances, Can handle mixed language
7
Emotion | Sentiment from Audio
#1 Speech → Text → Emotion (2 Steps)
Step 1 Speech to text using automatic speech recognition (ASR)
Step 2 Analyzing linguistic (text) content
Based on what is spoken
Learning | Classification (text)
Language dependent; dependent on performance of ASR
#2 Speech → Emotion (1 Step)
Step 1 Analyzing non-linguistic (audio) content
based on how spoken
Extraction of features (pitch, energy, spectral, speaking rate etc.)
from audio
Learning | Classification (Rule based, Statistical)
Independent of ASR performances, Can handle mixed language
Q
Do we always mean what we speak?
7
Can we Trust Words?
8
Can we Trust Words?
• ”Thanks” during or after a conversation
8
Can we Trust Words?
• ”Thanks” during or after a conversation
Quote!
”I don’t Trust Words.
I even question Actions.
But I never Doubt Patterns.”
8
Can we Trust Words?
• ”Thanks” during or after a conversation
Quote!
”I don’t Trust Words.
I even question Actions.
But I never Doubt Patterns.”
=⇒ Second (Speech → Emotion) Approach
8
Emotion in Speech: Acted vs Spontaneous
Acted Play
• Speaker (acts) to express emotion (Expressive, Loud)
• One emotion in the entire speech
Spontaneous Play
• Emotional state of speaker in (spontaneous) conversation (Subtle)
• Several emotions in the entire speech
Does this make a difference?
9
Spontaneous Speech Challenge
• Emotion ∈ {Anger, Sad, Neutral, Happy, · · · }
• Can be expressed in 2D space
(Valence, Arousal)
10
Emotion in Speech: Encoded vs Decoded
• Speaker → Audio → Listener
Encoded emotion → (exact emotion) felt by the speaker
Decoded emotion → (perceived emotion) (of the speaker) by listener
Challenge
Which is the True Emotion?
Emotion Extraction
11
Generic Emotion Recognition
Learning Process ( =⇒ a system needs to be trained; typical AI)
• Training Phase (using annotated | labeled audio segments)
Segment audio into smaller parts
(appropriate) Feature extraction from audio
(select) Train Classifiers (SVM, Neural Networks, Deep Networks)
• Testing Phase
Use (learnt) classifier to label audio
12
Generic Emotion Recognition
Learning Process ( =⇒ a system needs to be trained; typical AI)
• Training Phase (using annotated | labeled audio segments)
Segment audio into smaller parts
(appropriate) Feature extraction from audio
(select) Train Classifiers (SVM, Neural Networks, Deep Networks)
• Testing Phase
Use (learnt) classifier to label audio
Performance is dependent on
• the choice of speech features,
• the choice of classifier
#1 availability of training data
#2 the accuracy of the annotated audio,
#3 sufficient data for all classes
12
Availability of Speech Corpus
Available
• Speech corpus available for acted speech
13
Availability of Speech Corpus
Available
• Speech corpus available for acted speech
Challenge
• What works for acted does not work for spontaneous speech
(tain-test mismatch)
13
Availability of Speech Corpus
Available
• Speech corpus available for acted speech
Challenge
• What works for acted does not work for spontaneous speech
(tain-test mismatch)
Need to
Build Spontaneous Speech Corpus!
13
Building Spontaneous Speech Corpus
• Encoded emotion: felt by the speaker
• Decoded emotion: perceived emotion of the speaker by listener
14
Building Spontaneous Speech Corpus
• Encoded emotion: felt by the speaker
• Decoded emotion: perceived emotion of the speaker by listener
Challenge | Building Corpus
Building a realistic spontaneous speech corpus would need
• a person to speak in a certain emotional state and/or
• annotate what he or she spoke;
generating such realistic spontaneous data corpus is extremely difficult
and is a huge challenge.
14
Building Spontaneous Speech Corpus
• Encoded emotion: felt by the speaker
• Decoded emotion: perceived emotion of the speaker by listener
Challenge | Building Corpus
Building a realistic spontaneous speech corpus would need
• a person to speak in a certain emotional state and/or
• annotate what he or she spoke;
generating such realistic spontaneous data corpus is extremely difficult
and is a huge challenge.
We need to do with decoded (listener’s perspective of the speakers)
emotion!
14
Emotion Annotation Challenge
Listener’s perspective of the speakers emotion
• Time, costs, human efforts (minor challenges)
• Consistency and reliability in annotations (major challenge)
15
Emotion Annotation Challenge
Listener’s perspective of the speakers emotion
• Time, costs, human efforts (minor challenges)
• Consistency and reliability in annotations (major challenge)
15
Emotion Annotation Challenge
Listener’s perspective of the speakers emotion
• Time, costs, human efforts (minor challenges)
• Consistency and reliability in annotations (major challenge)
Observations
• fair amount of disagreement among the evaluators (κ = 0.12)
(κ = 1 is best)
• κ improves (0.65) when knowledge is provided to annotators
15
Σing Challenges in Spontaneous Speech Emotion
Recognition
#1 Intensity of emotion: mostly subtle
(Acted versus Spontaneous)
#2 What works for acted speech does not work for spontaneous speech
(Data mismatch for ML based systems)
#3 Spontaneous speech corpus
(Difficulty in building corpus - Encoded emotion)
#4 Difficulties in annotation
(Decoded emotion disagreement)
#5 Difficulties in uniformity of data in each class
(Difficulty in getting happy data in a call center!)
16
Σing Challenges in Spontaneous Speech Emotion
Recognition
#1 Intensity of emotion: mostly subtle
(Acted versus Spontaneous)
#2 What works for acted speech does not work for spontaneous speech
(Data mismatch for ML based systems)
#3 Spontaneous speech corpus
(Difficulty in building corpus - Encoded emotion)
#4 Difficulties in annotation
(Decoded emotion disagreement)
#5 Difficulties in uniformity of data in each class
(Difficulty in getting happy data in a call center!)
=⇒
emotion recognition literature on acted speech does not help in
spontaneous speech emotion recognition
16
Observation
Recall
• fair amount of disagreement among the evaluators (κ = 0.12)
(κ = 1 is best)
• κ improves (0.65) when knowledge is provided to annotators
17
Observation
Recall
• fair amount of disagreement among the evaluators (κ = 0.12)
(κ = 1 is best)
• κ improves (0.65) when knowledge is provided to annotators
Hypothesis
the use of
prior knowledge
can help address
recognizing emotions in spontaneous speech
17
Our Effort: Addressing -Spontaneous Speech-
Emotion
• Knowledge based framework (applicable to acted speech also!)
The Crux!
18
Our Effort: Addressing -Spontaneous Speech-
Emotion
• Knowledge based framework (applicable to acted speech also!)
Tested on two datasets.
• SERES Details
• Call Center Conversations Demo 1 Demo 2
18
Our Effort: Addressing -Spontaneous Speech-
Emotion
• Knowledge based framework (applicable to acted speech also!)
18
Our Effort: Addressing -Spontaneous Speech-
Emotion
• Classifier error corrections
Like conventional PR problem: feature extraction at the front end,
classification done by ANN
Error correction decoding to correct errors
19
Our Effort: Addressing -Spontaneous Speech-
Emotion
• Classifier error corrections
Like conventional PR problem: feature extraction at the front end,
classification done by ANN
Error correction decoding to correct errors
19
Concluding Remarks
More Challenges
• Real Time Emotion
Not much work regarding the minimum or maximum time duration
that exhibit an emotion.
Our work shows that 600 ms of audio is sufficient to reliably extract
emotion (Audio Segmentation ... for Improved Emotion Recognition)
• Feature extraction and representation: optimal features for
spontaneous speech (no agreement)
• Acoustic variability: different speakers, speaking styles, speaking
rates, different sentences, different semantics
• Speakers gender, culture, and environment
20
Concluding Remarks
Difficulty Level Less Challenging −→ Very Challenging
Type Acted −→ Spontaneous
Media Video −→ Only Audio
Read Speech −→ Natural Speech
Language Single Language −→ Mixed Language
Clarity Clean Speech −→ Noisy Speech
Distance Close Talk −→ Far Field
Corpus Available −→ Not Available
Annotation Good −→ Poor
Context Available −→ Not Available
Good ASR Available −→ Not Available
20
Concluding Remarks
Difficulty Level Less Challenging −→ Very Challenging
Type Acted −→ Spontaneous
Media Video −→ Only Audio
Read Speech −→ Natural Speech
Language Single Language −→ Mixed Language
Clarity Clean Speech −→ Noisy Speech
Distance Close Talk −→ Far Field
Corpus Available −→ Not Available
Annotation Good −→ Poor
Context Available −→ Not Available
Good ASR Available −→ Not Available
• Applicable to most Speech Signal Processing problems!
• Definitely for Emotion Recognition
20
Concluding Remarks
There is a large play field ...
go ahead Research and Innovate!
Good Luck
20
Thank You
• Acknowledgements (Rupayan and Meghna)
TCS Innovation Labs - Mumbai
• Queries? | Comments | Suggestions?
Dr Sunil Kopparapu
SunilKumar.Kopparapu@TCS.Com
TCS Innovation Lab - Mumbai
Tata Consultancy Services Limited
Loc: 72.977265, 19.225129 Yantra Park, Thane (West), India.
END
21
Fun
• Listen to this Play
22
Fun
• Listen to this Play
• Not Loud? Listen Again Play
22
Fun
• Listen to this Play
• Not Loud? Listen Again Play
• Your response? Noise?
22
Fun
• Listen to this Play
• Not Loud? Listen Again Play
• Your response? Noise?
• Let us see if this helps Hidden (some seconds)
22
Fun
• Listen to this Play
• Not Loud? Listen Again Play
• Your response? Noise?
• Let us see if this helps Hidden (some seconds)
• Greetings!
22
Fun
• Listen to this Play
• Not Loud? Listen Again Play
• Your response? Noise?
• Let us see if this helps Hidden (some seconds)
• Greetings!
Can this be used?
22
A System & Method For Visual Message Comm
• Indian Patent Application [1116/MUM/2013; Mar 25, 2013]
• US Publication number US20140287676 A1
Back
23
Speech Enabled Railway Enquiry System
• Languages
Indian English
Marathi
Hindi
• Trains
Passing thru Mumbai
Real time Info
• Functionality
Arrival Departure
Fare
Availability
PNR
Running Status
Demonstration Menu Based
Back
24

More Related Content

PDF
Do you Mean what you say? Recognizing Emotions.
PDF
Pycon India 2018 Natural Language Processing Workshop
PPTX
A Panorama of Natural Language Processing
PDF
Sentiment Analysis
PPTX
Natural language processing
PPT
Natural Language Processing for Games Research
PPT
How to improve communication skill
PPT
Chp1,2&3
Do you Mean what you say? Recognizing Emotions.
Pycon India 2018 Natural Language Processing Workshop
A Panorama of Natural Language Processing
Sentiment Analysis
Natural language processing
Natural Language Processing for Games Research
How to improve communication skill
Chp1,2&3

Similar to Recognizing Emotions in Spontaneous Speech (20)

PPT
How To Improve Communication Skill
PPT
How To Improve Communication Skill
PPT
How To Improve Communication Skill
PPT
How To Improve Communication Skill 120299511997138 4 2
PPTX
Collective sensing
PPTX
To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis
PPTX
Sumit A
PPTX
Automatic recognition-of-emotions-in-speech wendemuth-yac-moscow 2014-present...
PPTX
Semantic Patterns for Sentiment Analysis of Twitter
PPT
English Communication effective skills ppt
PPTX
NLP 101 + Chatbots
PPT
Learning About Learning And Thinking About Thinking
PDF
Best Practices for Designing High-Fidelity Voice Experiences
PDF
Aaai2010 cao
PPTX
Semantic vs. Statistic Language Model Expansion
PDF
Transformation of feelings using pitch parameter for Marathi speech
PDF
#5 Predicting Machine Translation Quality
PPTX
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
PPTX
NLP Business Communication
PDF
How to pitch when you are a start up founder
How To Improve Communication Skill
How To Improve Communication Skill
How To Improve Communication Skill
How To Improve Communication Skill 120299511997138 4 2
Collective sensing
To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis
Sumit A
Automatic recognition-of-emotions-in-speech wendemuth-yac-moscow 2014-present...
Semantic Patterns for Sentiment Analysis of Twitter
English Communication effective skills ppt
NLP 101 + Chatbots
Learning About Learning And Thinking About Thinking
Best Practices for Designing High-Fidelity Voice Experiences
Aaai2010 cao
Semantic vs. Statistic Language Model Expansion
Transformation of feelings using pitch parameter for Marathi speech
#5 Predicting Machine Translation Quality
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
NLP Business Communication
How to pitch when you are a start up founder
Ad

Recently uploaded (20)

PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPTX
Modernising the Digital Integration Hub
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
August Patch Tuesday
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
STKI Israel Market Study 2025 version august
PDF
Five Habits of High-Impact Board Members
PPT
What is a Computer? Input Devices /output devices
PPTX
The various Industrial Revolutions .pptx
PDF
WOOl fibre morphology and structure.pdf for textiles
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Benefits of Physical activity for teenagers.pptx
sustainability-14-14877-v2.pddhzftheheeeee
Modernising the Digital Integration Hub
NewMind AI Weekly Chronicles – August ’25 Week III
Taming the Chaos: How to Turn Unstructured Data into Decisions
A review of recent deep learning applications in wood surface defect identifi...
August Patch Tuesday
O2C Customer Invoices to Receipt V15A.pptx
Web Crawler for Trend Tracking Gen Z Insights.pptx
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
STKI Israel Market Study 2025 version august
Five Habits of High-Impact Board Members
What is a Computer? Input Devices /output devices
The various Industrial Revolutions .pptx
WOOl fibre morphology and structure.pdf for textiles
Ad

Recognizing Emotions in Spontaneous Speech

  • 1. Recognizing Emotions in Spontaneous Speech [Keynote Address | CSCITA 2017 | Mumbai] Dr. Sunil Kumar Kopparapu SunilKumar.Kopparapu@TCS.COM TCS Innovation Labs - Mumbai Tata Consultancy Services Ltd, Yantra Park, Thane (West), Maharastra 400601, INDIA. April 8, 2017 1
  • 2. Talk Motivated by ... RMS Do You Mean What You Say? Recognizing Emotions in Spontaneous Speech, Invited Talk, ICNETS2, Chennai, March 2017. [R]upayan, [M]eghna, [S]unil [Published in 2017] 2
  • 3. Talk Motivated by ... RS ”Improved Speech Emotion Recognition using Error Correcting Codes”, ICME 2016 Workshop on Affective Social Multimedia Computing 2016, Seattle, USA. RS ”Validating Is ECC-ANN Combination Equivalent to DNN? for Speech Emotion Recognition”, 2016 IEEE International Conference on Systems, Man, and Cybernatics (SMC) Budapest, Hungary, 2016. RMS ”Knowledge based framework for intelligent emotion recognition in spontaneous speech”, 20th International Conference on Knowledge Based and Intelligent Information and Engineering Systems, KES2016, York, UK, Sep 2016 RMS ”Mining call center conversations exhibiting similar affective states”, 30th Pacific Asia Conference on Language, Information and Computation (PACLIC 30), Seoul, South Korea (28-30 October 2016) RMS ”Spontaneous Speech Emotion Recognition Using Prior Knowledge”, 23rd International Conference on Pattern Recognition (ICPR) in Cancun, Mexico, Dec 2016 [R]upayan, [M]eghna, [S]unil [Published in 2016] 2
  • 4. Talk Motivated by ... Created using http://tagcrowd.com/ 2
  • 5. Talk Motivated by ... Created using http://tagcrowd.com/ Acknowledging effort of Rupayan, Meghna 2
  • 6. Title (Recognize - Emotion - Spontaneous Speech) • Recognizing assigning a predefined label to a sample from a set of known labels! (example) Given n labels L = (l1, l2, · · · , ln); Assign a label to sample • Emotions (fact) no consensus on a definition feelings or thoughts that arise spontaneously instead of conscious thought (Wiki) brief conscious experience characterized by arousal (intense mental activity) and valence | mood (a high degree of pleasure or displeasure) (example) happiness (positive mood, high arousal), anger (negative mood, high arousal) and sadness (negative mood, low arousal). • Spontaneous Speech (example) day to day conversation (opposite!) not acted! natural language speech (lax grammar) 3
  • 7. Importance of Emotion Self Help | Car Insurance • System (:-)): /Thanks you for calling xyz Insurance, How may I help you?/ • Customer(:-(): /I just had an accident and need to claim insurance and tow the car/ Reactions 4
  • 8. Importance of Emotion Self Help | Car Insurance • System (:-)): /Thanks you for calling xyz Insurance, How may I help you?/ • Customer(:-(): /I just had an accident and need to claim insurance and tow the car/ Reactions • System (:-)): /Can I have your 18 digit insurance policy number/ 4
  • 9. Importance of Emotion Self Help | Car Insurance • System (:-)): /Thanks you for calling xyz Insurance, How may I help you?/ • Customer(:-(): /I just had an accident and need to claim insurance and tow the car/ Reactions • System (:-)): /Sorry to hear this. Where is your vehicle currently/ 4
  • 10. Importance of Emotion Self Help | Car Insurance • System (:-)): /Thanks you for calling xyz Insurance, How may I help you?/ • Customer(:-(): /I just had an accident and need to claim insurance and tow the car/ Reactions • System (:-)): /Sorry to hear this. Where is your vehicle currently/ • System (:-(): /Sorry to hear this. Where is your vehicle currently/ 4
  • 11. Importance of Emotion Self Help | Car Insurance • System (:-)): /Thanks you for calling xyz Insurance, How may I help you?/ • Customer(:-(): /I just had an accident and need to claim insurance and tow the car/ Reactions • System (:-)): /Can I have your 18 digit insurance policy number/ • System (:-)): /Sorry to hear this. Where is your vehicle currently/ • System (:-(): /Sorry to hear this. Where is your vehicle currently/ 4
  • 12. Importance of Emotion Self Help | Car Insurance • System (:-)): /Thanks you for calling xyz Insurance, How may I help you?/ • Customer(:-(): /I just had an accident and need to claim insurance and tow the car/ Reactions • System (:-)): /Can I have your 18 digit insurance policy number/ • System (:-)): /Sorry to hear this. Where is your vehicle currently/ • System (:-(): /Sorry to hear this. Where is your vehicle currently/ Detecting emotions in conversations can lead to better Ux 4
  • 13. Information in Speech Audio −→ 1. Play 2. Play 5
  • 14. Information in Speech Audio −→ 1. Play 2. Play • Non-linguistic, (who said it) gender , emotional states, speaker name 5
  • 15. Information in Speech Audio −→ 1. Play 2. Play • Non-linguistic, (who said it) gender , emotional states, speaker name • Linguistic (what (s)he said) Language name and what was said (written text, ) 5
  • 16. Information in Speech Audio −→ 1. Play 2. Play • Non-linguistic, (who said it) gender , emotional states, speaker name • Linguistic (what (s)he said) Language name and what was said (written text, ) • Para-linguistic (how well said, quality; manner, clarity, accent) Deliberately | Subconsciously added by speaker; not inferable from written text. 5
  • 17. Information in Speech Audio −→ 1. Play 2. Play • Non-linguistic, (who said it) gender (Male), emotional states, speaker name (Mahatma Gandhi) • Linguistic (what (s)he said) Language name (English) and what was said (written text, ”truth persists, in the midst of darkness light persists”) • Para-linguistic (how well said, quality; manner, clarity, accent) Deliberately | Subconsciously added by speaker; not inferable from written text. 5
  • 18. Information in Speech Audio −→ 1. Play 2. Play • Non-linguistic, (who said it) gender (Male), emotional states, speaker name (Mahatma Gandhi) • Linguistic (what (s)he said) Language name and what was said (written text, ”truth persists, in the midst of darkness light persists”) • Para-linguistic (how well said, quality; manner, clarity, accent) Deliberately | Subconsciously added by speaker; not inferable from written text. Goal Automatically extract information in speech signal (≡ Speech, Speaker, Gender, Emotion, Language, · · · → Recognition) What we do? In Speech and Natural Language Processing Anything else? 5
  • 19. Work in Speech and NL: A Bird’s eye view #1 Robust Speaker and Speech Recognition #2 Emotion Recognition in Spontaneous Speech #3 Remote Speech Therapy #4 SuNo - Speech and NL for Self Help #5 Speech Analysis for Laryngeal cancer detection 6
  • 20. Work in Speech and NL: A Bird’s eye view #1 Robust Speaker and Speech Recognition #2 Emotion Recognition in Spontaneous Speech #3 Remote Speech Therapy #4 SuNo - Speech and NL for Self Help #5 Speech Analysis for Laryngeal cancer detection Indian Scenario • Large number of Languages (dialects, accents) • Noisy Environment • Use of more than one language in the same sentence • Spoken Variations Kopparapu | Yerraguntla | Thangal Kunju Musaliar • People-Machine interaction comfort • Non-availability of speech corpus 6
  • 21. Work in Speech and NL: A Bird’s eye view Emotion in Audio 6
  • 22. Work in Speech and NL: A Bird’s eye view Emotion in Audio Will Concentrate on Emotion in Spontaneous Speech (Eg. Call Center Audio)! 6
  • 23. Emotion | Sentiment from Audio #1 Speech → Text → Emotion (2 Steps) #2 Speech → Emotion (1 Step) 7
  • 24. Emotion | Sentiment from Audio #1 Speech → Text → Emotion (2 Steps) Step 1 Speech to text using automatic speech recognition (ASR) Step 2 Analyzing linguistic (text) content #2 Speech → Emotion (1 Step) 7
  • 25. Emotion | Sentiment from Audio #1 Speech → Text → Emotion (2 Steps) Step 1 Speech to text using automatic speech recognition (ASR) Step 2 Analyzing linguistic (text) content Based on what is spoken Learning | Classification (text) #2 Speech → Emotion (1 Step) 7
  • 26. Emotion | Sentiment from Audio #1 Speech → Text → Emotion (2 Steps) Step 1 Speech to text using automatic speech recognition (ASR) Step 2 Analyzing linguistic (text) content Based on what is spoken Learning | Classification (text) Language dependent; dependent on performance of ASR #2 Speech → Emotion (1 Step) 7
  • 27. Emotion | Sentiment from Audio #1 Speech → Text → Emotion (2 Steps) Step 1 Speech to text using automatic speech recognition (ASR) Step 2 Analyzing linguistic (text) content Based on what is spoken Learning | Classification (text) Language dependent; dependent on performance of ASR #2 Speech → Emotion (1 Step) Step 1 Analyzing non-linguistic (audio) content 7
  • 28. Emotion | Sentiment from Audio #1 Speech → Text → Emotion (2 Steps) Step 1 Speech to text using automatic speech recognition (ASR) Step 2 Analyzing linguistic (text) content Based on what is spoken Learning | Classification (text) Language dependent; dependent on performance of ASR #2 Speech → Emotion (1 Step) Step 1 Analyzing non-linguistic (audio) content based on how spoken Extraction of features (pitch, energy, spectral, speaking rate etc.) from audio 7
  • 29. Emotion | Sentiment from Audio #1 Speech → Text → Emotion (2 Steps) Step 1 Speech to text using automatic speech recognition (ASR) Step 2 Analyzing linguistic (text) content Based on what is spoken Learning | Classification (text) Language dependent; dependent on performance of ASR #2 Speech → Emotion (1 Step) Step 1 Analyzing non-linguistic (audio) content based on how spoken Extraction of features (pitch, energy, spectral, speaking rate etc.) from audio Learning | Classification (Rule based, Statistical) 7
  • 30. Emotion | Sentiment from Audio #1 Speech → Text → Emotion (2 Steps) Step 1 Speech to text using automatic speech recognition (ASR) Step 2 Analyzing linguistic (text) content Based on what is spoken Learning | Classification (text) Language dependent; dependent on performance of ASR #2 Speech → Emotion (1 Step) Step 1 Analyzing non-linguistic (audio) content based on how spoken Extraction of features (pitch, energy, spectral, speaking rate etc.) from audio Learning | Classification (Rule based, Statistical) Independent of ASR performances, Can handle mixed language 7
  • 31. Emotion | Sentiment from Audio #1 Speech → Text → Emotion (2 Steps) Step 1 Speech to text using automatic speech recognition (ASR) Step 2 Analyzing linguistic (text) content Based on what is spoken Learning | Classification (text) Language dependent; dependent on performance of ASR #2 Speech → Emotion (1 Step) Step 1 Analyzing non-linguistic (audio) content based on how spoken Extraction of features (pitch, energy, spectral, speaking rate etc.) from audio Learning | Classification (Rule based, Statistical) Independent of ASR performances, Can handle mixed language Q Do we always mean what we speak? 7
  • 32. Can we Trust Words? 8
  • 33. Can we Trust Words? • ”Thanks” during or after a conversation 8
  • 34. Can we Trust Words? • ”Thanks” during or after a conversation Quote! ”I don’t Trust Words. I even question Actions. But I never Doubt Patterns.” 8
  • 35. Can we Trust Words? • ”Thanks” during or after a conversation Quote! ”I don’t Trust Words. I even question Actions. But I never Doubt Patterns.” =⇒ Second (Speech → Emotion) Approach 8
  • 36. Emotion in Speech: Acted vs Spontaneous Acted Play • Speaker (acts) to express emotion (Expressive, Loud) • One emotion in the entire speech Spontaneous Play • Emotional state of speaker in (spontaneous) conversation (Subtle) • Several emotions in the entire speech Does this make a difference? 9
  • 37. Spontaneous Speech Challenge • Emotion ∈ {Anger, Sad, Neutral, Happy, · · · } • Can be expressed in 2D space (Valence, Arousal) 10
  • 38. Emotion in Speech: Encoded vs Decoded • Speaker → Audio → Listener Encoded emotion → (exact emotion) felt by the speaker Decoded emotion → (perceived emotion) (of the speaker) by listener Challenge Which is the True Emotion? Emotion Extraction 11
  • 39. Generic Emotion Recognition Learning Process ( =⇒ a system needs to be trained; typical AI) • Training Phase (using annotated | labeled audio segments) Segment audio into smaller parts (appropriate) Feature extraction from audio (select) Train Classifiers (SVM, Neural Networks, Deep Networks) • Testing Phase Use (learnt) classifier to label audio 12
  • 40. Generic Emotion Recognition Learning Process ( =⇒ a system needs to be trained; typical AI) • Training Phase (using annotated | labeled audio segments) Segment audio into smaller parts (appropriate) Feature extraction from audio (select) Train Classifiers (SVM, Neural Networks, Deep Networks) • Testing Phase Use (learnt) classifier to label audio Performance is dependent on • the choice of speech features, • the choice of classifier #1 availability of training data #2 the accuracy of the annotated audio, #3 sufficient data for all classes 12
  • 41. Availability of Speech Corpus Available • Speech corpus available for acted speech 13
  • 42. Availability of Speech Corpus Available • Speech corpus available for acted speech Challenge • What works for acted does not work for spontaneous speech (tain-test mismatch) 13
  • 43. Availability of Speech Corpus Available • Speech corpus available for acted speech Challenge • What works for acted does not work for spontaneous speech (tain-test mismatch) Need to Build Spontaneous Speech Corpus! 13
  • 44. Building Spontaneous Speech Corpus • Encoded emotion: felt by the speaker • Decoded emotion: perceived emotion of the speaker by listener 14
  • 45. Building Spontaneous Speech Corpus • Encoded emotion: felt by the speaker • Decoded emotion: perceived emotion of the speaker by listener Challenge | Building Corpus Building a realistic spontaneous speech corpus would need • a person to speak in a certain emotional state and/or • annotate what he or she spoke; generating such realistic spontaneous data corpus is extremely difficult and is a huge challenge. 14
  • 46. Building Spontaneous Speech Corpus • Encoded emotion: felt by the speaker • Decoded emotion: perceived emotion of the speaker by listener Challenge | Building Corpus Building a realistic spontaneous speech corpus would need • a person to speak in a certain emotional state and/or • annotate what he or she spoke; generating such realistic spontaneous data corpus is extremely difficult and is a huge challenge. We need to do with decoded (listener’s perspective of the speakers) emotion! 14
  • 47. Emotion Annotation Challenge Listener’s perspective of the speakers emotion • Time, costs, human efforts (minor challenges) • Consistency and reliability in annotations (major challenge) 15
  • 48. Emotion Annotation Challenge Listener’s perspective of the speakers emotion • Time, costs, human efforts (minor challenges) • Consistency and reliability in annotations (major challenge) 15
  • 49. Emotion Annotation Challenge Listener’s perspective of the speakers emotion • Time, costs, human efforts (minor challenges) • Consistency and reliability in annotations (major challenge) Observations • fair amount of disagreement among the evaluators (κ = 0.12) (κ = 1 is best) • κ improves (0.65) when knowledge is provided to annotators 15
  • 50. Σing Challenges in Spontaneous Speech Emotion Recognition #1 Intensity of emotion: mostly subtle (Acted versus Spontaneous) #2 What works for acted speech does not work for spontaneous speech (Data mismatch for ML based systems) #3 Spontaneous speech corpus (Difficulty in building corpus - Encoded emotion) #4 Difficulties in annotation (Decoded emotion disagreement) #5 Difficulties in uniformity of data in each class (Difficulty in getting happy data in a call center!) 16
  • 51. Σing Challenges in Spontaneous Speech Emotion Recognition #1 Intensity of emotion: mostly subtle (Acted versus Spontaneous) #2 What works for acted speech does not work for spontaneous speech (Data mismatch for ML based systems) #3 Spontaneous speech corpus (Difficulty in building corpus - Encoded emotion) #4 Difficulties in annotation (Decoded emotion disagreement) #5 Difficulties in uniformity of data in each class (Difficulty in getting happy data in a call center!) =⇒ emotion recognition literature on acted speech does not help in spontaneous speech emotion recognition 16
  • 52. Observation Recall • fair amount of disagreement among the evaluators (κ = 0.12) (κ = 1 is best) • κ improves (0.65) when knowledge is provided to annotators 17
  • 53. Observation Recall • fair amount of disagreement among the evaluators (κ = 0.12) (κ = 1 is best) • κ improves (0.65) when knowledge is provided to annotators Hypothesis the use of prior knowledge can help address recognizing emotions in spontaneous speech 17
  • 54. Our Effort: Addressing -Spontaneous Speech- Emotion • Knowledge based framework (applicable to acted speech also!) The Crux! 18
  • 55. Our Effort: Addressing -Spontaneous Speech- Emotion • Knowledge based framework (applicable to acted speech also!) Tested on two datasets. • SERES Details • Call Center Conversations Demo 1 Demo 2 18
  • 56. Our Effort: Addressing -Spontaneous Speech- Emotion • Knowledge based framework (applicable to acted speech also!) 18
  • 57. Our Effort: Addressing -Spontaneous Speech- Emotion • Classifier error corrections Like conventional PR problem: feature extraction at the front end, classification done by ANN Error correction decoding to correct errors 19
  • 58. Our Effort: Addressing -Spontaneous Speech- Emotion • Classifier error corrections Like conventional PR problem: feature extraction at the front end, classification done by ANN Error correction decoding to correct errors 19
  • 59. Concluding Remarks More Challenges • Real Time Emotion Not much work regarding the minimum or maximum time duration that exhibit an emotion. Our work shows that 600 ms of audio is sufficient to reliably extract emotion (Audio Segmentation ... for Improved Emotion Recognition) • Feature extraction and representation: optimal features for spontaneous speech (no agreement) • Acoustic variability: different speakers, speaking styles, speaking rates, different sentences, different semantics • Speakers gender, culture, and environment 20
  • 60. Concluding Remarks Difficulty Level Less Challenging −→ Very Challenging Type Acted −→ Spontaneous Media Video −→ Only Audio Read Speech −→ Natural Speech Language Single Language −→ Mixed Language Clarity Clean Speech −→ Noisy Speech Distance Close Talk −→ Far Field Corpus Available −→ Not Available Annotation Good −→ Poor Context Available −→ Not Available Good ASR Available −→ Not Available 20
  • 61. Concluding Remarks Difficulty Level Less Challenging −→ Very Challenging Type Acted −→ Spontaneous Media Video −→ Only Audio Read Speech −→ Natural Speech Language Single Language −→ Mixed Language Clarity Clean Speech −→ Noisy Speech Distance Close Talk −→ Far Field Corpus Available −→ Not Available Annotation Good −→ Poor Context Available −→ Not Available Good ASR Available −→ Not Available • Applicable to most Speech Signal Processing problems! • Definitely for Emotion Recognition 20
  • 62. Concluding Remarks There is a large play field ... go ahead Research and Innovate! Good Luck 20
  • 63. Thank You • Acknowledgements (Rupayan and Meghna) TCS Innovation Labs - Mumbai • Queries? | Comments | Suggestions? Dr Sunil Kopparapu SunilKumar.Kopparapu@TCS.Com TCS Innovation Lab - Mumbai Tata Consultancy Services Limited Loc: 72.977265, 19.225129 Yantra Park, Thane (West), India. END 21
  • 64. Fun • Listen to this Play 22
  • 65. Fun • Listen to this Play • Not Loud? Listen Again Play 22
  • 66. Fun • Listen to this Play • Not Loud? Listen Again Play • Your response? Noise? 22
  • 67. Fun • Listen to this Play • Not Loud? Listen Again Play • Your response? Noise? • Let us see if this helps Hidden (some seconds) 22
  • 68. Fun • Listen to this Play • Not Loud? Listen Again Play • Your response? Noise? • Let us see if this helps Hidden (some seconds) • Greetings! 22
  • 69. Fun • Listen to this Play • Not Loud? Listen Again Play • Your response? Noise? • Let us see if this helps Hidden (some seconds) • Greetings! Can this be used? 22
  • 70. A System & Method For Visual Message Comm • Indian Patent Application [1116/MUM/2013; Mar 25, 2013] • US Publication number US20140287676 A1 Back 23
  • 71. Speech Enabled Railway Enquiry System • Languages Indian English Marathi Hindi • Trains Passing thru Mumbai Real time Info • Functionality Arrival Departure Fare Availability PNR Running Status Demonstration Menu Based Back 24