SlideShare a Scribd company logo
Text-based Speaker Identification on Multiparty Dialogues
Using Multi-document Convolutional Neural Networks
Kaixin Ma, Catherine Xiao, Jinho D. Choi
Department of Mathematics and Computer Science, Emory University
• Withhold the identities of speakers in multi-party dialogue.
• Classify each utterance in dialogue to speakers.
• This work attempts to identify the six main characters in the first 8 seasons
of the TV show, Friends.
• The minor characters in the show are to be identified collectively as Other.
Objective
• The corpus consists of 194 episodes, 2579 scenes and 49755 utterances.
Seasons
Episodes
Scenes
Utterances
Utterance Text +
Speaker + Statement
Corpus Structure Speaker Distribution
• Each utterance may contain one or multiple sentences.
• Each consecutive utterance must have a different speaker.
• The frequencies of interactions between pairs of speakers varies.
• Large number of misspelling and colloquialisms.
• Utterances that are too short and too general.
• Another dataset is created by utterance concatenation.
• Utterances from the same speaker within the scene are concatenated.
U1 U2 U3 U4 U5 U1+U3+U5 U2 U4
Corpus Description
• Each utterance is predicted independently.
Baseline CNN Structure
• The model takes one scene as a batch of input.
• The original sequence of dialogue is preserved.
• The tensor is sliced and padded to represent the previous/next utterance.
Multi-document CNN Structure
• The multi-document CNN model’s identification accuracy increase by 6%
from that of basic CNN.
• The model can better capture different speech patterns on longer document.
• When prediction labels are restricted, the accuracies boosts of 10% and
12% are achieved on two datasets, respectively.
• The Speakers with higher accuracies are also confused by the model more
often than others.
• Frequency of interactions between speaker pairs correlates with the rate of
confusion.
Results
• We present neural network based approach to speaker identification in
multiparty dialogue relying on textual transcription data.
• The contextual information is essential to the performance of text-based
speaker identification.
• Because of our model’s ability to identify speakers in the absence of audio
data, interests in the intelligence and surveillance community are expected.
• We plan to incorporate text-based features in a larger audio-based system
of speaker identification to enhance its security.
Conclusion
• We gratefully acknowledge the department of Mathematics and Computer
Science at Emory University for supporting this work. Any content presented
here is solely the responsibility of the authors and does not necessarily
represent the official view of the organization.
Acknowledgement
Approaches

More Related Content

PDF
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
PDF
Deep Learning for Speech Recognition - Vikrant Singh Tomar
PDF
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
PPTX
Automatic speech recognition
PPTX
SPEECH RECOGNITION USING NEURAL NETWORK
PPT
Speech Recognition
PDF
Recurrent Convolutional Neural Networks for Text Classification
PPTX
Deep Learning - Speaker Recognition
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
Deep Learning for Speech Recognition - Vikrant Singh Tomar
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
Automatic speech recognition
SPEECH RECOGNITION USING NEURAL NETWORK
Speech Recognition
Recurrent Convolutional Neural Networks for Text Classification
Deep Learning - Speaker Recognition

What's hot (20)

PPTX
Deep Learning | Speaker Indentification
PPTX
2010 INTERSPEECH
PPT
Automatic speech recognition
PPT
CAP computer Aided Program for pronunciation
PDF
Speech recognition using neural + fuzzy logic
PPT
Amharic WSD using WordNet
PPTX
Sequence to sequence model speech recognition
PPTX
Speech recognition techniques
PPTX
Speech Recognition Technology
PPTX
Voice recognition system
DOC
12EEE032- text 2 voice
PPTX
Intro to Auto Speech Recognition -- How ML Learns Speech-to-Text
PPTX
Artificial Intelligence Notes Unit 4
PDF
Deep Learning in practice : Speech recognition and beyond - Meetup
PPTX
Natural Language Processing in Alternative and Augmentative Communication
PPT
Automatic speech recognition
PDF
Learning to understand phrases by embedding the dictionary
PPTX
Speech recognition final
DOCX
Natural Language Processing
PPT
The role of linguistic information for shallow language processing
Deep Learning | Speaker Indentification
2010 INTERSPEECH
Automatic speech recognition
CAP computer Aided Program for pronunciation
Speech recognition using neural + fuzzy logic
Amharic WSD using WordNet
Sequence to sequence model speech recognition
Speech recognition techniques
Speech Recognition Technology
Voice recognition system
12EEE032- text 2 voice
Intro to Auto Speech Recognition -- How ML Learns Speech-to-Text
Artificial Intelligence Notes Unit 4
Deep Learning in practice : Speech recognition and beyond - Meetup
Natural Language Processing in Alternative and Augmentative Communication
Automatic speech recognition
Learning to understand phrases by embedding the dictionary
Speech recognition final
Natural Language Processing
The role of linguistic information for shallow language processing
Ad

More from Jinho Choi (20)

PDF
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
PDF
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
PDF
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
PDF
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
PDF
The Myth of Higher-Order Inference in Coreference Resolution
PDF
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
PDF
Abstract Meaning Representation
PDF
Semantic Role Labeling
PDF
CKY Parsing
PDF
CS329 - WordNet Similarities
PDF
CS329 - Lexical Relations
PDF
Automatic Knowledge Base Expansion for Dialogue Management
PDF
Attention is All You Need for AMR Parsing
PDF
Graph-to-Text Generation and its Applications to Dialogue
PDF
Real-time Coreference Resolution for Dialogue Understanding
PDF
Topological Sort
PDF
Tries - Put
PDF
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
PDF
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
PDF
How to make Emora talk about Sports Intelligently
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
The Myth of Higher-Order Inference in Coreference Resolution
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Abstract Meaning Representation
Semantic Role Labeling
CKY Parsing
CS329 - WordNet Similarities
CS329 - Lexical Relations
Automatic Knowledge Base Expansion for Dialogue Management
Attention is All You Need for AMR Parsing
Graph-to-Text Generation and its Applications to Dialogue
Real-time Coreference Resolution for Dialogue Understanding
Topological Sort
Tries - Put
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
How to make Emora talk about Sports Intelligently
Ad

Recently uploaded (20)

PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Empathic Computing: Creating Shared Understanding
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Approach and Philosophy of On baking technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Encapsulation theory and applications.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Electronic commerce courselecture one. Pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
20250228 LYD VKU AI Blended-Learning.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
A Presentation on Artificial Intelligence
Assigned Numbers - 2025 - Bluetooth® Document
Spectral efficient network and resource selection model in 5G networks
Unlocking AI with Model Context Protocol (MCP)
Per capita expenditure prediction using model stacking based on satellite ima...
MYSQL Presentation for SQL database connectivity
Empathic Computing: Creating Shared Understanding
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Dropbox Q2 2025 Financial Results & Investor Presentation
NewMind AI Weekly Chronicles - August'25-Week II
Advanced methodologies resolving dimensionality complications for autism neur...
Approach and Philosophy of On baking technology
Chapter 3 Spatial Domain Image Processing.pdf
Encapsulation theory and applications.pdf

Text-based Speaker Identification on Multiparty Dialogues Using Multi-document Convolutional Neural Networks

  • 1. Text-based Speaker Identification on Multiparty Dialogues Using Multi-document Convolutional Neural Networks Kaixin Ma, Catherine Xiao, Jinho D. Choi Department of Mathematics and Computer Science, Emory University • Withhold the identities of speakers in multi-party dialogue. • Classify each utterance in dialogue to speakers. • This work attempts to identify the six main characters in the first 8 seasons of the TV show, Friends. • The minor characters in the show are to be identified collectively as Other. Objective • The corpus consists of 194 episodes, 2579 scenes and 49755 utterances. Seasons Episodes Scenes Utterances Utterance Text + Speaker + Statement Corpus Structure Speaker Distribution • Each utterance may contain one or multiple sentences. • Each consecutive utterance must have a different speaker. • The frequencies of interactions between pairs of speakers varies. • Large number of misspelling and colloquialisms. • Utterances that are too short and too general. • Another dataset is created by utterance concatenation. • Utterances from the same speaker within the scene are concatenated. U1 U2 U3 U4 U5 U1+U3+U5 U2 U4 Corpus Description • Each utterance is predicted independently. Baseline CNN Structure • The model takes one scene as a batch of input. • The original sequence of dialogue is preserved. • The tensor is sliced and padded to represent the previous/next utterance. Multi-document CNN Structure • The multi-document CNN model’s identification accuracy increase by 6% from that of basic CNN. • The model can better capture different speech patterns on longer document. • When prediction labels are restricted, the accuracies boosts of 10% and 12% are achieved on two datasets, respectively. • The Speakers with higher accuracies are also confused by the model more often than others. • Frequency of interactions between speaker pairs correlates with the rate of confusion. Results • We present neural network based approach to speaker identification in multiparty dialogue relying on textual transcription data. • The contextual information is essential to the performance of text-based speaker identification. • Because of our model’s ability to identify speakers in the absence of audio data, interests in the intelligence and surveillance community are expected. • We plan to incorporate text-based features in a larger audio-based system of speaker identification to enhance its security. Conclusion • We gratefully acknowledge the department of Mathematics and Computer Science at Emory University for supporting this work. Any content presented here is solely the responsibility of the authors and does not necessarily represent the official view of the organization. Acknowledgement Approaches