SlideShare a Scribd company logo
DIALOGUE ACT MODELING
FOR AUTOMATIC TAGGING
AND RECOGNITION OF
CONVERSATIONAL SPEECH
Paper published by Andreas Stolcke, Noah
Coccaro, Rebecca Bates, Paul Taylor, Carol Van
Ess-Dykema, Klaus Ries, Elizabeth Shriberg, Daniel
Jurafsky, Rachel Martin, Marie Meteer
Presented by Vipul Munot
Indiana University Bloomington
MS Data Science
AGENDA
 Objective
 Dialogue Acts
 Dialogue Act Labelling Task
 Utterance Segmentation
 Tag set
 Major Dialogue Act Types
 Hidden Markov Modeling of Dialogue
 Dialogue Act Likelihoods
 Dialogue Act Decoding
 Discourse Grammars
 N-gram Discourse Models
 Dialogue Act Classification
 Dialogue Act Classification Using Words
 Dialogue Act Classification Using Prosody
 Prosodic Decision Trees
 Neural Network Classifiers
 Intonation Event Likelihoods
 Focused Classifications
 Speech Recognition
 Integrating DA Modeling and ASR
 Experiments and Results
 Related Work
 Future Work
 Conclusions
OBJECTIVE
 Aim to present comprehensive framework
 for modelling and automatic classification of DA’s
 founded on well-known statistical methods
 Present results obtained with this approach
 on large widely available corpus of
 spontaneous conversational speech.
DIALOGUE ACTS
 Tag set that classifies utterances according to a combination of
pragmatic, semantic, and syntactic criteria.
Eg:- A Stenographer needs to keep track of who said what to whom.
 dialogue act modeling for automatic tagging and recognition
Dialogue Act Labelling Task
 Domain - Switchboard corpus of human-human conversational
telephone speech (Godfrey, Holliman, and McDaniel 1992)
distributed by the Linguistic Data Consortium .
 Used human hand-coding of DAs for each utterance, together with a
variety of automatic and semiautomatic tools.
 96% accuracy based on correct word transcripts, and 78% accuracy
with automatically recognized words.
 dialogue act modeling for automatic tagging and recognition
UTTERANCE SEGMENTATION
 Switchboard data is not segmented in a linguistically consistent way.
 They used a version of the corpus that had been hand-segmented into
sentence-level units previously.
 As relation between utterances and speaker is not a one-to-one that is a single
utterance can contain multiple utterance.
 Also , Automatic segmentation of spontaneous speech was still a research
problem then.
TAG SET
 Started with standard for shallow discourse structure annotation, the
Dialogue Act Markup in Several Layers (DAMSL) tag set.
 Then modified it to create more relevant for corpus and task.
 Labelled categories that seemed both inherently interesting linguistically and
that could be identified reliably.
 The resulting SWBD-DAMSL tag set was multidimensional; approximately 50
basic tags.
 The tag set is structured so as to allow labelers to annotate a Switchboard
conversation from transcripts alone (i.e., without listening) in about 30
minutes.
MAJOR DIALOGUE ACT TYPES
 Statements and Opinions
 Questions
 Backchannels
 E.g. : Hmm , U-Huh , Um , Ok… etc.
 Turn Exits and Abandoned Utterances
 Answers and Agreements
HIDDEN MARKOV MODELING OF DIALOGUE
 Given all available evidence E about a conversation, the goal is
to find the DA sequence U that has the highest posterior
probability P(U|E) given that evidence. Applying Bayes’ Rule we
get
 Here P(U) represents the prior probability of a DA sequence, and
P(E|U) is the likelihood of U given the evidence
DIALOGUE ACT LIKELIHOODS
 The computation of likelihoods P(E|U) depends on the types of
evidence used.
 Following sources were used :
 Transcribed words
 Recognized words
 Prosodic features
 Capture various aspects of pitch, duration, energy, etc.
DIALOGUE ACT LIKELIHOODS
DIALOGUE ACT DECODING
 The HMM representation allows us to use efficient dynamic
programming algorithms to compute relevant aspects of the model,
such as
 the most probable DA sequence (the Viterbi algorithm)
 the posterior probability of various DAs for a given utterance, after
considering all the evidence (the forward-backward algorithm)
 The Viterbi Algorithm for HMM finds globally the most probable state
sequence, but it does not necessarily find the sequence that has the
most DA labels correct.
DISCOURSE GRAMMARS
 A grammatical framework.
 More broadly, discourse is the use of spoken or written language
in a social context.
N-GRAM DISCOURSE MODELS
 Models of various orders were compared by their perplexities,
i.e., the average number of choices the model predicts for each
tag, conditioned on the preceding tags.
 As expected, we see an improvement (decreasing perplexities)
for increasing n-gram order. However, the incremental gain of a
trigram is small, and higher-order models did not prove useful.
DIALOGUE ACT CLASSIFICATION
 Dialogue Act Classification Using Words
 Dialogue Act Classification Using Prosody
 Using Multiple Knowledge Sources
DIALOGUE ACT CLASSIFICATION
USING WORDS
 Classification from True Words
 Classification from Recognized Words
CLASSIFICATION FROM RECOGNIZED
WORDS
 For fully automatic DA classification, the True word approach is only
a partial solution, as it was not able to recognize words in
spontaneous speech with perfect accuracy.
 The classification framework is modified such that the recognizer’s
acoustic information (spectral features) A appear as the evidence.
 P(A|W) – Acoustic Likelihood for every recognized word sequence
W .
RESULTS
 The best accuracy obtained from transcribed words, 71%, is encouraging
given a comparable human performance of 84% .
DIALOGUE ACT CLASSIFICATION
USING PROSODY
 Prosodic Features
 Prosodic Decision Trees
 Neural Network Classifiers
 Intonation Event Likelihoods
PROSODIC FEATURES
 Prosodic DA classification was based on a large set of features
computed automatically from the waveform, without reference to
word or phone information.
 (e.g., utterance duration, with and without pauses),
 pauses (e.g., total and mean of non speech regions exceeding 100 ms),
 pitch
 energy (e.g., mean and range of RMS energy, same for signal-to-noise
ratio [SNR])
 speaking rate (based on the “enrate” measure of Morgan, Fosler, and
Mirghafori [1997]), and
 gender (of both speaker and listener).
PROSODIC DECISION TREES
PROSODIC DECISION TREES
 A prosodic tree trained on this task revealed that agreements
have consistently longer durations and greater energy (as
reflected by the SNR measure) than do backchannels.
RESULTS WITH DECISION TREES
 The five most frequent DA types(STATEMENT ,BACKCHANNEL ,OPINION,
ABANDONED , and AGREEMENT, totaling 79% of the data) and an Other
category comprising all remaining DA types.
 The tree achieved a classification accuracy of 45.4% on an independent
test set with the same uniform six-class distribution.
 The chance accuracy on this set was 16.6%, so the tree clearly extracts
useful information from the prosodic features.
NEURAL NETWORK CLASSIFIERS
 They tested various neural network models on the same six-class
down sampled data as used for decision tree training, using a
variety of network architectures and output layer functions.
INTONATION EVENT LIKELIHOODS
 The approach relies on the intuition that different utterance types
are characterized by different intonational “tunes” (Kowtko
1996), and has been successfully applied to the classification of
move types in the DCIEM Map Task corpus (Wright and Taylor
1997).
 Unfortunately , the event classification accuracy on the
Switchboard corpus was poor, but this approach could prove
valuable in the future.
USING MULTIPLE KNOWLEDGE
SOURCES
 Expected improved performance from combining word and
prosodic information. Combining these knowledge sources requires
estimating a combined likelihood , P(Ai, Fi | Ui)
 One important respect in which the independence assumption is
violated is in the modeling of utterance length. While utterance
length itself is not a prosodic feature, it is an important feature to
condition on when examining prosodic characteristics of utterances,
and is thus best included in the decision tree.
FOCUSED CLASSIFICATIONS
 Tested a prosodic classifier, a word-based classifier (with both
transcribed and recognized words), and a combined classifier on
the DA classification tasks, down sampling the DA distribution to
equate the class sizes in each case. Chance performance in all
experiments is therefore 50% .
SPEECH RECOGNITION
 The wave file is cleaned by removing background noise and normalizing volume
 The resulting filtered wave form is then broken down into what are called
phonemes. (Phonemes are the basic building block sounds of language and
words. English has 44 of them, consisting of sound blocks such as “wh”, “th”, “ka”
and “t”.
 Each phoneme is like a chain link and by analyzing them in sequence, starting
from the first phoneme, the ASR software uses statistical probability analysis to
deduce whole words and then from there, complete sentences
 Your ASR, now having “understood” your words, can respond to you in a
meaningful way.
INTEGRATING DA MODELING AND ASR
 The likelihoods P(Ai |Wi )are estimated by the recognizer’s
acoustic model. In a standard recognizer the language model
P(Wi) is the same for all utterances; the idea here is to obtain
better-quality Language Models by conditioning on the DA type
U.
EXPERIMENTS AND RESULTS
RELATED WORK
 Previous research on DA modeling has generally focused on task-
oriented dialogue, with three tasks in particular garnering much
of the research effort.
 The Map Task corpus (Anderson et al. 1991; Bard et al. 1995) consists
of conversations between two speakers with slightly different maps
of an imaginary territory.
 The VERBMOBIL corpus consists of two party scheduling dialogues.
 Reithinger et al. (1996),
 Reithinger and Klesen (1997), and
 Samuel, Carberry, and Vijay-Shanker (1998).
 dialogue act modeling for automatic tagging and recognition
FUTURE RESEARCH
 Dialogue grammars for conversational speech need to be made
more aware of the temporal properties of utterances.
 For example, not modeling the fact that utterances by the
conversants may actually overlap (e.g., backchannels interrupting
an ongoing utterance).
 Need of fairly large, standardized corpora that allow comparisons
over time and across approaches.
CONCLUSIONS
 The approach combines models for lexical and prosodic
realizations of DAs, as well as a statistical discourse grammar.
 All components of the model are automatically trained, and are
thus applicable to other domains for which labeled data is
available.
THANK YOU

More Related Content

PPTX
1 l5eng
PPT
An Intuitive Natural Language Understanding System
PPT
Improvement in Quality of Speech associated with Braille codes - A Review
PPTX
Deep Learning for Search
PDF
Language Models for Information Retrieval
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
PPTX
Word Embedding to Document distances
PDF
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
1 l5eng
An Intuitive Natural Language Understanding System
Improvement in Quality of Speech associated with Braille codes - A Review
Deep Learning for Search
Language Models for Information Retrieval
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Word Embedding to Document distances
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...

What's hot (20)

PDF
Nlp research presentation
PDF
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
PPTX
A Simple Introduction to Word Embeddings
PPTX
Using Text Embeddings for Information Retrieval
PPTX
Vectorland: Brief Notes from Using Text Embeddings for Search
PPTX
Language models
PDF
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
PDF
Word Segmentation and Lexical Normalization for Unsegmented Languages
PDF
DDH 2021-03-03: Text Processing and Searching in the Medical Domain
PPTX
Text Mining for Lexicography
PPTX
What is word2vec?
PPTX
1909 paclic
PPTX
2010 PACLIC - pay attention to categories
PDF
MACHINE-DRIVEN TEXT ANALYSIS
PPTX
The Duet model
PDF
Topics Modeling
PPTX
Neural Models for Information Retrieval
PDF
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
PDF
Colloquium talk on modal sense classification using a convolutional neural ne...
Nlp research presentation
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
A Simple Introduction to Word Embeddings
Using Text Embeddings for Information Retrieval
Vectorland: Brief Notes from Using Text Embeddings for Search
Language models
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
Word Segmentation and Lexical Normalization for Unsegmented Languages
DDH 2021-03-03: Text Processing and Searching in the Medical Domain
Text Mining for Lexicography
What is word2vec?
1909 paclic
2010 PACLIC - pay attention to categories
MACHINE-DRIVEN TEXT ANALYSIS
The Duet model
Topics Modeling
Neural Models for Information Retrieval
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Colloquium talk on modal sense classification using a convolutional neural ne...
Ad

Viewers also liked (8)

PDF
Sentence level sentiment analysis
PDF
Vision 2016 fpm 1081 - getting data from sap business warehouse into your ibm...
PDF
Finance Transformation for JABIL with IBM Cognos TM1
PDF
Empowering Businesses using Yelp Reviews Mining
PPTX
Cognos TM1
PPTX
IBM Cognos TM1 Version 10.1 Demonstration and Financial Planning Best Practic...
PPTX
IBM Cognos TM1
PPT
G A InfoMart Cognos TM1 planning budgeting-forecasting solution
Sentence level sentiment analysis
Vision 2016 fpm 1081 - getting data from sap business warehouse into your ibm...
Finance Transformation for JABIL with IBM Cognos TM1
Empowering Businesses using Yelp Reviews Mining
Cognos TM1
IBM Cognos TM1 Version 10.1 Demonstration and Financial Planning Best Practic...
IBM Cognos TM1
G A InfoMart Cognos TM1 planning budgeting-forecasting solution
Ad

Similar to dialogue act modeling for automatic tagging and recognition (20)

PPT
scribgy.ppt
PPT
PPT
sr.ppt
PPT
Voice recognitionr.ppt
PPTX
lec26_audio.pptx
PPTX
Speech Recognition Technology
PDF
The Main Concepts of Speech Recognition
PDF
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
PDF
Emotional telugu speech signals classification based on k nn classifier
PDF
Emotional telugu speech signals classification based on k nn classifier
PPTX
Speech recognition final presentation
PPT
speech recognition system of modern world.ppt
PDF
God Mode for designing scenario-driven skills for DeepPavlov Dream
PDF
Semi-Automated Assistance for Conceiving Chatbots
PPT
Asr
PPT
Speech recognition
PDF
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
PDF
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...
PPSX
Speech recognition an overview
scribgy.ppt
sr.ppt
Voice recognitionr.ppt
lec26_audio.pptx
Speech Recognition Technology
The Main Concepts of Speech Recognition
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
Emotional telugu speech signals classification based on k nn classifier
Emotional telugu speech signals classification based on k nn classifier
Speech recognition final presentation
speech recognition system of modern world.ppt
God Mode for designing scenario-driven skills for DeepPavlov Dream
Semi-Automated Assistance for Conceiving Chatbots
Asr
Speech recognition
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...
Speech recognition an overview

More from Vipul Munot (7)

PPTX
Event pal
PDF
Search: Probabilistic Information Retrieval
PPTX
Airtel
PPTX
PPTX
Will chinese yuan become world currency
PPTX
Ascertaining Customer Satisfaction
PPSX
Visual CV / Vipul Munot
Event pal
Search: Probabilistic Information Retrieval
Airtel
Will chinese yuan become world currency
Ascertaining Customer Satisfaction
Visual CV / Vipul Munot

Recently uploaded (20)

PPTX
Database Infoormation System (DBIS).pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
annual-report-2024-2025 original latest.
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Introduction to the R Programming Language
PDF
Mega Projects Data Mega Projects Data
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Database Infoormation System (DBIS).pptx
Quality review (1)_presentation of this 21
Business Ppt On Nestle.pptx huunnnhhgfvu
annual-report-2024-2025 original latest.
.pdf is not working space design for the following data for the following dat...
Fluorescence-microscope_Botany_detailed content
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction to Knowledge Engineering Part 1
STERILIZATION AND DISINFECTION-1.ppthhhbx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
1_Introduction to advance data techniques.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to the R Programming Language
Mega Projects Data Mega Projects Data
oil_refinery_comprehensive_20250804084928 (1).pptx
Qualitative Qantitative and Mixed Methods.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx

dialogue act modeling for automatic tagging and recognition

  • 1. DIALOGUE ACT MODELING FOR AUTOMATIC TAGGING AND RECOGNITION OF CONVERSATIONAL SPEECH Paper published by Andreas Stolcke, Noah Coccaro, Rebecca Bates, Paul Taylor, Carol Van Ess-Dykema, Klaus Ries, Elizabeth Shriberg, Daniel Jurafsky, Rachel Martin, Marie Meteer Presented by Vipul Munot Indiana University Bloomington MS Data Science
  • 2. AGENDA  Objective  Dialogue Acts  Dialogue Act Labelling Task  Utterance Segmentation  Tag set  Major Dialogue Act Types  Hidden Markov Modeling of Dialogue  Dialogue Act Likelihoods  Dialogue Act Decoding  Discourse Grammars  N-gram Discourse Models  Dialogue Act Classification  Dialogue Act Classification Using Words  Dialogue Act Classification Using Prosody  Prosodic Decision Trees  Neural Network Classifiers  Intonation Event Likelihoods  Focused Classifications  Speech Recognition  Integrating DA Modeling and ASR  Experiments and Results  Related Work  Future Work  Conclusions
  • 3. OBJECTIVE  Aim to present comprehensive framework  for modelling and automatic classification of DA’s  founded on well-known statistical methods  Present results obtained with this approach  on large widely available corpus of  spontaneous conversational speech.
  • 4. DIALOGUE ACTS  Tag set that classifies utterances according to a combination of pragmatic, semantic, and syntactic criteria. Eg:- A Stenographer needs to keep track of who said what to whom.
  • 6. Dialogue Act Labelling Task  Domain - Switchboard corpus of human-human conversational telephone speech (Godfrey, Holliman, and McDaniel 1992) distributed by the Linguistic Data Consortium .  Used human hand-coding of DAs for each utterance, together with a variety of automatic and semiautomatic tools.  96% accuracy based on correct word transcripts, and 78% accuracy with automatically recognized words.
  • 8. UTTERANCE SEGMENTATION  Switchboard data is not segmented in a linguistically consistent way.  They used a version of the corpus that had been hand-segmented into sentence-level units previously.  As relation between utterances and speaker is not a one-to-one that is a single utterance can contain multiple utterance.  Also , Automatic segmentation of spontaneous speech was still a research problem then.
  • 9. TAG SET  Started with standard for shallow discourse structure annotation, the Dialogue Act Markup in Several Layers (DAMSL) tag set.  Then modified it to create more relevant for corpus and task.  Labelled categories that seemed both inherently interesting linguistically and that could be identified reliably.  The resulting SWBD-DAMSL tag set was multidimensional; approximately 50 basic tags.  The tag set is structured so as to allow labelers to annotate a Switchboard conversation from transcripts alone (i.e., without listening) in about 30 minutes.
  • 10. MAJOR DIALOGUE ACT TYPES  Statements and Opinions  Questions  Backchannels  E.g. : Hmm , U-Huh , Um , Ok… etc.  Turn Exits and Abandoned Utterances  Answers and Agreements
  • 11. HIDDEN MARKOV MODELING OF DIALOGUE  Given all available evidence E about a conversation, the goal is to find the DA sequence U that has the highest posterior probability P(U|E) given that evidence. Applying Bayes’ Rule we get
  • 12.  Here P(U) represents the prior probability of a DA sequence, and P(E|U) is the likelihood of U given the evidence
  • 13. DIALOGUE ACT LIKELIHOODS  The computation of likelihoods P(E|U) depends on the types of evidence used.  Following sources were used :  Transcribed words  Recognized words  Prosodic features  Capture various aspects of pitch, duration, energy, etc. DIALOGUE ACT LIKELIHOODS
  • 14. DIALOGUE ACT DECODING  The HMM representation allows us to use efficient dynamic programming algorithms to compute relevant aspects of the model, such as  the most probable DA sequence (the Viterbi algorithm)  the posterior probability of various DAs for a given utterance, after considering all the evidence (the forward-backward algorithm)  The Viterbi Algorithm for HMM finds globally the most probable state sequence, but it does not necessarily find the sequence that has the most DA labels correct.
  • 15. DISCOURSE GRAMMARS  A grammatical framework.  More broadly, discourse is the use of spoken or written language in a social context.
  • 16. N-GRAM DISCOURSE MODELS  Models of various orders were compared by their perplexities, i.e., the average number of choices the model predicts for each tag, conditioned on the preceding tags.  As expected, we see an improvement (decreasing perplexities) for increasing n-gram order. However, the incremental gain of a trigram is small, and higher-order models did not prove useful.
  • 17. DIALOGUE ACT CLASSIFICATION  Dialogue Act Classification Using Words  Dialogue Act Classification Using Prosody  Using Multiple Knowledge Sources
  • 18. DIALOGUE ACT CLASSIFICATION USING WORDS  Classification from True Words  Classification from Recognized Words
  • 19. CLASSIFICATION FROM RECOGNIZED WORDS  For fully automatic DA classification, the True word approach is only a partial solution, as it was not able to recognize words in spontaneous speech with perfect accuracy.  The classification framework is modified such that the recognizer’s acoustic information (spectral features) A appear as the evidence.  P(A|W) – Acoustic Likelihood for every recognized word sequence W .
  • 20. RESULTS  The best accuracy obtained from transcribed words, 71%, is encouraging given a comparable human performance of 84% .
  • 21. DIALOGUE ACT CLASSIFICATION USING PROSODY  Prosodic Features  Prosodic Decision Trees  Neural Network Classifiers  Intonation Event Likelihoods
  • 22. PROSODIC FEATURES  Prosodic DA classification was based on a large set of features computed automatically from the waveform, without reference to word or phone information.  (e.g., utterance duration, with and without pauses),  pauses (e.g., total and mean of non speech regions exceeding 100 ms),  pitch  energy (e.g., mean and range of RMS energy, same for signal-to-noise ratio [SNR])  speaking rate (based on the “enrate” measure of Morgan, Fosler, and Mirghafori [1997]), and  gender (of both speaker and listener).
  • 24. PROSODIC DECISION TREES  A prosodic tree trained on this task revealed that agreements have consistently longer durations and greater energy (as reflected by the SNR measure) than do backchannels.
  • 25. RESULTS WITH DECISION TREES  The five most frequent DA types(STATEMENT ,BACKCHANNEL ,OPINION, ABANDONED , and AGREEMENT, totaling 79% of the data) and an Other category comprising all remaining DA types.  The tree achieved a classification accuracy of 45.4% on an independent test set with the same uniform six-class distribution.  The chance accuracy on this set was 16.6%, so the tree clearly extracts useful information from the prosodic features.
  • 26. NEURAL NETWORK CLASSIFIERS  They tested various neural network models on the same six-class down sampled data as used for decision tree training, using a variety of network architectures and output layer functions.
  • 27. INTONATION EVENT LIKELIHOODS  The approach relies on the intuition that different utterance types are characterized by different intonational “tunes” (Kowtko 1996), and has been successfully applied to the classification of move types in the DCIEM Map Task corpus (Wright and Taylor 1997).  Unfortunately , the event classification accuracy on the Switchboard corpus was poor, but this approach could prove valuable in the future.
  • 28. USING MULTIPLE KNOWLEDGE SOURCES  Expected improved performance from combining word and prosodic information. Combining these knowledge sources requires estimating a combined likelihood , P(Ai, Fi | Ui)  One important respect in which the independence assumption is violated is in the modeling of utterance length. While utterance length itself is not a prosodic feature, it is an important feature to condition on when examining prosodic characteristics of utterances, and is thus best included in the decision tree.
  • 29. FOCUSED CLASSIFICATIONS  Tested a prosodic classifier, a word-based classifier (with both transcribed and recognized words), and a combined classifier on the DA classification tasks, down sampling the DA distribution to equate the class sizes in each case. Chance performance in all experiments is therefore 50% .
  • 30. SPEECH RECOGNITION  The wave file is cleaned by removing background noise and normalizing volume  The resulting filtered wave form is then broken down into what are called phonemes. (Phonemes are the basic building block sounds of language and words. English has 44 of them, consisting of sound blocks such as “wh”, “th”, “ka” and “t”.  Each phoneme is like a chain link and by analyzing them in sequence, starting from the first phoneme, the ASR software uses statistical probability analysis to deduce whole words and then from there, complete sentences  Your ASR, now having “understood” your words, can respond to you in a meaningful way.
  • 31. INTEGRATING DA MODELING AND ASR  The likelihoods P(Ai |Wi )are estimated by the recognizer’s acoustic model. In a standard recognizer the language model P(Wi) is the same for all utterances; the idea here is to obtain better-quality Language Models by conditioning on the DA type U.
  • 33. RELATED WORK  Previous research on DA modeling has generally focused on task- oriented dialogue, with three tasks in particular garnering much of the research effort.  The Map Task corpus (Anderson et al. 1991; Bard et al. 1995) consists of conversations between two speakers with slightly different maps of an imaginary territory.  The VERBMOBIL corpus consists of two party scheduling dialogues.  Reithinger et al. (1996),  Reithinger and Klesen (1997), and  Samuel, Carberry, and Vijay-Shanker (1998).
  • 35. FUTURE RESEARCH  Dialogue grammars for conversational speech need to be made more aware of the temporal properties of utterances.  For example, not modeling the fact that utterances by the conversants may actually overlap (e.g., backchannels interrupting an ongoing utterance).  Need of fairly large, standardized corpora that allow comparisons over time and across approaches.
  • 36. CONCLUSIONS  The approach combines models for lexical and prosodic realizations of DAs, as well as a statistical discourse grammar.  All components of the model are automatically trained, and are thus applicable to other domains for which labeled data is available.