dialogue act modeling for automatic tagging and recognition

DIALOGUE ACT MODELING
FOR AUTOMATIC TAGGING
AND RECOGNITION OF
CONVERSATIONAL SPEECH
Paper published by Andreas Stolcke, Noah
Coccaro, Rebecca Bates, Paul Taylor, Carol Van
Ess-Dykema, Klaus Ries, Elizabeth Shriberg, Daniel
Jurafsky, Rachel Martin, Marie Meteer
Presented by Vipul Munot
Indiana University Bloomington
MS Data Science

AGENDA
 Objective
 Dialogue Acts
 Dialogue Act Labelling Task
 Utterance Segmentation
 Tag set
 Major Dialogue Act Types
 Hidden Markov Modeling of Dialogue
 Dialogue Act Likelihoods
 Dialogue Act Decoding
 Discourse Grammars
 N-gram Discourse Models
 Dialogue Act Classification
 Dialogue Act Classification Using Words
 Dialogue Act Classification Using Prosody
 Prosodic Decision Trees
 Neural Network Classifiers
 Intonation Event Likelihoods
 Focused Classifications
 Speech Recognition
 Integrating DA Modeling and ASR
 Experiments and Results
 Related Work
 Future Work
 Conclusions

OBJECTIVE
 Aim to present comprehensive framework
 for modelling and automatic classification of DA’s
 founded on well-known statistical methods
 Present results obtained with this approach
 on large widely available corpus of
 spontaneous conversational speech.

DIALOGUE ACTS
 Tag set that classifies utterances according to a combination of
pragmatic, semantic, and syntactic criteria.
Eg:- A Stenographer needs to keep track of who said what to whom.

Dialogue Act Labelling Task
 Domain - Switchboard corpus of human-human conversational
telephone speech (Godfrey, Holliman, and McDaniel 1992)
distributed by the Linguistic Data Consortium .
 Used human hand-coding of DAs for each utterance, together with a
variety of automatic and semiautomatic tools.
 96% accuracy based on correct word transcripts, and 78% accuracy
with automatically recognized words.

UTTERANCE SEGMENTATION
 Switchboard data is not segmented in a linguistically consistent way.
 They used a version of the corpus that had been hand-segmented into
sentence-level units previously.
 As relation between utterances and speaker is not a one-to-one that is a single
utterance can contain multiple utterance.
 Also , Automatic segmentation of spontaneous speech was still a research
problem then.

TAG SET
 Started with standard for shallow discourse structure annotation, the
Dialogue Act Markup in Several Layers (DAMSL) tag set.
 Then modified it to create more relevant for corpus and task.
 Labelled categories that seemed both inherently interesting linguistically and
that could be identiﬁed reliably.
 The resulting SWBD-DAMSL tag set was multidimensional; approximately 50
basic tags.
 The tag set is structured so as to allow labelers to annotate a Switchboard
conversation from transcripts alone (i.e., without listening) in about 30
minutes.

MAJOR DIALOGUE ACT TYPES
 Statements and Opinions
 Questions
 Backchannels
 E.g. : Hmm , U-Huh , Um , Ok… etc.
 Turn Exits and Abandoned Utterances
 Answers and Agreements

HIDDEN MARKOV MODELING OF DIALOGUE
 Given all available evidence E about a conversation, the goal is
to ﬁnd the DA sequence U that has the highest posterior
probability P(U|E) given that evidence. Applying Bayes’ Rule we
get

 Here P(U) represents the prior probability of a DA sequence, and
P(E|U) is the likelihood of U given the evidence

DIALOGUE ACT LIKELIHOODS
 The computation of likelihoods P(E|U) depends on the types of
evidence used.
 Following sources were used :
 Transcribed words
 Recognized words
 Prosodic features
 Capture various aspects of pitch, duration, energy, etc.
DIALOGUE ACT LIKELIHOODS

DIALOGUE ACT DECODING
 The HMM representation allows us to use efﬁcient dynamic
programming algorithms to compute relevant aspects of the model,
such as
 the most probable DA sequence (the Viterbi algorithm)
 the posterior probability of various DAs for a given utterance, after
considering all the evidence (the forward-backward algorithm)
 The Viterbi Algorithm for HMM finds globally the most probable state
sequence, but it does not necessarily find the sequence that has the
most DA labels correct.

DISCOURSE GRAMMARS
 A grammatical framework.
 More broadly, discourse is the use of spoken or written language
in a social context.

N-GRAM DISCOURSE MODELS
 Models of various orders were compared by their perplexities,
i.e., the average number of choices the model predicts for each
tag, conditioned on the preceding tags.
 As expected, we see an improvement (decreasing perplexities)
for increasing n-gram order. However, the incremental gain of a
trigram is small, and higher-order models did not prove useful.

DIALOGUE ACT CLASSIFICATION
 Dialogue Act Classiﬁcation Using Words
 Dialogue Act Classiﬁcation Using Prosody
 Using Multiple Knowledge Sources

USING WORDS
 Classiﬁcation from True Words
 Classiﬁcation from Recognized Words

CLASSIFICATION FROM RECOGNIZED
WORDS
 For fully automatic DA classification, the True word approach is only
a partial solution, as it was not able to recognize words in
spontaneous speech with perfect accuracy.
 The classification framework is modified such that the recognizer’s
acoustic information (spectral features) A appear as the evidence.
 P(A|W) – Acoustic Likelihood for every recognized word sequence
W .

RESULTS
 The best accuracy obtained from transcribed words, 71%, is encouraging
given a comparable human performance of 84% .

USING PROSODY
 Prosodic Features
 Prosodic Decision Trees
 Neural Network Classiﬁers
 Intonation Event Likelihoods

PROSODIC FEATURES
 Prosodic DA classiﬁcation was based on a large set of features
computed automatically from the waveform, without reference to
word or phone information.
 (e.g., utterance duration, with and without pauses),
 pauses (e.g., total and mean of non speech regions exceeding 100 ms),
 pitch
 energy (e.g., mean and range of RMS energy, same for signal-to-noise
ratio [SNR])
 speaking rate (based on the “enrate” measure of Morgan, Fosler, and
Mirghafori [1997]), and
 gender (of both speaker and listener).

PROSODIC DECISION TREES
 A prosodic tree trained on this task revealed that agreements
have consistently longer durations and greater energy (as
reﬂected by the SNR measure) than do backchannels.

RESULTS WITH DECISION TREES
 The ﬁve most frequent DA types(STATEMENT ,BACKCHANNEL ,OPINION,
ABANDONED , and AGREEMENT, totaling 79% of the data) and an Other
category comprising all remaining DA types.
 The tree achieved a classiﬁcation accuracy of 45.4% on an independent
test set with the same uniform six-class distribution.
 The chance accuracy on this set was 16.6%, so the tree clearly extracts
useful information from the prosodic features.

NEURAL NETWORK CLASSIFIERS
 They tested various neural network models on the same six-class
down sampled data as used for decision tree training, using a
variety of network architectures and output layer functions.

INTONATION EVENT LIKELIHOODS
 The approach relies on the intuition that different utterance types
are characterized by different intonational “tunes” (Kowtko
1996), and has been successfully applied to the classiﬁcation of
move types in the DCIEM Map Task corpus (Wright and Taylor
1997).
 Unfortunately , the event classification accuracy on the
Switchboard corpus was poor, but this approach could prove
valuable in the future.

USING MULTIPLE KNOWLEDGE
SOURCES
 Expected improved performance from combining word and
prosodic information. Combining these knowledge sources requires
estimating a combined likelihood , P(Ai, Fi | Ui)
 One important respect in which the independence assumption is
violated is in the modeling of utterance length. While utterance
length itself is not a prosodic feature, it is an important feature to
condition on when examining prosodic characteristics of utterances,
and is thus best included in the decision tree.

FOCUSED CLASSIFICATIONS
 Tested a prosodic classifier, a word-based classifier (with both
transcribed and recognized words), and a combined classifier on
the DA classification tasks, down sampling the DA distribution to
equate the class sizes in each case. Chance performance in all
experiments is therefore 50% .

SPEECH RECOGNITION
 The wave file is cleaned by removing background noise and normalizing volume
 The resulting filtered wave form is then broken down into what are called
phonemes. (Phonemes are the basic building block sounds of language and
words. English has 44 of them, consisting of sound blocks such as “wh”, “th”, “ka”
and “t”.
 Each phoneme is like a chain link and by analyzing them in sequence, starting
from the first phoneme, the ASR software uses statistical probability analysis to
deduce whole words and then from there, complete sentences
 Your ASR, now having “understood” your words, can respond to you in a
meaningful way.

INTEGRATING DA MODELING AND ASR
 The likelihoods P(Ai |Wi )are estimated by the recognizer’s
acoustic model. In a standard recognizer the language model
P(Wi) is the same for all utterances; the idea here is to obtain
better-quality Language Models by conditioning on the DA type
U.

RELATED WORK
 Previous research on DA modeling has generally focused on task-
oriented dialogue, with three tasks in particular garnering much
of the research effort.
 The Map Task corpus (Anderson et al. 1991; Bard et al. 1995) consists
of conversations between two speakers with slightly different maps
of an imaginary territory.
 The VERBMOBIL corpus consists of two party scheduling dialogues.
 Reithinger et al. (1996),
 Reithinger and Klesen (1997), and
 Samuel, Carberry, and Vijay-Shanker (1998).

FUTURE RESEARCH
 Dialogue grammars for conversational speech need to be made
more aware of the temporal properties of utterances.
 For example, not modeling the fact that utterances by the
conversants may actually overlap (e.g., backchannels interrupting
an ongoing utterance).
 Need of fairly large, standardized corpora that allow comparisons
over time and across approaches.

CONCLUSIONS
 The approach combines models for lexical and prosodic
realizations of DAs, as well as a statistical discourse grammar.
 All components of the model are automatically trained, and are
thus applicable to other domains for which labeled data is
available.

dialogue act modeling for automatic tagging and recognition

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to dialogue act modeling for automatic tagging and recognition (20)

More from Vipul Munot (7)

Recently uploaded (20)

dialogue act modeling for automatic tagging and recognition