SlideShare a Scribd company logo
Spoken Content Retrieval
― Lattices and Beyond
Lin-shan Lee
http://speech.ee.ntu.edu.tw/previous_version/lslNew.htm
National Taiwan University
Taipei, Taiwan, ROC
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Outline
• Introduction
• Fundamentals
• Recent Research Examples
– Parameter weighting, relevance feedback, graph-based, SVM,
concept matching, interactive retrieval, etc.
• Demo
• Conclusion
Introduction:
Spoken Content Retrieval
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Text content retrieval extremely successful
– information desired by the users can be obtained very efficiently
– all users like it
– producing very successful industry
• All roles of texts can be accomplished by voice
– spoken content or multimedia content with voice in audio part
– voice instructions/queries via handheld devices
• Spoken content retrieval
user
instructions/
queries
Internet
Server
Server Documents/Information
Text/Spoken Content Retrieval (1/3)
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Text/Spoken Content Retrieval (2/3)
Spoken Instructions/Queries
Spoken content
(multimedia content including audio part)
US president?
Text Instructions/Queries
Text Content
Barack Obama….Barack Obama….
• User instructions and/or network content can be in form of voice
– text queries/spoken content : spoken document retrieval, spoken term
detection
– spoken queries/text content : voice search
[Wang & Acero, IEEE SPM 08][Acero, et al, ICASSP 08]
– spoken queries/spoken content : query by example
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Text/Spoken Content Retrieval (3/3)
Spoken Instructions/Queries
Spoken content
(multimedia content including audio part)
US president?
Text Instructions/Queries
Text Content
Barack Obama….Barack Obama….
• If spoken content/queries can be accurately recognized
– Reduced to text content retrieval
• Correct but never possible
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Many hand-held devices with multimedia functionalities available
• Unlimited quantities of multimedia content fast growing over the
Internet
• User-content interaction necessary for retrieval can be accomplished
by spoken and multi-modal dialogues
• Network access is primarily text-based today, but almost all roles of
texts can be accomplished by voice
Multimedia
Content
Analysis
voice
information
Multimedia
and Spoken
Content
Internet
text
information
Text Content
Retrieval
Text
Content
Spoken Content
Retrieval
Text-to-Speech
Synthesis
Spoken and
multi-modal
Dialogue
Wireless and Multimedia Technologies are Creating An
Environment for Spoken Content Retrieval
Fundamentals
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Low recognition accuracies for spontaneous speech
including Out-of-Vocabulary (OOV) words under adverse
environment
– considering lattices with multiple alternatives rather than
1-best output
higher probability of including correct words, but also including more
noisy words
correct words may still be excluded (OOV and others)
huge memory and computation requirements
[Chelba, Hazen, Saraclar, IEEE SPM 08][Saraclar & Sproat, HLT 04]
[Vergyri, et al, Interspeech 07]
Lattices for Spoken Content Retrieval
W6
W8
W4
W1
W8
W7
W9
W3
W2
W5
W10
Start node End node
Time index
Wi: word hypotheses
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Confusion Matrices
– use of confusion matrices to model recognition errors and expand
the query/document, etc.
[Ng, ICSLP 98][Pint & Hermansky, SSCS 08][Chaudhari, ASRU 09]
[Parada, ICASSP 10][Wallace, ICASSP 10]
• Pronunciation Modeling
– use of pronunciation models to expand the query, etc.
[Mertens, Interspeech 09][Mertens, ASRU 09][Can, ICASSP 09]
[Wang, Interspeech 09][Wang, ICASSP 10][Wang, Interspeech 10]
• Fuzzy Matching
– query/content matching not necessarily exact
[Shao, Interspeech 07][Schneide, SSCS 08][Mamou, Interspeech 08]
Other Approach Examples in addition to Lattices
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Lattices
• An example of lattice indexing approach
– position information for words readily available
– more compatible to existing text indexing techniques
– reduced memory and computation requirements (still huge…)
Lattice Indexing Approaches
W6
W8
W4
W1
W8
W7
W9
W3
W2
W5
W10
Start node
End node
Time index
W3, p3
W8, p8
W9, p9
W4, p4
W1, p1
W5, p5
W6, p6
W10, p10
W2, p2
W7, p7
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Lattices
• An example of lattice indexing approach
– position information for words readily available
– more compatible to existing text indexing techniques
– reduced memory and computation requirements (still huge…)
– added possible paths
– noisy words discriminated by posterior probabilities or similar scores
– n-grams matched with query and accumulated for all possible n
Lattice Indexing Approaches
W6
W8
W4
W1
W8
W7
W9
W3
W2
W5
W10
Start node
End node
Time index
W3, p3
W8, p8
W9, p9
W4, p4
W1, p1
W5, p5
W6, p6
W10, p10
W2, p2
W7, p7
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Position Specific Posterior Lattices (PSPL)
[Chelba & Acero, ACL 05][Chelba & Acero, Computer Speech and Language 07]
• Confusion Networks (CN)
[Mamou, et al, SIGIR 06][Hori, Hazen, Glass, ICASSP 07][Mamou, et al, SIGIR 07]
• Time-based Merging for Indexing (TMI)
[Zhou, Chelba, Seide, HLT 06][Seide, et al, ASRU 07]
• Time-anchored Lattice Expansion (TALE)
[Seide, et al, ASRU 07][Seide, et al, ICASSP 08]
• WFST
– directly compile the lattice into a weighted finite state
transducer
[Allauzen, et al, HLT 04][Parlak & Saraclar, ICASSP 08][Can, ICASSP 09]
[Parada, ASRU 09]
Examples of Lattice Indexing Approaches
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• PSPL:
─ Locating a word in a segment according to the order of the word in a path
W3: prob
W7: prob
W2: probW1: prob W5: prob
W9: prob
PSPL:
W6: prob
segment 1
W4: prob
W8: prob
segment 2 segment 4segment 3
W2
W5
W7W8W9
W10
W6W8W9
W9
End node
W6 W8
W4
W1
W8
W7
W3
Start node
Time index
W1W2, W3W4W5,All paths:
Lattice:
Position Specific Posterior Lattices (PSPL) and Confusion
Networks (CN)
W10W10,
W10: prob
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• PSPL:
─ Locating a word in a segment according to the order of the word in a path
• CN:
─ Clustering several words in a segment according to similar time spans and word
pronunciation
W3: prob
W7: prob
W2: probW1: prob W5: prob
W9: prob
PSPL:
W6: prob
segment 1
W4: prob
W8: prob
segment 2 segment 4segment 3 segment 1 segment 2 segment 3
W6: prob
W9: probW4: probW1: prob
CN:
W3: prob
W7: prob
W8: prob
segment 4
W2
W5
W7W8W9
W10
W6W8W9
W9
End node
W6 W8
W4
W1
W8
W7
W3
Start node
Time index
W1W2, W3W4W5,All paths:
Lattice:
Position Specific Posterior Lattices (PSPL) and Confusion
Networks (CN)
W10W10,
W10: prob W2: prob
W5: prob
W10: prob
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
OOV or Rare Words Handled by Subword Units
• OOV Word W=w1w2w3w4 can’t be recognized and never
appears in lattice
– wi : subword units : phonemes, syllables…
– a, b, c, d, e : other subword units
• W=w1w2w3w4 hidden at subword level
– can be matched at subword level without being recognized
• Subword-based PSPL (S-PSPL) or CN (S-CN), for Example
[Pan & Lee, Interspeech 07][Pan & Lee, ASRU 07][Mamou, et al, SIGIR 07]
[Hori, Hazen, Glass, ICASSP 07][Turunen & Kurimo, SIGIR 07]
[Gao & Shao, ISCSLP 08]
w1w2
w2w3
bcd
e
w3w4b
a
Lattice:
Time index
w1w2 w3w4
w3w4
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Frequently Used Subword Units
• Linguistically motivated units
– phonemes, syllables/characters, morphemes, etc.
[Ng, MIT 00][Wallace, et al, Interspeech 07][Chen & Lee, IEEE T. SAP 02]
[Pan & Lee, ASRU 07][Meng & Seide, ASRU 07][Meng & Seide, Interspeech 08]
[Mertens, ICASSP 09][Itoh, Interspeech 07][Itoh, Interspeech 11]
[Pan & Lee, IEEE T. ASL 10]
• Data-driven units
– particles, word fragments, phone multigrams, morphs, etc.
[Turunen & Kurimo, SIGIR 07] [Turunen, Interspeech 08]
[Parlak & Saraclar, ICASSP 08][Logan, et al, IEEE T. Multimedia 05]
[Gouvea, Interspeech 10][Gouvea, Interspeech 11][Lee & Lee, ASRU 09]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Directly Matching Signals without Recognition/Lattices for
Spoken Queries (1/3)
• Query in voice form available
– avoiding recognition errors
– unsupervised
• Frame-based matching (DTW)
[Hazen, ASRU 09]
[Zhang & Glass, ASRU 09]
[Chan & Lee, Interspeech 10]
[Zhang & Glass, ICASSP 11]
[Gupta, Interspeech 11]
[Zhang & Glass, Interspeech 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Directly Matching Signals without Recognition/Lattices for
Spoken Queries (2/3)
• Query in voice form available
– avoiding recognition errors
– unsupervised
• Segment-based matching
[Chan & Lee, Interspeech 10]
[Chan & Lee, ICASSP 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Directly Matching Signals without Recognition/Lattices for
Spoken Queries (3/3)
• Query in voice form available
– avoiding recognition errors
– unsupervised
• Model-based matching
[Zhang & Glass, ASRU 09]
[Huijbregts, ICASSP 11]
[Chan & Lee, Interspeech 11]
Recent Research Examples
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Spoken Content Retrieval
Spoken
Archive
Recognition
Engine
Acoustic
Models
lattices
Search
Engine
Retrieval
Model
user
Query Q
Retrieval
Output
Recognition Retrieval
• Search engine
– indexing the lattices, search over indices
• Retrieval model
– confusion matrices, matching algorithms, weighting, learning, etc.
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Recent Research Examples (1)
- Integration and Weighting
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Integrating Different Clues from Recognition
• Integrating the outputs from different recognition systems
[Natori, Interspeech 10]
• Integrating results based on different subword units
[S.-w. Lee, ICASSP 05][Pan & Lee, Interspeech 07][Meng, Interspeech 10]
[Itoh, Interspeech 11]
• Considering phone duration information
[Wollmer, ICASSP 09][Ma, Interspeech 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Training Retrieval Model Parameters (1/7)
• Some training data needed
– A set of queries and associated relevant/irrelevant segments
– collected from relevance feedback
e.g. click-through data [Joachims, SIGKDD 02]
long-term context user relevance feedback [Shen, SIGIR 05]
time 1:10 T
time 2:01 F
time 3:04 T
time 5:31 T
time 1:10 T
time 2:01 F
time 3:04 T
time 5:31 F
time 1:10 F
time 2:01 F
time 3:04 T
time 5:31 T
Query Q1 Query Q2 Query Qn
……
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Spoken
Archive
Recognition
Engine
Acoustic
Models
lattices
Search
Engine
Retrieval
Model
user
Query Q
Retrieval
Output
Recognition Retrieval
time 1:10 T
time 2:01 F
time 3:04 T
time 5:31 T
time 1:10 T
time 2:01 F
time 3:04 T
time 5:31 F
time 1:10 F
time 2:01 F
time 3:04 T
time 5:31 T
Query Q1 Query Q2 Query Qn
……
Training Retrieval Model Parameters (2/7)
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Training Retrieval Model Parameters (3/7)
• Some parameters in the retrieval model estimated by
optimizing some retrieval related criteria
– Weights of different clues (e.g. different recognition outputs,
different subword units, phone duration information)
[Meng & Lee, ICASSP 09][Chen & Lee, ICASSP 10][Meng, Interspeech 10]
[Wollmer, ICASSP 09]
– Phone confidence
[Li, Interspeech 11]
– Phone confusion matrix
[Wallance, ICASSP 10]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Training Retrieval Model Parameters (4/7)
• Weights for Integrating 1,2,3-grams for different
word/subword units and different indices
syllable
Confusion
Network
Position
Specific
Posterior
Lattice
word
character
syllable
word
character
1-gram
2-gram
3-gram
1-gram
2-gram
3-gram
integrated with
different weights
maximizing the lower bound of MAP by SVM-MAP
[Meng & Lee, ICASSP 09]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• MAP (mean average precision)
– area under recall-precision curve
– a performance measure frequently used for information retrieval
Training Retrieval Model Parameters (5/7)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Precision
Recall
MAP = 0.484
MAP = 0.586
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Integrating different n-grams, word/subword units and
indices
single clue integrated
Training Retrieval Model Parameters (6/7)
[Meng & Lee, ICASSP 09]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Context-dependent term weighting
– the same term may have different weights depending on the
context
– e.g. “speech information retrieval” and “information theory”
– such different weights can be trained
Training Retrieval Model Parameters (7/7)
[Chen & Lee, ICASSP 10]
0.3
0.4
0.5
0.6
0.7
character syllable
baseline
context-dependent
weighting
MAP
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Recent Research Examples (2)
- Acoustic Modeling
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
[Lee & Lee, ICASSP 10]
[Lee & Lee, Interspeech 10]
[Lee & Lee, SLT 10]
Spoken
Archive
Recognition
Engine
Acoustic
Models
lattices
Search
Engine
Retrieval
Model
user
Query Q
Retrieval
Output
Recognition Retrieval
time 1:10 T
time 2:01 F
time 3:04 T
time 5:31 T
time 1:10 T
time 2:01 F
time 3:04 T
time 5:31 F
time 1:10 F
time 2:01 F
time 3:04 T
time 5:31 T
Query Q1 Query Q2 Query Qn
…
…
Retrieval-Oriented Acoustic Modeling (1/4)
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Retrieval considered on top of recognition output in the past
– recognition and retrieval as two cascaded stages
– retrieval performance relying on recognition accuracy
• Considering retrieval and recognition processes as a whole
– acoustic models re-estimated by optimizing retrieval performance
– acoustic models better matched to each respective data set
Retrieval-Oriented Acoustic Modeling (2/4)
[Lee & Lee, ICASSP 10]
[Lee & Lee, Interspeech 10]
[Lee & Lee, SLT 10]
Spoken
Archive
Recognition
Engine
Acoustic
Models
lattices
Search
Engine
Retrieval
Model
user
Query Q
Retrieval
Output
Recognition Retrieval
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Estimate the acoustic models
such that the relevance score
of positive and negative
examples are separated.
The above formulation can be improved by:
1.Considering ranking property of retrieval performance measure
2.Considering unlabeled utterances
( ) ( )[ ]∑ ∑∈
−=
train
Q
F
Q
T
QQ XX
Q
F
Q
T XQSXQS
,
|,|,maxargˆ θθθ
θ
• Objective Function
: positive example for query Q
θ : acoustic model
: training query set
: negative example for query Q
: relevance score of utterance X given query Q and θ
Retrieval-Oriented Acoustic Modeling (3/4)
Q
TX
Q
FX
trainQ
( )θXQS ,
[Lee & Lee, ICASSP 10]
[Lee & Lee, Interspeech 10]
[Lee & Lee, SLT 10]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
MAP SI SA1 SA2
Baseline 48.19 61.89 73.07
Model Re-estimation 50.52 62.81 73.62
Retrieval-Oriented Acoustic Modeling (4/4)
– SI : speaker independent models
– SA1 : adapted by global MLLR
– SA2 : adapted by global MLLR + class-based MLLR + MAPadap
• 40 training queries each with relevance information of top
5 utterances
• Improvements achievable but relatively limited, more
improvements for poorer models [Lee & Lee, ICASSP 10]
[Lee & Lee, Interspeech 10]
[Lee & Lee, SLT 10]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Limited Improvements for Retrieval-Oriented Acoustic
Modeling
– different queries with quite different characteristics
• Re-estimate the acoustic models for each query on-line
– based on the first several utterances the user clicks through
when browsing the retrieval results
– utterances not yet browsed can be re-ranked
– short-term context user relevance feedback
– models updated and lattices rescored quickly due to very
limited training data
Query-Specific Retrieval-Oriented Acoustic Modeling (1/4)
[Lee & Lee, Interspeech 10]
[Lee & Lee, SLT 10]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
time 1:01F
time 2:05T
time 5:31
time 4:78
……
Spoken
Archive
Recognition
Engine
Acoustic
Models
lattices
Search
Engine
Retrieval
Model
user
Query Q
Recognition Retrieval
Query-Specific Retrieval-Oriented Acoustic Modeling (2/4)
Retrieval
Output
User clicks some
utterances, indicating
as relevant or
irrelevant
[Lee & Lee, Interspeech 10]
[Lee & Lee, SLT 10]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Spoken
Archive
Recognition
Engine
Acoustic
Models
lattices
Search
Engine
Retrieval
Model
user
Query Q
Recognition Retrieval
Query-Specific Retrieval-Oriented Acoustic Modeling (2/4)
acoustic models
re-estimated
Reranked
Retrieval
Results
Rescored
lattices
^θQ
time 1:01F
time 2:05T
time 1:35
time 9:87
……
[Lee & Lee, Interspeech 10]
[Lee & Lee, SLT 10]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
( ) ( )[ ]∑ −=
Q
F
Q
T XX
Q
F
Q
TQ XQSXQS
,
|,|,maxargˆ θθθ
θ
• Objective Function
Estimate the acoustic models
such that the relevance score
of positive and negative
examples are separated.
Specific acoustic model for query Q
: positive example of query Q
θ : acoustic model
: negative example of query Q
The above formulation can be improved by:
1.Considering ranking property of retrieval performance measure
2.Considering unlabelled utterances
: relevance score function of utterance X
Q
TX
Q
FX
( )XQS ,
[Lee & Lee, Interspeech 10]
[Lee & Lee, SLT 10]
Query-Specific Retrieval-Oriented Acoustic Modeling (3/4)
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• 5 utterances were clicked by the use
(Speaker Independent Model)
• Improvements achievable, but relatively limited because
clicked utterances frozen but dominating MAP scores
Query-Specific Retrieval-Oriented Acoustic Modeling (4/4)
[Lee & Lee, Interspeech 10]
[Lee & Lee, SLT 10]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Recent Research Examples (3)
- Acoustic Features and Pseudo Relevance Feedback
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Similarity in Acoustic Features (1/3)
• When an utterance is known to be relevant/irrelevant,
other utterances similar to it is more probable to be
relevant/irrelevant
– Same scenario: the first several items clicked by the user,
the rest re-ranked
time 1:01F
time 2:05T
time 5:31
time 4:78
……
Recognition Retrieval
Spoken
Archive
Recognition
Engine
Acoustic
Models
lattices
Search
Engine
Retrieval
Model
user
Query Q
Retrieval
Output
[Chen & Lee, Interspeech 10]
[Lee & Lee, SLT 10]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Similarity in Acoustic Features (1/3)
Recognition Retrieval
Spoken
Archive
Recognition
Engine
Acoustic
Models
lattices
Search
Engine
Retrieval
Model
user
Query Q
Retrieval
Output
Those not yet browsed compared
with those clicked based on
acoustic similarity and re-ranked
[Chen & Lee, Interspeech 10]
[Lee & Lee, SLT 10]
time 1:01F
time 2:05T
time 5:31
time 4:78
……
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Similarity in Acoustic Features (1/3)
Recognition Retrieval
Spoken
Archive
Recognition
Engine
Acoustic
Models
lattices
Search
Engine
Retrieval
Model
user
Query Q
Retrieval
Output
Those not yet browsed compared
with those clicked based on
acoustic similarity and re-ranked
[Chen & Lee, Interspeech 10]
[Lee & Lee, SLT 10]
time 1:01F
time 2:05T
time 1:35
time 9:87
……
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Acoustic feature similarities between hypothesized regions
• Hypothesized region : feature vector sequence corresponding
to query Q in the lattice with the highest score
Q
Q
A
A
B
B
B
Lattice
utterance xi feature vector
sequence
hypothesized region of xi
Similarity in Acoustic Features (2/3)
[Chen & Lee, Interspeech 10]
[Lee & Lee, SLT 10]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Acoustic feature similarities between hypothesized regions
• Hypothesized region : feature vector sequence corresponding
to query Q in the lattice with the highest score
Q
Q
A
A
B
B
B
Lattice
utterance xi feature vector
sequence
hypothesized region of xi
C
A
Q
Q
DLattice
Q
feature vector
sequence
utterance xj
hypothesized region of xj
Similarity in Acoustic Features (2/3)
[Chen & Lee, Interspeech 10]
[Lee & Lee, SLT 10]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Acoustic feature similarities between hypothesized regions
Q
Q
A
A
B
B
B
Lattice
utterance xi feature vector
sequence
hypothesized region of xi
C
A
Q
Q
DLattice
Q
feature vector
sequence
utterance xj
hypothesized region of xj
Similarity
estimation
SIM(xi,xj)
• SIM(xi,xj) is based on the DTW distance between the two
hypothesized regions
acoustic
similarity
[Chen & Lee, Interspeech 10]
[Lee & Lee, SLT 10]
Similarity in Acoustic Features (2/3)
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
MAP SI SA1 SA2
Baseline 48.19 61.89 73.07
Model
Re-estimation
50.52 62.81 73.62
Acoustic
Similarity
51.89 64.71 74.03
– SI: speaker independent models
– SA1: adapted by global MLLR
– SA2: adapted by global MLLR + class-based MLLR + MAPadap
• Slightly more improvements than re-estimated models
[Chen & Lee, Interspeech 10]
[Lee & Lee, SLT 10]
Similarity in Acoustic Features (3/3)
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Pseudo Relevance Feedback (PRF) (1/3)
• Not relying on users to give the feedback
• Derive relevance information automatically
– “assume” the top N utterances on the first-pass retrieval
results are relevant (pseudo relevant)
– scores of utterances similar to the pseudo-relevant
utterances increased
– Pseudo Relevance Feedback (PRF)
[Chen & Lee, Interspeech 10]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
time 1:01
time 2:05
time 1:45
…
time 2:16
time 7:22
time 9:01
Pseudo Relevance Feedback (PRF) (2/3)
First-pass
retrieval
results
Top N
utterances
“Assumed”
relevant
Compare Acoustic
Feature Similarity
All first-pass retrieved
utterances compared
with the pseudo-relevant
utterances based on
acoustic similarity
time 1:01
time 2:05
Search
Engine
Spoken
archive
Query Q
[Chen & Lee, Interspeech 10]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
time 1:01
time 2:05
time 1:45
…
time 2:16
time 7:22
time 9:01
Pseudo Relevance Feedback (PRF) (2/3)
First-pass
retrieval
results
Top N
utterances
“Assumed”
relevant
Compare Acoustic
Feature Similarity
time 1:01
time 2:16
time 7:22
…
time 2:05
time 1:45
time 9:01
Final Results
Re-ranked
time 1:01
time 2:05
Search
Engine
Spoken
archive
Query Q
[Chen & Lee, Interspeech 10]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
MAP SI SA SD
Baseline 45.57 71.20 80.49
PRF 52.10 75.60 82.72
– SI : speaker independent models
– SA : adapted by global MLLR + class-based MLLR + MAPadap
– SD : speaker dependent models
Pseudo Relevance Feedback (PRF) (3/3)
• Improvement achievable and seems more significant
[Chen & Lee, Interspeech 10]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Recent Research Examples (4)
- Improved Pseudo Relevance Feedback
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Improved PRF – Graph-based Approach (1/5)
• Graph-based approach
– only the top N utterances are taken as references in PRF
– not necessarily reliable
– considering the acoustic similarity structure of all utterances
in the first-pass retrieval results globally using a graph
[Chen & Lee, ICASSP 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Construct a graph for all utterances in the first-pass
retrieval results
– nodes : utterances
– edge weights: acoustic similarities between utterances
First-pass
Retrieval Results
x1
x3
x2
x4
x5
x3
x1
x2
x5
x4
…..
Improved PRF – Graph-based Approach (2/5)
[Chen & Lee, ICASSP 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Utterances strongly connected to (similar to) utterances
with high relevance scores should have relevance scores
increased
x1
x3
x2
x4
x5
x3
x1
x2
x5
x4
…..
high
high
high
first-pass
retrieval results
Improved PRF – Graph-based Approach (3/5)
[Chen & Lee, ICASSP 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
x1
x3
x2
x4
x5
x3
x1
x2
x5
x4
…..
low
low
low
first-pass
retrieval results
• Utterances strongly connected to (similar to) utterances
with low relevance scores should have relevance scores
reduced
Improved PRF – Graph-based Approach (3/5)
[Chen & Lee, ICASSP 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Relevance scores propagate on the graph
– relevance scores smoothed among strongly connected
nodes
x1
x3
x2
x4
x5
x3
x1
x2
x5
x4
…..
x3
x1
x2
x5
x4
…..
first-pass
retrieval results
Re-ranked
Improved PRF – Graph-based Approach (4/5)
[Chen & Lee, ICASSP 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
MAP SI SA SD
Baseline 45.57 71.20 80.49
PRF 52.10 75.60 82.72
Graph-based 53.42 79.08 82.96
• Graph-based > PRF > Baseline
– graph-based approach globally considering the
similarity structure better than PRF referenced on top
N utterances only
Improved PRF – Graph-based Approach (5/5)
[Chen & Lee, ICASSP 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Improved PRF - Machine Learning (1/4)
• Machine learning shown very useful in spoken term
detection with a training set
[Wang, Interspeech 09][Wang, ICASSP 10][Tejedor, Interspeech 10]
• An example : use of Support Vector Machine (SVM) in the
scenario of pseudo relevance feedback
[Tu & Lee, ASRU 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Train an SVM for each query
time 1:01
time 2:05
time 1:45
…
time 2:16
time 7:22
time 9:01
time 1:01
time 2:16
time 7:22
…
time 2:05
time 1:45
time 9:01
SVM
Re-ranking
Feature
Extraction
Search
Engine
Spoken
archive
Final Results
First-pass
retrieval
results
Negative examples
Positive examples
Feature
Extraction
Improved PRF - Machine Learning (2/4)
Top N
“assumed”
relevant time 1:01
time 2:05
Bottom N
“assumed”
irrelevant
time 7:22
time 9:01
Query Q
[Tu & Lee, ASRU 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Representing each utterance by its hypothesized region,
segmented by states, with feature vectors in each state
averaged and concatenated
Q
A B
C
BD
B
EFF
Hypothesized Region
Feature
Vector
Sequence
. . .
…
D
State Boundaries
………
a feature vector
averaged
Improved PRF - Machine Learning (3/4)
[Tu & Lee, ASRU 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
MAP SI SA SD
Baseline 45.57 71.20 80.49
PRF 52.10 75.60 82.72
Graph-based 53.42 79.08 82.96
SVM 59.31 81.63 84.66
Improved PRF - Machine Learning (4/4)
• SVM > Graph-based > PRF > Baseline
– SVM offered significant improvements over PRF
[Tu & Lee, ASRU 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Context Consistency (1/4)
• All above discussions primarily considering acoustic
features/models
– linguistic information?
• Context consistency
– the same term usually have similar context; while quite
different context usually implies the terms are different
[Schneiderl & Mertens, Interspeech 10][Lee & Lee, ICASSP 11][Tu & Lee, ASRU 11]
• Extract context information from lattices
– used in SVM for PRF scenario
[Lee & Lee, ICASSP 11][Tu & Lee, ASRU 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Train a context model for each query by SVM
Context Consistency (2/4)
time 1:01
time 2:05
time 1:45
…
time 2:16
time 7:22
time 9:01
Top N
“assumed”
relevant
Bottom N
“assumed”
irrelevant
time 1:01
time 2:16
time 7:22
…
time 2:05
time 1:45
time 9:01
SVM
Feature
Extraction Re-ranking
Spoken
archive
Search
Engine
Query Q
Feature
Extraction
Final Results
First-pass
retrieval
results
Negative examples
Positive examples
[Lee & Lee, ICASSP 11]
[Tu & Lee, ASRU 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Q
Q
A
B
C
BD
B
E
F
0.2
0.3
0.2
0.5
0.2
0.2
0.3
0.4
0.2
0.1
D
0.1
F 0.9
A B C D … Q
0.2 0.6 0.5 0.3 … 0.4
A B C D … Q
0.0 0.3 0.0 0.0 … 0.0
A B C D … Q
0.2 0.0 0.5 0.0 … 0.0
Immediate
left context
Immediate
right context
whole segment
Concatenated into a 3V - dimensional
feature vector
V - dimensional vector
(V : lexicon size)
Context Consistency (3/4)
• Feature Extraction
[Lee & Lee, ICASSP 11]
[Tu & Lee, ASRU 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
MAP SI SA SD
Baseline 45.57 71.20 80.49
Context 54.93 80.72 84.81
Context Consistency (4/4)
• Context consistency helpful
[Lee & Lee, ICASSP 11]
[Tu & Lee, ASRU 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Recent Research Examples (5)
- Concept Matching
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Concept Matching for Spoken Content Retrieval (1/4)
• Concept matching rather than Literal matching
• Returning utterances/documents semantically related to
the query (e.g. Obama)
– not necessarily containing the query (e.g. with US and White
House)
• Approach Examples
– Clustering the document collection
[Hu, Interspeech 10]
– Using web data for document or query expansion
[Masumura, Interspeech 11][Akiba, Interspeech 11]
– Using latent topic models
[Chang & Lee, SLT 08][Molgaard, ICASSP 07][Chen, ICASSP 09][Chen, Interspeech 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Concept Matching for Spoken Content Retrieval (2/4)
• An example: Probabilistic Latent Semantic Analysis
(PLSA)
• Creating a set of latent topics between a set of terms and
a set of documents
– modeling the relationships by probabilistic models trained with
EM algorithm
• Other well-known approaches: Latent Semantic Analysis
(LSA), Non-negative Matrix Factorization (NMF), Latent
Dirichlet Allocation (LDA) … …
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
QuerySemantic
Relevance
Evaluation
Probability Latent
Semantic Analysis
(PLSA) Model
Retrieval
Results
Spoken Archive
Recognition output:
Lattices ……
Concept Matching for Spoken Content Retrieval (3/4)
[Chang & Lee, SLT 08]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Literal
Matching
(baseline)
Concept
Matching
Concept Matching for Spoken Content Retrieval (4/4)
[Chang & Lee, SLT 08]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Recent Research Examples (6)
- User-content Interaction
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
User-Content Interaction for Spoken Content Retrieval (1/2)
• Problems
– Unlike text content, spoken content not easily summarized on screen, thus
retrieved results difficult to scan and select
– User-content interaction always important even for text content
• Possible Approaches
– Automatic summary/title generation and key term extraction for spoken content
– Semantic structuring for spoken content
– Multi-modal dialogue with improved interaction
Key Terms/
Titles/Summaries
User
Query
Multi-modal
Dialogue
Spoken
Archives
Retrieved Results Retrieval
Engine
[Pan & Lee, ASRU 05]
[L.-s. Lee, IEEE SPM 05]
[L.-s. Lee, Interspeech 06]
[Kong & Lee, ICASSP 09]
User
Interface Semantic
Structuring
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Key Term Extraction from Spoken Content (1/3)
• Key Terms : key phrases and keywords
• Key Phrase Boundary Detection
• An Example
• Left/right boundary of a key phrase detected by context
statistics
“hidden” almost always followed by the same word
“hidden Markov” almost always followed by the same word
“hidden Markov model” is followed by many different words
[Chen & Lee, SLT 10]
boundary
hidden Markov model
represent
is
can
:
:
is
of
in
:
:
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Key Term Extraction from Spoken Content (2/3)
• Prosodic Features
– key terms probably produced with longer duration, wider
pitch range and higher energy
• Semantic Features (e.g. PLSA)
– key terms usually focused on smaller number of topics
• Lexical Features
– TF/IDF, POS tag, etc.
non-key term
P(Tk|ti)
k
key term
P(Tk|ti)
k
topics topics
[Hsieh & Lee, ICASSP 06]
[Chen & Lee, SLT 10]
[Kong & Lee, IEEE T. ASL 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Key Term Extraction from Spoken Content (3/3)
Pr: Prosodic
Lx: Lexical
Sm: Semantic0
10
20
30
40
50
60
Pr Lx Sm Pr+Lx Pr+Lx+Sm
20.78
42.86
35.63
48.15
56.55
F-measure
• All three sets of features useful [Chen & Lee, SLT 10]
• Trained with Neural Net
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
X1
X2
X3
X4
X5
X6
document d:
Correctly recognized word
Wrongly recognized word
t2t1
[Furui, et al, ICASSP 05][Furui, et al, IEEE T. SAP 04]
[Hirschberg, et al, Interspeech 05]
[Murray, Renals, et al, ACL 05][Murray, Renals, et al, HLT 06]
[Kawahara, et al, ICASSP 04][Nakagawa, et al, SLT 06]
[Zhu & Penn, Interspeech 06][Fung, et al, ICASSP 08]
[Kong & Lee, ICASSP 06][Kong & Lee, SLT 06]
[Gillick, ICASSP 09][Li, ASRU 09][Lin, ICASSP 10]
[Liu, ICASSP 08][Xie, ASRU 09][Xie, Interspeech 10]
[Kong & Lee, IEEE T. ASL 11]
Extractive Summarization of Spoken Documents (1/3)
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
X1
X2
X3
X4
X5
X6
document d:
X1
X3
summary of document d:
• Selecting most representative utterances
in the original document but avoiding
redundancy
t2t1
[Furui, et al, ICASSP 05][Furui, et al, IEEE T. SAP 04]
[Hirschberg, et al, Interspeech 05]
[Murray, Renals, et al, ACL 05][Murray, Renals, et al, HLT 06]
[Kawahara, et al, ICASSP 04][Nakagawa, et al, SLT 06]
[Zhu & Penn, Interspeech 06][Fung, et al, ICASSP 08]
[Kong & Lee, ICASSP 06][Kong & Lee, SLT 06]
[Gillick, ICASSP 09][Li, ASRU 09][Lin, ICASSP 10]
[Liu, ICASSP 08][Xie, ASRU 09][Xie, Interspeech 10]
[Kong & Lee, IEEE T. ASL 11]
Extractive Summarization of Spoken Documents (1/3)
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Extractive Summarization of Spoken Documents (2/3)
• An example: the utterances topically similar to the
representative utterances should also be considered as
representative
• Graph-based analysis used in finding representative
utterances, but corrected by avoiding redundancy
xi
xj
utterance x1
utterance x2
utterance x3
utterance x4
utterance x5
……
to construct
summary
x3
x1
x2
x5
x4
…..
[Chen & Lee, Interspeech 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
41
46
51
56
10% 20% 30%
ROUGE-1
18
23
28
10% 20% 30%
ROUGE-2
9
14
19
10% 20% 30%
ROUGE-3
40
45
50
10% 20% 30%
ROUGE-L
• Graph-based analysis helpful
Extractive Summarization of Spoken Documents (3/3)
F-measures
Graph-based
Baseline
[Chen & Lee, Interspeech 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Titles for retrieved documents/segments helpful in
browsing and selection of retrieved results
• Short, readable, telling what the document/segment is
about
• One example: Scored Viterbi Search
[Witbrock & Mittal, SIGIR 99][Jin & Hauptmann, HLT 01]
[Chen & Lee, Interspeech 03][Wang & Lee, SLT 08]
Title Generation for Spoken Documents
Training
corpus
Term
Ordering
Model
Term
Selection
Model
Title
Length
Model
Spoken document
ASR and
Automatic
Summarization
Viterbi
Algorithm
Output
Title
Summary
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Example 1: retrieved results clustered by Latent Topics
and organized in a two-dimensional tree structure (multi-
layered map)
– each cluster labeled by a set of key terms representing a group of
retrieved documents/segments
– each cluster expanded into a map in the next layer
[Li & Lee, Interspeech 05]
[L.-s. Lee, et al, Interspeech 06]
[Kong & Lee, IEEE T. ASL 11
Semantic Structuring (1/2)
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Example 2: Key-term Graph
– each retrieved spoken document/segment labeled by a set of key
terms
– relationships between key terms represented by a graph
Semantic Structuring (2/2)
-----
-----
-----
-----
---------
---------
---------
---
-------
-------
-------
----
retrieved
spoken
documents
key term
graph
Acoustic
Modeling
Viterbi
search
HMM
Language
Modeling
Perplexity
[Kong & Lee, ICASSP 09]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• An example: user-system interaction modeled as a Markov
Decision Process (MDP) very similar to spoken dialogues
Multi-modal Dialogue (1/3)
Key Terms/
Titles/Summaries
Spoken
Archives
User
Retrieved Results Retrieval
Engine
Query
User
Interface
Multi-modal
Dialogue
Semantic
Structuring
• Example goals
– high task success rate (success: user’s information need satisfied)
– small average number of dialogue turns (average number of query
terms entered) for successful tasks
• A reward function defined and maximized with simulated users
[Pan & Lee, ASRU 05][Pan & Lee, Interspeech 06]
[Pan & Lee, SLT 06][Pan & Lee, ASRU 07]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Multi-modal Dialogue (2/3)
• An example application scenario: retrieving broadcast
news
– in each step system returns retrieved results plus a list of key
terms for user to select
– user looks through the list from the top, and selects the first
relevant to his information need
– key terms on the list ranked by MDP
doc 1201
doc 1707
doc 5205
doc 9201
doc 1200
……
Israel
China
Taiwan
Iraq
……
doc 1201
doc 9231
doc 1241
doc 3201
doc 3207
……
Diplomatic
Economic
Middle East
China
U.S.
news about “the
meeting of Obama
and Hu”
Step 1
User enters query
“U.S. President”
Step 2
User selects
“Diplomatic”
[Pan & Lee, ASRU 05][Pan & Lee, Interspeech 06]
[Pan & Lee, SLT 06][Pan & Lee, ASRU 07]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
[Pan & Lee, ASRU 05][Pan & Lee, Interspeech 06]
[Pan & Lee, SLT 06][Pan & Lee, ASRU 07]
Multi-modal Dialogue (3/3)
• Much less steps needed for successful retrieval sessions
• Much less failure sessions
0
10
20
30
40
50
60
70
80
90
100
2 3 4 5 6 7 8
NumberofRetrievalSessions
Number
of
failure
sessions
Number of interaction steps
needed to complete a successful retrieval session
baselines: standard
approaches used in
text retrieval
wpq
lca
proposed
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
User-Content Interaction for Spoken Content Retrieval (2/2)
• Problems
– Unlike text content, spoken content not easily summarized on-screen,
thus difficult to scan and select
– User-content interaction always important even for text content
• Possible Approaches
– Automatic summary/title generation and key term extraction for spoken
content
– Semantic structuring for spoken content
– Multi-modal dialogue with improved interaction
Key Terms/
Titles/Summaries
Spoken
Archives
User
Retrieved Results Retrieval
Engine
Query
User
Interface
Multi-modal
Dialogue
Semantic
Structuring
Demo
Link of demo system:
http://speech.ee.ntu.edu.tw/~RA/lecture/
(please browse it by Firefox)
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Course Lectures
• Many Course Lectures available over the Internet
– it takes very long to listen to a complete course (e.g. 45 hrs)
– not easy for engineers or industry leaders to learn new
knowledge via course lectures
– goal : to help people learn easily
• Lecture Browsers available over the Internet
– knowledge in course lectures are structured, one concept
following the other
– the retrieved lecture segment possibly not easy to understand
for the learner without enough background knowledge
– given the retrieved segment there is no information for the
learner regarding what should be learned next
[Kong & Lee, ICASSP 09]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• Structuring Course Lectures by Slides and Key Terms
– dividing the course lecture by slides
– deriving the core content of the slides by key terms
– constructing the semantic relationships among slides by a key
term graph
– all slides given its length, timing information in the course,
summary, key terms, related key terms and slides based on
the key term graph, in order to help the learner to decide
whether to choose to listen to it or not
– retrieved spoken segments include all above information about
the slides they belong to to help the user to browse the
retrieved results
Proposed Approach
[Kong & Lee, ICASSP 09]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
• A Course of Digital Speech Processing
– code-mixing: lectures given in the host language of Mandarin
Chinese, while all terminologies produced directly in the guest
language of English, inserted in Mandarin utterances just as
Chinese words, all slides in English
• Corpus Transcription
– bi-lingual phone set of 75 units
– bi-lingual lexicon of 12.3k words
– class-based tri-gram language model adapted by slide
information
– character/word accuracy of 64.77%-81.63% for different slides
A Prototype System
[Lee & Lee, ASRU 09]
[Kong & Lee, ICASSP 09]
[Yeh & Lee, Interspeech 11]
金金金金 聲聲聲聲 玉玉玉玉 振振振振
National Taiwan University
Conclusion –
Spoken Language Processing over the Internet
• User interface
– Successful but not very easy since users usually expect technology to
replace human beings
• Content analysis/user-content interaction
– Technology can handle massive quantities of content, while human
beings cannot
• Spoken content retrieval
– Integrates user interface with content analysis/user-content interaction
– Offering very attractive applications for spoken language processing
technologies
User
Interface
Internet
User-Content
Interaction
Content
Analysis

More Related Content

PPTX
Automatic Key Term Extraction from Spoken Course Lectures
PDF
Recent advances in LVCSR : A benchmark comparison of performances
PDF
Relation Extraction
PDF
The VoiceMOS Challenge 2022
PDF
Grosof haley-talk-semtech2013-ver6-10-13
PDF
Natural language processing for requirements engineering: ICSE 2021 Technical...
PDF
Lecture: Question Answering
PDF
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Automatic Key Term Extraction from Spoken Course Lectures
Recent advances in LVCSR : A benchmark comparison of performances
Relation Extraction
The VoiceMOS Challenge 2022
Grosof haley-talk-semtech2013-ver6-10-13
Natural language processing for requirements engineering: ICSE 2021 Technical...
Lecture: Question Answering
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...

Similar to Spoken Content Retrieval - Lattices and Beyond (20)

PPTX
Spoken Content Retrieval
PPTX
Mediaeval 2013 Spoken Web Search results slides
PPTX
2015 04-15 research seminar
PPT
Reasoning on the Semantic Web
PPTX
Wreck a nice beach: adventures in speech recognition
PPTX
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
PPTX
Accent conversion using Deep neural network
PDF
Word Segmentation and Lexical Normalization for Unsegmented Languages
PPTX
Voice Cloning
PPTX
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
PDF
Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus
PDF
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
PPTX
Morphological Analyzer and Generator for Tamil Language
PPTX
Intro to Auto Speech Recognition -- How ML Learns Speech-to-Text
PPTX
What is word2vec?
PPTX
Research_Wu.pptx
PDF
K. Seo, ICASSP 2023, MLILAB, KAISTAI
PPTX
Llauferseiler "OU Libraries: Opportunities Supporting Research and Education"
PPT
What might a spoken corpus tell us about language
Spoken Content Retrieval
Mediaeval 2013 Spoken Web Search results slides
2015 04-15 research seminar
Reasoning on the Semantic Web
Wreck a nice beach: adventures in speech recognition
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Accent conversion using Deep neural network
Word Segmentation and Lexical Normalization for Unsegmented Languages
Voice Cloning
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Morphological Analyzer and Generator for Tamil Language
Intro to Auto Speech Recognition -- How ML Learns Speech-to-Text
What is word2vec?
Research_Wu.pptx
K. Seo, ICASSP 2023, MLILAB, KAISTAI
Llauferseiler "OU Libraries: Opportunities Supporting Research and Education"
What might a spoken corpus tell us about language
Ad

More from linshanleearchive (19)

PDF
星雲教育獎頒獎典禮手冊
PDF
國立臺灣大學電機資訊學院學術貢獻獎設置辦法.pdf
PDF
新科學創造新文明 Part 2
PDF
新科學創造新文明 Part 1
PDF
2013《無涯學海渡扁舟:課本論文中不曾討論的電機資訊經驗談》 電機系大學部專題討論
PPTX
From Semantics to Self-supervised Learning for Speech and Beyond (Opening Ke...
PPTX
2022 國際語音學會科學成就獎章得獎致詞
PPTX
琳山老師榮退感言.pptx
PPTX
2021《芝麻開門——語音的聲音開啟人類文明的無限空間》台大科學教育中心「探索科學講座」
PPTX
芝麻開門:語音技術的前世今生
PPTX
Towards A Spoken Version of Google
PPTX
From Semantics to Self-supervised Learning for Speech and Beyond
PDF
輕舟已過萬重山
PDF
2016《華語語音辨識研究的先驅者》科學月刊專訪
PDF
2017《推動產業轉型 大學必修課程先鬆綁》自由時報星期專訪
PDF
2017《推動產業轉型 大學必修課程先鬆綁》
PDF
無涯學海渡扁舟 - 課本論文中不曾討論的電機資訊經驗談
PDF
芝麻開門 - 語音技術的前世今生
PDF
105-08-17 輕舟已過萬重山
星雲教育獎頒獎典禮手冊
國立臺灣大學電機資訊學院學術貢獻獎設置辦法.pdf
新科學創造新文明 Part 2
新科學創造新文明 Part 1
2013《無涯學海渡扁舟:課本論文中不曾討論的電機資訊經驗談》 電機系大學部專題討論
From Semantics to Self-supervised Learning for Speech and Beyond (Opening Ke...
2022 國際語音學會科學成就獎章得獎致詞
琳山老師榮退感言.pptx
2021《芝麻開門——語音的聲音開啟人類文明的無限空間》台大科學教育中心「探索科學講座」
芝麻開門:語音技術的前世今生
Towards A Spoken Version of Google
From Semantics to Self-supervised Learning for Speech and Beyond
輕舟已過萬重山
2016《華語語音辨識研究的先驅者》科學月刊專訪
2017《推動產業轉型 大學必修課程先鬆綁》自由時報星期專訪
2017《推動產業轉型 大學必修課程先鬆綁》
無涯學海渡扁舟 - 課本論文中不曾討論的電機資訊經驗談
芝麻開門 - 語音技術的前世今生
105-08-17 輕舟已過萬重山
Ad

Recently uploaded (20)

PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Current and future trends in Computer Vision.pptx
PDF
PPT on Performance Review to get promotions
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
737-MAX_SRG.pdf student reference guides
PPTX
Construction Project Organization Group 2.pptx
PPTX
additive manufacturing of ss316l using mig welding
PDF
R24 SURVEYING LAB MANUAL for civil enggi
DOCX
573137875-Attendance-Management-System-original
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPTX
Sustainable Sites - Green Building Construction
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Internet of Things (IOT) - A guide to understanding
Current and future trends in Computer Vision.pptx
PPT on Performance Review to get promotions
Foundation to blockchain - A guide to Blockchain Tech
737-MAX_SRG.pdf student reference guides
Construction Project Organization Group 2.pptx
additive manufacturing of ss316l using mig welding
R24 SURVEYING LAB MANUAL for civil enggi
573137875-Attendance-Management-System-original
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Fundamentals of safety and accident prevention -final (1).pptx
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Sustainable Sites - Green Building Construction
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks

Spoken Content Retrieval - Lattices and Beyond

  • 1. Spoken Content Retrieval ― Lattices and Beyond Lin-shan Lee http://speech.ee.ntu.edu.tw/previous_version/lslNew.htm National Taiwan University Taipei, Taiwan, ROC
  • 2. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Outline • Introduction • Fundamentals • Recent Research Examples – Parameter weighting, relevance feedback, graph-based, SVM, concept matching, interactive retrieval, etc. • Demo • Conclusion
  • 4. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Text content retrieval extremely successful – information desired by the users can be obtained very efficiently – all users like it – producing very successful industry • All roles of texts can be accomplished by voice – spoken content or multimedia content with voice in audio part – voice instructions/queries via handheld devices • Spoken content retrieval user instructions/ queries Internet Server Server Documents/Information Text/Spoken Content Retrieval (1/3)
  • 5. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Text/Spoken Content Retrieval (2/3) Spoken Instructions/Queries Spoken content (multimedia content including audio part) US president? Text Instructions/Queries Text Content Barack Obama….Barack Obama…. • User instructions and/or network content can be in form of voice – text queries/spoken content : spoken document retrieval, spoken term detection – spoken queries/text content : voice search [Wang & Acero, IEEE SPM 08][Acero, et al, ICASSP 08] – spoken queries/spoken content : query by example
  • 6. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Text/Spoken Content Retrieval (3/3) Spoken Instructions/Queries Spoken content (multimedia content including audio part) US president? Text Instructions/Queries Text Content Barack Obama….Barack Obama…. • If spoken content/queries can be accurately recognized – Reduced to text content retrieval • Correct but never possible
  • 7. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Many hand-held devices with multimedia functionalities available • Unlimited quantities of multimedia content fast growing over the Internet • User-content interaction necessary for retrieval can be accomplished by spoken and multi-modal dialogues • Network access is primarily text-based today, but almost all roles of texts can be accomplished by voice Multimedia Content Analysis voice information Multimedia and Spoken Content Internet text information Text Content Retrieval Text Content Spoken Content Retrieval Text-to-Speech Synthesis Spoken and multi-modal Dialogue Wireless and Multimedia Technologies are Creating An Environment for Spoken Content Retrieval
  • 9. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Low recognition accuracies for spontaneous speech including Out-of-Vocabulary (OOV) words under adverse environment – considering lattices with multiple alternatives rather than 1-best output higher probability of including correct words, but also including more noisy words correct words may still be excluded (OOV and others) huge memory and computation requirements [Chelba, Hazen, Saraclar, IEEE SPM 08][Saraclar & Sproat, HLT 04] [Vergyri, et al, Interspeech 07] Lattices for Spoken Content Retrieval W6 W8 W4 W1 W8 W7 W9 W3 W2 W5 W10 Start node End node Time index Wi: word hypotheses
  • 10. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Confusion Matrices – use of confusion matrices to model recognition errors and expand the query/document, etc. [Ng, ICSLP 98][Pint & Hermansky, SSCS 08][Chaudhari, ASRU 09] [Parada, ICASSP 10][Wallace, ICASSP 10] • Pronunciation Modeling – use of pronunciation models to expand the query, etc. [Mertens, Interspeech 09][Mertens, ASRU 09][Can, ICASSP 09] [Wang, Interspeech 09][Wang, ICASSP 10][Wang, Interspeech 10] • Fuzzy Matching – query/content matching not necessarily exact [Shao, Interspeech 07][Schneide, SSCS 08][Mamou, Interspeech 08] Other Approach Examples in addition to Lattices
  • 11. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Lattices • An example of lattice indexing approach – position information for words readily available – more compatible to existing text indexing techniques – reduced memory and computation requirements (still huge…) Lattice Indexing Approaches W6 W8 W4 W1 W8 W7 W9 W3 W2 W5 W10 Start node End node Time index W3, p3 W8, p8 W9, p9 W4, p4 W1, p1 W5, p5 W6, p6 W10, p10 W2, p2 W7, p7
  • 12. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Lattices • An example of lattice indexing approach – position information for words readily available – more compatible to existing text indexing techniques – reduced memory and computation requirements (still huge…) – added possible paths – noisy words discriminated by posterior probabilities or similar scores – n-grams matched with query and accumulated for all possible n Lattice Indexing Approaches W6 W8 W4 W1 W8 W7 W9 W3 W2 W5 W10 Start node End node Time index W3, p3 W8, p8 W9, p9 W4, p4 W1, p1 W5, p5 W6, p6 W10, p10 W2, p2 W7, p7
  • 13. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Position Specific Posterior Lattices (PSPL) [Chelba & Acero, ACL 05][Chelba & Acero, Computer Speech and Language 07] • Confusion Networks (CN) [Mamou, et al, SIGIR 06][Hori, Hazen, Glass, ICASSP 07][Mamou, et al, SIGIR 07] • Time-based Merging for Indexing (TMI) [Zhou, Chelba, Seide, HLT 06][Seide, et al, ASRU 07] • Time-anchored Lattice Expansion (TALE) [Seide, et al, ASRU 07][Seide, et al, ICASSP 08] • WFST – directly compile the lattice into a weighted finite state transducer [Allauzen, et al, HLT 04][Parlak & Saraclar, ICASSP 08][Can, ICASSP 09] [Parada, ASRU 09] Examples of Lattice Indexing Approaches
  • 14. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • PSPL: ─ Locating a word in a segment according to the order of the word in a path W3: prob W7: prob W2: probW1: prob W5: prob W9: prob PSPL: W6: prob segment 1 W4: prob W8: prob segment 2 segment 4segment 3 W2 W5 W7W8W9 W10 W6W8W9 W9 End node W6 W8 W4 W1 W8 W7 W3 Start node Time index W1W2, W3W4W5,All paths: Lattice: Position Specific Posterior Lattices (PSPL) and Confusion Networks (CN) W10W10, W10: prob
  • 15. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • PSPL: ─ Locating a word in a segment according to the order of the word in a path • CN: ─ Clustering several words in a segment according to similar time spans and word pronunciation W3: prob W7: prob W2: probW1: prob W5: prob W9: prob PSPL: W6: prob segment 1 W4: prob W8: prob segment 2 segment 4segment 3 segment 1 segment 2 segment 3 W6: prob W9: probW4: probW1: prob CN: W3: prob W7: prob W8: prob segment 4 W2 W5 W7W8W9 W10 W6W8W9 W9 End node W6 W8 W4 W1 W8 W7 W3 Start node Time index W1W2, W3W4W5,All paths: Lattice: Position Specific Posterior Lattices (PSPL) and Confusion Networks (CN) W10W10, W10: prob W2: prob W5: prob W10: prob
  • 16. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University OOV or Rare Words Handled by Subword Units • OOV Word W=w1w2w3w4 can’t be recognized and never appears in lattice – wi : subword units : phonemes, syllables… – a, b, c, d, e : other subword units • W=w1w2w3w4 hidden at subword level – can be matched at subword level without being recognized • Subword-based PSPL (S-PSPL) or CN (S-CN), for Example [Pan & Lee, Interspeech 07][Pan & Lee, ASRU 07][Mamou, et al, SIGIR 07] [Hori, Hazen, Glass, ICASSP 07][Turunen & Kurimo, SIGIR 07] [Gao & Shao, ISCSLP 08] w1w2 w2w3 bcd e w3w4b a Lattice: Time index w1w2 w3w4 w3w4
  • 17. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Frequently Used Subword Units • Linguistically motivated units – phonemes, syllables/characters, morphemes, etc. [Ng, MIT 00][Wallace, et al, Interspeech 07][Chen & Lee, IEEE T. SAP 02] [Pan & Lee, ASRU 07][Meng & Seide, ASRU 07][Meng & Seide, Interspeech 08] [Mertens, ICASSP 09][Itoh, Interspeech 07][Itoh, Interspeech 11] [Pan & Lee, IEEE T. ASL 10] • Data-driven units – particles, word fragments, phone multigrams, morphs, etc. [Turunen & Kurimo, SIGIR 07] [Turunen, Interspeech 08] [Parlak & Saraclar, ICASSP 08][Logan, et al, IEEE T. Multimedia 05] [Gouvea, Interspeech 10][Gouvea, Interspeech 11][Lee & Lee, ASRU 09]
  • 18. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Directly Matching Signals without Recognition/Lattices for Spoken Queries (1/3) • Query in voice form available – avoiding recognition errors – unsupervised • Frame-based matching (DTW) [Hazen, ASRU 09] [Zhang & Glass, ASRU 09] [Chan & Lee, Interspeech 10] [Zhang & Glass, ICASSP 11] [Gupta, Interspeech 11] [Zhang & Glass, Interspeech 11]
  • 19. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Directly Matching Signals without Recognition/Lattices for Spoken Queries (2/3) • Query in voice form available – avoiding recognition errors – unsupervised • Segment-based matching [Chan & Lee, Interspeech 10] [Chan & Lee, ICASSP 11]
  • 20. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Directly Matching Signals without Recognition/Lattices for Spoken Queries (3/3) • Query in voice form available – avoiding recognition errors – unsupervised • Model-based matching [Zhang & Glass, ASRU 09] [Huijbregts, ICASSP 11] [Chan & Lee, Interspeech 11]
  • 22. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Spoken Content Retrieval Spoken Archive Recognition Engine Acoustic Models lattices Search Engine Retrieval Model user Query Q Retrieval Output Recognition Retrieval • Search engine – indexing the lattices, search over indices • Retrieval model – confusion matrices, matching algorithms, weighting, learning, etc.
  • 23. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Recent Research Examples (1) - Integration and Weighting
  • 24. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Integrating Different Clues from Recognition • Integrating the outputs from different recognition systems [Natori, Interspeech 10] • Integrating results based on different subword units [S.-w. Lee, ICASSP 05][Pan & Lee, Interspeech 07][Meng, Interspeech 10] [Itoh, Interspeech 11] • Considering phone duration information [Wollmer, ICASSP 09][Ma, Interspeech 11]
  • 25. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Training Retrieval Model Parameters (1/7) • Some training data needed – A set of queries and associated relevant/irrelevant segments – collected from relevance feedback e.g. click-through data [Joachims, SIGKDD 02] long-term context user relevance feedback [Shen, SIGIR 05] time 1:10 T time 2:01 F time 3:04 T time 5:31 T time 1:10 T time 2:01 F time 3:04 T time 5:31 F time 1:10 F time 2:01 F time 3:04 T time 5:31 T Query Q1 Query Q2 Query Qn ……
  • 26. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Spoken Archive Recognition Engine Acoustic Models lattices Search Engine Retrieval Model user Query Q Retrieval Output Recognition Retrieval time 1:10 T time 2:01 F time 3:04 T time 5:31 T time 1:10 T time 2:01 F time 3:04 T time 5:31 F time 1:10 F time 2:01 F time 3:04 T time 5:31 T Query Q1 Query Q2 Query Qn …… Training Retrieval Model Parameters (2/7)
  • 27. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Training Retrieval Model Parameters (3/7) • Some parameters in the retrieval model estimated by optimizing some retrieval related criteria – Weights of different clues (e.g. different recognition outputs, different subword units, phone duration information) [Meng & Lee, ICASSP 09][Chen & Lee, ICASSP 10][Meng, Interspeech 10] [Wollmer, ICASSP 09] – Phone confidence [Li, Interspeech 11] – Phone confusion matrix [Wallance, ICASSP 10]
  • 28. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Training Retrieval Model Parameters (4/7) • Weights for Integrating 1,2,3-grams for different word/subword units and different indices syllable Confusion Network Position Specific Posterior Lattice word character syllable word character 1-gram 2-gram 3-gram 1-gram 2-gram 3-gram integrated with different weights maximizing the lower bound of MAP by SVM-MAP [Meng & Lee, ICASSP 09]
  • 29. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • MAP (mean average precision) – area under recall-precision curve – a performance measure frequently used for information retrieval Training Retrieval Model Parameters (5/7) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision Recall MAP = 0.484 MAP = 0.586
  • 30. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Integrating different n-grams, word/subword units and indices single clue integrated Training Retrieval Model Parameters (6/7) [Meng & Lee, ICASSP 09]
  • 31. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Context-dependent term weighting – the same term may have different weights depending on the context – e.g. “speech information retrieval” and “information theory” – such different weights can be trained Training Retrieval Model Parameters (7/7) [Chen & Lee, ICASSP 10] 0.3 0.4 0.5 0.6 0.7 character syllable baseline context-dependent weighting MAP
  • 32. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Recent Research Examples (2) - Acoustic Modeling
  • 33. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University [Lee & Lee, ICASSP 10] [Lee & Lee, Interspeech 10] [Lee & Lee, SLT 10] Spoken Archive Recognition Engine Acoustic Models lattices Search Engine Retrieval Model user Query Q Retrieval Output Recognition Retrieval time 1:10 T time 2:01 F time 3:04 T time 5:31 T time 1:10 T time 2:01 F time 3:04 T time 5:31 F time 1:10 F time 2:01 F time 3:04 T time 5:31 T Query Q1 Query Q2 Query Qn … … Retrieval-Oriented Acoustic Modeling (1/4)
  • 34. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Retrieval considered on top of recognition output in the past – recognition and retrieval as two cascaded stages – retrieval performance relying on recognition accuracy • Considering retrieval and recognition processes as a whole – acoustic models re-estimated by optimizing retrieval performance – acoustic models better matched to each respective data set Retrieval-Oriented Acoustic Modeling (2/4) [Lee & Lee, ICASSP 10] [Lee & Lee, Interspeech 10] [Lee & Lee, SLT 10] Spoken Archive Recognition Engine Acoustic Models lattices Search Engine Retrieval Model user Query Q Retrieval Output Recognition Retrieval
  • 35. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Estimate the acoustic models such that the relevance score of positive and negative examples are separated. The above formulation can be improved by: 1.Considering ranking property of retrieval performance measure 2.Considering unlabeled utterances ( ) ( )[ ]∑ ∑∈ −= train Q F Q T QQ XX Q F Q T XQSXQS , |,|,maxargˆ θθθ θ • Objective Function : positive example for query Q θ : acoustic model : training query set : negative example for query Q : relevance score of utterance X given query Q and θ Retrieval-Oriented Acoustic Modeling (3/4) Q TX Q FX trainQ ( )θXQS , [Lee & Lee, ICASSP 10] [Lee & Lee, Interspeech 10] [Lee & Lee, SLT 10]
  • 36. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University MAP SI SA1 SA2 Baseline 48.19 61.89 73.07 Model Re-estimation 50.52 62.81 73.62 Retrieval-Oriented Acoustic Modeling (4/4) – SI : speaker independent models – SA1 : adapted by global MLLR – SA2 : adapted by global MLLR + class-based MLLR + MAPadap • 40 training queries each with relevance information of top 5 utterances • Improvements achievable but relatively limited, more improvements for poorer models [Lee & Lee, ICASSP 10] [Lee & Lee, Interspeech 10] [Lee & Lee, SLT 10]
  • 37. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Limited Improvements for Retrieval-Oriented Acoustic Modeling – different queries with quite different characteristics • Re-estimate the acoustic models for each query on-line – based on the first several utterances the user clicks through when browsing the retrieval results – utterances not yet browsed can be re-ranked – short-term context user relevance feedback – models updated and lattices rescored quickly due to very limited training data Query-Specific Retrieval-Oriented Acoustic Modeling (1/4) [Lee & Lee, Interspeech 10] [Lee & Lee, SLT 10]
  • 38. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University time 1:01F time 2:05T time 5:31 time 4:78 …… Spoken Archive Recognition Engine Acoustic Models lattices Search Engine Retrieval Model user Query Q Recognition Retrieval Query-Specific Retrieval-Oriented Acoustic Modeling (2/4) Retrieval Output User clicks some utterances, indicating as relevant or irrelevant [Lee & Lee, Interspeech 10] [Lee & Lee, SLT 10]
  • 39. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Spoken Archive Recognition Engine Acoustic Models lattices Search Engine Retrieval Model user Query Q Recognition Retrieval Query-Specific Retrieval-Oriented Acoustic Modeling (2/4) acoustic models re-estimated Reranked Retrieval Results Rescored lattices ^θQ time 1:01F time 2:05T time 1:35 time 9:87 …… [Lee & Lee, Interspeech 10] [Lee & Lee, SLT 10]
  • 40. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University ( ) ( )[ ]∑ −= Q F Q T XX Q F Q TQ XQSXQS , |,|,maxargˆ θθθ θ • Objective Function Estimate the acoustic models such that the relevance score of positive and negative examples are separated. Specific acoustic model for query Q : positive example of query Q θ : acoustic model : negative example of query Q The above formulation can be improved by: 1.Considering ranking property of retrieval performance measure 2.Considering unlabelled utterances : relevance score function of utterance X Q TX Q FX ( )XQS , [Lee & Lee, Interspeech 10] [Lee & Lee, SLT 10] Query-Specific Retrieval-Oriented Acoustic Modeling (3/4)
  • 41. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • 5 utterances were clicked by the use (Speaker Independent Model) • Improvements achievable, but relatively limited because clicked utterances frozen but dominating MAP scores Query-Specific Retrieval-Oriented Acoustic Modeling (4/4) [Lee & Lee, Interspeech 10] [Lee & Lee, SLT 10]
  • 42. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Recent Research Examples (3) - Acoustic Features and Pseudo Relevance Feedback
  • 43. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Similarity in Acoustic Features (1/3) • When an utterance is known to be relevant/irrelevant, other utterances similar to it is more probable to be relevant/irrelevant – Same scenario: the first several items clicked by the user, the rest re-ranked time 1:01F time 2:05T time 5:31 time 4:78 …… Recognition Retrieval Spoken Archive Recognition Engine Acoustic Models lattices Search Engine Retrieval Model user Query Q Retrieval Output [Chen & Lee, Interspeech 10] [Lee & Lee, SLT 10]
  • 44. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Similarity in Acoustic Features (1/3) Recognition Retrieval Spoken Archive Recognition Engine Acoustic Models lattices Search Engine Retrieval Model user Query Q Retrieval Output Those not yet browsed compared with those clicked based on acoustic similarity and re-ranked [Chen & Lee, Interspeech 10] [Lee & Lee, SLT 10] time 1:01F time 2:05T time 5:31 time 4:78 ……
  • 45. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Similarity in Acoustic Features (1/3) Recognition Retrieval Spoken Archive Recognition Engine Acoustic Models lattices Search Engine Retrieval Model user Query Q Retrieval Output Those not yet browsed compared with those clicked based on acoustic similarity and re-ranked [Chen & Lee, Interspeech 10] [Lee & Lee, SLT 10] time 1:01F time 2:05T time 1:35 time 9:87 ……
  • 46. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Acoustic feature similarities between hypothesized regions • Hypothesized region : feature vector sequence corresponding to query Q in the lattice with the highest score Q Q A A B B B Lattice utterance xi feature vector sequence hypothesized region of xi Similarity in Acoustic Features (2/3) [Chen & Lee, Interspeech 10] [Lee & Lee, SLT 10]
  • 47. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Acoustic feature similarities between hypothesized regions • Hypothesized region : feature vector sequence corresponding to query Q in the lattice with the highest score Q Q A A B B B Lattice utterance xi feature vector sequence hypothesized region of xi C A Q Q DLattice Q feature vector sequence utterance xj hypothesized region of xj Similarity in Acoustic Features (2/3) [Chen & Lee, Interspeech 10] [Lee & Lee, SLT 10]
  • 48. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Acoustic feature similarities between hypothesized regions Q Q A A B B B Lattice utterance xi feature vector sequence hypothesized region of xi C A Q Q DLattice Q feature vector sequence utterance xj hypothesized region of xj Similarity estimation SIM(xi,xj) • SIM(xi,xj) is based on the DTW distance between the two hypothesized regions acoustic similarity [Chen & Lee, Interspeech 10] [Lee & Lee, SLT 10] Similarity in Acoustic Features (2/3)
  • 49. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University MAP SI SA1 SA2 Baseline 48.19 61.89 73.07 Model Re-estimation 50.52 62.81 73.62 Acoustic Similarity 51.89 64.71 74.03 – SI: speaker independent models – SA1: adapted by global MLLR – SA2: adapted by global MLLR + class-based MLLR + MAPadap • Slightly more improvements than re-estimated models [Chen & Lee, Interspeech 10] [Lee & Lee, SLT 10] Similarity in Acoustic Features (3/3)
  • 50. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Pseudo Relevance Feedback (PRF) (1/3) • Not relying on users to give the feedback • Derive relevance information automatically – “assume” the top N utterances on the first-pass retrieval results are relevant (pseudo relevant) – scores of utterances similar to the pseudo-relevant utterances increased – Pseudo Relevance Feedback (PRF) [Chen & Lee, Interspeech 10]
  • 51. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University time 1:01 time 2:05 time 1:45 … time 2:16 time 7:22 time 9:01 Pseudo Relevance Feedback (PRF) (2/3) First-pass retrieval results Top N utterances “Assumed” relevant Compare Acoustic Feature Similarity All first-pass retrieved utterances compared with the pseudo-relevant utterances based on acoustic similarity time 1:01 time 2:05 Search Engine Spoken archive Query Q [Chen & Lee, Interspeech 10]
  • 52. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University time 1:01 time 2:05 time 1:45 … time 2:16 time 7:22 time 9:01 Pseudo Relevance Feedback (PRF) (2/3) First-pass retrieval results Top N utterances “Assumed” relevant Compare Acoustic Feature Similarity time 1:01 time 2:16 time 7:22 … time 2:05 time 1:45 time 9:01 Final Results Re-ranked time 1:01 time 2:05 Search Engine Spoken archive Query Q [Chen & Lee, Interspeech 10]
  • 53. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University MAP SI SA SD Baseline 45.57 71.20 80.49 PRF 52.10 75.60 82.72 – SI : speaker independent models – SA : adapted by global MLLR + class-based MLLR + MAPadap – SD : speaker dependent models Pseudo Relevance Feedback (PRF) (3/3) • Improvement achievable and seems more significant [Chen & Lee, Interspeech 10]
  • 54. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Recent Research Examples (4) - Improved Pseudo Relevance Feedback
  • 55. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Improved PRF – Graph-based Approach (1/5) • Graph-based approach – only the top N utterances are taken as references in PRF – not necessarily reliable – considering the acoustic similarity structure of all utterances in the first-pass retrieval results globally using a graph [Chen & Lee, ICASSP 11]
  • 56. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Construct a graph for all utterances in the first-pass retrieval results – nodes : utterances – edge weights: acoustic similarities between utterances First-pass Retrieval Results x1 x3 x2 x4 x5 x3 x1 x2 x5 x4 ….. Improved PRF – Graph-based Approach (2/5) [Chen & Lee, ICASSP 11]
  • 57. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Utterances strongly connected to (similar to) utterances with high relevance scores should have relevance scores increased x1 x3 x2 x4 x5 x3 x1 x2 x5 x4 ….. high high high first-pass retrieval results Improved PRF – Graph-based Approach (3/5) [Chen & Lee, ICASSP 11]
  • 58. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University x1 x3 x2 x4 x5 x3 x1 x2 x5 x4 ….. low low low first-pass retrieval results • Utterances strongly connected to (similar to) utterances with low relevance scores should have relevance scores reduced Improved PRF – Graph-based Approach (3/5) [Chen & Lee, ICASSP 11]
  • 59. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Relevance scores propagate on the graph – relevance scores smoothed among strongly connected nodes x1 x3 x2 x4 x5 x3 x1 x2 x5 x4 ….. x3 x1 x2 x5 x4 ….. first-pass retrieval results Re-ranked Improved PRF – Graph-based Approach (4/5) [Chen & Lee, ICASSP 11]
  • 60. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University MAP SI SA SD Baseline 45.57 71.20 80.49 PRF 52.10 75.60 82.72 Graph-based 53.42 79.08 82.96 • Graph-based > PRF > Baseline – graph-based approach globally considering the similarity structure better than PRF referenced on top N utterances only Improved PRF – Graph-based Approach (5/5) [Chen & Lee, ICASSP 11]
  • 61. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Improved PRF - Machine Learning (1/4) • Machine learning shown very useful in spoken term detection with a training set [Wang, Interspeech 09][Wang, ICASSP 10][Tejedor, Interspeech 10] • An example : use of Support Vector Machine (SVM) in the scenario of pseudo relevance feedback [Tu & Lee, ASRU 11]
  • 62. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Train an SVM for each query time 1:01 time 2:05 time 1:45 … time 2:16 time 7:22 time 9:01 time 1:01 time 2:16 time 7:22 … time 2:05 time 1:45 time 9:01 SVM Re-ranking Feature Extraction Search Engine Spoken archive Final Results First-pass retrieval results Negative examples Positive examples Feature Extraction Improved PRF - Machine Learning (2/4) Top N “assumed” relevant time 1:01 time 2:05 Bottom N “assumed” irrelevant time 7:22 time 9:01 Query Q [Tu & Lee, ASRU 11]
  • 63. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Representing each utterance by its hypothesized region, segmented by states, with feature vectors in each state averaged and concatenated Q A B C BD B EFF Hypothesized Region Feature Vector Sequence . . . … D State Boundaries ……… a feature vector averaged Improved PRF - Machine Learning (3/4) [Tu & Lee, ASRU 11]
  • 64. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University MAP SI SA SD Baseline 45.57 71.20 80.49 PRF 52.10 75.60 82.72 Graph-based 53.42 79.08 82.96 SVM 59.31 81.63 84.66 Improved PRF - Machine Learning (4/4) • SVM > Graph-based > PRF > Baseline – SVM offered significant improvements over PRF [Tu & Lee, ASRU 11]
  • 65. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Context Consistency (1/4) • All above discussions primarily considering acoustic features/models – linguistic information? • Context consistency – the same term usually have similar context; while quite different context usually implies the terms are different [Schneiderl & Mertens, Interspeech 10][Lee & Lee, ICASSP 11][Tu & Lee, ASRU 11] • Extract context information from lattices – used in SVM for PRF scenario [Lee & Lee, ICASSP 11][Tu & Lee, ASRU 11]
  • 66. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Train a context model for each query by SVM Context Consistency (2/4) time 1:01 time 2:05 time 1:45 … time 2:16 time 7:22 time 9:01 Top N “assumed” relevant Bottom N “assumed” irrelevant time 1:01 time 2:16 time 7:22 … time 2:05 time 1:45 time 9:01 SVM Feature Extraction Re-ranking Spoken archive Search Engine Query Q Feature Extraction Final Results First-pass retrieval results Negative examples Positive examples [Lee & Lee, ICASSP 11] [Tu & Lee, ASRU 11]
  • 67. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Q Q A B C BD B E F 0.2 0.3 0.2 0.5 0.2 0.2 0.3 0.4 0.2 0.1 D 0.1 F 0.9 A B C D … Q 0.2 0.6 0.5 0.3 … 0.4 A B C D … Q 0.0 0.3 0.0 0.0 … 0.0 A B C D … Q 0.2 0.0 0.5 0.0 … 0.0 Immediate left context Immediate right context whole segment Concatenated into a 3V - dimensional feature vector V - dimensional vector (V : lexicon size) Context Consistency (3/4) • Feature Extraction [Lee & Lee, ICASSP 11] [Tu & Lee, ASRU 11]
  • 68. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University MAP SI SA SD Baseline 45.57 71.20 80.49 Context 54.93 80.72 84.81 Context Consistency (4/4) • Context consistency helpful [Lee & Lee, ICASSP 11] [Tu & Lee, ASRU 11]
  • 69. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Recent Research Examples (5) - Concept Matching
  • 70. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Concept Matching for Spoken Content Retrieval (1/4) • Concept matching rather than Literal matching • Returning utterances/documents semantically related to the query (e.g. Obama) – not necessarily containing the query (e.g. with US and White House) • Approach Examples – Clustering the document collection [Hu, Interspeech 10] – Using web data for document or query expansion [Masumura, Interspeech 11][Akiba, Interspeech 11] – Using latent topic models [Chang & Lee, SLT 08][Molgaard, ICASSP 07][Chen, ICASSP 09][Chen, Interspeech 11]
  • 71. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Concept Matching for Spoken Content Retrieval (2/4) • An example: Probabilistic Latent Semantic Analysis (PLSA) • Creating a set of latent topics between a set of terms and a set of documents – modeling the relationships by probabilistic models trained with EM algorithm • Other well-known approaches: Latent Semantic Analysis (LSA), Non-negative Matrix Factorization (NMF), Latent Dirichlet Allocation (LDA) … …
  • 72. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University QuerySemantic Relevance Evaluation Probability Latent Semantic Analysis (PLSA) Model Retrieval Results Spoken Archive Recognition output: Lattices …… Concept Matching for Spoken Content Retrieval (3/4) [Chang & Lee, SLT 08]
  • 73. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Literal Matching (baseline) Concept Matching Concept Matching for Spoken Content Retrieval (4/4) [Chang & Lee, SLT 08]
  • 74. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Recent Research Examples (6) - User-content Interaction
  • 75. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University User-Content Interaction for Spoken Content Retrieval (1/2) • Problems – Unlike text content, spoken content not easily summarized on screen, thus retrieved results difficult to scan and select – User-content interaction always important even for text content • Possible Approaches – Automatic summary/title generation and key term extraction for spoken content – Semantic structuring for spoken content – Multi-modal dialogue with improved interaction Key Terms/ Titles/Summaries User Query Multi-modal Dialogue Spoken Archives Retrieved Results Retrieval Engine [Pan & Lee, ASRU 05] [L.-s. Lee, IEEE SPM 05] [L.-s. Lee, Interspeech 06] [Kong & Lee, ICASSP 09] User Interface Semantic Structuring
  • 76. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Key Term Extraction from Spoken Content (1/3) • Key Terms : key phrases and keywords • Key Phrase Boundary Detection • An Example • Left/right boundary of a key phrase detected by context statistics “hidden” almost always followed by the same word “hidden Markov” almost always followed by the same word “hidden Markov model” is followed by many different words [Chen & Lee, SLT 10] boundary hidden Markov model represent is can : : is of in : :
  • 77. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Key Term Extraction from Spoken Content (2/3) • Prosodic Features – key terms probably produced with longer duration, wider pitch range and higher energy • Semantic Features (e.g. PLSA) – key terms usually focused on smaller number of topics • Lexical Features – TF/IDF, POS tag, etc. non-key term P(Tk|ti) k key term P(Tk|ti) k topics topics [Hsieh & Lee, ICASSP 06] [Chen & Lee, SLT 10] [Kong & Lee, IEEE T. ASL 11]
  • 78. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Key Term Extraction from Spoken Content (3/3) Pr: Prosodic Lx: Lexical Sm: Semantic0 10 20 30 40 50 60 Pr Lx Sm Pr+Lx Pr+Lx+Sm 20.78 42.86 35.63 48.15 56.55 F-measure • All three sets of features useful [Chen & Lee, SLT 10] • Trained with Neural Net
  • 79. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University X1 X2 X3 X4 X5 X6 document d: Correctly recognized word Wrongly recognized word t2t1 [Furui, et al, ICASSP 05][Furui, et al, IEEE T. SAP 04] [Hirschberg, et al, Interspeech 05] [Murray, Renals, et al, ACL 05][Murray, Renals, et al, HLT 06] [Kawahara, et al, ICASSP 04][Nakagawa, et al, SLT 06] [Zhu & Penn, Interspeech 06][Fung, et al, ICASSP 08] [Kong & Lee, ICASSP 06][Kong & Lee, SLT 06] [Gillick, ICASSP 09][Li, ASRU 09][Lin, ICASSP 10] [Liu, ICASSP 08][Xie, ASRU 09][Xie, Interspeech 10] [Kong & Lee, IEEE T. ASL 11] Extractive Summarization of Spoken Documents (1/3)
  • 80. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University X1 X2 X3 X4 X5 X6 document d: X1 X3 summary of document d: • Selecting most representative utterances in the original document but avoiding redundancy t2t1 [Furui, et al, ICASSP 05][Furui, et al, IEEE T. SAP 04] [Hirschberg, et al, Interspeech 05] [Murray, Renals, et al, ACL 05][Murray, Renals, et al, HLT 06] [Kawahara, et al, ICASSP 04][Nakagawa, et al, SLT 06] [Zhu & Penn, Interspeech 06][Fung, et al, ICASSP 08] [Kong & Lee, ICASSP 06][Kong & Lee, SLT 06] [Gillick, ICASSP 09][Li, ASRU 09][Lin, ICASSP 10] [Liu, ICASSP 08][Xie, ASRU 09][Xie, Interspeech 10] [Kong & Lee, IEEE T. ASL 11] Extractive Summarization of Spoken Documents (1/3)
  • 81. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Extractive Summarization of Spoken Documents (2/3) • An example: the utterances topically similar to the representative utterances should also be considered as representative • Graph-based analysis used in finding representative utterances, but corrected by avoiding redundancy xi xj utterance x1 utterance x2 utterance x3 utterance x4 utterance x5 …… to construct summary x3 x1 x2 x5 x4 ….. [Chen & Lee, Interspeech 11]
  • 82. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University 41 46 51 56 10% 20% 30% ROUGE-1 18 23 28 10% 20% 30% ROUGE-2 9 14 19 10% 20% 30% ROUGE-3 40 45 50 10% 20% 30% ROUGE-L • Graph-based analysis helpful Extractive Summarization of Spoken Documents (3/3) F-measures Graph-based Baseline [Chen & Lee, Interspeech 11]
  • 83. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Titles for retrieved documents/segments helpful in browsing and selection of retrieved results • Short, readable, telling what the document/segment is about • One example: Scored Viterbi Search [Witbrock & Mittal, SIGIR 99][Jin & Hauptmann, HLT 01] [Chen & Lee, Interspeech 03][Wang & Lee, SLT 08] Title Generation for Spoken Documents Training corpus Term Ordering Model Term Selection Model Title Length Model Spoken document ASR and Automatic Summarization Viterbi Algorithm Output Title Summary
  • 84. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Example 1: retrieved results clustered by Latent Topics and organized in a two-dimensional tree structure (multi- layered map) – each cluster labeled by a set of key terms representing a group of retrieved documents/segments – each cluster expanded into a map in the next layer [Li & Lee, Interspeech 05] [L.-s. Lee, et al, Interspeech 06] [Kong & Lee, IEEE T. ASL 11 Semantic Structuring (1/2)
  • 85. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Example 2: Key-term Graph – each retrieved spoken document/segment labeled by a set of key terms – relationships between key terms represented by a graph Semantic Structuring (2/2) ----- ----- ----- ----- --------- --------- --------- --- ------- ------- ------- ---- retrieved spoken documents key term graph Acoustic Modeling Viterbi search HMM Language Modeling Perplexity [Kong & Lee, ICASSP 09]
  • 86. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • An example: user-system interaction modeled as a Markov Decision Process (MDP) very similar to spoken dialogues Multi-modal Dialogue (1/3) Key Terms/ Titles/Summaries Spoken Archives User Retrieved Results Retrieval Engine Query User Interface Multi-modal Dialogue Semantic Structuring • Example goals – high task success rate (success: user’s information need satisfied) – small average number of dialogue turns (average number of query terms entered) for successful tasks • A reward function defined and maximized with simulated users [Pan & Lee, ASRU 05][Pan & Lee, Interspeech 06] [Pan & Lee, SLT 06][Pan & Lee, ASRU 07]
  • 87. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Multi-modal Dialogue (2/3) • An example application scenario: retrieving broadcast news – in each step system returns retrieved results plus a list of key terms for user to select – user looks through the list from the top, and selects the first relevant to his information need – key terms on the list ranked by MDP doc 1201 doc 1707 doc 5205 doc 9201 doc 1200 …… Israel China Taiwan Iraq …… doc 1201 doc 9231 doc 1241 doc 3201 doc 3207 …… Diplomatic Economic Middle East China U.S. news about “the meeting of Obama and Hu” Step 1 User enters query “U.S. President” Step 2 User selects “Diplomatic” [Pan & Lee, ASRU 05][Pan & Lee, Interspeech 06] [Pan & Lee, SLT 06][Pan & Lee, ASRU 07]
  • 88. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University [Pan & Lee, ASRU 05][Pan & Lee, Interspeech 06] [Pan & Lee, SLT 06][Pan & Lee, ASRU 07] Multi-modal Dialogue (3/3) • Much less steps needed for successful retrieval sessions • Much less failure sessions 0 10 20 30 40 50 60 70 80 90 100 2 3 4 5 6 7 8 NumberofRetrievalSessions Number of failure sessions Number of interaction steps needed to complete a successful retrieval session baselines: standard approaches used in text retrieval wpq lca proposed
  • 89. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University User-Content Interaction for Spoken Content Retrieval (2/2) • Problems – Unlike text content, spoken content not easily summarized on-screen, thus difficult to scan and select – User-content interaction always important even for text content • Possible Approaches – Automatic summary/title generation and key term extraction for spoken content – Semantic structuring for spoken content – Multi-modal dialogue with improved interaction Key Terms/ Titles/Summaries Spoken Archives User Retrieved Results Retrieval Engine Query User Interface Multi-modal Dialogue Semantic Structuring
  • 90. Demo Link of demo system: http://speech.ee.ntu.edu.tw/~RA/lecture/ (please browse it by Firefox)
  • 91. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Course Lectures • Many Course Lectures available over the Internet – it takes very long to listen to a complete course (e.g. 45 hrs) – not easy for engineers or industry leaders to learn new knowledge via course lectures – goal : to help people learn easily • Lecture Browsers available over the Internet – knowledge in course lectures are structured, one concept following the other – the retrieved lecture segment possibly not easy to understand for the learner without enough background knowledge – given the retrieved segment there is no information for the learner regarding what should be learned next [Kong & Lee, ICASSP 09]
  • 92. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • Structuring Course Lectures by Slides and Key Terms – dividing the course lecture by slides – deriving the core content of the slides by key terms – constructing the semantic relationships among slides by a key term graph – all slides given its length, timing information in the course, summary, key terms, related key terms and slides based on the key term graph, in order to help the learner to decide whether to choose to listen to it or not – retrieved spoken segments include all above information about the slides they belong to to help the user to browse the retrieved results Proposed Approach [Kong & Lee, ICASSP 09]
  • 93. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University • A Course of Digital Speech Processing – code-mixing: lectures given in the host language of Mandarin Chinese, while all terminologies produced directly in the guest language of English, inserted in Mandarin utterances just as Chinese words, all slides in English • Corpus Transcription – bi-lingual phone set of 75 units – bi-lingual lexicon of 12.3k words – class-based tri-gram language model adapted by slide information – character/word accuracy of 64.77%-81.63% for different slides A Prototype System [Lee & Lee, ASRU 09] [Kong & Lee, ICASSP 09] [Yeh & Lee, Interspeech 11]
  • 94. 金金金金 聲聲聲聲 玉玉玉玉 振振振振 National Taiwan University Conclusion – Spoken Language Processing over the Internet • User interface – Successful but not very easy since users usually expect technology to replace human beings • Content analysis/user-content interaction – Technology can handle massive quantities of content, while human beings cannot • Spoken content retrieval – Integrates user interface with content analysis/user-content interaction – Offering very attractive applications for spoken language processing technologies User Interface Internet User-Content Interaction Content Analysis