Telefonica Research System for the Spoken Web Search task at Mediaeval 2012

Telefonica
Research
at
Mediaeval

2012
Spoken
Web
Search
Task

Xavier
Anguera

Outline

•  System
descripBon

–  Speech
AcBvity
detecBon

•  Proposed
systems

–  Segmental-‐DTW

–  IR-‐DTW

•  Results

Proposed
overall
system

S-‐DTW
IR-‐DTW

Frontend

MFCC-‐39
features

(12
Cepstra
+
Energy)
+
Delta
+
DeltaDelta

Mean
&
variance
normalizaBon
at
sentence
level

Posterior
probabiliBes
from
a
GMM
background

model

L2-‐normalizaBon

Background
model
training

IteraBve
128

Gaussian
Spling

EM-‐ML
GMM

training

K-‐means

assignment

[1]
“Speaker
Independent
discriminant
feature
extracBon
for
acousBc
paXern
matching”,

Xavier
Anguera,
ICASSP
2012

Silence
modeling

10%
lowest
energy

frames

•  1
Gauss
for
noise
and
4

Gauss
for
speech

Silence/Speech
•  Perform
10
iteraBons
or

GMM
training
while
%
variaBon
is
high

Decode
the
data

2234444343322444444444443222222234444444444444444444444443210000011222443

Threshold
set
to
values
<2
(i.e.
silence
+
lowest
speech)

Overlap
postprocessing

•  We
compute
the
percentage
of
overlap

between
all
matching
paths

min(End1, End2) ! max(Start1, Start2)
Ovl =
min(End1! Start1, End2 ! Start2)

•  For
pairs
with
>
0.5
overlap

–  Select
the
match
with
highest
score

Start1
End1

Match1

Match2

Start2
End2

min(ends)
–
max(starts)

Ovl
=

=
0.8

Min(size1,
size2)

S-‐DTW
submission

•  Based
on
last
year’s
submission
but
with
the

system
improvements
above

DTW
local
constraints

•  no
global
constraints
are
applied
in
order
to
allow
for

matching
of
any
segment
among
both
sequences

•  Local
constraints
are
set
to
allow
warping
up
to
2X

" D(m ! 2, n) + d(xm , yn ) (m,
n)

$
$ jumps(m ! 2, n) + 3
$ D(m, n ! 2) + d(xm , yn ) (m-‐2,
n-‐1)

D(m, n) = min #
$ jumps(m, n ! 2) + 3
$ D(m ! 2, n ! 2) + d(x , y )
m n
$ (m-‐1,
n-‐2)

% jumps(m ! 2, n ! 2) + 4 (m-‐1,
n-‐1)

•  Posteriorgram
features
distance:
$ N!1 '
d(xm , yn ) = ! log & # xm [i]" yn [i])
% i=0 (

S-‐DTW
algorithm

Query
term

Reference
term

IR-‐DTW

•  Total
rework
from
last
year’s
system

•  Aim
at
keeping
the
same
accuracy,
but:

–  Much
less
memory
usage

–  Faster
retrieval

•  IR
(InformaBon
Retrieval)
cause
we
use

reference
features
indexing
for
fast
nearest

neighbors
retrieval

Oﬃcial
results

MTWV
Dev-‐dev
Dev-‐eval
Eval-‐dev
Eval-‐eval

IR-‐DTW
0.3903
0.3139
0.4983
0.3416

S-‐DTW
0.3745
0.3001
0.4716
0.3113

ATWV
Dev-‐dev
Dev-‐eval
Eval-‐dev
Eval-‐eval

IR-‐DTW
0.3866
0.3042
0.4219
0.3301

S-‐DTW
0.3644
0.292
0.3988
0.2942

DEV-DEV results
98
Random Performance
IR-DTW MTWV=0.390 Scr=0.387
95
S-DTW MTWV=0.375 Scr=0.695

90

80
Miss probability (in %)

60

40

20

10

5
.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40
False Alarm probability (in %)

EVAL-EVAL Results
98
Random Performance
IR-DTW MTWV=0.342
95
S-DTW MTWV=0.311

90

80

60

40

20

10

5
.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40

DEV-EVAL results
98
Random Performance
IR-DTW MTWV=0.314
95
S-DTW MTWV=0.300

90

80

60

40

20

10

5
.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40

EVAL-DEV results
98
Random Performance
IR-DTW MTWV=0.498
95
S-DTW MTWV=0.472

90

80

60

40

20

10

5
.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40

Xavier
Anguera

Summary
xanguera@Bd.es

•  We
propose
2
systems,
all
sharing
the
same

framework

•  Some
improvements
in
the
framework
were

incorporated:
speech/silence
classiﬁcaBon,
new

overlap
detecBon,
modiﬁed
background
model.

•  IR-‐DTW
is
a
total
reimplementaBon
of
SDTW,

using
informaBon
retrieval
concepts

Telefonica Research System for the Spoken Web Search task at Mediaeval 2012

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to Telefonica Research System for the Spoken Web Search task at Mediaeval 2012 (20)

More from MediaEval2012 (20)

Telefonica Research System for the Spoken Web Search task at Mediaeval 2012