SlideShare a Scribd company logo
Telefonica	
  Research	
  at	
  Mediaeval	
  
 2012	
  Spoken	
  Web	
  Search	
  Task	
  
              Xavier	
  Anguera	
  
Outline	
  
•  System	
  descripBon	
  
   –  Speech	
  AcBvity	
  detecBon	
  


•  Proposed	
  systems	
  
   –  Segmental-­‐DTW	
  
   –  IR-­‐DTW	
  
•  Results	
  
Proposed	
  overall	
  system	
  




                  S-­‐DTW	
     IR-­‐DTW	
  
Frontend	
  
MFCC-­‐39	
  features	
  
   (12	
  Cepstra	
  +	
  Energy)	
  +	
  Delta	
  +	
  DeltaDelta	
  
Mean	
  &	
  variance	
  normalizaBon	
  at	
  sentence	
  level	
  
	
  
Posterior	
  probabiliBes	
  from	
  a	
  GMM	
  background	
  
     	
  model	
  
L2-­‐normalizaBon	
  
   	
  
Background	
  model	
  training	
  
                                             IteraBve	
  128	
  
                                           Gaussian	
  Spling	
  


                                               EM-­‐ML	
  GMM	
  
                                                 training	
  


                                                 K-­‐means	
  	
  
                                                assignment	
  


[1]	
  “Speaker	
  Independent	
  discriminant	
  feature	
  extracBon	
  for	
  acousBc	
  paXern	
  matching”,	
  
Xavier	
  Anguera,	
  ICASSP	
  2012	
  
Silence	
  modeling	
  
10%	
  lowest	
  energy	
  
        frames	
  
                                   •  1	
  Gauss	
  for	
  noise	
  and	
  4	
  
                                      Gauss	
  for	
  speech	
  
  Silence/Speech	
                 •  Perform	
  10	
  iteraBons	
  or	
  
   GMM	
  training	
                  while	
  %	
  variaBon	
  is	
  high	
  

 Decode	
  the	
  data	
  
2234444343322444444444443222222234444444444444444444444443210000011222443	
  




      Threshold	
  set	
  to	
  values	
  <2	
  (i.e.	
  silence	
  +	
  lowest	
  speech)	
  
Overlap	
  postprocessing	
  
   •  We	
  compute	
  the	
  percentage	
  of	
  overlap	
  
      between	
  all	
  matching	
  paths	
  
                             min(End1, End2) ! max(Start1, Start2)
                     Ovl =
                               min(End1! Start1, End2 ! Start2)

   •  For	
  pairs	
  with	
  >	
  0.5	
  overlap	
  
       –  Select	
  the	
  match	
  with	
  highest	
  score	
  
Start1	
                                                   End1	
  
                          Match1	
  

                                             Match2	
  
                   Start2	
                                              End2	
  




                            min(ends)	
  –	
  max(starts)	
  
Ovl	
  =	
  	
                                                                      =	
  0.8	
  
                                       Min(size1,	
  size2)	
  
S-­‐DTW	
  submission	
  
•  Based	
  on	
  last	
  year’s	
  submission	
  but	
  with	
  the	
  
   system	
  improvements	
  above	
  
DTW	
  local	
  constraints	
  
•  no	
  global	
  constraints	
  are	
  applied	
  in	
  order	
  to	
  allow	
  for	
  
   matching	
  of	
  any	
  segment	
  among	
  both	
  sequences	
  
•  Local	
  constraints	
  are	
  set	
  to	
  allow	
  warping	
  up	
  to	
  2X	
  
              " D(m ! 2, n) + d(xm , yn )                                                                   (m,	
  n)	
  
              $
              $ jumps(m ! 2, n) + 3
              $ D(m, n ! 2) + d(xm , yn )                   (m-­‐2,	
  n-­‐1)	
  
D(m, n) = min #
              $ jumps(m, n ! 2) + 3
              $ D(m ! 2, n ! 2) + d(x , y )
                                      m   n
              $                                                                     (m-­‐1,	
  n-­‐2)	
  
              % jumps(m ! 2, n ! 2) + 4                     (m-­‐1,	
  n-­‐1)	
  



•  Posteriorgram	
  features	
  distance:	
                       $ N!1             '
                                              d(xm , yn ) = ! log & # xm [i]" yn [i])
                                                                  % i=0             (
S-­‐DTW	
  algorithm	
  
Query	
  term	
  




                                               Reference	
  term	
  
S-­‐DTW	
  algorithm	
  
Query	
  term	
  




                                               Reference	
  term	
  
IR-­‐DTW	
  
•  Total	
  rework	
  from	
  last	
  year’s	
  system	
  
•  Aim	
  at	
  keeping	
  the	
  same	
  accuracy,	
  but:	
  
    –  Much	
  less	
  memory	
  usage	
  
    –  Faster	
  retrieval	
  
•  IR	
  (InformaBon	
  Retrieval)	
  cause	
  we	
  use	
  
   reference	
  features	
  indexing	
  for	
  fast	
  nearest	
  
   neighbors	
  retrieval	
  
Official	
  results	
  

 MTWV	
        Dev-­‐dev	
     Dev-­‐eval	
     Eval-­‐dev	
     Eval-­‐eval	
  

IR-­‐DTW	
      0.3903	
        0.3139	
         0.4983	
         0.3416	
  

 S-­‐DTW	
      0.3745	
        0.3001	
         0.4716	
         0.3113	
  




 ATWV	
        Dev-­‐dev	
     Dev-­‐eval	
     Eval-­‐dev	
     Eval-­‐eval	
  

IR-­‐DTW	
      0.3866	
        0.3042	
         0.4219	
          0.3301	
  

S-­‐DTW	
       0.3644	
         0.292	
         0.3988	
          0.2942	
  
DEV-DEV results
                          98
                                                                       Random Performance
                                                              IR-DTW MTWV=0.390 Scr=0.387
                          95
                                                               S-DTW MTWV=0.375 Scr=0.695

                          90


                          80
Miss probability (in %)




                          60



                          40



                          20


                          10

                          5
                          .0001   .001 .004 .01.02 .05 .1 .2    .5 1      2        5   10   20   40
                                                    False Alarm probability (in %)
EVAL-EVAL Results
                          98
                                                                          Random Performance
                                                                         IR-DTW MTWV=0.342
                          95
                                                                          S-DTW MTWV=0.311

                          90


                          80
Miss probability (in %)




                          60



                          40



                          20


                          10

                          5
                          .0001   .001 .004 .01.02 .05 .1 .2    .5 1      2        5   10      20   40
                                                    False Alarm probability (in %)
DEV-EVAL results
                          98
                                                                         Random Performance
                                                                        IR-DTW MTWV=0.314
                          95
                                                                         S-DTW MTWV=0.300

                          90


                          80
Miss probability (in %)




                          60



                          40



                          20


                          10

                          5
                          .0001   .001 .004 .01.02 .05 .1 .2   .5   1    2          5   10    20   40
                                                   False Alarm probability (in %)
EVAL-DEV results
                          98
                                                                          Random Performance
                                                                         IR-DTW MTWV=0.498
                          95
                                                                          S-DTW MTWV=0.472

                          90


                          80
Miss probability (in %)




                          60



                          40



                          20


                          10

                          5
                          .0001   .001 .004 .01.02 .05 .1 .2    .5 1      2        5   10      20   40
                                                    False Alarm probability (in %)
Xavier	
  Anguera	
  
Summary	
                                        xanguera@Bd.es	
  


     •  We	
  propose	
  2	
  systems,	
  all	
  sharing	
  the	
  same	
  
        framework	
  
     •  Some	
  improvements	
  in	
  the	
  framework	
  were	
  
        incorporated:	
  speech/silence	
  classificaBon,	
  new	
  
        overlap	
  detecBon,	
  modified	
  background	
  model.	
  
     •  IR-­‐DTW	
  is	
  a	
  total	
  reimplementaBon	
  of	
  SDTW,	
  
        using	
  informaBon	
  retrieval	
  concepts	
  

More Related Content

PDF
Dsp U Lec04 Discrete Time Signals & Systems
PDF
EE443 - Communications 1 - Lab 1 - Loren Schwappach.pdf
PPT
OPERATIONS ON SIGNALS
PDF
Signal & system
PDF
3.Properties of signals
PPT
1. signal and systems basics
PPT
Signal classification of signal
Dsp U Lec04 Discrete Time Signals & Systems
EE443 - Communications 1 - Lab 1 - Loren Schwappach.pdf
OPERATIONS ON SIGNALS
Signal & system
3.Properties of signals
1. signal and systems basics
Signal classification of signal

What's hot (19)

PPT
2. signal & systems beyonds
PDF
Convolution discrete and continuous time-difference equaion and system proper...
DOCX
Signals & systems
PDF
Lecture123
PPT
PDF
Instrumentation Engineering : Signals & systems, THE GATE ACADEMY
PDF
1.introduction to signals
PPT
Lecture2 Signal and Systems
DOCX
signal and system
PDF
Signals and systems( chapter 1)
PDF
Ec8352 signals and systems 2 marks with answers
PPTX
Operations on Continuous Time Signals
PDF
Notes for signals and systems
PDF
Alternative Approach for Computing the Activation Factor of the PNLMS Algorithm
PPT
Sns slide 1 2011
PDF
Signal and System, CT Signal DT Signal, Signal Processing(amplitude and time ...
PDF
Matlab programs
PPT
Lecture5 Signal and Systems
PPT
Lecture9
2. signal & systems beyonds
Convolution discrete and continuous time-difference equaion and system proper...
Signals & systems
Lecture123
Instrumentation Engineering : Signals & systems, THE GATE ACADEMY
1.introduction to signals
Lecture2 Signal and Systems
signal and system
Signals and systems( chapter 1)
Ec8352 signals and systems 2 marks with answers
Operations on Continuous Time Signals
Notes for signals and systems
Alternative Approach for Computing the Activation Factor of the PNLMS Algorithm
Sns slide 1 2011
Signal and System, CT Signal DT Signal, Signal Processing(amplitude and time ...
Matlab programs
Lecture5 Signal and Systems
Lecture9
Ad

Viewers also liked (20)

PPSX
תחרות אלוף הידע
PDF
Simha_RP
PPT
Overview of MediaEval 2012 Visual Privacy Task
PDF
DCU Search Runs at MediaEval 2012: Search and Hyperlinking Task
PPTX
Brave New Task: Musiclef Multimodal Music Tagging
PDF
Search and Hyperlinking Task at MediaEval 2012
PDF
LIG at MediaEval 2012 affect task: use of a generic method
PPTX
When Ideas and Opportunities Collide
PPTX
Idea or opportunity?
ODP
Thotcon2013
PDF
Closing
PPTX
Week 2 discussion 2
PPTX
Event Detection via LDA for the MediaEval2012 SED Task
PPTX
Working Notes for the Placing Task at MediaEval 2012
PPTX
Violent Scenes Detection with Large, Brute-forced Acoustic and Visual Feature...
PPTX
Mentor Strategy Session: Business Plan and Video
PPT
CERTH @ MediaEval 2012 Social Event Detection Task
PPTX
Live pitch event
PPT
MediaEval 2012 Visual Privacy Task: Privacy and Intelligibility through Pixel...
ODP
The Deck by Phil Polstra GrrCON2012
תחרות אלוף הידע
Simha_RP
Overview of MediaEval 2012 Visual Privacy Task
DCU Search Runs at MediaEval 2012: Search and Hyperlinking Task
Brave New Task: Musiclef Multimodal Music Tagging
Search and Hyperlinking Task at MediaEval 2012
LIG at MediaEval 2012 affect task: use of a generic method
When Ideas and Opportunities Collide
Idea or opportunity?
Thotcon2013
Closing
Week 2 discussion 2
Event Detection via LDA for the MediaEval2012 SED Task
Working Notes for the Placing Task at MediaEval 2012
Violent Scenes Detection with Large, Brute-forced Acoustic and Visual Feature...
Mentor Strategy Session: Business Plan and Video
CERTH @ MediaEval 2012 Social Event Detection Task
Live pitch event
MediaEval 2012 Visual Privacy Task: Privacy and Intelligibility through Pixel...
The Deck by Phil Polstra GrrCON2012
Ad

Similar to Telefonica Research System for the Spoken Web Search task at Mediaeval 2012 (20)

PDF
Molecular models, threads and you
PPTX
Presentació renovables
PDF
Fundamentals of Communication Systems 1st Edition Proakis Solutions Manual
PDF
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM
PDF
Fundamentals of Communication Systems 1st Edition Proakis Solutions Manual
PPTX
Md Mushfiqul Alam: Biological, NeuralNet Approaches to Recognition, Gain Cont...
PDF
Neural Networks in the Wild: Handwriting Recognition
ODP
image compression ppt
PDF
Lecture 2: Stochastic Hydrology
PDF
Performance tests - it's a trap
PDF
[系列活動] 手把手的深度學習實務
PDF
Learning Moving Cast Shadows for Foreground Detection (VS 2008)
PDF
Fundamentals of Communication Systems 1st Edition Proakis Solutions Manual
PPTX
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...
PDF
Fundamentals of Communication Systems 1st Edition Proakis Solutions Manual
PDF
Evaluating Data Freshness in Large Scale Replicated Databases
PDF
SPICE MODEL of SLP-WB89A-51 , White ,TA=0degree (Standard Model) in SPICE PARK
PDF
Fundamentals of Communication Systems 1st Edition Proakis Solutions Manual
PPTX
Deep learning image classification aplicado al mundo de la moda
PDF
Using Graph Partitioning Techniques for Neighbour Selection in User-Based Col...
Molecular models, threads and you
Presentació renovables
Fundamentals of Communication Systems 1st Edition Proakis Solutions Manual
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM
Fundamentals of Communication Systems 1st Edition Proakis Solutions Manual
Md Mushfiqul Alam: Biological, NeuralNet Approaches to Recognition, Gain Cont...
Neural Networks in the Wild: Handwriting Recognition
image compression ppt
Lecture 2: Stochastic Hydrology
Performance tests - it's a trap
[系列活動] 手把手的深度學習實務
Learning Moving Cast Shadows for Foreground Detection (VS 2008)
Fundamentals of Communication Systems 1st Edition Proakis Solutions Manual
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...
Fundamentals of Communication Systems 1st Edition Proakis Solutions Manual
Evaluating Data Freshness in Large Scale Replicated Databases
SPICE MODEL of SLP-WB89A-51 , White ,TA=0degree (Standard Model) in SPICE PARK
Fundamentals of Communication Systems 1st Edition Proakis Solutions Manual
Deep learning image classification aplicado al mundo de la moda
Using Graph Partitioning Techniques for Neighbour Selection in User-Based Col...

More from MediaEval2012 (20)

PDF
MediaEval 2012 Opening
PDF
A Multimodal Approach for Video Geocoding
PDF
CUNI at MediaEval 2012: Search and Hyperlinking Task
PPTX
Ghent University-IBBT at MediaEval 2012 Search and Hyperlinking: Semantic Sim...
PPTX
Brave New Task: User Account Matching
PDF
The CLEF Initiative From 2010 to 2012 and Onwards
PPT
MediaEval 2012 Visual Privacy Task: Applying Transform-domain Scrambling to A...
PPTX
mevd2012 esra_
PPTX
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
PPT
The MediaEval 2012 Affect Task: Violent Scenes Detectio
PPT
NII, Japan at MediaEval 2012 Violent Scenes Detection Affect Task
PPT
Violence Detection in Video by Large Scale Multi-Scale Local Binary Pattern D...
PPT
ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywo...
PPTX
The Shanghai-Hongkong Team at MediaEval2012: Violent Scene Detection Using Tr...
PDF
UNICAMP-UFMG at MediaEval 2012: Genre Tagging Task
PDF
TUD at MediaEval 2012 genre tagging task: Multi-modality video categorization...
PPT
ARF @ MediaEval 2012: Multimodal Video Classification
PPTX
TUB @ MediaEval 2012 Tagging Task: Feature Selection Methods for Bag-of-(visu...
PPTX
KIT at MediaEval 2012 – Content–based Genre Classification with Visual Cues
PPT
Overview of the MediaEval 2012 Tagging Task
MediaEval 2012 Opening
A Multimodal Approach for Video Geocoding
CUNI at MediaEval 2012: Search and Hyperlinking Task
Ghent University-IBBT at MediaEval 2012 Search and Hyperlinking: Semantic Sim...
Brave New Task: User Account Matching
The CLEF Initiative From 2010 to 2012 and Onwards
MediaEval 2012 Visual Privacy Task: Applying Transform-domain Scrambling to A...
mevd2012 esra_
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
The MediaEval 2012 Affect Task: Violent Scenes Detectio
NII, Japan at MediaEval 2012 Violent Scenes Detection Affect Task
Violence Detection in Video by Large Scale Multi-Scale Local Binary Pattern D...
ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywo...
The Shanghai-Hongkong Team at MediaEval2012: Violent Scene Detection Using Tr...
UNICAMP-UFMG at MediaEval 2012: Genre Tagging Task
TUD at MediaEval 2012 genre tagging task: Multi-modality video categorization...
ARF @ MediaEval 2012: Multimodal Video Classification
TUB @ MediaEval 2012 Tagging Task: Feature Selection Methods for Bag-of-(visu...
KIT at MediaEval 2012 – Content–based Genre Classification with Visual Cues
Overview of the MediaEval 2012 Tagging Task

Telefonica Research System for the Spoken Web Search task at Mediaeval 2012

  • 1. Telefonica  Research  at  Mediaeval   2012  Spoken  Web  Search  Task   Xavier  Anguera  
  • 2. Outline   •  System  descripBon   –  Speech  AcBvity  detecBon   •  Proposed  systems   –  Segmental-­‐DTW   –  IR-­‐DTW   •  Results  
  • 3. Proposed  overall  system   S-­‐DTW   IR-­‐DTW  
  • 4. Frontend   MFCC-­‐39  features   (12  Cepstra  +  Energy)  +  Delta  +  DeltaDelta   Mean  &  variance  normalizaBon  at  sentence  level     Posterior  probabiliBes  from  a  GMM  background    model   L2-­‐normalizaBon    
  • 5. Background  model  training   IteraBve  128   Gaussian  Spling   EM-­‐ML  GMM   training   K-­‐means     assignment   [1]  “Speaker  Independent  discriminant  feature  extracBon  for  acousBc  paXern  matching”,   Xavier  Anguera,  ICASSP  2012  
  • 6. Silence  modeling   10%  lowest  energy   frames   •  1  Gauss  for  noise  and  4   Gauss  for  speech   Silence/Speech   •  Perform  10  iteraBons  or   GMM  training   while  %  variaBon  is  high   Decode  the  data  
  • 7. 2234444343322444444444443222222234444444444444444444444443210000011222443   Threshold  set  to  values  <2  (i.e.  silence  +  lowest  speech)  
  • 8. Overlap  postprocessing   •  We  compute  the  percentage  of  overlap   between  all  matching  paths   min(End1, End2) ! max(Start1, Start2) Ovl = min(End1! Start1, End2 ! Start2) •  For  pairs  with  >  0.5  overlap   –  Select  the  match  with  highest  score  
  • 9. Start1   End1   Match1   Match2   Start2   End2   min(ends)  –  max(starts)   Ovl  =     =  0.8   Min(size1,  size2)  
  • 10. S-­‐DTW  submission   •  Based  on  last  year’s  submission  but  with  the   system  improvements  above  
  • 11. DTW  local  constraints   •  no  global  constraints  are  applied  in  order  to  allow  for   matching  of  any  segment  among  both  sequences   •  Local  constraints  are  set  to  allow  warping  up  to  2X   " D(m ! 2, n) + d(xm , yn ) (m,  n)   $ $ jumps(m ! 2, n) + 3 $ D(m, n ! 2) + d(xm , yn ) (m-­‐2,  n-­‐1)   D(m, n) = min # $ jumps(m, n ! 2) + 3 $ D(m ! 2, n ! 2) + d(x , y ) m n $ (m-­‐1,  n-­‐2)   % jumps(m ! 2, n ! 2) + 4 (m-­‐1,  n-­‐1)   •  Posteriorgram  features  distance:   $ N!1 ' d(xm , yn ) = ! log & # xm [i]" yn [i]) % i=0 (
  • 12. S-­‐DTW  algorithm   Query  term   Reference  term  
  • 13. S-­‐DTW  algorithm   Query  term   Reference  term  
  • 14. IR-­‐DTW   •  Total  rework  from  last  year’s  system   •  Aim  at  keeping  the  same  accuracy,  but:   –  Much  less  memory  usage   –  Faster  retrieval   •  IR  (InformaBon  Retrieval)  cause  we  use   reference  features  indexing  for  fast  nearest   neighbors  retrieval  
  • 15. Official  results   MTWV   Dev-­‐dev   Dev-­‐eval   Eval-­‐dev   Eval-­‐eval   IR-­‐DTW   0.3903   0.3139   0.4983   0.3416   S-­‐DTW   0.3745   0.3001   0.4716   0.3113   ATWV   Dev-­‐dev   Dev-­‐eval   Eval-­‐dev   Eval-­‐eval   IR-­‐DTW   0.3866   0.3042   0.4219   0.3301   S-­‐DTW   0.3644   0.292   0.3988   0.2942  
  • 16. DEV-DEV results 98 Random Performance IR-DTW MTWV=0.390 Scr=0.387 95 S-DTW MTWV=0.375 Scr=0.695 90 80 Miss probability (in %) 60 40 20 10 5 .0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40 False Alarm probability (in %)
  • 17. EVAL-EVAL Results 98 Random Performance IR-DTW MTWV=0.342 95 S-DTW MTWV=0.311 90 80 Miss probability (in %) 60 40 20 10 5 .0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40 False Alarm probability (in %)
  • 18. DEV-EVAL results 98 Random Performance IR-DTW MTWV=0.314 95 S-DTW MTWV=0.300 90 80 Miss probability (in %) 60 40 20 10 5 .0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40 False Alarm probability (in %)
  • 19. EVAL-DEV results 98 Random Performance IR-DTW MTWV=0.498 95 S-DTW MTWV=0.472 90 80 Miss probability (in %) 60 40 20 10 5 .0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40 False Alarm probability (in %)
  • 20. Xavier  Anguera   Summary   xanguera@Bd.es   •  We  propose  2  systems,  all  sharing  the  same   framework   •  Some  improvements  in  the  framework  were   incorporated:  speech/silence  classificaBon,  new   overlap  detecBon,  modified  background  model.   •  IR-­‐DTW  is  a  total  reimplementaBon  of  SDTW,   using  informaBon  retrieval  concepts