SlideShare a Scribd company logo
Real-time Automatic    Speaker Segmentation Luís Gustavo Martins UTM – INESC Porto [email_address] http://www.inescporto.pt/~lmartins   LabMeetings March 16, 2006 INESC Porto
Notice This work is licensed under the Creative Commons Attribution-Share Alike 2.5 Portugal License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.5/pt/  or send a letter to  Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
Summary Summary System Overview Audio Analysis front-end Speaker Coarse Segmentation Speaker Change Validation Speaker Model Update Experimental Results Achievements Conclusions
Scope Objective Development of a Real-time, Automatic Speaker Segmentation module Already having in mind for future development: Speaker Tracking Speaker Identification Challenges No pre-knowledge about the number and identities of speakers On-line and Real-time operation Audio data is not available beforehand Must only use small amounts of arriving speaker data iterative and computationally intensive methods are unfeasible
System Overview
Audio Analysis front-end
Audio Analysis front-end Front-end Processing 8kHz, 16 bit, pre-emphasized, mono speech streams 25ms analysis  frames  with no overlap Speech  segments  with 2.075 secs and 1.4 secs overlap Consecutive  sub-segments  with 1.375 secs each
Audio Analysis front-end Feature Extraction (1) Speaker Modeling 10th-order  LPC / LSP Source / Filter approach Other possible features… MFCC Pitch … SOURCE FILTER
Audio Analysis front-end LPC Modeling (1)  [Rabiner93, Campbell97]   Linear Predictive Coding Order  p Yule-Walker equations Durbin’s recursive algorithm Toeplitz autocorrelation matrix Autocorrelation method
Audio Analysis front-end LPC Modeling (2) Whitening Filter    Pitch LPC Spectrum FFT Spectrum
Audio Analysis front-end LSP Modeling  [Campbell97]   Linear Spectral Pairs More robust to quantization, as normally used in speech coding Derived from the LPC  a k  coefficients Zeros of  A(z)  mapped to the unit circle in the  Z-Domain Use of a pair of (p+1)-order polynomials
Speaker Modeling Speaker information is  mostly contained in the voiced part  of the speech signal… Can you identify Who’s speaking? LPC / LSP analysis behaves badly with non-voiced (i.e. non-periodic) signals Unvoiced/Silence data degrades speaker model accuracy!    Select only voiced data for processing… Audio Analysis front-end Unvoiced speech frames Voiced speech frames
Audio Analysis front-end Voiced / Unvoiced / Silence (V/U/S) detection Feature Extraction (2) Short Time Energy   (STE)    silence detection Zero Crossing Rate   (ZCR)  voiced / unvoiced detection
Audio Analysis front-end V/U/S speech classes modeled by Gaussian Distributions modeled by 2-d  Gaussian Distributions Simple and Fast    real-time operation Dataset: ~4 minutes of  manually annotated  speech signals 2 male and 2 female Portuguese speakers ZCR STE voiced unvoiced silence
Audio Analysis front-end Manual Annotation of V/U/S segments in a speech signal
Audio Analysis front-end V/U/S Speech dataset Voiced / Unvoiced / Silence stratification in  manually segmented  audio files -------------------------------------- Portuguese Male 2 -------------------------------------- Total Time  = 60 secs voiced  = 30 secs = 50.0% unvoiced  = 14 secs = 23.3% silence = 13 secs = 21.6% Voiced /(Voiced + Unvoiced) = 68%  Unvoiced / (Voiced + Unvoiced)  = 32% -------------------------------------- Portuguese Male 1 -------------------------------------- Total Time  = 60 secs Voiced = 37 secs = 62% unvoiced  = 12 secs = 20% silence = 10 secs = 17% Voiced /(Voiced + Unvoiced) = 76%  Unvoiced / (Voiced + Unvoiced)  = 24% -------------------------------------- Portuguese Female 2 -------------------------------------- Total Time  = 60 secs voiced  = 30 secs = 50.0% unvoiced  = 19 secs = 31.6% silence = 10 secs = 17% Voiced /(Voiced + Unvoiced) = 61.2%  Unvoiced / (Voiced + Unvoiced)  = 38.7% -------------------------------------- Portuguese Female 1 -------------------------------------- Total Time  = 60 secs voiced  = 32 secs = 53.3% unvoiced  = 17 secs = 28.3% silence = 10 secs = 17% Voiced /(Voiced + Unvoiced) = 65.3%  Unvoiced / (Voiced + Unvoiced)  = 32.7%
Audio Analysis front-end Automatic Classification of V/U/S speech frames : 10-fold Cross-Validation  Confusion matrix: Some  voiced  frames are being discarded as  unvoiced …  Waste  of relevant and scarce data… A few  unvoiced  and  silence  frames are being misclassified as  voiced Contamination  of the data to be analyzed contamination waste (Theoretical Random Classifier Correct Classifications = 33.33%) Total Classification Error = 18.385% Total Correct Classifications = 81.615+/-1.13912% 64.92 % 33.55 % 0.88 % silence 34.66 % 62.28 % 6.8 % unvoiced 0.41 % 4.17 % 92.32 % voiced silence unvoiced voiced Classified as:  ↓
Audio Analysis front-end Voiced / Unvoiced / Silence (V/U/S) detection Advantages Only quasi-stationary parts of the speech signal are used Include most of the speaker information in a speech signal Avoids model degradation in LPC/LSP Potentially more robust to different speakers/languages Different languages may have distinct V/U/S stratification Speakers talk differently (i.e. more paused    more silence frames) Drawbacks May cause few data points per speech sub-segment Ill-estimation of the covariance matrices number of data points (i.e. voiced frames) >=  d(d+1)/2 d=dim(cov matrix) = 10   (i.e. 10 LSP coefficients)   nr. data points / sub-segment >=  55 frames Not always guaranteed!!      use of   dynamically sized windows Does this really work??
Speaker Coarse Segmentation
Speaker Coarse Segmentation Divergence Shape  Only uses  LSP features Assumes Gaussian Distribution Calculated between consecutive sub-segments Speech stream with 4 speech segments [Campbell97] [Lu2002]
Speaker Coarse Segmentation Dynamic Threshold  [Lu2002]   Speaker change whenever:
Speaker Coarse Segmentation Coarse Segmentation performance Presents high False Alarm Rate (FAR = Type I errors)   Possible solution: Use a Speaker Validation Strategy Should allow decreasing  FAR …   …  but should also avoid an increase in  Miss Detections  (MDR = Type II errors)  
Speaker Change Validation
Speaker Change Validation Bayesian Information Criterion (BIC) (1) Hypothesis 0: Single  θ z   model  for speaker data in segments X and Y L 0    L 0    Same speaker in segments X and Y Different   speakers in segments X and Y X Y Z
Speaker Change Validation Bayesian Information Criterion (BIC) (2) Hypothesis 1: Separate models  θ x ,  θ y  for speakers in segments X and Y, respectively L 1    L 1    Same speaker in segments X and Y Different   speakers in segments X and Y X Y Z
Speaker Change Validation Bayesian Information Criterion (BIC) (3) Log Likelihood Ratio (LLR)  However, this is not a far comparison… The models do not have the same number of parameters! More complex models always fit better the data They should be penalized when compared with simpler models    Δ K  =  difference   of the nr. parameters in the two hypotheses      Need to define a  Threshold …   No  Threshold  needed! Or is it!?
Speaker Change Validation Bayesian Information Criterion (BIC) (4) Using  Gaussian models  for  θ x,  θ y  and  θ z :   Validate Speaker Change Point when: Threshold Free!   …  but  λ  must be set…  
Speaker Change Validation Bayesian Information Criterion (BIC) (5) BIC needs large amounts of data for good accuracy! Each speech segment only contains  55 data points … too few! Solution: Speaker Model Update…
Speaker Model Update
Speaker Model Update “ Quasi-GMM”  speaker modeling  [Lu2002]   Approximation to  GMM  (Gaussian Mixture Models)   using  segmental clustering  of Gaussian Models   instead of  EM Gaussian models incrementally updated with new arriving speaker data  less accurate than GMM… …  but feasible for real-time operation
Speaker Model Update “ Quasi-GMM”  speaker modeling  [Lu2002]   Segmental Clustering Start with one Gaussian Mixture (~GMM1) DO: Update mixture as speaker data is received WHILE: dissimilarity between mixture model before and after update is  sufficiently  small Create a new Gaussian mixture (GMMn+1) Up to a maximum of 32 mixtures (GMM32) Mixture Weight ( w m ):
Speaker Model Update “ Quasi-GMM”  speaker modeling  [Lu2002]   Gaussian Model on-line updating    μ   dependent terms are discarded   [Lu2002] Increase robustness to changes in noise and background sound ~ Cepstral Mean Subtraction (CMR)
Speaker Change Validation
Speaker Change Validation BIC and Quasi-GMM Speaker Models Validate Speaker Change Point when:
Complete System
Complete System
Experimental Results Speaker Datasets: INESC Porto dataset: Sources: MPEG-7 Content Set CD1  [MPEG.N2467] broadcast news from assorted sources male, female, various languages 43 minutes of speaker audio 16 bit @ 22.05kHz  PCM,  single-channel Ground Truth 181 speaker changes Manually annotated Speaker segments durations Maximum ~= 120 secs Minimum = 2.25 secs  Mean = 19.81 secs  Std.Dev. = 27.08 secs
Experimental Results Speaker Datasets: TIMIT/AUTH dataset: Sources: TIMIT  database   630 English speakers 6300 sentences 56 minutes of speaker audio 16 bit @ 22.05kHz  PCM,  single-channel Ground Truth 983 speaker changes Manually annotated Speaker segments durations Maximum ~= 12 secs Minimum = 1.139 secs Mean = 3.28 secs  Std.Dev. = 1.52 secs
Experimental Results Efficiency Measures
Experimental Results System’s Parameters fine-tuning Parameters Dynamic Threshold :  α   and   nr. of previous frames   BIC:   λ qGMM:   mixture creation thresholds Detection Tolerance Interval:  set to [-1;+1] secs. tune system to higher FAR & lower MDR Missed speaker changes can not be recovered by subsequent processing False speaker changes will hopefully be discarded by subsequent processing Speaker Tracking module (future work) Merge adjacent segments identified as belonging to the same speaker
Experimental Results Dynamic Threshold and BIC parameters  ( α  and  λ ) Best Results found for:  α   = 0.8  λ  = 0.6
Experimental Results INESC Porto dataset evaluation (1) INESC System ver.1  INESC System ver.2  Features: LSP Voiced Filter disabled On-line processing (realtime) Uses BIC  Features: LSP Voiced Filter enabled On-line processing (realtime) Uses BIC
Experimental Results TIMIT/AUTH dataset evaluation (1) INESC System ver.1  INESC System ver.2  Features: LSP Voiced Filter disabled On-line processing (realtime) Uses BIC  Features: LSP Voiced Filter enabled On-line processing (realtime) Uses BIC
Experimental Results INESC Porto dataset evaluation (2) INESC System ver.2  Features: LSP Voiced Filter disabled On-line processing (realtime) Uses BIC  AUTH System 1  Features: AudioSpectrumCentroid AudioWaveformEnvelope Multiple-pass (non-realtime) Uses BIC
Experimental Results TIMIT/AUTH dataset evaluation (2) AUTH System 1  INESC System ver.2  Features: AudioSpectrumCentroid AudioWaveformEnvelope Multiple-pass (non-realtime) Uses BIC  Features: LSP Voiced Filter disabled On-line processing (realtime) Uses BIC
Experimental Results TIMIT/AUTH dataset evaluation (3) AUTH System 2  INESC System ver.2  Features: DFT Mag STE AudioWaveformEnvelope AudioSpectrumCentroid MFCC Fast system (realtime?) Uses BIC  Features: LSP Voiced Filter disabled On-line processing (realtime) Uses BIC
Experimental Results Time Shifts on the detected Speaker Change Points  Detection tolerance interval = [-1, 1] secs INESC System ver.1
Achievements Software C++ routines Numerical routines Matrix Determinant Polynomial Roots Levinson-Durbin LPC (adapted from Marsyas) LSP Divergence and Bhattacharyya Shape metrics BIC Quasi-GMM modeling class Automatic Speaker Segment prototype application As a Library (DLL) Integrated into  “4VDO - Annotator” As a stand-alone application Reports VISNET deliverables D29, D30, D31, D40, D41 Publications (co-author) “ Speaker Change Detection using BIC: A comparison on two datasets” Accepted to the ISCCSP2006 “ Automatic Speaker Segmentation using Multiple Features and Distance Measures: A Comparison of Three Approaches”   Submitted to ICME2006 DEMO
Conclusions Open Issues… Voiced detection procedure Results should improve…  Parameter fine-tuning Dynamic Threshold BIC parameter  Quasi-GMM Model Further Work Audio Features Evaluate other features for speaker segmentation, tracking and identification Pitch MFCC … Speaker Tracking Clustering of speaker segments Evaluation Ground Truth    Needs manual annotation work Speaker Identification Speaker Model Training Evaluation Ground Truth    Needs manual annotation work
Contributors INESC Porto Rui Costa Jaime Cardoso Luís Filipe Teixeira Sílvio Macedo VISNET Aristotle University of Thessaloniki (AUTH),  Greece Margarita Kotti Emmanuoil Benetos Constantine Kotropoulos
Thank you! Questions? [email_address]

More Related Content

PPT
Speech technology basics
PPTX
Speech Compression using LPC
PPT
Speech coding techniques
PPTX
Text independent speaker recognition system
PDF
Speech Compression using LPC
PPTX
LPC for Speech Recognition
PPT
Speech compression-using-gsm
PPTX
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
Speech technology basics
Speech Compression using LPC
Speech coding techniques
Text independent speaker recognition system
Speech Compression using LPC
LPC for Speech Recognition
Speech compression-using-gsm
SPEKER RECOGNITION UNDER LIMITED DATA CODITION

What's hot (20)

PPTX
Linear Predictive Coding
PPTX
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...
PPT
Speech encoding techniques
DOCX
Linear predictive coding documentation
PPTX
Speech based password authentication system on FPGA
PPTX
Multimedia seminar ppt
PPTX
lpc and horn noise detection
PPT
Audio and video compression
PPT
Audio compression 1
PPT
Lecture 8 audio compression
PPS
MPEG/Audio Compression
PPTX
Linear Predictive Coding
PPT
Compression
PPT
Digital Video And Compression
PPT
Compression
PPTX
Speech coding standards2
PPT
MM_Conferencing.ppt
PPTX
Speech coding std
PPTX
Audio encoding principles
Linear Predictive Coding
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...
Speech encoding techniques
Linear predictive coding documentation
Speech based password authentication system on FPGA
Multimedia seminar ppt
lpc and horn noise detection
Audio and video compression
Audio compression 1
Lecture 8 audio compression
MPEG/Audio Compression
Linear Predictive Coding
Compression
Digital Video And Compression
Compression
Speech coding standards2
MM_Conferencing.ppt
Speech coding std
Audio encoding principles
Ad

Viewers also liked (20)

PPTX
Post-Processing of Prostate Perfusion MRI
PDF
Falsification of data
PDF
Introdução à programação em Android e iOS - Android
PDF
Extended case studies
PDF
Controversial product
PPTX
Episode 54 : CAPE Problem Formulations
PPTX
Episode 33 : Project Execution Part (4)
PDF
A Computational Framework for Sound Segregation in Music Signals using Marsyas
PDF
Methodology - Statistic
PDF
Justifying price rise
PPT
Episode 45 : 4 Stages Of Solid Liquid Separations
PDF
For vals reading
PPTX
Episode 51 : Integrated Process Simulation
PDF
Technology Trends in Creativity and Business
PPT
Episode 36 : What is Powder Technology?
PPT
Episode 44 : 4 Stages Of Solid Liquid Separations
PPT
Episode 35 : Design Approach to Dilute Phase Pneumatic Conveying
PPTX
Episode 47 : CONCEPTUAL DESIGN OF CHEMICAL PROCESSES
PPTX
Episode 52 : Flow sheeting Case Study
Post-Processing of Prostate Perfusion MRI
Falsification of data
Introdução à programação em Android e iOS - Android
Extended case studies
Controversial product
Episode 54 : CAPE Problem Formulations
Episode 33 : Project Execution Part (4)
A Computational Framework for Sound Segregation in Music Signals using Marsyas
Methodology - Statistic
Justifying price rise
Episode 45 : 4 Stages Of Solid Liquid Separations
For vals reading
Episode 51 : Integrated Process Simulation
Technology Trends in Creativity and Business
Episode 36 : What is Powder Technology?
Episode 44 : 4 Stages Of Solid Liquid Separations
Episode 35 : Design Approach to Dilute Phase Pneumatic Conveying
Episode 47 : CONCEPTUAL DESIGN OF CHEMICAL PROCESSES
Episode 52 : Flow sheeting Case Study
Ad

Similar to Speaker Segmentation (2006) (20)

PDF
Speaker Recognition System using MFCC and Vector Quantization Approach
PDF
V041203124126
PPTX
Speaker recognition using MFCC
PPTX
SPEAKER VERIFICATION
PDF
The past, present and future of singing synthesis
PPTX
Final_Presentation_ENDSEMFORNITJSRI.pptx
DOCX
Voice biometric recognition
DOC
Speaker recognition on matlab
PDF
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
PDF
PDF
ASR_final
PDF
A novel automatic voice recognition system based on text-independent in a noi...
PDF
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
PDF
Speaker Identification
PPTX
Speaker Recognition
PPTX
COLEA : A MATLAB Tool for Speech Analysis
PDF
Speech Analysis and synthesis using Vocoder
PPTX
Speaker Identification and Verification
PDF
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
PDF
Bachelors project summary
Speaker Recognition System using MFCC and Vector Quantization Approach
V041203124126
Speaker recognition using MFCC
SPEAKER VERIFICATION
The past, present and future of singing synthesis
Final_Presentation_ENDSEMFORNITJSRI.pptx
Voice biometric recognition
Speaker recognition on matlab
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
ASR_final
A novel automatic voice recognition system based on text-independent in a noi...
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
Speaker Identification
Speaker Recognition
COLEA : A MATLAB Tool for Speech Analysis
Speech Analysis and synthesis using Vocoder
Speaker Identification and Verification
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
Bachelors project summary

More from Luís Gustavo Martins (13)

PDF
Creativity and Design Thinking - 2024.pdf
PDF
Inteligência Artificial - do hype, ao mito, passando pelas oportunidades e ri...
PDF
ANDROIDS, REPLICANTS AND BLADE RUNNERS - ARE WE ALL DEEP DREAMING OF ELECTRI...
PDF
Smart research? A retórica da Excelência.
PDF
Artificial intelligence and Creativity
PDF
Creativity and Design Thinking
PDF
The impact of Cultural Context on the Perception of Sound and Musical Languag...
PDF
Introdução à programação em Android e iOS - iOS
PDF
Introdução à programação em Android e iOS - OOP Java
PDF
Introdução à programação em Android e iOS - OOP em ObjC
PDF
Introdução à programação em Android e iOS - Conceitos fundamentais de program...
PDF
Research methodology - What is a PhD?
PDF
Introduction to pattern recognition
Creativity and Design Thinking - 2024.pdf
Inteligência Artificial - do hype, ao mito, passando pelas oportunidades e ri...
ANDROIDS, REPLICANTS AND BLADE RUNNERS - ARE WE ALL DEEP DREAMING OF ELECTRI...
Smart research? A retórica da Excelência.
Artificial intelligence and Creativity
Creativity and Design Thinking
The impact of Cultural Context on the Perception of Sound and Musical Languag...
Introdução à programação em Android e iOS - iOS
Introdução à programação em Android e iOS - OOP Java
Introdução à programação em Android e iOS - OOP em ObjC
Introdução à programação em Android e iOS - Conceitos fundamentais de program...
Research methodology - What is a PhD?
Introduction to pattern recognition

Recently uploaded (20)

PPTX
Spectroscopy.pptx food analysis technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Encapsulation theory and applications.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Approach and Philosophy of On baking technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Spectroscopy.pptx food analysis technology
Per capita expenditure prediction using model stacking based on satellite ima...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Chapter 3 Spatial Domain Image Processing.pdf
Encapsulation theory and applications.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Spectral efficient network and resource selection model in 5G networks
Programs and apps: productivity, graphics, security and other tools
Unlocking AI with Model Context Protocol (MCP)
Encapsulation_ Review paper, used for researhc scholars
Reach Out and Touch Someone: Haptics and Empathic Computing
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Approach and Philosophy of On baking technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

Speaker Segmentation (2006)

  • 1. Real-time Automatic Speaker Segmentation Luís Gustavo Martins UTM – INESC Porto [email_address] http://www.inescporto.pt/~lmartins LabMeetings March 16, 2006 INESC Porto
  • 2. Notice This work is licensed under the Creative Commons Attribution-Share Alike 2.5 Portugal License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.5/pt/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
  • 3. Summary Summary System Overview Audio Analysis front-end Speaker Coarse Segmentation Speaker Change Validation Speaker Model Update Experimental Results Achievements Conclusions
  • 4. Scope Objective Development of a Real-time, Automatic Speaker Segmentation module Already having in mind for future development: Speaker Tracking Speaker Identification Challenges No pre-knowledge about the number and identities of speakers On-line and Real-time operation Audio data is not available beforehand Must only use small amounts of arriving speaker data iterative and computationally intensive methods are unfeasible
  • 7. Audio Analysis front-end Front-end Processing 8kHz, 16 bit, pre-emphasized, mono speech streams 25ms analysis frames with no overlap Speech segments with 2.075 secs and 1.4 secs overlap Consecutive sub-segments with 1.375 secs each
  • 8. Audio Analysis front-end Feature Extraction (1) Speaker Modeling 10th-order LPC / LSP Source / Filter approach Other possible features… MFCC Pitch … SOURCE FILTER
  • 9. Audio Analysis front-end LPC Modeling (1) [Rabiner93, Campbell97] Linear Predictive Coding Order p Yule-Walker equations Durbin’s recursive algorithm Toeplitz autocorrelation matrix Autocorrelation method
  • 10. Audio Analysis front-end LPC Modeling (2) Whitening Filter  Pitch LPC Spectrum FFT Spectrum
  • 11. Audio Analysis front-end LSP Modeling [Campbell97] Linear Spectral Pairs More robust to quantization, as normally used in speech coding Derived from the LPC a k coefficients Zeros of A(z) mapped to the unit circle in the Z-Domain Use of a pair of (p+1)-order polynomials
  • 12. Speaker Modeling Speaker information is mostly contained in the voiced part of the speech signal… Can you identify Who’s speaking? LPC / LSP analysis behaves badly with non-voiced (i.e. non-periodic) signals Unvoiced/Silence data degrades speaker model accuracy!  Select only voiced data for processing… Audio Analysis front-end Unvoiced speech frames Voiced speech frames
  • 13. Audio Analysis front-end Voiced / Unvoiced / Silence (V/U/S) detection Feature Extraction (2) Short Time Energy (STE)  silence detection Zero Crossing Rate (ZCR)  voiced / unvoiced detection
  • 14. Audio Analysis front-end V/U/S speech classes modeled by Gaussian Distributions modeled by 2-d Gaussian Distributions Simple and Fast  real-time operation Dataset: ~4 minutes of manually annotated speech signals 2 male and 2 female Portuguese speakers ZCR STE voiced unvoiced silence
  • 15. Audio Analysis front-end Manual Annotation of V/U/S segments in a speech signal
  • 16. Audio Analysis front-end V/U/S Speech dataset Voiced / Unvoiced / Silence stratification in manually segmented audio files -------------------------------------- Portuguese Male 2 -------------------------------------- Total Time = 60 secs voiced = 30 secs = 50.0% unvoiced = 14 secs = 23.3% silence = 13 secs = 21.6% Voiced /(Voiced + Unvoiced) = 68% Unvoiced / (Voiced + Unvoiced) = 32% -------------------------------------- Portuguese Male 1 -------------------------------------- Total Time = 60 secs Voiced = 37 secs = 62% unvoiced = 12 secs = 20% silence = 10 secs = 17% Voiced /(Voiced + Unvoiced) = 76% Unvoiced / (Voiced + Unvoiced) = 24% -------------------------------------- Portuguese Female 2 -------------------------------------- Total Time = 60 secs voiced = 30 secs = 50.0% unvoiced = 19 secs = 31.6% silence = 10 secs = 17% Voiced /(Voiced + Unvoiced) = 61.2% Unvoiced / (Voiced + Unvoiced) = 38.7% -------------------------------------- Portuguese Female 1 -------------------------------------- Total Time = 60 secs voiced = 32 secs = 53.3% unvoiced = 17 secs = 28.3% silence = 10 secs = 17% Voiced /(Voiced + Unvoiced) = 65.3% Unvoiced / (Voiced + Unvoiced) = 32.7%
  • 17. Audio Analysis front-end Automatic Classification of V/U/S speech frames : 10-fold Cross-Validation Confusion matrix: Some voiced frames are being discarded as unvoiced … Waste of relevant and scarce data… A few unvoiced and silence frames are being misclassified as voiced Contamination of the data to be analyzed contamination waste (Theoretical Random Classifier Correct Classifications = 33.33%) Total Classification Error = 18.385% Total Correct Classifications = 81.615+/-1.13912% 64.92 % 33.55 % 0.88 % silence 34.66 % 62.28 % 6.8 % unvoiced 0.41 % 4.17 % 92.32 % voiced silence unvoiced voiced Classified as: ↓
  • 18. Audio Analysis front-end Voiced / Unvoiced / Silence (V/U/S) detection Advantages Only quasi-stationary parts of the speech signal are used Include most of the speaker information in a speech signal Avoids model degradation in LPC/LSP Potentially more robust to different speakers/languages Different languages may have distinct V/U/S stratification Speakers talk differently (i.e. more paused  more silence frames) Drawbacks May cause few data points per speech sub-segment Ill-estimation of the covariance matrices number of data points (i.e. voiced frames) >= d(d+1)/2 d=dim(cov matrix) = 10 (i.e. 10 LSP coefficients) nr. data points / sub-segment >= 55 frames Not always guaranteed!!  use of dynamically sized windows Does this really work??
  • 20. Speaker Coarse Segmentation Divergence Shape Only uses LSP features Assumes Gaussian Distribution Calculated between consecutive sub-segments Speech stream with 4 speech segments [Campbell97] [Lu2002]
  • 21. Speaker Coarse Segmentation Dynamic Threshold [Lu2002] Speaker change whenever:
  • 22. Speaker Coarse Segmentation Coarse Segmentation performance Presents high False Alarm Rate (FAR = Type I errors)  Possible solution: Use a Speaker Validation Strategy Should allow decreasing FAR …  … but should also avoid an increase in Miss Detections (MDR = Type II errors) 
  • 24. Speaker Change Validation Bayesian Information Criterion (BIC) (1) Hypothesis 0: Single θ z model for speaker data in segments X and Y L 0  L 0  Same speaker in segments X and Y Different speakers in segments X and Y X Y Z
  • 25. Speaker Change Validation Bayesian Information Criterion (BIC) (2) Hypothesis 1: Separate models θ x , θ y for speakers in segments X and Y, respectively L 1  L 1  Same speaker in segments X and Y Different speakers in segments X and Y X Y Z
  • 26. Speaker Change Validation Bayesian Information Criterion (BIC) (3) Log Likelihood Ratio (LLR) However, this is not a far comparison… The models do not have the same number of parameters! More complex models always fit better the data They should be penalized when compared with simpler models  Δ K = difference of the nr. parameters in the two hypotheses  Need to define a Threshold …  No Threshold needed! Or is it!?
  • 27. Speaker Change Validation Bayesian Information Criterion (BIC) (4) Using Gaussian models for θ x, θ y and θ z : Validate Speaker Change Point when: Threshold Free!  … but λ must be set… 
  • 28. Speaker Change Validation Bayesian Information Criterion (BIC) (5) BIC needs large amounts of data for good accuracy! Each speech segment only contains 55 data points … too few! Solution: Speaker Model Update…
  • 30. Speaker Model Update “ Quasi-GMM” speaker modeling [Lu2002] Approximation to GMM (Gaussian Mixture Models) using segmental clustering of Gaussian Models instead of EM Gaussian models incrementally updated with new arriving speaker data less accurate than GMM… … but feasible for real-time operation
  • 31. Speaker Model Update “ Quasi-GMM” speaker modeling [Lu2002] Segmental Clustering Start with one Gaussian Mixture (~GMM1) DO: Update mixture as speaker data is received WHILE: dissimilarity between mixture model before and after update is sufficiently small Create a new Gaussian mixture (GMMn+1) Up to a maximum of 32 mixtures (GMM32) Mixture Weight ( w m ):
  • 32. Speaker Model Update “ Quasi-GMM” speaker modeling [Lu2002] Gaussian Model on-line updating  μ dependent terms are discarded [Lu2002] Increase robustness to changes in noise and background sound ~ Cepstral Mean Subtraction (CMR)
  • 34. Speaker Change Validation BIC and Quasi-GMM Speaker Models Validate Speaker Change Point when:
  • 37. Experimental Results Speaker Datasets: INESC Porto dataset: Sources: MPEG-7 Content Set CD1 [MPEG.N2467] broadcast news from assorted sources male, female, various languages 43 minutes of speaker audio 16 bit @ 22.05kHz PCM, single-channel Ground Truth 181 speaker changes Manually annotated Speaker segments durations Maximum ~= 120 secs Minimum = 2.25 secs Mean = 19.81 secs Std.Dev. = 27.08 secs
  • 38. Experimental Results Speaker Datasets: TIMIT/AUTH dataset: Sources: TIMIT database 630 English speakers 6300 sentences 56 minutes of speaker audio 16 bit @ 22.05kHz PCM, single-channel Ground Truth 983 speaker changes Manually annotated Speaker segments durations Maximum ~= 12 secs Minimum = 1.139 secs Mean = 3.28 secs Std.Dev. = 1.52 secs
  • 40. Experimental Results System’s Parameters fine-tuning Parameters Dynamic Threshold : α and nr. of previous frames BIC: λ qGMM: mixture creation thresholds Detection Tolerance Interval: set to [-1;+1] secs. tune system to higher FAR & lower MDR Missed speaker changes can not be recovered by subsequent processing False speaker changes will hopefully be discarded by subsequent processing Speaker Tracking module (future work) Merge adjacent segments identified as belonging to the same speaker
  • 41. Experimental Results Dynamic Threshold and BIC parameters ( α and λ ) Best Results found for: α = 0.8 λ = 0.6
  • 42. Experimental Results INESC Porto dataset evaluation (1) INESC System ver.1 INESC System ver.2 Features: LSP Voiced Filter disabled On-line processing (realtime) Uses BIC Features: LSP Voiced Filter enabled On-line processing (realtime) Uses BIC
  • 43. Experimental Results TIMIT/AUTH dataset evaluation (1) INESC System ver.1 INESC System ver.2 Features: LSP Voiced Filter disabled On-line processing (realtime) Uses BIC Features: LSP Voiced Filter enabled On-line processing (realtime) Uses BIC
  • 44. Experimental Results INESC Porto dataset evaluation (2) INESC System ver.2 Features: LSP Voiced Filter disabled On-line processing (realtime) Uses BIC AUTH System 1 Features: AudioSpectrumCentroid AudioWaveformEnvelope Multiple-pass (non-realtime) Uses BIC
  • 45. Experimental Results TIMIT/AUTH dataset evaluation (2) AUTH System 1 INESC System ver.2 Features: AudioSpectrumCentroid AudioWaveformEnvelope Multiple-pass (non-realtime) Uses BIC Features: LSP Voiced Filter disabled On-line processing (realtime) Uses BIC
  • 46. Experimental Results TIMIT/AUTH dataset evaluation (3) AUTH System 2 INESC System ver.2 Features: DFT Mag STE AudioWaveformEnvelope AudioSpectrumCentroid MFCC Fast system (realtime?) Uses BIC Features: LSP Voiced Filter disabled On-line processing (realtime) Uses BIC
  • 47. Experimental Results Time Shifts on the detected Speaker Change Points Detection tolerance interval = [-1, 1] secs INESC System ver.1
  • 48. Achievements Software C++ routines Numerical routines Matrix Determinant Polynomial Roots Levinson-Durbin LPC (adapted from Marsyas) LSP Divergence and Bhattacharyya Shape metrics BIC Quasi-GMM modeling class Automatic Speaker Segment prototype application As a Library (DLL) Integrated into “4VDO - Annotator” As a stand-alone application Reports VISNET deliverables D29, D30, D31, D40, D41 Publications (co-author) “ Speaker Change Detection using BIC: A comparison on two datasets” Accepted to the ISCCSP2006 “ Automatic Speaker Segmentation using Multiple Features and Distance Measures: A Comparison of Three Approaches” Submitted to ICME2006 DEMO
  • 49. Conclusions Open Issues… Voiced detection procedure Results should improve… Parameter fine-tuning Dynamic Threshold BIC parameter Quasi-GMM Model Further Work Audio Features Evaluate other features for speaker segmentation, tracking and identification Pitch MFCC … Speaker Tracking Clustering of speaker segments Evaluation Ground Truth  Needs manual annotation work Speaker Identification Speaker Model Training Evaluation Ground Truth  Needs manual annotation work
  • 50. Contributors INESC Porto Rui Costa Jaime Cardoso Luís Filipe Teixeira Sílvio Macedo VISNET Aristotle University of Thessaloniki (AUTH), Greece Margarita Kotti Emmanuoil Benetos Constantine Kotropoulos
  • 51. Thank you! Questions? [email_address]