Speaker Segmentation (2006)

Real-time Automatic Speaker Segmentation Luís Gustavo Martins UTM – INESC Porto [email_address] http://www.inescporto.pt/~lmartins LabMeetings March 16, 2006 INESC Porto

Notice This work is licensed under the Creative Commons Attribution-Share Alike 2.5 Portugal License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.5/pt/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

Summary Summary System Overview Audio Analysis front-end Speaker Coarse Segmentation Speaker Change Validation Speaker Model Update Experimental Results Achievements Conclusions

Scope Objective Development of a Real-time, Automatic Speaker Segmentation module Already having in mind for future development: Speaker Tracking Speaker Identification Challenges No pre-knowledge about the number and identities of speakers On-line and Real-time operation Audio data is not available beforehand Must only use small amounts of arriving speaker data iterative and computationally intensive methods are unfeasible

Audio Analysis front-end Front-end Processing 8kHz, 16 bit, pre-emphasized, mono speech streams 25ms analysis frames with no overlap Speech segments with 2.075 secs and 1.4 secs overlap Consecutive sub-segments with 1.375 secs each

Audio Analysis front-end Feature Extraction (1) Speaker Modeling 10th-order LPC / LSP Source / Filter approach Other possible features… MFCC Pitch … SOURCE FILTER

Audio Analysis front-end LPC Modeling (1) [Rabiner93, Campbell97] Linear Predictive Coding Order p Yule-Walker equations Durbin’s recursive algorithm Toeplitz autocorrelation matrix Autocorrelation method

Audio Analysis front-end LPC Modeling (2) Whitening Filter  Pitch LPC Spectrum FFT Spectrum

Audio Analysis front-end LSP Modeling [Campbell97] Linear Spectral Pairs More robust to quantization, as normally used in speech coding Derived from the LPC a k coefficients Zeros of A(z) mapped to the unit circle in the Z-Domain Use of a pair of (p+1)-order polynomials

Speaker Modeling Speaker information is mostly contained in the voiced part of the speech signal… Can you identify Who’s speaking? LPC / LSP analysis behaves badly with non-voiced (i.e. non-periodic) signals Unvoiced/Silence data degrades speaker model accuracy!  Select only voiced data for processing… Audio Analysis front-end Unvoiced speech frames Voiced speech frames

Audio Analysis front-end Voiced / Unvoiced / Silence (V/U/S) detection Feature Extraction (2) Short Time Energy (STE)  silence detection Zero Crossing Rate (ZCR)  voiced / unvoiced detection

Audio Analysis front-end V/U/S speech classes modeled by Gaussian Distributions modeled by 2-d Gaussian Distributions Simple and Fast  real-time operation Dataset: ~4 minutes of manually annotated speech signals 2 male and 2 female Portuguese speakers ZCR STE voiced unvoiced silence

Audio Analysis front-end Manual Annotation of V/U/S segments in a speech signal

Audio Analysis front-end V/U/S Speech dataset Voiced / Unvoiced / Silence stratification in manually segmented audio files -------------------------------------- Portuguese Male 2 -------------------------------------- Total Time = 60 secs voiced = 30 secs = 50.0% unvoiced = 14 secs = 23.3% silence = 13 secs = 21.6% Voiced /(Voiced + Unvoiced) = 68% Unvoiced / (Voiced + Unvoiced) = 32% -------------------------------------- Portuguese Male 1 -------------------------------------- Total Time = 60 secs Voiced = 37 secs = 62% unvoiced = 12 secs = 20% silence = 10 secs = 17% Voiced /(Voiced + Unvoiced) = 76% Unvoiced / (Voiced + Unvoiced) = 24% -------------------------------------- Portuguese Female 2 -------------------------------------- Total Time = 60 secs voiced = 30 secs = 50.0% unvoiced = 19 secs = 31.6% silence = 10 secs = 17% Voiced /(Voiced + Unvoiced) = 61.2% Unvoiced / (Voiced + Unvoiced) = 38.7% -------------------------------------- Portuguese Female 1 -------------------------------------- Total Time = 60 secs voiced = 32 secs = 53.3% unvoiced = 17 secs = 28.3% silence = 10 secs = 17% Voiced /(Voiced + Unvoiced) = 65.3% Unvoiced / (Voiced + Unvoiced) = 32.7%

Audio Analysis front-end Automatic Classification of V/U/S speech frames : 10-fold Cross-Validation Confusion matrix: Some voiced frames are being discarded as unvoiced … Waste of relevant and scarce data… A few unvoiced and silence frames are being misclassified as voiced Contamination of the data to be analyzed contamination waste (Theoretical Random Classifier Correct Classifications = 33.33%) Total Classification Error = 18.385% Total Correct Classifications = 81.615+/-1.13912% 64.92 % 33.55 % 0.88 % silence 34.66 % 62.28 % 6.8 % unvoiced 0.41 % 4.17 % 92.32 % voiced silence unvoiced voiced Classified as: ↓

Audio Analysis front-end Voiced / Unvoiced / Silence (V/U/S) detection Advantages Only quasi-stationary parts of the speech signal are used Include most of the speaker information in a speech signal Avoids model degradation in LPC/LSP Potentially more robust to different speakers/languages Different languages may have distinct V/U/S stratification Speakers talk differently (i.e. more paused  more silence frames) Drawbacks May cause few data points per speech sub-segment Ill-estimation of the covariance matrices number of data points (i.e. voiced frames) >= d(d+1)/2 d=dim(cov matrix) = 10 (i.e. 10 LSP coefficients) nr. data points / sub-segment >= 55 frames Not always guaranteed!!  use of dynamically sized windows Does this really work??

Speaker Coarse Segmentation Divergence Shape Only uses LSP features Assumes Gaussian Distribution Calculated between consecutive sub-segments Speech stream with 4 speech segments [Campbell97] [Lu2002]

Speaker Coarse Segmentation Dynamic Threshold [Lu2002] Speaker change whenever:

Speaker Coarse Segmentation Coarse Segmentation performance Presents high False Alarm Rate (FAR = Type I errors)  Possible solution: Use a Speaker Validation Strategy Should allow decreasing FAR …  … but should also avoid an increase in Miss Detections (MDR = Type II errors) 

Speaker Change Validation Bayesian Information Criterion (BIC) (1) Hypothesis 0: Single θ z model for speaker data in segments X and Y L 0  L 0  Same speaker in segments X and Y Different speakers in segments X and Y X Y Z

Speaker Change Validation Bayesian Information Criterion (BIC) (2) Hypothesis 1: Separate models θ x , θ y for speakers in segments X and Y, respectively L 1  L 1  Same speaker in segments X and Y Different speakers in segments X and Y X Y Z

Speaker Change Validation Bayesian Information Criterion (BIC) (3) Log Likelihood Ratio (LLR) However, this is not a far comparison… The models do not have the same number of parameters! More complex models always fit better the data They should be penalized when compared with simpler models  Δ K = difference of the nr. parameters in the two hypotheses  Need to define a Threshold …  No Threshold needed! Or is it!?

Speaker Change Validation Bayesian Information Criterion (BIC) (4) Using Gaussian models for θ x, θ y and θ z : Validate Speaker Change Point when: Threshold Free!  … but λ must be set… 

Speaker Change Validation Bayesian Information Criterion (BIC) (5) BIC needs large amounts of data for good accuracy! Each speech segment only contains 55 data points … too few! Solution: Speaker Model Update…

Speaker Model Update “ Quasi-GMM” speaker modeling [Lu2002] Approximation to GMM (Gaussian Mixture Models) using segmental clustering of Gaussian Models instead of EM Gaussian models incrementally updated with new arriving speaker data less accurate than GMM… … but feasible for real-time operation

Speaker Model Update “ Quasi-GMM” speaker modeling [Lu2002] Segmental Clustering Start with one Gaussian Mixture (~GMM1) DO: Update mixture as speaker data is received WHILE: dissimilarity between mixture model before and after update is sufficiently small Create a new Gaussian mixture (GMMn+1) Up to a maximum of 32 mixtures (GMM32) Mixture Weight ( w m ):

Speaker Model Update “ Quasi-GMM” speaker modeling [Lu2002] Gaussian Model on-line updating  μ dependent terms are discarded [Lu2002] Increase robustness to changes in noise and background sound ~ Cepstral Mean Subtraction (CMR)

Speaker Change Validation BIC and Quasi-GMM Speaker Models Validate Speaker Change Point when:

Experimental Results Speaker Datasets: INESC Porto dataset: Sources: MPEG-7 Content Set CD1 [MPEG.N2467] broadcast news from assorted sources male, female, various languages 43 minutes of speaker audio 16 bit @ 22.05kHz PCM, single-channel Ground Truth 181 speaker changes Manually annotated Speaker segments durations Maximum ~= 120 secs Minimum = 2.25 secs Mean = 19.81 secs Std.Dev. = 27.08 secs

Experimental Results Speaker Datasets: TIMIT/AUTH dataset: Sources: TIMIT database 630 English speakers 6300 sentences 56 minutes of speaker audio 16 bit @ 22.05kHz PCM, single-channel Ground Truth 983 speaker changes Manually annotated Speaker segments durations Maximum ~= 12 secs Minimum = 1.139 secs Mean = 3.28 secs Std.Dev. = 1.52 secs

Experimental Results Efficiency Measures

Experimental Results System’s Parameters fine-tuning Parameters Dynamic Threshold : α and nr. of previous frames BIC: λ qGMM: mixture creation thresholds Detection Tolerance Interval: set to [-1;+1] secs. tune system to higher FAR & lower MDR Missed speaker changes can not be recovered by subsequent processing False speaker changes will hopefully be discarded by subsequent processing Speaker Tracking module (future work) Merge adjacent segments identified as belonging to the same speaker

Experimental Results Dynamic Threshold and BIC parameters ( α and λ ) Best Results found for: α = 0.8 λ = 0.6

Experimental Results INESC Porto dataset evaluation (1) INESC System ver.1 INESC System ver.2 Features: LSP Voiced Filter disabled On-line processing (realtime) Uses BIC Features: LSP Voiced Filter enabled On-line processing (realtime) Uses BIC

Experimental Results TIMIT/AUTH dataset evaluation (1) INESC System ver.1 INESC System ver.2 Features: LSP Voiced Filter disabled On-line processing (realtime) Uses BIC Features: LSP Voiced Filter enabled On-line processing (realtime) Uses BIC

Experimental Results INESC Porto dataset evaluation (2) INESC System ver.2 Features: LSP Voiced Filter disabled On-line processing (realtime) Uses BIC AUTH System 1 Features: AudioSpectrumCentroid AudioWaveformEnvelope Multiple-pass (non-realtime) Uses BIC

Experimental Results TIMIT/AUTH dataset evaluation (2) AUTH System 1 INESC System ver.2 Features: AudioSpectrumCentroid AudioWaveformEnvelope Multiple-pass (non-realtime) Uses BIC Features: LSP Voiced Filter disabled On-line processing (realtime) Uses BIC

Experimental Results TIMIT/AUTH dataset evaluation (3) AUTH System 2 INESC System ver.2 Features: DFT Mag STE AudioWaveformEnvelope AudioSpectrumCentroid MFCC Fast system (realtime?) Uses BIC Features: LSP Voiced Filter disabled On-line processing (realtime) Uses BIC

Experimental Results Time Shifts on the detected Speaker Change Points Detection tolerance interval = [-1, 1] secs INESC System ver.1

Achievements Software C++ routines Numerical routines Matrix Determinant Polynomial Roots Levinson-Durbin LPC (adapted from Marsyas) LSP Divergence and Bhattacharyya Shape metrics BIC Quasi-GMM modeling class Automatic Speaker Segment prototype application As a Library (DLL) Integrated into “4VDO - Annotator” As a stand-alone application Reports VISNET deliverables D29, D30, D31, D40, D41 Publications (co-author) “ Speaker Change Detection using BIC: A comparison on two datasets” Accepted to the ISCCSP2006 “ Automatic Speaker Segmentation using Multiple Features and Distance Measures: A Comparison of Three Approaches” Submitted to ICME2006 DEMO

Conclusions Open Issues… Voiced detection procedure Results should improve… Parameter fine-tuning Dynamic Threshold BIC parameter Quasi-GMM Model Further Work Audio Features Evaluate other features for speaker segmentation, tracking and identification Pitch MFCC … Speaker Tracking Clustering of speaker segments Evaluation Ground Truth  Needs manual annotation work Speaker Identification Speaker Model Training Evaluation Ground Truth  Needs manual annotation work

Contributors INESC Porto Rui Costa Jaime Cardoso Luís Filipe Teixeira Sílvio Macedo VISNET Aristotle University of Thessaloniki (AUTH), Greece Margarita Kotti Emmanuoil Benetos Constantine Kotropoulos

Thank you! Questions? [email_address]

Speaker Segmentation (2006)

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Speaker Segmentation (2006) (20)

More from Luís Gustavo Martins (13)

Recently uploaded (20)

Speaker Segmentation (2006)