SlideShare a Scribd company logo
Improved Cepstra Minimum-Mean-
Square-Error Noise Reduction Algorithm
for Robust Speech Recognition
Jinyu Li, Yan Huang, Yifan Gong
Microsoft
Robust Front-End
• Conventional noise-robust front-ends work very well with the Gaussian
mixture models (GMMs).
• Single-channel robust front-ends were reported not helpful to multi-style
deep neural network (DNN) training.
• DNN’s layer-by-layer structure provides a feature extraction strategy that automatically
derives powerful noise-resistant features
t-SNE Plot for Paired Clean and Noisy
Utterances in Training Set (LFB)
t-SNE Plot for Paired Clean and Noisy
Utterances in Training Set (Layer 1)
t-SNE Plot for Paired Clean and Noisy
Utterances in Training Set (Layer 3)
t-SNE Plot for Paired Clean and Noisy
Utterances in Training Set (Layer 5)
t-SNE Plot for Paired Clean and Noisy
Utterances in Testing Set (Layer 1)
t-SNE Plot for Paired Clean and Noisy
Utterances in Testing Set (Layer 3)
t-SNE Plot for Paired Clean and Noisy
Utterances in Testing Set (Layer 5)
Robust Front-End
• Conventional noise-robust front-ends work very well with the Gaussian
mixture models (GMMs).
• Single-channel robust front-ends were reported not helpful to multi-style
deep neural network (DNN) training although multi-channel signal
processing still helps.
• In this study, we show that the single-channel robust front-end is still
beneficial to deep learning models as long as it is well designed.
Cepstra Minimum Mean Square Error
(CMMSE)
• Reported very effective in dealing with noise when used in the GMM-based
acoustic models
• The solution to CMMSE for each element of the dimension-wise MFCC:
𝑐 𝑥 𝑡, 𝑘 = 𝐸 𝑐 𝑥 𝑡, 𝑘 𝒎 𝑦 𝑡 = 𝑏 𝑎 𝑘,𝑏 𝐸 log 𝑚 𝑥 𝑡, 𝑏 |𝒎 𝑦(𝑡)
• The problem is reduced to finding the log-MMSE estimator of the Mel
filter-bank’s output: 𝑚 𝑥 𝑡, 𝑏 ≈ exp 𝐸 log 𝑚 𝑥 𝑡, 𝑏 |𝑚 𝑦(𝑡, 𝑏) given a
weak independent assumption between Mel-filterbanks.
4-Step Processing of CMMSE
• Voice activity detection (VAD): detects the speech probability at every time-
filterbank bin;
• Noise spectrum estimation: uses the estimated speech probability to update
the estimation of noise spectrum;
• Gain estimation: uses the noisy speech spectrum and the estimated noise
spectrum to calculate the gain of every time-filterbank bin;
• Noise reduction: applies the estimated gain to the noisy speech spectrum to
generate the clean spectrum.
CMMSE
spectrum
calculation
cleaned filter-bank spectrum
Input
signal
Mel filtering noise reduction
VAD
(MCRA)
filter-bank gain
estimation
noise spectrum
estimation
(MCRA)
VAD
• CMMSE uses a minimum controlled recursive moving average (MCRA)
noise tracker (Cohen and Berdugo, 2002) to detect the speech probability
𝑝 𝑡, 𝑏 in each filterbank bin b and time t.
• I. Cohen and B. Berdugo, “Noise estimation by minima controlled recursive averaging
for robust speech enhancement,” IEEE signal processing letters, 9(1), pp.12-15, 2002.
Noise Spectrum Estimation
• The noise power spectrum 𝑚 𝑛 𝑡, 𝑏 is estimated using MCRA(Cohen and
Berdugo, 2002) as
𝑚 𝑛 𝑡, 𝑏 = 𝛼 ∗ 𝑚 𝑛 𝑡 − 1, 𝑏 + 1.0 − 𝛼 ∗ 𝑚 𝑦 𝑡, 𝑏
with
𝛼 = 𝛼 𝐷 + 1.0 − 𝛼 𝐷 ∗ 𝑝 𝑡, 𝑏
Gain Estimation
• The gain of time-filterbank bin 𝐺 𝑡, 𝑏 =
𝜉(𝑡,𝑏)
1+ 𝜉(𝑡,𝑏)
exp
1
2 𝑣 𝑡,𝑏
∞ 𝑒−𝜏
𝜏
𝑑𝜏 .
• posterior SNR : 𝛾 𝑡, 𝑏 =
𝑚 𝑦(𝑡,𝑏)
𝑚 𝑛(𝑡,𝑏)
• prior SNR is calculated using a decision-directed approach (DDA) 𝜉 𝑡, 𝑏 = 𝛽 ∗
𝐺 𝑡 − 1, 𝑏 ∗ 𝛾 𝑡 − 1, 𝑏 + 1.0 − 𝛽 ∗ 𝜉 𝑡, 𝑏
• 𝜉 𝑡, 𝑏 = max(𝛾 𝑡, 𝑏 − 1, 0.0)
• 𝑣 𝑡, 𝑏 =
𝜉(𝑡,𝑏)
1+ 𝜉(𝑡,𝑏)
𝛾 𝑡, 𝑏
Noise Reduction
• 𝑚 𝑥 𝑡, 𝑏 ≈ 𝐺 𝑡, 𝑏 𝑚 𝑦 𝑡, 𝑏
CMMSE
spectrum
calculation
cleaned filter-bank spectrum
Input
signal
Mel filtering noise reduction
VAD
(MCRA)
filter-bank gain
estimation
noise spectrum
estimation
(MCRA)
Speech Probability Estimation
• Reliable speech probability estimation is critical to the estimation of noise
spectrum: 𝑚 𝑛 𝑡, 𝑏 = 𝛼 ∗ 𝑚 𝑛 𝑡 − 1, 𝑏 + 1.0 − 𝛼 ∗ 𝑚 𝑥 𝑡, 𝑏 .
• We use IMCRA (Cohen 2003) to estimate the speech probability 𝑝 𝑡, 𝑏 in
each time-filterbank bin.
• I. Cohen, "Noise spectrum estimation in adverse environments: improved minima
controlled recursive averaging," IEEE Trans. Speech and Audio Processing, Vol. 11, No. 5,
pp. 466-475, 2003.
Improving CMMSE
spectrum
calculation
cleaned filter-bank spectrum
Input
signal
Mel filtering noise reduction
VAD
(IMCRA)
filter-bank gain
estimation
noise spectrum
estimation
(IMCRA)
Refined Prior SNR Estimation
• We do further gain estimation to get a converged gain
• 𝜉 𝑡, 𝑏 = 𝐺 𝑡, 𝑏 ∗ 𝛾 𝑡, 𝑏
• 𝑣 𝑡, 𝑏 =
𝜉(𝑡,𝑏)
1+ 𝜉(𝑡,𝑏)
𝛾 𝑡, 𝑏
• 𝐺 𝑡, 𝑏 =
𝜉(𝑡,𝑏)
1+ 𝜉(𝑡,𝑏)
exp
1
2 𝑣 𝑡,𝑏
∞ 𝑒−𝜏
𝜏
𝑑𝜏
Improving CMMSE
spectrum
calculation
cleaned filter-bank spectrum
Input
signal
Mel filtering noise reduction
VAD
(IMCRA)
filter-bank gain
estimation
noise spectrum
estimation
(IMCRA)
OMLSA
• Optimally-modified log-spectral amplitude (OMLSA) speech estimator
(Cohen and Berdugo, 2001) is used to modify time-filterbank gain:
𝐺 𝑡, 𝑏 = 𝐺 𝑡, 𝑏 𝑝 𝑡,𝑏 𝐺0
1−𝑝 𝑡,𝑏
• I. Cohen and B. Berdugo, "Speech enhancement for non-stationary noise
environments," Signal Processing, Vol. 81, No. 11, pp. 2403-2418, 2001.
Gain Smoothing
• It is better to use OMLSA when the speech probability estimation is reliable.
• We do cross filterbank gain smoothing in this stage to partially address the
weak independent assumption of Mel-filterbanks.
𝐺 𝑡, 𝑏 = ( 𝐺 𝑡, 𝑏 − 1 + 𝐺 𝑡, 𝑏 + 𝐺(𝑡, 𝑏 + 1))/3
Improving CMMSE
spectrum
calculation
cleaned filter-bank spectrum
Input
signal
Mel filtering noise reduction
filter-bank gain
smoothing
VAD
(IMCRA)
filter-bank gain
estimation
noise spectrum
estimation
(IMCRA)
Two-stage Processing
• The noise reduction process is not perfect due to the factors such as
imperfect noise estimation.
• A second stage noise reduction can be used to further reduce the noise.
• We use OMLSA together with gain smoothing for the gain modification in
the second stage because the residual noise has less impact to speech
probability estimation after the first stage noise reduction
Improved Cepstra Minimum Mean Square Error
(ICMMSE)
spectrum
calculation
1st
stage cleaned filter-bank spectrum
Input
signal
Mel filtering noise reduction
filter-bank gain
smoothing
VAD
(IMCRA)
filter-bank gain
estimation
noise spectrum
estimation
(IMCRA)
noise reduction
filter-bank gain
OMLSA+smoothing
noise spectrum
estimation
(IMCRA)
VAD
(IMCRA)
filter-bank gain
estimation
final cleaned filter-bank spectrum
Experiments
Task Training Data Acoustic Model
Aurora 2 8440 utterances GMM
Experiments
Task Training Data Acoustic Model
Aurora 2 8440 utterances GMM
Chime 3 1600 real utterances
+ 7138 simulated utterances
feed-forward DNN
Experiments
Task Training Data Acoustic Model
Aurora 2 8440 utterances GMM
Chime 3 1600 real utterances
+ 7138 simulated utterances
feed-forward DNN
Cortana 3400hr Live data LSTM-RNN
Aurora 2
Baseline CMMSE 1-stage
ICMMSE
2-stage
ICMMSE
Clean 1.39 1.48 1.20 1.10
20db 2.69 2.21 1.99 1.85
15db 3.6 3.21 2.91 2.71
10db 6.04 5.65 5.30 4.97
5db 14.38 13.01 12.59 11.57
0db 43.41 38.36 34.02 32.06
-5db 75.93 72.8 68.47 67.45
Avg. (0-20
db)
13.67 12.17 10.87 10.19
-10.00
-5.00
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
Clean 20db 15db 10db 5db 0db -5db Avg. (0-20
db)
Relative WER reduction from baseline
CMMSE 1-stage ICMMSE 2-stage ICMMSE
Improvement Breakdown
Method Avg. WER
Baseline 13.67
CMMSE 12.17
Improvement Breakdown
Method Avg. WER
Baseline 13.67
CMMSE 12.17
+ IMCRA 11.66
Improvement Breakdown
Method Avg. WER
Baseline 13.67
CMMSE 12.17
+ IMCRA 11.66
+OMLSA 10.99
Improvement Breakdown
Method Avg. WER
Baseline 13.67
CMMSE 12.17
+ IMCRA 11.66
+OMLSA 10.99
+refined prior SNR 10.87
Improvement Breakdown
Method Avg. WER
Baseline 13.67
CMMSE 12.17
+ IMCRA 11.66
+OMLSA 10.99
+refined prior SNR 10.87
-OMLSA +gain smoothing
(1-stage ICMMSE)
10.74
Improvement Breakdown
Method Avg. WER
Baseline 13.67
CMMSE 12.17
+ IMCRA 11.66
+OMLSA 10.99
+refined prior SNR 10.87
-OMLSA +gain smoothing
(1-stage ICMMSE)
10.74
+2nd stage
processing
(2-stage ICMMSE)
10.19
Chime 3
Model FE Test Real Simulate
Baseline Clean 7.56 N/A
CMMSE Clean 7.64 N/A
ICMMSE Clean 7.41 N/A
Baseline Noisy 18.95 18.18
CMMSE Noisy 17.32 16.76
ICMMSE Noisy 16.73 15.79 -2.00
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
Clean Real Noisy Simu Noisy
Relative WER reduction from Baseline
CMMSE ICMMSE
Chime 3 WER Breakdown with Real and Simu.
Noisy Test Set
Cortana
WER Baseline CMMSE ICMMSE
20db above 13.17 12.64 12.71
10-20db 20.8 19.83 18.51
0-10db 27.03 26 24.71
0.00
2.00
4.00
6.00
8.00
10.00
12.00
20db above 10-20db 0-10db
Relative WER reduction from Baseline
CMMSE ICMMSE
Conclusion
• A new robust front-end called ICMMSE is proposed to improve the previous
CMMSE front-end with several advanced components
• The IMCRA algorithm helps to generate more accurate speech probability.
• The refined prior SNR estimation helps to get a converged gain.
• Either cross filterbank gain smoothing or OMLSA is helpful to further modify the gain
function.
• The two-stage processing helps to reduce the residual noise after the first-stage processing.
• ICMMSE is superior regardless of the underlying models and evaluation tasks
Result Summary
Task Training Data Acoustic Model Relative
WER
reduction
Aurora 2 8440 utterances GMM 25.46%
Chime 3 1600 real utterances
+ 7138 simulated utterances
feed-forward DNN 11.98%
Cortana 3400hr Live data LSTM-RNN 11.01%
Thank You

More Related Content

PDF
Dsp book ch15
PPTX
Digital Image restoration
PPTX
Implementation and comparison of Low pass filters in Frequency domain
PPTX
Image Restoration (Frequency Domain Filters):Basics
PDF
Performance of MMSE Denoise Signal Using LS-MMSE Technique
PPTX
Image Smoothing using Frequency Domain Filters
PPT
07 frequency domain DIP
PPT
Unit1 image transform
Dsp book ch15
Digital Image restoration
Implementation and comparison of Low pass filters in Frequency domain
Image Restoration (Frequency Domain Filters):Basics
Performance of MMSE Denoise Signal Using LS-MMSE Technique
Image Smoothing using Frequency Domain Filters
07 frequency domain DIP
Unit1 image transform

What's hot (20)

PPT
Digital communication
PDF
Lecture 10
PPT
Enhancement in frequency domain
PDF
Frequency Image Processing
PPT
Development of a Multipurpose Audio Transmission System on the Internet
PDF
Dsp 2018 foehu - lec 10 - multi-rate digital signal processing
PPT
Allaboutequalizers
PPTX
Thesis Overview
PDF
Ch6 digital transmission of analog signal pg 99
PPTX
Image Restoration (Digital Image Processing)
PPTX
Unit i-pcm-vsh
PDF
DSP_2018_FOEHU - Lec 06 - FIR Filter Design
PPT
Pulse Code Modulation
DOCX
Junaid Malik - Formal Element
PPTX
PPT
PDF
Minimize MIMO OFDM interference and noise ratio using polynomial-time algorit...
PDF
solution manual of goldsmith wireless communication
PPT
Pulse modulation
Digital communication
Lecture 10
Enhancement in frequency domain
Frequency Image Processing
Development of a Multipurpose Audio Transmission System on the Internet
Dsp 2018 foehu - lec 10 - multi-rate digital signal processing
Allaboutequalizers
Thesis Overview
Ch6 digital transmission of analog signal pg 99
Image Restoration (Digital Image Processing)
Unit i-pcm-vsh
DSP_2018_FOEHU - Lec 06 - FIR Filter Design
Pulse Code Modulation
Junaid Malik - Formal Element
Minimize MIMO OFDM interference and noise ratio using polynomial-time algorit...
solution manual of goldsmith wireless communication
Pulse modulation
Ad

Similar to Icmmse slides (20)

PDF
F010334548
PDF
Adaptive noise estimation algorithm for speech enhancement
PDF
01 8445 speech enhancement
PDF
Comparative performance analysis of channel normalization techniques
PDF
A literature review on improving speech intelligibility in noisy environment
PDF
Environmentally robust ASR front end for DNN-based acoustic models
PDF
Adaptive wavelet thresholding with robust hybrid features for text-independe...
PDF
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
PDF
A novel automatic voice recognition system based on text-independent in a noi...
PDF
General Kalman Filter & Speech Enhancement for Speaker Identification
PPTX
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...
PDF
A_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdf
PDF
Improvement of minimum tracking in Minimum Statistics noise estimation method
PDF
Hybrid hmmdtw based speech recognition with kernel adaptive filtering method
PDF
PDF
Investigations on the role of analysis window shape parameter in speech enhan...
PDF
An effective evaluation study of objective measures using spectral subtractiv...
PDF
IRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
PDF
A novel speech enhancement technique
PDF
Robust Speech Recognition Technique using Mat lab
F010334548
Adaptive noise estimation algorithm for speech enhancement
01 8445 speech enhancement
Comparative performance analysis of channel normalization techniques
A literature review on improving speech intelligibility in noisy environment
Environmentally robust ASR front end for DNN-based acoustic models
Adaptive wavelet thresholding with robust hybrid features for text-independe...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A novel automatic voice recognition system based on text-independent in a noi...
General Kalman Filter & Speech Enhancement for Speaker Identification
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...
A_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdf
Improvement of minimum tracking in Minimum Statistics noise estimation method
Hybrid hmmdtw based speech recognition with kernel adaptive filtering method
Investigations on the role of analysis window shape parameter in speech enhan...
An effective evaluation study of objective measures using spectral subtractiv...
IRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
A novel speech enhancement technique
Robust Speech Recognition Technique using Mat lab
Ad

Recently uploaded (20)

PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Geodesy 1.pptx...............................................
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Well-logging-methods_new................
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
web development for engineering and engineering
PDF
PPT on Performance Review to get promotions
PPTX
Sustainable Sites - Green Building Construction
PPTX
Welding lecture in detail for understanding
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Geodesy 1.pptx...............................................
Model Code of Practice - Construction Work - 21102022 .pdf
additive manufacturing of ss316l using mig welding
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Well-logging-methods_new................
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Embodied AI: Ushering in the Next Era of Intelligent Systems
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
R24 SURVEYING LAB MANUAL for civil enggi
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
web development for engineering and engineering
PPT on Performance Review to get promotions
Sustainable Sites - Green Building Construction
Welding lecture in detail for understanding

Icmmse slides

  • 1. Improved Cepstra Minimum-Mean- Square-Error Noise Reduction Algorithm for Robust Speech Recognition Jinyu Li, Yan Huang, Yifan Gong Microsoft
  • 2. Robust Front-End • Conventional noise-robust front-ends work very well with the Gaussian mixture models (GMMs). • Single-channel robust front-ends were reported not helpful to multi-style deep neural network (DNN) training. • DNN’s layer-by-layer structure provides a feature extraction strategy that automatically derives powerful noise-resistant features
  • 3. t-SNE Plot for Paired Clean and Noisy Utterances in Training Set (LFB)
  • 4. t-SNE Plot for Paired Clean and Noisy Utterances in Training Set (Layer 1)
  • 5. t-SNE Plot for Paired Clean and Noisy Utterances in Training Set (Layer 3)
  • 6. t-SNE Plot for Paired Clean and Noisy Utterances in Training Set (Layer 5)
  • 7. t-SNE Plot for Paired Clean and Noisy Utterances in Testing Set (Layer 1)
  • 8. t-SNE Plot for Paired Clean and Noisy Utterances in Testing Set (Layer 3)
  • 9. t-SNE Plot for Paired Clean and Noisy Utterances in Testing Set (Layer 5)
  • 10. Robust Front-End • Conventional noise-robust front-ends work very well with the Gaussian mixture models (GMMs). • Single-channel robust front-ends were reported not helpful to multi-style deep neural network (DNN) training although multi-channel signal processing still helps. • In this study, we show that the single-channel robust front-end is still beneficial to deep learning models as long as it is well designed.
  • 11. Cepstra Minimum Mean Square Error (CMMSE) • Reported very effective in dealing with noise when used in the GMM-based acoustic models • The solution to CMMSE for each element of the dimension-wise MFCC: 𝑐 𝑥 𝑡, 𝑘 = 𝐸 𝑐 𝑥 𝑡, 𝑘 𝒎 𝑦 𝑡 = 𝑏 𝑎 𝑘,𝑏 𝐸 log 𝑚 𝑥 𝑡, 𝑏 |𝒎 𝑦(𝑡) • The problem is reduced to finding the log-MMSE estimator of the Mel filter-bank’s output: 𝑚 𝑥 𝑡, 𝑏 ≈ exp 𝐸 log 𝑚 𝑥 𝑡, 𝑏 |𝑚 𝑦(𝑡, 𝑏) given a weak independent assumption between Mel-filterbanks.
  • 12. 4-Step Processing of CMMSE • Voice activity detection (VAD): detects the speech probability at every time- filterbank bin; • Noise spectrum estimation: uses the estimated speech probability to update the estimation of noise spectrum; • Gain estimation: uses the noisy speech spectrum and the estimated noise spectrum to calculate the gain of every time-filterbank bin; • Noise reduction: applies the estimated gain to the noisy speech spectrum to generate the clean spectrum.
  • 13. CMMSE spectrum calculation cleaned filter-bank spectrum Input signal Mel filtering noise reduction VAD (MCRA) filter-bank gain estimation noise spectrum estimation (MCRA)
  • 14. VAD • CMMSE uses a minimum controlled recursive moving average (MCRA) noise tracker (Cohen and Berdugo, 2002) to detect the speech probability 𝑝 𝑡, 𝑏 in each filterbank bin b and time t. • I. Cohen and B. Berdugo, “Noise estimation by minima controlled recursive averaging for robust speech enhancement,” IEEE signal processing letters, 9(1), pp.12-15, 2002.
  • 15. Noise Spectrum Estimation • The noise power spectrum 𝑚 𝑛 𝑡, 𝑏 is estimated using MCRA(Cohen and Berdugo, 2002) as 𝑚 𝑛 𝑡, 𝑏 = 𝛼 ∗ 𝑚 𝑛 𝑡 − 1, 𝑏 + 1.0 − 𝛼 ∗ 𝑚 𝑦 𝑡, 𝑏 with 𝛼 = 𝛼 𝐷 + 1.0 − 𝛼 𝐷 ∗ 𝑝 𝑡, 𝑏
  • 16. Gain Estimation • The gain of time-filterbank bin 𝐺 𝑡, 𝑏 = 𝜉(𝑡,𝑏) 1+ 𝜉(𝑡,𝑏) exp 1 2 𝑣 𝑡,𝑏 ∞ 𝑒−𝜏 𝜏 𝑑𝜏 . • posterior SNR : 𝛾 𝑡, 𝑏 = 𝑚 𝑦(𝑡,𝑏) 𝑚 𝑛(𝑡,𝑏) • prior SNR is calculated using a decision-directed approach (DDA) 𝜉 𝑡, 𝑏 = 𝛽 ∗ 𝐺 𝑡 − 1, 𝑏 ∗ 𝛾 𝑡 − 1, 𝑏 + 1.0 − 𝛽 ∗ 𝜉 𝑡, 𝑏 • 𝜉 𝑡, 𝑏 = max(𝛾 𝑡, 𝑏 − 1, 0.0) • 𝑣 𝑡, 𝑏 = 𝜉(𝑡,𝑏) 1+ 𝜉(𝑡,𝑏) 𝛾 𝑡, 𝑏
  • 17. Noise Reduction • 𝑚 𝑥 𝑡, 𝑏 ≈ 𝐺 𝑡, 𝑏 𝑚 𝑦 𝑡, 𝑏
  • 18. CMMSE spectrum calculation cleaned filter-bank spectrum Input signal Mel filtering noise reduction VAD (MCRA) filter-bank gain estimation noise spectrum estimation (MCRA)
  • 19. Speech Probability Estimation • Reliable speech probability estimation is critical to the estimation of noise spectrum: 𝑚 𝑛 𝑡, 𝑏 = 𝛼 ∗ 𝑚 𝑛 𝑡 − 1, 𝑏 + 1.0 − 𝛼 ∗ 𝑚 𝑥 𝑡, 𝑏 . • We use IMCRA (Cohen 2003) to estimate the speech probability 𝑝 𝑡, 𝑏 in each time-filterbank bin. • I. Cohen, "Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging," IEEE Trans. Speech and Audio Processing, Vol. 11, No. 5, pp. 466-475, 2003.
  • 20. Improving CMMSE spectrum calculation cleaned filter-bank spectrum Input signal Mel filtering noise reduction VAD (IMCRA) filter-bank gain estimation noise spectrum estimation (IMCRA)
  • 21. Refined Prior SNR Estimation • We do further gain estimation to get a converged gain • 𝜉 𝑡, 𝑏 = 𝐺 𝑡, 𝑏 ∗ 𝛾 𝑡, 𝑏 • 𝑣 𝑡, 𝑏 = 𝜉(𝑡,𝑏) 1+ 𝜉(𝑡,𝑏) 𝛾 𝑡, 𝑏 • 𝐺 𝑡, 𝑏 = 𝜉(𝑡,𝑏) 1+ 𝜉(𝑡,𝑏) exp 1 2 𝑣 𝑡,𝑏 ∞ 𝑒−𝜏 𝜏 𝑑𝜏
  • 22. Improving CMMSE spectrum calculation cleaned filter-bank spectrum Input signal Mel filtering noise reduction VAD (IMCRA) filter-bank gain estimation noise spectrum estimation (IMCRA)
  • 23. OMLSA • Optimally-modified log-spectral amplitude (OMLSA) speech estimator (Cohen and Berdugo, 2001) is used to modify time-filterbank gain: 𝐺 𝑡, 𝑏 = 𝐺 𝑡, 𝑏 𝑝 𝑡,𝑏 𝐺0 1−𝑝 𝑡,𝑏 • I. Cohen and B. Berdugo, "Speech enhancement for non-stationary noise environments," Signal Processing, Vol. 81, No. 11, pp. 2403-2418, 2001.
  • 24. Gain Smoothing • It is better to use OMLSA when the speech probability estimation is reliable. • We do cross filterbank gain smoothing in this stage to partially address the weak independent assumption of Mel-filterbanks. 𝐺 𝑡, 𝑏 = ( 𝐺 𝑡, 𝑏 − 1 + 𝐺 𝑡, 𝑏 + 𝐺(𝑡, 𝑏 + 1))/3
  • 25. Improving CMMSE spectrum calculation cleaned filter-bank spectrum Input signal Mel filtering noise reduction filter-bank gain smoothing VAD (IMCRA) filter-bank gain estimation noise spectrum estimation (IMCRA)
  • 26. Two-stage Processing • The noise reduction process is not perfect due to the factors such as imperfect noise estimation. • A second stage noise reduction can be used to further reduce the noise. • We use OMLSA together with gain smoothing for the gain modification in the second stage because the residual noise has less impact to speech probability estimation after the first stage noise reduction
  • 27. Improved Cepstra Minimum Mean Square Error (ICMMSE) spectrum calculation 1st stage cleaned filter-bank spectrum Input signal Mel filtering noise reduction filter-bank gain smoothing VAD (IMCRA) filter-bank gain estimation noise spectrum estimation (IMCRA) noise reduction filter-bank gain OMLSA+smoothing noise spectrum estimation (IMCRA) VAD (IMCRA) filter-bank gain estimation final cleaned filter-bank spectrum
  • 28. Experiments Task Training Data Acoustic Model Aurora 2 8440 utterances GMM
  • 29. Experiments Task Training Data Acoustic Model Aurora 2 8440 utterances GMM Chime 3 1600 real utterances + 7138 simulated utterances feed-forward DNN
  • 30. Experiments Task Training Data Acoustic Model Aurora 2 8440 utterances GMM Chime 3 1600 real utterances + 7138 simulated utterances feed-forward DNN Cortana 3400hr Live data LSTM-RNN
  • 31. Aurora 2 Baseline CMMSE 1-stage ICMMSE 2-stage ICMMSE Clean 1.39 1.48 1.20 1.10 20db 2.69 2.21 1.99 1.85 15db 3.6 3.21 2.91 2.71 10db 6.04 5.65 5.30 4.97 5db 14.38 13.01 12.59 11.57 0db 43.41 38.36 34.02 32.06 -5db 75.93 72.8 68.47 67.45 Avg. (0-20 db) 13.67 12.17 10.87 10.19 -10.00 -5.00 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 Clean 20db 15db 10db 5db 0db -5db Avg. (0-20 db) Relative WER reduction from baseline CMMSE 1-stage ICMMSE 2-stage ICMMSE
  • 32. Improvement Breakdown Method Avg. WER Baseline 13.67 CMMSE 12.17
  • 33. Improvement Breakdown Method Avg. WER Baseline 13.67 CMMSE 12.17 + IMCRA 11.66
  • 34. Improvement Breakdown Method Avg. WER Baseline 13.67 CMMSE 12.17 + IMCRA 11.66 +OMLSA 10.99
  • 35. Improvement Breakdown Method Avg. WER Baseline 13.67 CMMSE 12.17 + IMCRA 11.66 +OMLSA 10.99 +refined prior SNR 10.87
  • 36. Improvement Breakdown Method Avg. WER Baseline 13.67 CMMSE 12.17 + IMCRA 11.66 +OMLSA 10.99 +refined prior SNR 10.87 -OMLSA +gain smoothing (1-stage ICMMSE) 10.74
  • 37. Improvement Breakdown Method Avg. WER Baseline 13.67 CMMSE 12.17 + IMCRA 11.66 +OMLSA 10.99 +refined prior SNR 10.87 -OMLSA +gain smoothing (1-stage ICMMSE) 10.74 +2nd stage processing (2-stage ICMMSE) 10.19
  • 38. Chime 3 Model FE Test Real Simulate Baseline Clean 7.56 N/A CMMSE Clean 7.64 N/A ICMMSE Clean 7.41 N/A Baseline Noisy 18.95 18.18 CMMSE Noisy 17.32 16.76 ICMMSE Noisy 16.73 15.79 -2.00 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 Clean Real Noisy Simu Noisy Relative WER reduction from Baseline CMMSE ICMMSE
  • 39. Chime 3 WER Breakdown with Real and Simu. Noisy Test Set
  • 40. Cortana WER Baseline CMMSE ICMMSE 20db above 13.17 12.64 12.71 10-20db 20.8 19.83 18.51 0-10db 27.03 26 24.71 0.00 2.00 4.00 6.00 8.00 10.00 12.00 20db above 10-20db 0-10db Relative WER reduction from Baseline CMMSE ICMMSE
  • 41. Conclusion • A new robust front-end called ICMMSE is proposed to improve the previous CMMSE front-end with several advanced components • The IMCRA algorithm helps to generate more accurate speech probability. • The refined prior SNR estimation helps to get a converged gain. • Either cross filterbank gain smoothing or OMLSA is helpful to further modify the gain function. • The two-stage processing helps to reduce the residual noise after the first-stage processing. • ICMMSE is superior regardless of the underlying models and evaluation tasks
  • 42. Result Summary Task Training Data Acoustic Model Relative WER reduction Aurora 2 8440 utterances GMM 25.46% Chime 3 1600 real utterances + 7138 simulated utterances feed-forward DNN 11.98% Cortana 3400hr Live data LSTM-RNN 11.01%