Icmmse slides

Improved Cepstra Minimum-Mean-
Square-Error Noise Reduction Algorithm
for Robust Speech Recognition
Jinyu Li, Yan Huang, Yifan Gong
Microsoft

Robust Front-End
• Conventional noise-robust front-ends work very well with the Gaussian
mixture models (GMMs).
• Single-channel robust front-ends were reported not helpful to multi-style
deep neural network (DNN) training.
• DNN’s layer-by-layer structure provides a feature extraction strategy that automatically
derives powerful noise-resistant features

t-SNE Plot for Paired Clean and Noisy
Utterances in Training Set (LFB)

Utterances in Training Set (Layer 1)

Utterances in Testing Set (Layer 1)

Robust Front-End
• Conventional noise-robust front-ends work very well with the Gaussian
mixture models (GMMs).
• Single-channel robust front-ends were reported not helpful to multi-style
deep neural network (DNN) training although multi-channel signal
processing still helps.
• In this study, we show that the single-channel robust front-end is still
beneficial to deep learning models as long as it is well designed.

Cepstra Minimum Mean Square Error
(CMMSE)
• Reported very effective in dealing with noise when used in the GMM-based
acoustic models
• The solution to CMMSE for each element of the dimension-wise MFCC:
𝑐 𝑥 𝑡, 𝑘 = 𝐸 𝑐 𝑥 𝑡, 𝑘 𝒎 𝑦 𝑡 = 𝑏 𝑎 𝑘,𝑏 𝐸 log 𝑚 𝑥 𝑡, 𝑏 |𝒎 𝑦(𝑡)
• The problem is reduced to finding the log-MMSE estimator of the Mel
filter-bank’s output: 𝑚 𝑥 𝑡, 𝑏 ≈ exp 𝐸 log 𝑚 𝑥 𝑡, 𝑏 |𝑚 𝑦(𝑡, 𝑏) given a
weak independent assumption between Mel-filterbanks.

4-Step Processing of CMMSE
• Voice activity detection (VAD): detects the speech probability at every time-
filterbank bin;
• Noise spectrum estimation: uses the estimated speech probability to update
the estimation of noise spectrum;
• Gain estimation: uses the noisy speech spectrum and the estimated noise
spectrum to calculate the gain of every time-filterbank bin;
• Noise reduction: applies the estimated gain to the noisy speech spectrum to
generate the clean spectrum.

CMMSE
spectrum
calculation
cleaned filter-bank spectrum
Input
signal
Mel filtering noise reduction
VAD
(MCRA)
filter-bank gain
estimation
noise spectrum
estimation
(MCRA)

VAD
• CMMSE uses a minimum controlled recursive moving average (MCRA)
noise tracker (Cohen and Berdugo, 2002) to detect the speech probability
𝑝 𝑡, 𝑏 in each filterbank bin b and time t.
• I. Cohen and B. Berdugo, “Noise estimation by minima controlled recursive averaging
for robust speech enhancement,” IEEE signal processing letters, 9(1), pp.12-15, 2002.

Noise Spectrum Estimation
• The noise power spectrum 𝑚 𝑛 𝑡, 𝑏 is estimated using MCRA(Cohen and
Berdugo, 2002) as
𝑚 𝑛 𝑡, 𝑏 = 𝛼 ∗ 𝑚 𝑛 𝑡 − 1, 𝑏 + 1.0 − 𝛼 ∗ 𝑚 𝑦 𝑡, 𝑏
with
𝛼 = 𝛼 𝐷 + 1.0 − 𝛼 𝐷 ∗ 𝑝 𝑡, 𝑏

Gain Estimation
• The gain of time-filterbank bin 𝐺 𝑡, 𝑏 =
𝜉(𝑡,𝑏)
1+ 𝜉(𝑡,𝑏)
exp
1
2 𝑣 𝑡,𝑏
∞ 𝑒−𝜏
𝜏
𝑑𝜏 .
• posterior SNR : 𝛾 𝑡, 𝑏 =
𝑚 𝑦(𝑡,𝑏)
𝑚 𝑛(𝑡,𝑏)
• prior SNR is calculated using a decision-directed approach (DDA) 𝜉 𝑡, 𝑏 = 𝛽 ∗
𝐺 𝑡 − 1, 𝑏 ∗ 𝛾 𝑡 − 1, 𝑏 + 1.0 − 𝛽 ∗ 𝜉 𝑡, 𝑏
• 𝜉 𝑡, 𝑏 = max(𝛾 𝑡, 𝑏 − 1, 0.0)
• 𝑣 𝑡, 𝑏 =
𝜉(𝑡,𝑏)
1+ 𝜉(𝑡,𝑏)
𝛾 𝑡, 𝑏

Noise Reduction
• 𝑚 𝑥 𝑡, 𝑏 ≈ 𝐺 𝑡, 𝑏 𝑚 𝑦 𝑡, 𝑏

Speech Probability Estimation
• Reliable speech probability estimation is critical to the estimation of noise
spectrum: 𝑚 𝑛 𝑡, 𝑏 = 𝛼 ∗ 𝑚 𝑛 𝑡 − 1, 𝑏 + 1.0 − 𝛼 ∗ 𝑚 𝑥 𝑡, 𝑏 .
• We use IMCRA (Cohen 2003) to estimate the speech probability 𝑝 𝑡, 𝑏 in
each time-filterbank bin.
• I. Cohen, "Noise spectrum estimation in adverse environments: improved minima
controlled recursive averaging," IEEE Trans. Speech and Audio Processing, Vol. 11, No. 5,
pp. 466-475, 2003.

Improving CMMSE
spectrum
calculation
Input
signal
VAD
(IMCRA)
filter-bank gain
estimation
noise spectrum
estimation
(IMCRA)

Refined Prior SNR Estimation
• We do further gain estimation to get a converged gain
• 𝜉 𝑡, 𝑏 = 𝐺 𝑡, 𝑏 ∗ 𝛾 𝑡, 𝑏
• 𝑣 𝑡, 𝑏 =
𝜉(𝑡,𝑏)
1+ 𝜉(𝑡,𝑏)
𝛾 𝑡, 𝑏
• 𝐺 𝑡, 𝑏 =
𝜉(𝑡,𝑏)
1+ 𝜉(𝑡,𝑏)
exp
1
2 𝑣 𝑡,𝑏
∞ 𝑒−𝜏
𝜏
𝑑𝜏

OMLSA
• Optimally-modified log-spectral amplitude (OMLSA) speech estimator
(Cohen and Berdugo, 2001) is used to modify time-filterbank gain:
𝐺 𝑡, 𝑏 = 𝐺 𝑡, 𝑏 𝑝 𝑡,𝑏 𝐺0
1−𝑝 𝑡,𝑏
• I. Cohen and B. Berdugo, "Speech enhancement for non-stationary noise
environments," Signal Processing, Vol. 81, No. 11, pp. 2403-2418, 2001.

Gain Smoothing
• It is better to use OMLSA when the speech probability estimation is reliable.
• We do cross filterbank gain smoothing in this stage to partially address the
weak independent assumption of Mel-filterbanks.
𝐺 𝑡, 𝑏 = ( 𝐺 𝑡, 𝑏 − 1 + 𝐺 𝑡, 𝑏 + 𝐺(𝑡, 𝑏 + 1))/3

Improving CMMSE
spectrum
calculation
Input
signal
filter-bank gain
smoothing
VAD
(IMCRA)
filter-bank gain
estimation
noise spectrum
estimation
(IMCRA)

Two-stage Processing
• The noise reduction process is not perfect due to the factors such as
imperfect noise estimation.
• A second stage noise reduction can be used to further reduce the noise.
• We use OMLSA together with gain smoothing for the gain modification in
the second stage because the residual noise has less impact to speech
probability estimation after the first stage noise reduction

Improved Cepstra Minimum Mean Square Error
(ICMMSE)
spectrum
calculation
1st
stage cleaned filter-bank spectrum
Input
signal
filter-bank gain
smoothing
VAD
(IMCRA)
filter-bank gain
estimation
noise spectrum
estimation
(IMCRA)
noise reduction
filter-bank gain
OMLSA+smoothing
noise spectrum
estimation
(IMCRA)
VAD
(IMCRA)
filter-bank gain
estimation
final cleaned filter-bank spectrum

Experiments
Task Training Data Acoustic Model
Aurora 2 8440 utterances GMM

Experiments
Chime 3 1600 real utterances
+ 7138 simulated utterances
feed-forward DNN

Experiments
feed-forward DNN
Cortana 3400hr Live data LSTM-RNN

Aurora 2
Baseline CMMSE 1-stage
ICMMSE
2-stage
ICMMSE
Clean 1.39 1.48 1.20 1.10
20db 2.69 2.21 1.99 1.85
15db 3.6 3.21 2.91 2.71
10db 6.04 5.65 5.30 4.97
5db 14.38 13.01 12.59 11.57
0db 43.41 38.36 34.02 32.06
-5db 75.93 72.8 68.47 67.45
Avg. (0-20
db)
13.67 12.17 10.87 10.19
-10.00
-5.00
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
Clean 20db 15db 10db 5db 0db -5db Avg. (0-20
db)
Relative WER reduction from baseline
CMMSE 1-stage ICMMSE 2-stage ICMMSE

Improvement Breakdown
Method Avg. WER
Baseline 13.67
CMMSE 12.17

Method Avg. WER
Baseline 13.67
CMMSE 12.17
+ IMCRA 11.66

Method Avg. WER
Baseline 13.67
CMMSE 12.17
+ IMCRA 11.66
+OMLSA 10.99

Method Avg. WER
Baseline 13.67
CMMSE 12.17
+ IMCRA 11.66
+OMLSA 10.99
+refined prior SNR 10.87

Method Avg. WER
Baseline 13.67
CMMSE 12.17
+ IMCRA 11.66
+OMLSA 10.99
-OMLSA +gain smoothing
(1-stage ICMMSE)
10.74

Method Avg. WER
Baseline 13.67
CMMSE 12.17
+ IMCRA 11.66
+OMLSA 10.99
-OMLSA +gain smoothing
(1-stage ICMMSE)
10.74
+2nd stage
processing
(2-stage ICMMSE)
10.19

Chime 3
Model FE Test Real Simulate
Baseline Clean 7.56 N/A
CMMSE Clean 7.64 N/A
ICMMSE Clean 7.41 N/A
Baseline Noisy 18.95 18.18
CMMSE Noisy 17.32 16.76
ICMMSE Noisy 16.73 15.79 -2.00
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
Clean Real Noisy Simu Noisy
Relative WER reduction from Baseline
CMMSE ICMMSE

Chime 3 WER Breakdown with Real and Simu.
Noisy Test Set

Cortana
WER Baseline CMMSE ICMMSE
20db above 13.17 12.64 12.71
10-20db 20.8 19.83 18.51
0-10db 27.03 26 24.71
0.00
2.00
4.00
6.00
8.00
10.00
12.00
20db above 10-20db 0-10db
Relative WER reduction from Baseline
CMMSE ICMMSE

Conclusion
• A new robust front-end called ICMMSE is proposed to improve the previous
CMMSE front-end with several advanced components
• The IMCRA algorithm helps to generate more accurate speech probability.
• The refined prior SNR estimation helps to get a converged gain.
• Either cross filterbank gain smoothing or OMLSA is helpful to further modify the gain
function.
• The two-stage processing helps to reduce the residual noise after the first-stage processing.
• ICMMSE is superior regardless of the underlying models and evaluation tasks

Result Summary
Task Training Data Acoustic Model Relative
WER
reduction
Aurora 2 8440 utterances GMM 25.46%
feed-forward DNN 11.98%
Cortana 3400hr Live data LSTM-RNN 11.01%

Icmmse slides

More Related Content

What's hot (20)

Similar to Icmmse slides (20)

Recently uploaded (20)

Icmmse slides