SlideShare a Scribd company logo
Copyright©2017 NTT corp. All Rights Reserved.
Black-box Optimization of
DNN-based Source Enhancement
for Increasing Objective Sound Quality
Oct., 26, 2017 @ MERL visiting
1
Yuma Koizumi
NTT Media Intelligence Laboratories
Copyright©2017 NTT corp. All Rights Reserved. 2
Self introduction
 Name: Yuma Koizumi
 Age: 27 (born in 1990 @ Tokyo, Japan)
 Background
 2008-2014: Hosei University
 B.S. and M.S. (received in 2012 and 2014)
 2016-2017: University of Electro-Communications
 Ph.D. degree (received in 2017)
 2014-Now: Researcher at NTT
 Research topics at NTT
 Source enhancement, machine learning, anomaly
detection, etc.
Copyright©2017 NTT corp. All Rights Reserved. 3
 Introduction
 Conventional method
 Proposed method
1. Basic idea
Black-box optimization to increase OSQA score
2. T-F mask selection based approach
presented in [Koizumi+, ICASSP-2017]
3. T-F mask estimation based approach
in peer review, [IEEE Trans. ASLP]
Table of contents
Copyright©2017 NTT corp. All Rights Reserved. 4
Introduction
Purpose: improve perceptual sound quality
of DNN source enhancement output
…
…
…
Source enhancement
𝑆 𝜔,𝜏
𝑁 𝜔,𝜏
𝑆 𝜔,𝜏
Automatic
speech recognition
…
Immersive audio
Telecommunication
Anomaly
detection
DNN source enhancement has been widely used
in various practical applications
Copyright©2017 NTT corp. All Rights Reserved. 5
…
…
…
Source enhancement
𝑆 𝜔,𝜏
𝑁 𝜔,𝜏
𝑆 𝜔,𝜏
Automatic
speech recognition
…
Immersive audio
Telecommunication
Anomaly
detection
How to train DNN source enhancement
to improve “perceptual” quality?
Introduction
Purpose: improve perceptual sound quality
of DNN source enhancement output
Copyright©2017 NTT corp. All Rights Reserved. 6
 Introduction
 Conventional method
 Proposed method
1. Basic idea
Black-box optimization to increase OSQA score
2. T-F mask selection based approach
presented in [Koizumi+, ICASSP-2017]
3. T-F mask estimation based approach
in peer review, [IEEE Trans. ASLP]
Table of contents
Copyright©2017 NTT corp. All Rights Reserved. 7
Time-frequency (T-F) masking
𝑆 𝜔,𝜏
𝑋 𝜔,𝜏 𝑆 𝜔,𝜏
𝑁 𝜔,𝜏
𝐺 𝜔,𝜏
Filtering
𝑋 𝜔,𝜏
𝑋 𝜔,𝜏 𝐺 𝜔,𝜏
 T-F mask 0 ≤ 𝐺 𝜔,𝜏 ≤ 1 is multiplied to observation 𝑋 𝜔,𝜏
T-F mask
estimation
e.g. 𝑆 𝜔,𝜏 / 𝑋 𝜔,𝜏
Copyright©2017 NTT corp. All Rights Reserved. 8
Time-frequency (T-F) masking (cont’d)
𝑆 𝜔,𝜏
𝑋 𝜔,𝜏
T-F mask
estimation
e.g. 𝑆 𝜔,𝜏 / 𝑋 𝜔,𝜏
𝑆 𝜔,𝜏
𝑁 𝜔,𝜏
𝐺 𝜔,𝜏
Filtering
𝑋 𝜔,𝜏
𝑆 𝜔,𝜏
How to estimate 𝐺 𝜔,𝜏 ?
 T-F mask 0 ≤ 𝐺 𝜔,𝜏 ≤ 1 is multiplied to observation 𝑋 𝜔,𝜏
Copyright©2017 NTT corp. All Rights Reserved.
𝑆 𝜔,𝜏
𝑋 𝜔,𝜏
DNN-based
T-F mask
estimator
𝑆 𝜔,𝜏
𝑁 𝜔,𝜏
𝐺 𝜔,𝜏
𝑋 𝜔,𝜏
 Deep neural networks are used as T-F mask estimator
DNN-based source enhancement
…
……
……
……
……
…
……𝐗 𝐆
ℳ 𝐱|Θ
Filtering
Obs.
Sigmoid activation to
satisfy 0 ≤ 𝐺 𝜔,𝜏 ≤ 1
T-F mask
Copyright©2017 NTT corp. All Rights Reserved.
Cost
function
𝓙 Θ
Obs.𝐒
Target
𝐍
Noise
Training
data
𝑆 𝜔,𝜏
𝑋 𝜔,𝜏
DNN-based
T-F mask
estimator
𝑆 𝜔,𝜏
𝑁 𝜔,𝜏
𝐺 𝜔,𝜏
𝑋 𝜔,𝜏
DNN-based source enhancement
…
……
……
……
……
…
……𝐗 𝐆
ℳ 𝐱|Θ
T-F mask
Filtering
Obs.
 Deep neural networks are used as T-F mask estimator
Copyright©2017 NTT corp. All Rights Reserved.
 Gradient decent/assent based training
Θ ← Θ − 𝜆𝛻Θℐ Θ
 Mean-squared-error (MSE) is widely used
 Phase sensitive spectrum approximation (PSA) [Erdogan+, 2015]
 Complex ideal ratio mask (cIRM) 𝐺 𝜔,𝜏 ∈ ℂ [Williamson+ 2016]
 predicting real and imaginary parts of cIRM
 minimize MSE between 𝑆 𝜔,𝜏 and 𝐺 𝜔,𝜏 𝑋 𝜔,𝜏 on complex plane
Conventional DNN training
ℐ Θ = 𝔼 𝜔,𝜏 𝔑 cIRM − 𝔑 𝐺 𝜔,𝜏
2
+ ℑ cIRM − ℑ 𝐺 𝜔,𝜏
2
ℐ Θ = 𝔼 𝜔,𝜏 𝑆 𝜔,𝜏 − 𝐺 𝜔,𝜏 𝑋 𝜔,𝜏
2
11
Copyright©2017 NTT corp. All Rights Reserved.
 Minimize MSE ≠ maximize perceptual quality
Pros and cons of MMSE cost
Perceptually motivated cost
 Pros
 Simple implementation
 Easy to adopt backpropagation (analytical gradient)
 A lot of “know-how” for parameter-tuning
 Cons
12
Copyright©2017 NTT corp. All Rights Reserved. 13
 Introduction
 Conventional method
 Proposed method
1. Basic idea
Black-box optimization to increase OSQA score
2. T-F mask selection based approach
presented in [Koizumi+, ICASSP-2017]
3. T-F mask estimation based approach
in peer review, [IEEE Trans. ASLP]
Table of contents
Copyright©2017 NTT corp. All Rights Reserved.
 Subjective evaluation would be best, however…
Perceptual motivated cost function
 What measurement should be used as ℐ Θ ?
𝑆 𝜔,𝜏 𝑋 𝜔,𝜏
𝑆 𝜔,𝜏
𝑁 𝜔,𝜏
Filtering
𝐺 𝜔,𝜏
Target source
…
…
……
…
MOS
Noise
 listening test without sleeping/resting night and day...
That is not realistic!!
14
Copyright©2017 NTT corp. All Rights Reserved.
Perceptual motivated cost function (cont’d)
𝑆 𝜔,𝜏 𝑋 𝜔,𝜏
𝑆 𝜔,𝜏
𝑁 𝜔,𝜏
Filtering
𝐺 𝜔,𝜏
Target source
…
…
……
…
Noise
 Use of Objective Sound Quality Assessment (OSQA)
 Subjective evaluation would be best, however…
 PESQ [ITU-U P.862]:Quality
 STOI [Taal+, 2011]:Intelligibility
OSQA
 What measurement should be used as ℐ Θ ?
 listening test without sleeping/resting night and day...
15
Copyright©2017 NTT corp. All Rights Reserved.
Black-box optimization
for DNN-source enhancement
 Most of OSQAs are black-box function
 Gradient 𝛻Θℐ Θ can not be calculated analytically
 DNN training: gradient assent
Θ ← Θ + 𝜆𝛻Θℐ Θ
 Need to calculate the gradient 𝛻Θℐ Θ
Applying black-box optimization
to train DNN source enhancement
To increase perceptual (black-box) motivated cost
16
Copyright©2017 NTT corp. All Rights Reserved. 17
 Introduction
 Conventional method
 Proposed method
1. Basic idea
Black-box optimization to increase OSQA score
2. T-F mask selection based approach
presented in [Koizumi+, ICASSP-2017]
3. T-F mask estimation based approach
in peer review, [IEEE Trans. ASLP]
Table of contents
Copyright©2017 NTT corp. All Rights Reserved. 18
Reinforcement learning
 A technique of black-box optimization
 e.g. Atari [Minh+,2015] & Alpha-Go [Silver+, 2016]
Game
score
Reward func.Action selector
Action
candi-
dates
Action
…
…
…
… DNN is used as action “selector”
 Good/bad reward increase/decrease selection probability
Copyright©2017 NTT corp. All Rights Reserved. 19
Reinforcement learning (cont’d)
 A technique of black-box optimization
 e.g. Atari [Minh+,2015] & Alpha-Go [Silver+, 2016]
Reward func.
T-F
masking
Action
…
…
…
…
T-F mask
template
OSQA score
Mask selector
 DNN is used as T-F mask “selector”
 Good/bad OSQA score increase/decrease selection probability
Investigation weather DNN source enhancement
can be trained by black-box optimization scheme
Copyright©2017 NTT corp. All Rights Reserved.
Implementation
𝑮 𝜏 ← 𝓖 𝑎 𝜏
𝑎 𝜏 ← argmax 𝑎ℳ 𝒙 𝜏, 𝑎|Θ
 DNN ℳ 𝒙 𝜏, 𝑎|Θ selects a T-F mask template
 T-F mask templates are calculated by K-means algorithm
Reward
T-F
masking
…
…
…
…
Action
𝑁 𝜔,𝑘
𝑆 𝜔,𝑘
OSQA score
e.g. PESQ
T-F mask
templates
T-F mask
selector
Mask selector
20
𝓖 𝑎 ≔ 𝐺1,𝑎, 𝐺2,𝑎, … , 𝐺Ω,𝑎
⊤
where ℳ 𝒙 𝜏, 𝑎|Θ𝑎 = 1
Copyright©2017 NTT corp. All Rights Reserved.
21
Evaluation: setting
 Target source: ATR Japanese speech database (3316 utterances)
 Noise source: CHiME-3 noise dataset (cafes, street junctions, public
transport (buses), and pedestrian areas) [Barker+, 2015]
 SNR: 0, 3 and 6 dB
 Training data
 OSQA
 PESQ (P.862.2) and PEASS [Emiya+, 2011]
 DNN architecture
 Input: 64 × 11 units (mel-FB × frame concat.)
 Output: 32 actions (softmax activation)
 Hidden: 2 hidden layers, 64 units, and sigmoid activation
 Comparison method
 Fully-connected DNN estimates log-amplitude-spectrum,
(64 × 11 units )→ 128 sigmoid → 128 sigmoid → 64(linear)
Copyright©2017 NTT corp. All Rights Reserved.
Verification experiment
22
 Both OSQAs were increased as number of update increased
⇒ DNN could be trained by black-box optimization scheme!
Copyright©2017 NTT corp. All Rights Reserved.
Verification experiment
OSQA scores were still significantly lower than SOTA… 
PSA’s score
0dB: 2.36
6dB: 2.72
23
 Both OSQAs were increased as number of update increased
⇒ DNN could be trained by black-box optimization scheme!
Copyright©2017 NTT corp. All Rights Reserved. 24
 Introduction
 Conventional method
 Proposed method
1. Basic idea
Black-box optimization to increase OSQA score
2. T-F mask selection based approach
presented in [Koizumi+, ICASSP-2017]
3. T-F mask estimation based approach
in peer review, [IEEE Trans. ASLP]
Table of contents
Copyright©2017 NTT corp. All Rights Reserved. 25
Motivation and summary
 Problem: low flexibility
OSQA-based cost Flexibility
MMSE-based ✓
RL-based ✓
This study ✓ ✓
 Key points
1. Gradient is calculated using sampling algorithm
2. DNN outputs PDF parameters of T-F-masked output signal
 Policy gradient method [Williams+,1992] is applied
 Finite-number of (only 32) T-F mask template is not enough
 Selection ⇒ Regression
Copyright©2017 NTT corp. All Rights Reserved.
ℐ Θ = 𝔼 𝐒,𝐗 ℬ 𝐒, 𝐗 = 𝔼 𝐗 𝔼 𝐒|𝐗 ℬ 𝐒, 𝐗
= 𝑝 𝐗 𝑝 𝐒|𝐗 ℬ 𝐒, 𝐗 𝑑𝐒 𝑑𝐗
= 𝑝 𝐗 𝑝 𝐒|𝐗, 𝚯 ℬ 𝐒, 𝐗 𝑑𝐒 𝑑𝐗
Complex Gaussian distribution
…
…
…
…
……
𝑮 𝒙 𝜏
𝝈 𝒙 𝜏
𝑿 𝜔,𝜏
T-F mask and variance
Real
Imag 𝑋 𝜔,𝜏
𝐺 𝜔,𝜏 𝑋 𝜔,𝜏
𝑝 𝑆 𝜔,𝜏|𝑋 𝜔,𝜏, Θ
 Cost: Expectation of OSQA ℬ 𝐒, 𝐗
 DNN estimates PDF of output signal 𝑝 𝐒|𝐗, 𝚯
Cost function and DNN architecture
26
Copyright©2017 NTT corp. All Rights Reserved.
ℐ Θ = 𝔼 𝐒,𝐗 ℬ 𝐒, 𝐗 = 𝔼 𝐗 𝔼 𝐒|𝐗 ℬ 𝐒, 𝐗
= 𝑝 𝐗 𝑝 𝐒|𝐗 ℬ 𝐒, 𝐗 𝑑𝐒 𝑑𝐗
= 𝑝 𝐗 𝑝 𝐒|𝐗, 𝚯 ℬ 𝐒, 𝐗 𝑑𝐒 𝑑𝐗
 To calculate gradient, policy gradient method is applied
Gradient estimation
𝛻Θℐ Θ = 𝑝 𝐗 𝛻Θ 𝑝 𝐒|𝐗, 𝚯 ℬ 𝐒, 𝐗 𝑑𝐒 𝑑𝐗
= 𝑝 𝐗 𝑝 𝐒|𝐗, 𝚯 ℬ 𝐒, 𝐗 𝛻Θ ln 𝑝 𝐒|𝐗, 𝚯 𝑑𝐒 𝑑𝐗
Policygradient
[Williams+,1992]
OSQA Likelihood of
output signal
= 𝔼 𝐗 𝔼 𝐒|𝐗 ℬ 𝐒, 𝐗 𝛻Θ ln 𝑝 𝐒|𝐗, 𝚯
Copyright©2017 NTT corp. All Rights Reserved.
ℐ Θ = 𝔼 𝐒,𝐗 ℬ 𝐒, 𝐗 = 𝔼 𝐗 𝔼 𝐒|𝐗 ℬ 𝐒, 𝐗
= 𝑝 𝐗 𝑝 𝐒|𝐗 ℬ 𝐒, 𝐗 𝑑𝐒 𝑑𝐗
= 𝑝 𝐗 𝑝 𝐒|𝐗, 𝚯 ℬ 𝐒, 𝐗 𝑑𝐒 𝑑𝐗
 To calculate gradient, policy gradient method is applied
Gradient estimation
𝛻Θℐ Θ = 𝑝 𝐗 𝛻Θ 𝑝 𝐒|𝐗, 𝚯 ℬ 𝐒, 𝐗 𝑑𝐒 𝑑𝐗
≈
1
𝐼
1
𝐾𝑇 𝑖
𝑖
ℬ 𝐒 𝑖,𝑘 , 𝐗(𝑖)
𝑘𝑡
𝛻Θ ln 𝑝 𝐒 𝑡
𝑖,𝑘
|𝐗 𝑡
𝑖
, 𝚯
Expectation is approximately calculated
by sampling algorithm
= 𝑝 𝐗 𝑝 𝐒|𝐗, 𝚯 ℬ 𝐒, 𝐗 𝛻Θ ln 𝑝 𝐒|𝐗, 𝚯 𝑑𝐒 𝑑𝐗
Policygradient
[Williams+,1992]
Copyright©2017 NTT corp. All Rights Reserved.
STEP-1: Estimate T-F mask and variance form 𝑖-th obs. 𝐗 𝑖
+Target
Training data
𝐒(𝑖)
𝐍(𝑖)
𝐗(𝑖)
…
…
𝑮 𝒙 𝜏
(𝑖)
𝝈 𝒙 𝜏
(𝑖)
Noise
……
…
|𝑿, Θ
Random
select
T-F mask sampling
T-F masking
Calc. ℬ 𝐒(𝑖,𝑘)
, 𝐗(𝑖)
𝐆(𝑖,𝑘)
𝐒(𝑖,𝑘)
Monte Carlo sampling (𝐾 times)
Calculate
𝛻Θ Θ
Back-
propagation
Repeat times
𝑋 𝜔,𝜏
𝐺 𝜔,𝜏 𝑋 𝜔,𝜏
𝑝 𝑆 𝜔,𝜏|𝑋 𝜔,𝜏, Θ
Training algorithm
Real
Imag
29
Copyright©2017 NTT corp. All Rights Reserved.
STEP-1: Estimate T-F mask and variance form 𝑖-th obs. 𝐗 𝑖
+Target
Training data
𝐒(𝑖)
𝐍(𝑖)
𝐗(𝑖)
…
…
𝑮 𝒙 𝜏
(𝑖)
𝝈 𝒙 𝜏
(𝑖)
Noise
……
…
|𝑿, Θ
Random
select
T-F mask sampling
T-F masking
Calc. ℬ 𝐒(𝑖,𝑘)
, 𝐗(𝑖)
𝐆(𝑖,𝑘)
𝐒(𝑖,𝑘)
Monte Carlo sampling (𝐾 times)
Calculate
𝛻Θ Θ
Back-
propagation
Repeat times
Training algorithm (cont’d)
STEP-2: Sample output signal 𝐾-times 𝑋 𝜔,𝜏
𝐺 𝜔,𝜏 𝑋 𝜔,𝜏
𝑆′ 𝜔,𝜏
𝑆 𝜔,𝜏 = 𝐺 𝜔,𝜏 𝑋 𝜔,𝜏𝑝 𝑆 𝜔,𝜏|𝑋 𝜔,𝜏, Θ
Real
Imag
30
Copyright©2017 NTT corp. All Rights Reserved.
STEP-1: Estimate T-F mask and variance form 𝑖-th obs. 𝐗 𝑖
+Target
Training data
𝐒(𝑖)
𝐍(𝑖)
𝐗(𝑖)
…
…
𝑮 𝒙 𝜏
(𝑖)
𝝈 𝒙 𝜏
(𝑖)
Noise
……
…
|𝑿, Θ
Random
select
T-F mask sampling
T-F masking
Calc. ℬ 𝐒(𝑖,𝑘)
, 𝐗(𝑖)
𝐆(𝑖,𝑘)
𝐒(𝑖,𝑘)
Monte Carlo sampling (𝐾 times)
Calculate
𝛻Θ Θ
Back-
propagation
Repeat times
Training algorithm (cont’d)
STEP-2: Sample output signal 𝐾-times
STEP-3: Calculate 𝑘-th OSQA
𝛻Θℐ Θ ≈
1
𝐼
1
𝐾𝑇 𝑖
𝑖
ℬ 𝐒 𝑖,𝑘 , 𝐗(𝑖)
𝑘𝜏
𝛻Θ ln 𝐒 𝜏
𝑖,𝑘
|𝐗 𝜏
𝑖
, 𝚯
OSQA
31
Copyright©2017 NTT corp. All Rights Reserved.
STEP-1: Estimate T-F mask and variance form 𝑖-th obs. 𝐗 𝑖
+Target
Training data
𝐒(𝑖)
𝐍(𝑖)
𝐗(𝑖)
…
…
𝑮 𝒙 𝜏
(𝑖)
𝝈 𝒙 𝜏
(𝑖)
Noise
……
…
|𝑿, Θ
Random
select
T-F mask sampling
T-F masking
Calc. ℬ 𝐒(𝑖,𝑘)
, 𝐗(𝑖)
𝐆(𝑖,𝑘)
𝐒(𝑖,𝑘)
Monte Carlo sampling (𝐾 times)
Calculate
𝛻Θ Θ
Back-
propagation
Repeat times
Training algorithm (cont’d)
STEP-2: Sample output signal 𝐾-times
STEP-3: Calculate 𝑘-th OSQA
STEP-4: Calculate gradient as follow:
𝛻Θℐ Θ ≈
1
𝐼
1
𝐾𝑇 𝑖
𝑖
ℬ 𝐒 𝑖,𝑘 , 𝐗(𝑖)
𝑘𝜏
𝛻Θ ln 𝑝 𝐒 𝜏
𝑖,𝑘
|𝐗 𝜏
𝑖
, 𝚯
OSQA
i.e. weight
Likelihood of
simulated output signal
32
Copyright©2017 NTT corp. All Rights Reserved.
Experimental conditions
 Training data
 Target: JPN speech from ATR speech database (6632 utterances)
 Noise: Noise data of CHiME-3
 SNR: random (-6 to 12 dB)
 Test data
 Target: JPN speech from NEW-Japan speech database (300 utterances)
 Noise: Environmental noise database (4 types of ambient noise)
 SNR: -6, 0, 6, 12 dB
 DNN architecture (fully-connected DNN)
 Input: log-mel-Fbank with 64 mel-filter × 11 frames (5 prev. & 5 future)
 Hidden: 3 hidden layers, 1024 hidden units, and ReLU activation
 Output: T-F mask with 64 mel-filter
 Comparison method
 PSA [Erdogan+, 2015] estimated by fully-connected DNN
33
Copyright©2017 NTT corp. All Rights Reserved.
Input SNR: -6 dB Input SNR: 0 dB Input SNR: 6 dB Input SNR: 12 dB
STOI
Improvement[%]
# of update # of update # of update # of update
PESQ
improvement
Results (PESQ & STOI)
 OSQAs were increased as number of updates increased
⇒ proposed method could increase black-box score
34
Copyright©2017 NTT corp. All Rights Reserved.
Experimental results
Cost func. SDR (dB) STOI (%) PESQ
PSA 𝟏𝟎. 𝟕 ± 𝟏. 𝟖𝟔 86.3 ± 3.81 2.36 ± 0.19
STOI 10.4 ± 2.01 𝟖𝟗. 𝟏 ± 𝟑. 𝟓𝟐 2.36 ± 0.19
PESQ 𝟏𝟎. 𝟕 ± 𝟏. 𝟖𝟒 86.7 ± 3.92 𝟐. 𝟒𝟖 ± 𝟎. 𝟏𝟗
Input SNR: 0 dB
Input SNR: 6 dB
Cost func. SDR (dB) STOI (%) PESQ
PSA 15.1 ± 1.67 92.8 ± 2.67 2.72 ± 0.17
STOI 𝟏𝟓. 𝟓 ± 𝟏. 𝟖𝟑 𝟗𝟒. 𝟔 ± 𝟐. 𝟓𝟏 2.65 ± 0.16
PESQ 14.8 ± 1.57 92.5 ± 2.91 𝟐. 𝟖𝟒 ± 𝟎. 𝟏𝟕
35
 PSA obtained averagely high-score for all measurements
 Proposed method obtained highest score for target OSQA
⇒ specialized to target OSQA (black-box score) !!
Orange: best, Purple: worst
Copyright©2017 NTT corp. All Rights Reserved.
Freq.[kHz]
Time [s]
Time [s]
Observed signal 𝑋 𝜔,𝜏
(SNR: 0dB)
Target source 𝑆 𝜔,𝜏
Freq.[kHz]
Freq.[kHz]Freq.[kHz]Freq.[kHz]
MMSE(PSA)PESQSTOI
T-F mask 𝐺 𝜔,𝜏 Output signal 𝑆 𝜔,𝜏
Time [s] Time [s]
SDR: 11.9 dB
PESQ: 2.44
STOI: 88.4 %
SDR: 11.3 dB
PESQ: 2.62
STOI: 87.4 %
SDR: 10.2 dB
PESQ: 2.25
STOI: 89.4 %
Result examples (T-F mask)
36
Copyright©2017 NTT corp. All Rights Reserved.
Result examples (sound)
37
 Male (SNR: 6dB)
 Female (SNR: 0dB)
Observation PSA STOI PESQ
Observation PSA STOI PESQ
Perhaps, we need a headphone to clearly interpret the differences.
After this talk, please listen to these files using a headphone.
Copyright©2017 NTT corp. All Rights Reserved.
 Improve perceptual quality of DNN-source enhancement
⇒ Black-box optimization of DNN-source enhancement
 Two approaches were introduced
 T-F mask selection based on RL-like approach
 T-F mask estimation based on policy gradient method
 OSQAs were increased as number of updates increased
 DNN could be specialized for target OSQA
⇒ Succeeded to train DNN using black-box cost!
Future work
 Other black-box cost
 Human-in-the-loop audio system [Niwa+, 2017]
 Fixed ASR system (word accuracy) [Watanabe+, 2014]
Conclusion
38
Copyright©2017 NTT corp. All Rights Reserved. 39
Q & A

More Related Content

PDF
A Brief Introduction of Anomalous Sound Detection: Recent Studies and Future...
PPT
Waveform Generation Using TMS320C6745 DSP
PDF
第14回 配信講義 計算科学技術特論A(2021)
PDF
Team Jarvis Poster
PDF
Ph.D. Defense: Expressive Sound Synthesis for Animation
PDF
Sensor Fusion Study - Ch3. Least Square Estimation [강소라, Stella, Hayden]
PDF
Nonnegative Matrix Factorization
PDF
ADC Lab Analysis
A Brief Introduction of Anomalous Sound Detection: Recent Studies and Future...
Waveform Generation Using TMS320C6745 DSP
第14回 配信講義 計算科学技術特論A(2021)
Team Jarvis Poster
Ph.D. Defense: Expressive Sound Synthesis for Animation
Sensor Fusion Study - Ch3. Least Square Estimation [강소라, Stella, Hayden]
Nonnegative Matrix Factorization
ADC Lab Analysis

Similar to Black-box Optimization of DNN-based Source Enhancement for Increasing Objective Sound Quality (20)

PDF
Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...
PDF
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...
PDF
PDF
IRJET- A Novel Adaptive Sub-Band Filter Design with BD-VSS using Particle Swa...
PDF
Special Design Topics in Digital Wideband Receivers Artech House Radar Series...
PDF
2012 mdsp pr03 kalman filter
PPTX
Dct and adaptive filters
PPTX
Robust music signal separation based on supervised nonnegative matrix factori...
PDF
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
PDF
Maneuvering target track prediction model
PDF
Learning the Statistical Model of the NMF Using the Deep Multiplicative Updat...
PPTX
Efficient initialization for nonnegative matrix factorization based on nonneg...
PPTX
Adaptive filter
PDF
Lab04_Signals_Systems.pdf
DOCX
EBDSS Max Research Report - Final
PDF
USRP Implementation of Max-Min SNR Signal Energy based Spectrum Sensing Algor...
PDF
journal paper publication
PDF
Black-box Behavioral Model Inference for Autopilot Software Systems
PDF
Principal Component Analysis for Tensor Analysis and EEG classification
PDF
SURF 2012 Final Report(1)
Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...
IRJET- A Novel Adaptive Sub-Band Filter Design with BD-VSS using Particle Swa...
Special Design Topics in Digital Wideband Receivers Artech House Radar Series...
2012 mdsp pr03 kalman filter
Dct and adaptive filters
Robust music signal separation based on supervised nonnegative matrix factori...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Maneuvering target track prediction model
Learning the Statistical Model of the NMF Using the Deep Multiplicative Updat...
Efficient initialization for nonnegative matrix factorization based on nonneg...
Adaptive filter
Lab04_Signals_Systems.pdf
EBDSS Max Research Report - Final
USRP Implementation of Max-Min SNR Signal Energy based Spectrum Sensing Algor...
journal paper publication
Black-box Behavioral Model Inference for Autopilot Software Systems
Principal Component Analysis for Tensor Analysis and EEG classification
SURF 2012 Final Report(1)
Ad

More from Yuma Koizumi (9)

PDF
深層学習を利用した音声強調
PDF
キーワード推定を内包したオーディオキャプション法
PDF
音響システム特論 第11回 実環境における音響信号処理と機械学習
PDF
音響信号に対する異常音検知技術と応用
PDF
ICASSP 2019での音響信号処理分野の世界動向
PDF
Theory and Methods for Unsupervised Anomaly Detection in Sounds Based on Deep...
PDF
深層学習と音響信号処理
PDF
実環境音響信号処理における収音技術
PDF
統計的手法に基づく異常音検知の理論と応用
深層学習を利用した音声強調
キーワード推定を内包したオーディオキャプション法
音響システム特論 第11回 実環境における音響信号処理と機械学習
音響信号に対する異常音検知技術と応用
ICASSP 2019での音響信号処理分野の世界動向
Theory and Methods for Unsupervised Anomaly Detection in Sounds Based on Deep...
深層学習と音響信号処理
実環境音響信号処理における収音技術
統計的手法に基づく異常音検知の理論と応用
Ad

Recently uploaded (20)

PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Mushroom cultivation and it's methods.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
August Patch Tuesday
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Group 1 Presentation -Planning and Decision Making .pptx
Mushroom cultivation and it's methods.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Encapsulation_ Review paper, used for researhc scholars
DP Operators-handbook-extract for the Mautical Institute
August Patch Tuesday
Hindi spoken digit analysis for native and non-native speakers
Digital-Transformation-Roadmap-for-Companies.pptx
A comparative analysis of optical character recognition models for extracting...
Univ-Connecticut-ChatGPT-Presentaion.pdf
1 - Historical Antecedents, Social Consideration.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Encapsulation theory and applications.pdf
cloud_computing_Infrastucture_as_cloud_p
gpt5_lecture_notes_comprehensive_20250812015547.pdf
OMC Textile Division Presentation 2021.pptx
TLE Review Electricity (Electricity).pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Enhancing emotion recognition model for a student engagement use case through...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

Black-box Optimization of DNN-based Source Enhancement for Increasing Objective Sound Quality

  • 1. Copyright©2017 NTT corp. All Rights Reserved. Black-box Optimization of DNN-based Source Enhancement for Increasing Objective Sound Quality Oct., 26, 2017 @ MERL visiting 1 Yuma Koizumi NTT Media Intelligence Laboratories
  • 2. Copyright©2017 NTT corp. All Rights Reserved. 2 Self introduction  Name: Yuma Koizumi  Age: 27 (born in 1990 @ Tokyo, Japan)  Background  2008-2014: Hosei University  B.S. and M.S. (received in 2012 and 2014)  2016-2017: University of Electro-Communications  Ph.D. degree (received in 2017)  2014-Now: Researcher at NTT  Research topics at NTT  Source enhancement, machine learning, anomaly detection, etc.
  • 3. Copyright©2017 NTT corp. All Rights Reserved. 3  Introduction  Conventional method  Proposed method 1. Basic idea Black-box optimization to increase OSQA score 2. T-F mask selection based approach presented in [Koizumi+, ICASSP-2017] 3. T-F mask estimation based approach in peer review, [IEEE Trans. ASLP] Table of contents
  • 4. Copyright©2017 NTT corp. All Rights Reserved. 4 Introduction Purpose: improve perceptual sound quality of DNN source enhancement output … … … Source enhancement 𝑆 𝜔,𝜏 𝑁 𝜔,𝜏 𝑆 𝜔,𝜏 Automatic speech recognition … Immersive audio Telecommunication Anomaly detection DNN source enhancement has been widely used in various practical applications
  • 5. Copyright©2017 NTT corp. All Rights Reserved. 5 … … … Source enhancement 𝑆 𝜔,𝜏 𝑁 𝜔,𝜏 𝑆 𝜔,𝜏 Automatic speech recognition … Immersive audio Telecommunication Anomaly detection How to train DNN source enhancement to improve “perceptual” quality? Introduction Purpose: improve perceptual sound quality of DNN source enhancement output
  • 6. Copyright©2017 NTT corp. All Rights Reserved. 6  Introduction  Conventional method  Proposed method 1. Basic idea Black-box optimization to increase OSQA score 2. T-F mask selection based approach presented in [Koizumi+, ICASSP-2017] 3. T-F mask estimation based approach in peer review, [IEEE Trans. ASLP] Table of contents
  • 7. Copyright©2017 NTT corp. All Rights Reserved. 7 Time-frequency (T-F) masking 𝑆 𝜔,𝜏 𝑋 𝜔,𝜏 𝑆 𝜔,𝜏 𝑁 𝜔,𝜏 𝐺 𝜔,𝜏 Filtering 𝑋 𝜔,𝜏 𝑋 𝜔,𝜏 𝐺 𝜔,𝜏  T-F mask 0 ≤ 𝐺 𝜔,𝜏 ≤ 1 is multiplied to observation 𝑋 𝜔,𝜏 T-F mask estimation e.g. 𝑆 𝜔,𝜏 / 𝑋 𝜔,𝜏
  • 8. Copyright©2017 NTT corp. All Rights Reserved. 8 Time-frequency (T-F) masking (cont’d) 𝑆 𝜔,𝜏 𝑋 𝜔,𝜏 T-F mask estimation e.g. 𝑆 𝜔,𝜏 / 𝑋 𝜔,𝜏 𝑆 𝜔,𝜏 𝑁 𝜔,𝜏 𝐺 𝜔,𝜏 Filtering 𝑋 𝜔,𝜏 𝑆 𝜔,𝜏 How to estimate 𝐺 𝜔,𝜏 ?  T-F mask 0 ≤ 𝐺 𝜔,𝜏 ≤ 1 is multiplied to observation 𝑋 𝜔,𝜏
  • 9. Copyright©2017 NTT corp. All Rights Reserved. 𝑆 𝜔,𝜏 𝑋 𝜔,𝜏 DNN-based T-F mask estimator 𝑆 𝜔,𝜏 𝑁 𝜔,𝜏 𝐺 𝜔,𝜏 𝑋 𝜔,𝜏  Deep neural networks are used as T-F mask estimator DNN-based source enhancement … …… …… …… …… … ……𝐗 𝐆 ℳ 𝐱|Θ Filtering Obs. Sigmoid activation to satisfy 0 ≤ 𝐺 𝜔,𝜏 ≤ 1 T-F mask
  • 10. Copyright©2017 NTT corp. All Rights Reserved. Cost function 𝓙 Θ Obs.𝐒 Target 𝐍 Noise Training data 𝑆 𝜔,𝜏 𝑋 𝜔,𝜏 DNN-based T-F mask estimator 𝑆 𝜔,𝜏 𝑁 𝜔,𝜏 𝐺 𝜔,𝜏 𝑋 𝜔,𝜏 DNN-based source enhancement … …… …… …… …… … ……𝐗 𝐆 ℳ 𝐱|Θ T-F mask Filtering Obs.  Deep neural networks are used as T-F mask estimator
  • 11. Copyright©2017 NTT corp. All Rights Reserved.  Gradient decent/assent based training Θ ← Θ − 𝜆𝛻Θℐ Θ  Mean-squared-error (MSE) is widely used  Phase sensitive spectrum approximation (PSA) [Erdogan+, 2015]  Complex ideal ratio mask (cIRM) 𝐺 𝜔,𝜏 ∈ ℂ [Williamson+ 2016]  predicting real and imaginary parts of cIRM  minimize MSE between 𝑆 𝜔,𝜏 and 𝐺 𝜔,𝜏 𝑋 𝜔,𝜏 on complex plane Conventional DNN training ℐ Θ = 𝔼 𝜔,𝜏 𝔑 cIRM − 𝔑 𝐺 𝜔,𝜏 2 + ℑ cIRM − ℑ 𝐺 𝜔,𝜏 2 ℐ Θ = 𝔼 𝜔,𝜏 𝑆 𝜔,𝜏 − 𝐺 𝜔,𝜏 𝑋 𝜔,𝜏 2 11
  • 12. Copyright©2017 NTT corp. All Rights Reserved.  Minimize MSE ≠ maximize perceptual quality Pros and cons of MMSE cost Perceptually motivated cost  Pros  Simple implementation  Easy to adopt backpropagation (analytical gradient)  A lot of “know-how” for parameter-tuning  Cons 12
  • 13. Copyright©2017 NTT corp. All Rights Reserved. 13  Introduction  Conventional method  Proposed method 1. Basic idea Black-box optimization to increase OSQA score 2. T-F mask selection based approach presented in [Koizumi+, ICASSP-2017] 3. T-F mask estimation based approach in peer review, [IEEE Trans. ASLP] Table of contents
  • 14. Copyright©2017 NTT corp. All Rights Reserved.  Subjective evaluation would be best, however… Perceptual motivated cost function  What measurement should be used as ℐ Θ ? 𝑆 𝜔,𝜏 𝑋 𝜔,𝜏 𝑆 𝜔,𝜏 𝑁 𝜔,𝜏 Filtering 𝐺 𝜔,𝜏 Target source … … …… … MOS Noise  listening test without sleeping/resting night and day... That is not realistic!! 14
  • 15. Copyright©2017 NTT corp. All Rights Reserved. Perceptual motivated cost function (cont’d) 𝑆 𝜔,𝜏 𝑋 𝜔,𝜏 𝑆 𝜔,𝜏 𝑁 𝜔,𝜏 Filtering 𝐺 𝜔,𝜏 Target source … … …… … Noise  Use of Objective Sound Quality Assessment (OSQA)  Subjective evaluation would be best, however…  PESQ [ITU-U P.862]:Quality  STOI [Taal+, 2011]:Intelligibility OSQA  What measurement should be used as ℐ Θ ?  listening test without sleeping/resting night and day... 15
  • 16. Copyright©2017 NTT corp. All Rights Reserved. Black-box optimization for DNN-source enhancement  Most of OSQAs are black-box function  Gradient 𝛻Θℐ Θ can not be calculated analytically  DNN training: gradient assent Θ ← Θ + 𝜆𝛻Θℐ Θ  Need to calculate the gradient 𝛻Θℐ Θ Applying black-box optimization to train DNN source enhancement To increase perceptual (black-box) motivated cost 16
  • 17. Copyright©2017 NTT corp. All Rights Reserved. 17  Introduction  Conventional method  Proposed method 1. Basic idea Black-box optimization to increase OSQA score 2. T-F mask selection based approach presented in [Koizumi+, ICASSP-2017] 3. T-F mask estimation based approach in peer review, [IEEE Trans. ASLP] Table of contents
  • 18. Copyright©2017 NTT corp. All Rights Reserved. 18 Reinforcement learning  A technique of black-box optimization  e.g. Atari [Minh+,2015] & Alpha-Go [Silver+, 2016] Game score Reward func.Action selector Action candi- dates Action … … … … DNN is used as action “selector”  Good/bad reward increase/decrease selection probability
  • 19. Copyright©2017 NTT corp. All Rights Reserved. 19 Reinforcement learning (cont’d)  A technique of black-box optimization  e.g. Atari [Minh+,2015] & Alpha-Go [Silver+, 2016] Reward func. T-F masking Action … … … … T-F mask template OSQA score Mask selector  DNN is used as T-F mask “selector”  Good/bad OSQA score increase/decrease selection probability Investigation weather DNN source enhancement can be trained by black-box optimization scheme
  • 20. Copyright©2017 NTT corp. All Rights Reserved. Implementation 𝑮 𝜏 ← 𝓖 𝑎 𝜏 𝑎 𝜏 ← argmax 𝑎ℳ 𝒙 𝜏, 𝑎|Θ  DNN ℳ 𝒙 𝜏, 𝑎|Θ selects a T-F mask template  T-F mask templates are calculated by K-means algorithm Reward T-F masking … … … … Action 𝑁 𝜔,𝑘 𝑆 𝜔,𝑘 OSQA score e.g. PESQ T-F mask templates T-F mask selector Mask selector 20 𝓖 𝑎 ≔ 𝐺1,𝑎, 𝐺2,𝑎, … , 𝐺Ω,𝑎 ⊤ where ℳ 𝒙 𝜏, 𝑎|Θ𝑎 = 1
  • 21. Copyright©2017 NTT corp. All Rights Reserved. 21 Evaluation: setting  Target source: ATR Japanese speech database (3316 utterances)  Noise source: CHiME-3 noise dataset (cafes, street junctions, public transport (buses), and pedestrian areas) [Barker+, 2015]  SNR: 0, 3 and 6 dB  Training data  OSQA  PESQ (P.862.2) and PEASS [Emiya+, 2011]  DNN architecture  Input: 64 × 11 units (mel-FB × frame concat.)  Output: 32 actions (softmax activation)  Hidden: 2 hidden layers, 64 units, and sigmoid activation  Comparison method  Fully-connected DNN estimates log-amplitude-spectrum, (64 × 11 units )→ 128 sigmoid → 128 sigmoid → 64(linear)
  • 22. Copyright©2017 NTT corp. All Rights Reserved. Verification experiment 22  Both OSQAs were increased as number of update increased ⇒ DNN could be trained by black-box optimization scheme!
  • 23. Copyright©2017 NTT corp. All Rights Reserved. Verification experiment OSQA scores were still significantly lower than SOTA…  PSA’s score 0dB: 2.36 6dB: 2.72 23  Both OSQAs were increased as number of update increased ⇒ DNN could be trained by black-box optimization scheme!
  • 24. Copyright©2017 NTT corp. All Rights Reserved. 24  Introduction  Conventional method  Proposed method 1. Basic idea Black-box optimization to increase OSQA score 2. T-F mask selection based approach presented in [Koizumi+, ICASSP-2017] 3. T-F mask estimation based approach in peer review, [IEEE Trans. ASLP] Table of contents
  • 25. Copyright©2017 NTT corp. All Rights Reserved. 25 Motivation and summary  Problem: low flexibility OSQA-based cost Flexibility MMSE-based ✓ RL-based ✓ This study ✓ ✓  Key points 1. Gradient is calculated using sampling algorithm 2. DNN outputs PDF parameters of T-F-masked output signal  Policy gradient method [Williams+,1992] is applied  Finite-number of (only 32) T-F mask template is not enough  Selection ⇒ Regression
  • 26. Copyright©2017 NTT corp. All Rights Reserved. ℐ Θ = 𝔼 𝐒,𝐗 ℬ 𝐒, 𝐗 = 𝔼 𝐗 𝔼 𝐒|𝐗 ℬ 𝐒, 𝐗 = 𝑝 𝐗 𝑝 𝐒|𝐗 ℬ 𝐒, 𝐗 𝑑𝐒 𝑑𝐗 = 𝑝 𝐗 𝑝 𝐒|𝐗, 𝚯 ℬ 𝐒, 𝐗 𝑑𝐒 𝑑𝐗 Complex Gaussian distribution … … … … …… 𝑮 𝒙 𝜏 𝝈 𝒙 𝜏 𝑿 𝜔,𝜏 T-F mask and variance Real Imag 𝑋 𝜔,𝜏 𝐺 𝜔,𝜏 𝑋 𝜔,𝜏 𝑝 𝑆 𝜔,𝜏|𝑋 𝜔,𝜏, Θ  Cost: Expectation of OSQA ℬ 𝐒, 𝐗  DNN estimates PDF of output signal 𝑝 𝐒|𝐗, 𝚯 Cost function and DNN architecture 26
  • 27. Copyright©2017 NTT corp. All Rights Reserved. ℐ Θ = 𝔼 𝐒,𝐗 ℬ 𝐒, 𝐗 = 𝔼 𝐗 𝔼 𝐒|𝐗 ℬ 𝐒, 𝐗 = 𝑝 𝐗 𝑝 𝐒|𝐗 ℬ 𝐒, 𝐗 𝑑𝐒 𝑑𝐗 = 𝑝 𝐗 𝑝 𝐒|𝐗, 𝚯 ℬ 𝐒, 𝐗 𝑑𝐒 𝑑𝐗  To calculate gradient, policy gradient method is applied Gradient estimation 𝛻Θℐ Θ = 𝑝 𝐗 𝛻Θ 𝑝 𝐒|𝐗, 𝚯 ℬ 𝐒, 𝐗 𝑑𝐒 𝑑𝐗 = 𝑝 𝐗 𝑝 𝐒|𝐗, 𝚯 ℬ 𝐒, 𝐗 𝛻Θ ln 𝑝 𝐒|𝐗, 𝚯 𝑑𝐒 𝑑𝐗 Policygradient [Williams+,1992] OSQA Likelihood of output signal = 𝔼 𝐗 𝔼 𝐒|𝐗 ℬ 𝐒, 𝐗 𝛻Θ ln 𝑝 𝐒|𝐗, 𝚯
  • 28. Copyright©2017 NTT corp. All Rights Reserved. ℐ Θ = 𝔼 𝐒,𝐗 ℬ 𝐒, 𝐗 = 𝔼 𝐗 𝔼 𝐒|𝐗 ℬ 𝐒, 𝐗 = 𝑝 𝐗 𝑝 𝐒|𝐗 ℬ 𝐒, 𝐗 𝑑𝐒 𝑑𝐗 = 𝑝 𝐗 𝑝 𝐒|𝐗, 𝚯 ℬ 𝐒, 𝐗 𝑑𝐒 𝑑𝐗  To calculate gradient, policy gradient method is applied Gradient estimation 𝛻Θℐ Θ = 𝑝 𝐗 𝛻Θ 𝑝 𝐒|𝐗, 𝚯 ℬ 𝐒, 𝐗 𝑑𝐒 𝑑𝐗 ≈ 1 𝐼 1 𝐾𝑇 𝑖 𝑖 ℬ 𝐒 𝑖,𝑘 , 𝐗(𝑖) 𝑘𝑡 𝛻Θ ln 𝑝 𝐒 𝑡 𝑖,𝑘 |𝐗 𝑡 𝑖 , 𝚯 Expectation is approximately calculated by sampling algorithm = 𝑝 𝐗 𝑝 𝐒|𝐗, 𝚯 ℬ 𝐒, 𝐗 𝛻Θ ln 𝑝 𝐒|𝐗, 𝚯 𝑑𝐒 𝑑𝐗 Policygradient [Williams+,1992]
  • 29. Copyright©2017 NTT corp. All Rights Reserved. STEP-1: Estimate T-F mask and variance form 𝑖-th obs. 𝐗 𝑖 +Target Training data 𝐒(𝑖) 𝐍(𝑖) 𝐗(𝑖) … … 𝑮 𝒙 𝜏 (𝑖) 𝝈 𝒙 𝜏 (𝑖) Noise …… … |𝑿, Θ Random select T-F mask sampling T-F masking Calc. ℬ 𝐒(𝑖,𝑘) , 𝐗(𝑖) 𝐆(𝑖,𝑘) 𝐒(𝑖,𝑘) Monte Carlo sampling (𝐾 times) Calculate 𝛻Θ Θ Back- propagation Repeat times 𝑋 𝜔,𝜏 𝐺 𝜔,𝜏 𝑋 𝜔,𝜏 𝑝 𝑆 𝜔,𝜏|𝑋 𝜔,𝜏, Θ Training algorithm Real Imag 29
  • 30. Copyright©2017 NTT corp. All Rights Reserved. STEP-1: Estimate T-F mask and variance form 𝑖-th obs. 𝐗 𝑖 +Target Training data 𝐒(𝑖) 𝐍(𝑖) 𝐗(𝑖) … … 𝑮 𝒙 𝜏 (𝑖) 𝝈 𝒙 𝜏 (𝑖) Noise …… … |𝑿, Θ Random select T-F mask sampling T-F masking Calc. ℬ 𝐒(𝑖,𝑘) , 𝐗(𝑖) 𝐆(𝑖,𝑘) 𝐒(𝑖,𝑘) Monte Carlo sampling (𝐾 times) Calculate 𝛻Θ Θ Back- propagation Repeat times Training algorithm (cont’d) STEP-2: Sample output signal 𝐾-times 𝑋 𝜔,𝜏 𝐺 𝜔,𝜏 𝑋 𝜔,𝜏 𝑆′ 𝜔,𝜏 𝑆 𝜔,𝜏 = 𝐺 𝜔,𝜏 𝑋 𝜔,𝜏𝑝 𝑆 𝜔,𝜏|𝑋 𝜔,𝜏, Θ Real Imag 30
  • 31. Copyright©2017 NTT corp. All Rights Reserved. STEP-1: Estimate T-F mask and variance form 𝑖-th obs. 𝐗 𝑖 +Target Training data 𝐒(𝑖) 𝐍(𝑖) 𝐗(𝑖) … … 𝑮 𝒙 𝜏 (𝑖) 𝝈 𝒙 𝜏 (𝑖) Noise …… … |𝑿, Θ Random select T-F mask sampling T-F masking Calc. ℬ 𝐒(𝑖,𝑘) , 𝐗(𝑖) 𝐆(𝑖,𝑘) 𝐒(𝑖,𝑘) Monte Carlo sampling (𝐾 times) Calculate 𝛻Θ Θ Back- propagation Repeat times Training algorithm (cont’d) STEP-2: Sample output signal 𝐾-times STEP-3: Calculate 𝑘-th OSQA 𝛻Θℐ Θ ≈ 1 𝐼 1 𝐾𝑇 𝑖 𝑖 ℬ 𝐒 𝑖,𝑘 , 𝐗(𝑖) 𝑘𝜏 𝛻Θ ln 𝐒 𝜏 𝑖,𝑘 |𝐗 𝜏 𝑖 , 𝚯 OSQA 31
  • 32. Copyright©2017 NTT corp. All Rights Reserved. STEP-1: Estimate T-F mask and variance form 𝑖-th obs. 𝐗 𝑖 +Target Training data 𝐒(𝑖) 𝐍(𝑖) 𝐗(𝑖) … … 𝑮 𝒙 𝜏 (𝑖) 𝝈 𝒙 𝜏 (𝑖) Noise …… … |𝑿, Θ Random select T-F mask sampling T-F masking Calc. ℬ 𝐒(𝑖,𝑘) , 𝐗(𝑖) 𝐆(𝑖,𝑘) 𝐒(𝑖,𝑘) Monte Carlo sampling (𝐾 times) Calculate 𝛻Θ Θ Back- propagation Repeat times Training algorithm (cont’d) STEP-2: Sample output signal 𝐾-times STEP-3: Calculate 𝑘-th OSQA STEP-4: Calculate gradient as follow: 𝛻Θℐ Θ ≈ 1 𝐼 1 𝐾𝑇 𝑖 𝑖 ℬ 𝐒 𝑖,𝑘 , 𝐗(𝑖) 𝑘𝜏 𝛻Θ ln 𝑝 𝐒 𝜏 𝑖,𝑘 |𝐗 𝜏 𝑖 , 𝚯 OSQA i.e. weight Likelihood of simulated output signal 32
  • 33. Copyright©2017 NTT corp. All Rights Reserved. Experimental conditions  Training data  Target: JPN speech from ATR speech database (6632 utterances)  Noise: Noise data of CHiME-3  SNR: random (-6 to 12 dB)  Test data  Target: JPN speech from NEW-Japan speech database (300 utterances)  Noise: Environmental noise database (4 types of ambient noise)  SNR: -6, 0, 6, 12 dB  DNN architecture (fully-connected DNN)  Input: log-mel-Fbank with 64 mel-filter × 11 frames (5 prev. & 5 future)  Hidden: 3 hidden layers, 1024 hidden units, and ReLU activation  Output: T-F mask with 64 mel-filter  Comparison method  PSA [Erdogan+, 2015] estimated by fully-connected DNN 33
  • 34. Copyright©2017 NTT corp. All Rights Reserved. Input SNR: -6 dB Input SNR: 0 dB Input SNR: 6 dB Input SNR: 12 dB STOI Improvement[%] # of update # of update # of update # of update PESQ improvement Results (PESQ & STOI)  OSQAs were increased as number of updates increased ⇒ proposed method could increase black-box score 34
  • 35. Copyright©2017 NTT corp. All Rights Reserved. Experimental results Cost func. SDR (dB) STOI (%) PESQ PSA 𝟏𝟎. 𝟕 ± 𝟏. 𝟖𝟔 86.3 ± 3.81 2.36 ± 0.19 STOI 10.4 ± 2.01 𝟖𝟗. 𝟏 ± 𝟑. 𝟓𝟐 2.36 ± 0.19 PESQ 𝟏𝟎. 𝟕 ± 𝟏. 𝟖𝟒 86.7 ± 3.92 𝟐. 𝟒𝟖 ± 𝟎. 𝟏𝟗 Input SNR: 0 dB Input SNR: 6 dB Cost func. SDR (dB) STOI (%) PESQ PSA 15.1 ± 1.67 92.8 ± 2.67 2.72 ± 0.17 STOI 𝟏𝟓. 𝟓 ± 𝟏. 𝟖𝟑 𝟗𝟒. 𝟔 ± 𝟐. 𝟓𝟏 2.65 ± 0.16 PESQ 14.8 ± 1.57 92.5 ± 2.91 𝟐. 𝟖𝟒 ± 𝟎. 𝟏𝟕 35  PSA obtained averagely high-score for all measurements  Proposed method obtained highest score for target OSQA ⇒ specialized to target OSQA (black-box score) !! Orange: best, Purple: worst
  • 36. Copyright©2017 NTT corp. All Rights Reserved. Freq.[kHz] Time [s] Time [s] Observed signal 𝑋 𝜔,𝜏 (SNR: 0dB) Target source 𝑆 𝜔,𝜏 Freq.[kHz] Freq.[kHz]Freq.[kHz]Freq.[kHz] MMSE(PSA)PESQSTOI T-F mask 𝐺 𝜔,𝜏 Output signal 𝑆 𝜔,𝜏 Time [s] Time [s] SDR: 11.9 dB PESQ: 2.44 STOI: 88.4 % SDR: 11.3 dB PESQ: 2.62 STOI: 87.4 % SDR: 10.2 dB PESQ: 2.25 STOI: 89.4 % Result examples (T-F mask) 36
  • 37. Copyright©2017 NTT corp. All Rights Reserved. Result examples (sound) 37  Male (SNR: 6dB)  Female (SNR: 0dB) Observation PSA STOI PESQ Observation PSA STOI PESQ Perhaps, we need a headphone to clearly interpret the differences. After this talk, please listen to these files using a headphone.
  • 38. Copyright©2017 NTT corp. All Rights Reserved.  Improve perceptual quality of DNN-source enhancement ⇒ Black-box optimization of DNN-source enhancement  Two approaches were introduced  T-F mask selection based on RL-like approach  T-F mask estimation based on policy gradient method  OSQAs were increased as number of updates increased  DNN could be specialized for target OSQA ⇒ Succeeded to train DNN using black-box cost! Future work  Other black-box cost  Human-in-the-loop audio system [Niwa+, 2017]  Fixed ASR system (word accuracy) [Watanabe+, 2014] Conclusion 38
  • 39. Copyright©2017 NTT corp. All Rights Reserved. 39 Q & A