DNN-based frequency component prediction for frequency-domain audio source separation

DNN-based frequency component
prediction for frequency-domain
audio source separation
Rui Watanabe, Daichi Kitamura (National Institute of Technology, Japan)
Hiroshi Saruwatari (The University of Tokyo, Japan)
Yu Takahashi, Kazunobu Kondo (Yamaha Corporation, Japan)
28th European Signal Processing Conference (EUSIPCO) SS-2.4

Background
 Audio source separation
– aims to separate audio sources such as speech, singing
voice, and musical instruments.
 Products with audio source separation
– Intelligent speaker
– Hearing-aid system
– Music editing by users etc.
1

Background
 Multichannel audio source separation (MASS)
– estimates separation system using multichannel
observation without knowing mixing system
 Popular methods for each condition
– Underdetermined (number of mics < number of sources)
• Multichannel nonnegative matrix factorization (MNMF) [Sawada+, 2013]
• Approaches based on deep neural networks (DNNs)
– Overdetermined (number of mics ≥ number of sources)
• Frequency-domain Independent component analysis [Smaragdis, 1998]
• Independent vector analysis [Kim+, 2007]
• Independent low-rank matrix (ILRMA) [Kitamura+, 2016]
2

Background
 Frequency-domain MASS
– performs a short-time Fourier transform to the
observed time-domain signal to obtain the
spectrograms
– estimates frequency-wise separation filter
3

Conventional frequency-domain MASS
 Multichannel nonnegative matrix factorization
(MNMF) [Sawada+, 2013]
– Unsupervised source separation algorithm without any
prior information or training
– High quality MASS can be achieved
– Huge computational cost for estimating the
parameters
4

Proposed method: motivation
 High-quality MASS with low computational cost
 A new framework combining frequency-domain MASS and
DNN
– Separate specific frequencies via MNMF and obtain separated
source components
– The estimated source components of the other frequencies will
be predicted by DNN
5

Proposed method: interpretation of DNN
 DNN in proposed framework can be interpreted in two
ways
1. Audio source separation of specific frequencies (high-
frequency band)
• Low-frequency bands can be used for predicting high-frequency separated
components
2. Audio bandwidth expansion of each source
• High-frequency band of the mixture is a strong cue for expanding bandwidth
6

Proposed method: details of framework
 Observed multichannel spectrograms and are
divided into low- and high-frequency bands
 Apply MNMF to the low frequency band and to
obtain the separated source components and
– High-frequency band and are not separated in
this step
7

Proposed method: details of framework
 Input , , and to DNN
– DNN outputs softmasks and such that the high-
frequency bands and are estimated from
8
Apply softmasks

Proposed method: input vector of DNN
 DNN prediction is performed for each time frame
(each column of spectrograms)
– Input vector is a concatenation of several time frames
around th frame in , , and
– Normalize the concatenated vector
9

Proposed method: DNN architecture
 Simple full-connected networks
– Four hidden layers with Swish or Softmax functions
10

Experiment 1: bandwidth expansion
 Validation of the proposed framework
– Evaluate bandwidth expansion performance from the low-
frequency band of true sources with/without mixture
– Confirm validity of the proposed framework that utilizes
mixture components for predicting the separated sources
– Use sources-to-artifact ratio (SAR) [Vincent+, 2006]
11

 Training conditions of DNN
 Test dataset (SiSEC2011) [Araki+, 2012] for evaluation
12
Training dataset
100 drums (Dr.) and vocal (Vo.) songs in SiSEC2016
Database [Liutkus+, 2016]
FFT length/Shift length 128 ms/64 ms
Boundary frequency 4 kHz (Half of Nyquist frequency)
Epochs/batch size 1000/128
Optimizer Adam (learning rate=0.001)
Song ID Song name Signal length [s]
1 dev1__bearlin-roads (Dr. & Vo.) 14.0
2 dev2__another_dreamer-the_ones_we_love (Dr. & Vo.) 25.0
3 dev2__fort_minor-remember_the_name (Dr. & Vo.) 24.0
4 dev2_ultimate_nz_tour (Dr. & Vo.) 18.0

 Mixture components help to predict the high-
frequency band of the separated sources
13
Song ID DNN w/o mixture DNN w/ mixture
1
Dr. : 21.1 dB Dr. : 28.0 dB
Vo. : 21.8 dB Vo. : 31.5 dB
2
Dr. : 22.0 dB Dr. : 21.8 dB
Vo. : 12.7 dB Vo. : 19.6 dB
3
Dr. : 15.0 dB Dr. : 20.4 dB
Vo. : 11.2 dB Vo. : 18.5 dB
4
Dr. : 11.0 dB Dr. : 18.2 dB
Vo. : 10.4 dB Vo. : 15.3 dB

Experiment 2: evaluate proposed MASS framework
 Compare conventional fullband MNMF and the
proposed framework
– In terms of separation accuracy (source-to-distortion
ratio: SDR [Vincent+, 2006]) and computational efficiency
14

 Experimental conditions of MNMF
15
Multichannel observed
signal
Produce two-channel mixture by convoluting
E2A impulse responses to the sources of the
test dataset
Boundary frequency 4 kHz
Number of bases in MNMF 13

16

 Song ID 4
– Since the number of frequencies is reduced by half,
the proposed method is twice faster
– In Fullband MNMF, 13dB
was achieved in 120s
– Proposed method
achieved 13 dB in less
than 50s
17

18

Conclusion
 In this paper
– We proposed a computationally efficient audio source
separation framework combined frequency-domain
MASS and frequency component prediction based on
DNN
– In the proposed framework, MASS is applied to only
the limited frequencies, and DNN predicts the other
frequency components of the sources
– By comparing fullband MNMF, the proposed method
can achieve almost the same quality with the half-
reduced computational cost
19
Thank you for your attention!

DNN-based frequency component prediction for frequency-domain audio source separation

More Related Content

What's hot (20)

Similar to DNN-based frequency component prediction for frequency-domain audio source separation (20)

More from Kitamura Laboratory (20)

Recently uploaded (20)

DNN-based frequency component prediction for frequency-domain audio source separation

Editor's Notes