SlideShare a Scribd company logo
DNN-based frequency component
prediction for frequency-domain
audio source separation
Rui Watanabe, Daichi Kitamura (National Institute of Technology, Japan)
Hiroshi Saruwatari (The University of Tokyo, Japan)
Yu Takahashi, Kazunobu Kondo (Yamaha Corporation, Japan)
28th European Signal Processing Conference (EUSIPCO) SS-2.4
Background
 Audio source separation
– aims to separate audio sources such as speech, singing
voice, and musical instruments.
 Products with audio source separation
– Intelligent speaker
– Hearing-aid system
– Music editing by users etc.
1
Background
 Multichannel audio source separation (MASS)
– estimates separation system using multichannel
observation without knowing mixing system
 Popular methods for each condition
– Underdetermined (number of mics < number of sources)
• Multichannel nonnegative matrix factorization (MNMF) [Sawada+, 2013]
• Approaches based on deep neural networks (DNNs)
– Overdetermined (number of mics ≥ number of sources)
• Frequency-domain Independent component analysis [Smaragdis, 1998]
• Independent vector analysis [Kim+, 2007]
• Independent low-rank matrix (ILRMA) [Kitamura+, 2016]
2
Background
 Frequency-domain MASS
– performs a short-time Fourier transform to the
observed time-domain signal to obtain the
spectrograms
– estimates frequency-wise separation filter
3
Conventional frequency-domain MASS
 Multichannel nonnegative matrix factorization
(MNMF) [Sawada+, 2013]
– Unsupervised source separation algorithm without any
prior information or training
– High quality MASS can be achieved
– Huge computational cost for estimating the
parameters
4
Proposed method: motivation
 High-quality MASS with low computational cost
 A new framework combining frequency-domain MASS and
DNN
– Separate specific frequencies via MNMF and obtain separated
source components
– The estimated source components of the other frequencies will
be predicted by DNN
5
Proposed method: interpretation of DNN
 DNN in proposed framework can be interpreted in two
ways
1. Audio source separation of specific frequencies (high-
frequency band)
• Low-frequency bands can be used for predicting high-frequency separated
components
2. Audio bandwidth expansion of each source
• High-frequency band of the mixture is a strong cue for expanding bandwidth
6
Proposed method: details of framework
 Observed multichannel spectrograms and are
divided into low- and high-frequency bands
 Apply MNMF to the low frequency band and to
obtain the separated source components and
– High-frequency band and are not separated in
this step
7
Proposed method: details of framework
 Input , , and to DNN
– DNN outputs softmasks and such that the high-
frequency bands and are estimated from
8
Apply softmasks
Proposed method: input vector of DNN
 DNN prediction is performed for each time frame
(each column of spectrograms)
– Input vector is a concatenation of several time frames
around th frame in , , and
– Normalize the concatenated vector
9
Proposed method: DNN architecture
 Simple full-connected networks
– Four hidden layers with Swish or Softmax functions
10
Experiment 1: bandwidth expansion
 Validation of the proposed framework
– Evaluate bandwidth expansion performance from the low-
frequency band of true sources with/without mixture
– Confirm validity of the proposed framework that utilizes
mixture components for predicting the separated sources
– Use sources-to-artifact ratio (SAR) [Vincent+, 2006]
11
Experiment 1: bandwidth expansion
 Training conditions of DNN
 Test dataset (SiSEC2011) [Araki+, 2012] for evaluation
12
Training dataset
100 drums (Dr.) and vocal (Vo.) songs in SiSEC2016
Database [Liutkus+, 2016]
FFT length/Shift length 128 ms/64 ms
Boundary frequency 4 kHz (Half of Nyquist frequency)
Epochs/batch size 1000/128
Optimizer Adam (learning rate=0.001)
Song ID Song name Signal length [s]
1 dev1__bearlin-roads (Dr. & Vo.) 14.0
2 dev2__another_dreamer-the_ones_we_love (Dr. & Vo.) 25.0
3 dev2__fort_minor-remember_the_name (Dr. & Vo.) 24.0
4 dev2_ultimate_nz_tour (Dr. & Vo.) 18.0
Experiment 1: bandwidth expansion
 Mixture components help to predict the high-
frequency band of the separated sources
13
Song ID DNN w/o mixture DNN w/ mixture
1
Dr. : 21.1 dB Dr. : 28.0 dB
Vo. : 21.8 dB Vo. : 31.5 dB
2
Dr. : 22.0 dB Dr. : 21.8 dB
Vo. : 12.7 dB Vo. : 19.6 dB
3
Dr. : 15.0 dB Dr. : 20.4 dB
Vo. : 11.2 dB Vo. : 18.5 dB
4
Dr. : 11.0 dB Dr. : 18.2 dB
Vo. : 10.4 dB Vo. : 15.3 dB
Experiment 2: evaluate proposed MASS framework
 Compare conventional fullband MNMF and the
proposed framework
– In terms of separation accuracy (source-to-distortion
ratio: SDR [Vincent+, 2006]) and computational efficiency
14
Experiment 2: evaluate proposed MASS framework
 Experimental conditions of MNMF
15
Multichannel observed
signal
Produce two-channel mixture by convoluting
E2A impulse responses to the sources of the
test dataset
Boundary frequency 4 kHz
Number of bases in MNMF 13
16
Experiment 2: evaluate proposed MASS framework
 Song ID 4
– Since the number of frequencies is reduced by half,
the proposed method is twice faster
– In Fullband MNMF, 13dB
was achieved in 120s
– Proposed method
achieved 13 dB in less
than 50s
17
Experiment 2: evaluate proposed MASS framework
18
Experiment 2: evaluate proposed MASS framework
Conclusion
 In this paper
– We proposed a computationally efficient audio source
separation framework combined frequency-domain
MASS and frequency component prediction based on
DNN
– In the proposed framework, MASS is applied to only
the limited frequencies, and DNN predicts the other
frequency components of the sources
– By comparing fullband MNMF, the proposed method
can achieve almost the same quality with the half-
reduced computational cost
19
Thank you for your attention!

More Related Content

PPTX
Prior distribution design for music bleeding-sound reduction based on nonnega...
PPTX
Blind audio source separation based on time-frequency structure models
PPTX
Linear multichannel blind source separation based on time-frequency mask obta...
PPTX
DNN-based permutation solver for frequency-domain independent component analy...
PPTX
Blind source separation based on independent low-rank matrix analysis and its...
PPTX
Relaxation of rank-1 spatial constraint in overdetermined blind source separa...
PPTX
Koyama ASA ASJ joint meeting 2016
PPTX
Audio Source Separation Based on Low-Rank Structure and Statistical Independence
Prior distribution design for music bleeding-sound reduction based on nonnega...
Blind audio source separation based on time-frequency structure models
Linear multichannel blind source separation based on time-frequency mask obta...
DNN-based permutation solver for frequency-domain independent component analy...
Blind source separation based on independent low-rank matrix analysis and its...
Relaxation of rank-1 spatial constraint in overdetermined blind source separa...
Koyama ASA ASJ joint meeting 2016
Audio Source Separation Based on Low-Rank Structure and Statistical Independence

What's hot (20)

PPTX
Experimental analysis of optimal window length for independent low-rank matri...
PDF
Online Divergence Switching for Superresolution-Based Nonnegative Matrix Fa...
PDF
Depth Estimation of Sound Images Using Directional Clustering and Activation...
PPTX
Depth estimation of sound images using directional clustering and activation-...
PDF
Ica2016 312 saruwatari
PPTX
Robust music signal separation based on supervised nonnegative matrix factori...
PPTX
Regularized superresolution-based binaural signal separation with nonnegative...
PPTX
Online divergence switching for superresolution-based nonnegative matrix fact...
PDF
Robust Sound Field Reproduction against Listener’s Movement Utilizing Image ...
PPTX
Koyama AES Conference SFC 2016
PPTX
Blind source separation based on independent low-rank matrix analysis and its...
PPTX
Divergence optimization in nonnegative matrix factorization with spectrogram ...
PPTX
Hybrid NMF APSIPA2014 invited
PDF
Apsipa2016for ss
PPTX
Superresolution-based stereo signal separation via supervised nonnegative mat...
PPTX
Hybrid multichannel signal separation using supervised nonnegative matrix fac...
PPTX
コサイン類似度罰則条件付き半教師あり非負値行列因子分解と音源分離への応用
PDF
Dsp2015for ss
PPTX
Speaker recognition systems
PDF
Isolated words recognition using mfcc, lpc and neural network
Experimental analysis of optimal window length for independent low-rank matri...
Online Divergence Switching for Superresolution-Based Nonnegative Matrix Fa...
Depth Estimation of Sound Images Using Directional Clustering and Activation...
Depth estimation of sound images using directional clustering and activation-...
Ica2016 312 saruwatari
Robust music signal separation based on supervised nonnegative matrix factori...
Regularized superresolution-based binaural signal separation with nonnegative...
Online divergence switching for superresolution-based nonnegative matrix fact...
Robust Sound Field Reproduction against Listener’s Movement Utilizing Image ...
Koyama AES Conference SFC 2016
Blind source separation based on independent low-rank matrix analysis and its...
Divergence optimization in nonnegative matrix factorization with spectrogram ...
Hybrid NMF APSIPA2014 invited
Apsipa2016for ss
Superresolution-based stereo signal separation via supervised nonnegative mat...
Hybrid multichannel signal separation using supervised nonnegative matrix fac...
コサイン類似度罰則条件付き半教師あり非負値行列因子分解と音源分離への応用
Dsp2015for ss
Speaker recognition systems
Isolated words recognition using mfcc, lpc and neural network
Ad

Similar to DNN-based frequency component prediction for frequency-domain audio source separation (20)

PDF
IRJET- A Review of Music Analysis Techniques
PDF
Presentation ismir
PPTX
Ml conf2013 teaching_computers_share
PPTX
MLConf2013: Teaching Computer to Listen to Music
PPTX
Teaching Computers to Listen to Music
PDF
Investigating Multi-Feature Selection and Ensembling for Audio Classification
PPTX
DNN-based frequency-domain permutation solver for multichannel audio source s...
PDF
PhD Thesis Marius Miron - Source Separation Methods for Orchestral Music
PDF
repport christian el hajj
PPTX
ISMIR 2016_Melody Extraction
PPTX
Music Gesture for Visual Sound Separation
PDF
FORECASTING MUSIC GENRE (RNN - LSTM)
PPTX
PROSAMBA
PDF
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
PDF
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
PDF
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
PPTX
Final_Presentation_ENDSEMFORNITJSRI.pptx
PDF
Machine learning for Music
PDF
A computationally efficient learning model to classify audio signal attributes
PDF
(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...
IRJET- A Review of Music Analysis Techniques
Presentation ismir
Ml conf2013 teaching_computers_share
MLConf2013: Teaching Computer to Listen to Music
Teaching Computers to Listen to Music
Investigating Multi-Feature Selection and Ensembling for Audio Classification
DNN-based frequency-domain permutation solver for multichannel audio source s...
PhD Thesis Marius Miron - Source Separation Methods for Orchestral Music
repport christian el hajj
ISMIR 2016_Melody Extraction
Music Gesture for Visual Sound Separation
FORECASTING MUSIC GENRE (RNN - LSTM)
PROSAMBA
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
Final_Presentation_ENDSEMFORNITJSRI.pptx
Machine learning for Music
A computationally efficient learning model to classify audio signal attributes
(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...
Ad

More from Kitamura Laboratory (20)

PPTX
付け爪センサによる生体信号を用いた深層学習に基づく心拍推定
PPTX
STEM教育を目的とした動画像処理による二重振り子の軌跡推定
PPTX
ギタータブ譜からのギターリフ抽出アルゴリズム
PPTX
時間微分スペクトログラムに基づくブラインド音源分離
PPTX
Amplitude spectrogram prediction from mel-frequency cepstrum coefficients and...
PPTX
周波数双方向再帰に基づく深層パーミュテーション解決法
PPTX
Heart rate estimation of car driver using radar sensors and blind source sepa...
PDF
双方向LSTMによるラウドネス及びMFCCからの振幅スペクトログラム予測と評価
PPTX
深層ニューラルネットワークに基づくパーミュテーション解決法の基礎的検討
PPTX
多重解像度時間周波数表現に基づく独立低ランク行列分析,
PPTX
深層パーミュテーション解決法の基礎的検討
PPTX
深層学習に基づく音響特徴量からの振幅スペクトログラム予測
PPTX
音楽信号処理における基本周波数推定を応用した心拍信号解析
PPTX
調波打撃音モデルに基づく線形多チャネルブラインド音源分離
PPTX
コサイン類似度罰則条件付き非負値行列因子分解に基づく音楽音源分離
PPTX
非負値行列因子分解を用いた被り音の抑圧
PPTX
独立成分分析に基づく信号源分離精度の予測
PPTX
深層学習に基づく間引きインジケータ付き周波数帯域補間手法による音源分離処理の高速化
PDF
独立低ランク行列分析を用いたインタラクティブ音源分離システム
PPTX
局所時間周波数構造に基づく深層パーミュテーション解決法の実験的評価
付け爪センサによる生体信号を用いた深層学習に基づく心拍推定
STEM教育を目的とした動画像処理による二重振り子の軌跡推定
ギタータブ譜からのギターリフ抽出アルゴリズム
時間微分スペクトログラムに基づくブラインド音源分離
Amplitude spectrogram prediction from mel-frequency cepstrum coefficients and...
周波数双方向再帰に基づく深層パーミュテーション解決法
Heart rate estimation of car driver using radar sensors and blind source sepa...
双方向LSTMによるラウドネス及びMFCCからの振幅スペクトログラム予測と評価
深層ニューラルネットワークに基づくパーミュテーション解決法の基礎的検討
多重解像度時間周波数表現に基づく独立低ランク行列分析,
深層パーミュテーション解決法の基礎的検討
深層学習に基づく音響特徴量からの振幅スペクトログラム予測
音楽信号処理における基本周波数推定を応用した心拍信号解析
調波打撃音モデルに基づく線形多チャネルブラインド音源分離
コサイン類似度罰則条件付き非負値行列因子分解に基づく音楽音源分離
非負値行列因子分解を用いた被り音の抑圧
独立成分分析に基づく信号源分離精度の予測
深層学習に基づく間引きインジケータ付き周波数帯域補間手法による音源分離処理の高速化
独立低ランク行列分析を用いたインタラクティブ音源分離システム
局所時間周波数構造に基づく深層パーミュテーション解決法の実験的評価

Recently uploaded (20)

PPTX
Geodesy 1.pptx...............................................
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Lecture Notes Electrical Wiring System Components
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Construction Project Organization Group 2.pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
UNIT 4 Total Quality Management .pptx
PDF
composite construction of structures.pdf
Geodesy 1.pptx...............................................
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
R24 SURVEYING LAB MANUAL for civil enggi
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Foundation to blockchain - A guide to Blockchain Tech
additive manufacturing of ss316l using mig welding
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Operating System & Kernel Study Guide-1 - converted.pdf
Lecture Notes Electrical Wiring System Components
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
bas. eng. economics group 4 presentation 1.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Construction Project Organization Group 2.pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
CYBER-CRIMES AND SECURITY A guide to understanding
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
UNIT 4 Total Quality Management .pptx
composite construction of structures.pdf

DNN-based frequency component prediction for frequency-domain audio source separation

  • 1. DNN-based frequency component prediction for frequency-domain audio source separation Rui Watanabe, Daichi Kitamura (National Institute of Technology, Japan) Hiroshi Saruwatari (The University of Tokyo, Japan) Yu Takahashi, Kazunobu Kondo (Yamaha Corporation, Japan) 28th European Signal Processing Conference (EUSIPCO) SS-2.4
  • 2. Background  Audio source separation – aims to separate audio sources such as speech, singing voice, and musical instruments.  Products with audio source separation – Intelligent speaker – Hearing-aid system – Music editing by users etc. 1
  • 3. Background  Multichannel audio source separation (MASS) – estimates separation system using multichannel observation without knowing mixing system  Popular methods for each condition – Underdetermined (number of mics < number of sources) • Multichannel nonnegative matrix factorization (MNMF) [Sawada+, 2013] • Approaches based on deep neural networks (DNNs) – Overdetermined (number of mics ≥ number of sources) • Frequency-domain Independent component analysis [Smaragdis, 1998] • Independent vector analysis [Kim+, 2007] • Independent low-rank matrix (ILRMA) [Kitamura+, 2016] 2
  • 4. Background  Frequency-domain MASS – performs a short-time Fourier transform to the observed time-domain signal to obtain the spectrograms – estimates frequency-wise separation filter 3
  • 5. Conventional frequency-domain MASS  Multichannel nonnegative matrix factorization (MNMF) [Sawada+, 2013] – Unsupervised source separation algorithm without any prior information or training – High quality MASS can be achieved – Huge computational cost for estimating the parameters 4
  • 6. Proposed method: motivation  High-quality MASS with low computational cost  A new framework combining frequency-domain MASS and DNN – Separate specific frequencies via MNMF and obtain separated source components – The estimated source components of the other frequencies will be predicted by DNN 5
  • 7. Proposed method: interpretation of DNN  DNN in proposed framework can be interpreted in two ways 1. Audio source separation of specific frequencies (high- frequency band) • Low-frequency bands can be used for predicting high-frequency separated components 2. Audio bandwidth expansion of each source • High-frequency band of the mixture is a strong cue for expanding bandwidth 6
  • 8. Proposed method: details of framework  Observed multichannel spectrograms and are divided into low- and high-frequency bands  Apply MNMF to the low frequency band and to obtain the separated source components and – High-frequency band and are not separated in this step 7
  • 9. Proposed method: details of framework  Input , , and to DNN – DNN outputs softmasks and such that the high- frequency bands and are estimated from 8 Apply softmasks
  • 10. Proposed method: input vector of DNN  DNN prediction is performed for each time frame (each column of spectrograms) – Input vector is a concatenation of several time frames around th frame in , , and – Normalize the concatenated vector 9
  • 11. Proposed method: DNN architecture  Simple full-connected networks – Four hidden layers with Swish or Softmax functions 10
  • 12. Experiment 1: bandwidth expansion  Validation of the proposed framework – Evaluate bandwidth expansion performance from the low- frequency band of true sources with/without mixture – Confirm validity of the proposed framework that utilizes mixture components for predicting the separated sources – Use sources-to-artifact ratio (SAR) [Vincent+, 2006] 11
  • 13. Experiment 1: bandwidth expansion  Training conditions of DNN  Test dataset (SiSEC2011) [Araki+, 2012] for evaluation 12 Training dataset 100 drums (Dr.) and vocal (Vo.) songs in SiSEC2016 Database [Liutkus+, 2016] FFT length/Shift length 128 ms/64 ms Boundary frequency 4 kHz (Half of Nyquist frequency) Epochs/batch size 1000/128 Optimizer Adam (learning rate=0.001) Song ID Song name Signal length [s] 1 dev1__bearlin-roads (Dr. & Vo.) 14.0 2 dev2__another_dreamer-the_ones_we_love (Dr. & Vo.) 25.0 3 dev2__fort_minor-remember_the_name (Dr. & Vo.) 24.0 4 dev2_ultimate_nz_tour (Dr. & Vo.) 18.0
  • 14. Experiment 1: bandwidth expansion  Mixture components help to predict the high- frequency band of the separated sources 13 Song ID DNN w/o mixture DNN w/ mixture 1 Dr. : 21.1 dB Dr. : 28.0 dB Vo. : 21.8 dB Vo. : 31.5 dB 2 Dr. : 22.0 dB Dr. : 21.8 dB Vo. : 12.7 dB Vo. : 19.6 dB 3 Dr. : 15.0 dB Dr. : 20.4 dB Vo. : 11.2 dB Vo. : 18.5 dB 4 Dr. : 11.0 dB Dr. : 18.2 dB Vo. : 10.4 dB Vo. : 15.3 dB
  • 15. Experiment 2: evaluate proposed MASS framework  Compare conventional fullband MNMF and the proposed framework – In terms of separation accuracy (source-to-distortion ratio: SDR [Vincent+, 2006]) and computational efficiency 14
  • 16. Experiment 2: evaluate proposed MASS framework  Experimental conditions of MNMF 15 Multichannel observed signal Produce two-channel mixture by convoluting E2A impulse responses to the sources of the test dataset Boundary frequency 4 kHz Number of bases in MNMF 13
  • 17. 16 Experiment 2: evaluate proposed MASS framework
  • 18.  Song ID 4 – Since the number of frequencies is reduced by half, the proposed method is twice faster – In Fullband MNMF, 13dB was achieved in 120s – Proposed method achieved 13 dB in less than 50s 17 Experiment 2: evaluate proposed MASS framework
  • 19. 18 Experiment 2: evaluate proposed MASS framework
  • 20. Conclusion  In this paper – We proposed a computationally efficient audio source separation framework combined frequency-domain MASS and frequency component prediction based on DNN – In the proposed framework, MASS is applied to only the limited frequencies, and DNN predicts the other frequency components of the sources – By comparing fullband MNMF, the proposed method can achieve almost the same quality with the half- reduced computational cost 19 Thank you for your attention!

Editor's Notes

  • #2: Hi everyone, I’m Rui Watanabe / from National Institute of Technology(テクナーロジィ), / Kagawa College, / Japan. I’m gonna talk about / DNN-based frequency component prediction / for frequency-domain audio source separation.
  • #3: Audio source separation / is a technique(テクニーク)to separate audio sources / such as speech,↑ / singing voice,↑ / musical instruments↑, and so on↓. This technology(テクナーロジィ)can be used for many products / including an intelligent speakers,↑/ hearing-aid systems,↑ / and music editing by users↓.
  • #4: In particular, / multichannel(マーチチャネル)audio source separation, / MASS(エムエイエスエス)in short, / estimates a separation system W / using multichannel(マーチチャネル) observation / without knowing the mixing system A.(WとAは指し示しながら) This technique(テクニーク)can be divided into two categories(キャテゴリーズ), / for underdetermined / and overdetermined situations(スィテュエイションズ). The underdetermined situation(スィテュエイション) is that / the number of microphones / is less than the number of sources in the mixture. For this case, multichannel(マーチチャネル)nonnegative matrix(メイトリクス)factorization, / MNMF in short, / is a popular(パピュラー)algorithm. Also, / many DNN-based approaches / have been proposed so far in this case. On the other hand, / in the overdetermined situation(スィテュエイション), / the number of microphones is equal to / or larger than the number of sources. In this case, / frequency-domain independent component analysis / and independent low-rank matrix(メイトリクス)analysis / are the most reliable approaches.
  • #5: In this presentation,↑/ we only treat “frequency-domain MASS”. In this algorithm↑, / we perform / a short-time Fourier transform to the observed time-domain signal / and obtain the multichannel(マーチチャネル)spectrograms.(図の紫部分を指しながら) Then, / we estimate a frequency-wise separation filter, / which is applied to each frequency like this(図の中央を指しながら)/ to estimate the separated source signals.
  • #6: Let me introduce the conventional frequency-domain MASS called “multichannel(マーチチャネル)nonnegative matrix(メイトリクス)factorization,” / MNMF in short. This is an unsupervised source separation algorithm / and does not require any prior(プライォア)information or training(テュレイニン). As an unsupervised technique(テクニーク), MNMF tends to provide high quality separation performance. In MNMF, / the observed multichannel signal / is represented by the time-frequency-wise channel correlation matrices / denoted by X. Since X is a frequency-by-time matrix whose element is a channel-by-channel matrix↑, / this is a matrix of matrices, / which is a fourth-order tensor(テンサー). (frequency-by-timeのところは指し示しながら) MNMF decomposes X / into the source-wise spatial model(マドー)/ and the low-rank spectral model(マドー)of all the sources. Thus, / by clustering the spectral model(マドー)into each source using the estimated spatial model(マドー), the source separation is achieved. However, it requires a huge computational cost / for estimating the parameters(パラミターズ)/ because there are so many parameters(パラミターズ)in this model(マドー).
  • #7: In this presentation, / our motivation(モーティベイシュン) is that / we want to achieve a high-quality MASS / with a low computational cost. And we propose / a new source separation framework / combining frequency-domain MASS / and deep neural networks. In this framework, / as an initial process, / the mixture signal in specific frequencies are separated by MNMF, / and we obtain the separated source components in that frequencies. In this figure, / since only the low-frequency band of the mixture is input to MNMF, we can get the separated components in the low-frequency band. Of course, / the high-frequency bands of the separated sources / are missing. (しっかり間を開ける) As a post process, / we apply DNN-based frequency component prediction, / namely, / the missing high-frequency bands of the separated sources are predicted by DNN, / where we input not only the separated low-frequency bands / but also the mixture of the high-frequency band.(inputの矢印をそれぞれ指し示しながら) Since the DNN prediction process is much faster than MNMF process, / we can reduce / the total computational cost in this framework. For example, if we divide the frequency bands in half like figure,↑/ we can reduce the computational time / almost half.
  • #8: In our framework, / the post DNN process can be interpreted in two ways. First, / the DNN is an audio source separation of specific frequencies, / high-frequency band in this figure. Please note that / the low-frequency bands can be used for predicting high-frequency separated components in our DNN model(マドー). Second, / the DNN seems to be a bandwidth expansion of each source / because the high-frequency bands are predicted. In general, a bandwidth expansion is a hard task / even for DNN. However, / in our model,(マドー)/ the high-frequency band of the mixture / becomes a strong cue / to achieve the bandwidth expansion.
  • #9: The details of the proposed method is as follows. First, / the observed multichannel spectrograms / M1 and M2 / are divided into low- and high-frequency bands. Then, / we apply MNMF to only the low-frequency band / M1(L) and M2(L) / to obtain the separated source components / Y1(L) and Y2(L). The high-frequency band / M1(H) and M2(H) / are not separated in this step.
  • #10: Next, / we input the high-frequency band of the mixture / and the low-frequency bands of the separated sources / like this figure. The DNN / outputs two soft-masks / W1 and W2 / such that the high-frequency bands of the separated sources are calculated from M1(H) / by multiplying them. Of course, / the masks are the matrices with the elements between zero(ジロー)and one, / and the sum of each element in W1 and W2 is always unity.
  • #11: The DNN prediction is performed for each time frame j, / which is / each column of spectrograms. To utilize the information along time in the prediction, / the input vector for DNN is a concatenation of several time frames around j in the mixture and the separated sources. Also, / before we input the vector to the DNN, we normalize it to stabilize the model(マドー)training(テュレイニン), / where the normalization coefficient is added / to keep the information of the signal volume.
  • #12: The DNN model(マドー)in the proposed method is very simple. We have full-connected four hidden layers, / and we apply Swish function / to each hidden layer. Just before the output, / we apply frequency-wise Softmax function, / to ensure(インシュァー)that / the sum of the masks equals unity in each frequency. The mean squared error / between the separated source vector and the label(レイボーゥ)vector↑/ is used as a loss function of the DNN training(テュレイニン).
  • #13: To confirm the validity of the proposed method, / we have done two experiments(イクスペリメンツ). In the first experiment, / we evaluate the performance of the DNN model(マドー)/ as the bandwidth expansion. That is, / the DNN restores the high-frequency band from the low-frequency band of the completely separated sources, / where we confirm whether the high-frequency band of the mixture is effective / by comparing these two models(マドーズ).(図を指しながら) Therefore, / we can confirm the validity of the proposed framework / that utilizes mixture components / for predicting the separated sources. As an evaluation score, / we use sources-to-artifact ratio(レイシオ), / SAR, / which shows the absence of artificial distortions / in the estimated audio signals.
  • #14: This slide shows experimental conditions. For the training of DNN, / we used 100 songs with drums and vocals / in the SiSEC2016 database.(トゥエンティシックスティーンと発音) The boundary frequency between low- and high-frequency bands / was set to 4kHz, / which is a half of Nyquist frequency. As the test dataset, / we used four songs included in the SiSEC2011 database(トゥエンティイレブンと発音), / where these songs are the mixture of drums and vocals.
  • #15: This is the result of bandwidth expansion. For each song, / we showed the SAR values of Drums and Vocals. Higher SAR indicates better audio quality. Two columns show the results of DNN without mixture / and DNN with mixture. In almost all results, / the DNN with mixture outperforms the DNN without mixture. From this result, / we can confirm that / the mixture components help to predict the high-frequency band of the separated sources. Thus, / we can expect that / the proposed framework will perform effectively / in a source separation task.
  • #16: Next, / we conducted the MASS experiment. We compare the conventional MNMF and the proposed framework. The conventional method separates fullband mixture by MNMF / whereas the proposed framework separates only the low-frequency band by MNMF, / and the high-frequency band is predicted by DNN post process. We expect that / the computational time is reduced by skipping half number of frequencies in the MNMF process / while the separation performance is almost the same. As a source separation score, / we used source-to-distortion ratio(レイシオ), / SDR, / which represents the total performance of source separation including both “degree of separation” and the “quality of separated signals.”
  • #17: The other conditions are shown in this slide.(ここのtheは重要,otherとthe otherは違う) DNN is trained using the same dataset in the previous experiment. For the MASS test data, / we produced two-channel observed mixtures / by convoluting E2A impulse responses to the Drums and Vocals sources of the test dataset, / where the recording condition of E2A impulse response is depicted here.(図を指しながら) The reverberation time of E2A is 300 ms. The number of bases in MNMF was set to 13, / which provides the best result for both the conventional and proposed methods.
  • #18: This is the result for each song. The vertical axis indicates SDR score / averaged over 10 random initial values.(指しながら) The horizontal axis shows the average elapsed time.(指しながら) The black line is the conventional method, / fullband MNMF, / and the red circles are the results of the proposed framework. Since the elapsed time depends on the number of iterations of parameter update in MNMF,↑/ for the proposed framework,↑/ we plot the results with every 10 iterations in the MNMF process. Of course, / the computational time for the DNN prediction process / is included in each red circle, / although the DNN process requires less than 0.1 s(ジロポイントワンセカンズ). In all the results, / we can confirm the efficacy of the proposed method. In particular, / Song ID 4 shows the result just as we expected, / so let me explain the result of Song ID 4.
  • #19: In the case of Song ID 4, / the proposed method achieves 13 dB / in less than 50 s, / whereas fullband MNMF converged to 13 dB in 120 s. This is because / the number of frequencies in MNMF is reduced by half.
  • #20: In addition, / the proposed method outperforms fullband MNMF in Song IDs 1, 2, and 4. In particular, / the improvement in Song ID 1 is very large. The reason of these improvements might be that / the proposed method performed more accurate estimation of high-frequency band sources / based on the training with 100 songs. Also, / in the case of Song ID 1, / fullband MNMF might be trapped into a bad local minimum during the iterative optimization.