SlideShare a Scribd company logo
Tomoki Koriyama1
, Shinnosuke Takamichi1
, Takao Kobayashi2
1
The University of Tokyo, 2
Tokyo Institute of Technology
Sparse Approximation of Gram Matrices for
GMMN-based Speech Synthesis
Block diagonal (BLOCK; conventional)
•Calculate CMMD for each minibatch
Random Fourier feature (RFF) [Rahimi&Recht, 2008]
•Replace kernel function by inner product of -dimensional vectors
•Radial basis function (RBF) kernel can be approximated:
•Use RFF-based low-rank matrix for input Gram matrix
Conditional maximum mean discrepancy (CMMD) [Ren et al., 2016]
•Distance between conditional distributions given by
linear operators, , of training and generated data
•CMMD can be estimated using kernel trick
•Problem: computationaly infesible for speech synthesis
•Purpose: exaimine the effect of approximation to reduce computation
feature map for infinite dimensional Hilbert spaces
Abstract
Experiments
Approximation methodsBackground
Conclusions
•Investigate the training method of sampling-based speech synthesis
based on generative moment matching network (GMMN)
•GMMN's cost function, CMMD, is computationally infeasible
•Examine approximation methods for GMMN
– Gram matrix approximation:
block diagonal / random Fourier feature (RFF)
– Minibatch selection:
random / clustering of bottleneck feature
•Performed the subjective tests not only on naturalness but also
on inter-utterance variation
•RFF and clustering-based minibatch selection gave higher subjective
score in inter-utterance variation
•Future work:
– Evaluate trade-off between variation and naturalness
– Compare with other methods, e.g., simply adding noise, GAN, VAE
– Sequence-level modeling
1
MSE
BLOCK-RAND
BLOCK-CLST
RFF-RAND
RFF-CLST
Vocoded
MSE
BLOCK-RAND
BLOCK-CLST
RFF-RAND
RFF-CLST
Vocoded
Mean opinion score
95% confidence interval p<0.01
2 3 4 5
(very good)(too bad)
Mean opinion score
"Is a pair of randomly-generatad samples different?"
(very different)(completely equivalent)
95% confidence interval p<0.05 p<0.001
1 2 3 4 5
0th mel-
cep
1st mel-
cep
log F0
[cent]
phone
duration
[ms]
LOCAL-RAND 0.023 0.012 15.8 2.46
LOCAL-CLST 0.053 0.022 18.2 3.50
RFF-RAND 0.021 0.007 1.5 3.77
RFF-CLST 0.049 0.027 14.0 5.47
RBF kernel: rank=N=1000
1
0
-1
1
0
-1
RFF approx.: rank=M=100
:Random variable,
Gram
matrix
Gram
matrix
DNN trained by
MSE criterion
Context
Acoustic feature
Perturbation
Bottleneck
feature
CMMD
Random value
DNN for sampling
trained by CMMD
criterion
Generative moment matching network (GMMN)[Ren et al., 2016]-based
speech synthesis [Takamichi et al., 2017]
•Generate pertuerbation using DNN trained by CMMD criterion
Gram matrices approximation
Minibatch selection methods
Random selection (RAND; conventional)
•Problem: Gram matrices tend to be redundant sparse ones
Use clustering results as minibatch (CLST)
•Perform K-means clustering for bottleneck feature
•Similar to Gaussian-process-VC [Pilkington et al., 2011]
Gram matrix for output Gram matrix for input
low rank low rank
minibatch size
Acoustic features
0-39th mel-cepstrum, log F0, and 5-band aperiodicity
with their delta and delta-delta, and V/UV
Model configurations Random value: 3 dim, bottleneck feature: 32 dim
Minibatch-size: 10000, RFF: 1024 dim
Database
1 female, 203 sentences
Each sentence was repeated 5 times.
Train/valid data 5 x 150 / 5 x 26 utterances
Test data 27 utterances, 5 samples are generated
Experimental
conditions
Naturalness Inter-utterance variation Standard deviation average of
sampled synthetic speech parameters

More Related Content

PDF
GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討
PDF
Learning the Statistical Model of the NMF Using the Deep Multiplicative Updat...
PDF
Hue preservation and color correction
PDF
Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...
PPT
Image compression using dpcm with lms algorithm ranbeer
PDF
Text-Independent Speaker Verification Report
PPTX
Text-Independent Speaker Verification
PDF
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討
Learning the Statistical Model of the NMF Using the Deep Multiplicative Updat...
Hue preservation and color correction
Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...
Image compression using dpcm with lms algorithm ranbeer
Text-Independent Speaker Verification Report
Text-Independent Speaker Verification
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning

What's hot (20)

PPTX
Dynamic Programming
PPTX
Adaptive filters and band reject filters
PPTX
Online divergence switching for superresolution-based nonnegative matrix fact...
PDF
Stft vs. mfcc
PPTX
Speaker recognition systems
PDF
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction
PPTX
Speech based password authentication system on FPGA
PDF
A Combined Voice Activity Detector Based On Singular Value Decomposition and ...
PDF
Time domain analysis and synthesis using Pth norm filter design
PPTX
Text independent speaker recognition system
PDF
Isolated words recognition using mfcc, lpc and neural network
PDF
Reducting Power Dissipation in Fir Filter: an Analysis
PPTX
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
PPTX
Digital Image Processing - Frequency Filters
PPTX
Divergence optimization in nonnegative matrix factorization with spectrogram ...
PDF
Speaker Recognition System using MFCC and Vector Quantization Approach
PDF
Hybrid hmmdtw based speech recognition with kernel adaptive filtering method
PPTX
Dynamic programming prasintation eaisy
PDF
Comparison of Single Channel Blind Dereverberation Methods for Speech Signals
PPTX
Matlab: Speech Signal Analysis
Dynamic Programming
Adaptive filters and band reject filters
Online divergence switching for superresolution-based nonnegative matrix fact...
Stft vs. mfcc
Speaker recognition systems
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction
Speech based password authentication system on FPGA
A Combined Voice Activity Detector Based On Singular Value Decomposition and ...
Time domain analysis and synthesis using Pth norm filter design
Text independent speaker recognition system
Isolated words recognition using mfcc, lpc and neural network
Reducting Power Dissipation in Fir Filter: an Analysis
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
Digital Image Processing - Frequency Filters
Divergence optimization in nonnegative matrix factorization with spectrogram ...
Speaker Recognition System using MFCC and Vector Quantization Approach
Hybrid hmmdtw based speech recognition with kernel adaptive filtering method
Dynamic programming prasintation eaisy
Comparison of Single Channel Blind Dereverberation Methods for Speech Signals
Matlab: Speech Signal Analysis
Ad

Similar to Sparse Approximation of Gram Matrices for GMMN-based Speech Synthesis (20)

PDF
DataScienceLab2017_Блиц-доклад
PPT
Automatic speech recognition
PDF
What can GAN and GMMN do for augmented speech communication?
PPT
Speaker identification system with voice controlled functionality
PPTX
Speech recognition final
PPTX
Real-Time Voice Actuation
PDF
PhD-Thesis-ErhardRank
PDF
FPGA-based implementation of speech recognition for robocar control using MFCC
PPTX
Speaker Dependent WaveNet Vocoder
PDF
From sound to grammar: theory, representations and a computational model
PDF
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
PDF
Identification of frequency domain using quantum based optimization neural ne...
PDF
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
PDF
Et25897899
PDF
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
PDF
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
PDF
Environmentally robust ASR front end for DNN-based acoustic models
PDF
Mackey Glass Time Series Prediction
PPTX
Adversarial_Examples_in_Audio_and_Text.pptx
PDF
Thesis yossie
DataScienceLab2017_Блиц-доклад
Automatic speech recognition
What can GAN and GMMN do for augmented speech communication?
Speaker identification system with voice controlled functionality
Speech recognition final
Real-Time Voice Actuation
PhD-Thesis-ErhardRank
FPGA-based implementation of speech recognition for robocar control using MFCC
Speaker Dependent WaveNet Vocoder
From sound to grammar: theory, representations and a computational model
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
Identification of frequency domain using quantum based optimization neural ne...
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
Et25897899
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
Environmentally robust ASR front end for DNN-based acoustic models
Mackey Glass Time Series Prediction
Adversarial_Examples_in_Audio_and_Text.pptx
Thesis yossie
Ad

More from Tomoki Koriyama (12)

PDF
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
PDF
深層ガウス過程に基づく音声合成におけるリカレント構造を用いた系列モデリングの検討
PDF
Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
PDF
ICASSP2019音声&音響論文読み会 論文紹介(合成系) #icassp2019jp
PDF
深層ガウス過程とアクセントの潜在変数表現に基づく音声合成の検討
PDF
グラム行列のスパース近似を用いた生成的モーメントマッチングネットに基づく音声合成の検討
PDF
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...
PDF
深層ガウス過程に基づく音声合成のための
事前学習の検討
PDF
GPR音声合成における深層ガウス過程の利用の検討
PDF
GP-DNNハイブリッドモデルに基づく統計的音声合成の検討
PDF
GPR音声合成のためのフレームコンテキストカーネルに基づく決定木構築の検討
PDF
ICASSP2017読み会(Speech Synthesis)
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
深層ガウス過程に基づく音声合成におけるリカレント構造を用いた系列モデリングの検討
Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
ICASSP2019音声&音響論文読み会 論文紹介(合成系) #icassp2019jp
深層ガウス過程とアクセントの潜在変数表現に基づく音声合成の検討
グラム行列のスパース近似を用いた生成的モーメントマッチングネットに基づく音声合成の検討
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...
深層ガウス過程に基づく音声合成のための
事前学習の検討
GPR音声合成における深層ガウス過程の利用の検討
GP-DNNハイブリッドモデルに基づく統計的音声合成の検討
GPR音声合成のためのフレームコンテキストカーネルに基づく決定木構築の検討
ICASSP2017読み会(Speech Synthesis)

Recently uploaded (20)

PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
BIOMOLECULES PPT........................
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
famous lake in india and its disturibution and importance
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
7. General Toxicologyfor clinical phrmacy.pptx
BIOMOLECULES PPT........................
INTRODUCTION TO EVS | Concept of sustainability
Placing the Near-Earth Object Impact Probability in Context
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Cell Membrane: Structure, Composition & Functions
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
famous lake in india and its disturibution and importance
TOTAL hIP ARTHROPLASTY Presentation.pptx
ECG_Course_Presentation د.محمد صقران ppt
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Phytochemical Investigation of Miliusa longipes.pdf
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
2. Earth - The Living Planet Module 2ELS
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS

Sparse Approximation of Gram Matrices for GMMN-based Speech Synthesis

  • 1. Tomoki Koriyama1 , Shinnosuke Takamichi1 , Takao Kobayashi2 1 The University of Tokyo, 2 Tokyo Institute of Technology Sparse Approximation of Gram Matrices for GMMN-based Speech Synthesis Block diagonal (BLOCK; conventional) •Calculate CMMD for each minibatch Random Fourier feature (RFF) [Rahimi&Recht, 2008] •Replace kernel function by inner product of -dimensional vectors •Radial basis function (RBF) kernel can be approximated: •Use RFF-based low-rank matrix for input Gram matrix Conditional maximum mean discrepancy (CMMD) [Ren et al., 2016] •Distance between conditional distributions given by linear operators, , of training and generated data •CMMD can be estimated using kernel trick •Problem: computationaly infesible for speech synthesis •Purpose: exaimine the effect of approximation to reduce computation feature map for infinite dimensional Hilbert spaces Abstract Experiments Approximation methodsBackground Conclusions •Investigate the training method of sampling-based speech synthesis based on generative moment matching network (GMMN) •GMMN's cost function, CMMD, is computationally infeasible •Examine approximation methods for GMMN – Gram matrix approximation: block diagonal / random Fourier feature (RFF) – Minibatch selection: random / clustering of bottleneck feature •Performed the subjective tests not only on naturalness but also on inter-utterance variation •RFF and clustering-based minibatch selection gave higher subjective score in inter-utterance variation •Future work: – Evaluate trade-off between variation and naturalness – Compare with other methods, e.g., simply adding noise, GAN, VAE – Sequence-level modeling 1 MSE BLOCK-RAND BLOCK-CLST RFF-RAND RFF-CLST Vocoded MSE BLOCK-RAND BLOCK-CLST RFF-RAND RFF-CLST Vocoded Mean opinion score 95% confidence interval p<0.01 2 3 4 5 (very good)(too bad) Mean opinion score "Is a pair of randomly-generatad samples different?" (very different)(completely equivalent) 95% confidence interval p<0.05 p<0.001 1 2 3 4 5 0th mel- cep 1st mel- cep log F0 [cent] phone duration [ms] LOCAL-RAND 0.023 0.012 15.8 2.46 LOCAL-CLST 0.053 0.022 18.2 3.50 RFF-RAND 0.021 0.007 1.5 3.77 RFF-CLST 0.049 0.027 14.0 5.47 RBF kernel: rank=N=1000 1 0 -1 1 0 -1 RFF approx.: rank=M=100 :Random variable, Gram matrix Gram matrix DNN trained by MSE criterion Context Acoustic feature Perturbation Bottleneck feature CMMD Random value DNN for sampling trained by CMMD criterion Generative moment matching network (GMMN)[Ren et al., 2016]-based speech synthesis [Takamichi et al., 2017] •Generate pertuerbation using DNN trained by CMMD criterion Gram matrices approximation Minibatch selection methods Random selection (RAND; conventional) •Problem: Gram matrices tend to be redundant sparse ones Use clustering results as minibatch (CLST) •Perform K-means clustering for bottleneck feature •Similar to Gaussian-process-VC [Pilkington et al., 2011] Gram matrix for output Gram matrix for input low rank low rank minibatch size Acoustic features 0-39th mel-cepstrum, log F0, and 5-band aperiodicity with their delta and delta-delta, and V/UV Model configurations Random value: 3 dim, bottleneck feature: 32 dim Minibatch-size: 10000, RFF: 1024 dim Database 1 female, 203 sentences Each sentence was repeated 5 times. Train/valid data 5 x 150 / 5 x 26 utterances Test data 27 utterances, 5 samples are generated Experimental conditions Naturalness Inter-utterance variation Standard deviation average of sampled synthetic speech parameters