Sparse Approximation of Gram Matrices for GMMN-based Speech Synthesis

Tomoki Koriyama1
, Shinnosuke Takamichi1
, Takao Kobayashi2
1
The University of Tokyo, 2
Tokyo Institute of Technology
Sparse Approximation of Gram Matrices for
GMMN-based Speech Synthesis
Block diagonal (BLOCK; conventional)
•Calculate CMMD for each minibatch
Random Fourier feature (RFF) [Rahimi&Recht, 2008]
•Replace kernel function by inner product of -dimensional vectors
•Radial basis function (RBF) kernel can be approximated:
•Use RFF-based low-rank matrix for input Gram matrix
Conditional maximum mean discrepancy (CMMD) [Ren et al., 2016]
•Distance between conditional distributions given by
linear operators, , of training and generated data
•CMMD can be estimated using kernel trick
•Problem: computationaly infesible for speech synthesis
•Purpose: exaimine the effect of approximation to reduce computation
feature map for infinite dimensional Hilbert spaces
Abstract
Experiments
Approximation methodsBackground
Conclusions
•Investigate the training method of sampling-based speech synthesis
based on generative moment matching network (GMMN)
•GMMN's cost function, CMMD, is computationally infeasible
•Examine approximation methods for GMMN
– Gram matrix approximation:
block diagonal / random Fourier feature (RFF)
– Minibatch selection:
random / clustering of bottleneck feature
•Performed the subjective tests not only on naturalness but also
on inter-utterance variation
•RFF and clustering-based minibatch selection gave higher subjective
score in inter-utterance variation
•Future work:
– Evaluate trade-off between variation and naturalness
– Compare with other methods, e.g., simply adding noise, GAN, VAE
– Sequence-level modeling
1
MSE
BLOCK-RAND
BLOCK-CLST
RFF-RAND
RFF-CLST
Vocoded
MSE
BLOCK-RAND
BLOCK-CLST
RFF-RAND
RFF-CLST
Vocoded
Mean opinion score
95% confidence interval p<0.01
2 3 4 5
(very good)(too bad)
Mean opinion score
"Is a pair of randomly-generatad samples different?"
(very different)(completely equivalent)
95% confidence interval p<0.05 p<0.001
1 2 3 4 5
0th mel-
cep
1st mel-
cep
log F0
[cent]
phone
duration
[ms]
LOCAL-RAND 0.023 0.012 15.8 2.46
LOCAL-CLST 0.053 0.022 18.2 3.50
RFF-RAND 0.021 0.007 1.5 3.77
RFF-CLST 0.049 0.027 14.0 5.47
RBF kernel: rank=N=1000
1
0
-1
1
0
-1
RFF approx.: rank=M=100
:Random variable,
Gram
matrix
Gram
matrix
DNN trained by
MSE criterion
Context
Acoustic feature
Perturbation
Bottleneck
feature
CMMD
Random value
DNN for sampling
trained by CMMD
criterion
Generative moment matching network (GMMN)[Ren et al., 2016]-based
speech synthesis [Takamichi et al., 2017]
•Generate pertuerbation using DNN trained by CMMD criterion
Gram matrices approximation
Minibatch selection methods
Random selection (RAND; conventional)
•Problem: Gram matrices tend to be redundant sparse ones
Use clustering results as minibatch (CLST)
•Perform K-means clustering for bottleneck feature
•Similar to Gaussian-process-VC [Pilkington et al., 2011]
Gram matrix for output Gram matrix for input
low rank low rank
minibatch size
Acoustic features
0-39th mel-cepstrum, log F0, and 5-band aperiodicity
with their delta and delta-delta, and V/UV
Model configurations Random value: 3 dim, bottleneck feature: 32 dim
Minibatch-size: 10000, RFF: 1024 dim
Database
1 female, 203 sentences
Each sentence was repeated 5 times.
Train/valid data 5 x 150 / 5 x 26 utterances
Test data 27 utterances, 5 samples are generated
Experimental
conditions
Naturalness Inter-utterance variation Standard deviation average of
sampled synthetic speech parameters

Sparse Approximation of Gram Matrices for GMMN-based Speech Synthesis

More Related Content

What's hot (20)

Similar to Sparse Approximation of Gram Matrices for GMMN-based Speech Synthesis (20)

More from Tomoki Koriyama (12)

Recently uploaded (20)

Sparse Approximation of Gram Matrices for GMMN-based Speech Synthesis