41. 参考⽂献:1
[van den Oord+ 2016] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N.
Kalchbrenner, A. W. Senior, K. Kavukcuoglu, “WaveNet: a generative model for raw audio,” arXiv preprint,
arXiv:1609.03499, 15 pages, 2016.
[Tamamori+ 2017] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, T. Toda, “Speaker-dependent
WaveNet vocoder,” Proc. INTERSPEECH, pp. 1118‒1122, 2017.
[Itakura+ 1968] F. Itakura, S. Saito, “Analysis synthesis telephony based upon the maximum likelihood
method,” Proc. ICA, C-5-5, pp. C17‒20, 1968.
[van den Oord+ 2017] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van
den Driessche, E. Lockhart, L.C. Cobo, F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen,
N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, D. Hassabis, “Parallel WaveNet: fast high-
fidelity speech synthesis,” arXiv preprint, arXiv:1711.10433, 11 pages, 2017.
[Yamamoto+ 2020] R. Yamamoto, E. Song, J.-M. Kim, “Parallel WaveGAN: A fast waveform generation
model based on generative adversarial networks with multi-resolution spectrogram,” Proc. IEEE ICASSP, pp.
6199‒6203, 2020.
[Wang+ 2019] X. Wang, S. Takaki J. Yamagishi, “Neural source-filter waveform models for statistical
parametric speech synthesis,” IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 28, pp. 402‒415, 2019.
[Schroeder+ 1985] M. Schroeder, B. Atal, “Code-excited linear prediction (CELP): high-quality speech at
very low bit rates,” Proc. IEEE ICASSP, pp. 937‒940, 1985.
[Wu+ 2021a] Y.-C. Wu, T. Hayashi, T. Okamoto, H. Kawai, T. Toda, “Quasi-periodic parallel WaveGAN: a
non-autoregressive raw waveform generative model with pitch-dependent dilated convolution neural
network,” IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 29, pp. 792‒806, 2021.
[Wu+ 2021b] Y.-C. Wu, T. Hayashi, T. Okamoto, H. Kawai, T. Toda, “Quasi-periodic WaveNet: an
autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network,”
IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 29, pp. 1134‒1148, 2021.
参考⽂献
42. 参考⽂献:2
[Yoneyama+ 2023a] R. Yoneyama, Y.-C. Wu, T. Toda, “High-fidelity and pitch-controllable neural vocoder
based on unified source-filter networks,” IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 31, pp.
3717‒3729, 2023.
[Yoneyama+ 2023b] R. Yoneyama, Y.-C. Wu, T. Toda, “Source-Filter HiFiGAN: fast and pitch controllable
high-fidelity neural vocoder,” Proc. IEEE ICASSP, 5 pages, 2023.
[Yamagishi+ 2019] J. Yamagishi, C. Veaux, K. MacDonald, “CSTR VCTK Corpus: English multi-speaker
corpus for CSTR voice cloning toolkit (version 0.92),” https://doi.org/10.7488/ds/2645, University of
Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.
[Kong+ 2020] J. Kong, J. Kim, J. Bae, “HiFi-GAN: generative adversarial networks for efficient and high
fidelity speech synthesis,” Proc. NeurIPS, pp. 17022‒17033, 2020.
[Matsubara+ 2023] K. Matsubara, T. Okamoto, R. Takashima, T. Takiguchi, T. Toda, H. Kawai, “Harmonic-
Net: fundamental frequency and speech rate controllable fast neural vocoder,” IEEE/ACM Trans. Audio,
Speech & Lang. Process., Vol. 31, pp. 1902‒1915, 2023.
[Ogita+ 2025] K. Ogita, R. Yoneyama, W.-C. Huang, T. Toda, “VAE-SiFiGAN: source-filter HiFi-GAN based
on variational autoencoder representations with enhanced pitch controllability,” Proc. EUSIPCO, 5 pages,
2025 (in press).
[Canon 2021] Canon, “NamineRitsu Singing DB Ver.2,”
https://drive.google.com/drive/folders/1XA2cm3UyRpAk_BJb1LTytOWrhjsZKbSN, 2021.
[Kaneko+ 2022] T. Kaneko, K. Tanaka, H. Kameoka, S. Seki, “iSTFTNet: Fast and lightweight mel-
spectrogram vocoder incorporating inverse short-time Fourier transform,” Proc. IEEE ICASSP, pp. 6207‒6211,
2022.
[Yoneyama+ 2024] R. Yoneyama, A. Miyashita, R. Yamamoto, T. Toda, “Wavehax: aliasing-free neural
waveform synthesis based on 2D convolution and harmonic prior for reliable complex spectrogram
estimation,” arXiv preprint, arXiv:2411.06807, 13 pages, 2024.
[Pantazis+ 2011] Y. Pantazis, O. Rosec, and Y. Stylianou, “Adaptive AM-FM signal decomposition with
application to speech analysis,” IEEE Trans. Audio, Speech & Lang. Process., Vol. 19, No. 2, pp. 290‒300,
2011.
43. 参考⽂献:3
[Chen+ 2025a] S. Chen, T. Toda, “Sequence-wise speech waveform modeling via gradient descent
optimization of quasi-harmonic parameters,” IEEE Trans. Audio, Speech & Lang. Process., Vol. 33, pp. 319‒
332, 2025.
[Chen+ 2025b] S. Chen, T. Toda, “QHARMA-GAN: Quasi-harmonic neural vocoder based on autoregressive
moving average model,” arXiv preprint, arXiv: 2507.01611, 16 pages, 2025.
[Takamichi+ 2019] S. Takamichi, K. Mitsui, Y. Saito, T. Koriyama, N. Tanji, H. Saruwatari, “JVS corpus: free
Japanese multi-speaker voice corpus,” arXiv preprint, arXiv: 1908.06248, 4 pages, 2019.
[Huang+ 2021] R. Huang, F. Chen, Y. Ren, J. Liu, C. Cui, Z. Zhao, “Multi-Singer: fast multi-singer singing
voice vocoder with a large-scale corpus,” Proc. ACM Multimedia, pp. 3945‒3954, 2021.