SlideShare a Scribd company logo
Tomoki Koriyama12
, Takao Kobayashi1
1
Tokyo Institute of Technology, Japan, 2
Currently with The University of Tokyo, Japan
Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable Model
Abstract Experiments
Semi-supervised learning of prosody
using DGP-LVM
GP, GPLVM, Deep Gaussian processBackground
Conclusions & Future Work
•Prosody labeling is important for TTS but laborious
•Use deep Gaussian process (DGP), a Bayesian deep model, to
represent prosodic context labels as latent variables
•Propose semi-supervised modeling for partially-annotated data, in
which the latent variables are used in place of annotated prosody
•Perform experiments using around 10% of fully-annotated data
•The proposed semi-supervised modeling with DGP
– Gave comparable score with the case all training data was
fully-annotated
– Outperformed the case using the data w/o accent information
•Future work
– Use diverse speech data including low-resource languages
– Compare other generative models, e.g., Bayes NN, VAE, flow
•To construct TTS, we require manual annotation of prosody labels,
which costs much time and patience
End-to-end approach [Wang et al., 2017][Sotelo et al., 2017]
•End-to-end TTS is language-dependent
•Japanese TTS still requires prosodic context labels [Yasuda et al., 2019] (b) Partially-annotated data
Common function
for both data
Acoustic featureAcoustic feature
Encode function
of accent contexts
Manually annotated
accent-dependent context
Latent variable as a accent
information representation
Accent-independent
context
Accent-independent
context
(a) Fully-annotated data
(a) FULL
0 1 2 3
Time [s]
150
200
300
400
F0[Hz]
(b) LABELED
0 1 2 3
Time [s]
(c) W/O ACCENT
0 1 2 3
Time [s]
(d) PROPOSED
0 1 2 3
Time [s]
(a) GP regression
ha
shi ga
Inference
(c) DGP regression (d) DGP-LVM(b) GPLVM
Purpose
– Incorporate DGP with LVM into prosody modeling
– Apply latent representation to semi-supervised learning
Problems in Japanese pitch accent
•Word meanings depend on accent
•Accent is not lexical. It varies with speakers and contexts
ha
shi
ga
ha
shi ga
Inference
Inference
Inference
•Infer the posteriors of functions and
latent variables simultaneously
Inference
[Damianou&Lawrence, 2013][Titsias&Lawrence, 2009]
Latent variable approach
•Gaussian process latent variable model (GPLVM) can represent
unannotated prosody information as latent variables [Moungsri et al., 2016]
•Single-layer GP lacks expressiveness in modeling
•Deep Gaussan process (DGP) [Damianou&Lawrence, 2013]
– Deep model with stacked Bayesian kernel regressions
– Outperformed 1-layer GP and DNN in TTS [Koriyama&Kobayashi, 2019]
low
high
a
ta ma
a
ta
malow
high
"edge is" "bridge is"
"head" "head"
"chopsticks are"
Speaker 2:Speaker 1:
•Use both fully-annotated and partially-annotated data
•The partially-annotated data does not include accent information
•Infer the posteriors of function and variables by variational Bayes
FULL 4.79 167
LABELED 5.54 228
W/O ACCENT 4.75 207
PROPOSED 4.76 178
Experimental conditions
Database
Train / Valid / Test
data set
Input features
Acoustic features
Model architecture
Model training
# of utterances for each method
Latent space: 3 dim, hidden layer: 32 dim
1024 inducing points, 5 layers
ArcCos kernel [Cho&Saul, 2009]
Optimizer: Adam, learning rate: 0.01
Japanese speech data of a female speaker in
XIMERA corpus [Kawai et al., 2004]
1533 (119 min) / 60 / 60 utterances
– 99 fully-annotated utterances
– 1434 partially-annotated utterances
accent dependent/independent context: 137/477 dims
40-dim mel-cepstrum, log F0, 5-band aperiodicity,
and their Δ+Δ2
Methods
Subjective evaluation Acoustic feature distortions
Example: generated F0 countours
Fully‒annotated data
(w/ accent info.)
Partially‒annotated data
(w/o accent info.)
FULL 1533 ‒
LABELED 99 ‒
W/O ACCENT ‒ 1533
PROPOSED 99 1434
MCD
[dB]
RMSE of
log F0 [cent]

More Related Content

PDF
Open vocabulary problem
PDF
D3 dhanalakshmi
PPT
Multi-Task Learning and Web Search Ranking
PPTX
MDA Framework
PDF
Integration of speech recognition with computer assisted translation
PPTX
Word embedding
PPT
Statistical machine translation for indian language copy
PPTX
Real-time DirectTranslation System for Sinhala and Tamil Languages.
Open vocabulary problem
D3 dhanalakshmi
Multi-Task Learning and Web Search Ranking
MDA Framework
Integration of speech recognition with computer assisted translation
Word embedding
Statistical machine translation for indian language copy
Real-time DirectTranslation System for Sinhala and Tamil Languages.

What's hot (20)

PPTX
Statistical machine translation
PDF
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
PPTX
Machine translation with statistical approach
PPTX
2010 PACLIC - pay attention to categories
DOCX
A neural probabilistic language model
PPTX
1909 paclic
PDF
Interspeech 2017 s_miyoshi
PDF
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
PDF
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
PDF
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
PPTX
Gpt1 and 2 model review
PPTX
Effectof morphologicalsegmentation&de segmentationonmachinetranslation
PDF
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
PDF
Improving lexical choice in neural machine translation
PPTX
neural based_context_representation_learning_for_dialog_act_classification
PDF
NLP Asignment Final Presentation [IIT-Bombay]
PPTX
Deep Learning for Natural Language Processing
PPTX
2010 INTERSPEECH
PDF
MaxEnt (Loglinear) Models - Overview
PPTX
Intent Classifier with Facebook fastText
Statistical machine translation
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
Machine translation with statistical approach
2010 PACLIC - pay attention to categories
A neural probabilistic language model
1909 paclic
Interspeech 2017 s_miyoshi
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Gpt1 and 2 model review
Effectof morphologicalsegmentation&de segmentationonmachinetranslation
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
Improving lexical choice in neural machine translation
neural based_context_representation_learning_for_dialog_act_classification
NLP Asignment Final Presentation [IIT-Bombay]
Deep Learning for Natural Language Processing
2010 INTERSPEECH
MaxEnt (Loglinear) Models - Overview
Intent Classifier with Facebook fastText
Ad

Similar to Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable Model (20)

PPTX
NLP Deep Dive - recurrent neural networks .pptx
PPTX
2211 APSIPA
PPTX
Text summarization-with Extractive Text summarization techniques.pptx
PPTX
Transfer Learning in NLP: A Survey
PDF
LLM.pdf
PDF
Parafraseo-Chenggang.pdf
PDF
Seq2seq Model to Tokenize the Chinese Language
PDF
Seq2seq Model to Tokenize the Chinese Language
PDF
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
PPTX
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
PPTX
Experiments with Different Models of Statistcial Machine Translation
PPTX
project present
PPTX
Experiments with Different Models of Statistcial Machine Translation
PPTX
Hi I am Ram.pptx
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
PDF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
PDF
Javanese part-of-speech tagging using cross-lingual transfer learning
PPTX
What is word2vec?
PPTX
A Light Introduction to Transfer Learning for NLP
NLP Deep Dive - recurrent neural networks .pptx
2211 APSIPA
Text summarization-with Extractive Text summarization techniques.pptx
Transfer Learning in NLP: A Survey
LLM.pdf
Parafraseo-Chenggang.pdf
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese Language
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Experiments with Different Models of Statistcial Machine Translation
project present
Experiments with Different Models of Statistcial Machine Translation
Hi I am Ram.pptx
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
Javanese part-of-speech tagging using cross-lingual transfer learning
What is word2vec?
A Light Introduction to Transfer Learning for NLP
Ad

More from Tomoki Koriyama (12)

PDF
深層ガウス過程に基づく音声合成におけるリカレント構造を用いた系列モデリングの検討
PDF
Sparse Approximation of Gram Matrices for GMMN-based Speech Synthesis
PDF
ICASSP2019音声&音響論文読み会 論文紹介(合成系) #icassp2019jp
PDF
GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討
PDF
深層ガウス過程とアクセントの潜在変数表現に基づく音声合成の検討
PDF
グラム行列のスパース近似を用いた生成的モーメントマッチングネットに基づく音声合成の検討
PDF
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...
PDF
深層ガウス過程に基づく音声合成のための
事前学習の検討
PDF
GPR音声合成における深層ガウス過程の利用の検討
PDF
GP-DNNハイブリッドモデルに基づく統計的音声合成の検討
PDF
GPR音声合成のためのフレームコンテキストカーネルに基づく決定木構築の検討
PDF
ICASSP2017読み会(Speech Synthesis)
深層ガウス過程に基づく音声合成におけるリカレント構造を用いた系列モデリングの検討
Sparse Approximation of Gram Matrices for GMMN-based Speech Synthesis
ICASSP2019音声&音響論文読み会 論文紹介(合成系) #icassp2019jp
GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討
深層ガウス過程とアクセントの潜在変数表現に基づく音声合成の検討
グラム行列のスパース近似を用いた生成的モーメントマッチングネットに基づく音声合成の検討
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...
深層ガウス過程に基づく音声合成のための
事前学習の検討
GPR音声合成における深層ガウス過程の利用の検討
GP-DNNハイブリッドモデルに基づく統計的音声合成の検討
GPR音声合成のためのフレームコンテキストカーネルに基づく決定木構築の検討
ICASSP2017読み会(Speech Synthesis)

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Spectroscopy.pptx food analysis technology
Electronic commerce courselecture one. Pdf
cuic standard and advanced reporting.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Assigned Numbers - 2025 - Bluetooth® Document
Digital-Transformation-Roadmap-for-Companies.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
20250228 LYD VKU AI Blended-Learning.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Machine learning based COVID-19 study performance prediction
Spectral efficient network and resource selection model in 5G networks
Building Integrated photovoltaic BIPV_UPV.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
MIND Revenue Release Quarter 2 2025 Press Release
Approach and Philosophy of On baking technology
Encapsulation_ Review paper, used for researhc scholars
Chapter 3 Spatial Domain Image Processing.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding
Spectroscopy.pptx food analysis technology

Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable Model

  • 1. Tomoki Koriyama12 , Takao Kobayashi1 1 Tokyo Institute of Technology, Japan, 2 Currently with The University of Tokyo, Japan Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable Model Abstract Experiments Semi-supervised learning of prosody using DGP-LVM GP, GPLVM, Deep Gaussian processBackground Conclusions & Future Work •Prosody labeling is important for TTS but laborious •Use deep Gaussian process (DGP), a Bayesian deep model, to represent prosodic context labels as latent variables •Propose semi-supervised modeling for partially-annotated data, in which the latent variables are used in place of annotated prosody •Perform experiments using around 10% of fully-annotated data •The proposed semi-supervised modeling with DGP – Gave comparable score with the case all training data was fully-annotated – Outperformed the case using the data w/o accent information •Future work – Use diverse speech data including low-resource languages – Compare other generative models, e.g., Bayes NN, VAE, flow •To construct TTS, we require manual annotation of prosody labels, which costs much time and patience End-to-end approach [Wang et al., 2017][Sotelo et al., 2017] •End-to-end TTS is language-dependent •Japanese TTS still requires prosodic context labels [Yasuda et al., 2019] (b) Partially-annotated data Common function for both data Acoustic featureAcoustic feature Encode function of accent contexts Manually annotated accent-dependent context Latent variable as a accent information representation Accent-independent context Accent-independent context (a) Fully-annotated data (a) FULL 0 1 2 3 Time [s] 150 200 300 400 F0[Hz] (b) LABELED 0 1 2 3 Time [s] (c) W/O ACCENT 0 1 2 3 Time [s] (d) PROPOSED 0 1 2 3 Time [s] (a) GP regression ha shi ga Inference (c) DGP regression (d) DGP-LVM(b) GPLVM Purpose – Incorporate DGP with LVM into prosody modeling – Apply latent representation to semi-supervised learning Problems in Japanese pitch accent •Word meanings depend on accent •Accent is not lexical. It varies with speakers and contexts ha shi ga ha shi ga Inference Inference Inference •Infer the posteriors of functions and latent variables simultaneously Inference [Damianou&Lawrence, 2013][Titsias&Lawrence, 2009] Latent variable approach •Gaussian process latent variable model (GPLVM) can represent unannotated prosody information as latent variables [Moungsri et al., 2016] •Single-layer GP lacks expressiveness in modeling •Deep Gaussan process (DGP) [Damianou&Lawrence, 2013] – Deep model with stacked Bayesian kernel regressions – Outperformed 1-layer GP and DNN in TTS [Koriyama&Kobayashi, 2019] low high a ta ma a ta malow high "edge is" "bridge is" "head" "head" "chopsticks are" Speaker 2:Speaker 1: •Use both fully-annotated and partially-annotated data •The partially-annotated data does not include accent information •Infer the posteriors of function and variables by variational Bayes FULL 4.79 167 LABELED 5.54 228 W/O ACCENT 4.75 207 PROPOSED 4.76 178 Experimental conditions Database Train / Valid / Test data set Input features Acoustic features Model architecture Model training # of utterances for each method Latent space: 3 dim, hidden layer: 32 dim 1024 inducing points, 5 layers ArcCos kernel [Cho&Saul, 2009] Optimizer: Adam, learning rate: 0.01 Japanese speech data of a female speaker in XIMERA corpus [Kawai et al., 2004] 1533 (119 min) / 60 / 60 utterances – 99 fully-annotated utterances – 1434 partially-annotated utterances accent dependent/independent context: 137/477 dims 40-dim mel-cepstrum, log F0, 5-band aperiodicity, and their Δ+Δ2 Methods Subjective evaluation Acoustic feature distortions Example: generated F0 countours Fully‒annotated data (w/ accent info.) Partially‒annotated data (w/o accent info.) FULL 1533 ‒ LABELED 99 ‒ W/O ACCENT ‒ 1533 PROPOSED 99 1434 MCD [dB] RMSE of log F0 [cent]