SlideShare a Scribd company logo
Conditional
Generative Model
for Audio
발표자: 최형석 & 이주헌
2019/11/30 (Sat.)
최형석
Hyeong-Seok Choi
kekepa15@snu.ac.kr
이주헌
Juheon Lee
juheon2@snu.ac.kr
 Affiliation
 Seoul National University
 Music & Audio Research Group
 Research interest
 Audio Source Separation
 Speech Enhancement
 Self-supervised representation learning &
generation
 Singing Voice Synthesis
 Affiliation
 Seoul National University
 Music & Audio Research Group
 Research interest
 Singing Voice Synthesis
 Lyric-to-audio Alignment
 Cover Song Identification
 Abnormal Sound Detection
 Choreography Generation
3
Generative models
Dataset: Examples drawn from 𝑝(𝑿)
𝒙~𝑝(𝑿)
4
Generative models
Dataset: Examples drawn from 𝑝(𝑿)
𝒙~𝑝(𝑿)
X
𝑝(𝑿)
5
Generative models
Explicit models: infer the parameters of 𝑝 𝑿; 𝜽 . (i.e., how likely is this cat?)
X
𝑝(𝑿; 𝜽)
𝑝(𝑿; 𝜽)
VAE, Autoregressive models, …
6
Generative models
Implicit models: I don’t care about the parameters, just give me some nice cats when I
roll the dice! (sampling)
X
𝑝(𝑿; 𝜽)
GANs…
7
Generative models
Implicit models: I don’t care about the parameters, just give me some nice cats when I
roll the dice! (sampling)
X
𝑝(𝑿; 𝜽)
GANs…
8
Conditional generative models
Application dependent modeling
1. Given a piano roll, I want to generate an expressive piano performance
2. Given a mel-spectrogram, I want to generate a raw audio signal
3. Given a linguistic feature, I want to generate a speech signal
…
Generative
Model
Output
1. Signal
Condition
1. Controllability
9
Conditional generative models
What does conditional generative model do?
 Reconstruct a signal from a given information (filling in the missing
information)
Level of “missing information”? (In music&audio point of view)
Condition Abstract Level
Abstract (Sparse)
Realistic (Dense)
Instrument class
Sound class
Non-expressive score
Linguistic Feature
Audio features
(mel-spectrogram)
MIDI score w/ velocity and etc…
Linguistic Feature w/ pitch
10
Conditional generative models: applications
Example of densely conditioned models: Vocoders
• Representative application: TTS
• TTS
• Next generation codec
• Speech enhancement
• Some representative models
• Autoregressive generation
• Wavenet
• WaveRNN
• Parallel generation
• Parallel wavenet
• WaveGlow/FloWaveNet
• MelGANs
11
Conditional generative models: applications
Example of densely conditioned models: Vocoders (WaveRNN: training)
Upsample net
GRUs
… …
Input2:
wave[0:dim-1]
GroundTruth:
wave[1:dim]
Input1: mel-spectrogram
Num class: 2 𝑏𝑖𝑡𝑠
Training
12
Conditional generative models: applications
Example of densely conditioned models: Vocoders (WaveRNN: training)
Inference
Upsample net
… …
Input: mel-spectrogram
0
0
Zero state
sample sample sample
x[1] x[2]
sample sample
x[N-1] x[N]x[N-2]…
…
output
13
Conditional generative models: applications
Example of densely conditioned models: Vocoders
• Representative application: TTS
• TTS
• Next generation codec
• Speech enhancement
• Some representative models
• Autoregressive generation
• Wavenet
• WaveRNN
• Parallel generation
• Parallel wavenet
• WaveGlow/FloWaveNet
• MelGANs
14
Conditional generative models: applications
15
Conditional generative models: applications
Example of densely conditioned models: Vocoders
Practical/interesting application of vocoders: Generative speech enhancement
1. Parametric Resynthesis with Neural Vocoders (Waspaa2019)
2. Generative Speech Enhancement Based on Cloned Networks (Waspaa2019)
3. Speaker independence of neural vocoders and their effect on parametric resynthesis
speech enhancement (arxiv, 2019)
4. A Speech Synthesis Approach for High Quality Speech Separation and Generation
(IEEE Signal processing letters, 2019)
 Key idea: Ensemble the power of discriminative & generative approach!
Pros: Almost no artifacts 
Cons: Inaccurate pronunciation in low SNR condition 
Separator
Synthesizer
(Vocoders)
Noisy mel-spectrogram Estimated clean mel-spectrogram
Discriminative Generative
Synthesized clean raw wave
16
Conditional generative models: applications
Example of densely conditioned models: Vocoders
Practical/interesting application of vocoders: Generative speech enhancement
• Parametric Resynthesis with Neural Vocoders (Waspaa2019)
• Generative Speech Enhancement Based on Cloned Networks (Waspaa2019)
• Speaker independence of neural vocoders and their effect on parametric resynthesis
speech enhancement (arxiv, 2019)
• A Speech Synthesis Approach for High Quality Speech Separation and Generation (IEEE
Signal processing letters, 2019)
 Key idea: Ensemble the power of discriminative & generative approach!
Pros: Almost no artifacts 
Cons: Inaccurate pronunciation in low SNR condition 
Separator
Synthesizer
(Vocoders)
Noisy mel-spectrogram Estimated clean mel-spectrogram
Discriminative Generative
Synthesized clean raw wave
Some of my preliminary results…
Noisy
Generated
17
Conditional generative models: applications
Example of densely conditioned models: Vocoders
• Some other practical/interesting application: Next generation codec
1. Wavenet based low rate speech coding (ICASSP 2018)
2. Low bit-rate speech coding with vq-vae and a wavenet decoder (ICASSP 2019)
3. Improving opus low bit rate quality with neural speech synthesis (arxiv, 2019)
 Key idea:
1. Deep learning is good at learning a compressed representation (Encoder).
2. Deep learning is good at synthesizing (Decoder).
Pros: Good bit rate (bps) 
Cons: ???
Encoder
Server1
Compressed
representation
Decoder
Server2
Reconstructed
signal (speech)
18
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Training stage
19
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
TEXT MIDI
Conditioned wave
Generation stage
20
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Main Idea : Disentangling Formant mask & Pitch skeleton
• We wanted pitch and text information to be modelled as independent
acoustic features, and we designed the network to reflect that
21
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text : “do re mi fa sol ra ti do”
Input pitch : [C D E F G A B C]
Generated audio :
FormantmaskPitchskeletonGeneratedmelspec.
22
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text : “do do do do do do do do”
Input pitch : [C D E F G A B C]
Generated audio :
FormantmaskPitchskeletonGeneratedmelspec.
23
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text : “do re mi fa sol ra ti do”
Input pitch : [C C C C C C C C]
Generated audio :
FormantmaskPitchskeletonGeneratedmelspec.
24
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text “아리랑 아리랑 아라리오 아리랑 고개로 넘어간다 나를 버리고 가시는 님은 십리도 못 가서 발병 난다”
“arirang arirang arario arirang go gae ro neom eo gan da na reul beo ri go ga shi neun nim eun sib ri do mot ga seo bal byung nan da”
Input pitch
Generated
result
Generated singing
Audio samples
25
Conditional generative models: applications
Singing Voice Generation – multi singer
J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020)
• Based on single-singer model, added Singer Identity Encoder.
• Disentangling Singer identity into Timbre and Singing Style.
26
Conditional generative models: applications
Singing Voice Generation – multi singer
J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020)
• Based on single-singer model, added Singer Identity Encoder.
• Disentangling Singer identity into Timbre and Singing Style.
Generation Result
Singer A Singer B
27
Conditional generative models: applications
Singing Voice Generation – multi singer
J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020)
• Based on single-singer model, added Singer Identity Encoder.
• Disentangling Singer identity into Timbre and Singing Style.
Generation Result
Singer A Singer B
Timbre A + Style B Timbre B + Style A
28
Conditional generative models: applications
Generative
Model
Output
1. Signal
Condition
1. Controllability
Generative
Model
Output
a. Signal (Audio/Image)
Condition
a. Controllability
b. Signal (Image/Audio)
Randomness
a. Uncertainty
b. Creativity
What is lacking?...
Multi-modal transform
Deterministic
Some stochasticity
Can be seen as a supervised-way of disentangling representation
1.
2.
29
Conditional generative models: applications
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
30
Conditional generative models: applications
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
31
Conditional generative models: applications
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
1 2 3 4 5 6 7 8
2 3 4 5 6 7 8 9
Pose sequence
Music sequence
(concat.)
2 3 4 5 6 7 8 9
Estimated
Pose sequence
32
Conditional generative models: applications
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
Black pink - 불장난
Red velvet - Rookie
33
Conditional generative models: applications
Audio Driven Dance Generation – Dance with melody
T Tang et al, Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis (ACMMM 2018)
34
Conditional generative models: applications
Audio Driven Dance Generation – Dance with melody
T Tang et al, Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis (ACMMM 2018)
• By using autoencoder, obtain
Reduced Acoustics Features
• With Temporal Indexes mask,
Transform the frame-indexed
acoustic features into beat-
indexed acoustic features.
35
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
36
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
Learning How to Move
Learning How to Compose
Generation
37
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
Learning How to Move
• Decompose dance sequence with kinematic beat
• With VAE, disentangle dance into initial pose + movement
38
Conditional generative models (multi-modal)
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
Learning How to Compose
• Learns how to meaningfully compose a sequence of basic
movements into a dance conditioned on the input music.
• Conditional adversarial training for correspondence M&D
39
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
: conditioning applied
40
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
41
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Stochastic part
(Uncertainty)
42
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Spk1
Spk2
Spk3
Spk4
43
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Fix z & Change c (speech embedding)
44
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Fix c & Change z (random sampling)
Thank You!
Questions?

More Related Content

PDF
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
PDF
인공지능의 음악 인지 모델 - 65차 한국음악지각인지학회 기조강연 (최근우 박사)
PDF
"All you need is AI and music" by Keunwoo Choi
PDF
Deep Learning Meetup #5
PDF
Machine learning for Music
DOCX
Sound recording glossary preivious
PDF
Automatic Music Transcription
DOCX
Query By Humming - Music Retrieval Technique
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
인공지능의 음악 인지 모델 - 65차 한국음악지각인지학회 기조강연 (최근우 박사)
"All you need is AI and music" by Keunwoo Choi
Deep Learning Meetup #5
Machine learning for Music
Sound recording glossary preivious
Automatic Music Transcription
Query By Humming - Music Retrieval Technique

What's hot (20)

DOCX
Sound recording glossary
DOCX
Sound recording glossary - IMPROVED
DOCX
IG2 task 1 work sheet terence byrne
PPTX
Query By humming - Music retrieval technology
DOCX
IG2 Task 1 Worksheet
DOCX
Ig2 task 1 work sheet
DOCX
Ig2 task 1 work sheet
DOCX
Ig2 task 1 work sheet 2
DOCX
Ig2 task 1 work sheet s
DOCX
Ig2 task 1 work sheet s
PPT
Mono and stereo
PPTX
Surround sount system
DOCX
Sound recording glossary
DOCX
IG2 Task 1 Work Sheet Terence Byrne
PDF
Convolutional recurrent neural networks for music classification
PDF
Multimedia elements
DOCX
Ig2 task 1 work sheet
DOCX
Jordan smith ig2 task 1 revisited
DOCX
Ig2 task 1 work sheet
DOCX
Ig2 task 1 work sheet (revisited)
Sound recording glossary
Sound recording glossary - IMPROVED
IG2 task 1 work sheet terence byrne
Query By humming - Music retrieval technology
IG2 Task 1 Worksheet
Ig2 task 1 work sheet
Ig2 task 1 work sheet
Ig2 task 1 work sheet 2
Ig2 task 1 work sheet s
Ig2 task 1 work sheet s
Mono and stereo
Surround sount system
Sound recording glossary
IG2 Task 1 Work Sheet Terence Byrne
Convolutional recurrent neural networks for music classification
Multimedia elements
Ig2 task 1 work sheet
Jordan smith ig2 task 1 revisited
Ig2 task 1 work sheet
Ig2 task 1 work sheet (revisited)
Ad

Similar to Conditional generative model for audio (20)

PPTX
Research_Wu.pptx
PDF
Toward wave net speech synthesis
PPTX
Speaker Dependent WaveNet Vocoder
PDF
DataScienceLab2017_Блиц-доклад
PDF
The past, present and future of singing synthesis
PPTX
WaveNet
PPTX
Reclamation-based-Voice(talk)-Conversion
PDF
Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...
PDF
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
PPT
Automatic speech recognition
PDF
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
PDF
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
PDF
Performance estimation based recurrent-convolutional encoder decoder for spee...
PPTX
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
PPTX
ISMIR 2016_Melody Extraction
PDF
Extraction and Conversion of Vocals
PPTX
Voice Cloning
PDF
Real Time Speech Enhancement in the Waveform Domain
PDF
Audio tagging system using densely connected convolutional networks (DCASE201...
PPT
Research_Wu.pptx
Toward wave net speech synthesis
Speaker Dependent WaveNet Vocoder
DataScienceLab2017_Блиц-доклад
The past, present and future of singing synthesis
WaveNet
Reclamation-based-Voice(talk)-Conversion
Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Automatic speech recognition
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
Performance estimation based recurrent-convolutional encoder decoder for spee...
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
ISMIR 2016_Melody Extraction
Extraction and Conversion of Vocals
Voice Cloning
Real Time Speech Enhancement in the Waveform Domain
Audio tagging system using densely connected convolutional networks (DCASE201...
Ad

More from Keunwoo Choi (8)

PDF
가상현실을 위한 오디오 기술
PDF
The effects of noisy labels on deep convolutional neural networks for music t...
PDF
dl4mir tutorial at ETRI, Korea
PDF
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
PDF
Deep Convolutional Neural Networks - Overview
PDF
Deep learning for music classification, 2016-05-24
PDF
딥러닝 개요 (2015-05-09 KISTEP)
PDF
Understanding Music Playlists
가상현실을 위한 오디오 기술
The effects of noisy labels on deep convolutional neural networks for music t...
dl4mir tutorial at ETRI, Korea
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Deep Convolutional Neural Networks - Overview
Deep learning for music classification, 2016-05-24
딥러닝 개요 (2015-05-09 KISTEP)
Understanding Music Playlists

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PDF
Empathic Computing: Creating Shared Understanding
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Machine Learning_overview_presentation.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
MIND Revenue Release Quarter 2 2025 Press Release
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Big Data Technologies - Introduction.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
A Presentation on Artificial Intelligence
Empathic Computing: Creating Shared Understanding
Advanced methodologies resolving dimensionality complications for autism neur...
Machine Learning_overview_presentation.pptx
Review of recent advances in non-invasive hemoglobin estimation
MIND Revenue Release Quarter 2 2025 Press Release
The AUB Centre for AI in Media Proposal.docx
Big Data Technologies - Introduction.pptx
A comparative analysis of optical character recognition models for extracting...
NewMind AI Weekly Chronicles - August'25-Week II
Dropbox Q2 2025 Financial Results & Investor Presentation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
20250228 LYD VKU AI Blended-Learning.pptx
cuic standard and advanced reporting.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Chapter 3 Spatial Domain Image Processing.pdf

Conditional generative model for audio

  • 1. Conditional Generative Model for Audio 발표자: 최형석 & 이주헌 2019/11/30 (Sat.)
  • 2. 최형석 Hyeong-Seok Choi kekepa15@snu.ac.kr 이주헌 Juheon Lee juheon2@snu.ac.kr  Affiliation  Seoul National University  Music & Audio Research Group  Research interest  Audio Source Separation  Speech Enhancement  Self-supervised representation learning & generation  Singing Voice Synthesis  Affiliation  Seoul National University  Music & Audio Research Group  Research interest  Singing Voice Synthesis  Lyric-to-audio Alignment  Cover Song Identification  Abnormal Sound Detection  Choreography Generation
  • 3. 3 Generative models Dataset: Examples drawn from 𝑝(𝑿) 𝒙~𝑝(𝑿)
  • 4. 4 Generative models Dataset: Examples drawn from 𝑝(𝑿) 𝒙~𝑝(𝑿) X 𝑝(𝑿)
  • 5. 5 Generative models Explicit models: infer the parameters of 𝑝 𝑿; 𝜽 . (i.e., how likely is this cat?) X 𝑝(𝑿; 𝜽) 𝑝(𝑿; 𝜽) VAE, Autoregressive models, …
  • 6. 6 Generative models Implicit models: I don’t care about the parameters, just give me some nice cats when I roll the dice! (sampling) X 𝑝(𝑿; 𝜽) GANs…
  • 7. 7 Generative models Implicit models: I don’t care about the parameters, just give me some nice cats when I roll the dice! (sampling) X 𝑝(𝑿; 𝜽) GANs…
  • 8. 8 Conditional generative models Application dependent modeling 1. Given a piano roll, I want to generate an expressive piano performance 2. Given a mel-spectrogram, I want to generate a raw audio signal 3. Given a linguistic feature, I want to generate a speech signal … Generative Model Output 1. Signal Condition 1. Controllability
  • 9. 9 Conditional generative models What does conditional generative model do?  Reconstruct a signal from a given information (filling in the missing information) Level of “missing information”? (In music&audio point of view) Condition Abstract Level Abstract (Sparse) Realistic (Dense) Instrument class Sound class Non-expressive score Linguistic Feature Audio features (mel-spectrogram) MIDI score w/ velocity and etc… Linguistic Feature w/ pitch
  • 10. 10 Conditional generative models: applications Example of densely conditioned models: Vocoders • Representative application: TTS • TTS • Next generation codec • Speech enhancement • Some representative models • Autoregressive generation • Wavenet • WaveRNN • Parallel generation • Parallel wavenet • WaveGlow/FloWaveNet • MelGANs
  • 11. 11 Conditional generative models: applications Example of densely conditioned models: Vocoders (WaveRNN: training) Upsample net GRUs … … Input2: wave[0:dim-1] GroundTruth: wave[1:dim] Input1: mel-spectrogram Num class: 2 𝑏𝑖𝑡𝑠 Training
  • 12. 12 Conditional generative models: applications Example of densely conditioned models: Vocoders (WaveRNN: training) Inference Upsample net … … Input: mel-spectrogram 0 0 Zero state sample sample sample x[1] x[2] sample sample x[N-1] x[N]x[N-2]… … output
  • 13. 13 Conditional generative models: applications Example of densely conditioned models: Vocoders • Representative application: TTS • TTS • Next generation codec • Speech enhancement • Some representative models • Autoregressive generation • Wavenet • WaveRNN • Parallel generation • Parallel wavenet • WaveGlow/FloWaveNet • MelGANs
  • 15. 15 Conditional generative models: applications Example of densely conditioned models: Vocoders Practical/interesting application of vocoders: Generative speech enhancement 1. Parametric Resynthesis with Neural Vocoders (Waspaa2019) 2. Generative Speech Enhancement Based on Cloned Networks (Waspaa2019) 3. Speaker independence of neural vocoders and their effect on parametric resynthesis speech enhancement (arxiv, 2019) 4. A Speech Synthesis Approach for High Quality Speech Separation and Generation (IEEE Signal processing letters, 2019)  Key idea: Ensemble the power of discriminative & generative approach! Pros: Almost no artifacts  Cons: Inaccurate pronunciation in low SNR condition  Separator Synthesizer (Vocoders) Noisy mel-spectrogram Estimated clean mel-spectrogram Discriminative Generative Synthesized clean raw wave
  • 16. 16 Conditional generative models: applications Example of densely conditioned models: Vocoders Practical/interesting application of vocoders: Generative speech enhancement • Parametric Resynthesis with Neural Vocoders (Waspaa2019) • Generative Speech Enhancement Based on Cloned Networks (Waspaa2019) • Speaker independence of neural vocoders and their effect on parametric resynthesis speech enhancement (arxiv, 2019) • A Speech Synthesis Approach for High Quality Speech Separation and Generation (IEEE Signal processing letters, 2019)  Key idea: Ensemble the power of discriminative & generative approach! Pros: Almost no artifacts  Cons: Inaccurate pronunciation in low SNR condition  Separator Synthesizer (Vocoders) Noisy mel-spectrogram Estimated clean mel-spectrogram Discriminative Generative Synthesized clean raw wave Some of my preliminary results… Noisy Generated
  • 17. 17 Conditional generative models: applications Example of densely conditioned models: Vocoders • Some other practical/interesting application: Next generation codec 1. Wavenet based low rate speech coding (ICASSP 2018) 2. Low bit-rate speech coding with vq-vae and a wavenet decoder (ICASSP 2019) 3. Improving opus low bit rate quality with neural speech synthesis (arxiv, 2019)  Key idea: 1. Deep learning is good at learning a compressed representation (Encoder). 2. Deep learning is good at synthesizing (Decoder). Pros: Good bit rate (bps)  Cons: ??? Encoder Server1 Compressed representation Decoder Server2 Reconstructed signal (speech)
  • 18. 18 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) Training stage
  • 19. 19 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) TEXT MIDI Conditioned wave Generation stage
  • 20. 20 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) Main Idea : Disentangling Formant mask & Pitch skeleton • We wanted pitch and text information to be modelled as independent acoustic features, and we designed the network to reflect that
  • 21. 21 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) Generation Result Input text : “do re mi fa sol ra ti do” Input pitch : [C D E F G A B C] Generated audio : FormantmaskPitchskeletonGeneratedmelspec.
  • 22. 22 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) Generation Result Input text : “do do do do do do do do” Input pitch : [C D E F G A B C] Generated audio : FormantmaskPitchskeletonGeneratedmelspec.
  • 23. 23 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) Generation Result Input text : “do re mi fa sol ra ti do” Input pitch : [C C C C C C C C] Generated audio : FormantmaskPitchskeletonGeneratedmelspec.
  • 24. 24 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) Generation Result Input text “아리랑 아리랑 아라리오 아리랑 고개로 넘어간다 나를 버리고 가시는 님은 십리도 못 가서 발병 난다” “arirang arirang arario arirang go gae ro neom eo gan da na reul beo ri go ga shi neun nim eun sib ri do mot ga seo bal byung nan da” Input pitch Generated result Generated singing Audio samples
  • 25. 25 Conditional generative models: applications Singing Voice Generation – multi singer J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020) • Based on single-singer model, added Singer Identity Encoder. • Disentangling Singer identity into Timbre and Singing Style.
  • 26. 26 Conditional generative models: applications Singing Voice Generation – multi singer J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020) • Based on single-singer model, added Singer Identity Encoder. • Disentangling Singer identity into Timbre and Singing Style. Generation Result Singer A Singer B
  • 27. 27 Conditional generative models: applications Singing Voice Generation – multi singer J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020) • Based on single-singer model, added Singer Identity Encoder. • Disentangling Singer identity into Timbre and Singing Style. Generation Result Singer A Singer B Timbre A + Style B Timbre B + Style A
  • 28. 28 Conditional generative models: applications Generative Model Output 1. Signal Condition 1. Controllability Generative Model Output a. Signal (Audio/Image) Condition a. Controllability b. Signal (Image/Audio) Randomness a. Uncertainty b. Creativity What is lacking?... Multi-modal transform Deterministic Some stochasticity Can be seen as a supervised-way of disentangling representation 1. 2.
  • 29. 29 Conditional generative models: applications Audio Driven Dance Generation – Listen to dance J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
  • 30. 30 Conditional generative models: applications Audio Driven Dance Generation – Listen to dance J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
  • 31. 31 Conditional generative models: applications Audio Driven Dance Generation – Listen to dance J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019) 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 9 Pose sequence Music sequence (concat.) 2 3 4 5 6 7 8 9 Estimated Pose sequence
  • 32. 32 Conditional generative models: applications Audio Driven Dance Generation – Listen to dance J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019) Black pink - 불장난 Red velvet - Rookie
  • 33. 33 Conditional generative models: applications Audio Driven Dance Generation – Dance with melody T Tang et al, Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis (ACMMM 2018)
  • 34. 34 Conditional generative models: applications Audio Driven Dance Generation – Dance with melody T Tang et al, Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis (ACMMM 2018) • By using autoencoder, obtain Reduced Acoustics Features • With Temporal Indexes mask, Transform the frame-indexed acoustic features into beat- indexed acoustic features.
  • 35. 35 Conditional generative models: applications Audio Driven Dance Generation – Dancing to music Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
  • 36. 36 Conditional generative models: applications Audio Driven Dance Generation – Dancing to music Hsin-Ying Lee et al, Dancing to music (Neurlps 2019) Learning How to Move Learning How to Compose Generation
  • 37. 37 Conditional generative models: applications Audio Driven Dance Generation – Dancing to music Hsin-Ying Lee et al, Dancing to music (Neurlps 2019) Learning How to Move • Decompose dance sequence with kinematic beat • With VAE, disentangle dance into initial pose + movement
  • 38. 38 Conditional generative models (multi-modal) Audio Driven Dance Generation – Dancing to music Hsin-Ying Lee et al, Dancing to music (Neurlps 2019) Learning How to Compose • Learns how to meaningfully compose a sequence of basic movements into a dance conditioned on the input music. • Conditional adversarial training for correspondence M&D
  • 39. 39 Conditional generative models: applications Audio Driven Dance Generation – Dancing to music Hsin-Ying Lee et al, Dancing to music (Neurlps 2019) : conditioning applied
  • 40. 40 Conditional generative models: applications Audio Driven Dance Generation – Dancing to music Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
  • 41. 41 Conditional generative models: applications From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech Anonymous authors (ICLR2020 openreview) Stochastic part (Uncertainty)
  • 42. 42 Conditional generative models: applications From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech Anonymous authors (ICLR2020 openreview) Spk1 Spk2 Spk3 Spk4
  • 43. 43 Conditional generative models: applications From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech Anonymous authors (ICLR2020 openreview) Fix z & Change c (speech embedding)
  • 44. 44 Conditional generative models: applications From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech Anonymous authors (ICLR2020 openreview) Fix c & Change z (random sampling)

Editor's Notes

  • #10: 무엇을 채워 넣는지?에 따라서 어플리케이션이 달라짐
  • #15: Generator Architecture Stack of transposed convolutional layers to upsample the input sequence. Each transposed convolutional layer followed by a stack of residual blocks. Induced Receptive Field Residual blocks with dilations so temporally far output activations of each layer has significant overlapping inputs. Receptive field of a stack of dilated convolution layers increases exponentially with the number of layers. Discriminator Multiscale Architecture 3 discriminators (identical structure) operate on different audio scales -- original scale, 2x and 4x downsampled. Each discriminator biased to learn features for different frequency range of the audio. Window-based objective Each individual discriminator is a Markovian window-based discriminator (analogues to image patches, Isola et al. (2017)) Discriminator learns to classify between distributions of small audio chunks. Overlapping large windows maintain coherence across patches
  • #29: 1. 춤 2. Audio signal generation 3. Aumon (stochasticy반영) 4. Futurework with the example of Image generation with stochasticity
  • #42: 얼굴을 목소리로부터 100% 추정해낼 수 없음.