Conditional generative model for audio

Conditional
Generative Model
for Audio
발표자: 최형석 & 이주헌
2019/11/30 (Sat.)

최형석
Hyeong-Seok Choi
kekepa15@snu.ac.kr
이주헌
Juheon Lee
juheon2@snu.ac.kr
 Affiliation
 Seoul National University
 Music & Audio Research Group
 Research interest
 Audio Source Separation
 Speech Enhancement
 Self-supervised representation learning &
generation
 Singing Voice Synthesis
 Affiliation
 Seoul National University
 Music & Audio Research Group
 Research interest
 Singing Voice Synthesis
 Lyric-to-audio Alignment
 Cover Song Identification
 Abnormal Sound Detection
 Choreography Generation

3
Generative models
Dataset: Examples drawn from 𝑝(𝑿)
𝒙~𝑝(𝑿)

4
Generative models
Dataset: Examples drawn from 𝑝(𝑿)
𝒙~𝑝(𝑿)
X
𝑝(𝑿)

5
Generative models
Explicit models: infer the parameters of 𝑝 𝑿; 𝜽 . (i.e., how likely is this cat?)
X
𝑝(𝑿; 𝜽)
𝑝(𝑿; 𝜽)
VAE, Autoregressive models, …

6
Generative models
Implicit models: I don’t care about the parameters, just give me some nice cats when I
roll the dice! (sampling)
X
𝑝(𝑿; 𝜽)
GANs…

7
Generative models
Implicit models: I don’t care about the parameters, just give me some nice cats when I
roll the dice! (sampling)
X
𝑝(𝑿; 𝜽)
GANs…

8
Conditional generative models
Application dependent modeling
1. Given a piano roll, I want to generate an expressive piano performance
2. Given a mel-spectrogram, I want to generate a raw audio signal
3. Given a linguistic feature, I want to generate a speech signal
…
Generative
Model
Output
1. Signal
Condition
1. Controllability

9
Conditional generative models
What does conditional generative model do?
 Reconstruct a signal from a given information (filling in the missing
information)
Level of “missing information”? (In music&audio point of view)
Condition Abstract Level
Abstract (Sparse)
Realistic (Dense)
Instrument class
Sound class
Non-expressive score
Linguistic Feature
Audio features
(mel-spectrogram)
MIDI score w/ velocity and etc…
Linguistic Feature w/ pitch

10
Conditional generative models: applications
Example of densely conditioned models: Vocoders
• Representative application: TTS
• TTS
• Next generation codec
• Speech enhancement
• Some representative models
• Autoregressive generation
• Wavenet
• WaveRNN
• Parallel generation
• Parallel wavenet
• WaveGlow/FloWaveNet
• MelGANs

11
Example of densely conditioned models: Vocoders (WaveRNN: training)
Upsample net
GRUs
… …
Input2:
wave[0:dim-1]
GroundTruth:
wave[1:dim]
Input1: mel-spectrogram
Num class: 2 𝑏𝑖𝑡𝑠
Training

12
Example of densely conditioned models: Vocoders (WaveRNN: training)
Inference
Upsample net
… …
Input: mel-spectrogram
0
0
Zero state
sample sample sample
x[1] x[2]
sample sample
x[N-1] x[N]x[N-2]…
…
output

13
• Representative application: TTS
• TTS
• Next generation codec
• Speech enhancement
• Some representative models
• Autoregressive generation
• Wavenet
• WaveRNN
• Parallel generation
• Parallel wavenet
• WaveGlow/FloWaveNet
• MelGANs

14

15
Practical/interesting application of vocoders: Generative speech enhancement
1. Parametric Resynthesis with Neural Vocoders (Waspaa2019)
2. Generative Speech Enhancement Based on Cloned Networks (Waspaa2019)
3. Speaker independence of neural vocoders and their effect on parametric resynthesis
speech enhancement (arxiv, 2019)
4. A Speech Synthesis Approach for High Quality Speech Separation and Generation
(IEEE Signal processing letters, 2019)
 Key idea: Ensemble the power of discriminative & generative approach!
Pros: Almost no artifacts 
Cons: Inaccurate pronunciation in low SNR condition 
Separator
Synthesizer
(Vocoders)
Noisy mel-spectrogram Estimated clean mel-spectrogram
Discriminative Generative
Synthesized clean raw wave

16
Practical/interesting application of vocoders: Generative speech enhancement
• Parametric Resynthesis with Neural Vocoders (Waspaa2019)
• Generative Speech Enhancement Based on Cloned Networks (Waspaa2019)
• Speaker independence of neural vocoders and their effect on parametric resynthesis
speech enhancement (arxiv, 2019)
• A Speech Synthesis Approach for High Quality Speech Separation and Generation (IEEE
Signal processing letters, 2019)
 Key idea: Ensemble the power of discriminative & generative approach!
Pros: Almost no artifacts 
Cons: Inaccurate pronunciation in low SNR condition 
Separator
Synthesizer
(Vocoders)
Noisy mel-spectrogram Estimated clean mel-spectrogram
Discriminative Generative
Synthesized clean raw wave
Some of my preliminary results…
Noisy
Generated

17
• Some other practical/interesting application: Next generation codec
1. Wavenet based low rate speech coding (ICASSP 2018)
2. Low bit-rate speech coding with vq-vae and a wavenet decoder (ICASSP 2019)
3. Improving opus low bit rate quality with neural speech synthesis (arxiv, 2019)
 Key idea:
1. Deep learning is good at learning a compressed representation (Encoder).
2. Deep learning is good at synthesizing (Decoder).
Pros: Good bit rate (bps) 
Cons: ???
Encoder
Server1
Compressed
representation
Decoder
Server2
Reconstructed
signal (speech)

18
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Training stage

19
TEXT MIDI
Conditioned wave
Generation stage

20
Main Idea : Disentangling Formant mask & Pitch skeleton
• We wanted pitch and text information to be modelled as independent
acoustic features, and we designed the network to reflect that

21
Generation Result
Input text : “do re mi fa sol ra ti do”
Input pitch : [C D E F G A B C]
Generated audio :
FormantmaskPitchskeletonGeneratedmelspec.

22
Generation Result
Input text : “do do do do do do do do”
Input pitch : [C D E F G A B C]
Generated audio :

23
Generation Result
Input text : “do re mi fa sol ra ti do”
Input pitch : [C C C C C C C C]
Generated audio :

24
Generation Result
Input text “아리랑 아리랑 아라리오 아리랑 고개로 넘어간다 나를 버리고 가시는 님은 십리도 못 가서 발병 난다”
“arirang arirang arario arirang go gae ro neom eo gan da na reul beo ri go ga shi neun nim eun sib ri do mot ga seo bal byung nan da”
Input pitch
Generated
result
Generated singing
Audio samples

25
Singing Voice Generation – multi singer
J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020)
• Based on single-singer model, added Singer Identity Encoder.
• Disentangling Singer identity into Timbre and Singing Style.

26
Generation Result
Singer A Singer B

27
Generation Result
Singer A Singer B
Timbre A + Style B Timbre B + Style A

28
Generative
Model
Output
1. Signal
Condition
1. Controllability
Generative
Model
Output
a. Signal (Audio/Image)
Condition
a. Controllability
b. Signal (Image/Audio)
Randomness
a. Uncertainty
b. Creativity
What is lacking?...
Multi-modal transform
Deterministic
Some stochasticity
Can be seen as a supervised-way of disentangling representation
1.
2.

29
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)

30

31
1 2 3 4 5 6 7 8
2 3 4 5 6 7 8 9
Pose sequence
Music sequence
(concat.)
2 3 4 5 6 7 8 9
Estimated
Pose sequence

32
Black pink - 불장난
Red velvet - Rookie

33
Audio Driven Dance Generation – Dance with melody
T Tang et al, Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis (ACMMM 2018)

34
Audio Driven Dance Generation – Dance with melody
T Tang et al, Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis (ACMMM 2018)
• By using autoencoder, obtain
Reduced Acoustics Features
• With Temporal Indexes mask,
Transform the frame-indexed
acoustic features into beat-
indexed acoustic features.

35
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)

36
Learning How to Move
Learning How to Compose
Generation

37
Learning How to Move
• Decompose dance sequence with kinematic beat
• With VAE, disentangle dance into initial pose + movement

38
Conditional generative models (multi-modal)
Learning How to Compose
• Learns how to meaningfully compose a sequence of basic
movements into a dance conditioned on the input music.
• Conditional adversarial training for correspondence M&D

39
: conditioning applied

40

41
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Stochastic part
(Uncertainty)

42
from Speech
Spk1
Spk2
Spk3
Spk4

43
from Speech
Fix z & Change c (speech embedding)

44
from Speech
Fix c & Change z (random sampling)

Conditional generative model for audio

More Related Content

What's hot (20)

Similar to Conditional generative model for audio (20)

More from Keunwoo Choi (8)

Recently uploaded (20)

Conditional generative model for audio

Editor's Notes