SlideShare a Scribd company logo
Generative Adversarial Network
and its Applications to Speech Processing
and Natural Language Processing
Hung-yi Lee and Yu Tsao
Outline
Part I: Basic Idea of Generative Adversarial
Network (GAN)
Part II: A little bit theory
Part III: Applications to Speech Processing
Part IV: Applications to Natural Language
Processing
Take a break
All Kinds of GAN … https://github.com/hindupuravinash/the-gan-zoo
(not updated since 2018.09)
More than 500 species
in the zoo
All Kinds of GAN … https://github.com/hindupuravinash/the-gan-zoo
GAN
ACGAN
BGAN
DCGAN
EBGAN
fGAN
GoGAN
CGAN
……
Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, Shakir Mohamed, “Variational Approaches for Auto-Encoding
Generative Adversarial Networks”, arXiv, 2017
0 0 0 0 1 2
42
62
0 0 0 0
2
11
14
32
2012 2013 2014 2015 2016 2017 2018 2019
ICASSP INTERSPEECH
INTERSPEECH & ICASSP
How many papers have “adversarial” in their titles?
It is a wise choice to
attend this tutorial.
Part I: Basic Idea
Generator
“Girl with
red hair”
Generator
−0.3
0.1
⋮
0.9
random vector
Three Categories of GAN
1. Generation
image
2. Conditional Generation
Generator
text
imagepaired data
blue eyes,
red hair,
short hair
3. Unsupervised Conditional Generation
Photo Vincent van
Gogh’s styleunpaired data
x ydomain x domain y
Anime Face Generation
Draw
Generator
Examples
Basic Idea of GAN
Generator
It is a neural network
(NN), or a function.
Generator
0.1
−3
⋮
2.4
0.9
imagevector
Generator
3
−3
⋮
2.4
0.9
Generator
0.1
2.1
⋮
5.4
0.9
Generator
0.1
−3
⋮
2.4
3.5
high
dimensional
vector
Powered by: http://mattya.github.io/chainer-DCGAN/
Each dimension of input vector
represents some characteristics.
Longer hair
blue hair Open mouth
Discri-
minator
scalar
image
Basic Idea of GAN It is a neural network
(NN), or a function.
Larger value means real,
smaller value means fake.
Discri-
minator
Discri-
minator
Discri-
minator1.0 1.0
0.1 Discri-
minator
0.1
• Initialize generator and discriminator
• In each training iteration:
DG
sample
generated
objects
G
Algorithm
D
Update
vector
vector
vector
vector
0000
1111
randomly
sampled
Database
Step 1: Fix generator G, and update discriminator D
Discriminator learns to assign high scores to real objects
and low scores to generated objects.
Fix
• Initialize generator and discriminator
• In each training iteration:
DG
Algorithm
Step 2: Fix discriminator D, and update generator G
Discri-
minator
NN
Generator
vector
0.13
hidden layer
update fix
Gradient Ascent
large network
Generator learns to “fool” the discriminator
• Initialize generator and discriminator
• In each training iteration:
DG
Learning
D
Sample some
real objects:
Generate some
fake objects:
G
Algorithm
D
Update
Learning
G
G D
image
1111
image
image
image
1
update fix
0000vector
vector
vector
vector
vector
vector
vector
vector
fix
Anime Face Generation
100 updates
Source of training data: https://zhuanlan.zhihu.com/p/24767059
Anime Face Generation
1000 updates
Anime Face Generation
2000 updates
Anime Face Generation
5000 updates
Anime Face Generation
10,000 updates
Anime Face Generation
20,000 updates
Anime Face Generation
50,000 updates
In 2019, with StyleGAN ……
Source of video:
https://www.gwern.net/Faces
0.0
0.0
G
0.9
0.9
G
0.1
0.1
G
0.2
0.2
G
0.3
0.3
G
0.4
0.4
G
0.5
0.5
G
0.6
0.6
G
0.7
0.7
G
0.8
0.8
G
Progressive GAN
[Tero Karras, et al., ICLR, 2018]
The first GAN
[Ian J. Goodfellow, et al., NIPS, 2014]
Today ……
[Andrew Brock, et al., arXiv, 2018]
[David Bau, et al., ICLR 2019]
Does the generator have the concept
of objects?
Some neurons correspond to specific
objects, for example, tree
Remove the neurons for tree
[David Bau, et al., ICLR 2019]
Activate the neurons for tree
Generator
“Girl with
red hair”
Generator
−0.3
0.1
⋮
0.9
random vector
Three Categories of GAN
1. Generation
image
2. Conditional Generation
Generator
text
imagepaired data
blue eyes,
red hair,
short hair
3. Unsupervised Conditional Generation
Photo Vincent van
Gogh’s styleunpaired data
x ydomain x domain y
Target of
NN output
Text-to-Image
• Traditional supervised approach
NN Image
Text: “train”
a dog is running
a bird is flying
A blurry image!
c1: a dog is running
as close as
possible
Conditional GAN
D
(original)
scalar𝑥
G
𝑧Normal distribution
x = G(c,z)
c: train
x is real image or not
Image
Real images:
Generated images:
1
0
Generator will learn to
generate realistic images ….
But completely ignore the
input conditions.
[Scott Reed, et al, ICML, 2016]
Conditional GAN
D
(better)
scalar
𝑐
𝑥
True text-image pairs:
G
𝑧Normal distribution
x = G(c,z)
c: train
Image
x is realistic or not +
c and x are matched or not
(train , )
(train , )(cat , )
[Scott Reed, et al, ICML, 2016]
1
00
x is realistic or not +
c and x are matched
or not
Conditional GAN - Discriminator
[Takeru Miyato, et al., ICLR, 2018]
[Han Zhang, et al., arXiv, 2017]
[Augustus Odena et al., ICML, 2017]
condition c
object x
Network
Network
Network
score
Network
Network
(almost every paper)
condition c
object x
c and x are matched
or not
x is realistic or not
+
Conditional GAN
paired data
blue eyes
red hair
short hair
Collecting anime faces
and the description of its
characteristics
red hair,
green eyes
blue hair,
red eyes
The images are generated by
Yen-Hao Chen, Po-Chun Chien,
Jun-Chen Xie, Tsung-Han Wu.
Conditional GAN - Image-to-image
G
𝑧
x = G(c,z)
𝑐
[Phillip Isola, et al., CVPR, 2017]
Image translation, or pix2pix
as close as
possible
Conditional GAN - Image-to-image
• Traditional supervised approach
NN Image
It is blurry.
Testing:
input L1
e.g. L1
[Phillip Isola, et al., CVPR, 2017]
Conditional GAN - Image-to-image
Testing:
input L1 GAN
G
𝑧
Image D scalar
GAN + L1
L1
[Phillip Isola, et al., CVPR, 2017]
Conditional GAN
- Sound-to-image
Gc: sound Image
"a dog barking sound"
Training Data
Collection
video
[Wan, et al., ICASSP 2019]
Conditional GAN
- Sound-to-image
• Audio-to-image
https://wjohn1483.github.io/
audio_to_scene/index.html
The images are generated by Chia-
Hung Wan and Shun-Po Chuang.
Louder
Conditional GAN - Image-to-label
Multi-label Image Classifier = Conditional Generator
Input condition
Generated output
Conditional GAN - Image-to-label
F1 MS-COCO NUS-WIDE
VGG-16 56.0 33.9
+ GAN 60.4 41.2
Inception 62.4 53.5
+GAN 63.8 55.8
Resnet-101 62.8 53.1
+GAN 64.0 55.4
Resnet-152 63.3 52.1
+GAN 63.9 54.1
Att-RNN 62.1 54.7
RLSD 62.0 46.9
The classifiers can have
different architectures.
The classifiers are
trained as conditional
GAN.
[Tsai, et al., ICASSP 2019]
Conditional GAN - Image-to-label
F1 MS-COCO NUS-WIDE
VGG-16 56.0 33.9
+ GAN 60.4 41.2
Inception 62.4 53.5
+GAN 63.8 55.8
Resnet-101 62.8 53.1
+GAN 64.0 55.4
Resnet-152 63.3 52.1
+GAN 63.9 54.1
Att-RNN 62.1 54.7
RLSD 62.0 46.9
The classifiers can have
different architectures.
The classifiers are
trained as conditional
GAN.
Conditional GAN
outperforms other
models designed for
multi-label.
Conditional GAN
- Video Generation
Generator
Discrimi
nator
Last frame is real
or generated
Discriminator thinks it is real
[Michael Mathieu, et al., arXiv, 2015]
https://github.com/dyelax/Adversarial_Video_Generation
More about Video Generation
https://arxiv.org/abs/1905.08233
[Egor Zakharov, et al., arXiv, 2019]
Domain Adversarial Training
• Training and testing data are in different domains
Training
data:
Testing
data:
Generator
Generator
The same
distribution
feature
feature
Take digit
classification as example
blue points
red points
Domain Adversarial Training
feature extractor (Generator)
Discriminator
(Domain classifier)
image Which domain?
Always output
zero vectors
Domain Classifier Fails
Domain Adversarial Training
feature extractor (Generator)
Discriminator
(Domain classifier)
image
Label predictor
Which digits?
Not only cheat the domain
classifier, but satisfying label
predictor at the same time
More speech-related applications in Part III.
Successfully applied on image classification
[Ganin et al, ICML, 2015][Ajakan et al. JMLR, 2016 ]
Which domain?
Generator
“Girl with
red hair”
Generator
−0.3
0.1
⋮
0.9
random vector
Three Categories of GAN
1. Generation
image
2. Conditional Generation
Generator
text
imagepaired data
blue eyes,
red hair,
short hair
3. Unsupervised Conditional Generation
Photo Vincent van
Gogh’s styleunpaired data
x ydomain x domain y
Unsupervised
Conditional Generation
G
Object in Domain X Object in Domain Y
Transform an object from one domain to another
without paired data
Domain X Domain Y
photos
Condition Generated Object
Vincent van Gogh’s
paintings
Not Paired
More Applications in Parts III and IV
Use image style transfer as example here
Unsupervised
Conditional Generation
• Approach 1: Cycle-GAN and its variants
• Approach 2: Shared latent space
?𝐺 𝑋→𝑌
Domain X Domain Y
𝐸𝑁 𝑋 𝐷𝐸 𝑌
Encoder of
domain X
Decoder of
domain Y
Domain YDomain X Face
Attribute
?
Cycle GAN
𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain Y
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Cycle GAN
𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain Y
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Not what we want!
ignore input
Cycle GAN
𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Not what we want!
ignore input
[Tomer Galanti, et al. ICLR, 2018]
The issue can be avoided by network design.
Simpler generator makes the input and
output more closely related.
Cycle GAN
𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Encoder
Network
Encoder
Network
pre-trained
as close as
possible
Baseline of DTN [Yaniv Taigman, et al., ICLR, 2017]
Cycle GAN
𝐺 𝑋→𝑌
𝐷 𝑌
Domain Y
scalar
Input image
belongs to
domain Y or not
𝐺Y→X
as close as possible
Lack of information
for reconstruction
[Jun-Yan Zhu, et al., ICCV, 2017]
Cycle consistency
Cycle GAN
𝐺 𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺 𝑋→𝑌
as close as possible
𝐷 𝑌𝐷 𝑋
scalar: belongs to
domain Y or not
scalar: belongs to
domain X or not
Cycle GAN
Dual GAN
Disco GAN
[Jun-Yan Zhu, et al., ICCV, 2017]
[Zili Yi, et al., ICCV, 2017]
[Taeksoo Kim, et
al., ICML, 2017]
For multiple domains,
considering starGAN
[Yunjey Choi, arXiv, 2017]
Issue of Cycle Consistency
• CycleGAN: a Master of Steganography
[Casey Chu, et al., NIPS workshop, 2017]
𝐺Y→X𝐺 𝑋→𝑌
The information is hidden.
Unsupervised
Conditional Generation
• Approach 1: Cycle-GAN and its variants
• Approach 2: Shared latent space
?𝐺 𝑋→𝑌
Domain X Domain Y
𝐸𝑁 𝑋 𝐷𝐸 𝑌
Encoder of
domain X
Decoder of
domain Y
Domain YDomain X Face
Attribute
Domain X Domain Y
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
imageFace
Attribute
Shared latent space
Target
- domain-x information
+ domain-y information
Domain X Domain Y
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
Minimizing reconstruction error
Shared latent space
Training
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
Minimizing reconstruction error
Because we train two auto-encoders separately …
The images with the same attribute may not project
to the same position in the latent space.
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Minimizing reconstruction error
Shared latent space
Training
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
Minimizing reconstruction error
The domain discriminator forces the output of 𝐸𝑁𝑋 and
𝐸𝑁𝑌 have the same distribution.
From 𝐸𝑁𝑋 or 𝐸𝑁𝑌
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Shared latent space
Training
Domain
Discriminator
𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the
domain discriminator
[Guillaume Lample, et al., NIPS, 2017]
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Shared latent space
Training
Cycle Consistency:
Used in ComboGAN [Asha Anoosheh, et al., arXiv, 017]
Minimizing reconstruction error
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Shared latent space
Training
Semantic Consistency:
Used in DTN [Yaniv Taigman, et al., ICLR, 2017] and
XGAN [Amélie Royer, et al., arXiv, 2017]
To the same
latent space
Sharing the parameters of encoders and decoders
Shared latent space
𝐸𝑁 𝑋
𝐸𝑁𝑌
𝐷𝐸 𝑋
𝐷𝐸 𝑌
Couple GAN[Ming-Yu Liu, et al., NIPS, 2016]
UNIT[Ming-Yu Liu, et al., NIPS, 2017]
Shared latent space
𝐸𝑁 𝑋
𝐸𝑁𝑌
𝐷𝐸 𝑋
𝐷𝐸 𝑌
One encoder to extract domain-
independent information
Input an extra indicator
to control the decoder
x or y
Widely used in Voice Conversion
(Part III)
https://selfie2anime.com/
Generator
“Girl with
red hair”
Generator
−0.3
0.1
⋮
0.9
random vector
Three Categories of GAN
1. Typical GAN
image
2. Conditional GAN
Generator
text
imagepaired data
blue eyes,
red hair,
short hair
3. Unsupervised Conditional GAN
Photo Vincent van
Gogh’s styleunpaired data
x ydomain x domain y
Reference
• Generation
• Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative Adversarial
Nets, NIPS, 2014
• Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen, Progressive Growing
of GANs for Improved Quality, Stability, and Variation, ICLR, 2018
• Andrew Brock, Jeff Donahue, Karen Simonyan, Large Scale GAN Training for
High Fidelity Natural Image Synthesis, arXiv, 2018
• David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B.
Tenenbaum, William T. Freeman, Antonio Torralba, GAN Dissection:
Visualizing and Understanding Generative Adversarial Networks, ICLR 2019
Reference
• Conditional Generation
• Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt
Schiele, Honglak Lee, Generative Adversarial Text to Image Synthesis, ICML,
2016
• Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros, Image-to-Image
Translation with Conditional Adversarial Networks, CVPR, 2017
• Michael Mathieu, Camille Couprie, Yann LeCun, Deep multi-scale video
prediction beyond mean square error, arXiv, 2015
• Mehdi Mirza, Simon Osindero, Conditional Generative Adversarial Nets,
arXiv, 2014
• Takeru Miyato, Masanori Koyama, cGANs with Projection Discriminator,
ICLR, 2018
• Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei
Huang, Dimitris Metaxas, StackGAN++: Realistic Image Synthesis with
Stacked Generative Adversarial Networks, arXiv, 2017
• Augustus Odena, Christopher Olah, Jonathon Shlens, Conditional Image
Synthesis With Auxiliary Classifier GANs, ICML, 2017
Reference
• Conditional Generation
• Yaroslav Ganin, Victor Lempitsky, Unsupervised Domain Adaptation by
Backpropagation, ICML, 2015
• Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario
Marchand, Domain-Adversarial Training of Neural Networks, JMLR, 2016
• Che-Ping Tsai, Hung-Yi Lee, Adversarial Learning of Label Dependency: A
Novel Framework for Multi-class Classification, submitted to ICASSP 2019
• Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, Victor Lempitsky, Few-
Shot Adversarial Learning of Realistic Neural Talking Head Models, arXiv
2019
• Chia-Hung Wan, Shun-Po Chuang, Hung-Yi Lee, "Towards Audio to Scene
Image Synthesis using Generative Adversarial Network", ICASSP, 2019
Reference
• Unsupervised Conditional Generation
• Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros, Unpaired Image-to-
Image Translation using Cycle-Consistent Adversarial Networks, ICCV, 2017
• Zili Yi, Hao Zhang, Ping Tan, Minglun Gong, DualGAN: Unsupervised Dual
Learning for Image-to-Image Translation, ICCV, 2017
• Tomer Galanti, Lior Wolf, Sagie Benaim, The Role of Minimal Complexity
Functions in Unsupervised Learning of Semantic Mappings, ICLR, 2018
• Yaniv Taigman, Adam Polyak, Lior Wolf, Unsupervised Cross-Domain Image
Generation, ICLR, 2017
• Asha Anoosheh, Eirikur Agustsson, Radu Timofte, Luc Van Gool, ComboGAN:
Unrestrained Scalability for Image Domain Translation, arXiv, 2017
• Amélie Royer, Konstantinos Bousmalis, Stephan Gouws, Fred Bertsch, Inbar
Mosseri, Forrester Cole, Kevin Murphy, XGAN: Unsupervised Image-to-
Image Translation for Many-to-Many Mappings, arXiv, 2017
Reference
• Unsupervised Conditional Generation
• Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic
Denoyer, Marc'Aurelio Ranzato, Fader Networks: Manipulating Images by
Sliding Attributes, NIPS, 2017
• Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, Jiwon Kim,
Learning to Discover Cross-Domain Relations with Generative Adversarial
Networks, ICML, 2017
• Ming-Yu Liu, Oncel Tuzel, “Coupled Generative Adversarial Networks”, NIPS,
2016
• Ming-Yu Liu, Thomas Breuel, Jan Kautz, Unsupervised Image-to-Image
Translation Networks, NIPS, 2017
• Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim,
Jaegul Choo, StarGAN: Unified Generative Adversarial Networks for Multi-
Domain Image-to-Image Translation, arXiv, 2017
Part II: A little bit Theory
Outline of Part II
Basic Theory of GAN
Helpful Tips
How to evaluate GAN
Relation to Reinforcement Learning
Generator
• A generator G is a network. The network defines a
probability distribution 𝑃𝐺
generator
G𝑧 𝑥 = 𝐺 𝑧
Normal
Distribution
𝑃𝐺(𝑥) 𝑃𝑑𝑎𝑡𝑎 𝑥
as close as possible
How to compute the divergence?
𝐺∗ = 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
Divergence between distributions 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎
𝑥: an image (a high-
dimensional vector)
Discriminator
𝐺∗
= 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
Although we do not know the distributions of 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎,
we can sample from them.
sample
G
vector
vector
vector
vector
sample from
normal
Database
Sampling from 𝑷 𝑮
Sampling from 𝑷 𝒅𝒂𝒕𝒂
Discriminator 𝐺∗
= 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
Discriminator
: data sampled from 𝑃𝑑𝑎𝑡𝑎
: data sampled from 𝑃𝐺
train
𝑉 𝐺, 𝐷 = 𝐸 𝑥∼𝑃 𝑑𝑎𝑡𝑎
𝑙𝑜𝑔𝐷 𝑥 + 𝐸 𝑥∼𝑃 𝐺
𝑙𝑜𝑔 1 − 𝐷 𝑥
Example Objective Function for D
(G is fixed)
𝐷∗ = 𝑎𝑟𝑔 max
𝐷
𝑉 𝐷, 𝐺Training:
Using the example objective
function is exactly the same as
training a binary classifier.
[Goodfellow, et al., NIPS, 2014]
The maximum objective value
is related to JS divergence.
Discriminator 𝐺∗
= 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
Discriminator
: data sampled from 𝑃𝑑𝑎𝑡𝑎
: data sampled from 𝑃𝐺
train
hard to discriminatesmall divergence
Discriminator
train
easy to discriminatelarge divergence
𝐷∗ = 𝑎𝑟𝑔 max
𝐷
𝑉 𝐷, 𝐺
Training:
Small max
𝐷
𝑉 𝐷, 𝐺
𝐺∗
= 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎max
𝐷
𝑉 𝐺, 𝐷
The maximum objective value
is related to JS divergence.
• Initialize generator and discriminator
• In each training iteration:
Step 1: Fix generator G, and update discriminator D
Step 2: Fix discriminator D, and update generator G
𝐷∗ = 𝑎𝑟𝑔 max
𝐷
𝑉 𝐷, 𝐺
[Goodfellow, et al., NIPS, 2014]
Using the divergence
you like ☺
[Sebastian Nowozin, et al., NIPS, 2016]
Can we use other divergence?
Outline of Part II
Basic Theory of GAN
Helpful Tips
How to evaluate GAN
Relation to Reinforcement Learning
GAN is difficult to train ……
• There is a saying ……
(I found this joke from 陳柏文’s facebook.)
Too many tips ……
• I do a little survey among 12 students …..
Q: What is the most helpful tip for
training GAN?
WGAN (33.3%)
Spectral Norm (16.7%)
JS divergence is not suitable
• In most cases, 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎 are not overlapped.
• 1. The nature of data
• 2. Sampling
Both 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺 are low-dim
manifold in high-dim space.
𝑃𝑑𝑎𝑡𝑎
𝑃𝐺
The overlap can be ignored.
Even though 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺
have overlap.
If you do not have enough
sampling ……
𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1
𝐽𝑆 𝑃𝐺0
, 𝑃𝑑𝑎𝑡𝑎
= 𝑙𝑜𝑔2
𝑃𝑑𝑎𝑡𝑎𝑃𝐺100
……
𝐽𝑆 𝑃𝐺1
, 𝑃𝑑𝑎𝑡𝑎
= 𝑙𝑜𝑔2
𝐽𝑆 𝑃𝐺100
, 𝑃𝑑𝑎𝑡𝑎
= 0
What is the problem of JS divergence?
……
JS divergence is log2 if two distributions do not overlap.
Intuition: If two distributions do not overlap, binary classifier
achieves 100% accuracy
Equally bad
The same max objective value is
obtained.
Same divergence
Wasserstein distance
• Considering one distribution P as a pile of earth,
and another distribution Q as the target
• The average distance the earth mover has to move
the earth.
𝑃 𝑄
d
𝑊 𝑃, 𝑄 = 𝑑
Wasserstein distance
Source of image: https://vincentherrmann.github.io/blog/wasserstein/
𝑃
𝑄
Using the “moving plan” with the smallest average distance to
define the Wasserstein distance.
There are many possible “moving plans”.
Smaller
distance?
Larger
distance?
𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1
𝐽𝑆 𝑃𝐺0
, 𝑃𝑑𝑎𝑡𝑎
= 𝑙𝑜𝑔2
𝑃𝑑𝑎𝑡𝑎𝑃𝐺100
……
𝐽𝑆 𝑃𝐺1
, 𝑃𝑑𝑎𝑡𝑎
= 𝑙𝑜𝑔2
𝐽𝑆 𝑃𝐺100
, 𝑃𝑑𝑎𝑡𝑎
= 0
What is the problem of JS divergence?
𝑊 𝑃𝐺0
, 𝑃𝑑𝑎𝑡𝑎
= 𝑑0
𝑊 𝑃𝐺1
, 𝑃𝑑𝑎𝑡𝑎
= 𝑑1
𝑊 𝑃𝐺100
, 𝑃𝑑𝑎𝑡𝑎
= 0
𝑑0 𝑑1
……
……
Better!
WGAN
max
𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧
𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎
𝐷 𝑥 − 𝐸 𝑥~𝑃 𝐺
𝐷 𝑥
Evaluate Wasserstein distance between 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺
[Martin Arjovsky, et al., arXiv, 2017]
How to fulfill this constraint?D has to be smooth enough.
real
−∞
generated
D
∞
Without the constraint, the
training of D will not converge.
Keeping the D smooth forces
D(x) become ∞ and −∞
• Original WGAN → Weight Clipping [Martin Arjovsky, et al.,
arXiv, 2017]
• Improved WGAN → Gradient Penalty [Ishaan Gulrajani,
NIPS, 2017]
• Spectral Normalization → Keep gradient norm
smaller than 1 everywhere [Miyato, et al., ICLR, 2018]
Force the parameters w between c and -c
After parameter update, if w > c, w = c; if w < -c, w = -c
Keep the gradient close to 1
max
𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧
𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎
𝐷 𝑥 − 𝐸 𝑥~𝑃 𝐺
𝐷 𝑥
real
samples
Keep the gradient
close to 1
[Kodali, et al., arXiv, 2017]
[Wei, et al., ICLR, 2018]
More Tips
• Improved techniques for training GANs
• Tips in DCGAN [Alec Radford, et al., ICLR 2016]
• Guideline for network architecture design for
image generation
• Tips from Soumith
• https://github.com/soumith/ganhacks
• Tips from BigGAN [Andrew Brock, et al., arXiv, 2018]
[Tim Salimans, et al., NIPS, 2016]
Outline of Part II
Basic Theory of GAN
Helpful Tips
How to evaluate GAN
Relation to Reinforcement Learning
Inception Score
Off-the-shelf
Image Classifier
𝑥 𝑃 𝑦|𝑥
Concentrated distribution
means higher visual quality
CNN𝑥1
𝑃 𝑦1|𝑥1
Uniform distribution
means higher variety
CNN𝑥2
𝑃 𝑦2|𝑥2
CNN𝑥3
𝑃 𝑦3|𝑥3
…
𝑃 𝑦 =
1
𝑁
෍
𝑛
𝑃 𝑦 𝑛|𝑥 𝑛
[Tim Salimans, et al., NIPS, 2016]
𝑥: image
𝑦: class (output of CNN)
e.g. Inception net,
VGG, etc.
class 1
class 2
class 3
Inception Score
= ෍
𝑥
෍
𝑦
𝑃 𝑦|𝑥 𝑙𝑜𝑔𝑃 𝑦|𝑥
− ෍
𝑦
𝑃 𝑦 𝑙𝑜𝑔𝑃 𝑦
Negative entropy of P(y|x)
Entropy of P(y)
Inception Score
𝑃 𝑦 =
1
𝑁
෍
𝑛
𝑃 𝑦 𝑛|𝑥 𝑛
𝑃 𝑦|𝑥
class 1
class 2
class 3
[Tim Salimans, et al., NIPS 2016]
Fréchet Inception Distance (FID)
blue points: latent representation of Inception net for
the generated images
red points: latent representation of Inception net for
the read images
FID =
Fréchet distance
between the two
Gaussians
[Martin Heusel, et al., NIPS, 2017]
To learn more about evaluation …
Pros and cons of GAN evaluation measures
https://arxiv.org/abs/1802.03446
[Ali Borji, 2019]
Outline of Part II
Basic Theory of GAN
Helpful Tips
How to evaluate GAN
Relation to Reinforcement Learning
Basic Components
EnvActor
Reward
Function
Video
Game
Go
Get 20 scores when
killing a monster
The rule
of GO
You cannot control
• Input of neural network: the observation of machine
represented as a vector or a matrix
• Output neural network : each action corresponds to a
neuron in output layer
…
…
NN as actor
pixels
fire
right
left
Score of an
action
0.7
0.2
0.1
Take the action
based on the
probability.
Neural network as Actor
Actor, Environment, Reward
𝜏 = 𝑠1, 𝑎1, 𝑠2, 𝑎2, ⋯ , 𝑠 𝑇, 𝑎 𝑇
Trajectory
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
“right” “fire”
Reward Function → Discriminator
Reinforcement Learning v.s. GAN
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
𝑅 𝜏 = ෍
𝑡=1
𝑇
𝑟𝑡
Reward
𝑟1
Reward
𝑟2
“Black box”
You cannot use
backpropagation.
Actor → Generator Fixed
updatedupdated
Inverse Reinforcement Learning
We have demonstration of the expert.
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
reward function is not available
(in many cases, it is difficult to define reward function)
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁
Each Ƹ𝜏 is a trajectory
of the expert.
Self driving: record
human drivers
Robot: grab the
arm of robot
Inverse Reinforcement Learning
Reward
Function
Environment
Optimal
Actor
Inverse Reinforcement
Learning
➢Using the reward function to find the optimal actor.
➢Modeling reward can be easier. Simple reward
function can lead to complex policy.
Reinforcement
Learning
Expert
demonstration
of the expert
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁
Framework of IRL
Expert ො𝜋
Actor 𝜋
Obtain
Reward Function R
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁
𝜏1, 𝜏2, ⋯ , 𝜏 𝑁
Find an actor based
on reward function R
By Reinforcement learning
෍
𝑛=1
𝑁
𝑅 Ƹ𝜏 𝑛 > ෍
𝑛=1
𝑁
𝑅 𝜏
Reward function
→ Discriminator
Actor
→ Generator
Reward
Function R
The expert is always
the best.
𝜏1, 𝜏2, ⋯ , 𝜏 𝑁
GAN
IRL
G
D
High score for real,
low score for generated
Find a G whose output
obtains large score from D
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁
Expert
Actor
Reward
Function
Larger reward for Ƹ𝜏 𝑛,
Lower reward for 𝜏
Find a Actor obtains
large reward
Outline of Part II
Basic Theory of GAN
Helpful Tips
How to evaluate GAN
Relation to Reinforcement Learning
Reference
• Sebastian Nowozin, Botond Cseke, Ryota Tomioka, “f-GAN: Training Generative
Neural Samplers using Variational Divergence Minimization”, NIPS, 2016
• Martin Arjovsky, Soumith Chintala, Léon Bottou, Wasserstein GAN, arXiv, 2017
• Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron
Courville, Improved Training of Wasserstein GANs, NIPS, 2017
• Junbo Zhao, Michael Mathieu, Yann LeCun, Energy-based Generative Adversarial
Network, arXiv, 2016
• Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, Olivier Bousquet, “Are
GANs Created Equal? A Large-Scale Study”, arXiv, 2017
• Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi
Chen Improved Techniques for Training GANs, NIPS, 2016
• Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Sepp
Hochreiter, GANs Trained by a Two Time-Scale Update Rule Converge to a Local
Nash Equilibrium, NIPS, 2017
Generative Adversarial Network
and its Applications to Signal Processing
and Natural Language Processing
Part III: Speech Signal
Processing
Tsao, Yu Ph.D., Academia Sinica
yu.tsao@citi.sinica.edu.tw
Outline of Part III
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Our Recent Works
Speech Signal Generation (Regression Task)
G Output
Objective function
Paired
Speech, Speaker, Emotion Recognition and Lip-reading
(Classification Task)
Output
label
Clean data
E
G
𝒚
Emb.
Noisy data
𝒙
෤𝒛 = 𝑔(෥𝒙)
𝑔(∙)
ℎ(∙)
෥𝒙
Accented
speech
෭𝒙
Channel
distortion
ෝ𝒙
Acoustic Mismatch
Outline of Part III
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Our Recent Works
Speech Enhancement
• Neural network models for spectral mapping
• Typical objective function
➢ Mean square error (MSE) [Xu et al., TASLP 2015], L1 [Pascual et al., Interspeech
2017], likelihood [Chai et al., MLSP 2017], STOI [Fu et al., TASLP 2018].
Enhancing
➢ GAN is used as a new objective function to estimate the parameters in G.
➢Model structures of G: DNN [Wang et al. NIPS 2012; Xu et al., SPL 2014], DDAE
[Lu et al., Interspeech 2013], RNN (LSTM) [Chen et al., Interspeech 2015;
Weninger et al., LVA/ICA 2015], CNN [Fu et al., Interspeech 2016].
G Output
Objective function
Speech Enhancement
• Speech enhancement GAN (SEGAN) [Pascual et al., Interspeech 2017]
Table 1: Objective evaluation results. Table 2: Subjective evaluation results.
Fig. 1: Preference test results.
Speech Enhancement (SEGAN)
SEGAN yields better speech enhancement results than Noisy and Wiener.
• Experimental results
• Pix2Pix [Michelsanti et al., Interpsech 2017]
D Scalar
Clean
Noisy
(Fake/Real)
Output
Noisy
G
Noisy Output Clean
Speech Enhancement
Fig. 2: Spectrogram comparison of Pix2Pix with baseline methods.
Speech Enhancement (Pix2Pix)
• Spectrogram analysis
Pix2Pix outperforms STAT-MMSE and is competitive to DNN SE.
NG-DNN STAT-MMSE
Noisy Clean NG-Pix2Pix
Table 3: Objective evaluation results.
Speech Enhancement (Pix2Pix)
• Objective evaluation and speaker verification test
Table 4: Speaker verification results.
1. From the PESQ and STOI evaluations, Pix2Pix outperforms Noisy
and MMSE and is competitive to DNN SE.
2. From the speaker verification results, Pix2Pix outperforms the
baseline models when the clean training data is used.
• Frequency-domain SEGAN (FSEGAN) [Donahue et al., ICASSP 2018]
D Scalar
Clean
Noisy
(Fake/Real)
Output
Noisy
G
Noisy Output Clean
Speech Enhancement
Fig. 3: Spectrogram comparison of FSEGAN with L1-trained method.
Speech Enhancement (FSEGAN)
• Spectrogram analysis
FSEGAN reduces both additive noise and reverberant smearing.
Table 5: WER (%) of SEGAN and FSEGAN. Table 6: WER (%) of FSEGAN with retrain.
Speech Enhancement (FSEGAN)
• ASR results
1. From Table 5, (1) FSEGAN improves recognition results for ASR-Clean.
(2) FSEGAN outperforms SEGAN as front-ends.
2. From Table 6, (1) Hybrid Retraining with FSEGAN outperforms Baseline;
(2) FSEGAN retraining slightly underperforms L1–based retraining.
• Speech enhancement through a mask function
G
Noisy Output mask Enhanced
Speech Enhancement
Point-wise multiplication
• GAN for spectral magnitude mask estimation (MMS-GAN)
[Ashutosh Pandey and Deliang Wang, ICASSP 2018]
D Scalar
Ref.
mask
Noisy
(Fake/Real)
Output
mask
Noisy
G
Noisy Output mask Ref. mask
Speech Enhancement
We don’t know exactly what D functions.
Our ICML 2019 paper shed some lights on a potential future direction.
𝐺𝑆→𝑇 𝐺 𝑇→𝑆
as close as possible
𝐷 𝑇
Scalar: belongs to
domain T or not
𝐺 𝑇→𝑆 𝐺𝑆→𝑇
as close as possible
𝐷𝑆
Scalar: belongs to
domain S or not
Speech Enhancement (AFT)
• Cycle-GAN-based acoustic feature transformation (AFT)
[Mimura et al., ASRU 2017]
𝑉𝐹𝑢𝑙𝑙 = 𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌 +𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌
+𝜆 𝑉𝐶𝑦𝑐(𝐺 𝑋→𝑌, 𝐺 𝑌→𝑋)
Noisy Enhanced Noisy
Clean Syn. Noisy Clean
• ASR results on noise robustness and style adaptation
Table 7: Noise robust ASR. Table 8: Speaker style adaptation.
1. 𝐺 𝑇→𝑆 can transform acoustic features and effectively improve
ASR results for both noisy and accented speech.
2. 𝐺𝑆→𝑇 can be used for model adaptation and effectively improve
ASR results for noisy speech.
S: Clean; 𝑇: Noisy JNAS: Read; CSJ-SPS: Spontaneous (relax);
CSJ-APS: Spontaneous (formal);
Speech Enhancement (AFT)
Outline of Part III
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Our Recent Works
• Postfilter for synthesized or transformed speech
➢ Conventional postfilter approaches for G estimation include global variance
(GV) [Toda et al., IEICE 2007], variance scaling (VS) [Sil’en et al., Interpseech
2012], modulation spectrum (MS) [Takamichi et al., ICASSP 2014],DNN with
MSE criterion [Chen et al., Interspeech 2014; Chen et al., TASLP 2015].
➢ GAN is used a new objective function to estimate the parameters in G.
Postfilter
Synthesized
spectral texture
Natural
spectral texture
G Output
Objective function
Speech
synthesizer
Voice
conversion
Speech
enhancement
• GAN postfilter [Kaneko et al., ICASSP 2017]
➢ Traditional MMSE criterion results in statistical averaging.
➢ GAN is used as a new objective function to estimate the parameters in G.
➢ The proposed work intends to further improve the naturalness of
synthesized speech or parameters from a synthesizer.
Postfilter
Synthesized
Mel cepst. coef.
Natural
Mel cepst. coef.
D
Nature
or
Generated
Generated
Mel cepst. coef.
G
Fig. 4: Spectrograms of: (a) NAT (nature); (b) SYN (synthesized); (c) VS (variance
scaling); (d) MS (modulation spectrum); (e) MSE; (f) GAN postfilters.
Postfilter (GAN-based Postfilter)
• Spectrogram analysis
GAN postfilter reconstructs spectral texture similar to the natural one.
Fig. 5: Mel-cepstral trajectories (GANv:
GAN was applied in voiced part).
Fig. 6: Averaging difference in
modulation spectrum per Mel-
cepstral coefficient.
Postfilter (GAN-based Postfilter)
• Objective evaluations
GAN postfilter reconstructs spectral texture similar to the natural one.
Table 9: Preference score (%). Bold font indicates the numbers over 30%.
Postfilter (GAN-based Postfilter)
• Subjective evaluations
1. GAN postfilter significantly improves the synthesized speech.
2. GAN postfilter is effective particularly in voiced segments.
3. GANv outperforms GAN and is comparable to NAT.
Speech Synthesis
• Input: linguistic features; Output: speech parameters
𝒄ො𝒄
𝑮 𝑺𝑺
Natural
speech
parameters
Generated
speech
parameters
Linguistic
features
sp sp
Objective function
Minimum
generation error
(MGE), MSE
• Speech synthesis with anti-spoofing verification (ASV)
[Saito et al., ICASSP 2017]
𝐿 𝐷 𝒄, ො𝒄 = 𝐿 𝐷,1 𝒄 + 𝐿 𝐷,0 ො𝒄
𝐿 𝐷,1 𝒄 = −
1
𝑇
σ 𝑡=1
𝑇
log( 𝐷 𝒄 𝑡 )…NAT
𝐿 𝐷,0 ො𝒄 = −
1
𝑇
σ 𝑡=1
𝑇
log(1 − 𝐷 ො𝒄 𝑡 )…SYN
𝐿 𝒄, ො𝒄 = 𝐿 𝐺 𝒄, ො𝒄 + 𝜔 𝐷
𝐸 𝐿 𝐺
𝐸 𝐿 𝐷
𝐿 𝐷,1 ො𝒄
Minimum generation error (MGE)
with adversarial loss.
𝒄ො𝒄
𝑮 𝑺𝑺
Natural
speech
parameters
Generated
speech
parameters
Gen.
Nature
𝑫 𝑨𝑺𝑽𝝓(∙)
Linguistic
features
sp sp
MGE
Fig. 7: Averaged GVs of MCCs.
Speech Synthesis (ASV)
• Objective and subjective evaluations
1. The proposed algorithm generates MCCs similar to the natural ones.
Fig. 8: Scores of speech quality.
2. The proposed algorithm outperforms conventional MGE training.
• Speech synthesis with GAN (SS-GAN) [Saito et al., TASLP 2018]
𝐿 𝐷 𝒄, ො𝒄 = 𝐿 𝐷,1 𝒄 + 𝐿 𝐷,0 ො𝒄
𝐿 𝐷,1 𝒄 = −
1
𝑇
σ 𝑡=1
𝑇
log( 𝐷 𝒄 𝑡 )…NAT
𝐿 𝐷,0 ො𝒄 = −
1
𝑇
σ 𝑡=1
𝑇
log(1 − 𝐷 ො𝒄 𝑡 )…SYN
𝐿 𝒄, ො𝒄 = 𝐿 𝐺 𝒄, ො𝒄 + 𝜔 𝐷
𝐸 𝐿 𝐺
𝐸 𝐿 𝐷
𝐿 𝐷,1 ො𝒄
Minimum generation error (MGE)
with adversarial loss.
𝒄ො𝒄
Speech Synthesis
𝑮 𝑺𝑺
Natural
speech
parameters
Generated
speech
parameters
Gen.
Nature
𝑫𝝓(∙)
Linguistic
features
𝑫 𝑨𝑺𝑽
sp, f0, duration sp, f0, duration
MGE
Fig. 10: Scores of speech quality
(sp and F0).
.
Speech Synthesis (SS-GAN)
• Subjective evaluations
Fig. 9: Scores of speech quality (sp).
The proposed algorithm works for both spectral parameters and F0.
• Convert (transform) speech from source to target
➢ Conventional VC approaches include Gaussian mixture model (GMM) [Toda
et al., TASLP 2007], non-negative matrix factorization (NMF) [Wu et al., TASLP
2014; Fu et al., TBME 2017], locally linear embedding (LLE) [Wu et al.,
Interspeech 2016], variational autoencoder (VAE) [Hsu et al., APSIPA
2016], restricted Boltzmann machine (RBM) [Chen et al., TASLP
2014], feed forward NN [Desai et al., TASLP 2010], recurrent NN (RNN)
[Nakashika et al., Interspeech 2014].
Voice Conversion
G Output
Objective function
Target
speaker
Source
speaker
• VAW-GAN [Hsu et al., Interspeech 2017]
➢Conventional MMSE approaches often encounter the “over-smoothing” issue.
➢ GAN is used a new objective function to estimate G.
➢ The goal is to increase the naturalness, clarity, similarity of converted speech.
Voice Conversion
D
Real
or
Fake
G
Target
speaker
Source
speaker
𝑉 𝐺, 𝐷 = 𝑉𝐺𝐴𝑁 𝐺, 𝐷 + 𝜆 𝑉𝑉𝐴𝐸 𝒙|𝒚
• Objective and subjective evaluations
Fig. 12: MOS on naturalness.Fig. 11: The spectral envelopes.
Voice Conversion (VAW-GAN)
VAW-GAN outperforms VAE in terms of objective and subjective
evaluations with generating more structured speech.
• CycleGAN-VC [Kaneko et al., Eusipco 2018]
• used a new objective function to estimate G
𝑉𝐹𝑢𝑙𝑙 = 𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌 +𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌
+𝜆 𝑉𝐶𝑦𝑐(𝐺 𝑋→𝑌, 𝐺 𝑌→𝑋)
Voice Conversion
𝑮 𝑺→𝑻 𝐺 𝑇→𝑆
as close as possible
𝑫 𝑻
Scalar: belongs to
domain T or not
Scalar: belongs to
domain S or not
𝐺 𝑇→𝑆 𝑮 𝑺→𝑻
as close as possible
𝑫 𝑺
Target Syn. Source Target
Source Syn. Target Source
• Subjective evaluations
Fig. 13: MOS for naturalness.
Fig. 14: Similarity of to source and
to target speakers. S: Source;
T:Target; P: Proposed; B:Baseline
Voice Conversion (CycleGAN-VC)
1. The proposed method uses non-parallel data.
2. For naturalness, the proposed method outperforms baseline.
3. For similarity, the proposed method is comparable to the baseline.
Target
speaker
Source
speaker
Outline of Part III
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Our Recent Works
Speech, Speaker, Emotion Recognition and Lip-reading
(Classification Task)
Output
label
Clean data
E
G
𝒚
෥𝒙
Emb.
Noisy data
𝒙
෤𝒛 = 𝑔(෥𝒙)
𝑔(∙)
ℎ(∙)
Accented
speech
෭𝒙
Channel
distortion
ෝ𝒙
Acoustic Mismatch
Speech Recognition
• Adversarial multi-task learning (AMT)
[Shinohara Interspeech 2016]
Output 1
Senone
Input
Acoustic feature
E
G𝑉𝑦
Output 2
Domain
D 𝑉𝑧
𝒛
𝒚
𝒙
GRL
𝑉𝑦=− σ𝑖 log 𝑃(𝑦𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐺)
𝑉𝑧=− σ𝑖 log 𝑃(𝑧𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐷)
𝜃 𝐺 ← 𝜃 𝐺 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺
𝜃 𝐸 ← 𝜃 𝐸 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐸
𝜃 𝐷 ← 𝜃 𝐷 − ϵ
𝜕𝑉𝑧
𝜕𝜃 𝐷
Model update
Max
classification
accuracy
Max domain
accuracy
Max classification accuracy
Objective function
+𝛼
𝜕𝑉𝑧
𝜕𝜃 𝐸
and Min domain accuracy
• ASR results in known (k) and unknown (unk)
noisy conditions
Speech Recognition (AMT)
Table 10: WER of DNNs with single-task learning (ST) and AMT.
The AMT-DNN outperforms ST-DNN with yielding lower WERs.
Speech Recognition
• Domain adversarial training for accented ASR (DAT)
[Sun et al., ICASSP2018]
Output 2
Domain
Output 1
Senone
Input
Acoustic feature
E
G D
GRL
𝑉𝑧𝑉𝑦
𝒛
𝒚
𝒙
𝑉𝑦=− σ𝑖 log 𝑃(𝑦𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐺)
𝑉𝑧=− σ𝑖 log 𝑃(𝑧𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐷)
𝜃 𝐺 ← 𝜃 𝐺 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺
𝜃 𝐸 ← 𝜃 𝐸 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐸
𝜃 𝐷 ← 𝜃 𝐷 − ϵ
𝜕𝑉𝑧
𝜕𝜃 𝐷
Model update
Max
classification
accuracy
Max domain
accuracy
Max classification accuracy
Objective function
+𝛼
𝜕𝑉𝑧
𝜕𝜃 𝐸
and Min domain accuracy
• ASR results on accented speech
Speech Recognition (DAT)
1. With labeled transcriptions, ASR performance notably improves.
Table 11: WER of the baseline and adapted model.
2. DAT is effective in learning features invariant to domain differences
with and without labeled transcriptions.
STD: standard speech
Speech Recognition
• Unsupervised Adaptation with Domain Separation
Networks (DSN) [Meng et al., ASRU 2017]
R
PEt
R
PEs
𝒙
෥𝒙
Output 1
Senone
Clean data
E
GL1
𝒚
෥𝒙
Emb.
Noisy data
E
𝒙
Emb.𝒛 = 𝑔(𝒙) ෤𝒛 = 𝑔(෥𝒙)
𝑔(∙)𝑔(∙)
ℎ(∙)D
Output 2
Domain
𝒅
• Results on ASR in noise (CHiME3):
Speech Recognition (DSN)
1. DSN outperforms GRL consistently over different noise types.
2. The results confirmed the additional gains provided by private
component extractors.
Table 12: WER (in %) of Robust ASR on the CHiME3 task.
Outline of Part III
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Our Recent Works
Speaker Recognition
• Domain adversarial neural network (DANN)
[Wang et al., ICASSP 2018]
DANN
DANN
Pre-
processing
Pre-
processing
Scoring
Enroll
i-vector
Test
i-vector
Output 2
Domain
Output 1
Speaker
ID
Input
Acoustic feature
E
G D
GRL
𝑉𝑧𝑉𝑦
𝒛
𝒚
𝒙
• Recognition results of domain mismatched conditions
Table 13: Performance of DAT and the state-of-the-art methods.
Speaker Recognition (DANN)
The DAT approach outperforms other methods with
achieving lowest EER and DCF scores.
Outline of Part III
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Our Recent Works
Emotion Recognition
• Adversarial AE for emotion recognition (AAE-ER)
[Sahu et al., Interspeech 2017]
AE with GAN :
𝐻 ℎ 𝒛 , 𝒙 + λ 𝑉𝐺𝐴𝑁 (𝒒, 𝑔(𝒙))
E
D
𝒙
Emb.
Syn.
𝒛 = 𝑔(𝒙)
𝑔(∙)
ℎ(∙)
𝒒
𝒙
G
The distribution of code vectors
• Recognition results of domain mismatched conditions:
Table 15: Classification results on real and synthesized features.
Emotion Recognition (AAE-ER)
Table 14: Classification results on different systems.
1. AAE alone could not yield performance improvements.
2. Using synthetic data from AAE can yield higher UAR.
Original
Training
data
Outline of Part III
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Our Recent Works
Lip-reading
• Domain adversarial training for lip-reading (DAT-LR)
[Wand et al., Interspeech 2017]
Output 1
Words
E
G𝑉𝑦
Output 2
Speaker
D
GRL
𝑉𝑧
𝒛
𝒚
𝒙
𝑉𝑦=− σ𝑖 log 𝑃(𝑦𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐺)
𝑉𝑧=− σ𝑖 log 𝑃(𝑧𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐷)
𝜃 𝐺 ← 𝜃 𝐺 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺
𝜃 𝐸 ← 𝜃 𝐸 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐸
𝜃 𝐷 ← 𝜃 𝐷 − ϵ
𝜕𝑉𝑧
𝜕𝜃 𝐷
Model update
Max
classification
accuracy
Max domain
accuracy
Max classification accuracy
Objective function
+𝛼
𝜕𝑉𝑧
𝜕𝜃 𝐸
and Min domain accuracy
~80% WAC
• Recognition results of speaker mismatched conditions
Lip-reading (DAT-LR)
Table 16: Performance of DAT and the baseline.
The DAT approach notably enhances the recognition
accuracies in different conditions.
Outline of Part III
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Our Recent Works
Speech Signal Generation (Regression Task)
G Output
Objective function
Paired
Speech, Speaker, Emotion Recognition and Lip-reading
(Classification Task)
Output
label
Clean data
E
G
𝒚
෥𝒙
Emb.
Noisy data
𝒙
෤𝒛 = 𝑔(෥𝒙)
𝑔(∙)
ℎ(∙)
Accented
speech
෭𝒙
Channel
distortion
ෝ𝒙
Acoustic Mismatch
References
Speech enhancement (conventional methods)
• Y.-X. Wang and D.-L. Wang, Cocktail party processing via structured prediction, NIPS 2012.
• Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, An experimental study on speech enhancement based on deep neural
networks, IEEE SPL, 2014.
• Y. Xu, J. Du, L.-R. Dai, and Chin-Hui Lee, A regression approach to speech enhancement based on deep neural
networks, IEEE/ACM TASLP, 2015.
• X. Lu, Y. Tsao, S. Matsuda, H. Chiroi, Speech enhancement based on deep denoising autoencoder, Interspeech
2012.
• Z. Chen, S. Watanabe, H. Erdogan, J. R. Hershey, Integration of speech enhancement and recognition using long-
short term memory recurrent neural network, Interspeech 2015.
• F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. L. Roux, J. R. Hershey, and B. Schuller, Speech enhancement
with LSTM recurrent neural networks and Its application to noise-robust ASR, LVA/ICA, 2015.
• S.-W. Fu, Y. Tsao, and X.-G. Lu, SNR-aware convolutional neural network modeling for speech enhancement,
Interspeech, 2016.
• S.-W. Fu, Y. Tsao, X.-G. Lu, and Hisashi Kawai, End-to-end waveform utterance enhancement for direct evaluation
metrics optimization by fully convolutional neural networks, IEEE/ACM TASLP, 2018.
Speech enhancement (GAN-based methods)
• P. Santiago, B. Antonio, and S. Joan, SEGAN: Speech enhancement generative adversarial network, Interspeech,
2017.
• D. Michelsanti, and Z.-H. Tan, Conditional generative adversarial networks for speech enhancement and noise-
robust speaker verification, Interspeech, 2017.
• C. Donahue, B. Li, and P. Rohit, Exploring speech enhancement with generative adversarial networks for robust
speech recognition, ICASSP, 2018.
• T. Higuchi Takuya, K. Kinoshita, D. Marc, and T. Nakatani. Adversarial training for data-driven speech
enhancement without parallel Corpus, ASRU, 2017.
• S. Pascual, M. Park, J. Serrà, A. Bonafonte, K.-H. Ahn, Language and noise transfer in speech enhancement
generative adversarial network, ICASSP 2018.
References
Speech enhancement (GAN-based methods)
• A. Pandey and D. Wang, On adversarial training and loss functions for speech enhancement, ICASSP 2018.
• M. H. Soni, Neil Shah, and H. A. Patil, Time-frequency masking-based speech enhancement using generative
adversarial network, ICASSP 2018.
• Z. Meng, J.-Y. Li, Y.-G. Gong, B.-H. Juang, Adversarial feature-mapping for speech enhancemen, Interspeech, 2018.
• L.-W. Chen, M.Yu, Y.-M. Qian, D. Su, D. Yu, Permutation invariant training of generative adversarial network for
monaural speech separation, Interspeech 2018.
• D. Baby and S. Verhulst, Sergan: Speech enhancement using relativistic generative adversarial networks with
gradient penalty, ICASSP 2019.
Postfilter (conventional methods)
• T. Tod, and K. Tokuda, A speech parameter generation algorithm considering global variance for HMM-based
speech synthesis, IEICE Trans. Inf. Syst., 2007.
• H. Sil’en, E. Helander, J. Nurminen, and M. Gabbouj, Ways to implement global variance in statistical speech
synthesis, Interspeech, 2012.
• S. Takamichi, T. Toda, N. Graham, S. Sakriani, and S. Nakamura, A postfilter to modify the modulation spectrum
in HMM-based speech synthesis, ICASSP, 2014.
• L.-H. Chen, T. Raitio, C. V. Botinhao, J. Yamagishi, and Z.-H. Ling, DNN-based stochastic postfilter for HMM-
based speech synthesis, Interspeech, 2014.
• L.-H. Chen, T. Raitio, C. V. Botinhao, Z.-H. Ling, and J. Yamagishi, A deep generative architecture for postfiltering
in statistical parametric speech synthesis, IEEE/ACM TASLP, 2015.
Postfilter (GAN-based methods)
• K. Takuhiro, K. Hirokazu, H. Nobukatsu, Y. Ijima, K. Hiramatsu, and K. Kashino, Generative adversarial network-
based postfilter for statistical parametric speech synthesis, ICASSP, 2017.
• K. Takuhiro, T. Shinji, K. Hirokazu, and J. Yamagishi, Generative adversarial network-based postfilter for STFT
spectrograms, Interspeech, 2017.
• Y. Saito, S. Takamichi, and H. Saruwatari, Training algorithm to deceive anti-spoofing verification for DNN-based
speech synthesis, ICASSP, 2017.
• Y. Saito, S. Takamichi, H. Saruwatari, Statistical parametric speech synthesis incorporating generative
adversarial networks, IEEE/ACM TASLP, 2018.
• B. Bollepalli, L. Juvela, and A. Paavo, Generative adversarial network-based glottal waveform model for
statistical parametric speech synthesis, Interspeech, 2017.
• S. Yang, L. Xie, X. Chen, X.-Y. Lou, X. Zhu, D.-Y. Huang, and H.-Z. Li, Statistical parametric speech synthesis using
generative adversarial networks under a multi-task learning framework, ASRU, 2017.
References
VC (conventional methods)
• T. Toda, A. W. Black, and K. Tokuda, Voice conversion based on maximum likelihood estimation of spectral
parameter trajectory, IEEE/ACM TASLP, 2007.
• L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, Voice conversion using deep neural networks with layer-wise
generative training, IEEE/ACM TASLP, 2014.
• S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad, Spectral mapping using artificial neural networks for
voice conversion, IEEE/ACM TASLP, 2010.
• T. Nakashika, T. Takiguchi, Y. Ariki, High-order sequence modeling using speaker-dependent recurrent temporal
restricted boltzmann machines for voice conversion, Interspeech, 2014.
• K. Takuhiro, K. Hirokazu, H. Kaoru, and K. Kunio, Sequence-to-sequence voice conversion with similarity metric
learned using generative adversarial networks, Interspeech, 2017.
• Z.-Z. Wu, T. Virtanen, E.-S. Chng, and H.-Z. Li, Exemplar-based sparse representation with residual compensation
for voice conversion, IEEE/ACM TASLP, 2014.
• S.-. Fu, P.-C. Li, Y.-H. Lai, C.-C. Yang, L.-C. Hsieh, and Y. Tsao, Joint dictionary learning-based non-negative matrix
factorization for voice conversion to improve speech intelligibility after oral surgery, IEEE TBME, 2017.
• Y.-C. Wu, H.-T. Hwang, C.-C. Hsu, Y. Tsao, and H.-M. Wang, Locally linear embedding for exemplar-based spectral
conversion, Interspeech, 2016.
• C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, Y., and H.-M. Wang, Voice conversion from non-parallel corpora using
variational auto-encoder. APSIPA 2016.
VC (GAN-based methods)
• C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang Voice conversion from unaligned corpora using
variational autoencoding wasserstein generative adversarial networks, Interspeech 2017.
• K. Takuhiro, K. Hirokazu, H. Kaoru, and K. Kunio, Sequence-to-sequence voice conversion with similarity metric
learned using generative adversarial networks, Interspeech, 2017.
References
VC (GAN-based methods)
• K. Takuhiro, and K. Hirokazu. Parallel-data-free voice conversion using cycle-consistent adversarial networks,
arXiv, 2017.
• N. Shah, N. J. Shah, and H. A. Patil, Effectiveness of generative adversarial network for non-audible murmur-to-
whisper speech conversion, Interspeech, 2018.
• J.-C. Chou, C.-C. Yeh, H.-Y. Lee, and L.-S. Lee, Multi-target voice conversion without parallel data by adversarially
learning disentangled audio representations, Interspeech, 2018.
• G. Degottex, and M. Gales, A spectrally weighted mixture of least square error and wasserstein discriminator
loss for generative SPSS, SLT, 2018.
• B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura, Adaptive wavenet vocoder for residual compensation in
GAN-based voice conversion, SLT, 2018.
• C.-C. Yeh, P.-C. Hsu, J.-C. Chou, H.-Y. Lee, and L.-S. Lee, Rhythm-flexible voice conversion without parallel data
using cycle-GAN over phoneme posteriorgram sequences, SLT, 2018.
• H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, STARGAN-VC: Non-parallel many-to-many voice conversion with
star generative adversarial networks, SLT, 2018.
• K. Tanaka, T. Kaneko, N. Hojo, and H. Kameoka, Synthetic-to-natural speech waveform conversion using cycle-
consistent adversarial networks, SLT, 2018.
• O. Ocal, O. H. Elibol, G. Keskin, C. Stephenson, A. Thomas, and K. Ramchandran, Adversarially trained
autoencoders for parallel-data-free voice conversion, ICASSP, 2019.
• F. Fang, X. Wang, J. Yamagishi, and I. Echizen, Audiovisual speaker conversion: Jointly and simultaneously
transforming facial expression and acoustic characteristics, ICASSP, 2019.
• S. Seshadri, L. Juvela, J. Yamagishi, Okko Räsänen, and P. Alku, Cycle-consistent adversarial networks for non-
parallel vocal effort based speaking style conversion, ICASSP, 2019.
• T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, CYCLEGAN-VC2: Improved cyclegan-based non-parallel voice
conversion, ICASSP, 2019.
• L. Juvela, B. Bollepalli, J. Yamagishi, and P. Alku, Waveform generation for text-to-speech synthesis using pitch-
synchronous multi-scale generative adversarial networks, ICASSP, 2019.
References
Speaker recognition
• Q. Wang, W. Rao, S.-I. Sun, L. Xie, E.-S. Chng, and H.-Z. Li, Unsupervised domain adaptation via domain
adversarial training for speaker recognition, ICASSP, 2018.
• H. Yu, Z.-H. Tan, Z.-Y. Ma, and J. Guo, Adversarial network bottleneck features for noise robust speaker
verification, arXiv, 2017.
• G. Bhattacharya, J. Alam, & P. Kenny, Adapting end-to-end neural speaker verification to new languages and
recording conditions with adversarial training, ICASSP, 2019.
• Z. Peng, S. Feng, & T. Lee, Adversarial multi-task deep features and unsupervised back-end adaptation for
language recognition, ICASSP, 2019.
• Z. Meng, Y. Zhao, J. Li, & Y. Gong, Adversarial speaker verification, ICASSP, 2019.
• X. Fang, L. Zou, J. Li, L. Sun, & Z.-H. Ling, Channel adversarial training for cross-channel text-independent
speaker recognition, ICASSP, 2019.
• W. Xia, J. Huang, & J. H. Hansen, Cross-lingual text-independent speaker verification using unsupervised
adversarial discriminative domain adaptation, ICASSP, 2019.
• P. S. Nidadavolu, J. Villalba, & N. Dehak, Cycle-GANs for domain adaptation of acoustic features for speaker
recognition, ICASSP, 2019.
• G. Bhattacharya, J. Monteiro, J. Alam, & P. Kenny, Generative adversarial speaker embedding networks for
domain robust end-to-end speaker verification, ICASSP, 2019.
• J. Rohdin, T. Stafylakis, A. Silnova, H. Zeinali, L. Burget, & O. Plchot, Speaker verification using end-to-end
adversarial language adaptation, ICASSP, 2019.
• Zhou, J., Jiang, T., Li, L., Hong, Q., Wang, Z., & Xia, B., Training multi-task adversarial network for extracting
noise-robust speaker embedding, ICASSP, 2019.
• J. Zhang, N. Inoue, & K. Shinoda, I-vector transformation using conditional generative adversarial networks for
short utterance speaker verification, arXiv, 2018.
• W. Ding, & L. He, Mtgan: Speaker verification through multitasking triplet generative adversarial networks, arXiv,
2018.
• X. Miao, I. McLoughlin, S. Yao, & Y. Yan, Improved conditional generative adversarial net classification for
spoken language recognition, SLT, 2018.
References
Automatic Speech Recognition
• Yusuke Shinohara, Adversarial multi-task learning of deep neural networks for robust speech recognition,
Interspeech, 2016.
• D. Serdyuk, K. Audhkhasi, P. Brakel, B. Ramabhadran, S. Thomas, and Y. Bengio, Invariant Representations for
Noisy Speech Recognition, arXiv, 2016.
• Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara, Cross-domain speech recognition using nonparallel
corpora with cycle-consistent adversarial networks, ASRU, 2017.
• A. Sriram, H.-W Jun, Y. Gaur, and S. Satheesh, Robust speech recognition using generative adversarial networks,
arXiv, 2017.
• Z. Meng, Z. Chen, V. Mazalov, J. Li, J., and Y. Gong, Unsupervised adaptation with domain separation networks
for robust speech recognition, ASRU, 2017.
• Z. Meng, J. Li, Z. Chen, Y. Zhao, V. Mazalov, Y. Gong, and B.-H. Juang, Speaker-invariant training via adversarial
learning, ICASSP, 2018.
• Z. Meng, J. Li, Y. Gong, and B.-H. Juang, Adversarial teacher-student learning for unsupervised domain
adaptation, ICASSP, 2018.
• Y. Zhang, P. Zhang, and Y. Yan, Improving language modeling with an adversarial critic for automatic speech
recognition, Interspeech, 2018.
• S. Sun, C. Yeh, M. Ostendorf, M. Hwang, and L. Xie, Training augmentation with adversarial examples for robust
speech recognition, Interspeech, 2018.
• Z. Meng, J. Li, Y. Gong, and B.-H. Juang, Adversarial feature-mapping for speech enhancement, Interspeech
2018.
• K. Wang, J. Zhang, S. Sun, Y. Wang, F. Xiang, and L. Xie, Investigating generative adversarial networks based
speech dereverberation for robust speech recognition, Interspeech 2018.
• Z. Meng, J. Li, Y. Gong, B.-H. Juang, Cycle-consistent speech enhancement, Interspeech 2018.
• J. Drexler and J. Glass, Combining end-to-end and adversarial training for low-resource speech recognition, SLT,
2018.
• A. H. Liu, H. Lee and L. Lee, Adversarial training of end-to-end speech recognition using a criticizing language
model, ICASSP, 2019.
References
Automatic Speech Recognition
• J. Yi, J. Tao and Y. Bai, Language-invariant bottleneck features from adversarial end-to-end acoustic models for
• low resource speech recognition, ICASSP, 2019.
• D. Haws and X. Cui, Cyclegan bandwidth extension acoustic modeling for automatic speech recognition, ICASSP,
2019.
• Z. Meng, J. Li, J. and Y. Gong, Attentive adversarial learning for domain-Invariant training, ICASSP, 2019.
• Z. Meng, Y. Zhao, J. Li, and Y. Gong, Adversarial speaker verification, ICASSP, 2019.
• Z. Meng, Y. Zhao, J. Li, and Y. Gong., Adversarial speaker adaptation, ICASSP, 2019.
Emotion recognition
• J. Chang, and S. Scherer, Learning representations of emotional speech with deep convolutional generative
adversarial networks, ICASSP, 2017.
• S. Sahu, R. Gupta, G. Sivaraman, W. AbdAlmageed, and C. Espy-Wilson, Adversarial auto-encoders for speech
based emotion recognition. Interspeech, 2017.
• S. Sahu, R. Gupta, and C. E.-Wilson, On enhancing speech emotion recognition using generative adversarial
networks, Interspeech 2018.
• C.-M. Chang, and C.-C. Lee, Adversarially-enriched acoustic code vector learned from out-of-context affective
corpus for robust emotion recognition, ICASSP 2019.
• J. Liang, S. Chen, J. Zhao, Q. Jin, H. Liu, and L. Lu, Cross-culture multimodal emotion recognition with adversarial
learning, ICASSP 2019.
Lipreading
• M. Wand, and J. Schmidhuber, Improving speaker-independent lipreading with domain-adversarial training,
arXiv, 2017.
References
Outline of Part III
Our Recent Works
• Noise adaptive speech enhancement [Interspeech 2019]
• MetricGAN for speech enhancement [ICML 2019]
• Multi-Target voice conversion [Interspeech 2018]
• Impaired speech conversion [Interspeech 2019]
• Pathological voice detection [NeurIPS workshop 2018]
[Mon-P-2-A]
[Wed-P-6-E]
Speech Enhancement
N5 N5N4
N7 N9N10N12
Unseen
N11
𝒛
E G
𝑉𝑦
𝜃 𝐺 ← 𝜃 𝐺 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺 𝜃 𝐸 ← 𝜃 𝐸 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐸
Min reconstruction
error
Min reconstruction error
• Noise Adaptive Speech Enhancement (NA-SE)
[Liao et al., Interspeech 2019] [Wed-P-6-E]
Speech Enhancement (NA-SE)
N5 N5N4
N7 N9N10N12
Unseen
N11
Noise
Type
𝒛
E G
𝑉𝑦
D
𝑉𝑧
𝜃 𝐺 ← 𝜃 𝐺 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺 𝜃 𝐸 ← 𝜃 𝐸 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐸
𝜃 𝐷 ← 𝜃 𝐷 − ϵ
𝜕𝑉𝑧
𝜕𝜃 𝐷
Min reconstruction
error
Max domain
accuracy
Min reconstruction error
+𝛼
𝜕𝑉𝑧
𝜕𝜃 𝐸
and Min domain accuracy
• Domain adversarial training for NA-SE
GRL
Speech Enhancement (NA-SE)
• Objective evaluations
The DAT-based unsupervised adaptation can notably overcome
the mismatch issue of training and testing noise types.
Fig. 15: PESQ at different SNR levels.
• GAN for spectral magnitude mask estimation (MMS-GAN)
[Pandey et al., ICASSP 2018]
D Scalar
Ref.
mask
Noisy
(Fake/Real)
Output
mask
Noisy
G
Noisy Output mask Ref. mask
Speech Enhancement
• MetricGAN for Speech Enhancement [Fu et al., ICML 2019]
D Metric
Score
(0~1)
G
Noisy Spect. Output mask
Speech Enhancement
1.00.4
Clean
Spect.
Enhanced
Spect.
Enhanced Spect.
Point-wise multiplication
Speech Enhancement (MetricGAN)
With MetricGAN, we have freedom to specify the target
metric scores (PESQ or STOI) to generated speech.
• Multi-target VC [Chou et al., Interspeech 2018]
𝑒𝑛𝑐(𝒙)
𝒙
Voice Conversion
C
𝑬nc Dec
𝒚
𝒚 𝒚′····
𝑒𝑛𝑐(𝒙)
𝑬nc Dec
𝒚"
𝑮
𝒚"
D+C
Real
data
𝒙 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚) 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚′)
➢ Stage-1
➢ Stage-2
F/R
ID
···
• Subjective evaluations
Voice Conversion (Multi-target VC)
Fig. 16: Preference test results
1. The proposed method uses non-parallel data.
2. The multi-target VC approach outperforms one-stage only.
3. The multi-target VC approach is comparable to Cycle-GAN-VC in
terms of the naturalness and the similarity.
• Controller-generator-discriminator VC on Impaired
Speech [Chen et al., Interspeech 2019]
Voice Conversion
Previous applications: hearing aids; murmur to normal speech; bone-
conductive microphone to air-conductive microphone.
Before
Proposed: improving the speech intelligibility of surgical patients.
Target: oral cancer (top five cancer for male in Taiwan).
After Before After
[Mon-P-2-A]
• Controller-generator-discriminator VC (CGD VC) on
impaired speech [Chen et al., Interspeech 2019]
Voice Conversion
GD
Controller
Voice Conversion (CGD VC)
• Spectrogram analysis
Fig. 17: Spectrogram comparison of CGD with CycleGAN.
• Subjective evaluations
Voice Conversion (CGD VC)
The proposed method outperforms conditional GAN and CycleGAN
in terms of content similarity, speaker similarity, and articulation.
Fig. 18: MOS for content similarity, speaker similarity, and articulation.
Pathological Voice Detection
• Detection of Pathological Voice Using Cepstrum Vectors:
A Deep Learning Approach [Fang et al., Journal of Voice 2018]
GMM SVM DNN
MEEI 98.28 98.26 99.14
FEMH (M) 90.24 93.04 94.26
FEMH (F) 90.20 87.40 90.52
Table 17: Detection performance based on voice.
Pathological Voice Detection
• Robustness Against Channel [Hsu et al., NeurIPS Workshop 2018]
𝒛
E G
𝑉𝑦
D
𝑉𝑧
𝒚
DNN (S) DNN (T) DNN (FT) Unsup. DAT Sup. DAT
PR-AUC 0.8848 0.8509 0.9021 0.9455 0.9522
The unsupervised DAT notably increased the performance
robustness against channel effects and generated comparable
results as compared to supervised DAT.
𝒙
Table 18: Detection results of sup. and unsup. DAT under channel mismatches.
• C.-F. Liao, Y. Tsao, H.-Y. Lee and H.-M. Wang, Noise adaptive speech enhancement using domain adversarial
training, Interspeech 2019.
• J.-C. Chou, C.-C. Yeh, H.-Y. Lee, and L.-S. Lee. "Multi-target voice conversion without parallel data by
adversarially learning disentangled audio representations. Interspeech 2018.
• L.-W. Chen, H.-Y. Lee, and Y. Tsao, Generative adversarial networks for unpaired voice transformation on
impaired speech, Interspeech 2019.
• S.-W. Fu, C.-F. Liao, Y. Tsao, S.-D. Lin, MetricGAN: Generative adversarial networks based black-box metric
scores optimization for speech enhancement, ICML, 2019.
• C.-T. Wang, F.-C. Lin, J.-Y. Chen, M.-J. Hsiao, S.-H. Fang, Y.-H. Lai, Y. Tsao, Detection of pathological voice using
cepstrum vectors: a deep learning approach, Journal of Voice, 2018.
• S.-Y. Tsui, Y. Tsao, C.-W. Lin, S.-H. Fang, and C.-T. Wang, Demographic and symptomatic features of voice
disorders and their potential application in classification using machine learning algorithms, Folia Phoniatrica et
Logopaedica, 2018.
• S.-H. Fang, C.-T. Wang, J.-Y. Chen, Y. Tsao and F.-C. Lin, Combining acoustic signals and medical records to
improve pathological voice classification, APSIPA, 2019.
• Y.-T. Hsu, Z. Zhu, C.-T. Wang, S.-H. Fang, F. Rudzicz, and Y. Tsao, Robustness against the channel effect in
pathological voice detection, NeurIPS 2018 Machine Learning for Health (ML4H) Workshop, 2018.
References
Thank You Very Much
Tsao, Yu Ph.D., Academia Sinica
yu.tsao@citi.sinica.edu.tw
Generative Adversarial Network
and its Applications to Signal Processing
and Natural Language Processing
Part III: Speech Signal Processing
Part IV: Natural
Language Processing
NLP tasks usually involve Sequence Generation
How to use GAN to improve sequence generation?
Outline of Part IV
Sequence Generation by GAN
Unsupervised Conditional Sequence Generation
• Text Style Transfer
• Unsupervised Abstractive Summarization
• Unsupervised Translation
• Unsupervised Speech Recognition
Why we need GAN?
• Chat-bot as example
Encoder Decoder
Input sentence c
output
sentence x
Training
data:
A: How are you ?
B: I’m good.
…………
How are you ?
I’m good.
Seq2seq
Output: Not bad I’m John.
Maximize
likelihood
Training Criterion
Human better
better
Reinforcement Learning
Human
Input sentence c response sentence x
Chatbot
En De
response sentence x
Input sentence c
[Li, et al., EMNLP, 2016]
reward
𝑅 𝑐, 𝑥
Learn to maximize expected reward
E.g. Policy Gradient
human
“How are you?” “Not bad” “I’m John”
-1+1
Policy Gradient
𝜃 𝑡
𝑐1, 𝑥1
𝑐2, 𝑥2
𝑐 𝑁, 𝑥 𝑁
……
𝑅 𝑐1, 𝑥1
𝑅 𝑐2, 𝑥2
𝑅 𝑐 𝑁
, 𝑥 𝑁
……
1
𝑁
෍
𝑖=1
𝑁
𝑅 𝑐 𝑖, 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃 𝜃 𝑡 𝑥 𝑖|𝑐 𝑖
𝜃 𝑡+1 ← 𝜃 𝑡 + 𝜂𝛻 ത𝑅 𝜃 𝑡
𝑅 𝑐 𝑖, 𝑥 𝑖 is positive
Updating 𝜃 to increase 𝑃 𝜃 𝑥 𝑖
|𝑐 𝑖
𝑅 𝑐 𝑖
, 𝑥 𝑖
is negative
Updating 𝜃 to decrease 𝑃 𝜃 𝑥 𝑖|𝑐 𝑖
Policy Gradient
1
𝑁
෍
𝑖=1
𝑁
𝑅 𝑐 𝑖, 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖|𝑐 𝑖
1
𝑁
෍
𝑖=1
𝑁
𝑙𝑜𝑔𝑃 𝜃 ො𝑥 𝑖|𝑐 𝑖
1
𝑁
෍
𝑖=1
𝑁
𝛻𝑙𝑜𝑔𝑃 𝜃 ො𝑥 𝑖|𝑐 𝑖
1
𝑁
෍
𝑖=1
𝑁
𝑅 𝑐 𝑖, 𝑥 𝑖 𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖|𝑐 𝑖
𝑅 𝑐 𝑖
, ො𝑥 𝑖
= 1 obtained from interaction
weighted by 𝑅 𝑐 𝑖, 𝑥 𝑖
Objective
Function
Gradient
Maximum
Likelihood
Reinforcement Learning -
Policy Gradient
Training
Data
𝑐1, ො𝑥1 , … , 𝑐 𝑁, ො𝑥 𝑁
𝑐1, 𝑥1 , … , 𝑐 𝑁, 𝑥 𝑁
Conditional GAN
Discriminator
Input sentence c response sentence x
Chatbot
En De
response sentence x
Input sentence c
reward
𝑅 𝑐, 𝑥
I am busy.
Replace human evaluation with
machine evaluation [Li, et al., EMNLP, 2017]
However, there is an issue when you train your generator.
A A
A
B
A
B
A
A
B
B
B
<BOS>
Can we use
gradient ascent?
Discriminator
scalarNO!
Update Parameters
: obtained
by attention
Generator
A A
A
B
A
B
A
A
B
B
B
<BOS>
Can we use
gradient ascent?
Discriminator
scalarNO!
Update Parameters
Having non-
differentiable
part
: obtained
by attention
Three Categories of Solutions
Gumbel-softmax
• [Matt J. Kusner, et al., arXiv, 2016][Weili Nie, et al. ICLR, 2019]
Continuous Input for Discriminator
• [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen
Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML,
2017]
Reinforcement Learning
• [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv,
2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William
Fedus, et al., ICLR, 2018]
Gumbel-softmax
Source of image:
https://blog.evjang.com/2016/11/tutorial-
categorical-variational.html
Using the
reparameterization
trick
As what people
do for training
VAE
Three Categories of Solutions
Gumbel-softmax
• [Matt J. Kusner, et al., arXiv, 2016][Weili Nie, et al. ICLR, 2019]
Continuous Input for Discriminator
• [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen
Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML,
2017]
Reinforcement Learning
• [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv,
2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William
Fedus, et al., ICLR, 2018]
A A
A
B
A
B
A
A
B
B
B
<BOS>
Use the distribution
as the input of
discriminator
Avoid the sampling
process
Discriminator
scalar
Update Parameters
We can do
backpropagation
now.
What is the problem?
• Real sentence
• Generated
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0.9
0.1
0
0
0
0.1
0.9
0
0
0
0.1
0.1
0.7
0.1
0
0
0
0.1
0.8
0.1
0
0
0
0.1
0.9
Can never
be 1-hot
Discriminator can
immediately find
the difference.
Discriminator with constraint
(e.g. WGAN) can be helpful.
Three Categories of Solutions
Gumbel-softmax
• [Matt J. Kusner, et al., arXiv, 2016][Weili Nie, et al. ICLR, 2019]
Continuous Input for Discriminator
• [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen
Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML,
2017]
Reinforcement Learning
• [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv,
2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William
Fedus, et al., ICLR, 2018]
A A
A
B
A
B
A
A
B
B
B
<BOS>
Discriminator
scalar
Generator
= Agent in RL
Actions taken
Environment
Reward
Trained by RL algorithm
(e.g. Policy Gradient)
The reward function
may change
→ Different from typical RL
Tips for Sequence Generation
GAN
.
RL is difficult to train GAN is difficult to train
Sequence Generation GAN (RL+GAN)
Tips for Sequence Generation
GAN
• Usually the generator are fine-tuned from a model learned
by maximum-likelihood.
• However, with enough hyperparameter-tuning and tips,
ScarchGAN can train from scratch.
[Cyprien de Masson
d'Autume, et al.,
arXiv 2019]
Tips for Sequence Generation
GAN
• Typical
• Reward for Every Generation Step
Discrimi
natorChatbot
En De
You is good
Discrimi
natorChatbot
En De 0.9
0.1
0.1
0.1
You
You is
You is good
I don’t know which
part is wrong …
Tips for Sequence Generation
GAN
• Reward for Every Generation Step
Discrimi
natorChatbot
En De 0.9
0.1
0.1
You
You is
You is good
Method 2. Discriminator For Partially Decoded Sequences
Method 1. Monte Carlo (MC) Search [Yu, et al., AAAI, 2017]
[Li, et al., EMNLP, 2017]
Method 3. Step-wise evaluation[Tual, Lee, TASLP, 2019][Xu, et al., EMNLP,
2018][William Fedus, et al., ICLR, 2018]
Empirical Performance
• MLE frequently generates “I’m sorry”, “I don’t
know”, etc. (corresponding to fuzzy images?)
• GAN generates longer and more complex responses.
• Find more comparison in the survey papers.
• [Lu, et al., arXiv, 2018][Zhu, et al., arXiv, 2018]
• However, no strong evidence shows that GANs are
better than MLE.
• [Stanislau Semeniuta, et al., arXiv, 2018] [Guy Tevet, et al., arXiv, 2018]
[Massimo Caccia, et al., arXiv, 2018]
More Applications
• Supervised machine translation [Wu, et al., arXiv
2017][Yang, et al., arXiv 2017]
• Supervised abstractive summarization [Liu, et al., AAAI
2018]
• Image/video caption generation [Rakshith Shetty, et al., ICCV
2017][Liang, et al., arXiv 2017]
• Data augmentation for code-switching ASR [Mon-P-
1-D] [Chang, et al., INTERSPEECH 2019]
If you are trying to generate some sequences,
you can consider GAN.
Outline of Part IV
Sequence Generation by GAN
Unsupervised Conditional Sequence Generation
• Text Style Transfer
• Unsupervised Abstractive Summarization
• Unsupervised Translation
• Unsupervised Speech Recognition
male female
positive
sentences
negative
sentences
Language 1 Audio Text
summarydocument
Part I
Part III
Language 2
Text Style Transfer
Unsupervised Abstractive
Summarization
Unsupervised ASRUnsupervised Translation
Cycle-GAN
𝐺 𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺 𝑋→𝑌
as close as possible
𝐷 𝑌𝐷 𝑋
scalar: belongs to
domain Y or not
scalar: belongs to
domain X or not
Cycle-GAN
𝐺 𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺 𝑋→𝑌
as close as possible
𝐷 𝑌𝐷 𝑋
negative sentence? positive sentence?
It is bad. It is good. It is bad.
I love you. I hate you. I love you.
positive
positive
positivenegative
negative negative
Non-differentiable Issue?
You already know how to deal with it.
✘ Negative sentence to positive sentence:
it's a crappy day -> it's a great day
i wish you could be here -> you could be here
it's not a good idea -> it's good idea
i miss you -> i love you
i don't love you -> i love you
i can't do that -> i can do that
i feel so sad -> i happy
it's a bad day -> it's a good day
it's a dummy day -> it's a great day
sorry for doing such a horrible thing -> thanks for doing a
great thing
my doggy is sick -> my doggy is my doggy
my little doggy is sick -> my little doggy is my little doggy
Cycle GAN
感謝 王耀賢 同學提供實驗結果
[Lee, et al.,
ICASSP, 2018]
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋 𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Shared Latent Space
Positive
Sentence
Positive
Sentence
Negative
Sentence
Negative
Sentence
Decoder hidden layer as discriminator input
[Shen, et al., NIPS, 2017]
From 𝐸𝑁𝑋 or 𝐸𝑁𝑌
Domain
Discriminator
𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the
domain discriminator
[Zhao, et al., arXiv, 2017]
[Fu, et al., AAAI, 2018]
male female
positive
sentences
negative
sentences
Language 1 Audio Text
summarydocument
Part I
Part III
Language 2
Text Style Transfer
Unsupervised Abstractive
Summarization
Unsupervised ASRUnsupervised Translation
Abstractive Summarization
• Now machine can do abstractive summary by
seq2seq (write summaries in its own words)
summary 1
summary 2
summary 3
Training Data
summary
seq2seq
(in its own words)
Supervised: We need lots of
labelled training data.
Unsupervised Abstractive
Summarization
• Now machine can do abstractive summary by
seq2seq (write summaries in its own words)
summary 1
summary 2
summary 3
seq2seq document
Domain Y Domain X[Wang, et al., EMNLP, 2018]
G
Seq2seq
document
word
sequence
D
Human written summaries Real or not
Discriminator
Unsupervised Abstractive Summarization
Summary?
G
Seq2seq
document
word
sequence
D
Human written summaries Real or not
Discriminator
R
Seq2seq
document
Unsupervised Abstractive Summarization
minimize the reconstruction error
Unsupervised Abstractive
Summarization
G R
Summary?
Seq2seq Seq2seq
document document
word
sequence
Only need a lot
of documents to
train the model
This is a seq2seq2seq auto-encoder.
Using a sequence of words as latent representation.
not readable …
Unsupervised Abstractive
Summarization
G R
Seq2seq Seq2seq
word
sequence
D
Human written summaries Real or not
Discriminator
Let Discriminator considers
my output as real
document document
Summary?
Readable
Experimental results
ROUGE-1 ROUGE-2 ROUGE-L
Supervised 33.2 14.2 30.5
Trivial 21.9 7.7 20.5
Unsupervised
(matched data)
28.1 10.0 25.4
Unsupervised
(no matched data)
27.2 9.1 24.1
English Gigaword (Document title as summary)
• Matched data: using the title of English Gigaword to train
Discriminator
• No matched data: using the title of CNN/Diary Mail to
train Discriminator
[Wang, Lee, EMNLP 2018]
Semi-supervised Learning
25
26
27
28
29
30
31
32
33
34
0 10k 500k
ROUGE-1
Number of document-summary pairs used
WGAN Reinforce Supervised
3.8M pairs are used.Approaches to deal with the discrete issue.
unsupervised
semi-supervised
[Wang, Lee,
EMNLP 2018]
More Unsupervised
Summarization
• Unsupervised summarization with language prior
• Unsupervised multi-document summarization
[Eric Chu, Peter Liu,
ICML 2019]
[Christos Baziotis, etc al.,
NAACL 2019]
G
Input
Sentence
D
Said by
Trump?
Discriminator
R
Dialogue Response Generation
minimize the reconstruction error
Make the US great again
I would build a great wall
you are fired
What Trump has said
Chat
Bot
Generated
Response
Input
Sentence
(Reconstruct)
[Su, et al., INTERSPEECH, 2019]
(Thu-P-9-C)
General
Dialogues
male female
positive
sentences
negative
sentences
Language 1 Audio Text
summarydocument
Part I
Part III
Language 2
Text Style Transfer
Unsupervised Abstractive
Summarization
Unsupervised ASRUnsupervised Translation
Unsupervised learning
with 10M sentences
Supervised learning with
100K sentence pairs
=
supervised
unsupervised
[Alexis Conneau, et al., ICLR, 2018]
[Guillaume Lample, et al., ICLR, 2018]
male female
positive
sentences
negative
sentences
Language 1 Audio Text
summarydocument
Part I
Part III
Language 2
Text Style Transfer
Unsupervised Abstractive
Summarization
Unsupervised ASRUnsupervised Translation
Towards Unsupervised ASR
- Cycle GAN
G
ASR
Text
R
TTS
D
Real Text?
Discriminator
minimize the reconstruction error (speech chain)
how are you
good morning
i am fine
Real
Text
[Andros Tjandra, et al., ASRU 2017]
[Liu, et al., INTERSPEECH 2018]
[Yeh, et al., ICLR 2019]
[Chen, et al., INTERSPEECH 2019]
Towards Unsupervised ASR
- Cycle GAN
• Unsupervised setting on TIMIT (text and audio are
unpair, text is not the transcription of audio)
• 63.6% PER (oracle boundaries)
• 41.6% PER (automatic segmentation)
• 33.1% PER (automatic segmentation)
• Semi-supervised setting on Librispeech
[Liu, et al., INTERSPEECH 2018]
[Yeh, et al., ICLR 2019]
(Tue-P-4-B)[Chen, et al., INTERSPEECH 2019]
[Liu, et al., ICASSP 2019]
[Tomoki Hayashi, et al., SLT 2018]
[Takaaki Hori, et al., ICASSP 2019]
[Murali Karthick Baskar, et al., INTERSPEECH 2019]
Towards Unsupervised ASR
- Shared Latent Space
Text
Encoder
Audio
Encoder
Audio
Decoder
Text
Decoder
this is text this is text
Unsupervised setting on Librispeech: 76.3% WER
WSJ with 2.5 hours paired data: 64.6% WER
LJ speech with 20 mins paired data: 11.7% PER
[Chen, et al., SLT 2018]
Unsupervised speech translation is also possible!
[Chung, et al., NIPS 2018]
[Jennifer Drexler, et al., SLT 2018]
[Ren, et al., ICML 2019]
[Chung, et al., ICASSP 2019]
Outline of Part IV
Sequence Generation by GAN
Unsupervised Conditional Sequence Generation
• Text Style Transfer
• Unsupervised Abstractive Summarization
• Unsupervised Translation
• Unsupervised Speech Recognition
To Learn More …
https://www.youtube.com/playlist?list=PLJV_el3uVTsMd2G9ZjcpJn1YfnM9wVOBf
You can learn more from the YouTube Channel
(in Mandarin)
Reference
• Sequence Generation
• Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, Dan
Jurafsky, Deep Reinforcement Learning for Dialogue Generation, EMNLP,
2016
• Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky,
Adversarial Learning for Neural Dialogue Generation, EMNLP, 2017
• Matt J. Kusner, José Miguel Hernández-Lobato, GANS for Sequences of
Discrete Elements with the Gumbel-softmax Distribution, arXiv 2016
• Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu
Song, Yoshua Bengio, Maximum-Likelihood Augmented Discrete Generative
Adversarial Networks, arXiv 2017
• Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, SeqGAN: Sequence
Generative Adversarial Nets with Policy Gradient, AAAI 2017
• Sai Rajeswar, Sandeep Subramanian, Francis Dutil, Christopher Pal, Aaron
Courville, Adversarial Generation of Natural Language, arXiv, 2017
• Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, Lior Wolf, Language
Generation with Recurrent Generative Adversarial Networks without Pre-
training, ICML workshop, 2017
Reference
• Sequence Generation
• Zhen Xu, Bingquan Liu, Baoxun Wang, Chengjie Sun, Xiaolong Wang,
Zhuoran Wang, Chao Qi , Neural Response Generation via GAN with an
Approximate Embedding Layer, EMNLP, 2017
• Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron
Courville, Yoshua Bengio, Professor Forcing: A New Algorithm for Training
Recurrent Networks, NIPS, 2016
• Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan
Shen, Lawrence Carin, Adversarial Feature Matching for Text Generation,
ICML, 2017
• Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, Jun Wang, Long Text
Generation via Adversarial Training with Leaked Information, AAAI, 2018
• Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, Ming-Ting Sun,
Adversarial Ranking for Language Generation, NIPS, 2017
• William Fedus, Ian Goodfellow, Andrew M. Dai, MaskGAN: Better Text
Generation via Filling in the______, ICLR, 2018
Reference
• Sequence Generation
• Yi-Lin Tuan, Hung-Yi Lee, Improving Conditional Sequence Generative
Adversarial Networks by Stepwise Evaluation, TASLP, 2019
• Jingjing Xu, Xuancheng Ren, Junyang Lin, Xu Sun, Diversity-Promoting GAN:
A Cross-Entropy Based Generative Adversarial Network for Diversified Text
Generation, EMNLP, 2018
• Sidi Lu, Yaoming Zhu, Weinan Zhang, Jun Wang, Yong Yu, Neural Text
Generation: Past, Present and Beyond, arXiv, 2018
• Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun
Wang, Yong Yu, Texygen: A Benchmarking Platform for Text Generation
Models, arXiv, 2018
• Stanislau Semeniuta, Aliaksei Severyn, Sylvain Gelly, On Accurate Evaluation
of GANs for Language Generation, arXiv, 2018
• Guy Tevet, Gavriel Habib, Vered Shwartz, Jonathan Berant, Evaluating Text
GANs as Language Models, arXiv, 2018
• Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle
Pineau, Laurent Charlin, Language GANs Falling Short, arXiv, 2018
Reference
• Sequence Generation
• Zhen Yang, Wei Chen, Feng Wang, Bo Xu, Improving Neural Machine
Translation with Conditional Sequence Generative Adversarial Nets, NAACL,
2018
• Lijun Wu, Yingce Xia, Li Zhao, Fei Tian, Tao Qin, Jianhuang Lai, Tie-Yan Liu,
Adversarial Neural Machine Translation, arXiv 2017
• Linqing Liu, Yao Lu, Min Yang, Qiang Qu, Jia Zhu, Hongyan Li, Generative
Adversarial Network for Abstractive Text Summarization, AAAI 2018
• Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, Bernt
Schiele, Speaking the Same Language: Matching Machine to Human
Captions by Adversarial Training, ICCV 2017
• Xiaodan Liang, Zhiting Hu, Hao Zhang, Chuang Gan, Eric P. Xing, Recurrent
Topic-Transition GAN for Visual Paragraph Generation, arXiv 2017
• Weili Nie, Nina Narodytska, Ankit Patel, RelGAN: Relational Generative
Adversarial Networks for Text Generation, ICLR 2019
Reference
• Sequence Generation
• Ching-Ting Chang, Shun-Po Chuang, Hung-Yi Lee, "Code-switching Sentence
Generation by Generative Adversarial Networks and its Application to Data
Augmentation", INTERSPEECH 2019
• Cyprien de Masson d'Autume, Mihaela Rosca, Jack Rae, Shakir Mohamed,
Training language GANs from Scratch, arXiv 2019
Reference
• Text Style Transfer
• Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, Rui Yan, Style
Transfer in Text: Exploration and Evaluation, AAAI, 2018
• Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola, Style Transfer
from Non-Parallel Text by Cross-Alignment, NIPS 2017
• Chih-Wei Lee, Yau-Shian Wang, Tsung-Yuan Hsu, Kuan-Yu Chen, Hung-Yi Lee,
Lin-shan Lee, Scalable Sentiment for Sequence-to-sequence Chatbot
Response with Performance Analysis, ICASSP, 2018
• Junbo (Jake) Zhao, Yoon Kim, Kelly Zhang, Alexander M. Rush, Yann LeCun,
Adversarially Regularized Autoencoders, arxiv, 2017
• Feng-Guang Su, Aliyah Hsu, Yi-Lin Tuan and Hung-yi Lee, "Personalized
Dialogue Response Generation Learned from Monologues", INTERSPEECH,
2019
Reference
• Unsupervised Abstractive Summarization
• Yau-Shian Wang, Hung-Yi Lee, "Learning to Encode Text as Human-
Readable Summaries using Generative Adversarial Networks", EMNLP, 2018
• Eric Chu, Peter Liu, “MeanSum: A Neural Model for Unsupervised Multi-
Document Abstractive Summarization”, ICML, 2019
• Christos Baziotis, Ion Androutsopoulos, Ioannis Konstas, Alexandros
Potamianos, “SEQ^3: Differentiable Sequence-to-Sequence-to-Sequence
Autoencoder for Unsupervised Abstractive Sentence Compression”, NAACL
2019
Reference
• Unsupervised Machine Translation
• Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic
Denoyer, Hervé Jégou, Word Translation Without Parallel Data, ICRL 2018
• Guillaume Lample, Ludovic Denoyer, Marc'Aurelio Ranzato, Unsupervised
Machine Translation Using Monolingual Corpora Only, ICRL 2018
Reference
• Unsupervised Speech Recognition
• Alexander H. Liu, Hung-yi Lee, Lin-shan Lee, Adversarial Training of End-to-
end Speech Recognition Using a Criticizing Language Model, ICASSP 2018
• Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee, Completely
Unsupervised Phoneme Recognition by Adversarially Learning Mapping
Relationships from Audio Embeddings, INTERSPEECH, 2018
• Kuan-yu Chen, Che-ping Tsai, Da-Rong Liu, Hung-yi Lee and Lin-shan Lee,
"Completely Unsupervised Phoneme Recognition By A Generative
Adversarial Network Harmonized With Iteratively Refined Hidden Markov
Models", INTERSPEECH, 2019
• Yi-Chen Chen, Sung-Feng Huang, Chia-Hao Shen, Hung-yi Lee, Lin-shan Lee,
"Phonetic-and-Semantic Embedding of Spoken Words with Applications in
Spoken Content Retrieval", SLT, 2018
• Chih-Kuan Yeh, Jianshu Chen, Chengzhu Yu, Dong Yu, Unsupervised Speech
Recognition via Segmental Empirical Output Distribution Matching, ICLR,
2019
Reference
• Unsupervised Speech Recognition
• Takaaki Hori, Ramon Astudillo, Tomoki Hayashi, Yu Zhang, Shinji
Watanabe, Jonathan Le Roux, Cycle-consistency training for end-to-end
speech recognition, ICASSP 2019
• Murali Karthick Baskar, Shinji Watanabe, Ramon Astudillo, Takaaki
Hori, Lukáš Burget, Jan Černocký, Semi-supervised Sequence-to-sequence
ASR using Unpaired Speech and Text, INTERSPEECH 2019
• Andros Tjandra, Sakriani Sakti, Satoshi Nakamura, Listening while Speaking:
Speech Chain by Deep Learning, ASRU 2017
• Yu-An Chung, Wei-Hung Weng, Schrasing Tong, James Glass, Unsupervised
Cross-Modal Alignment of Speech and Text Embedding Spaces, NIPS, 2018
• Yu-An Chung, Wei-Hung Weng, Schrasing Tong, James Glass, Towards
Unsupervised Speech-to-Text Translation, ICASSP 2019
• Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu, Almost
Unsupervised Text to Speech and Automatic Speech Recognition, ICML
2019
Reference
• Unsupervised Speech Recognition
• Shigeki Karita , Shinji Watanabe, Tomoharu Iwata, Atsunori Ogawa, Marc
Delcroix, Semi-Supervised End-to-End Speech Recognition, INTERSPEECH,
2018
• Jennifer Drexler, James R. Glass, “Combining End-to-End and Adversarial
Training for Low-Resource Speech Recognition”, SLT 2018
• Tomoki Hayashi, Shinji Watanabe, Yu Zhang, Tomoki Toda, Takaaki
Hori, Ramon Astudillo, Kazuya Takeda, Back-Translation-Style Data
Augmentation for End-to-End ASR, SLT, 2018
Please download the latest slides here:
http://speech.ee.ntu.edu.tw/~tlkagk/GAN_3hour.pdf

More Related Content

PPTX
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
PDF
GANs and Applications
PDF
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
PDF
A Short Introduction to Generative Adversarial Networks
PDF
Basic Generative Adversarial Networks
PDF
GAN in medical imaging
PDF
Generative adversarial networks
PDF
GAN - Theory and Applications
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
GANs and Applications
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
A Short Introduction to Generative Adversarial Networks
Basic Generative Adversarial Networks
GAN in medical imaging
Generative adversarial networks
GAN - Theory and Applications

What's hot (20)

PDF
Generative Adversarial Networks
PDF
EuroSciPy 2019 - GANs: Theory and Applications
PDF
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
PDF
IMAGE GENERATION WITH GANS-BASED TECHNIQUES: A SURVEY
PDF
Unsupervised learning represenation with DCGAN
PPTX
Generative Adversarial Networks and Their Applications in Medical Imaging
PPTX
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
PDF
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
PPTX
Usage of Generative Adversarial Networks (GANs) in Healthcare
PDF
Generative Adversarial Networks and Their Applications
PPTX
Generative Adversarial Networks (GAN)
PDF
Finding connections among images using CycleGAN
PDF
Image-to-Image Translation
PDF
Generative Adversarial Networks (GANs) - Ian Goodfellow, OpenAI
PDF
Introduction to Generative Adversarial Networks
PDF
Tutorial on Theory and Application of Generative Adversarial Networks
PDF
Generative Adversarial Network (+Laplacian Pyramid GAN)
PDF
Generative adversarial text to image synthesis
PDF
Gan 발표자료
PDF
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Generative Adversarial Networks
EuroSciPy 2019 - GANs: Theory and Applications
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
IMAGE GENERATION WITH GANS-BASED TECHNIQUES: A SURVEY
Unsupervised learning represenation with DCGAN
Generative Adversarial Networks and Their Applications in Medical Imaging
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
Usage of Generative Adversarial Networks (GANs) in Healthcare
Generative Adversarial Networks and Their Applications
Generative Adversarial Networks (GAN)
Finding connections among images using CycleGAN
Image-to-Image Translation
Generative Adversarial Networks (GANs) - Ian Goodfellow, OpenAI
Introduction to Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial Networks
Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative adversarial text to image synthesis
Gan 발표자료
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Ad

Similar to Generative Adversarial Network and its Applications to Speech Processing and Natural Language Processing (INTERSPEECH 2019 Tutorial) (20)

PPTX
Computer Vision Gans
PDF
Anime Generation with AI
PDF
보다 유연한 이미지 변환을 하려면?
PDF
Deep learning applications - for fun and profit
PDF
Google Dev Group Yangon (2020) AI Talk (Creative AI in Action)
PDF
Look, Listen and Act [Navigation via Reinforcement Learning]
PDF
Generative Adversarial Networks and Their Medical Imaging Applications
PPTX
brief Introduction to Different Kinds of GANs
PDF
[PR12] intro. to gans jaejun yoo
PDF
Jakub Langr (University of Oxford) - Overview of Generative Adversarial Netwo...
PDF
Face recognition face verification one shot learning
PDF
Crafting Recommenders: the Shallow and the Deep of it!
PDF
3D Environment HOMENavi
PPTX
Generative adversarial networks
PDF
3D Environment : HomeNavigation
PPTX
Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning
PDF
Variants of GANs - Jaejun Yoo
PDF
Deep Learning applications
PDF
Performance evaluation of GANs in a semisupervised OCR use case
PDF
Performance evaluation of GANs in a semisupervised OCR use case
Computer Vision Gans
Anime Generation with AI
보다 유연한 이미지 변환을 하려면?
Deep learning applications - for fun and profit
Google Dev Group Yangon (2020) AI Talk (Creative AI in Action)
Look, Listen and Act [Navigation via Reinforcement Learning]
Generative Adversarial Networks and Their Medical Imaging Applications
brief Introduction to Different Kinds of GANs
[PR12] intro. to gans jaejun yoo
Jakub Langr (University of Oxford) - Overview of Generative Adversarial Netwo...
Face recognition face verification one shot learning
Crafting Recommenders: the Shallow and the Deep of it!
3D Environment HOMENavi
Generative adversarial networks
3D Environment : HomeNavigation
Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning
Variants of GANs - Jaejun Yoo
Deep Learning applications
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
Ad

Recently uploaded (20)

PPTX
web development for engineering and engineering
PPTX
Sustainable Sites - Green Building Construction
PPTX
Geodesy 1.pptx...............................................
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
PPT on Performance Review to get promotions
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
OOP with Java - Java Introduction (Basics)
DOCX
573137875-Attendance-Management-System-original
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
composite construction of structures.pdf
PDF
R24 SURVEYING LAB MANUAL for civil enggi
web development for engineering and engineering
Sustainable Sites - Green Building Construction
Geodesy 1.pptx...............................................
Foundation to blockchain - A guide to Blockchain Tech
Internet of Things (IOT) - A guide to understanding
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPT on Performance Review to get promotions
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
bas. eng. economics group 4 presentation 1.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
OOP with Java - Java Introduction (Basics)
573137875-Attendance-Management-System-original
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
composite construction of structures.pdf
R24 SURVEYING LAB MANUAL for civil enggi

Generative Adversarial Network and its Applications to Speech Processing and Natural Language Processing (INTERSPEECH 2019 Tutorial)

  • 1. Generative Adversarial Network and its Applications to Speech Processing and Natural Language Processing Hung-yi Lee and Yu Tsao
  • 2. Outline Part I: Basic Idea of Generative Adversarial Network (GAN) Part II: A little bit theory Part III: Applications to Speech Processing Part IV: Applications to Natural Language Processing Take a break
  • 3. All Kinds of GAN … https://github.com/hindupuravinash/the-gan-zoo (not updated since 2018.09) More than 500 species in the zoo
  • 4. All Kinds of GAN … https://github.com/hindupuravinash/the-gan-zoo GAN ACGAN BGAN DCGAN EBGAN fGAN GoGAN CGAN …… Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, Shakir Mohamed, “Variational Approaches for Auto-Encoding Generative Adversarial Networks”, arXiv, 2017
  • 5. 0 0 0 0 1 2 42 62 0 0 0 0 2 11 14 32 2012 2013 2014 2015 2016 2017 2018 2019 ICASSP INTERSPEECH INTERSPEECH & ICASSP How many papers have “adversarial” in their titles? It is a wise choice to attend this tutorial.
  • 7. Generator “Girl with red hair” Generator −0.3 0.1 ⋮ 0.9 random vector Three Categories of GAN 1. Generation image 2. Conditional Generation Generator text imagepaired data blue eyes, red hair, short hair 3. Unsupervised Conditional Generation Photo Vincent van Gogh’s styleunpaired data x ydomain x domain y
  • 9. Basic Idea of GAN Generator It is a neural network (NN), or a function. Generator 0.1 −3 ⋮ 2.4 0.9 imagevector Generator 3 −3 ⋮ 2.4 0.9 Generator 0.1 2.1 ⋮ 5.4 0.9 Generator 0.1 −3 ⋮ 2.4 3.5 high dimensional vector Powered by: http://mattya.github.io/chainer-DCGAN/ Each dimension of input vector represents some characteristics. Longer hair blue hair Open mouth
  • 10. Discri- minator scalar image Basic Idea of GAN It is a neural network (NN), or a function. Larger value means real, smaller value means fake. Discri- minator Discri- minator Discri- minator1.0 1.0 0.1 Discri- minator 0.1
  • 11. • Initialize generator and discriminator • In each training iteration: DG sample generated objects G Algorithm D Update vector vector vector vector 0000 1111 randomly sampled Database Step 1: Fix generator G, and update discriminator D Discriminator learns to assign high scores to real objects and low scores to generated objects. Fix
  • 12. • Initialize generator and discriminator • In each training iteration: DG Algorithm Step 2: Fix discriminator D, and update generator G Discri- minator NN Generator vector 0.13 hidden layer update fix Gradient Ascent large network Generator learns to “fool” the discriminator
  • 13. • Initialize generator and discriminator • In each training iteration: DG Learning D Sample some real objects: Generate some fake objects: G Algorithm D Update Learning G G D image 1111 image image image 1 update fix 0000vector vector vector vector vector vector vector vector fix
  • 14. Anime Face Generation 100 updates Source of training data: https://zhuanlan.zhihu.com/p/24767059
  • 21. In 2019, with StyleGAN …… Source of video: https://www.gwern.net/Faces
  • 23. Progressive GAN [Tero Karras, et al., ICLR, 2018]
  • 24. The first GAN [Ian J. Goodfellow, et al., NIPS, 2014]
  • 25. Today …… [Andrew Brock, et al., arXiv, 2018]
  • 26. [David Bau, et al., ICLR 2019] Does the generator have the concept of objects? Some neurons correspond to specific objects, for example, tree
  • 27. Remove the neurons for tree [David Bau, et al., ICLR 2019] Activate the neurons for tree
  • 28. Generator “Girl with red hair” Generator −0.3 0.1 ⋮ 0.9 random vector Three Categories of GAN 1. Generation image 2. Conditional Generation Generator text imagepaired data blue eyes, red hair, short hair 3. Unsupervised Conditional Generation Photo Vincent van Gogh’s styleunpaired data x ydomain x domain y
  • 29. Target of NN output Text-to-Image • Traditional supervised approach NN Image Text: “train” a dog is running a bird is flying A blurry image! c1: a dog is running as close as possible
  • 30. Conditional GAN D (original) scalar𝑥 G 𝑧Normal distribution x = G(c,z) c: train x is real image or not Image Real images: Generated images: 1 0 Generator will learn to generate realistic images …. But completely ignore the input conditions. [Scott Reed, et al, ICML, 2016]
  • 31. Conditional GAN D (better) scalar 𝑐 𝑥 True text-image pairs: G 𝑧Normal distribution x = G(c,z) c: train Image x is realistic or not + c and x are matched or not (train , ) (train , )(cat , ) [Scott Reed, et al, ICML, 2016] 1 00
  • 32. x is realistic or not + c and x are matched or not Conditional GAN - Discriminator [Takeru Miyato, et al., ICLR, 2018] [Han Zhang, et al., arXiv, 2017] [Augustus Odena et al., ICML, 2017] condition c object x Network Network Network score Network Network (almost every paper) condition c object x c and x are matched or not x is realistic or not +
  • 33. Conditional GAN paired data blue eyes red hair short hair Collecting anime faces and the description of its characteristics red hair, green eyes blue hair, red eyes The images are generated by Yen-Hao Chen, Po-Chun Chien, Jun-Chen Xie, Tsung-Han Wu.
  • 34. Conditional GAN - Image-to-image G 𝑧 x = G(c,z) 𝑐 [Phillip Isola, et al., CVPR, 2017] Image translation, or pix2pix
  • 35. as close as possible Conditional GAN - Image-to-image • Traditional supervised approach NN Image It is blurry. Testing: input L1 e.g. L1 [Phillip Isola, et al., CVPR, 2017]
  • 36. Conditional GAN - Image-to-image Testing: input L1 GAN G 𝑧 Image D scalar GAN + L1 L1 [Phillip Isola, et al., CVPR, 2017]
  • 37. Conditional GAN - Sound-to-image Gc: sound Image "a dog barking sound" Training Data Collection video [Wan, et al., ICASSP 2019]
  • 38. Conditional GAN - Sound-to-image • Audio-to-image https://wjohn1483.github.io/ audio_to_scene/index.html The images are generated by Chia- Hung Wan and Shun-Po Chuang. Louder
  • 39. Conditional GAN - Image-to-label Multi-label Image Classifier = Conditional Generator Input condition Generated output
  • 40. Conditional GAN - Image-to-label F1 MS-COCO NUS-WIDE VGG-16 56.0 33.9 + GAN 60.4 41.2 Inception 62.4 53.5 +GAN 63.8 55.8 Resnet-101 62.8 53.1 +GAN 64.0 55.4 Resnet-152 63.3 52.1 +GAN 63.9 54.1 Att-RNN 62.1 54.7 RLSD 62.0 46.9 The classifiers can have different architectures. The classifiers are trained as conditional GAN. [Tsai, et al., ICASSP 2019]
  • 41. Conditional GAN - Image-to-label F1 MS-COCO NUS-WIDE VGG-16 56.0 33.9 + GAN 60.4 41.2 Inception 62.4 53.5 +GAN 63.8 55.8 Resnet-101 62.8 53.1 +GAN 64.0 55.4 Resnet-152 63.3 52.1 +GAN 63.9 54.1 Att-RNN 62.1 54.7 RLSD 62.0 46.9 The classifiers can have different architectures. The classifiers are trained as conditional GAN. Conditional GAN outperforms other models designed for multi-label.
  • 42. Conditional GAN - Video Generation Generator Discrimi nator Last frame is real or generated Discriminator thinks it is real [Michael Mathieu, et al., arXiv, 2015]
  • 44. More about Video Generation https://arxiv.org/abs/1905.08233 [Egor Zakharov, et al., arXiv, 2019]
  • 45. Domain Adversarial Training • Training and testing data are in different domains Training data: Testing data: Generator Generator The same distribution feature feature Take digit classification as example
  • 46. blue points red points Domain Adversarial Training feature extractor (Generator) Discriminator (Domain classifier) image Which domain? Always output zero vectors Domain Classifier Fails
  • 47. Domain Adversarial Training feature extractor (Generator) Discriminator (Domain classifier) image Label predictor Which digits? Not only cheat the domain classifier, but satisfying label predictor at the same time More speech-related applications in Part III. Successfully applied on image classification [Ganin et al, ICML, 2015][Ajakan et al. JMLR, 2016 ] Which domain?
  • 48. Generator “Girl with red hair” Generator −0.3 0.1 ⋮ 0.9 random vector Three Categories of GAN 1. Generation image 2. Conditional Generation Generator text imagepaired data blue eyes, red hair, short hair 3. Unsupervised Conditional Generation Photo Vincent van Gogh’s styleunpaired data x ydomain x domain y
  • 49. Unsupervised Conditional Generation G Object in Domain X Object in Domain Y Transform an object from one domain to another without paired data Domain X Domain Y photos Condition Generated Object Vincent van Gogh’s paintings Not Paired More Applications in Parts III and IV Use image style transfer as example here
  • 50. Unsupervised Conditional Generation • Approach 1: Cycle-GAN and its variants • Approach 2: Shared latent space ?𝐺 𝑋→𝑌 Domain X Domain Y 𝐸𝑁 𝑋 𝐷𝐸 𝑌 Encoder of domain X Decoder of domain Y Domain YDomain X Face Attribute
  • 51. ? Cycle GAN 𝐺 𝑋→𝑌 Domain X Domain Y 𝐷 𝑌 Domain Y Domain X scalar Input image belongs to domain Y or not Become similar to domain Y
  • 52. Cycle GAN 𝐺 𝑋→𝑌 Domain X Domain Y 𝐷 𝑌 Domain Y Domain X scalar Input image belongs to domain Y or not Become similar to domain Y Not what we want! ignore input
  • 53. Cycle GAN 𝐺 𝑋→𝑌 Domain X Domain Y 𝐷 𝑌 Domain X scalar Input image belongs to domain Y or not Become similar to domain Y Not what we want! ignore input [Tomer Galanti, et al. ICLR, 2018] The issue can be avoided by network design. Simpler generator makes the input and output more closely related.
  • 54. Cycle GAN 𝐺 𝑋→𝑌 Domain X Domain Y 𝐷 𝑌 Domain X scalar Input image belongs to domain Y or not Become similar to domain Y Encoder Network Encoder Network pre-trained as close as possible Baseline of DTN [Yaniv Taigman, et al., ICLR, 2017]
  • 55. Cycle GAN 𝐺 𝑋→𝑌 𝐷 𝑌 Domain Y scalar Input image belongs to domain Y or not 𝐺Y→X as close as possible Lack of information for reconstruction [Jun-Yan Zhu, et al., ICCV, 2017] Cycle consistency
  • 56. Cycle GAN 𝐺 𝑋→𝑌 𝐺Y→X as close as possible 𝐺Y→X 𝐺 𝑋→𝑌 as close as possible 𝐷 𝑌𝐷 𝑋 scalar: belongs to domain Y or not scalar: belongs to domain X or not
  • 57. Cycle GAN Dual GAN Disco GAN [Jun-Yan Zhu, et al., ICCV, 2017] [Zili Yi, et al., ICCV, 2017] [Taeksoo Kim, et al., ICML, 2017] For multiple domains, considering starGAN [Yunjey Choi, arXiv, 2017]
  • 58. Issue of Cycle Consistency • CycleGAN: a Master of Steganography [Casey Chu, et al., NIPS workshop, 2017] 𝐺Y→X𝐺 𝑋→𝑌 The information is hidden.
  • 59. Unsupervised Conditional Generation • Approach 1: Cycle-GAN and its variants • Approach 2: Shared latent space ?𝐺 𝑋→𝑌 Domain X Domain Y 𝐸𝑁 𝑋 𝐷𝐸 𝑌 Encoder of domain X Decoder of domain Y Domain YDomain X Face Attribute
  • 60. Domain X Domain Y 𝐸𝑁 𝑋 𝐸𝑁𝑌 𝐷𝐸 𝑌 𝐷𝐸 𝑋image image image imageFace Attribute Shared latent space Target - domain-x information + domain-y information
  • 61. Domain X Domain Y 𝐸𝑁 𝑋 𝐸𝑁𝑌 𝐷𝐸 𝑌 𝐷𝐸 𝑋image image image image Minimizing reconstruction error Shared latent space Training
  • 62. 𝐸𝑁 𝑋 𝐸𝑁𝑌 𝐷𝐸 𝑌 𝐷𝐸 𝑋image image image image Minimizing reconstruction error Because we train two auto-encoders separately … The images with the same attribute may not project to the same position in the latent space. 𝐷 𝑋 𝐷 𝑌 Discriminator of X domain Discriminator of Y domain Minimizing reconstruction error Shared latent space Training
  • 63. 𝐸𝑁 𝑋 𝐸𝑁𝑌 𝐷𝐸 𝑌 𝐷𝐸 𝑋image image image image Minimizing reconstruction error The domain discriminator forces the output of 𝐸𝑁𝑋 and 𝐸𝑁𝑌 have the same distribution. From 𝐸𝑁𝑋 or 𝐸𝑁𝑌 𝐷 𝑋 𝐷 𝑌 Discriminator of X domain Discriminator of Y domain Shared latent space Training Domain Discriminator 𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the domain discriminator [Guillaume Lample, et al., NIPS, 2017]
  • 64. 𝐸𝑁 𝑋 𝐸𝑁𝑌 𝐷𝐸 𝑌 𝐷𝐸 𝑋image image image image 𝐷 𝑋 𝐷 𝑌 Discriminator of X domain Discriminator of Y domain Shared latent space Training Cycle Consistency: Used in ComboGAN [Asha Anoosheh, et al., arXiv, 017] Minimizing reconstruction error
  • 65. 𝐸𝑁 𝑋 𝐸𝑁𝑌 𝐷𝐸 𝑌 𝐷𝐸 𝑋image image image image 𝐷 𝑋 𝐷 𝑌 Discriminator of X domain Discriminator of Y domain Shared latent space Training Semantic Consistency: Used in DTN [Yaniv Taigman, et al., ICLR, 2017] and XGAN [Amélie Royer, et al., arXiv, 2017] To the same latent space
  • 66. Sharing the parameters of encoders and decoders Shared latent space 𝐸𝑁 𝑋 𝐸𝑁𝑌 𝐷𝐸 𝑋 𝐷𝐸 𝑌 Couple GAN[Ming-Yu Liu, et al., NIPS, 2016] UNIT[Ming-Yu Liu, et al., NIPS, 2017]
  • 67. Shared latent space 𝐸𝑁 𝑋 𝐸𝑁𝑌 𝐷𝐸 𝑋 𝐷𝐸 𝑌 One encoder to extract domain- independent information Input an extra indicator to control the decoder x or y Widely used in Voice Conversion (Part III)
  • 69. Generator “Girl with red hair” Generator −0.3 0.1 ⋮ 0.9 random vector Three Categories of GAN 1. Typical GAN image 2. Conditional GAN Generator text imagepaired data blue eyes, red hair, short hair 3. Unsupervised Conditional GAN Photo Vincent van Gogh’s styleunpaired data x ydomain x domain y
  • 70. Reference • Generation • Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative Adversarial Nets, NIPS, 2014 • Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen, Progressive Growing of GANs for Improved Quality, Stability, and Variation, ICLR, 2018 • Andrew Brock, Jeff Donahue, Karen Simonyan, Large Scale GAN Training for High Fidelity Natural Image Synthesis, arXiv, 2018 • David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B. Tenenbaum, William T. Freeman, Antonio Torralba, GAN Dissection: Visualizing and Understanding Generative Adversarial Networks, ICLR 2019
  • 71. Reference • Conditional Generation • Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, Honglak Lee, Generative Adversarial Text to Image Synthesis, ICML, 2016 • Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros, Image-to-Image Translation with Conditional Adversarial Networks, CVPR, 2017 • Michael Mathieu, Camille Couprie, Yann LeCun, Deep multi-scale video prediction beyond mean square error, arXiv, 2015 • Mehdi Mirza, Simon Osindero, Conditional Generative Adversarial Nets, arXiv, 2014 • Takeru Miyato, Masanori Koyama, cGANs with Projection Discriminator, ICLR, 2018 • Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, Dimitris Metaxas, StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks, arXiv, 2017 • Augustus Odena, Christopher Olah, Jonathon Shlens, Conditional Image Synthesis With Auxiliary Classifier GANs, ICML, 2017
  • 72. Reference • Conditional Generation • Yaroslav Ganin, Victor Lempitsky, Unsupervised Domain Adaptation by Backpropagation, ICML, 2015 • Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, Domain-Adversarial Training of Neural Networks, JMLR, 2016 • Che-Ping Tsai, Hung-Yi Lee, Adversarial Learning of Label Dependency: A Novel Framework for Multi-class Classification, submitted to ICASSP 2019 • Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, Victor Lempitsky, Few- Shot Adversarial Learning of Realistic Neural Talking Head Models, arXiv 2019 • Chia-Hung Wan, Shun-Po Chuang, Hung-Yi Lee, "Towards Audio to Scene Image Synthesis using Generative Adversarial Network", ICASSP, 2019
  • 73. Reference • Unsupervised Conditional Generation • Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros, Unpaired Image-to- Image Translation using Cycle-Consistent Adversarial Networks, ICCV, 2017 • Zili Yi, Hao Zhang, Ping Tan, Minglun Gong, DualGAN: Unsupervised Dual Learning for Image-to-Image Translation, ICCV, 2017 • Tomer Galanti, Lior Wolf, Sagie Benaim, The Role of Minimal Complexity Functions in Unsupervised Learning of Semantic Mappings, ICLR, 2018 • Yaniv Taigman, Adam Polyak, Lior Wolf, Unsupervised Cross-Domain Image Generation, ICLR, 2017 • Asha Anoosheh, Eirikur Agustsson, Radu Timofte, Luc Van Gool, ComboGAN: Unrestrained Scalability for Image Domain Translation, arXiv, 2017 • Amélie Royer, Konstantinos Bousmalis, Stephan Gouws, Fred Bertsch, Inbar Mosseri, Forrester Cole, Kevin Murphy, XGAN: Unsupervised Image-to- Image Translation for Many-to-Many Mappings, arXiv, 2017
  • 74. Reference • Unsupervised Conditional Generation • Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic Denoyer, Marc'Aurelio Ranzato, Fader Networks: Manipulating Images by Sliding Attributes, NIPS, 2017 • Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, Jiwon Kim, Learning to Discover Cross-Domain Relations with Generative Adversarial Networks, ICML, 2017 • Ming-Yu Liu, Oncel Tuzel, “Coupled Generative Adversarial Networks”, NIPS, 2016 • Ming-Yu Liu, Thomas Breuel, Jan Kautz, Unsupervised Image-to-Image Translation Networks, NIPS, 2017 • Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, Jaegul Choo, StarGAN: Unified Generative Adversarial Networks for Multi- Domain Image-to-Image Translation, arXiv, 2017
  • 75. Part II: A little bit Theory
  • 76. Outline of Part II Basic Theory of GAN Helpful Tips How to evaluate GAN Relation to Reinforcement Learning
  • 77. Generator • A generator G is a network. The network defines a probability distribution 𝑃𝐺 generator G𝑧 𝑥 = 𝐺 𝑧 Normal Distribution 𝑃𝐺(𝑥) 𝑃𝑑𝑎𝑡𝑎 𝑥 as close as possible How to compute the divergence? 𝐺∗ = 𝑎𝑟𝑔 min 𝐺 𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎 Divergence between distributions 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎 𝑥: an image (a high- dimensional vector)
  • 78. Discriminator 𝐺∗ = 𝑎𝑟𝑔 min 𝐺 𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎 Although we do not know the distributions of 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎, we can sample from them. sample G vector vector vector vector sample from normal Database Sampling from 𝑷 𝑮 Sampling from 𝑷 𝒅𝒂𝒕𝒂
  • 79. Discriminator 𝐺∗ = 𝑎𝑟𝑔 min 𝐺 𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎 Discriminator : data sampled from 𝑃𝑑𝑎𝑡𝑎 : data sampled from 𝑃𝐺 train 𝑉 𝐺, 𝐷 = 𝐸 𝑥∼𝑃 𝑑𝑎𝑡𝑎 𝑙𝑜𝑔𝐷 𝑥 + 𝐸 𝑥∼𝑃 𝐺 𝑙𝑜𝑔 1 − 𝐷 𝑥 Example Objective Function for D (G is fixed) 𝐷∗ = 𝑎𝑟𝑔 max 𝐷 𝑉 𝐷, 𝐺Training: Using the example objective function is exactly the same as training a binary classifier. [Goodfellow, et al., NIPS, 2014] The maximum objective value is related to JS divergence.
  • 80. Discriminator 𝐺∗ = 𝑎𝑟𝑔 min 𝐺 𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎 Discriminator : data sampled from 𝑃𝑑𝑎𝑡𝑎 : data sampled from 𝑃𝐺 train hard to discriminatesmall divergence Discriminator train easy to discriminatelarge divergence 𝐷∗ = 𝑎𝑟𝑔 max 𝐷 𝑉 𝐷, 𝐺 Training: Small max 𝐷 𝑉 𝐷, 𝐺
  • 81. 𝐺∗ = 𝑎𝑟𝑔 min 𝐺 𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎max 𝐷 𝑉 𝐺, 𝐷 The maximum objective value is related to JS divergence. • Initialize generator and discriminator • In each training iteration: Step 1: Fix generator G, and update discriminator D Step 2: Fix discriminator D, and update generator G 𝐷∗ = 𝑎𝑟𝑔 max 𝐷 𝑉 𝐷, 𝐺 [Goodfellow, et al., NIPS, 2014]
  • 82. Using the divergence you like ☺ [Sebastian Nowozin, et al., NIPS, 2016] Can we use other divergence?
  • 83. Outline of Part II Basic Theory of GAN Helpful Tips How to evaluate GAN Relation to Reinforcement Learning
  • 84. GAN is difficult to train …… • There is a saying …… (I found this joke from 陳柏文’s facebook.)
  • 85. Too many tips …… • I do a little survey among 12 students ….. Q: What is the most helpful tip for training GAN? WGAN (33.3%) Spectral Norm (16.7%)
  • 86. JS divergence is not suitable • In most cases, 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎 are not overlapped. • 1. The nature of data • 2. Sampling Both 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺 are low-dim manifold in high-dim space. 𝑃𝑑𝑎𝑡𝑎 𝑃𝐺 The overlap can be ignored. Even though 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺 have overlap. If you do not have enough sampling ……
  • 87. 𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1 𝐽𝑆 𝑃𝐺0 , 𝑃𝑑𝑎𝑡𝑎 = 𝑙𝑜𝑔2 𝑃𝑑𝑎𝑡𝑎𝑃𝐺100 …… 𝐽𝑆 𝑃𝐺1 , 𝑃𝑑𝑎𝑡𝑎 = 𝑙𝑜𝑔2 𝐽𝑆 𝑃𝐺100 , 𝑃𝑑𝑎𝑡𝑎 = 0 What is the problem of JS divergence? …… JS divergence is log2 if two distributions do not overlap. Intuition: If two distributions do not overlap, binary classifier achieves 100% accuracy Equally bad The same max objective value is obtained. Same divergence
  • 88. Wasserstein distance • Considering one distribution P as a pile of earth, and another distribution Q as the target • The average distance the earth mover has to move the earth. 𝑃 𝑄 d 𝑊 𝑃, 𝑄 = 𝑑
  • 89. Wasserstein distance Source of image: https://vincentherrmann.github.io/blog/wasserstein/ 𝑃 𝑄 Using the “moving plan” with the smallest average distance to define the Wasserstein distance. There are many possible “moving plans”. Smaller distance? Larger distance?
  • 90. 𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1 𝐽𝑆 𝑃𝐺0 , 𝑃𝑑𝑎𝑡𝑎 = 𝑙𝑜𝑔2 𝑃𝑑𝑎𝑡𝑎𝑃𝐺100 …… 𝐽𝑆 𝑃𝐺1 , 𝑃𝑑𝑎𝑡𝑎 = 𝑙𝑜𝑔2 𝐽𝑆 𝑃𝐺100 , 𝑃𝑑𝑎𝑡𝑎 = 0 What is the problem of JS divergence? 𝑊 𝑃𝐺0 , 𝑃𝑑𝑎𝑡𝑎 = 𝑑0 𝑊 𝑃𝐺1 , 𝑃𝑑𝑎𝑡𝑎 = 𝑑1 𝑊 𝑃𝐺100 , 𝑃𝑑𝑎𝑡𝑎 = 0 𝑑0 𝑑1 …… …… Better!
  • 91. WGAN max 𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧 𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎 𝐷 𝑥 − 𝐸 𝑥~𝑃 𝐺 𝐷 𝑥 Evaluate Wasserstein distance between 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺 [Martin Arjovsky, et al., arXiv, 2017] How to fulfill this constraint?D has to be smooth enough. real −∞ generated D ∞ Without the constraint, the training of D will not converge. Keeping the D smooth forces D(x) become ∞ and −∞
  • 92. • Original WGAN → Weight Clipping [Martin Arjovsky, et al., arXiv, 2017] • Improved WGAN → Gradient Penalty [Ishaan Gulrajani, NIPS, 2017] • Spectral Normalization → Keep gradient norm smaller than 1 everywhere [Miyato, et al., ICLR, 2018] Force the parameters w between c and -c After parameter update, if w > c, w = c; if w < -c, w = -c Keep the gradient close to 1 max 𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧 𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎 𝐷 𝑥 − 𝐸 𝑥~𝑃 𝐺 𝐷 𝑥 real samples Keep the gradient close to 1 [Kodali, et al., arXiv, 2017] [Wei, et al., ICLR, 2018]
  • 93. More Tips • Improved techniques for training GANs • Tips in DCGAN [Alec Radford, et al., ICLR 2016] • Guideline for network architecture design for image generation • Tips from Soumith • https://github.com/soumith/ganhacks • Tips from BigGAN [Andrew Brock, et al., arXiv, 2018] [Tim Salimans, et al., NIPS, 2016]
  • 94. Outline of Part II Basic Theory of GAN Helpful Tips How to evaluate GAN Relation to Reinforcement Learning
  • 95. Inception Score Off-the-shelf Image Classifier 𝑥 𝑃 𝑦|𝑥 Concentrated distribution means higher visual quality CNN𝑥1 𝑃 𝑦1|𝑥1 Uniform distribution means higher variety CNN𝑥2 𝑃 𝑦2|𝑥2 CNN𝑥3 𝑃 𝑦3|𝑥3 … 𝑃 𝑦 = 1 𝑁 ෍ 𝑛 𝑃 𝑦 𝑛|𝑥 𝑛 [Tim Salimans, et al., NIPS, 2016] 𝑥: image 𝑦: class (output of CNN) e.g. Inception net, VGG, etc. class 1 class 2 class 3
  • 96. Inception Score = ෍ 𝑥 ෍ 𝑦 𝑃 𝑦|𝑥 𝑙𝑜𝑔𝑃 𝑦|𝑥 − ෍ 𝑦 𝑃 𝑦 𝑙𝑜𝑔𝑃 𝑦 Negative entropy of P(y|x) Entropy of P(y) Inception Score 𝑃 𝑦 = 1 𝑁 ෍ 𝑛 𝑃 𝑦 𝑛|𝑥 𝑛 𝑃 𝑦|𝑥 class 1 class 2 class 3 [Tim Salimans, et al., NIPS 2016]
  • 97. Fréchet Inception Distance (FID) blue points: latent representation of Inception net for the generated images red points: latent representation of Inception net for the read images FID = Fréchet distance between the two Gaussians [Martin Heusel, et al., NIPS, 2017]
  • 98. To learn more about evaluation … Pros and cons of GAN evaluation measures https://arxiv.org/abs/1802.03446 [Ali Borji, 2019]
  • 99. Outline of Part II Basic Theory of GAN Helpful Tips How to evaluate GAN Relation to Reinforcement Learning
  • 100. Basic Components EnvActor Reward Function Video Game Go Get 20 scores when killing a monster The rule of GO You cannot control
  • 101. • Input of neural network: the observation of machine represented as a vector or a matrix • Output neural network : each action corresponds to a neuron in output layer … … NN as actor pixels fire right left Score of an action 0.7 0.2 0.1 Take the action based on the probability. Neural network as Actor
  • 102. Actor, Environment, Reward 𝜏 = 𝑠1, 𝑎1, 𝑠2, 𝑎2, ⋯ , 𝑠 𝑇, 𝑎 𝑇 Trajectory Actor 𝑠1 𝑎1 Env 𝑠2 Env 𝑠1 𝑎1 Actor 𝑠2 𝑎2 Env 𝑠3 𝑎2 …… “right” “fire”
  • 103. Reward Function → Discriminator Reinforcement Learning v.s. GAN Actor 𝑠1 𝑎1 Env 𝑠2 Env 𝑠1 𝑎1 Actor 𝑠2 𝑎2 Env 𝑠3 𝑎2 …… 𝑅 𝜏 = ෍ 𝑡=1 𝑇 𝑟𝑡 Reward 𝑟1 Reward 𝑟2 “Black box” You cannot use backpropagation. Actor → Generator Fixed updatedupdated
  • 104. Inverse Reinforcement Learning We have demonstration of the expert. Actor 𝑠1 𝑎1 Env 𝑠2 Env 𝑠1 𝑎1 Actor 𝑠2 𝑎2 Env 𝑠3 𝑎2 …… reward function is not available (in many cases, it is difficult to define reward function) Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁 Each Ƹ𝜏 is a trajectory of the expert. Self driving: record human drivers Robot: grab the arm of robot
  • 105. Inverse Reinforcement Learning Reward Function Environment Optimal Actor Inverse Reinforcement Learning ➢Using the reward function to find the optimal actor. ➢Modeling reward can be easier. Simple reward function can lead to complex policy. Reinforcement Learning Expert demonstration of the expert Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁
  • 106. Framework of IRL Expert ො𝜋 Actor 𝜋 Obtain Reward Function R Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁 𝜏1, 𝜏2, ⋯ , 𝜏 𝑁 Find an actor based on reward function R By Reinforcement learning ෍ 𝑛=1 𝑁 𝑅 Ƹ𝜏 𝑛 > ෍ 𝑛=1 𝑁 𝑅 𝜏 Reward function → Discriminator Actor → Generator Reward Function R The expert is always the best.
  • 107. 𝜏1, 𝜏2, ⋯ , 𝜏 𝑁 GAN IRL G D High score for real, low score for generated Find a G whose output obtains large score from D Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁 Expert Actor Reward Function Larger reward for Ƹ𝜏 𝑛, Lower reward for 𝜏 Find a Actor obtains large reward
  • 108. Outline of Part II Basic Theory of GAN Helpful Tips How to evaluate GAN Relation to Reinforcement Learning
  • 109. Reference • Sebastian Nowozin, Botond Cseke, Ryota Tomioka, “f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization”, NIPS, 2016 • Martin Arjovsky, Soumith Chintala, Léon Bottou, Wasserstein GAN, arXiv, 2017 • Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron Courville, Improved Training of Wasserstein GANs, NIPS, 2017 • Junbo Zhao, Michael Mathieu, Yann LeCun, Energy-based Generative Adversarial Network, arXiv, 2016 • Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, Olivier Bousquet, “Are GANs Created Equal? A Large-Scale Study”, arXiv, 2017 • Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen Improved Techniques for Training GANs, NIPS, 2016 • Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Sepp Hochreiter, GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, NIPS, 2017
  • 110. Generative Adversarial Network and its Applications to Signal Processing and Natural Language Processing Part III: Speech Signal Processing Tsao, Yu Ph.D., Academia Sinica yu.tsao@citi.sinica.edu.tw
  • 111. Outline of Part III Speech Signal Generation • Speech enhancement • Postfilter, speech synthesis, voice conversion Speech Signal Recognition • Speech recognition • Speaker recognition • Speech emotion recognition • Lip reading Conclusion Our Recent Works
  • 112. Speech Signal Generation (Regression Task) G Output Objective function Paired
  • 113. Speech, Speaker, Emotion Recognition and Lip-reading (Classification Task) Output label Clean data E G 𝒚 Emb. Noisy data 𝒙 ෤𝒛 = 𝑔(෥𝒙) 𝑔(∙) ℎ(∙) ෥𝒙 Accented speech ෭𝒙 Channel distortion ෝ𝒙 Acoustic Mismatch
  • 114. Outline of Part III Speech Signal Generation • Speech enhancement • Postfilter, speech synthesis, voice conversion Speech Signal Recognition • Speech recognition • Speaker recognition • Speech emotion recognition • Lip reading Conclusion Our Recent Works
  • 115. Speech Enhancement • Neural network models for spectral mapping • Typical objective function ➢ Mean square error (MSE) [Xu et al., TASLP 2015], L1 [Pascual et al., Interspeech 2017], likelihood [Chai et al., MLSP 2017], STOI [Fu et al., TASLP 2018]. Enhancing ➢ GAN is used as a new objective function to estimate the parameters in G. ➢Model structures of G: DNN [Wang et al. NIPS 2012; Xu et al., SPL 2014], DDAE [Lu et al., Interspeech 2013], RNN (LSTM) [Chen et al., Interspeech 2015; Weninger et al., LVA/ICA 2015], CNN [Fu et al., Interspeech 2016]. G Output Objective function
  • 116. Speech Enhancement • Speech enhancement GAN (SEGAN) [Pascual et al., Interspeech 2017]
  • 117. Table 1: Objective evaluation results. Table 2: Subjective evaluation results. Fig. 1: Preference test results. Speech Enhancement (SEGAN) SEGAN yields better speech enhancement results than Noisy and Wiener. • Experimental results
  • 118. • Pix2Pix [Michelsanti et al., Interpsech 2017] D Scalar Clean Noisy (Fake/Real) Output Noisy G Noisy Output Clean Speech Enhancement
  • 119. Fig. 2: Spectrogram comparison of Pix2Pix with baseline methods. Speech Enhancement (Pix2Pix) • Spectrogram analysis Pix2Pix outperforms STAT-MMSE and is competitive to DNN SE. NG-DNN STAT-MMSE Noisy Clean NG-Pix2Pix
  • 120. Table 3: Objective evaluation results. Speech Enhancement (Pix2Pix) • Objective evaluation and speaker verification test Table 4: Speaker verification results. 1. From the PESQ and STOI evaluations, Pix2Pix outperforms Noisy and MMSE and is competitive to DNN SE. 2. From the speaker verification results, Pix2Pix outperforms the baseline models when the clean training data is used.
  • 121. • Frequency-domain SEGAN (FSEGAN) [Donahue et al., ICASSP 2018] D Scalar Clean Noisy (Fake/Real) Output Noisy G Noisy Output Clean Speech Enhancement
  • 122. Fig. 3: Spectrogram comparison of FSEGAN with L1-trained method. Speech Enhancement (FSEGAN) • Spectrogram analysis FSEGAN reduces both additive noise and reverberant smearing.
  • 123. Table 5: WER (%) of SEGAN and FSEGAN. Table 6: WER (%) of FSEGAN with retrain. Speech Enhancement (FSEGAN) • ASR results 1. From Table 5, (1) FSEGAN improves recognition results for ASR-Clean. (2) FSEGAN outperforms SEGAN as front-ends. 2. From Table 6, (1) Hybrid Retraining with FSEGAN outperforms Baseline; (2) FSEGAN retraining slightly underperforms L1–based retraining.
  • 124. • Speech enhancement through a mask function G Noisy Output mask Enhanced Speech Enhancement Point-wise multiplication
  • 125. • GAN for spectral magnitude mask estimation (MMS-GAN) [Ashutosh Pandey and Deliang Wang, ICASSP 2018] D Scalar Ref. mask Noisy (Fake/Real) Output mask Noisy G Noisy Output mask Ref. mask Speech Enhancement We don’t know exactly what D functions. Our ICML 2019 paper shed some lights on a potential future direction.
  • 126. 𝐺𝑆→𝑇 𝐺 𝑇→𝑆 as close as possible 𝐷 𝑇 Scalar: belongs to domain T or not 𝐺 𝑇→𝑆 𝐺𝑆→𝑇 as close as possible 𝐷𝑆 Scalar: belongs to domain S or not Speech Enhancement (AFT) • Cycle-GAN-based acoustic feature transformation (AFT) [Mimura et al., ASRU 2017] 𝑉𝐹𝑢𝑙𝑙 = 𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌 +𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌 +𝜆 𝑉𝐶𝑦𝑐(𝐺 𝑋→𝑌, 𝐺 𝑌→𝑋) Noisy Enhanced Noisy Clean Syn. Noisy Clean
  • 127. • ASR results on noise robustness and style adaptation Table 7: Noise robust ASR. Table 8: Speaker style adaptation. 1. 𝐺 𝑇→𝑆 can transform acoustic features and effectively improve ASR results for both noisy and accented speech. 2. 𝐺𝑆→𝑇 can be used for model adaptation and effectively improve ASR results for noisy speech. S: Clean; 𝑇: Noisy JNAS: Read; CSJ-SPS: Spontaneous (relax); CSJ-APS: Spontaneous (formal); Speech Enhancement (AFT)
  • 128. Outline of Part III Speech Signal Generation • Speech enhancement • Postfilter, speech synthesis, voice conversion Speech Signal Recognition • Speech recognition • Speaker recognition • Speech emotion recognition • Lip reading Conclusion Our Recent Works
  • 129. • Postfilter for synthesized or transformed speech ➢ Conventional postfilter approaches for G estimation include global variance (GV) [Toda et al., IEICE 2007], variance scaling (VS) [Sil’en et al., Interpseech 2012], modulation spectrum (MS) [Takamichi et al., ICASSP 2014],DNN with MSE criterion [Chen et al., Interspeech 2014; Chen et al., TASLP 2015]. ➢ GAN is used a new objective function to estimate the parameters in G. Postfilter Synthesized spectral texture Natural spectral texture G Output Objective function Speech synthesizer Voice conversion Speech enhancement
  • 130. • GAN postfilter [Kaneko et al., ICASSP 2017] ➢ Traditional MMSE criterion results in statistical averaging. ➢ GAN is used as a new objective function to estimate the parameters in G. ➢ The proposed work intends to further improve the naturalness of synthesized speech or parameters from a synthesizer. Postfilter Synthesized Mel cepst. coef. Natural Mel cepst. coef. D Nature or Generated Generated Mel cepst. coef. G
  • 131. Fig. 4: Spectrograms of: (a) NAT (nature); (b) SYN (synthesized); (c) VS (variance scaling); (d) MS (modulation spectrum); (e) MSE; (f) GAN postfilters. Postfilter (GAN-based Postfilter) • Spectrogram analysis GAN postfilter reconstructs spectral texture similar to the natural one.
  • 132. Fig. 5: Mel-cepstral trajectories (GANv: GAN was applied in voiced part). Fig. 6: Averaging difference in modulation spectrum per Mel- cepstral coefficient. Postfilter (GAN-based Postfilter) • Objective evaluations GAN postfilter reconstructs spectral texture similar to the natural one.
  • 133. Table 9: Preference score (%). Bold font indicates the numbers over 30%. Postfilter (GAN-based Postfilter) • Subjective evaluations 1. GAN postfilter significantly improves the synthesized speech. 2. GAN postfilter is effective particularly in voiced segments. 3. GANv outperforms GAN and is comparable to NAT.
  • 134. Speech Synthesis • Input: linguistic features; Output: speech parameters 𝒄ො𝒄 𝑮 𝑺𝑺 Natural speech parameters Generated speech parameters Linguistic features sp sp Objective function Minimum generation error (MGE), MSE • Speech synthesis with anti-spoofing verification (ASV) [Saito et al., ICASSP 2017] 𝐿 𝐷 𝒄, ො𝒄 = 𝐿 𝐷,1 𝒄 + 𝐿 𝐷,0 ො𝒄 𝐿 𝐷,1 𝒄 = − 1 𝑇 σ 𝑡=1 𝑇 log( 𝐷 𝒄 𝑡 )…NAT 𝐿 𝐷,0 ො𝒄 = − 1 𝑇 σ 𝑡=1 𝑇 log(1 − 𝐷 ො𝒄 𝑡 )…SYN 𝐿 𝒄, ො𝒄 = 𝐿 𝐺 𝒄, ො𝒄 + 𝜔 𝐷 𝐸 𝐿 𝐺 𝐸 𝐿 𝐷 𝐿 𝐷,1 ො𝒄 Minimum generation error (MGE) with adversarial loss. 𝒄ො𝒄 𝑮 𝑺𝑺 Natural speech parameters Generated speech parameters Gen. Nature 𝑫 𝑨𝑺𝑽𝝓(∙) Linguistic features sp sp MGE
  • 135. Fig. 7: Averaged GVs of MCCs. Speech Synthesis (ASV) • Objective and subjective evaluations 1. The proposed algorithm generates MCCs similar to the natural ones. Fig. 8: Scores of speech quality. 2. The proposed algorithm outperforms conventional MGE training.
  • 136. • Speech synthesis with GAN (SS-GAN) [Saito et al., TASLP 2018] 𝐿 𝐷 𝒄, ො𝒄 = 𝐿 𝐷,1 𝒄 + 𝐿 𝐷,0 ො𝒄 𝐿 𝐷,1 𝒄 = − 1 𝑇 σ 𝑡=1 𝑇 log( 𝐷 𝒄 𝑡 )…NAT 𝐿 𝐷,0 ො𝒄 = − 1 𝑇 σ 𝑡=1 𝑇 log(1 − 𝐷 ො𝒄 𝑡 )…SYN 𝐿 𝒄, ො𝒄 = 𝐿 𝐺 𝒄, ො𝒄 + 𝜔 𝐷 𝐸 𝐿 𝐺 𝐸 𝐿 𝐷 𝐿 𝐷,1 ො𝒄 Minimum generation error (MGE) with adversarial loss. 𝒄ො𝒄 Speech Synthesis 𝑮 𝑺𝑺 Natural speech parameters Generated speech parameters Gen. Nature 𝑫𝝓(∙) Linguistic features 𝑫 𝑨𝑺𝑽 sp, f0, duration sp, f0, duration MGE
  • 137. Fig. 10: Scores of speech quality (sp and F0). . Speech Synthesis (SS-GAN) • Subjective evaluations Fig. 9: Scores of speech quality (sp). The proposed algorithm works for both spectral parameters and F0.
  • 138. • Convert (transform) speech from source to target ➢ Conventional VC approaches include Gaussian mixture model (GMM) [Toda et al., TASLP 2007], non-negative matrix factorization (NMF) [Wu et al., TASLP 2014; Fu et al., TBME 2017], locally linear embedding (LLE) [Wu et al., Interspeech 2016], variational autoencoder (VAE) [Hsu et al., APSIPA 2016], restricted Boltzmann machine (RBM) [Chen et al., TASLP 2014], feed forward NN [Desai et al., TASLP 2010], recurrent NN (RNN) [Nakashika et al., Interspeech 2014]. Voice Conversion G Output Objective function Target speaker Source speaker
  • 139. • VAW-GAN [Hsu et al., Interspeech 2017] ➢Conventional MMSE approaches often encounter the “over-smoothing” issue. ➢ GAN is used a new objective function to estimate G. ➢ The goal is to increase the naturalness, clarity, similarity of converted speech. Voice Conversion D Real or Fake G Target speaker Source speaker 𝑉 𝐺, 𝐷 = 𝑉𝐺𝐴𝑁 𝐺, 𝐷 + 𝜆 𝑉𝑉𝐴𝐸 𝒙|𝒚
  • 140. • Objective and subjective evaluations Fig. 12: MOS on naturalness.Fig. 11: The spectral envelopes. Voice Conversion (VAW-GAN) VAW-GAN outperforms VAE in terms of objective and subjective evaluations with generating more structured speech.
  • 141. • CycleGAN-VC [Kaneko et al., Eusipco 2018] • used a new objective function to estimate G 𝑉𝐹𝑢𝑙𝑙 = 𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌 +𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌 +𝜆 𝑉𝐶𝑦𝑐(𝐺 𝑋→𝑌, 𝐺 𝑌→𝑋) Voice Conversion 𝑮 𝑺→𝑻 𝐺 𝑇→𝑆 as close as possible 𝑫 𝑻 Scalar: belongs to domain T or not Scalar: belongs to domain S or not 𝐺 𝑇→𝑆 𝑮 𝑺→𝑻 as close as possible 𝑫 𝑺 Target Syn. Source Target Source Syn. Target Source
  • 142. • Subjective evaluations Fig. 13: MOS for naturalness. Fig. 14: Similarity of to source and to target speakers. S: Source; T:Target; P: Proposed; B:Baseline Voice Conversion (CycleGAN-VC) 1. The proposed method uses non-parallel data. 2. For naturalness, the proposed method outperforms baseline. 3. For similarity, the proposed method is comparable to the baseline. Target speaker Source speaker
  • 143. Outline of Part III Speech Signal Generation • Speech enhancement • Postfilter, speech synthesis, voice conversion Speech Signal Recognition • Speech recognition • Speaker recognition • Speech emotion recognition • Lip reading Conclusion Our Recent Works
  • 144. Speech, Speaker, Emotion Recognition and Lip-reading (Classification Task) Output label Clean data E G 𝒚 ෥𝒙 Emb. Noisy data 𝒙 ෤𝒛 = 𝑔(෥𝒙) 𝑔(∙) ℎ(∙) Accented speech ෭𝒙 Channel distortion ෝ𝒙 Acoustic Mismatch
  • 145. Speech Recognition • Adversarial multi-task learning (AMT) [Shinohara Interspeech 2016] Output 1 Senone Input Acoustic feature E G𝑉𝑦 Output 2 Domain D 𝑉𝑧 𝒛 𝒚 𝒙 GRL 𝑉𝑦=− σ𝑖 log 𝑃(𝑦𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐺) 𝑉𝑧=− σ𝑖 log 𝑃(𝑧𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐷) 𝜃 𝐺 ← 𝜃 𝐺 − ϵ 𝜕𝑉𝑦 𝜕𝜃 𝐺 𝜃 𝐸 ← 𝜃 𝐸 − ϵ 𝜕𝑉𝑦 𝜕𝜃 𝐸 𝜃 𝐷 ← 𝜃 𝐷 − ϵ 𝜕𝑉𝑧 𝜕𝜃 𝐷 Model update Max classification accuracy Max domain accuracy Max classification accuracy Objective function +𝛼 𝜕𝑉𝑧 𝜕𝜃 𝐸 and Min domain accuracy
  • 146. • ASR results in known (k) and unknown (unk) noisy conditions Speech Recognition (AMT) Table 10: WER of DNNs with single-task learning (ST) and AMT. The AMT-DNN outperforms ST-DNN with yielding lower WERs.
  • 147. Speech Recognition • Domain adversarial training for accented ASR (DAT) [Sun et al., ICASSP2018] Output 2 Domain Output 1 Senone Input Acoustic feature E G D GRL 𝑉𝑧𝑉𝑦 𝒛 𝒚 𝒙 𝑉𝑦=− σ𝑖 log 𝑃(𝑦𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐺) 𝑉𝑧=− σ𝑖 log 𝑃(𝑧𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐷) 𝜃 𝐺 ← 𝜃 𝐺 − ϵ 𝜕𝑉𝑦 𝜕𝜃 𝐺 𝜃 𝐸 ← 𝜃 𝐸 − ϵ 𝜕𝑉𝑦 𝜕𝜃 𝐸 𝜃 𝐷 ← 𝜃 𝐷 − ϵ 𝜕𝑉𝑧 𝜕𝜃 𝐷 Model update Max classification accuracy Max domain accuracy Max classification accuracy Objective function +𝛼 𝜕𝑉𝑧 𝜕𝜃 𝐸 and Min domain accuracy
  • 148. • ASR results on accented speech Speech Recognition (DAT) 1. With labeled transcriptions, ASR performance notably improves. Table 11: WER of the baseline and adapted model. 2. DAT is effective in learning features invariant to domain differences with and without labeled transcriptions. STD: standard speech
  • 149. Speech Recognition • Unsupervised Adaptation with Domain Separation Networks (DSN) [Meng et al., ASRU 2017] R PEt R PEs 𝒙 ෥𝒙 Output 1 Senone Clean data E GL1 𝒚 ෥𝒙 Emb. Noisy data E 𝒙 Emb.𝒛 = 𝑔(𝒙) ෤𝒛 = 𝑔(෥𝒙) 𝑔(∙)𝑔(∙) ℎ(∙)D Output 2 Domain 𝒅
  • 150. • Results on ASR in noise (CHiME3): Speech Recognition (DSN) 1. DSN outperforms GRL consistently over different noise types. 2. The results confirmed the additional gains provided by private component extractors. Table 12: WER (in %) of Robust ASR on the CHiME3 task.
  • 151. Outline of Part III Speech Signal Generation • Speech enhancement • Postfilter, speech synthesis, voice conversion Speech Signal Recognition • Speech recognition • Speaker recognition • Speech emotion recognition • Lip reading Conclusion Our Recent Works
  • 152. Speaker Recognition • Domain adversarial neural network (DANN) [Wang et al., ICASSP 2018] DANN DANN Pre- processing Pre- processing Scoring Enroll i-vector Test i-vector Output 2 Domain Output 1 Speaker ID Input Acoustic feature E G D GRL 𝑉𝑧𝑉𝑦 𝒛 𝒚 𝒙
  • 153. • Recognition results of domain mismatched conditions Table 13: Performance of DAT and the state-of-the-art methods. Speaker Recognition (DANN) The DAT approach outperforms other methods with achieving lowest EER and DCF scores.
  • 154. Outline of Part III Speech Signal Generation • Speech enhancement • Postfilter, speech synthesis, voice conversion Speech Signal Recognition • Speech recognition • Speaker recognition • Speech emotion recognition • Lip reading Conclusion Our Recent Works
  • 155. Emotion Recognition • Adversarial AE for emotion recognition (AAE-ER) [Sahu et al., Interspeech 2017] AE with GAN : 𝐻 ℎ 𝒛 , 𝒙 + λ 𝑉𝐺𝐴𝑁 (𝒒, 𝑔(𝒙)) E D 𝒙 Emb. Syn. 𝒛 = 𝑔(𝒙) 𝑔(∙) ℎ(∙) 𝒒 𝒙 G The distribution of code vectors
  • 156. • Recognition results of domain mismatched conditions: Table 15: Classification results on real and synthesized features. Emotion Recognition (AAE-ER) Table 14: Classification results on different systems. 1. AAE alone could not yield performance improvements. 2. Using synthetic data from AAE can yield higher UAR. Original Training data
  • 157. Outline of Part III Speech Signal Generation • Speech enhancement • Postfilter, speech synthesis, voice conversion Speech Signal Recognition • Speech recognition • Speaker recognition • Speech emotion recognition • Lip reading Conclusion Our Recent Works
  • 158. Lip-reading • Domain adversarial training for lip-reading (DAT-LR) [Wand et al., Interspeech 2017] Output 1 Words E G𝑉𝑦 Output 2 Speaker D GRL 𝑉𝑧 𝒛 𝒚 𝒙 𝑉𝑦=− σ𝑖 log 𝑃(𝑦𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐺) 𝑉𝑧=− σ𝑖 log 𝑃(𝑧𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐷) 𝜃 𝐺 ← 𝜃 𝐺 − ϵ 𝜕𝑉𝑦 𝜕𝜃 𝐺 𝜃 𝐸 ← 𝜃 𝐸 − ϵ 𝜕𝑉𝑦 𝜕𝜃 𝐸 𝜃 𝐷 ← 𝜃 𝐷 − ϵ 𝜕𝑉𝑧 𝜕𝜃 𝐷 Model update Max classification accuracy Max domain accuracy Max classification accuracy Objective function +𝛼 𝜕𝑉𝑧 𝜕𝜃 𝐸 and Min domain accuracy ~80% WAC
  • 159. • Recognition results of speaker mismatched conditions Lip-reading (DAT-LR) Table 16: Performance of DAT and the baseline. The DAT approach notably enhances the recognition accuracies in different conditions.
  • 160. Outline of Part III Speech Signal Generation • Speech enhancement • Postfilter, speech synthesis, voice conversion Speech Signal Recognition • Speech recognition • Speaker recognition • Speech emotion recognition • Lip reading Conclusion Our Recent Works
  • 161. Speech Signal Generation (Regression Task) G Output Objective function Paired
  • 162. Speech, Speaker, Emotion Recognition and Lip-reading (Classification Task) Output label Clean data E G 𝒚 ෥𝒙 Emb. Noisy data 𝒙 ෤𝒛 = 𝑔(෥𝒙) 𝑔(∙) ℎ(∙) Accented speech ෭𝒙 Channel distortion ෝ𝒙 Acoustic Mismatch
  • 163. References Speech enhancement (conventional methods) • Y.-X. Wang and D.-L. Wang, Cocktail party processing via structured prediction, NIPS 2012. • Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, An experimental study on speech enhancement based on deep neural networks, IEEE SPL, 2014. • Y. Xu, J. Du, L.-R. Dai, and Chin-Hui Lee, A regression approach to speech enhancement based on deep neural networks, IEEE/ACM TASLP, 2015. • X. Lu, Y. Tsao, S. Matsuda, H. Chiroi, Speech enhancement based on deep denoising autoencoder, Interspeech 2012. • Z. Chen, S. Watanabe, H. Erdogan, J. R. Hershey, Integration of speech enhancement and recognition using long- short term memory recurrent neural network, Interspeech 2015. • F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. L. Roux, J. R. Hershey, and B. Schuller, Speech enhancement with LSTM recurrent neural networks and Its application to noise-robust ASR, LVA/ICA, 2015. • S.-W. Fu, Y. Tsao, and X.-G. Lu, SNR-aware convolutional neural network modeling for speech enhancement, Interspeech, 2016. • S.-W. Fu, Y. Tsao, X.-G. Lu, and Hisashi Kawai, End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks, IEEE/ACM TASLP, 2018. Speech enhancement (GAN-based methods) • P. Santiago, B. Antonio, and S. Joan, SEGAN: Speech enhancement generative adversarial network, Interspeech, 2017. • D. Michelsanti, and Z.-H. Tan, Conditional generative adversarial networks for speech enhancement and noise- robust speaker verification, Interspeech, 2017. • C. Donahue, B. Li, and P. Rohit, Exploring speech enhancement with generative adversarial networks for robust speech recognition, ICASSP, 2018. • T. Higuchi Takuya, K. Kinoshita, D. Marc, and T. Nakatani. Adversarial training for data-driven speech enhancement without parallel Corpus, ASRU, 2017. • S. Pascual, M. Park, J. Serrà, A. Bonafonte, K.-H. Ahn, Language and noise transfer in speech enhancement generative adversarial network, ICASSP 2018.
  • 164. References Speech enhancement (GAN-based methods) • A. Pandey and D. Wang, On adversarial training and loss functions for speech enhancement, ICASSP 2018. • M. H. Soni, Neil Shah, and H. A. Patil, Time-frequency masking-based speech enhancement using generative adversarial network, ICASSP 2018. • Z. Meng, J.-Y. Li, Y.-G. Gong, B.-H. Juang, Adversarial feature-mapping for speech enhancemen, Interspeech, 2018. • L.-W. Chen, M.Yu, Y.-M. Qian, D. Su, D. Yu, Permutation invariant training of generative adversarial network for monaural speech separation, Interspeech 2018. • D. Baby and S. Verhulst, Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty, ICASSP 2019.
  • 165. Postfilter (conventional methods) • T. Tod, and K. Tokuda, A speech parameter generation algorithm considering global variance for HMM-based speech synthesis, IEICE Trans. Inf. Syst., 2007. • H. Sil’en, E. Helander, J. Nurminen, and M. Gabbouj, Ways to implement global variance in statistical speech synthesis, Interspeech, 2012. • S. Takamichi, T. Toda, N. Graham, S. Sakriani, and S. Nakamura, A postfilter to modify the modulation spectrum in HMM-based speech synthesis, ICASSP, 2014. • L.-H. Chen, T. Raitio, C. V. Botinhao, J. Yamagishi, and Z.-H. Ling, DNN-based stochastic postfilter for HMM- based speech synthesis, Interspeech, 2014. • L.-H. Chen, T. Raitio, C. V. Botinhao, Z.-H. Ling, and J. Yamagishi, A deep generative architecture for postfiltering in statistical parametric speech synthesis, IEEE/ACM TASLP, 2015. Postfilter (GAN-based methods) • K. Takuhiro, K. Hirokazu, H. Nobukatsu, Y. Ijima, K. Hiramatsu, and K. Kashino, Generative adversarial network- based postfilter for statistical parametric speech synthesis, ICASSP, 2017. • K. Takuhiro, T. Shinji, K. Hirokazu, and J. Yamagishi, Generative adversarial network-based postfilter for STFT spectrograms, Interspeech, 2017. • Y. Saito, S. Takamichi, and H. Saruwatari, Training algorithm to deceive anti-spoofing verification for DNN-based speech synthesis, ICASSP, 2017. • Y. Saito, S. Takamichi, H. Saruwatari, Statistical parametric speech synthesis incorporating generative adversarial networks, IEEE/ACM TASLP, 2018. • B. Bollepalli, L. Juvela, and A. Paavo, Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis, Interspeech, 2017. • S. Yang, L. Xie, X. Chen, X.-Y. Lou, X. Zhu, D.-Y. Huang, and H.-Z. Li, Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework, ASRU, 2017. References
  • 166. VC (conventional methods) • T. Toda, A. W. Black, and K. Tokuda, Voice conversion based on maximum likelihood estimation of spectral parameter trajectory, IEEE/ACM TASLP, 2007. • L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, Voice conversion using deep neural networks with layer-wise generative training, IEEE/ACM TASLP, 2014. • S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad, Spectral mapping using artificial neural networks for voice conversion, IEEE/ACM TASLP, 2010. • T. Nakashika, T. Takiguchi, Y. Ariki, High-order sequence modeling using speaker-dependent recurrent temporal restricted boltzmann machines for voice conversion, Interspeech, 2014. • K. Takuhiro, K. Hirokazu, H. Kaoru, and K. Kunio, Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks, Interspeech, 2017. • Z.-Z. Wu, T. Virtanen, E.-S. Chng, and H.-Z. Li, Exemplar-based sparse representation with residual compensation for voice conversion, IEEE/ACM TASLP, 2014. • S.-. Fu, P.-C. Li, Y.-H. Lai, C.-C. Yang, L.-C. Hsieh, and Y. Tsao, Joint dictionary learning-based non-negative matrix factorization for voice conversion to improve speech intelligibility after oral surgery, IEEE TBME, 2017. • Y.-C. Wu, H.-T. Hwang, C.-C. Hsu, Y. Tsao, and H.-M. Wang, Locally linear embedding for exemplar-based spectral conversion, Interspeech, 2016. • C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, Y., and H.-M. Wang, Voice conversion from non-parallel corpora using variational auto-encoder. APSIPA 2016. VC (GAN-based methods) • C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks, Interspeech 2017. • K. Takuhiro, K. Hirokazu, H. Kaoru, and K. Kunio, Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks, Interspeech, 2017. References
  • 167. VC (GAN-based methods) • K. Takuhiro, and K. Hirokazu. Parallel-data-free voice conversion using cycle-consistent adversarial networks, arXiv, 2017. • N. Shah, N. J. Shah, and H. A. Patil, Effectiveness of generative adversarial network for non-audible murmur-to- whisper speech conversion, Interspeech, 2018. • J.-C. Chou, C.-C. Yeh, H.-Y. Lee, and L.-S. Lee, Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations, Interspeech, 2018. • G. Degottex, and M. Gales, A spectrally weighted mixture of least square error and wasserstein discriminator loss for generative SPSS, SLT, 2018. • B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura, Adaptive wavenet vocoder for residual compensation in GAN-based voice conversion, SLT, 2018. • C.-C. Yeh, P.-C. Hsu, J.-C. Chou, H.-Y. Lee, and L.-S. Lee, Rhythm-flexible voice conversion without parallel data using cycle-GAN over phoneme posteriorgram sequences, SLT, 2018. • H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, STARGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks, SLT, 2018. • K. Tanaka, T. Kaneko, N. Hojo, and H. Kameoka, Synthetic-to-natural speech waveform conversion using cycle- consistent adversarial networks, SLT, 2018. • O. Ocal, O. H. Elibol, G. Keskin, C. Stephenson, A. Thomas, and K. Ramchandran, Adversarially trained autoencoders for parallel-data-free voice conversion, ICASSP, 2019. • F. Fang, X. Wang, J. Yamagishi, and I. Echizen, Audiovisual speaker conversion: Jointly and simultaneously transforming facial expression and acoustic characteristics, ICASSP, 2019. • S. Seshadri, L. Juvela, J. Yamagishi, Okko Räsänen, and P. Alku, Cycle-consistent adversarial networks for non- parallel vocal effort based speaking style conversion, ICASSP, 2019. • T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, CYCLEGAN-VC2: Improved cyclegan-based non-parallel voice conversion, ICASSP, 2019. • L. Juvela, B. Bollepalli, J. Yamagishi, and P. Alku, Waveform generation for text-to-speech synthesis using pitch- synchronous multi-scale generative adversarial networks, ICASSP, 2019. References
  • 168. Speaker recognition • Q. Wang, W. Rao, S.-I. Sun, L. Xie, E.-S. Chng, and H.-Z. Li, Unsupervised domain adaptation via domain adversarial training for speaker recognition, ICASSP, 2018. • H. Yu, Z.-H. Tan, Z.-Y. Ma, and J. Guo, Adversarial network bottleneck features for noise robust speaker verification, arXiv, 2017. • G. Bhattacharya, J. Alam, & P. Kenny, Adapting end-to-end neural speaker verification to new languages and recording conditions with adversarial training, ICASSP, 2019. • Z. Peng, S. Feng, & T. Lee, Adversarial multi-task deep features and unsupervised back-end adaptation for language recognition, ICASSP, 2019. • Z. Meng, Y. Zhao, J. Li, & Y. Gong, Adversarial speaker verification, ICASSP, 2019. • X. Fang, L. Zou, J. Li, L. Sun, & Z.-H. Ling, Channel adversarial training for cross-channel text-independent speaker recognition, ICASSP, 2019. • W. Xia, J. Huang, & J. H. Hansen, Cross-lingual text-independent speaker verification using unsupervised adversarial discriminative domain adaptation, ICASSP, 2019. • P. S. Nidadavolu, J. Villalba, & N. Dehak, Cycle-GANs for domain adaptation of acoustic features for speaker recognition, ICASSP, 2019. • G. Bhattacharya, J. Monteiro, J. Alam, & P. Kenny, Generative adversarial speaker embedding networks for domain robust end-to-end speaker verification, ICASSP, 2019. • J. Rohdin, T. Stafylakis, A. Silnova, H. Zeinali, L. Burget, & O. Plchot, Speaker verification using end-to-end adversarial language adaptation, ICASSP, 2019. • Zhou, J., Jiang, T., Li, L., Hong, Q., Wang, Z., & Xia, B., Training multi-task adversarial network for extracting noise-robust speaker embedding, ICASSP, 2019. • J. Zhang, N. Inoue, & K. Shinoda, I-vector transformation using conditional generative adversarial networks for short utterance speaker verification, arXiv, 2018. • W. Ding, & L. He, Mtgan: Speaker verification through multitasking triplet generative adversarial networks, arXiv, 2018. • X. Miao, I. McLoughlin, S. Yao, & Y. Yan, Improved conditional generative adversarial net classification for spoken language recognition, SLT, 2018. References
  • 169. Automatic Speech Recognition • Yusuke Shinohara, Adversarial multi-task learning of deep neural networks for robust speech recognition, Interspeech, 2016. • D. Serdyuk, K. Audhkhasi, P. Brakel, B. Ramabhadran, S. Thomas, and Y. Bengio, Invariant Representations for Noisy Speech Recognition, arXiv, 2016. • Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara, Cross-domain speech recognition using nonparallel corpora with cycle-consistent adversarial networks, ASRU, 2017. • A. Sriram, H.-W Jun, Y. Gaur, and S. Satheesh, Robust speech recognition using generative adversarial networks, arXiv, 2017. • Z. Meng, Z. Chen, V. Mazalov, J. Li, J., and Y. Gong, Unsupervised adaptation with domain separation networks for robust speech recognition, ASRU, 2017. • Z. Meng, J. Li, Z. Chen, Y. Zhao, V. Mazalov, Y. Gong, and B.-H. Juang, Speaker-invariant training via adversarial learning, ICASSP, 2018. • Z. Meng, J. Li, Y. Gong, and B.-H. Juang, Adversarial teacher-student learning for unsupervised domain adaptation, ICASSP, 2018. • Y. Zhang, P. Zhang, and Y. Yan, Improving language modeling with an adversarial critic for automatic speech recognition, Interspeech, 2018. • S. Sun, C. Yeh, M. Ostendorf, M. Hwang, and L. Xie, Training augmentation with adversarial examples for robust speech recognition, Interspeech, 2018. • Z. Meng, J. Li, Y. Gong, and B.-H. Juang, Adversarial feature-mapping for speech enhancement, Interspeech 2018. • K. Wang, J. Zhang, S. Sun, Y. Wang, F. Xiang, and L. Xie, Investigating generative adversarial networks based speech dereverberation for robust speech recognition, Interspeech 2018. • Z. Meng, J. Li, Y. Gong, B.-H. Juang, Cycle-consistent speech enhancement, Interspeech 2018. • J. Drexler and J. Glass, Combining end-to-end and adversarial training for low-resource speech recognition, SLT, 2018. • A. H. Liu, H. Lee and L. Lee, Adversarial training of end-to-end speech recognition using a criticizing language model, ICASSP, 2019. References
  • 170. Automatic Speech Recognition • J. Yi, J. Tao and Y. Bai, Language-invariant bottleneck features from adversarial end-to-end acoustic models for • low resource speech recognition, ICASSP, 2019. • D. Haws and X. Cui, Cyclegan bandwidth extension acoustic modeling for automatic speech recognition, ICASSP, 2019. • Z. Meng, J. Li, J. and Y. Gong, Attentive adversarial learning for domain-Invariant training, ICASSP, 2019. • Z. Meng, Y. Zhao, J. Li, and Y. Gong, Adversarial speaker verification, ICASSP, 2019. • Z. Meng, Y. Zhao, J. Li, and Y. Gong., Adversarial speaker adaptation, ICASSP, 2019. Emotion recognition • J. Chang, and S. Scherer, Learning representations of emotional speech with deep convolutional generative adversarial networks, ICASSP, 2017. • S. Sahu, R. Gupta, G. Sivaraman, W. AbdAlmageed, and C. Espy-Wilson, Adversarial auto-encoders for speech based emotion recognition. Interspeech, 2017. • S. Sahu, R. Gupta, and C. E.-Wilson, On enhancing speech emotion recognition using generative adversarial networks, Interspeech 2018. • C.-M. Chang, and C.-C. Lee, Adversarially-enriched acoustic code vector learned from out-of-context affective corpus for robust emotion recognition, ICASSP 2019. • J. Liang, S. Chen, J. Zhao, Q. Jin, H. Liu, and L. Lu, Cross-culture multimodal emotion recognition with adversarial learning, ICASSP 2019. Lipreading • M. Wand, and J. Schmidhuber, Improving speaker-independent lipreading with domain-adversarial training, arXiv, 2017. References
  • 171. Outline of Part III Our Recent Works • Noise adaptive speech enhancement [Interspeech 2019] • MetricGAN for speech enhancement [ICML 2019] • Multi-Target voice conversion [Interspeech 2018] • Impaired speech conversion [Interspeech 2019] • Pathological voice detection [NeurIPS workshop 2018] [Mon-P-2-A] [Wed-P-6-E]
  • 172. Speech Enhancement N5 N5N4 N7 N9N10N12 Unseen N11 𝒛 E G 𝑉𝑦 𝜃 𝐺 ← 𝜃 𝐺 − ϵ 𝜕𝑉𝑦 𝜕𝜃 𝐺 𝜃 𝐸 ← 𝜃 𝐸 − ϵ 𝜕𝑉𝑦 𝜕𝜃 𝐸 Min reconstruction error Min reconstruction error • Noise Adaptive Speech Enhancement (NA-SE) [Liao et al., Interspeech 2019] [Wed-P-6-E]
  • 173. Speech Enhancement (NA-SE) N5 N5N4 N7 N9N10N12 Unseen N11 Noise Type 𝒛 E G 𝑉𝑦 D 𝑉𝑧 𝜃 𝐺 ← 𝜃 𝐺 − ϵ 𝜕𝑉𝑦 𝜕𝜃 𝐺 𝜃 𝐸 ← 𝜃 𝐸 − ϵ 𝜕𝑉𝑦 𝜕𝜃 𝐸 𝜃 𝐷 ← 𝜃 𝐷 − ϵ 𝜕𝑉𝑧 𝜕𝜃 𝐷 Min reconstruction error Max domain accuracy Min reconstruction error +𝛼 𝜕𝑉𝑧 𝜕𝜃 𝐸 and Min domain accuracy • Domain adversarial training for NA-SE GRL
  • 174. Speech Enhancement (NA-SE) • Objective evaluations The DAT-based unsupervised adaptation can notably overcome the mismatch issue of training and testing noise types. Fig. 15: PESQ at different SNR levels.
  • 175. • GAN for spectral magnitude mask estimation (MMS-GAN) [Pandey et al., ICASSP 2018] D Scalar Ref. mask Noisy (Fake/Real) Output mask Noisy G Noisy Output mask Ref. mask Speech Enhancement
  • 176. • MetricGAN for Speech Enhancement [Fu et al., ICML 2019] D Metric Score (0~1) G Noisy Spect. Output mask Speech Enhancement 1.00.4 Clean Spect. Enhanced Spect. Enhanced Spect. Point-wise multiplication
  • 177. Speech Enhancement (MetricGAN) With MetricGAN, we have freedom to specify the target metric scores (PESQ or STOI) to generated speech.
  • 178. • Multi-target VC [Chou et al., Interspeech 2018] 𝑒𝑛𝑐(𝒙) 𝒙 Voice Conversion C 𝑬nc Dec 𝒚 𝒚 𝒚′···· 𝑒𝑛𝑐(𝒙) 𝑬nc Dec 𝒚" 𝑮 𝒚" D+C Real data 𝒙 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚) 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚′) ➢ Stage-1 ➢ Stage-2 F/R ID ···
  • 179. • Subjective evaluations Voice Conversion (Multi-target VC) Fig. 16: Preference test results 1. The proposed method uses non-parallel data. 2. The multi-target VC approach outperforms one-stage only. 3. The multi-target VC approach is comparable to Cycle-GAN-VC in terms of the naturalness and the similarity.
  • 180. • Controller-generator-discriminator VC on Impaired Speech [Chen et al., Interspeech 2019] Voice Conversion Previous applications: hearing aids; murmur to normal speech; bone- conductive microphone to air-conductive microphone. Before Proposed: improving the speech intelligibility of surgical patients. Target: oral cancer (top five cancer for male in Taiwan). After Before After [Mon-P-2-A]
  • 181. • Controller-generator-discriminator VC (CGD VC) on impaired speech [Chen et al., Interspeech 2019] Voice Conversion GD Controller
  • 182. Voice Conversion (CGD VC) • Spectrogram analysis Fig. 17: Spectrogram comparison of CGD with CycleGAN.
  • 183. • Subjective evaluations Voice Conversion (CGD VC) The proposed method outperforms conditional GAN and CycleGAN in terms of content similarity, speaker similarity, and articulation. Fig. 18: MOS for content similarity, speaker similarity, and articulation.
  • 184. Pathological Voice Detection • Detection of Pathological Voice Using Cepstrum Vectors: A Deep Learning Approach [Fang et al., Journal of Voice 2018] GMM SVM DNN MEEI 98.28 98.26 99.14 FEMH (M) 90.24 93.04 94.26 FEMH (F) 90.20 87.40 90.52 Table 17: Detection performance based on voice.
  • 185. Pathological Voice Detection • Robustness Against Channel [Hsu et al., NeurIPS Workshop 2018] 𝒛 E G 𝑉𝑦 D 𝑉𝑧 𝒚 DNN (S) DNN (T) DNN (FT) Unsup. DAT Sup. DAT PR-AUC 0.8848 0.8509 0.9021 0.9455 0.9522 The unsupervised DAT notably increased the performance robustness against channel effects and generated comparable results as compared to supervised DAT. 𝒙 Table 18: Detection results of sup. and unsup. DAT under channel mismatches.
  • 186. • C.-F. Liao, Y. Tsao, H.-Y. Lee and H.-M. Wang, Noise adaptive speech enhancement using domain adversarial training, Interspeech 2019. • J.-C. Chou, C.-C. Yeh, H.-Y. Lee, and L.-S. Lee. "Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations. Interspeech 2018. • L.-W. Chen, H.-Y. Lee, and Y. Tsao, Generative adversarial networks for unpaired voice transformation on impaired speech, Interspeech 2019. • S.-W. Fu, C.-F. Liao, Y. Tsao, S.-D. Lin, MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement, ICML, 2019. • C.-T. Wang, F.-C. Lin, J.-Y. Chen, M.-J. Hsiao, S.-H. Fang, Y.-H. Lai, Y. Tsao, Detection of pathological voice using cepstrum vectors: a deep learning approach, Journal of Voice, 2018. • S.-Y. Tsui, Y. Tsao, C.-W. Lin, S.-H. Fang, and C.-T. Wang, Demographic and symptomatic features of voice disorders and their potential application in classification using machine learning algorithms, Folia Phoniatrica et Logopaedica, 2018. • S.-H. Fang, C.-T. Wang, J.-Y. Chen, Y. Tsao and F.-C. Lin, Combining acoustic signals and medical records to improve pathological voice classification, APSIPA, 2019. • Y.-T. Hsu, Z. Zhu, C.-T. Wang, S.-H. Fang, F. Rudzicz, and Y. Tsao, Robustness against the channel effect in pathological voice detection, NeurIPS 2018 Machine Learning for Health (ML4H) Workshop, 2018. References
  • 187. Thank You Very Much Tsao, Yu Ph.D., Academia Sinica yu.tsao@citi.sinica.edu.tw Generative Adversarial Network and its Applications to Signal Processing and Natural Language Processing Part III: Speech Signal Processing
  • 189. NLP tasks usually involve Sequence Generation How to use GAN to improve sequence generation?
  • 190. Outline of Part IV Sequence Generation by GAN Unsupervised Conditional Sequence Generation • Text Style Transfer • Unsupervised Abstractive Summarization • Unsupervised Translation • Unsupervised Speech Recognition
  • 191. Why we need GAN? • Chat-bot as example Encoder Decoder Input sentence c output sentence x Training data: A: How are you ? B: I’m good. ………… How are you ? I’m good. Seq2seq Output: Not bad I’m John. Maximize likelihood Training Criterion Human better better
  • 192. Reinforcement Learning Human Input sentence c response sentence x Chatbot En De response sentence x Input sentence c [Li, et al., EMNLP, 2016] reward 𝑅 𝑐, 𝑥 Learn to maximize expected reward E.g. Policy Gradient human “How are you?” “Not bad” “I’m John” -1+1
  • 193. Policy Gradient 𝜃 𝑡 𝑐1, 𝑥1 𝑐2, 𝑥2 𝑐 𝑁, 𝑥 𝑁 …… 𝑅 𝑐1, 𝑥1 𝑅 𝑐2, 𝑥2 𝑅 𝑐 𝑁 , 𝑥 𝑁 …… 1 𝑁 ෍ 𝑖=1 𝑁 𝑅 𝑐 𝑖, 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃 𝜃 𝑡 𝑥 𝑖|𝑐 𝑖 𝜃 𝑡+1 ← 𝜃 𝑡 + 𝜂𝛻 ത𝑅 𝜃 𝑡 𝑅 𝑐 𝑖, 𝑥 𝑖 is positive Updating 𝜃 to increase 𝑃 𝜃 𝑥 𝑖 |𝑐 𝑖 𝑅 𝑐 𝑖 , 𝑥 𝑖 is negative Updating 𝜃 to decrease 𝑃 𝜃 𝑥 𝑖|𝑐 𝑖
  • 194. Policy Gradient 1 𝑁 ෍ 𝑖=1 𝑁 𝑅 𝑐 𝑖, 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖|𝑐 𝑖 1 𝑁 ෍ 𝑖=1 𝑁 𝑙𝑜𝑔𝑃 𝜃 ො𝑥 𝑖|𝑐 𝑖 1 𝑁 ෍ 𝑖=1 𝑁 𝛻𝑙𝑜𝑔𝑃 𝜃 ො𝑥 𝑖|𝑐 𝑖 1 𝑁 ෍ 𝑖=1 𝑁 𝑅 𝑐 𝑖, 𝑥 𝑖 𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖|𝑐 𝑖 𝑅 𝑐 𝑖 , ො𝑥 𝑖 = 1 obtained from interaction weighted by 𝑅 𝑐 𝑖, 𝑥 𝑖 Objective Function Gradient Maximum Likelihood Reinforcement Learning - Policy Gradient Training Data 𝑐1, ො𝑥1 , … , 𝑐 𝑁, ො𝑥 𝑁 𝑐1, 𝑥1 , … , 𝑐 𝑁, 𝑥 𝑁
  • 195. Conditional GAN Discriminator Input sentence c response sentence x Chatbot En De response sentence x Input sentence c reward 𝑅 𝑐, 𝑥 I am busy. Replace human evaluation with machine evaluation [Li, et al., EMNLP, 2017] However, there is an issue when you train your generator.
  • 196. A A A B A B A A B B B <BOS> Can we use gradient ascent? Discriminator scalarNO! Update Parameters : obtained by attention Generator
  • 197. A A A B A B A A B B B <BOS> Can we use gradient ascent? Discriminator scalarNO! Update Parameters Having non- differentiable part : obtained by attention
  • 198. Three Categories of Solutions Gumbel-softmax • [Matt J. Kusner, et al., arXiv, 2016][Weili Nie, et al. ICLR, 2019] Continuous Input for Discriminator • [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML, 2017] Reinforcement Learning • [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv, 2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William Fedus, et al., ICLR, 2018]
  • 200. Three Categories of Solutions Gumbel-softmax • [Matt J. Kusner, et al., arXiv, 2016][Weili Nie, et al. ICLR, 2019] Continuous Input for Discriminator • [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML, 2017] Reinforcement Learning • [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv, 2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William Fedus, et al., ICLR, 2018]
  • 201. A A A B A B A A B B B <BOS> Use the distribution as the input of discriminator Avoid the sampling process Discriminator scalar Update Parameters We can do backpropagation now.
  • 202. What is the problem? • Real sentence • Generated 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0.9 0.1 0 0 0 0.1 0.9 0 0 0 0.1 0.1 0.7 0.1 0 0 0 0.1 0.8 0.1 0 0 0 0.1 0.9 Can never be 1-hot Discriminator can immediately find the difference. Discriminator with constraint (e.g. WGAN) can be helpful.
  • 203. Three Categories of Solutions Gumbel-softmax • [Matt J. Kusner, et al., arXiv, 2016][Weili Nie, et al. ICLR, 2019] Continuous Input for Discriminator • [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML, 2017] Reinforcement Learning • [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv, 2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William Fedus, et al., ICLR, 2018]
  • 204. A A A B A B A A B B B <BOS> Discriminator scalar Generator = Agent in RL Actions taken Environment Reward Trained by RL algorithm (e.g. Policy Gradient) The reward function may change → Different from typical RL
  • 205. Tips for Sequence Generation GAN . RL is difficult to train GAN is difficult to train Sequence Generation GAN (RL+GAN)
  • 206. Tips for Sequence Generation GAN • Usually the generator are fine-tuned from a model learned by maximum-likelihood. • However, with enough hyperparameter-tuning and tips, ScarchGAN can train from scratch. [Cyprien de Masson d'Autume, et al., arXiv 2019]
  • 207. Tips for Sequence Generation GAN • Typical • Reward for Every Generation Step Discrimi natorChatbot En De You is good Discrimi natorChatbot En De 0.9 0.1 0.1 0.1 You You is You is good I don’t know which part is wrong …
  • 208. Tips for Sequence Generation GAN • Reward for Every Generation Step Discrimi natorChatbot En De 0.9 0.1 0.1 You You is You is good Method 2. Discriminator For Partially Decoded Sequences Method 1. Monte Carlo (MC) Search [Yu, et al., AAAI, 2017] [Li, et al., EMNLP, 2017] Method 3. Step-wise evaluation[Tual, Lee, TASLP, 2019][Xu, et al., EMNLP, 2018][William Fedus, et al., ICLR, 2018]
  • 209. Empirical Performance • MLE frequently generates “I’m sorry”, “I don’t know”, etc. (corresponding to fuzzy images?) • GAN generates longer and more complex responses. • Find more comparison in the survey papers. • [Lu, et al., arXiv, 2018][Zhu, et al., arXiv, 2018] • However, no strong evidence shows that GANs are better than MLE. • [Stanislau Semeniuta, et al., arXiv, 2018] [Guy Tevet, et al., arXiv, 2018] [Massimo Caccia, et al., arXiv, 2018]
  • 210. More Applications • Supervised machine translation [Wu, et al., arXiv 2017][Yang, et al., arXiv 2017] • Supervised abstractive summarization [Liu, et al., AAAI 2018] • Image/video caption generation [Rakshith Shetty, et al., ICCV 2017][Liang, et al., arXiv 2017] • Data augmentation for code-switching ASR [Mon-P- 1-D] [Chang, et al., INTERSPEECH 2019] If you are trying to generate some sequences, you can consider GAN.
  • 211. Outline of Part IV Sequence Generation by GAN Unsupervised Conditional Sequence Generation • Text Style Transfer • Unsupervised Abstractive Summarization • Unsupervised Translation • Unsupervised Speech Recognition
  • 212. male female positive sentences negative sentences Language 1 Audio Text summarydocument Part I Part III Language 2 Text Style Transfer Unsupervised Abstractive Summarization Unsupervised ASRUnsupervised Translation
  • 213. Cycle-GAN 𝐺 𝑋→𝑌 𝐺Y→X as close as possible 𝐺Y→X 𝐺 𝑋→𝑌 as close as possible 𝐷 𝑌𝐷 𝑋 scalar: belongs to domain Y or not scalar: belongs to domain X or not
  • 214. Cycle-GAN 𝐺 𝑋→𝑌 𝐺Y→X as close as possible 𝐺Y→X 𝐺 𝑋→𝑌 as close as possible 𝐷 𝑌𝐷 𝑋 negative sentence? positive sentence? It is bad. It is good. It is bad. I love you. I hate you. I love you. positive positive positivenegative negative negative Non-differentiable Issue? You already know how to deal with it.
  • 215. ✘ Negative sentence to positive sentence: it's a crappy day -> it's a great day i wish you could be here -> you could be here it's not a good idea -> it's good idea i miss you -> i love you i don't love you -> i love you i can't do that -> i can do that i feel so sad -> i happy it's a bad day -> it's a good day it's a dummy day -> it's a great day sorry for doing such a horrible thing -> thanks for doing a great thing my doggy is sick -> my doggy is my doggy my little doggy is sick -> my little doggy is my little doggy Cycle GAN 感謝 王耀賢 同學提供實驗結果 [Lee, et al., ICASSP, 2018]
  • 216. 𝐸𝑁 𝑋 𝐸𝑁𝑌 𝐷𝐸 𝑌 𝐷𝐸 𝑋 𝐷 𝑋 𝐷 𝑌 Discriminator of X domain Discriminator of Y domain Shared Latent Space Positive Sentence Positive Sentence Negative Sentence Negative Sentence Decoder hidden layer as discriminator input [Shen, et al., NIPS, 2017] From 𝐸𝑁𝑋 or 𝐸𝑁𝑌 Domain Discriminator 𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the domain discriminator [Zhao, et al., arXiv, 2017] [Fu, et al., AAAI, 2018]
  • 217. male female positive sentences negative sentences Language 1 Audio Text summarydocument Part I Part III Language 2 Text Style Transfer Unsupervised Abstractive Summarization Unsupervised ASRUnsupervised Translation
  • 218. Abstractive Summarization • Now machine can do abstractive summary by seq2seq (write summaries in its own words) summary 1 summary 2 summary 3 Training Data summary seq2seq (in its own words) Supervised: We need lots of labelled training data.
  • 219. Unsupervised Abstractive Summarization • Now machine can do abstractive summary by seq2seq (write summaries in its own words) summary 1 summary 2 summary 3 seq2seq document Domain Y Domain X[Wang, et al., EMNLP, 2018]
  • 220. G Seq2seq document word sequence D Human written summaries Real or not Discriminator Unsupervised Abstractive Summarization Summary?
  • 221. G Seq2seq document word sequence D Human written summaries Real or not Discriminator R Seq2seq document Unsupervised Abstractive Summarization minimize the reconstruction error
  • 222. Unsupervised Abstractive Summarization G R Summary? Seq2seq Seq2seq document document word sequence Only need a lot of documents to train the model This is a seq2seq2seq auto-encoder. Using a sequence of words as latent representation. not readable …
  • 223. Unsupervised Abstractive Summarization G R Seq2seq Seq2seq word sequence D Human written summaries Real or not Discriminator Let Discriminator considers my output as real document document Summary? Readable
  • 224. Experimental results ROUGE-1 ROUGE-2 ROUGE-L Supervised 33.2 14.2 30.5 Trivial 21.9 7.7 20.5 Unsupervised (matched data) 28.1 10.0 25.4 Unsupervised (no matched data) 27.2 9.1 24.1 English Gigaword (Document title as summary) • Matched data: using the title of English Gigaword to train Discriminator • No matched data: using the title of CNN/Diary Mail to train Discriminator [Wang, Lee, EMNLP 2018]
  • 225. Semi-supervised Learning 25 26 27 28 29 30 31 32 33 34 0 10k 500k ROUGE-1 Number of document-summary pairs used WGAN Reinforce Supervised 3.8M pairs are used.Approaches to deal with the discrete issue. unsupervised semi-supervised [Wang, Lee, EMNLP 2018]
  • 226. More Unsupervised Summarization • Unsupervised summarization with language prior • Unsupervised multi-document summarization [Eric Chu, Peter Liu, ICML 2019] [Christos Baziotis, etc al., NAACL 2019]
  • 227. G Input Sentence D Said by Trump? Discriminator R Dialogue Response Generation minimize the reconstruction error Make the US great again I would build a great wall you are fired What Trump has said Chat Bot Generated Response Input Sentence (Reconstruct) [Su, et al., INTERSPEECH, 2019] (Thu-P-9-C) General Dialogues
  • 228. male female positive sentences negative sentences Language 1 Audio Text summarydocument Part I Part III Language 2 Text Style Transfer Unsupervised Abstractive Summarization Unsupervised ASRUnsupervised Translation
  • 229. Unsupervised learning with 10M sentences Supervised learning with 100K sentence pairs = supervised unsupervised [Alexis Conneau, et al., ICLR, 2018] [Guillaume Lample, et al., ICLR, 2018]
  • 230. male female positive sentences negative sentences Language 1 Audio Text summarydocument Part I Part III Language 2 Text Style Transfer Unsupervised Abstractive Summarization Unsupervised ASRUnsupervised Translation
  • 231. Towards Unsupervised ASR - Cycle GAN G ASR Text R TTS D Real Text? Discriminator minimize the reconstruction error (speech chain) how are you good morning i am fine Real Text [Andros Tjandra, et al., ASRU 2017] [Liu, et al., INTERSPEECH 2018] [Yeh, et al., ICLR 2019] [Chen, et al., INTERSPEECH 2019]
  • 232. Towards Unsupervised ASR - Cycle GAN • Unsupervised setting on TIMIT (text and audio are unpair, text is not the transcription of audio) • 63.6% PER (oracle boundaries) • 41.6% PER (automatic segmentation) • 33.1% PER (automatic segmentation) • Semi-supervised setting on Librispeech [Liu, et al., INTERSPEECH 2018] [Yeh, et al., ICLR 2019] (Tue-P-4-B)[Chen, et al., INTERSPEECH 2019] [Liu, et al., ICASSP 2019] [Tomoki Hayashi, et al., SLT 2018] [Takaaki Hori, et al., ICASSP 2019] [Murali Karthick Baskar, et al., INTERSPEECH 2019]
  • 233. Towards Unsupervised ASR - Shared Latent Space Text Encoder Audio Encoder Audio Decoder Text Decoder this is text this is text Unsupervised setting on Librispeech: 76.3% WER WSJ with 2.5 hours paired data: 64.6% WER LJ speech with 20 mins paired data: 11.7% PER [Chen, et al., SLT 2018] Unsupervised speech translation is also possible! [Chung, et al., NIPS 2018] [Jennifer Drexler, et al., SLT 2018] [Ren, et al., ICML 2019] [Chung, et al., ICASSP 2019]
  • 234. Outline of Part IV Sequence Generation by GAN Unsupervised Conditional Sequence Generation • Text Style Transfer • Unsupervised Abstractive Summarization • Unsupervised Translation • Unsupervised Speech Recognition
  • 235. To Learn More … https://www.youtube.com/playlist?list=PLJV_el3uVTsMd2G9ZjcpJn1YfnM9wVOBf You can learn more from the YouTube Channel (in Mandarin)
  • 236. Reference • Sequence Generation • Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, Dan Jurafsky, Deep Reinforcement Learning for Dialogue Generation, EMNLP, 2016 • Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky, Adversarial Learning for Neural Dialogue Generation, EMNLP, 2017 • Matt J. Kusner, José Miguel Hernández-Lobato, GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution, arXiv 2016 • Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu Song, Yoshua Bengio, Maximum-Likelihood Augmented Discrete Generative Adversarial Networks, arXiv 2017 • Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient, AAAI 2017 • Sai Rajeswar, Sandeep Subramanian, Francis Dutil, Christopher Pal, Aaron Courville, Adversarial Generation of Natural Language, arXiv, 2017 • Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, Lior Wolf, Language Generation with Recurrent Generative Adversarial Networks without Pre- training, ICML workshop, 2017
  • 237. Reference • Sequence Generation • Zhen Xu, Bingquan Liu, Baoxun Wang, Chengjie Sun, Xiaolong Wang, Zhuoran Wang, Chao Qi , Neural Response Generation via GAN with an Approximate Embedding Layer, EMNLP, 2017 • Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron Courville, Yoshua Bengio, Professor Forcing: A New Algorithm for Training Recurrent Networks, NIPS, 2016 • Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, Lawrence Carin, Adversarial Feature Matching for Text Generation, ICML, 2017 • Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, Jun Wang, Long Text Generation via Adversarial Training with Leaked Information, AAAI, 2018 • Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, Ming-Ting Sun, Adversarial Ranking for Language Generation, NIPS, 2017 • William Fedus, Ian Goodfellow, Andrew M. Dai, MaskGAN: Better Text Generation via Filling in the______, ICLR, 2018
  • 238. Reference • Sequence Generation • Yi-Lin Tuan, Hung-Yi Lee, Improving Conditional Sequence Generative Adversarial Networks by Stepwise Evaluation, TASLP, 2019 • Jingjing Xu, Xuancheng Ren, Junyang Lin, Xu Sun, Diversity-Promoting GAN: A Cross-Entropy Based Generative Adversarial Network for Diversified Text Generation, EMNLP, 2018 • Sidi Lu, Yaoming Zhu, Weinan Zhang, Jun Wang, Yong Yu, Neural Text Generation: Past, Present and Beyond, arXiv, 2018 • Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, Yong Yu, Texygen: A Benchmarking Platform for Text Generation Models, arXiv, 2018 • Stanislau Semeniuta, Aliaksei Severyn, Sylvain Gelly, On Accurate Evaluation of GANs for Language Generation, arXiv, 2018 • Guy Tevet, Gavriel Habib, Vered Shwartz, Jonathan Berant, Evaluating Text GANs as Language Models, arXiv, 2018 • Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, Laurent Charlin, Language GANs Falling Short, arXiv, 2018
  • 239. Reference • Sequence Generation • Zhen Yang, Wei Chen, Feng Wang, Bo Xu, Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets, NAACL, 2018 • Lijun Wu, Yingce Xia, Li Zhao, Fei Tian, Tao Qin, Jianhuang Lai, Tie-Yan Liu, Adversarial Neural Machine Translation, arXiv 2017 • Linqing Liu, Yao Lu, Min Yang, Qiang Qu, Jia Zhu, Hongyan Li, Generative Adversarial Network for Abstractive Text Summarization, AAAI 2018 • Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, Bernt Schiele, Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training, ICCV 2017 • Xiaodan Liang, Zhiting Hu, Hao Zhang, Chuang Gan, Eric P. Xing, Recurrent Topic-Transition GAN for Visual Paragraph Generation, arXiv 2017 • Weili Nie, Nina Narodytska, Ankit Patel, RelGAN: Relational Generative Adversarial Networks for Text Generation, ICLR 2019
  • 240. Reference • Sequence Generation • Ching-Ting Chang, Shun-Po Chuang, Hung-Yi Lee, "Code-switching Sentence Generation by Generative Adversarial Networks and its Application to Data Augmentation", INTERSPEECH 2019 • Cyprien de Masson d'Autume, Mihaela Rosca, Jack Rae, Shakir Mohamed, Training language GANs from Scratch, arXiv 2019
  • 241. Reference • Text Style Transfer • Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, Rui Yan, Style Transfer in Text: Exploration and Evaluation, AAAI, 2018 • Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola, Style Transfer from Non-Parallel Text by Cross-Alignment, NIPS 2017 • Chih-Wei Lee, Yau-Shian Wang, Tsung-Yuan Hsu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee, Scalable Sentiment for Sequence-to-sequence Chatbot Response with Performance Analysis, ICASSP, 2018 • Junbo (Jake) Zhao, Yoon Kim, Kelly Zhang, Alexander M. Rush, Yann LeCun, Adversarially Regularized Autoencoders, arxiv, 2017 • Feng-Guang Su, Aliyah Hsu, Yi-Lin Tuan and Hung-yi Lee, "Personalized Dialogue Response Generation Learned from Monologues", INTERSPEECH, 2019
  • 242. Reference • Unsupervised Abstractive Summarization • Yau-Shian Wang, Hung-Yi Lee, "Learning to Encode Text as Human- Readable Summaries using Generative Adversarial Networks", EMNLP, 2018 • Eric Chu, Peter Liu, “MeanSum: A Neural Model for Unsupervised Multi- Document Abstractive Summarization”, ICML, 2019 • Christos Baziotis, Ion Androutsopoulos, Ioannis Konstas, Alexandros Potamianos, “SEQ^3: Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression”, NAACL 2019
  • 243. Reference • Unsupervised Machine Translation • Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou, Word Translation Without Parallel Data, ICRL 2018 • Guillaume Lample, Ludovic Denoyer, Marc'Aurelio Ranzato, Unsupervised Machine Translation Using Monolingual Corpora Only, ICRL 2018
  • 244. Reference • Unsupervised Speech Recognition • Alexander H. Liu, Hung-yi Lee, Lin-shan Lee, Adversarial Training of End-to- end Speech Recognition Using a Criticizing Language Model, ICASSP 2018 • Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee, Completely Unsupervised Phoneme Recognition by Adversarially Learning Mapping Relationships from Audio Embeddings, INTERSPEECH, 2018 • Kuan-yu Chen, Che-ping Tsai, Da-Rong Liu, Hung-yi Lee and Lin-shan Lee, "Completely Unsupervised Phoneme Recognition By A Generative Adversarial Network Harmonized With Iteratively Refined Hidden Markov Models", INTERSPEECH, 2019 • Yi-Chen Chen, Sung-Feng Huang, Chia-Hao Shen, Hung-yi Lee, Lin-shan Lee, "Phonetic-and-Semantic Embedding of Spoken Words with Applications in Spoken Content Retrieval", SLT, 2018 • Chih-Kuan Yeh, Jianshu Chen, Chengzhu Yu, Dong Yu, Unsupervised Speech Recognition via Segmental Empirical Output Distribution Matching, ICLR, 2019
  • 245. Reference • Unsupervised Speech Recognition • Takaaki Hori, Ramon Astudillo, Tomoki Hayashi, Yu Zhang, Shinji Watanabe, Jonathan Le Roux, Cycle-consistency training for end-to-end speech recognition, ICASSP 2019 • Murali Karthick Baskar, Shinji Watanabe, Ramon Astudillo, Takaaki Hori, Lukáš Burget, Jan Černocký, Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text, INTERSPEECH 2019 • Andros Tjandra, Sakriani Sakti, Satoshi Nakamura, Listening while Speaking: Speech Chain by Deep Learning, ASRU 2017 • Yu-An Chung, Wei-Hung Weng, Schrasing Tong, James Glass, Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces, NIPS, 2018 • Yu-An Chung, Wei-Hung Weng, Schrasing Tong, James Glass, Towards Unsupervised Speech-to-Text Translation, ICASSP 2019 • Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu, Almost Unsupervised Text to Speech and Automatic Speech Recognition, ICML 2019
  • 246. Reference • Unsupervised Speech Recognition • Shigeki Karita , Shinji Watanabe, Tomoharu Iwata, Atsunori Ogawa, Marc Delcroix, Semi-Supervised End-to-End Speech Recognition, INTERSPEECH, 2018 • Jennifer Drexler, James R. Glass, “Combining End-to-End and Adversarial Training for Low-Resource Speech Recognition”, SLT 2018 • Tomoki Hayashi, Shinji Watanabe, Yu Zhang, Tomoki Toda, Takaaki Hori, Ramon Astudillo, Kazuya Takeda, Back-Translation-Style Data Augmentation for End-to-End ASR, SLT, 2018
  • 247. Please download the latest slides here: http://speech.ee.ntu.edu.tw/~tlkagk/GAN_3hour.pdf