Generative Adversarial Network and its Applications to Speech Processing and Natural Language Processing (INTERSPEECH 2019 Tutorial)

Generative Adversarial Network
and its Applications to Speech Processing
and Natural Language Processing
Hung-yi Lee and Yu Tsao

Outline
Part I: Basic Idea of Generative Adversarial
Network (GAN)
Part II: A little bit theory
Part III: Applications to Speech Processing
Part IV: Applications to Natural Language
Processing
Take a break

All Kinds of GAN … https://github.com/hindupuravinash/the-gan-zoo
(not updated since 2018.09)
More than 500 species
in the zoo

All Kinds of GAN … https://github.com/hindupuravinash/the-gan-zoo
GAN
ACGAN
BGAN
DCGAN
EBGAN
fGAN
GoGAN
CGAN
……
Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, Shakir Mohamed, “Variational Approaches for Auto-Encoding
Generative Adversarial Networks”, arXiv, 2017

0 0 0 0 1 2
42
62
0 0 0 0
2
11
14
32
2012 2013 2014 2015 2016 2017 2018 2019
ICASSP INTERSPEECH
INTERSPEECH & ICASSP
How many papers have “adversarial” in their titles?
It is a wise choice to
attend this tutorial.

Generator
“Girl with
red hair”
Generator
−0.3
0.1
⋮
0.9
random vector
Three Categories of GAN
1. Generation
image
2. Conditional Generation
Generator
text
imagepaired data
blue eyes,
red hair,
short hair
3. Unsupervised Conditional Generation
Photo Vincent van
Gogh’s styleunpaired data
x ydomain x domain y

Anime Face Generation
Draw
Generator
Examples

Basic Idea of GAN
Generator
It is a neural network
(NN), or a function.
Generator
0.1
−3
⋮
2.4
0.9
imagevector
Generator
3
−3
⋮
2.4
0.9
Generator
0.1
2.1
⋮
5.4
0.9
Generator
0.1
−3
⋮
2.4
3.5
high
dimensional
vector
Powered by: http://mattya.github.io/chainer-DCGAN/
Each dimension of input vector
represents some characteristics.
Longer hair
blue hair Open mouth

Discri-
minator
scalar
image
Basic Idea of GAN It is a neural network
(NN), or a function.
Larger value means real,
smaller value means fake.
Discri-
minator
Discri-
minator
Discri-
minator1.0 1.0
0.1 Discri-
minator
0.1

• Initialize generator and discriminator
• In each training iteration:
DG
sample
generated
objects
G
Algorithm
D
Update
vector
vector
vector
vector
0000
1111
randomly
sampled
Database
Step 1: Fix generator G, and update discriminator D
Discriminator learns to assign high scores to real objects
and low scores to generated objects.
Fix

DG
Algorithm
Step 2: Fix discriminator D, and update generator G
Discri-
minator
NN
Generator
vector
0.13
hidden layer
update fix
Gradient Ascent
large network
Generator learns to “fool” the discriminator

DG
Learning
D
Sample some
real objects:
Generate some
fake objects:
G
Algorithm
D
Update
Learning
G
G D
image
1111
image
image
image
1
update fix
0000vector
vector
vector
vector
vector
vector
vector
vector
fix

100 updates
Source of training data: https://zhuanlan.zhihu.com/p/24767059

1000 updates

2000 updates

5000 updates

10,000 updates

20,000 updates

50,000 updates

In 2019, with StyleGAN ……
Source of video:
https://www.gwern.net/Faces

0.0
0.0
G
0.9
0.9
G
0.1
0.1
G
0.2
0.2
G
0.3
0.3
G
0.4
0.4
G
0.5
0.5
G
0.6
0.6
G
0.7
0.7
G
0.8
0.8
G

Progressive GAN
[Tero Karras, et al., ICLR, 2018]

The first GAN
[Ian J. Goodfellow, et al., NIPS, 2014]

Today ……
[Andrew Brock, et al., arXiv, 2018]

[David Bau, et al., ICLR 2019]
Does the generator have the concept
of objects?
Some neurons correspond to specific
objects, for example, tree

Remove the neurons for tree
[David Bau, et al., ICLR 2019]
Activate the neurons for tree

Target of
NN output
Text-to-Image
• Traditional supervised approach
NN Image
Text: “train”
a dog is running
a bird is flying
A blurry image!
c1: a dog is running
as close as
possible

Conditional GAN
D
(original)
scalar𝑥
G
𝑧Normal distribution
x = G(c,z)
c: train
x is real image or not
Image
Real images:
Generated images:
1
0
Generator will learn to
generate realistic images ….
But completely ignore the
input conditions.
[Scott Reed, et al, ICML, 2016]

Conditional GAN
D
(better)
scalar
𝑐
𝑥
True text-image pairs:
G
𝑧Normal distribution
x = G(c,z)
c: train
Image
x is realistic or not +
c and x are matched or not
(train , )
(train , )(cat , )
[Scott Reed, et al, ICML, 2016]
1
00

x is realistic or not +
c and x are matched
or not
Conditional GAN - Discriminator
[Takeru Miyato, et al., ICLR, 2018]
[Han Zhang, et al., arXiv, 2017]
[Augustus Odena et al., ICML, 2017]
condition c
object x
Network
Network
Network
score
Network
Network
(almost every paper)
condition c
object x
c and x are matched
or not
x is realistic or not
+

Conditional GAN
paired data
blue eyes
red hair
short hair
Collecting anime faces
and the description of its
characteristics
red hair,
green eyes
blue hair,
red eyes
The images are generated by
Yen-Hao Chen, Po-Chun Chien,
Jun-Chen Xie, Tsung-Han Wu.

Conditional GAN - Image-to-image
G
𝑧
x = G(c,z)
𝑐
[Phillip Isola, et al., CVPR, 2017]
Image translation, or pix2pix

as close as
possible
• Traditional supervised approach
NN Image
It is blurry.
Testing:
input L1
e.g. L1

Testing:
input L1 GAN
G
𝑧
Image D scalar
GAN + L1
L1

Conditional GAN
- Sound-to-image
Gc: sound Image
"a dog barking sound"
Training Data
Collection
video
[Wan, et al., ICASSP 2019]

Conditional GAN
- Sound-to-image
• Audio-to-image
https://wjohn1483.github.io/
audio_to_scene/index.html
The images are generated by Chia-
Hung Wan and Shun-Po Chuang.
Louder

Conditional GAN - Image-to-label
Multi-label Image Classifier = Conditional Generator
Input condition
Generated output

F1 MS-COCO NUS-WIDE
VGG-16 56.0 33.9
+ GAN 60.4 41.2
Inception 62.4 53.5
+GAN 63.8 55.8
Resnet-101 62.8 53.1
+GAN 64.0 55.4
Resnet-152 63.3 52.1
+GAN 63.9 54.1
Att-RNN 62.1 54.7
RLSD 62.0 46.9
The classifiers can have
different architectures.
The classifiers are
trained as conditional
GAN.
[Tsai, et al., ICASSP 2019]

F1 MS-COCO NUS-WIDE
VGG-16 56.0 33.9
+ GAN 60.4 41.2
Inception 62.4 53.5
+GAN 63.8 55.8
Resnet-101 62.8 53.1
+GAN 64.0 55.4
Resnet-152 63.3 52.1
+GAN 63.9 54.1
Att-RNN 62.1 54.7
RLSD 62.0 46.9
The classifiers can have
different architectures.
The classifiers are
trained as conditional
GAN.
Conditional GAN
outperforms other
models designed for
multi-label.

Conditional GAN
- Video Generation
Generator
Discrimi
nator
Last frame is real
or generated
Discriminator thinks it is real
[Michael Mathieu, et al., arXiv, 2015]

https://github.com/dyelax/Adversarial_Video_Generation

More about Video Generation
https://arxiv.org/abs/1905.08233
[Egor Zakharov, et al., arXiv, 2019]

Domain Adversarial Training
• Training and testing data are in different domains
Training
data:
Testing
data:
Generator
Generator
The same
distribution
feature
feature
Take digit
classification as example

blue points
red points
feature extractor (Generator)
Discriminator
(Domain classifier)
image Which domain?
Always output
zero vectors
Domain Classifier Fails

feature extractor (Generator)
Discriminator
(Domain classifier)
image
Label predictor
Which digits?
Not only cheat the domain
classifier, but satisfying label
predictor at the same time
More speech-related applications in Part III.
Successfully applied on image classification
[Ganin et al, ICML, 2015][Ajakan et al. JMLR, 2016 ]
Which domain?

Unsupervised
Conditional Generation
G
Object in Domain X Object in Domain Y
Transform an object from one domain to another
without paired data
Domain X Domain Y
photos
Condition Generated Object
Vincent van Gogh’s
paintings
Not Paired
More Applications in Parts III and IV
Use image style transfer as example here

Unsupervised
Conditional Generation
• Approach 1: Cycle-GAN and its variants
• Approach 2: Shared latent space
?𝐺 𝑋→𝑌
Domain X Domain Y
𝐸𝑁 𝑋 𝐷𝐸 𝑌
Encoder of
domain X
Decoder of
domain Y
Domain YDomain X Face
Attribute

?
Cycle GAN
𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain Y
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y

Cycle GAN
𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain Y
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Not what we want!
ignore input

Cycle GAN
𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Not what we want!
ignore input
[Tomer Galanti, et al. ICLR, 2018]
The issue can be avoided by network design.
Simpler generator makes the input and
output more closely related.

Cycle GAN
𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Encoder
Network
Encoder
Network
pre-trained
as close as
possible
Baseline of DTN [Yaniv Taigman, et al., ICLR, 2017]

Cycle GAN
𝐺 𝑋→𝑌
𝐷 𝑌
Domain Y
scalar
Input image
belongs to
domain Y or not
𝐺Y→X
as close as possible
Lack of information
for reconstruction
[Jun-Yan Zhu, et al., ICCV, 2017]
Cycle consistency

Cycle GAN
𝐺 𝑋→𝑌 𝐺Y→X
𝐺Y→X 𝐺 𝑋→𝑌
𝐷 𝑌𝐷 𝑋
scalar: belongs to
domain Y or not
scalar: belongs to
domain X or not

Cycle GAN
Dual GAN
Disco GAN
[Jun-Yan Zhu, et al., ICCV, 2017]
[Zili Yi, et al., ICCV, 2017]
[Taeksoo Kim, et
al., ICML, 2017]
For multiple domains,
considering starGAN
[Yunjey Choi, arXiv, 2017]

Issue of Cycle Consistency
• CycleGAN: a Master of Steganography
[Casey Chu, et al., NIPS workshop, 2017]
𝐺Y→X𝐺 𝑋→𝑌
The information is hidden.

Domain X Domain Y
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
imageFace
Attribute
Shared latent space
Target
- domain-x information
+ domain-y information

Domain X Domain Y
𝐸𝑁 𝑋
𝐷𝐸 𝑋image
image
image
image
Minimizing reconstruction error
Shared latent space
Training

𝐸𝑁 𝑋
𝐷𝐸 𝑋image
image
image
image
Because we train two auto-encoders separately …
The images with the same attribute may not project
to the same position in the latent space.
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Shared latent space
Training

𝐸𝑁 𝑋
𝐷𝐸 𝑋image
image
image
image
The domain discriminator forces the output of 𝐸𝑁𝑋 and
𝐸𝑁𝑌 have the same distribution.
From 𝐸𝑁𝑋 or 𝐸𝑁𝑌
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Shared latent space
Training
Domain
Discriminator
𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the
domain discriminator
[Guillaume Lample, et al., NIPS, 2017]

𝐸𝑁 𝑋
𝐷𝐸 𝑋image
image
image
image
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Shared latent space
Training
Cycle Consistency:
Used in ComboGAN [Asha Anoosheh, et al., arXiv, 017]

𝐸𝑁 𝑋
𝐷𝐸 𝑋image
image
image
image
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Shared latent space
Training
Semantic Consistency:
Used in DTN [Yaniv Taigman, et al., ICLR, 2017] and
XGAN [Amélie Royer, et al., arXiv, 2017]
To the same
latent space

Sharing the parameters of encoders and decoders
Shared latent space
𝐸𝑁 𝑋
𝐸𝑁𝑌
𝐷𝐸 𝑋
𝐷𝐸 𝑌
Couple GAN[Ming-Yu Liu, et al., NIPS, 2016]
UNIT[Ming-Yu Liu, et al., NIPS, 2017]

Shared latent space
𝐸𝑁 𝑋
𝐸𝑁𝑌
𝐷𝐸 𝑋
𝐷𝐸 𝑌
One encoder to extract domain-
independent information
Input an extra indicator
to control the decoder
x or y
Widely used in Voice Conversion
(Part III)

https://selfie2anime.com/

Generator
“Girl with
red hair”
Generator
−0.3
0.1
⋮
0.9
random vector
Three Categories of GAN
1. Typical GAN
image
2. Conditional GAN
Generator
text
imagepaired data
blue eyes,
red hair,
short hair
3. Unsupervised Conditional GAN
Photo Vincent van
Gogh’s styleunpaired data
x ydomain x domain y

Reference
• Generation
• Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative Adversarial
Nets, NIPS, 2014
• Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen, Progressive Growing
of GANs for Improved Quality, Stability, and Variation, ICLR, 2018
• Andrew Brock, Jeff Donahue, Karen Simonyan, Large Scale GAN Training for
High Fidelity Natural Image Synthesis, arXiv, 2018
• David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B.
Tenenbaum, William T. Freeman, Antonio Torralba, GAN Dissection:
Visualizing and Understanding Generative Adversarial Networks, ICLR 2019

Reference
• Conditional Generation
• Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt
Schiele, Honglak Lee, Generative Adversarial Text to Image Synthesis, ICML,
2016
• Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros, Image-to-Image
Translation with Conditional Adversarial Networks, CVPR, 2017
• Michael Mathieu, Camille Couprie, Yann LeCun, Deep multi-scale video
prediction beyond mean square error, arXiv, 2015
• Mehdi Mirza, Simon Osindero, Conditional Generative Adversarial Nets,
arXiv, 2014
• Takeru Miyato, Masanori Koyama, cGANs with Projection Discriminator,
ICLR, 2018
• Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei
Huang, Dimitris Metaxas, StackGAN++: Realistic Image Synthesis with
Stacked Generative Adversarial Networks, arXiv, 2017
• Augustus Odena, Christopher Olah, Jonathon Shlens, Conditional Image
Synthesis With Auxiliary Classifier GANs, ICML, 2017

Reference
• Conditional Generation
• Yaroslav Ganin, Victor Lempitsky, Unsupervised Domain Adaptation by
Backpropagation, ICML, 2015
• Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario
Marchand, Domain-Adversarial Training of Neural Networks, JMLR, 2016
• Che-Ping Tsai, Hung-Yi Lee, Adversarial Learning of Label Dependency: A
Novel Framework for Multi-class Classification, submitted to ICASSP 2019
• Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, Victor Lempitsky, Few-
Shot Adversarial Learning of Realistic Neural Talking Head Models, arXiv
2019
• Chia-Hung Wan, Shun-Po Chuang, Hung-Yi Lee, "Towards Audio to Scene
Image Synthesis using Generative Adversarial Network", ICASSP, 2019

Reference
• Unsupervised Conditional Generation
• Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros, Unpaired Image-to-
Image Translation using Cycle-Consistent Adversarial Networks, ICCV, 2017
• Zili Yi, Hao Zhang, Ping Tan, Minglun Gong, DualGAN: Unsupervised Dual
Learning for Image-to-Image Translation, ICCV, 2017
• Tomer Galanti, Lior Wolf, Sagie Benaim, The Role of Minimal Complexity
Functions in Unsupervised Learning of Semantic Mappings, ICLR, 2018
• Yaniv Taigman, Adam Polyak, Lior Wolf, Unsupervised Cross-Domain Image
Generation, ICLR, 2017
• Asha Anoosheh, Eirikur Agustsson, Radu Timofte, Luc Van Gool, ComboGAN:
Unrestrained Scalability for Image Domain Translation, arXiv, 2017
• Amélie Royer, Konstantinos Bousmalis, Stephan Gouws, Fred Bertsch, Inbar
Mosseri, Forrester Cole, Kevin Murphy, XGAN: Unsupervised Image-to-
Image Translation for Many-to-Many Mappings, arXiv, 2017

Reference
• Unsupervised Conditional Generation
• Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic
Denoyer, Marc'Aurelio Ranzato, Fader Networks: Manipulating Images by
Sliding Attributes, NIPS, 2017
• Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, Jiwon Kim,
Learning to Discover Cross-Domain Relations with Generative Adversarial
Networks, ICML, 2017
• Ming-Yu Liu, Oncel Tuzel, “Coupled Generative Adversarial Networks”, NIPS,
2016
• Ming-Yu Liu, Thomas Breuel, Jan Kautz, Unsupervised Image-to-Image
Translation Networks, NIPS, 2017
• Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim,
Jaegul Choo, StarGAN: Unified Generative Adversarial Networks for Multi-
Domain Image-to-Image Translation, arXiv, 2017

Outline of Part II
Basic Theory of GAN
Helpful Tips
How to evaluate GAN
Relation to Reinforcement Learning

Generator
• A generator G is a network. The network defines a
probability distribution 𝑃𝐺
generator
G𝑧 𝑥 = 𝐺 𝑧
Normal
Distribution
𝑃𝐺(𝑥) 𝑃𝑑𝑎𝑡𝑎 𝑥
How to compute the divergence?
𝐺∗ = 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
Divergence between distributions 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎
𝑥: an image (a high-
dimensional vector)

Discriminator
𝐺∗
= 𝑎𝑟𝑔 min
𝐺
Although we do not know the distributions of 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎,
we can sample from them.
sample
G
vector
vector
vector
vector
sample from
normal
Database
Sampling from 𝑷 𝑮
Sampling from 𝑷 𝒅𝒂𝒕𝒂

Discriminator 𝐺∗
= 𝑎𝑟𝑔 min
𝐺
Discriminator
: data sampled from 𝑃𝑑𝑎𝑡𝑎
: data sampled from 𝑃𝐺
train
𝑉 𝐺, 𝐷 = 𝐸 𝑥∼𝑃 𝑑𝑎𝑡𝑎
𝑙𝑜𝑔𝐷 𝑥 + 𝐸 𝑥∼𝑃 𝐺
𝑙𝑜𝑔 1 − 𝐷 𝑥
Example Objective Function for D
(G is fixed)
𝐷∗ = 𝑎𝑟𝑔 max
𝐷
𝑉 𝐷, 𝐺Training:
Using the example objective
function is exactly the same as
training a binary classifier.
[Goodfellow, et al., NIPS, 2014]
The maximum objective value
is related to JS divergence.

Discriminator 𝐺∗
= 𝑎𝑟𝑔 min
𝐺
Discriminator
: data sampled from 𝑃𝑑𝑎𝑡𝑎
: data sampled from 𝑃𝐺
train
hard to discriminatesmall divergence
Discriminator
train
easy to discriminatelarge divergence
𝐷
𝑉 𝐷, 𝐺
Training:
Small max
𝐷
𝑉 𝐷, 𝐺

𝐺∗
= 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎max
𝐷
𝑉 𝐺, 𝐷
The maximum objective value
is related to JS divergence.
Step 1: Fix generator G, and update discriminator D
Step 2: Fix discriminator D, and update generator G
𝐷
𝑉 𝐷, 𝐺
[Goodfellow, et al., NIPS, 2014]

Using the divergence
you like ☺
[Sebastian Nowozin, et al., NIPS, 2016]
Can we use other divergence?

GAN is difficult to train ……
• There is a saying ……
(I found this joke from 陳柏文’s facebook.)

Too many tips ……
• I do a little survey among 12 students …..
Q: What is the most helpful tip for
training GAN?
WGAN (33.3%)
Spectral Norm (16.7%)

JS divergence is not suitable
• In most cases, 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎 are not overlapped.
• 1. The nature of data
• 2. Sampling
Both 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺 are low-dim
manifold in high-dim space.
𝑃𝑑𝑎𝑡𝑎
𝑃𝐺
The overlap can be ignored.
Even though 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺
have overlap.
If you do not have enough
sampling ……

𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1
𝐽𝑆 𝑃𝐺0
, 𝑃𝑑𝑎𝑡𝑎
= 𝑙𝑜𝑔2
𝑃𝑑𝑎𝑡𝑎𝑃𝐺100
……
𝐽𝑆 𝑃𝐺1
= 𝑙𝑜𝑔2
𝐽𝑆 𝑃𝐺100
= 0
What is the problem of JS divergence?
……
JS divergence is log2 if two distributions do not overlap.
Intuition: If two distributions do not overlap, binary classifier
achieves 100% accuracy
Equally bad
The same max objective value is
obtained.
Same divergence

Wasserstein distance
• Considering one distribution P as a pile of earth,
and another distribution Q as the target
• The average distance the earth mover has to move
the earth.
𝑃 𝑄
d
𝑊 𝑃, 𝑄 = 𝑑

Wasserstein distance
Source of image: https://vincentherrmann.github.io/blog/wasserstein/
𝑃
𝑄
Using the “moving plan” with the smallest average distance to
define the Wasserstein distance.
There are many possible “moving plans”.
Smaller
distance?
Larger
distance?

𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1
𝐽𝑆 𝑃𝐺0
= 𝑙𝑜𝑔2
𝑃𝑑𝑎𝑡𝑎𝑃𝐺100
……
𝐽𝑆 𝑃𝐺1
= 𝑙𝑜𝑔2
𝐽𝑆 𝑃𝐺100
= 0
What is the problem of JS divergence?
𝑊 𝑃𝐺0
= 𝑑0
𝑊 𝑃𝐺1
= 𝑑1
𝑊 𝑃𝐺100
= 0
𝑑0 𝑑1
……
……
Better!

WGAN
max
𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧
𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎
𝐷 𝑥 − 𝐸 𝑥~𝑃 𝐺
𝐷 𝑥
Evaluate Wasserstein distance between 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺
[Martin Arjovsky, et al., arXiv, 2017]
How to fulfill this constraint?D has to be smooth enough.
real
−∞
generated
D
∞
Without the constraint, the
training of D will not converge.
Keeping the D smooth forces
D(x) become ∞ and −∞

• Original WGAN → Weight Clipping [Martin Arjovsky, et al.,
arXiv, 2017]
• Improved WGAN → Gradient Penalty [Ishaan Gulrajani,
NIPS, 2017]
• Spectral Normalization → Keep gradient norm
smaller than 1 everywhere [Miyato, et al., ICLR, 2018]
Force the parameters w between c and -c
After parameter update, if w > c, w = c; if w < -c, w = -c
Keep the gradient close to 1
max
𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧
𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎
𝐷 𝑥 − 𝐸 𝑥~𝑃 𝐺
𝐷 𝑥
real
samples
Keep the gradient
close to 1
[Kodali, et al., arXiv, 2017]
[Wei, et al., ICLR, 2018]

More Tips
• Improved techniques for training GANs
• Tips in DCGAN [Alec Radford, et al., ICLR 2016]
• Guideline for network architecture design for
image generation
• Tips from Soumith
• https://github.com/soumith/ganhacks
• Tips from BigGAN [Andrew Brock, et al., arXiv, 2018]
[Tim Salimans, et al., NIPS, 2016]

Inception Score
Off-the-shelf
Image Classifier
𝑥 𝑃 𝑦|𝑥
Concentrated distribution
means higher visual quality
CNN𝑥1
𝑃 𝑦1|𝑥1
Uniform distribution
means higher variety
CNN𝑥2
𝑃 𝑦2|𝑥2
CNN𝑥3
𝑃 𝑦3|𝑥3
…
𝑃 𝑦 =
1
𝑁
෍
𝑛
𝑃 𝑦 𝑛|𝑥 𝑛
[Tim Salimans, et al., NIPS, 2016]
𝑥: image
𝑦: class (output of CNN)
e.g. Inception net,
VGG, etc.
class 1
class 2
class 3

Fréchet Inception Distance (FID)
blue points: latent representation of Inception net for
the generated images
red points: latent representation of Inception net for
the read images
FID =
Fréchet distance
between the two
Gaussians
[Martin Heusel, et al., NIPS, 2017]

To learn more about evaluation …
Pros and cons of GAN evaluation measures
https://arxiv.org/abs/1802.03446
[Ali Borji, 2019]

Basic Components
EnvActor
Reward
Function
Video
Game
Go
Get 20 scores when
killing a monster
The rule
of GO
You cannot control

• Input of neural network: the observation of machine
represented as a vector or a matrix
• Output neural network : each action corresponds to a
neuron in output layer
…
…
NN as actor
pixels
fire
right
left
Score of an
action
0.7
0.2
0.1
Take the action
based on the
probability.
Neural network as Actor

Actor, Environment, Reward
𝜏 = 𝑠1, 𝑎1, 𝑠2, 𝑎2, ⋯ , 𝑠 𝑇, 𝑎 𝑇
Trajectory
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
“right” “fire”

Reward Function → Discriminator
Reinforcement Learning v.s. GAN
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
𝑅 𝜏 = ෍
𝑡=1
𝑇
𝑟𝑡
Reward
𝑟1
Reward
𝑟2
“Black box”
You cannot use
backpropagation.
Actor → Generator Fixed
updatedupdated

Inverse Reinforcement Learning
We have demonstration of the expert.
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
reward function is not available
(in many cases, it is difficult to define reward function)
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁
Each Ƹ𝜏 is a trajectory
of the expert.
Self driving: record
human drivers
Robot: grab the
arm of robot

Inverse Reinforcement Learning
Reward
Function
Environment
Optimal
Actor
Inverse Reinforcement
Learning
➢Using the reward function to find the optimal actor.
➢Modeling reward can be easier. Simple reward
function can lead to complex policy.
Reinforcement
Learning
Expert
demonstration
of the expert
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁

Framework of IRL
Expert ො𝜋
Actor 𝜋
Obtain
Reward Function R
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁
𝜏1, 𝜏2, ⋯ , 𝜏 𝑁
Find an actor based
on reward function R
By Reinforcement learning
෍
𝑛=1
𝑁
𝑅 Ƹ𝜏 𝑛 > ෍
𝑛=1
𝑁
𝑅 𝜏
Reward function
→ Discriminator
Actor
→ Generator
Reward
Function R
The expert is always
the best.

𝜏1, 𝜏2, ⋯ , 𝜏 𝑁
GAN
IRL
G
D
High score for real,
low score for generated
Find a G whose output
obtains large score from D
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁
Expert
Actor
Reward
Function
Larger reward for Ƹ𝜏 𝑛,
Lower reward for 𝜏
Find a Actor obtains
large reward

Reference
• Sebastian Nowozin, Botond Cseke, Ryota Tomioka, “f-GAN: Training Generative
Neural Samplers using Variational Divergence Minimization”, NIPS, 2016
• Martin Arjovsky, Soumith Chintala, Léon Bottou, Wasserstein GAN, arXiv, 2017
• Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron
Courville, Improved Training of Wasserstein GANs, NIPS, 2017
• Junbo Zhao, Michael Mathieu, Yann LeCun, Energy-based Generative Adversarial
Network, arXiv, 2016
• Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, Olivier Bousquet, “Are
GANs Created Equal? A Large-Scale Study”, arXiv, 2017
• Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi
Chen Improved Techniques for Training GANs, NIPS, 2016
• Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Sepp
Hochreiter, GANs Trained by a Two Time-Scale Update Rule Converge to a Local
Nash Equilibrium, NIPS, 2017

and its Applications to Signal Processing
Part III: Speech Signal
Processing
Tsao, Yu Ph.D., Academia Sinica
yu.tsao@citi.sinica.edu.tw

Outline of Part III
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Our Recent Works

Speech Signal Generation (Regression Task)
G Output
Objective function
Paired

Speech, Speaker, Emotion Recognition and Lip-reading
(Classification Task)
Output
label
Clean data
E
G
𝒚
Emb.
Noisy data
𝒙
෤𝒛 = 𝑔(෥𝒙)
𝑔(∙)
ℎ(∙)
෥𝒙
Accented
speech
෭𝒙
Channel
distortion
ෝ𝒙
Acoustic Mismatch

Speech Enhancement
• Neural network models for spectral mapping
• Typical objective function
➢ Mean square error (MSE) [Xu et al., TASLP 2015], L1 [Pascual et al., Interspeech
2017], likelihood [Chai et al., MLSP 2017], STOI [Fu et al., TASLP 2018].
Enhancing
➢ GAN is used as a new objective function to estimate the parameters in G.
➢Model structures of G: DNN [Wang et al. NIPS 2012; Xu et al., SPL 2014], DDAE
[Lu et al., Interspeech 2013], RNN (LSTM) [Chen et al., Interspeech 2015;
Weninger et al., LVA/ICA 2015], CNN [Fu et al., Interspeech 2016].
G Output
Objective function

Speech Enhancement
• Speech enhancement GAN (SEGAN) [Pascual et al., Interspeech 2017]

Table 1: Objective evaluation results. Table 2: Subjective evaluation results.
Fig. 1: Preference test results.
Speech Enhancement (SEGAN)
SEGAN yields better speech enhancement results than Noisy and Wiener.
• Experimental results

• Pix2Pix [Michelsanti et al., Interpsech 2017]
D Scalar
Clean
Noisy
(Fake/Real)
Output
Noisy
G
Noisy Output Clean
Speech Enhancement

Fig. 2: Spectrogram comparison of Pix2Pix with baseline methods.
Speech Enhancement (Pix2Pix)
• Spectrogram analysis
Pix2Pix outperforms STAT-MMSE and is competitive to DNN SE.
NG-DNN STAT-MMSE
Noisy Clean NG-Pix2Pix

Table 3: Objective evaluation results.
Speech Enhancement (Pix2Pix)
• Objective evaluation and speaker verification test
Table 4: Speaker verification results.
1. From the PESQ and STOI evaluations, Pix2Pix outperforms Noisy
and MMSE and is competitive to DNN SE.
2. From the speaker verification results, Pix2Pix outperforms the
baseline models when the clean training data is used.

• Frequency-domain SEGAN (FSEGAN) [Donahue et al., ICASSP 2018]
D Scalar
Clean
Noisy
(Fake/Real)
Output
Noisy
G
Noisy Output Clean
Speech Enhancement

Fig. 3: Spectrogram comparison of FSEGAN with L1-trained method.
Speech Enhancement (FSEGAN)
FSEGAN reduces both additive noise and reverberant smearing.

Table 5: WER (%) of SEGAN and FSEGAN. Table 6: WER (%) of FSEGAN with retrain.
Speech Enhancement (FSEGAN)
• ASR results
1. From Table 5, (1) FSEGAN improves recognition results for ASR-Clean.
(2) FSEGAN outperforms SEGAN as front-ends.
2. From Table 6, (1) Hybrid Retraining with FSEGAN outperforms Baseline;
(2) FSEGAN retraining slightly underperforms L1–based retraining.

• Speech enhancement through a mask function
G
Noisy Output mask Enhanced
Speech Enhancement
Point-wise multiplication

• GAN for spectral magnitude mask estimation (MMS-GAN)
[Ashutosh Pandey and Deliang Wang, ICASSP 2018]
D Scalar
Ref.
mask
Noisy
(Fake/Real)
Output
mask
Noisy
G
Noisy Output mask Ref. mask
Speech Enhancement
We don’t know exactly what D functions.
Our ICML 2019 paper shed some lights on a potential future direction.

𝐺𝑆→𝑇 𝐺 𝑇→𝑆
𝐷 𝑇
Scalar: belongs to
domain T or not
𝐺 𝑇→𝑆 𝐺𝑆→𝑇
𝐷𝑆
Scalar: belongs to
domain S or not
Speech Enhancement (AFT)
• Cycle-GAN-based acoustic feature transformation (AFT)
[Mimura et al., ASRU 2017]
𝑉𝐹𝑢𝑙𝑙 = 𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌 ＋𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌
＋𝜆 𝑉𝐶𝑦𝑐(𝐺 𝑋→𝑌, 𝐺 𝑌→𝑋)
Noisy Enhanced Noisy
Clean Syn. Noisy Clean

• ASR results on noise robustness and style adaptation
Table 7: Noise robust ASR. Table 8: Speaker style adaptation.
1. 𝐺 𝑇→𝑆 can transform acoustic features and effectively improve
ASR results for both noisy and accented speech.
2. 𝐺𝑆→𝑇 can be used for model adaptation and effectively improve
ASR results for noisy speech.
S: Clean; 𝑇: Noisy JNAS: Read; CSJ-SPS: Spontaneous (relax);
CSJ-APS: Spontaneous (formal);
Speech Enhancement (AFT)

• Postfilter for synthesized or transformed speech
➢ Conventional postfilter approaches for G estimation include global variance
(GV) [Toda et al., IEICE 2007], variance scaling (VS) [Sil’en et al., Interpseech
2012], modulation spectrum (MS) [Takamichi et al., ICASSP 2014],DNN with
MSE criterion [Chen et al., Interspeech 2014; Chen et al., TASLP 2015].
➢ GAN is used a new objective function to estimate the parameters in G.
Postfilter
Synthesized
spectral texture
Natural
spectral texture
G Output
Objective function
Speech
synthesizer
Voice
conversion
Speech
enhancement

• GAN postfilter [Kaneko et al., ICASSP 2017]
➢ Traditional MMSE criterion results in statistical averaging.
➢ GAN is used as a new objective function to estimate the parameters in G.
➢ The proposed work intends to further improve the naturalness of
synthesized speech or parameters from a synthesizer.
Postfilter
Synthesized
Mel cepst. coef.
Natural
Mel cepst. coef.
D
Nature
or
Generated
Generated
Mel cepst. coef.
G

Fig. 4: Spectrograms of: (a) NAT (nature); (b) SYN (synthesized); (c) VS (variance
scaling); (d) MS (modulation spectrum); (e) MSE; (f) GAN postfilters.
Postfilter (GAN-based Postfilter)
GAN postfilter reconstructs spectral texture similar to the natural one.

Fig. 5: Mel-cepstral trajectories (GANv:
GAN was applied in voiced part).
Fig. 6: Averaging difference in
modulation spectrum per Mel-
cepstral coefficient.
• Objective evaluations
GAN postfilter reconstructs spectral texture similar to the natural one.

Table 9: Preference score (%). Bold font indicates the numbers over 30%.
• Subjective evaluations
1. GAN postfilter significantly improves the synthesized speech.
2. GAN postfilter is effective particularly in voiced segments.
3. GANv outperforms GAN and is comparable to NAT.

Speech Synthesis
• Input: linguistic features; Output: speech parameters
𝒄ො𝒄
𝑮 𝑺𝑺
Natural
speech
parameters
Generated
speech
parameters
Linguistic
features
sp sp
Objective function
Minimum
generation error
(MGE), MSE
• Speech synthesis with anti-spoofing verification (ASV)
[Saito et al., ICASSP 2017]
𝐿 𝐷 𝒄, ො𝒄 = 𝐿 𝐷,1 𝒄 + 𝐿 𝐷,0 ො𝒄
𝐿 𝐷,1 𝒄 = −
1
𝑇
σ 𝑡=1
𝑇
log( 𝐷 𝒄 𝑡 )…NAT
𝐿 𝐷,0 ො𝒄 = −
1
𝑇
σ 𝑡=1
𝑇
log(1 − 𝐷 ො𝒄 𝑡 )…SYN
𝐿 𝒄, ො𝒄 = 𝐿 𝐺 𝒄, ො𝒄 + 𝜔 𝐷
𝐸 𝐿 𝐺
𝐸 𝐿 𝐷
𝐿 𝐷,1 ො𝒄
Minimum generation error (MGE)
with adversarial loss.
𝒄ො𝒄
𝑮 𝑺𝑺
Natural
speech
parameters
Generated
speech
parameters
Gen.
Nature
𝑫 𝑨𝑺𝑽𝝓(∙)
Linguistic
features
sp sp
MGE

Fig. 7: Averaged GVs of MCCs.
Speech Synthesis (ASV)
• Objective and subjective evaluations
1. The proposed algorithm generates MCCs similar to the natural ones.
Fig. 8: Scores of speech quality.
2. The proposed algorithm outperforms conventional MGE training.

• Speech synthesis with GAN (SS-GAN) [Saito et al., TASLP 2018]
𝐿 𝐷 𝒄, ො𝒄 = 𝐿 𝐷,1 𝒄 + 𝐿 𝐷,0 ො𝒄
𝐿 𝐷,1 𝒄 = −
1
𝑇
σ 𝑡=1
𝑇
log( 𝐷 𝒄 𝑡 )…NAT
𝐿 𝐷,0 ො𝒄 = −
1
𝑇
σ 𝑡=1
𝑇
log(1 − 𝐷 ො𝒄 𝑡 )…SYN
𝐿 𝒄, ො𝒄 = 𝐿 𝐺 𝒄, ො𝒄 + 𝜔 𝐷
𝐸 𝐿 𝐺
𝐸 𝐿 𝐷
𝐿 𝐷,1 ො𝒄
Minimum generation error (MGE)
with adversarial loss.
𝒄ො𝒄
Speech Synthesis
𝑮 𝑺𝑺
Natural
speech
parameters
Generated
speech
parameters
Gen.
Nature
𝑫𝝓(∙)
Linguistic
features
𝑫 𝑨𝑺𝑽
sp, f0, duration sp, f0, duration
MGE

Fig. 10: Scores of speech quality
(sp and F0).
.
Speech Synthesis (SS-GAN)
Fig. 9: Scores of speech quality (sp).
The proposed algorithm works for both spectral parameters and F0.

• Convert (transform) speech from source to target
➢ Conventional VC approaches include Gaussian mixture model (GMM) [Toda
et al., TASLP 2007], non-negative matrix factorization (NMF) [Wu et al., TASLP
2014; Fu et al., TBME 2017], locally linear embedding (LLE) [Wu et al.,
Interspeech 2016], variational autoencoder (VAE) [Hsu et al., APSIPA
2016], restricted Boltzmann machine (RBM) [Chen et al., TASLP
2014], feed forward NN [Desai et al., TASLP 2010], recurrent NN (RNN)
[Nakashika et al., Interspeech 2014].
Voice Conversion
G Output
Objective function
Target
speaker
Source
speaker

• VAW-GAN [Hsu et al., Interspeech 2017]
➢Conventional MMSE approaches often encounter the “over-smoothing” issue.
➢ GAN is used a new objective function to estimate G.
➢ The goal is to increase the naturalness, clarity, similarity of converted speech.
Voice Conversion
D
Real
or
Fake
G
Target
speaker
Source
speaker
𝑉 𝐺, 𝐷 = 𝑉𝐺𝐴𝑁 𝐺, 𝐷 + 𝜆 𝑉𝑉𝐴𝐸 𝒙|𝒚

• Objective and subjective evaluations
Fig. 12: MOS on naturalness.Fig. 11: The spectral envelopes.
Voice Conversion (VAW-GAN)
VAW-GAN outperforms VAE in terms of objective and subjective
evaluations with generating more structured speech.

• CycleGAN-VC [Kaneko et al., Eusipco 2018]
• used a new objective function to estimate G
𝑉𝐹𝑢𝑙𝑙 = 𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌 ＋𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌
＋𝜆 𝑉𝐶𝑦𝑐(𝐺 𝑋→𝑌, 𝐺 𝑌→𝑋)
Voice Conversion
𝑮 𝑺→𝑻 𝐺 𝑇→𝑆
𝑫 𝑻
Scalar: belongs to
domain T or not
Scalar: belongs to
domain S or not
𝐺 𝑇→𝑆 𝑮 𝑺→𝑻
𝑫 𝑺
Target Syn. Source Target
Source Syn. Target Source

Fig. 13: MOS for naturalness.
Fig. 14: Similarity of to source and
to target speakers. S: Source;
T:Target; P: Proposed; B:Baseline
Voice Conversion (CycleGAN-VC)
1. The proposed method uses non-parallel data.
2. For naturalness, the proposed method outperforms baseline.
3. For similarity, the proposed method is comparable to the baseline.
Target
speaker
Source
speaker

Speech, Speaker, Emotion Recognition and Lip-reading
(Classification Task)
Output
label
Clean data
E
G
𝒚
෥𝒙
Emb.
Noisy data
𝒙
෤𝒛 = 𝑔(෥𝒙)
𝑔(∙)
ℎ(∙)
Accented
speech
෭𝒙
Channel
distortion
ෝ𝒙
Acoustic Mismatch

Speech Recognition
• Adversarial multi-task learning (AMT)
[Shinohara Interspeech 2016]
Output 1
Senone
Input
Acoustic feature
E
G𝑉𝑦
Output 2
Domain
D 𝑉𝑧
𝒛
𝒚
𝒙
GRL
𝑉𝑦=− σ𝑖 log 𝑃(𝑦𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐺)
𝑉𝑧=− σ𝑖 log 𝑃(𝑧𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐷)
𝜃 𝐺 ← 𝜃 𝐺 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺
𝜃 𝐸 ← 𝜃 𝐸 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐸
𝜃 𝐷 ← 𝜃 𝐷 − ϵ
𝜕𝑉𝑧
𝜕𝜃 𝐷
Model update
Max
classification
accuracy
Max domain
accuracy
Max classification accuracy
Objective function
+𝛼
𝜕𝑉𝑧
𝜕𝜃 𝐸
and Min domain accuracy

• ASR results in known (k) and unknown (unk)
noisy conditions
Speech Recognition (AMT)
Table 10: WER of DNNs with single-task learning (ST) and AMT.
The AMT-DNN outperforms ST-DNN with yielding lower WERs.

Speech Recognition
• Domain adversarial training for accented ASR (DAT)
[Sun et al., ICASSP2018]
Output 2
Domain
Output 1
Senone
Input
Acoustic feature
E
G D
GRL
𝑉𝑧𝑉𝑦
𝒛
𝒚
𝒙
𝜃 𝐺 ← 𝜃 𝐺 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺
𝜃 𝐸 ← 𝜃 𝐸 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐸
𝜃 𝐷 ← 𝜃 𝐷 − ϵ
𝜕𝑉𝑧
𝜕𝜃 𝐷
Model update
Max
classification
accuracy
Max domain
accuracy
Objective function
+𝛼
𝜕𝑉𝑧
𝜕𝜃 𝐸

• ASR results on accented speech
Speech Recognition (DAT)
1. With labeled transcriptions, ASR performance notably improves.
Table 11: WER of the baseline and adapted model.
2. DAT is effective in learning features invariant to domain differences
with and without labeled transcriptions.
STD: standard speech

Speech Recognition
• Unsupervised Adaptation with Domain Separation
Networks (DSN) [Meng et al., ASRU 2017]
R
PEt
R
PEs
𝒙
෥𝒙
Output 1
Senone
Clean data
E
GL1
𝒚
෥𝒙
Emb.
Noisy data
E
𝒙
Emb.𝒛 = 𝑔(𝒙) ෤𝒛 = 𝑔(෥𝒙)
𝑔(∙)𝑔(∙)
ℎ(∙)D
Output 2
Domain
𝒅

• Results on ASR in noise (CHiME3):
Speech Recognition (DSN)
1. DSN outperforms GRL consistently over different noise types.
2. The results confirmed the additional gains provided by private
component extractors.
Table 12: WER (in %) of Robust ASR on the CHiME3 task.

Speaker Recognition
• Domain adversarial neural network (DANN)
[Wang et al., ICASSP 2018]
DANN
DANN
Pre-
processing
Pre-
processing
Scoring
Enroll
i-vector
Test
i-vector
Output 2
Domain
Output 1
Speaker
ID
Input
Acoustic feature
E
G D
GRL
𝑉𝑧𝑉𝑦
𝒛
𝒚
𝒙

• Recognition results of domain mismatched conditions
Table 13: Performance of DAT and the state-of-the-art methods.
Speaker Recognition (DANN)
The DAT approach outperforms other methods with
achieving lowest EER and DCF scores.

Emotion Recognition
• Adversarial AE for emotion recognition (AAE-ER)
[Sahu et al., Interspeech 2017]
AE with GAN :
𝐻 ℎ 𝒛 , 𝒙 + λ 𝑉𝐺𝐴𝑁 (𝒒, 𝑔(𝒙))
E
D
𝒙
Emb.
Syn.
𝒛 = 𝑔(𝒙)
𝑔(∙)
ℎ(∙)
𝒒
𝒙
G
The distribution of code vectors

• Recognition results of domain mismatched conditions:
Table 15: Classification results on real and synthesized features.
Emotion Recognition (AAE-ER)
Table 14: Classification results on different systems.
1. AAE alone could not yield performance improvements.
2. Using synthetic data from AAE can yield higher UAR.
Original
Training
data

Lip-reading
• Domain adversarial training for lip-reading (DAT-LR)
[Wand et al., Interspeech 2017]
Output 1
Words
E
G𝑉𝑦
Output 2
Speaker
D
GRL
𝑉𝑧
𝒛
𝒚
𝒙
𝜃 𝐺 ← 𝜃 𝐺 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺
𝜃 𝐸 ← 𝜃 𝐸 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐸
𝜃 𝐷 ← 𝜃 𝐷 − ϵ
𝜕𝑉𝑧
𝜕𝜃 𝐷
Model update
Max
classification
accuracy
Max domain
accuracy
Objective function
+𝛼
𝜕𝑉𝑧
𝜕𝜃 𝐸
~80% WAC

• Recognition results of speaker mismatched conditions
Lip-reading (DAT-LR)
Table 16: Performance of DAT and the baseline.
The DAT approach notably enhances the recognition
accuracies in different conditions.

References
Speech enhancement (conventional methods)
• Y.-X. Wang and D.-L. Wang, Cocktail party processing via structured prediction, NIPS 2012.
• Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, An experimental study on speech enhancement based on deep neural
networks, IEEE SPL, 2014.
• Y. Xu, J. Du, L.-R. Dai, and Chin-Hui Lee, A regression approach to speech enhancement based on deep neural
networks, IEEE/ACM TASLP, 2015.
• X. Lu, Y. Tsao, S. Matsuda, H. Chiroi, Speech enhancement based on deep denoising autoencoder, Interspeech
2012.
• Z. Chen, S. Watanabe, H. Erdogan, J. R. Hershey, Integration of speech enhancement and recognition using long-
short term memory recurrent neural network, Interspeech 2015.
• F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. L. Roux, J. R. Hershey, and B. Schuller, Speech enhancement
with LSTM recurrent neural networks and Its application to noise-robust ASR, LVA/ICA, 2015.
• S.-W. Fu, Y. Tsao, and X.-G. Lu, SNR-aware convolutional neural network modeling for speech enhancement,
Interspeech, 2016.
• S.-W. Fu, Y. Tsao, X.-G. Lu, and Hisashi Kawai, End-to-end waveform utterance enhancement for direct evaluation
metrics optimization by fully convolutional neural networks, IEEE/ACM TASLP, 2018.
Speech enhancement (GAN-based methods)
• P. Santiago, B. Antonio, and S. Joan, SEGAN: Speech enhancement generative adversarial network, Interspeech,
2017.
• D. Michelsanti, and Z.-H. Tan, Conditional generative adversarial networks for speech enhancement and noise-
robust speaker verification, Interspeech, 2017.
• C. Donahue, B. Li, and P. Rohit, Exploring speech enhancement with generative adversarial networks for robust
speech recognition, ICASSP, 2018.
• T. Higuchi Takuya, K. Kinoshita, D. Marc, and T. Nakatani. Adversarial training for data-driven speech
enhancement without parallel Corpus, ASRU, 2017.
• S. Pascual, M. Park, J. Serrà, A. Bonafonte, K.-H. Ahn, Language and noise transfer in speech enhancement
generative adversarial network, ICASSP 2018.

References
Speech enhancement (GAN-based methods)
• A. Pandey and D. Wang, On adversarial training and loss functions for speech enhancement, ICASSP 2018.
• M. H. Soni, Neil Shah, and H. A. Patil, Time-frequency masking-based speech enhancement using generative
adversarial network, ICASSP 2018.
• Z. Meng, J.-Y. Li, Y.-G. Gong, B.-H. Juang, Adversarial feature-mapping for speech enhancemen, Interspeech, 2018.
• L.-W. Chen, M.Yu, Y.-M. Qian, D. Su, D. Yu, Permutation invariant training of generative adversarial network for
monaural speech separation, Interspeech 2018.
• D. Baby and S. Verhulst, Sergan: Speech enhancement using relativistic generative adversarial networks with
gradient penalty, ICASSP 2019.

Postfilter (conventional methods)
• T. Tod, and K. Tokuda, A speech parameter generation algorithm considering global variance for HMM-based
speech synthesis, IEICE Trans. Inf. Syst., 2007.
• H. Sil’en, E. Helander, J. Nurminen, and M. Gabbouj, Ways to implement global variance in statistical speech
synthesis, Interspeech, 2012.
• S. Takamichi, T. Toda, N. Graham, S. Sakriani, and S. Nakamura, A postfilter to modify the modulation spectrum
in HMM-based speech synthesis, ICASSP, 2014.
• L.-H. Chen, T. Raitio, C. V. Botinhao, J. Yamagishi, and Z.-H. Ling, DNN-based stochastic postfilter for HMM-
based speech synthesis, Interspeech, 2014.
• L.-H. Chen, T. Raitio, C. V. Botinhao, Z.-H. Ling, and J. Yamagishi, A deep generative architecture for postfiltering
in statistical parametric speech synthesis, IEEE/ACM TASLP, 2015.
Postfilter (GAN-based methods)
• K. Takuhiro, K. Hirokazu, H. Nobukatsu, Y. Ijima, K. Hiramatsu, and K. Kashino, Generative adversarial network-
based postfilter for statistical parametric speech synthesis, ICASSP, 2017.
• K. Takuhiro, T. Shinji, K. Hirokazu, and J. Yamagishi, Generative adversarial network-based postfilter for STFT
spectrograms, Interspeech, 2017.
• Y. Saito, S. Takamichi, and H. Saruwatari, Training algorithm to deceive anti-spoofing verification for DNN-based
speech synthesis, ICASSP, 2017.
• Y. Saito, S. Takamichi, H. Saruwatari, Statistical parametric speech synthesis incorporating generative
adversarial networks, IEEE/ACM TASLP, 2018.
• B. Bollepalli, L. Juvela, and A. Paavo, Generative adversarial network-based glottal waveform model for
statistical parametric speech synthesis, Interspeech, 2017.
• S. Yang, L. Xie, X. Chen, X.-Y. Lou, X. Zhu, D.-Y. Huang, and H.-Z. Li, Statistical parametric speech synthesis using
generative adversarial networks under a multi-task learning framework, ASRU, 2017.
References

VC (conventional methods)
• T. Toda, A. W. Black, and K. Tokuda, Voice conversion based on maximum likelihood estimation of spectral
parameter trajectory, IEEE/ACM TASLP, 2007.
• L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, Voice conversion using deep neural networks with layer-wise
generative training, IEEE/ACM TASLP, 2014.
• S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad, Spectral mapping using artificial neural networks for
voice conversion, IEEE/ACM TASLP, 2010.
• T. Nakashika, T. Takiguchi, Y. Ariki, High-order sequence modeling using speaker-dependent recurrent temporal
restricted boltzmann machines for voice conversion, Interspeech, 2014.
• K. Takuhiro, K. Hirokazu, H. Kaoru, and K. Kunio, Sequence-to-sequence voice conversion with similarity metric
learned using generative adversarial networks, Interspeech, 2017.
• Z.-Z. Wu, T. Virtanen, E.-S. Chng, and H.-Z. Li, Exemplar-based sparse representation with residual compensation
for voice conversion, IEEE/ACM TASLP, 2014.
• S.-. Fu, P.-C. Li, Y.-H. Lai, C.-C. Yang, L.-C. Hsieh, and Y. Tsao, Joint dictionary learning-based non-negative matrix
factorization for voice conversion to improve speech intelligibility after oral surgery, IEEE TBME, 2017.
• Y.-C. Wu, H.-T. Hwang, C.-C. Hsu, Y. Tsao, and H.-M. Wang, Locally linear embedding for exemplar-based spectral
conversion, Interspeech, 2016.
• C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, Y., and H.-M. Wang, Voice conversion from non-parallel corpora using
variational auto-encoder. APSIPA 2016.
VC (GAN-based methods)
• C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang Voice conversion from unaligned corpora using
variational autoencoding wasserstein generative adversarial networks, Interspeech 2017.
• K. Takuhiro, K. Hirokazu, H. Kaoru, and K. Kunio, Sequence-to-sequence voice conversion with similarity metric
learned using generative adversarial networks, Interspeech, 2017.
References

VC (GAN-based methods)
• K. Takuhiro, and K. Hirokazu. Parallel-data-free voice conversion using cycle-consistent adversarial networks,
arXiv, 2017.
• N. Shah, N. J. Shah, and H. A. Patil, Effectiveness of generative adversarial network for non-audible murmur-to-
whisper speech conversion, Interspeech, 2018.
• J.-C. Chou, C.-C. Yeh, H.-Y. Lee, and L.-S. Lee, Multi-target voice conversion without parallel data by adversarially
learning disentangled audio representations, Interspeech, 2018.
• G. Degottex, and M. Gales, A spectrally weighted mixture of least square error and wasserstein discriminator
loss for generative SPSS, SLT, 2018.
• B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura, Adaptive wavenet vocoder for residual compensation in
GAN-based voice conversion, SLT, 2018.
• C.-C. Yeh, P.-C. Hsu, J.-C. Chou, H.-Y. Lee, and L.-S. Lee, Rhythm-flexible voice conversion without parallel data
using cycle-GAN over phoneme posteriorgram sequences, SLT, 2018.
• H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, STARGAN-VC: Non-parallel many-to-many voice conversion with
star generative adversarial networks, SLT, 2018.
• K. Tanaka, T. Kaneko, N. Hojo, and H. Kameoka, Synthetic-to-natural speech waveform conversion using cycle-
consistent adversarial networks, SLT, 2018.
• O. Ocal, O. H. Elibol, G. Keskin, C. Stephenson, A. Thomas, and K. Ramchandran, Adversarially trained
autoencoders for parallel-data-free voice conversion, ICASSP, 2019.
• F. Fang, X. Wang, J. Yamagishi, and I. Echizen, Audiovisual speaker conversion: Jointly and simultaneously
transforming facial expression and acoustic characteristics, ICASSP, 2019.
• S. Seshadri, L. Juvela, J. Yamagishi, Okko Räsänen, and P. Alku, Cycle-consistent adversarial networks for non-
parallel vocal effort based speaking style conversion, ICASSP, 2019.
• T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, CYCLEGAN-VC2: Improved cyclegan-based non-parallel voice
conversion, ICASSP, 2019.
• L. Juvela, B. Bollepalli, J. Yamagishi, and P. Alku, Waveform generation for text-to-speech synthesis using pitch-
synchronous multi-scale generative adversarial networks, ICASSP, 2019.
References

Speaker recognition
• Q. Wang, W. Rao, S.-I. Sun, L. Xie, E.-S. Chng, and H.-Z. Li, Unsupervised domain adaptation via domain
adversarial training for speaker recognition, ICASSP, 2018.
• H. Yu, Z.-H. Tan, Z.-Y. Ma, and J. Guo, Adversarial network bottleneck features for noise robust speaker
verification, arXiv, 2017.
• G. Bhattacharya, J. Alam, & P. Kenny, Adapting end-to-end neural speaker verification to new languages and
recording conditions with adversarial training, ICASSP, 2019.
• Z. Peng, S. Feng, & T. Lee, Adversarial multi-task deep features and unsupervised back-end adaptation for
language recognition, ICASSP, 2019.
• Z. Meng, Y. Zhao, J. Li, & Y. Gong, Adversarial speaker verification, ICASSP, 2019.
• X. Fang, L. Zou, J. Li, L. Sun, & Z.-H. Ling, Channel adversarial training for cross-channel text-independent
speaker recognition, ICASSP, 2019.
• W. Xia, J. Huang, & J. H. Hansen, Cross-lingual text-independent speaker verification using unsupervised
adversarial discriminative domain adaptation, ICASSP, 2019.
• P. S. Nidadavolu, J. Villalba, & N. Dehak, Cycle-GANs for domain adaptation of acoustic features for speaker
recognition, ICASSP, 2019.
• G. Bhattacharya, J. Monteiro, J. Alam, & P. Kenny, Generative adversarial speaker embedding networks for
domain robust end-to-end speaker verification, ICASSP, 2019.
• J. Rohdin, T. Stafylakis, A. Silnova, H. Zeinali, L. Burget, & O. Plchot, Speaker verification using end-to-end
adversarial language adaptation, ICASSP, 2019.
• Zhou, J., Jiang, T., Li, L., Hong, Q., Wang, Z., & Xia, B., Training multi-task adversarial network for extracting
noise-robust speaker embedding, ICASSP, 2019.
• J. Zhang, N. Inoue, & K. Shinoda, I-vector transformation using conditional generative adversarial networks for
short utterance speaker verification, arXiv, 2018.
• W. Ding, & L. He, Mtgan: Speaker verification through multitasking triplet generative adversarial networks, arXiv,
2018.
• X. Miao, I. McLoughlin, S. Yao, & Y. Yan, Improved conditional generative adversarial net classification for
spoken language recognition, SLT, 2018.
References

Automatic Speech Recognition
• Yusuke Shinohara, Adversarial multi-task learning of deep neural networks for robust speech recognition,
Interspeech, 2016.
• D. Serdyuk, K. Audhkhasi, P. Brakel, B. Ramabhadran, S. Thomas, and Y. Bengio, Invariant Representations for
Noisy Speech Recognition, arXiv, 2016.
• Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara, Cross-domain speech recognition using nonparallel
corpora with cycle-consistent adversarial networks, ASRU, 2017.
• A. Sriram, H.-W Jun, Y. Gaur, and S. Satheesh, Robust speech recognition using generative adversarial networks,
arXiv, 2017.
• Z. Meng, Z. Chen, V. Mazalov, J. Li, J., and Y. Gong, Unsupervised adaptation with domain separation networks
for robust speech recognition, ASRU, 2017.
• Z. Meng, J. Li, Z. Chen, Y. Zhao, V. Mazalov, Y. Gong, and B.-H. Juang, Speaker-invariant training via adversarial
learning, ICASSP, 2018.
• Z. Meng, J. Li, Y. Gong, and B.-H. Juang, Adversarial teacher-student learning for unsupervised domain
adaptation, ICASSP, 2018.
• Y. Zhang, P. Zhang, and Y. Yan, Improving language modeling with an adversarial critic for automatic speech
recognition, Interspeech, 2018.
• S. Sun, C. Yeh, M. Ostendorf, M. Hwang, and L. Xie, Training augmentation with adversarial examples for robust
speech recognition, Interspeech, 2018.
• Z. Meng, J. Li, Y. Gong, and B.-H. Juang, Adversarial feature-mapping for speech enhancement, Interspeech
2018.
• K. Wang, J. Zhang, S. Sun, Y. Wang, F. Xiang, and L. Xie, Investigating generative adversarial networks based
speech dereverberation for robust speech recognition, Interspeech 2018.
• Z. Meng, J. Li, Y. Gong, B.-H. Juang, Cycle-consistent speech enhancement, Interspeech 2018.
• J. Drexler and J. Glass, Combining end-to-end and adversarial training for low-resource speech recognition, SLT,
2018.
• A. H. Liu, H. Lee and L. Lee, Adversarial training of end-to-end speech recognition using a criticizing language
model, ICASSP, 2019.
References

Automatic Speech Recognition
• J. Yi, J. Tao and Y. Bai, Language-invariant bottleneck features from adversarial end-to-end acoustic models for
• low resource speech recognition, ICASSP, 2019.
• D. Haws and X. Cui, Cyclegan bandwidth extension acoustic modeling for automatic speech recognition, ICASSP,
2019.
• Z. Meng, J. Li, J. and Y. Gong, Attentive adversarial learning for domain-Invariant training, ICASSP, 2019.
• Z. Meng, Y. Zhao, J. Li, and Y. Gong, Adversarial speaker verification, ICASSP, 2019.
• Z. Meng, Y. Zhao, J. Li, and Y. Gong., Adversarial speaker adaptation, ICASSP, 2019.
Emotion recognition
• J. Chang, and S. Scherer, Learning representations of emotional speech with deep convolutional generative
adversarial networks, ICASSP, 2017.
• S. Sahu, R. Gupta, G. Sivaraman, W. AbdAlmageed, and C. Espy-Wilson, Adversarial auto-encoders for speech
based emotion recognition. Interspeech, 2017.
• S. Sahu, R. Gupta, and C. E.-Wilson, On enhancing speech emotion recognition using generative adversarial
networks, Interspeech 2018.
• C.-M. Chang, and C.-C. Lee, Adversarially-enriched acoustic code vector learned from out-of-context affective
corpus for robust emotion recognition, ICASSP 2019.
• J. Liang, S. Chen, J. Zhao, Q. Jin, H. Liu, and L. Lu, Cross-culture multimodal emotion recognition with adversarial
learning, ICASSP 2019.
Lipreading
• M. Wand, and J. Schmidhuber, Improving speaker-independent lipreading with domain-adversarial training,
arXiv, 2017.
References

Outline of Part III
Our Recent Works
• Noise adaptive speech enhancement [Interspeech 2019]
• MetricGAN for speech enhancement [ICML 2019]
• Multi-Target voice conversion [Interspeech 2018]
• Impaired speech conversion [Interspeech 2019]
• Pathological voice detection [NeurIPS workshop 2018]
[Mon-P-2-A]
[Wed-P-6-E]

Speech Enhancement
N5 N5N4
N7 N9N10N12
Unseen
N11
𝒛
E G
𝑉𝑦
𝜃 𝐺 ← 𝜃 𝐺 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺 𝜃 𝐸 ← 𝜃 𝐸 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐸
Min reconstruction
error
Min reconstruction error
• Noise Adaptive Speech Enhancement (NA-SE)
[Liao et al., Interspeech 2019] [Wed-P-6-E]

Speech Enhancement (NA-SE)
N5 N5N4
N7 N9N10N12
Unseen
N11
Noise
Type
𝒛
E G
𝑉𝑦
D
𝑉𝑧
𝜃 𝐺 ← 𝜃 𝐺 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺 𝜃 𝐸 ← 𝜃 𝐸 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐸
𝜃 𝐷 ← 𝜃 𝐷 − ϵ
𝜕𝑉𝑧
𝜕𝜃 𝐷
Min reconstruction
error
Max domain
accuracy
Min reconstruction error
+𝛼
𝜕𝑉𝑧
𝜕𝜃 𝐸
• Domain adversarial training for NA-SE
GRL

Speech Enhancement (NA-SE)
• Objective evaluations
The DAT-based unsupervised adaptation can notably overcome
the mismatch issue of training and testing noise types.
Fig. 15: PESQ at different SNR levels.

• GAN for spectral magnitude mask estimation (MMS-GAN)
[Pandey et al., ICASSP 2018]
D Scalar
Ref.
mask
Noisy
(Fake/Real)
Output
mask
Noisy
G
Noisy Output mask Ref. mask
Speech Enhancement

• MetricGAN for Speech Enhancement [Fu et al., ICML 2019]
D Metric
Score
(0~1)
G
Noisy Spect. Output mask
Speech Enhancement
1.00.4
Clean
Spect.
Enhanced
Spect.
Enhanced Spect.
Point-wise multiplication

Speech Enhancement (MetricGAN)
With MetricGAN, we have freedom to specify the target
metric scores (PESQ or STOI) to generated speech.

• Multi-target VC [Chou et al., Interspeech 2018]
𝑒𝑛𝑐(𝒙)
𝒙
Voice Conversion
C
𝑬nc Dec
𝒚
𝒚 𝒚′····
𝑒𝑛𝑐(𝒙)
𝑬nc Dec
𝒚"
𝑮
𝒚"
D+C
Real
data
𝒙 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚) 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚′)
➢ Stage-1
➢ Stage-2
F/R
ID
···

Voice Conversion (Multi-target VC)
Fig. 16: Preference test results
1. The proposed method uses non-parallel data.
2. The multi-target VC approach outperforms one-stage only.
3. The multi-target VC approach is comparable to Cycle-GAN-VC in
terms of the naturalness and the similarity.

• Controller-generator-discriminator VC on Impaired
Speech [Chen et al., Interspeech 2019]
Voice Conversion
Previous applications: hearing aids; murmur to normal speech; bone-
conductive microphone to air-conductive microphone.
Before
Proposed: improving the speech intelligibility of surgical patients.
Target: oral cancer (top five cancer for male in Taiwan).
After Before After
[Mon-P-2-A]

• Controller-generator-discriminator VC (CGD VC) on
impaired speech [Chen et al., Interspeech 2019]
Voice Conversion
GD
Controller

Voice Conversion (CGD VC)
Fig. 17: Spectrogram comparison of CGD with CycleGAN.

Voice Conversion (CGD VC)
The proposed method outperforms conditional GAN and CycleGAN
in terms of content similarity, speaker similarity, and articulation.
Fig. 18: MOS for content similarity, speaker similarity, and articulation.

Pathological Voice Detection
• Detection of Pathological Voice Using Cepstrum Vectors:
A Deep Learning Approach [Fang et al., Journal of Voice 2018]
GMM SVM DNN
MEEI 98.28 98.26 99.14
FEMH (M) 90.24 93.04 94.26
FEMH (F) 90.20 87.40 90.52
Table 17: Detection performance based on voice.

Pathological Voice Detection
• Robustness Against Channel [Hsu et al., NeurIPS Workshop 2018]
𝒛
E G
𝑉𝑦
D
𝑉𝑧
𝒚
DNN (S) DNN (T) DNN (FT) Unsup. DAT Sup. DAT
PR-AUC 0.8848 0.8509 0.9021 0.9455 0.9522
The unsupervised DAT notably increased the performance
robustness against channel effects and generated comparable
results as compared to supervised DAT.
𝒙
Table 18: Detection results of sup. and unsup. DAT under channel mismatches.

• C.-F. Liao, Y. Tsao, H.-Y. Lee and H.-M. Wang, Noise adaptive speech enhancement using domain adversarial
training, Interspeech 2019.
• J.-C. Chou, C.-C. Yeh, H.-Y. Lee, and L.-S. Lee. "Multi-target voice conversion without parallel data by
adversarially learning disentangled audio representations. Interspeech 2018.
• L.-W. Chen, H.-Y. Lee, and Y. Tsao, Generative adversarial networks for unpaired voice transformation on
impaired speech, Interspeech 2019.
• S.-W. Fu, C.-F. Liao, Y. Tsao, S.-D. Lin, MetricGAN: Generative adversarial networks based black-box metric
scores optimization for speech enhancement, ICML, 2019.
• C.-T. Wang, F.-C. Lin, J.-Y. Chen, M.-J. Hsiao, S.-H. Fang, Y.-H. Lai, Y. Tsao, Detection of pathological voice using
cepstrum vectors: a deep learning approach, Journal of Voice, 2018.
• S.-Y. Tsui, Y. Tsao, C.-W. Lin, S.-H. Fang, and C.-T. Wang, Demographic and symptomatic features of voice
disorders and their potential application in classification using machine learning algorithms, Folia Phoniatrica et
Logopaedica, 2018.
• S.-H. Fang, C.-T. Wang, J.-Y. Chen, Y. Tsao and F.-C. Lin, Combining acoustic signals and medical records to
improve pathological voice classification, APSIPA, 2019.
• Y.-T. Hsu, Z. Zhu, C.-T. Wang, S.-H. Fang, F. Rudzicz, and Y. Tsao, Robustness against the channel effect in
pathological voice detection, NeurIPS 2018 Machine Learning for Health (ML4H) Workshop, 2018.
References

Thank You Very Much
Tsao, Yu Ph.D., Academia Sinica
yu.tsao@citi.sinica.edu.tw
and its Applications to Signal Processing
Part III: Speech Signal Processing

Part IV: Natural
Language Processing

NLP tasks usually involve Sequence Generation
How to use GAN to improve sequence generation?

Outline of Part IV
Sequence Generation by GAN
Unsupervised Conditional Sequence Generation
• Text Style Transfer
• Unsupervised Abstractive Summarization
• Unsupervised Translation
• Unsupervised Speech Recognition

Why we need GAN?
• Chat-bot as example
Encoder Decoder
Input sentence c
output
sentence x
Training
data:
A: How are you ?
B: I’m good.
…………
How are you ?
I’m good.
Seq2seq
Output: Not bad I’m John.
Maximize
likelihood
Training Criterion
Human better
better

Reinforcement Learning
Human
Input sentence c response sentence x
Chatbot
En De
response sentence x
Input sentence c
[Li, et al., EMNLP, 2016]
reward
𝑅 𝑐, 𝑥
Learn to maximize expected reward
E.g. Policy Gradient
human
“How are you?” “Not bad” “I’m John”
-1+1

Policy Gradient
𝜃 𝑡
𝑐1, 𝑥1
𝑐2, 𝑥2
𝑐 𝑁, 𝑥 𝑁
……
𝑅 𝑐1, 𝑥1
𝑅 𝑐2, 𝑥2
𝑅 𝑐 𝑁
, 𝑥 𝑁
……
1
𝑁
෍
𝑖=1
𝑁
𝑅 𝑐 𝑖, 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃 𝜃 𝑡 𝑥 𝑖|𝑐 𝑖
𝜃 𝑡+1 ← 𝜃 𝑡 + 𝜂𝛻 ത𝑅 𝜃 𝑡
𝑅 𝑐 𝑖, 𝑥 𝑖 is positive
Updating 𝜃 to increase 𝑃 𝜃 𝑥 𝑖
|𝑐 𝑖
𝑅 𝑐 𝑖
, 𝑥 𝑖
is negative
Updating 𝜃 to decrease 𝑃 𝜃 𝑥 𝑖|𝑐 𝑖

Policy Gradient
1
𝑁
෍
𝑖=1
𝑁
𝑅 𝑐 𝑖, 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖|𝑐 𝑖
1
𝑁
෍
𝑖=1
𝑁
𝑙𝑜𝑔𝑃 𝜃 ො𝑥 𝑖|𝑐 𝑖
1
𝑁
෍
𝑖=1
𝑁
𝛻𝑙𝑜𝑔𝑃 𝜃 ො𝑥 𝑖|𝑐 𝑖
1
𝑁
෍
𝑖=1
𝑁
𝑅 𝑐 𝑖, 𝑥 𝑖 𝑙𝑜𝑔𝑃 𝜃 𝑥 𝑖|𝑐 𝑖
𝑅 𝑐 𝑖
, ො𝑥 𝑖
= 1 obtained from interaction
weighted by 𝑅 𝑐 𝑖, 𝑥 𝑖
Objective
Function
Gradient
Maximum
Likelihood
Reinforcement Learning -
Policy Gradient
Training
Data
𝑐1, ො𝑥1 , … , 𝑐 𝑁, ො𝑥 𝑁
𝑐1, 𝑥1 , … , 𝑐 𝑁, 𝑥 𝑁

Conditional GAN
Discriminator
Input sentence c response sentence x
Chatbot
En De
response sentence x
Input sentence c
reward
𝑅 𝑐, 𝑥
I am busy.
Replace human evaluation with
machine evaluation [Li, et al., EMNLP, 2017]
However, there is an issue when you train your generator.

A A
A
B
A
B
A
A
B
B
B
<BOS>
Can we use
gradient ascent?
Discriminator
scalarNO!
Update Parameters
: obtained
by attention
Generator

A A
A
B
A
B
A
A
B
B
B
<BOS>
Can we use
gradient ascent?
Discriminator
scalarNO!
Update Parameters
Having non-
differentiable
part
: obtained
by attention

Three Categories of Solutions
Gumbel-softmax
• [Matt J. Kusner, et al., arXiv, 2016][Weili Nie, et al. ICLR, 2019]
Continuous Input for Discriminator
• [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen
Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML,
2017]
Reinforcement Learning
• [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv,
2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William
Fedus, et al., ICLR, 2018]

Gumbel-softmax
Source of image:
https://blog.evjang.com/2016/11/tutorial-
categorical-variational.html
Using the
reparameterization
trick
As what people
do for training
VAE

A A
A
B
A
B
A
A
B
B
B
<BOS>
Use the distribution
as the input of
discriminator
Avoid the sampling
process
Discriminator
scalar
Update Parameters
We can do
backpropagation
now.

What is the problem?
• Real sentence
• Generated
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0.9
0.1
0
0
0
0.1
0.9
0
0
0
0.1
0.1
0.7
0.1
0
0
0
0.1
0.8
0.1
0
0
0
0.1
0.9
Can never
be 1-hot
Discriminator can
immediately find
the difference.
Discriminator with constraint
(e.g. WGAN) can be helpful.

A A
A
B
A
B
A
A
B
B
B
<BOS>
Discriminator
scalar
Generator
= Agent in RL
Actions taken
Environment
Reward
Trained by RL algorithm
(e.g. Policy Gradient)
The reward function
may change
→ Different from typical RL

Tips for Sequence Generation
GAN
.
RL is difficult to train GAN is difficult to train
Sequence Generation GAN (RL+GAN)

GAN
• Usually the generator are fine-tuned from a model learned
by maximum-likelihood.
• However, with enough hyperparameter-tuning and tips,
ScarchGAN can train from scratch.
[Cyprien de Masson
d'Autume, et al.,
arXiv 2019]

GAN
• Typical
• Reward for Every Generation Step
Discrimi
natorChatbot
En De
You is good
Discrimi
natorChatbot
En De 0.9
0.1
0.1
0.1
You
You is
You is good
I don’t know which
part is wrong …

GAN
• Reward for Every Generation Step
Discrimi
natorChatbot
En De 0.9
0.1
0.1
You
You is
You is good
Method 2. Discriminator For Partially Decoded Sequences
Method 1. Monte Carlo (MC) Search [Yu, et al., AAAI, 2017]
[Li, et al., EMNLP, 2017]
Method 3. Step-wise evaluation[Tual, Lee, TASLP, 2019][Xu, et al., EMNLP,
2018][William Fedus, et al., ICLR, 2018]

Empirical Performance
• MLE frequently generates “I’m sorry”, “I don’t
know”, etc. (corresponding to fuzzy images?)
• GAN generates longer and more complex responses.
• Find more comparison in the survey papers.
• [Lu, et al., arXiv, 2018][Zhu, et al., arXiv, 2018]
• However, no strong evidence shows that GANs are
better than MLE.
• [Stanislau Semeniuta, et al., arXiv, 2018] [Guy Tevet, et al., arXiv, 2018]
[Massimo Caccia, et al., arXiv, 2018]

More Applications
• Supervised machine translation [Wu, et al., arXiv
2017][Yang, et al., arXiv 2017]
• Supervised abstractive summarization [Liu, et al., AAAI
2018]
• Image/video caption generation [Rakshith Shetty, et al., ICCV
2017][Liang, et al., arXiv 2017]
• Data augmentation for code-switching ASR [Mon-P-
1-D] [Chang, et al., INTERSPEECH 2019]
If you are trying to generate some sequences,
you can consider GAN.

male female
positive
sentences
negative
sentences
Language 1 Audio Text
summarydocument
Part I
Part III
Language 2
Text Style Transfer
Unsupervised Abstractive
Summarization
Unsupervised ASRUnsupervised Translation

Cycle-GAN
𝐷 𝑌𝐷 𝑋
scalar: belongs to
domain Y or not
scalar: belongs to
domain X or not

Cycle-GAN
𝐷 𝑌𝐷 𝑋
negative sentence? positive sentence?
It is bad. It is good. It is bad.
I love you. I hate you. I love you.
positive
positive
positivenegative
negative negative
Non-differentiable Issue?
You already know how to deal with it.

✘ Negative sentence to positive sentence:
it's a crappy day -> it's a great day
i wish you could be here -> you could be here
it's not a good idea -> it's good idea
i miss you -> i love you
i don't love you -> i love you
i can't do that -> i can do that
i feel so sad -> i happy
it's a bad day -> it's a good day
it's a dummy day -> it's a great day
sorry for doing such a horrible thing -> thanks for doing a
great thing
my doggy is sick -> my doggy is my doggy
my little doggy is sick -> my little doggy is my little doggy
Cycle GAN
感謝王耀賢同學提供實驗結果
[Lee, et al.,
ICASSP, 2018]

𝐸𝑁 𝑋
𝐷𝐸 𝑋 𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Shared Latent Space
Positive
Sentence
Positive
Sentence
Negative
Sentence
Negative
Sentence
Decoder hidden layer as discriminator input
[Shen, et al., NIPS, 2017]
From 𝐸𝑁𝑋 or 𝐸𝑁𝑌
Domain
Discriminator
𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the
domain discriminator
[Zhao, et al., arXiv, 2017]
[Fu, et al., AAAI, 2018]

Abstractive Summarization
• Now machine can do abstractive summary by
seq2seq (write summaries in its own words)
summary 1
summary 2
summary 3
Training Data
summary
seq2seq
(in its own words)
Supervised: We need lots of
labelled training data.

Summarization
• Now machine can do abstractive summary by
seq2seq (write summaries in its own words)
summary 1
summary 2
summary 3
seq2seq document
Domain Y Domain X[Wang, et al., EMNLP, 2018]

G
Seq2seq
document
word
sequence
D
Human written summaries Real or not
Discriminator
Unsupervised Abstractive Summarization
Summary?

G
Seq2seq
document
word
sequence
D
Discriminator
R
Seq2seq
document
Unsupervised Abstractive Summarization
minimize the reconstruction error

Summarization
G R
Summary?
Seq2seq Seq2seq
document document
word
sequence
Only need a lot
of documents to
train the model
This is a seq2seq2seq auto-encoder.
Using a sequence of words as latent representation.
not readable …

Summarization
G R
Seq2seq Seq2seq
word
sequence
D
Discriminator
Let Discriminator considers
my output as real
document document
Summary?
Readable

Experimental results
ROUGE-1 ROUGE-2 ROUGE-L
Supervised 33.2 14.2 30.5
Trivial 21.9 7.7 20.5
Unsupervised
(matched data)
28.1 10.0 25.4
Unsupervised
(no matched data)
27.2 9.1 24.1
English Gigaword (Document title as summary)
• Matched data: using the title of English Gigaword to train
Discriminator
• No matched data: using the title of CNN/Diary Mail to
train Discriminator
[Wang, Lee, EMNLP 2018]

Semi-supervised Learning
25
26
27
28
29
30
31
32
33
34
0 10k 500k
ROUGE-1
Number of document-summary pairs used
WGAN Reinforce Supervised
3.8M pairs are used.Approaches to deal with the discrete issue.
unsupervised
semi-supervised
[Wang, Lee,
EMNLP 2018]

More Unsupervised
Summarization
• Unsupervised summarization with language prior
• Unsupervised multi-document summarization
[Eric Chu, Peter Liu,
ICML 2019]
[Christos Baziotis, etc al.,
NAACL 2019]

G
Input
Sentence
D
Said by
Trump?
Discriminator
R
Dialogue Response Generation
minimize the reconstruction error
Make the US great again
I would build a great wall
you are fired
What Trump has said
Chat
Bot
Generated
Response
Input
Sentence
(Reconstruct)
[Su, et al., INTERSPEECH, 2019]
(Thu-P-9-C)
General
Dialogues

Unsupervised learning
with 10M sentences
Supervised learning with
100K sentence pairs
=
supervised
unsupervised
[Alexis Conneau, et al., ICLR, 2018]
[Guillaume Lample, et al., ICLR, 2018]

Towards Unsupervised ASR
- Cycle GAN
G
ASR
Text
R
TTS
D
Real Text?
Discriminator
minimize the reconstruction error (speech chain)
how are you
good morning
i am fine
Real
Text
[Andros Tjandra, et al., ASRU 2017]
[Liu, et al., INTERSPEECH 2018]
[Yeh, et al., ICLR 2019]
[Chen, et al., INTERSPEECH 2019]

- Cycle GAN
• Unsupervised setting on TIMIT (text and audio are
unpair, text is not the transcription of audio)
• 63.6% PER (oracle boundaries)
• 41.6% PER (automatic segmentation)
• 33.1% PER (automatic segmentation)
• Semi-supervised setting on Librispeech
[Liu, et al., INTERSPEECH 2018]
[Yeh, et al., ICLR 2019]
(Tue-P-4-B)[Chen, et al., INTERSPEECH 2019]
[Liu, et al., ICASSP 2019]
[Tomoki Hayashi, et al., SLT 2018]
[Takaaki Hori, et al., ICASSP 2019]
[Murali Karthick Baskar, et al., INTERSPEECH 2019]

- Shared Latent Space
Text
Encoder
Audio
Encoder
Audio
Decoder
Text
Decoder
this is text this is text
Unsupervised setting on Librispeech: 76.3% WER
WSJ with 2.5 hours paired data: 64.6% WER
LJ speech with 20 mins paired data: 11.7% PER
[Chen, et al., SLT 2018]
Unsupervised speech translation is also possible!
[Chung, et al., NIPS 2018]
[Jennifer Drexler, et al., SLT 2018]
[Ren, et al., ICML 2019]
[Chung, et al., ICASSP 2019]

To Learn More …
https://www.youtube.com/playlist?list=PLJV_el3uVTsMd2G9ZjcpJn1YfnM9wVOBf
You can learn more from the YouTube Channel
(in Mandarin)

Reference
• Sequence Generation
• Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, Dan
Jurafsky, Deep Reinforcement Learning for Dialogue Generation, EMNLP,
2016
• Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky,
Adversarial Learning for Neural Dialogue Generation, EMNLP, 2017
• Matt J. Kusner, José Miguel Hernández-Lobato, GANS for Sequences of
Discrete Elements with the Gumbel-softmax Distribution, arXiv 2016
• Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu
Song, Yoshua Bengio, Maximum-Likelihood Augmented Discrete Generative
Adversarial Networks, arXiv 2017
• Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, SeqGAN: Sequence
Generative Adversarial Nets with Policy Gradient, AAAI 2017
• Sai Rajeswar, Sandeep Subramanian, Francis Dutil, Christopher Pal, Aaron
Courville, Adversarial Generation of Natural Language, arXiv, 2017
• Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, Lior Wolf, Language
Generation with Recurrent Generative Adversarial Networks without Pre-
training, ICML workshop, 2017

Reference
• Zhen Xu, Bingquan Liu, Baoxun Wang, Chengjie Sun, Xiaolong Wang,
Zhuoran Wang, Chao Qi , Neural Response Generation via GAN with an
Approximate Embedding Layer, EMNLP, 2017
• Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron
Courville, Yoshua Bengio, Professor Forcing: A New Algorithm for Training
Recurrent Networks, NIPS, 2016
• Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan
Shen, Lawrence Carin, Adversarial Feature Matching for Text Generation,
ICML, 2017
• Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, Jun Wang, Long Text
Generation via Adversarial Training with Leaked Information, AAAI, 2018
• Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, Ming-Ting Sun,
Adversarial Ranking for Language Generation, NIPS, 2017
• William Fedus, Ian Goodfellow, Andrew M. Dai, MaskGAN: Better Text
Generation via Filling in the______, ICLR, 2018

Reference
• Yi-Lin Tuan, Hung-Yi Lee, Improving Conditional Sequence Generative
Adversarial Networks by Stepwise Evaluation, TASLP, 2019
• Jingjing Xu, Xuancheng Ren, Junyang Lin, Xu Sun, Diversity-Promoting GAN:
A Cross-Entropy Based Generative Adversarial Network for Diversified Text
Generation, EMNLP, 2018
• Sidi Lu, Yaoming Zhu, Weinan Zhang, Jun Wang, Yong Yu, Neural Text
Generation: Past, Present and Beyond, arXiv, 2018
• Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun
Wang, Yong Yu, Texygen: A Benchmarking Platform for Text Generation
Models, arXiv, 2018
• Stanislau Semeniuta, Aliaksei Severyn, Sylvain Gelly, On Accurate Evaluation
of GANs for Language Generation, arXiv, 2018
• Guy Tevet, Gavriel Habib, Vered Shwartz, Jonathan Berant, Evaluating Text
GANs as Language Models, arXiv, 2018
• Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle
Pineau, Laurent Charlin, Language GANs Falling Short, arXiv, 2018

Reference
• Zhen Yang, Wei Chen, Feng Wang, Bo Xu, Improving Neural Machine
Translation with Conditional Sequence Generative Adversarial Nets, NAACL,
2018
• Lijun Wu, Yingce Xia, Li Zhao, Fei Tian, Tao Qin, Jianhuang Lai, Tie-Yan Liu,
Adversarial Neural Machine Translation, arXiv 2017
• Linqing Liu, Yao Lu, Min Yang, Qiang Qu, Jia Zhu, Hongyan Li, Generative
Adversarial Network for Abstractive Text Summarization, AAAI 2018
• Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, Bernt
Schiele, Speaking the Same Language: Matching Machine to Human
Captions by Adversarial Training, ICCV 2017
• Xiaodan Liang, Zhiting Hu, Hao Zhang, Chuang Gan, Eric P. Xing, Recurrent
Topic-Transition GAN for Visual Paragraph Generation, arXiv 2017
• Weili Nie, Nina Narodytska, Ankit Patel, RelGAN: Relational Generative
Adversarial Networks for Text Generation, ICLR 2019

Reference
• Ching-Ting Chang, Shun-Po Chuang, Hung-Yi Lee, "Code-switching Sentence
Generation by Generative Adversarial Networks and its Application to Data
Augmentation", INTERSPEECH 2019
• Cyprien de Masson d'Autume, Mihaela Rosca, Jack Rae, Shakir Mohamed,
Training language GANs from Scratch, arXiv 2019

Reference
• Text Style Transfer
• Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, Rui Yan, Style
Transfer in Text: Exploration and Evaluation, AAAI, 2018
• Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola, Style Transfer
from Non-Parallel Text by Cross-Alignment, NIPS 2017
• Chih-Wei Lee, Yau-Shian Wang, Tsung-Yuan Hsu, Kuan-Yu Chen, Hung-Yi Lee,
Lin-shan Lee, Scalable Sentiment for Sequence-to-sequence Chatbot
Response with Performance Analysis, ICASSP, 2018
• Junbo (Jake) Zhao, Yoon Kim, Kelly Zhang, Alexander M. Rush, Yann LeCun,
Adversarially Regularized Autoencoders, arxiv, 2017
• Feng-Guang Su, Aliyah Hsu, Yi-Lin Tuan and Hung-yi Lee, "Personalized
Dialogue Response Generation Learned from Monologues", INTERSPEECH,
2019

Reference
• Unsupervised Abstractive Summarization
• Yau-Shian Wang, Hung-Yi Lee, "Learning to Encode Text as Human-
Readable Summaries using Generative Adversarial Networks", EMNLP, 2018
• Eric Chu, Peter Liu, “MeanSum: A Neural Model for Unsupervised Multi-
Document Abstractive Summarization”, ICML, 2019
• Christos Baziotis, Ion Androutsopoulos, Ioannis Konstas, Alexandros
Potamianos, “SEQ^3: Differentiable Sequence-to-Sequence-to-Sequence
Autoencoder for Unsupervised Abstractive Sentence Compression”, NAACL
2019

Reference
• Unsupervised Machine Translation
• Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic
Denoyer, Hervé Jégou, Word Translation Without Parallel Data, ICRL 2018
• Guillaume Lample, Ludovic Denoyer, Marc'Aurelio Ranzato, Unsupervised
Machine Translation Using Monolingual Corpora Only, ICRL 2018

Reference
• Alexander H. Liu, Hung-yi Lee, Lin-shan Lee, Adversarial Training of End-to-
end Speech Recognition Using a Criticizing Language Model, ICASSP 2018
• Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee, Completely
Unsupervised Phoneme Recognition by Adversarially Learning Mapping
Relationships from Audio Embeddings, INTERSPEECH, 2018
• Kuan-yu Chen, Che-ping Tsai, Da-Rong Liu, Hung-yi Lee and Lin-shan Lee,
"Completely Unsupervised Phoneme Recognition By A Generative
Adversarial Network Harmonized With Iteratively Refined Hidden Markov
Models", INTERSPEECH, 2019
• Yi-Chen Chen, Sung-Feng Huang, Chia-Hao Shen, Hung-yi Lee, Lin-shan Lee,
"Phonetic-and-Semantic Embedding of Spoken Words with Applications in
Spoken Content Retrieval", SLT, 2018
• Chih-Kuan Yeh, Jianshu Chen, Chengzhu Yu, Dong Yu, Unsupervised Speech
Recognition via Segmental Empirical Output Distribution Matching, ICLR,
2019

Reference
• Takaaki Hori, Ramon Astudillo, Tomoki Hayashi, Yu Zhang, Shinji
Watanabe, Jonathan Le Roux, Cycle-consistency training for end-to-end
speech recognition, ICASSP 2019
• Murali Karthick Baskar, Shinji Watanabe, Ramon Astudillo, Takaaki
Hori, Lukáš Burget, Jan Černocký, Semi-supervised Sequence-to-sequence
ASR using Unpaired Speech and Text, INTERSPEECH 2019
• Andros Tjandra, Sakriani Sakti, Satoshi Nakamura, Listening while Speaking:
Speech Chain by Deep Learning, ASRU 2017
• Yu-An Chung, Wei-Hung Weng, Schrasing Tong, James Glass, Unsupervised
Cross-Modal Alignment of Speech and Text Embedding Spaces, NIPS, 2018
• Yu-An Chung, Wei-Hung Weng, Schrasing Tong, James Glass, Towards
Unsupervised Speech-to-Text Translation, ICASSP 2019
• Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu, Almost
Unsupervised Text to Speech and Automatic Speech Recognition, ICML
2019

Reference
• Shigeki Karita , Shinji Watanabe, Tomoharu Iwata, Atsunori Ogawa, Marc
Delcroix, Semi-Supervised End-to-End Speech Recognition, INTERSPEECH,
2018
• Jennifer Drexler, James R. Glass, “Combining End-to-End and Adversarial
Training for Low-Resource Speech Recognition”, SLT 2018
• Tomoki Hayashi, Shinji Watanabe, Yu Zhang, Tomoki Toda, Takaaki
Hori, Ramon Astudillo, Kazuya Takeda, Back-Translation-Style Data
Augmentation for End-to-End ASR, SLT, 2018

Please download the latest slides here:
http://speech.ee.ntu.edu.tw/~tlkagk/GAN_3hour.pdf

Generative Adversarial Network and its Applications to Speech Processing and Natural Language Processing (INTERSPEECH 2019 Tutorial)

More Related Content

What's hot (20)

Similar to Generative Adversarial Network and its Applications to Speech Processing and Natural Language Processing (INTERSPEECH 2019 Tutorial) (20)

Recently uploaded (20)

Generative Adversarial Network and its Applications to Speech Processing and Natural Language Processing (INTERSPEECH 2019 Tutorial)