Improved Trainings of Wasserstein GANs (WGAN-GP)

Improved Trainings of Wasserstein GANs
Sangwoo Mo
KAIST ALIN Lab.
July 04, 2018
1

Table of Contents
Review for GANs
Improved Training of WGANs
2

Table of Contents
Review for GANs
3

Generative Adversarial Networks (GANs)
Generative model aims to learn a model distribution pθ(x) be
match with the target distribution p(x)
Usually we assume x ∼ pθ(x) is a deterministic mapping
x = Gθ(z) of a simple noise z ∼ p(z)
* Figure from OpenAI blog.
4

Q. How to train a generative model?
Explicit model: directly optimize the objective (e.g. MLE)
For example, PixelCNN maximizes
log pθ(x) =
n
i=1
log pθ(xi | x1:i−1)
5

Q. How to train a generative model?
Explicit model: directly optimize the objective (e.g. MLE)
Implicit model1: learning by comparison
Idea of GAN
Train a discriminator D which compares p(x) and pθ(x)
Train a generator G using the signal from D
1
Do not know pθ(x) but only can sample from.
6

What is happening in GAN?
GAN plays a minimax game between G and D:
min
G
max
D
V (G, D) where
V (G, D) = Ex∼p(x)[log D(x)] + Ez∼p(z)[log(1 − D(G(z))]
For given G, the optimal D∗ is
D∗
(x) =
p(x)
p(x) + pθ(x)
7

Putting D∗ to the objective, we have
C(G) = max
D
V (G, D)
= KL(p
p + pθ
2
) + KL(pθ
p + pθ
2
) + const
= 2 · JSD(p pθ) + const
Hence, GAN minimizes the lower bound of JSD
8

In practice, GAN suﬀers from gradient vanishing
To avoid this problem, we minimize − log D(G(z)) instead
Putting D∗, we have
C(G) = KL(pθ p) − 2 · JSD(p pθ) + const
Hence, it minimizes the lower bound of reverse KL
9

Wasserstein GANs (WGANs)
Why GAN is unstable?
Supports of p(x) and pθ(x) are disjoint1 a.s.
Then
JSD(p pθ) = log 2
KL(p pθ) = KL(pθ p) = +∞
The loss does not provide a valuable information
Solution
1. Add noise to overlap supports
2. Use better divergence
1
Lie on the low-dimensional manifolds.
10

Toy example
Let z ∼ U[0, 1] and x = (0, z) ∼ p(x)
Let Gθ(z) = (θ, z), hence pθ(x) = p(x) for θ = 0
* Figure from Lilian Weng’s blog.
11

Toy example
Here, Wasserstein distance is
W (p pθ) = |θ|
Unlike JSD and KL, it provides the closeness info.
* Figure from WGAN paper.
12

Wasserstein distance
Wasserstein-1 distance is
W (p, q) = inf
γ∈Π(p,q)
E(x,y)∼γ[ x − y 1]
Relation between divergences
W = conv in dist. < JSD = TV < KL
13

How to minimize Wasserstein distance?
Wasserstein-1 distance has a dual form:
W (p, q) = sup
f ∈F
Ex∼p(x)[f (x)] − Ex∼q(x)[f (x)]
where F is the set of 1-Lipschitz functions
Hence, the objective of WGAN is
min
G
max
D∈D
Ex∼p(x)[D(x)] − Ez∼p(z)[D(G(z))]
To achieve Lipschitz constraints, WGAN uses weight clipping
14

Table of Contents
Review for GANs
15

Motivation
Motivation: weight clipping leads optimization diﬃculties
1 Restricts function space too simple
2 Gradient exploding/vanishing
16

Observation
Theorem 1
Let (x, y) ∼ γ∗ where γ∗ is optimal coupling and f ∗ is optimal
function. Let xt = ty + (1 − t)x with 0 ≤ t ≤ 1. Then
P(x,y)∼γ f ∗
(xt) =
y − xt
y − xt
= 1
Corollary 2
f ∗ has gradient norm 1 a.e. on the line segments xy
17

Observation
Proof.
For (x, y) ∼ γ∗, f ∗(y) − f ∗(x) = y − x a.s
Let ψ(t) = f ∗(xt) − f ∗(x). Then
|ψ(t) − ψ(t )| = f ∗
(xt) − f ∗
(xt )
≤ xt − xt = x − y |t − t |,
hence ψ(t) is x − y -Lipschitz. Using this,
ψ(1) − ψ(0) = (ψ(1) − ψ(t)) + (ψ(t) − ψ(0))
≤ (1 − t) x − y + t x − y = x − y ,
and equality holds since
|ψ(1) − ψ(0)| = |f ∗
(y) − f ∗
(x)| = y − x
18

Observation
Proof.
Thus, ψ(t) − ψ(0) = t x − y , and so ψ(t) = t x − y .
Hence, f ∗(xt) = f ∗(x) + t y − x .
Let v = (y − x)/ y − x . Then
∂
∂v
f ∗
(xt) = lim
h→0
f ∗(xt + hv) − f ∗(xt)
h
= 1
Since f ∗(xt) ≤ 1, we conclude that f ∗(xt) = v.
19

Gradient Penalty (WGAN-GP)
From observation, we deﬁne gradient penalty
λ Eˆx∼ˆp(x)[( ˆx D(ˆx) 2 − 1)2
]
where ˆx ∼ ˆp(x) is uniformly sampled from the line segment xy
No critic BN: penalized gradient norm independently
Two-sided penalty: also tried one-sided penalty
max(0, D(ˆx) 2 − 1)2
but empirically no much diﬀerence
20

Possible Improvement
WGAN-GP does not sample (x, y) from optimal coupling γ∗
Instead, samples from x ∼ p(x), y ∼ pθ(x)
It does not match with the theory (Theorem 1)
Idea: (x, G(E(x))) would be a better approximation for γ∗
E is additionally trained encoder x → z
G(E(x)) is projection of x to G manifold
21

Experiments
WGAN-GP improves the stability
# of success1 for GAN & WGAN-GP
1
Inception score > threshold. Experiments on 32×32 ImageNet.
22

Experiments
WGAN-GP improves the performance
Inception score on CIFAR-10.
24

Reference
Goodfellow et al. Generative Adversarial Nets. NIPS 2014.
Arjovsky et al. Towards Principled Methods for Training
GANs. ICLR 2017.
Arjovsky et al. Wasserstein GAN. ICML 2017.
Gulrajani et al. Improved Training of Wasserstein GANs.
NIPS 2017.
25

Improved Trainings of Wasserstein GANs (WGAN-GP)

More Related Content

What's hot (20)

Similar to Improved Trainings of Wasserstein GANs (WGAN-GP) (20)

More from Sangwoo Mo (20)

Recently uploaded (20)

Improved Trainings of Wasserstein GANs (WGAN-GP)