SlideShare a Scribd company logo
Improved Trainings of Wasserstein GANs
Sangwoo Mo
KAIST ALIN Lab.
July 04, 2018
1
Table of Contents
Review for GANs
Improved Training of WGANs
2
Table of Contents
Review for GANs
Improved Training of WGANs
3
Generative Adversarial Networks (GANs)
Generative model aims to learn a model distribution pθ(x) be
match with the target distribution p(x)
Usually we assume x ∼ pθ(x) is a deterministic mapping
x = Gθ(z) of a simple noise z ∼ p(z)
* Figure from OpenAI blog.
4
Generative Adversarial Networks (GANs)
Q. How to train a generative model?
Explicit model: directly optimize the objective (e.g. MLE)
For example, PixelCNN maximizes
log pθ(x) =
n
i=1
log pθ(xi | x1:i−1)
5
Generative Adversarial Networks (GANs)
Q. How to train a generative model?
Explicit model: directly optimize the objective (e.g. MLE)
Implicit model1: learning by comparison
Idea of GAN
Train a discriminator D which compares p(x) and pθ(x)
Train a generator G using the signal from D
1
Do not know pθ(x) but only can sample from.
6
Generative Adversarial Networks (GANs)
What is happening in GAN?
GAN plays a minimax game between G and D:
min
G
max
D
V (G, D) where
V (G, D) = Ex∼p(x)[log D(x)] + Ez∼p(z)[log(1 − D(G(z))]
For given G, the optimal D∗ is
D∗
(x) =
p(x)
p(x) + pθ(x)
7
Generative Adversarial Networks (GANs)
What is happening in GAN?
Putting D∗ to the objective, we have
C(G) = max
D
V (G, D)
= KL(p
p + pθ
2
) + KL(pθ
p + pθ
2
) + const
= 2 · JSD(p pθ) + const
Hence, GAN minimizes the lower bound of JSD
8
Generative Adversarial Networks (GANs)
What is happening in GAN?
In practice, GAN suffers from gradient vanishing
To avoid this problem, we minimize − log D(G(z)) instead
Putting D∗, we have
C(G) = KL(pθ p) − 2 · JSD(p pθ) + const
Hence, it minimizes the lower bound of reverse KL
9
Wasserstein GANs (WGANs)
Why GAN is unstable?
Supports of p(x) and pθ(x) are disjoint1 a.s.
Then
JSD(p pθ) = log 2
KL(p pθ) = KL(pθ p) = +∞
The loss does not provide a valuable information
Solution
1. Add noise to overlap supports
2. Use better divergence
1
Lie on the low-dimensional manifolds.
10
Wasserstein GANs (WGANs)
Toy example
Let z ∼ U[0, 1] and x = (0, z) ∼ p(x)
Let Gθ(z) = (θ, z), hence pθ(x) = p(x) for θ = 0
* Figure from Lilian Weng’s blog.
11
Wasserstein GANs (WGANs)
Toy example
Here, Wasserstein distance is
W (p pθ) = |θ|
Unlike JSD and KL, it provides the closeness info.
* Figure from WGAN paper.
12
Wasserstein GANs (WGANs)
Wasserstein distance
Wasserstein-1 distance is
W (p, q) = inf
γ∈Π(p,q)
E(x,y)∼γ[ x − y 1]
Relation between divergences
W = conv in dist. < JSD = TV < KL
13
Wasserstein GANs (WGANs)
How to minimize Wasserstein distance?
Wasserstein-1 distance has a dual form:
W (p, q) = sup
f ∈F
Ex∼p(x)[f (x)] − Ex∼q(x)[f (x)]
where F is the set of 1-Lipschitz functions
Hence, the objective of WGAN is
min
G
max
D∈D
Ex∼p(x)[D(x)] − Ez∼p(z)[D(G(z))]
To achieve Lipschitz constraints, WGAN uses weight clipping
14
Table of Contents
Review for GANs
Improved Training of WGANs
15
Motivation
Motivation: weight clipping leads optimization difficulties
1 Restricts function space too simple
2 Gradient exploding/vanishing
16
Observation
Theorem 1
Let (x, y) ∼ γ∗ where γ∗ is optimal coupling and f ∗ is optimal
function. Let xt = ty + (1 − t)x with 0 ≤ t ≤ 1. Then
P(x,y)∼γ f ∗
(xt) =
y − xt
y − xt
= 1
Corollary 2
f ∗ has gradient norm 1 a.e. on the line segments xy
17
Observation
Proof.
For (x, y) ∼ γ∗, f ∗(y) − f ∗(x) = y − x a.s
Let ψ(t) = f ∗(xt) − f ∗(x). Then
|ψ(t) − ψ(t )| = f ∗
(xt) − f ∗
(xt )
≤ xt − xt = x − y |t − t |,
hence ψ(t) is x − y -Lipschitz. Using this,
ψ(1) − ψ(0) = (ψ(1) − ψ(t)) + (ψ(t) − ψ(0))
≤ (1 − t) x − y + t x − y = x − y ,
and equality holds since
|ψ(1) − ψ(0)| = |f ∗
(y) − f ∗
(x)| = y − x
18
Observation
Proof.
Thus, ψ(t) − ψ(0) = t x − y , and so ψ(t) = t x − y .
Hence, f ∗(xt) = f ∗(x) + t y − x .
Let v = (y − x)/ y − x . Then
∂
∂v
f ∗
(xt) = lim
h→0
f ∗(xt + hv) − f ∗(xt)
h
= 1
Since f ∗(xt) ≤ 1, we conclude that f ∗(xt) = v.
19
Gradient Penalty (WGAN-GP)
From observation, we define gradient penalty
λ Eˆx∼ˆp(x)[( ˆx D(ˆx) 2 − 1)2
]
where ˆx ∼ ˆp(x) is uniformly sampled from the line segment xy
No critic BN: penalized gradient norm independently
Two-sided penalty: also tried one-sided penalty
max(0, D(ˆx) 2 − 1)2
but empirically no much difference
20
Possible Improvement
WGAN-GP does not sample (x, y) from optimal coupling γ∗
Instead, samples from x ∼ p(x), y ∼ pθ(x)
It does not match with the theory (Theorem 1)
Idea: (x, G(E(x))) would be a better approximation for γ∗
E is additionally trained encoder x → z
G(E(x)) is projection of x to G manifold
21
Experiments
WGAN-GP improves the stability
# of success1 for GAN & WGAN-GP
1
Inception score > threshold. Experiments on 32×32 ImageNet.
22
Experiments
23
Experiments
WGAN-GP improves the performance
Inception score on CIFAR-10.
24
Reference
Goodfellow et al. Generative Adversarial Nets. NIPS 2014.
Arjovsky et al. Towards Principled Methods for Training
GANs. ICLR 2017.
Arjovsky et al. Wasserstein GAN. ICML 2017.
Gulrajani et al. Improved Training of Wasserstein GANs.
NIPS 2017.
25

More Related Content

PDF
Wasserstein GAN
PDF
(第3版)「知能の構成的解明の研究動向と今後の展望」についての個人的見解:Chain of thought promptingやpostdictionを中...
PDF
Wasserstein GAN 수학 이해하기 I
PDF
Rosのリアルタイムツールの紹介
PDF
Soft Actor-Critic Algorithms and Applications 한국어 리뷰
PDF
Introduction of VAE
PPTX
Curriculum Learning (関東CV勉強会)
PDF
[기초개념] Graph Convolutional Network (GCN)
Wasserstein GAN
(第3版)「知能の構成的解明の研究動向と今後の展望」についての個人的見解:Chain of thought promptingやpostdictionを中...
Wasserstein GAN 수학 이해하기 I
Rosのリアルタイムツールの紹介
Soft Actor-Critic Algorithms and Applications 한국어 리뷰
Introduction of VAE
Curriculum Learning (関東CV勉強会)
[기초개념] Graph Convolutional Network (GCN)

What's hot (20)

PPTX
[DL輪読会]Live-Streaming Fraud Detection: A Heterogeneous Graph Neural Network A...
PDF
理解して使うRNA Velocity解析ツール-最近のツール編
PPTX
A Unified Approach to Interpreting Model Predictions (SHAP)
PPTX
Attention Is All You Need
PDF
Introduction of Deep Reinforcement Learning
PDF
感情の出どころを探る、一歩進んだ感情解析
PDF
自由エネルギー原理と視覚的意識 2019-06-08
PPTX
XLnet RoBERTa Reformer
PDF
(2018.3) 分子のグラフ表現と機械学習
PDF
信号処理・画像処理における凸最適化
PPTX
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―
PDF
Generative Adversarial Networks (GAN) の学習方法進展・画像生成・教師なし画像変換
PDF
강화학습 알고리즘의 흐름도 Part 2
PDF
오토인코더의 모든 것
PPTX
Imitation learning tutorial
PDF
確率的自己位置推定
PPTX
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
PDF
駒場学部講義2018 「意識の神経科学と自由エネルギー原理」講義スライド
PPTX
みんなが知らない pytorch-pfn-extras
PDF
Multi-armed Bandits
[DL輪読会]Live-Streaming Fraud Detection: A Heterogeneous Graph Neural Network A...
理解して使うRNA Velocity解析ツール-最近のツール編
A Unified Approach to Interpreting Model Predictions (SHAP)
Attention Is All You Need
Introduction of Deep Reinforcement Learning
感情の出どころを探る、一歩進んだ感情解析
自由エネルギー原理と視覚的意識 2019-06-08
XLnet RoBERTa Reformer
(2018.3) 分子のグラフ表現と機械学習
信号処理・画像処理における凸最適化
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―
Generative Adversarial Networks (GAN) の学習方法進展・画像生成・教師なし画像変換
강화학습 알고리즘의 흐름도 Part 2
오토인코더의 모든 것
Imitation learning tutorial
確率的自己位置推定
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
駒場学部講義2018 「意識の神経科学と自由エネルギー原理」講義スライド
みんなが知らない pytorch-pfn-extras
Multi-armed Bandits
Ad

Similar to Improved Trainings of Wasserstein GANs (WGAN-GP) (20)

PDF
Murphy: Machine learning A probabilistic perspective: Ch.9
PDF
Tensor Train data format for uncertainty quantification
PDF
Recursive Compressed Sensing
PDF
Equivariance
PDF
Divergence clustering
PDF
Hyperfunction method for numerical integration and Fredholm integral equation...
PDF
Maximum likelihood estimation of regularisation parameters in inverse problem...
PDF
MUMS: Bayesian, Fiducial, and Frequentist Conference - Coverage of Credible I...
PDF
Slides_A4.pdf
PDF
gans_copy.pdfhjsjsisidkskskkskwkduydjekedj
PDF
Introduction to Generative Adversarial Network
PDF
talk MCMC & SMC 2004
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
Ece3075 a 8
PPT
Deep-Learning-2017-Lecture7GAN.ppt
PPT
Deep-Learning-2017-Lecture7GAN.ppt
PPT
Deep-Learning-2017-Lecture7GAN.ppt
PDF
8803-09-lec16.pdf
PDF
Rainone - Groups St. Andrew 2013
Murphy: Machine learning A probabilistic perspective: Ch.9
Tensor Train data format for uncertainty quantification
Recursive Compressed Sensing
Equivariance
Divergence clustering
Hyperfunction method for numerical integration and Fredholm integral equation...
Maximum likelihood estimation of regularisation parameters in inverse problem...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Coverage of Credible I...
Slides_A4.pdf
gans_copy.pdfhjsjsisidkskskkskwkduydjekedj
Introduction to Generative Adversarial Network
talk MCMC & SMC 2004
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Ece3075 a 8
Deep-Learning-2017-Lecture7GAN.ppt
Deep-Learning-2017-Lecture7GAN.ppt
Deep-Learning-2017-Lecture7GAN.ppt
8803-09-lec16.pdf
Rainone - Groups St. Andrew 2013
Ad

More from Sangwoo Mo (20)

PDF
Brief History of Visual Representation Learning
PDF
Learning Visual Representations from Uncurated Data
PDF
Hyperbolic Deep Reinforcement Learning
PDF
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
PDF
Self-supervised Learning Lecture Note
PDF
Deep Learning Theory Seminar (Chap 3, part 2)
PDF
Deep Learning Theory Seminar (Chap 1-2, part 1)
PDF
Introduction to Diffusion Models
PDF
Object-Region Video Transformers
PDF
Deep Implicit Layers: Learning Structured Problems with Neural Networks
PDF
Learning Theory 101 ...and Towards Learning the Flat Minima
PDF
Sharpness-aware minimization (SAM)
PDF
Explicit Density Models
PDF
Score-Based Generative Modeling through Stochastic Differential Equations
PDF
Self-Attention with Linear Complexity
PDF
Meta-Learning with Implicit Gradients
PDF
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
PDF
Generative Models for General Audiences
PDF
Bayesian Model-Agnostic Meta-Learning
PDF
Deep Learning for Natural Language Processing
Brief History of Visual Representation Learning
Learning Visual Representations from Uncurated Data
Hyperbolic Deep Reinforcement Learning
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
Self-supervised Learning Lecture Note
Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 1-2, part 1)
Introduction to Diffusion Models
Object-Region Video Transformers
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Learning Theory 101 ...and Towards Learning the Flat Minima
Sharpness-aware minimization (SAM)
Explicit Density Models
Score-Based Generative Modeling through Stochastic Differential Equations
Self-Attention with Linear Complexity
Meta-Learning with Implicit Gradients
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Generative Models for General Audiences
Bayesian Model-Agnostic Meta-Learning
Deep Learning for Natural Language Processing

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Approach and Philosophy of On baking technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPT
Teaching material agriculture food technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
cuic standard and advanced reporting.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Approach and Philosophy of On baking technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Teaching material agriculture food technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Machine learning based COVID-19 study performance prediction
20250228 LYD VKU AI Blended-Learning.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Chapter 3 Spatial Domain Image Processing.pdf
Network Security Unit 5.pdf for BCA BBA.
NewMind AI Monthly Chronicles - July 2025
cuic standard and advanced reporting.pdf
Electronic commerce courselecture one. Pdf
Digital-Transformation-Roadmap-for-Companies.pptx

Improved Trainings of Wasserstein GANs (WGAN-GP)

  • 1. Improved Trainings of Wasserstein GANs Sangwoo Mo KAIST ALIN Lab. July 04, 2018 1
  • 2. Table of Contents Review for GANs Improved Training of WGANs 2
  • 3. Table of Contents Review for GANs Improved Training of WGANs 3
  • 4. Generative Adversarial Networks (GANs) Generative model aims to learn a model distribution pθ(x) be match with the target distribution p(x) Usually we assume x ∼ pθ(x) is a deterministic mapping x = Gθ(z) of a simple noise z ∼ p(z) * Figure from OpenAI blog. 4
  • 5. Generative Adversarial Networks (GANs) Q. How to train a generative model? Explicit model: directly optimize the objective (e.g. MLE) For example, PixelCNN maximizes log pθ(x) = n i=1 log pθ(xi | x1:i−1) 5
  • 6. Generative Adversarial Networks (GANs) Q. How to train a generative model? Explicit model: directly optimize the objective (e.g. MLE) Implicit model1: learning by comparison Idea of GAN Train a discriminator D which compares p(x) and pθ(x) Train a generator G using the signal from D 1 Do not know pθ(x) but only can sample from. 6
  • 7. Generative Adversarial Networks (GANs) What is happening in GAN? GAN plays a minimax game between G and D: min G max D V (G, D) where V (G, D) = Ex∼p(x)[log D(x)] + Ez∼p(z)[log(1 − D(G(z))] For given G, the optimal D∗ is D∗ (x) = p(x) p(x) + pθ(x) 7
  • 8. Generative Adversarial Networks (GANs) What is happening in GAN? Putting D∗ to the objective, we have C(G) = max D V (G, D) = KL(p p + pθ 2 ) + KL(pθ p + pθ 2 ) + const = 2 · JSD(p pθ) + const Hence, GAN minimizes the lower bound of JSD 8
  • 9. Generative Adversarial Networks (GANs) What is happening in GAN? In practice, GAN suffers from gradient vanishing To avoid this problem, we minimize − log D(G(z)) instead Putting D∗, we have C(G) = KL(pθ p) − 2 · JSD(p pθ) + const Hence, it minimizes the lower bound of reverse KL 9
  • 10. Wasserstein GANs (WGANs) Why GAN is unstable? Supports of p(x) and pθ(x) are disjoint1 a.s. Then JSD(p pθ) = log 2 KL(p pθ) = KL(pθ p) = +∞ The loss does not provide a valuable information Solution 1. Add noise to overlap supports 2. Use better divergence 1 Lie on the low-dimensional manifolds. 10
  • 11. Wasserstein GANs (WGANs) Toy example Let z ∼ U[0, 1] and x = (0, z) ∼ p(x) Let Gθ(z) = (θ, z), hence pθ(x) = p(x) for θ = 0 * Figure from Lilian Weng’s blog. 11
  • 12. Wasserstein GANs (WGANs) Toy example Here, Wasserstein distance is W (p pθ) = |θ| Unlike JSD and KL, it provides the closeness info. * Figure from WGAN paper. 12
  • 13. Wasserstein GANs (WGANs) Wasserstein distance Wasserstein-1 distance is W (p, q) = inf γ∈Π(p,q) E(x,y)∼γ[ x − y 1] Relation between divergences W = conv in dist. < JSD = TV < KL 13
  • 14. Wasserstein GANs (WGANs) How to minimize Wasserstein distance? Wasserstein-1 distance has a dual form: W (p, q) = sup f ∈F Ex∼p(x)[f (x)] − Ex∼q(x)[f (x)] where F is the set of 1-Lipschitz functions Hence, the objective of WGAN is min G max D∈D Ex∼p(x)[D(x)] − Ez∼p(z)[D(G(z))] To achieve Lipschitz constraints, WGAN uses weight clipping 14
  • 15. Table of Contents Review for GANs Improved Training of WGANs 15
  • 16. Motivation Motivation: weight clipping leads optimization difficulties 1 Restricts function space too simple 2 Gradient exploding/vanishing 16
  • 17. Observation Theorem 1 Let (x, y) ∼ γ∗ where γ∗ is optimal coupling and f ∗ is optimal function. Let xt = ty + (1 − t)x with 0 ≤ t ≤ 1. Then P(x,y)∼γ f ∗ (xt) = y − xt y − xt = 1 Corollary 2 f ∗ has gradient norm 1 a.e. on the line segments xy 17
  • 18. Observation Proof. For (x, y) ∼ γ∗, f ∗(y) − f ∗(x) = y − x a.s Let ψ(t) = f ∗(xt) − f ∗(x). Then |ψ(t) − ψ(t )| = f ∗ (xt) − f ∗ (xt ) ≤ xt − xt = x − y |t − t |, hence ψ(t) is x − y -Lipschitz. Using this, ψ(1) − ψ(0) = (ψ(1) − ψ(t)) + (ψ(t) − ψ(0)) ≤ (1 − t) x − y + t x − y = x − y , and equality holds since |ψ(1) − ψ(0)| = |f ∗ (y) − f ∗ (x)| = y − x 18
  • 19. Observation Proof. Thus, ψ(t) − ψ(0) = t x − y , and so ψ(t) = t x − y . Hence, f ∗(xt) = f ∗(x) + t y − x . Let v = (y − x)/ y − x . Then ∂ ∂v f ∗ (xt) = lim h→0 f ∗(xt + hv) − f ∗(xt) h = 1 Since f ∗(xt) ≤ 1, we conclude that f ∗(xt) = v. 19
  • 20. Gradient Penalty (WGAN-GP) From observation, we define gradient penalty λ Eˆx∼ˆp(x)[( ˆx D(ˆx) 2 − 1)2 ] where ˆx ∼ ˆp(x) is uniformly sampled from the line segment xy No critic BN: penalized gradient norm independently Two-sided penalty: also tried one-sided penalty max(0, D(ˆx) 2 − 1)2 but empirically no much difference 20
  • 21. Possible Improvement WGAN-GP does not sample (x, y) from optimal coupling γ∗ Instead, samples from x ∼ p(x), y ∼ pθ(x) It does not match with the theory (Theorem 1) Idea: (x, G(E(x))) would be a better approximation for γ∗ E is additionally trained encoder x → z G(E(x)) is projection of x to G manifold 21
  • 22. Experiments WGAN-GP improves the stability # of success1 for GAN & WGAN-GP 1 Inception score > threshold. Experiments on 32×32 ImageNet. 22
  • 24. Experiments WGAN-GP improves the performance Inception score on CIFAR-10. 24
  • 25. Reference Goodfellow et al. Generative Adversarial Nets. NIPS 2014. Arjovsky et al. Towards Principled Methods for Training GANs. ICLR 2017. Arjovsky et al. Wasserstein GAN. ICML 2017. Gulrajani et al. Improved Training of Wasserstein GANs. NIPS 2017. 25