SlideShare a Scribd company logo
Metrics	
  for	
  distributions	
  and	
  their	
  
applications	
  for	
  generative	
  models	
  
(part	
  1)
Dai	
  Hai	
  Nguyen
Kyoto	
  University
Learning	
  generative	
  models	
  ?
Q
𝑃"
Ρ
Distance	
  ( 𝑃",Q)=?
𝑄 =
1
𝑛
( 𝛿*+
,
-./
Learning generative models?
• Maximum Likelihood Estimation (MLE):
Given training samples 𝑥/, 𝑥2,…, 𝑥,, how to learn 𝑝45678 𝑥; 𝜃 from
which training samples are likely to be generated
𝜃∗
= 𝑎𝑟𝑔𝑚𝑎𝑥" ( log	
   𝑝45678(𝑥-; 𝜃)
,
-./
Learning	
  generative	
  models?
• Likelihood-free model
Random	
  input
NEURAL	
  NETWORK
Z~Uniform
Generator Output
Learning	
  generative	
  models	
  ?
Q
𝑃"
Ρ
Distance	
  ( 𝑃",Q)=?
𝑄 =
1
𝑛
( 𝛿*+
,
-./
How to measure similarity between 𝑝 and 𝑞 ?
§ Kullback-Leibler (KL) divergence: asymmetric, i.e., 𝐷HI(𝑝| 𝑞 ≠ 𝐷HI(𝑞| 𝑝
𝐷HI(𝑝| 𝑞 = L 𝑝 𝑥 𝑙𝑜𝑔
𝑝(𝑥)
𝑞(𝑥)
𝑑𝑥
§ Jensen-shanon (JS) divergence: symmetric
𝐷PQ(𝑝| 𝑞 =
1
2
𝐷HI(𝑝||
𝑝 + 𝑞
2
) +
1
2
𝐷HI(𝑞||
𝑝 + 𝑞
2
)
§ Optimal transport (OT):
𝒲 U 𝑝, 𝑞 = 𝑖𝑛𝑓
X~Z([,)
𝐸 *,^ ~X[||𝑥 − 𝑦||]
Where Π(𝑝, 𝑞) is a set of all joint distribution of (X, Y) with marginals 𝑝 and 𝑞
Many fundamental problems can be cast as quantifying
similarity between two distributions
§ Maximum likelihood estimation (MLE) is equivalent to minimizing KL
divergence
Suppose we sample N of 𝑥~𝑝(𝑥|𝜃∗
)
MLE of 𝜃 is
𝜃∗
= argmin
"
−
1
𝑁
( log 𝑝 𝑥- 𝜃 =
j
-./
− Ε*~[ 𝑥 𝜃∗ [log 𝑝 𝑥 𝜃 ]
By def of KL divergence:
𝐷HI(𝑝(𝑥|𝜃∗
)| 𝑝 𝑥 𝜃 = Ε*~[ 𝑥 𝜃∗ [log
𝑝 𝑥 𝜃∗
𝑝 𝑥 𝜃
]
= Ε*~[ 𝑥 𝜃∗ log 𝑝 𝑥 𝜃∗
− Ε*~[ 𝑥 𝜃∗ log 𝑝 𝑥 𝜃
Training GAN is equivalent to minimizing JS divergence
§ GAN has two networks: D and G, which are playing a minimax game
min
l
max
n
𝐿 𝐷, 𝐺 = Ε*~(*) log 𝐷 𝑥 + Εq~r(q) log(1 − 𝐷(𝐺(𝑧)))
= Ε*~(*) log 𝐷 𝑥 + Ε*~[(*) log(1 − 𝐷(𝑥))
Where 𝑝 𝑥 	
  and 𝑞(𝑥)	
  is the distributions of fake images and real images,
respectively
§ Fixing G, optimal D can be easily obtained:
𝐷 𝑥 =
𝑞(𝑥)
𝑝 𝑥 + 𝑞(𝑥)
Training GAN is equivalent to minimizing JS divergence
§ GAN has two networks: D and G, which are playing a minimax game
min
l
max
n
𝐿 𝐷, 𝐺 = Ε*~(*) log 𝐷 𝑥 + Εq~[(q) log(1 − 𝐷(𝐺(𝑧)))
= Ε*~(*) log 𝐷 𝑥 + Ε*~[(*) log(1 − 𝐷(𝑥))
Where 𝑝 𝑥 	
  and 𝑞(𝑥)	
  is the distribution of fake and real images, respectively
§ Fixing G, optimal D can be easily obtained by:
𝐷 𝑥 =
𝑝(𝑥)
𝑝 𝑥 + 𝑞(𝑥)
And 𝐿 𝐷, 𝐺 = ∫ 𝑞 𝑥 𝑙𝑜𝑔
(*)
[ * u(*)
𝑑𝑥 + ∫ 𝑝 𝑥 𝑙𝑜𝑔
[(*)
[ * u(*)
𝑑𝑥
= 2𝐷PQ (𝑝| 𝑞 − log4
f-­‐divergences
• Divergence	
  between	
  two	
  distributions
𝐷w(𝑞| 𝑝 = L 𝑝 𝑥 𝑓(
𝑞 𝑥
𝑝 𝑥
)𝑑𝑥
• f:	
  generator	
  function,	
  convex	
  and	
  f(1)	
  =	
  0
• Every	
  function	
  f	
  has	
  a	
  convex	
  conjugate	
  f*	
  such	
  that:
𝑓 𝑥 = sup
^∈654(w∗)
{𝑥𝑦	
   − 𝑓∗
(𝑦)}
f-­‐divergences
• Different	
  generator	
  f	
  give	
  different	
  divergences
Estimating	
  f-­‐divergences	
  from	
  samples
𝐷w(𝑞| 𝑝 = L 𝑝 𝑥 𝑓
𝑞 𝑥
𝑝 𝑥
𝑑𝑥
= L 𝑝 𝑥 sup
~∈654(w∗)
{𝑡
𝑞 𝑥
𝑝 𝑥
− 𝑓∗
(𝑡)} 𝑑𝑥
≥ sup
•∈‚
{L 𝑞 𝑥 𝑇 𝑥 𝑑𝑥 −L 𝑝 𝑥 𝑓∗
𝑇 𝑥 𝑑𝑥}
= sup
•∈‚
{ 𝐸*~„ 𝑇 𝑥 − 𝐸*~… 𝑓∗
𝑇 𝑥 }
Samples	
  from	
  PSamples	
  from	
  Q
Conjugate	
  function	
  of	
  f(x):
𝑓∗
𝑥 = sup
~∈654(w)
{𝑡𝑥 − 𝑓(𝑡)}
Some	
  properties:
• 𝑓(𝑥) = sup
~∈654(w∗)
{𝑡𝑥 − 𝑓∗
𝑡 }
• 𝑓∗∗
𝑥 = 𝑓 𝑥
• 𝑓∗
𝑥 is	
  always	
  convec
Training	
  f-­‐divergence	
  GAN
• f-­‐GAN:
m𝑖𝑛
"
max
†
	
   𝐹 𝜃, 𝑤 = 𝐸*~„ 𝑇† 𝑥 − 𝐸*~…‰
𝑓∗
𝑇† 𝑥
f-­‐GAN:	
  Training	
  Generative	
  Neural	
  Sampler	
  using	
  Variational Divergence	
  Minimization,	
  NIPS2016
Turns	
  out:	
  GAN	
  is	
  a	
  specific	
  case	
  of	
  f-­‐divergence
• GAN:
m𝑖𝑛
"
max
†
𝐸*~„ log 𝐷† 𝑥 − 𝐸*~…‰
log(1 − 𝐷† 𝑥 )
• f-­‐GAN:
m𝑖𝑛
"
max
†
𝐸*~„ 𝑇† 𝑥 − 𝐸*~…‰
𝑓∗
𝑇† 𝑥
By	
  choosing	
  suitable	
  T	
  and	
  f,	
  f-­‐GAN	
  turns	
  into	
  original	
  GAN	
  (^^)
1-Wasserstein distance (another option)
§ It seeks for a probabilistic coupling 𝛾:
𝑊/ = min
X∈ℙ
L 𝑐 𝑥, 𝑦
𝒳×𝒴
𝛾 𝑥, 𝑦 𝑑𝑥𝑑𝑦 = 𝐸 *,^ ~X 𝑐(𝑥, 𝑦)
Where ℙ = {𝛾 ≥ 0, ∫ 𝛾 𝑥, 𝑦 𝑑𝑦 = 𝑝𝒴
, ∫ 𝛾 𝑥, 𝑦 𝑑𝑥 = 𝑞𝒳
}
𝑐 𝑥, 𝑦 is the displacement cost from x to y (e.g. Euclidean distance)
§ a.k.a Earth mover distance
§ Can be formulated as Linear Programming (convex)
Kantarovich’s formulation of OT
§ In case of discrete input
𝑝 = ( 𝑎- 𝛿*+
4
-./
, 𝑞 = ( 𝑏“ 𝛿^”
,
“./
§ Couplings:
ℙ = {𝑃 ≥ 0, 𝑃 ∈ ℝ4×,
, 𝑃1, = 𝑎, 𝑃•
14 = 𝑏}
§ LP problem: find P
𝑃 = argmin
…∈ℙ
< 𝑃, 𝐶 >
Where C is cost matrix, i.e. 𝐶-“ = 𝑐(𝑥-, 𝑦“)
Why OT is better than KL and JS divergences?
§ OT provides a smooth measure and
more useful than KL and JS
§ Example:
How to apply 1-Wassertain distance to GAN?
𝒲 U 𝑝, 𝑞 = 𝑖𝑛𝑓
X~Z([,)
𝐸 *,^ ~X 𝑥 − 𝑦
= inf
X
< 𝐶, 𝛾 >
s.t. š
∑ 𝛾-“ = 𝑝-,,
“./ 	
   𝑖 = 1, 𝑚
∑ 𝛾-“ = 𝑞“,4
-./ 𝑗 = 1, 𝑛
min 𝑐• 𝑥
s.t. 	
  	
  	
  	
  	
  	
  	
   𝐴 𝑥 = 𝑏
𝑥 ≥ 0
m𝑎𝑥 𝑏• 𝑦
s.t. 	
  	
  	
  	
  	
  	
  	
   𝐴• 𝑦 ≤ 𝑐
Primal Dual
𝑐 = 𝑣𝑒𝑐 𝐶 ∈ ℝ4×,
𝑥 = 𝑣𝑒𝑐 𝛾 ∈ ℝ4×,
𝑏•
= [𝑝•
, 𝑞•
]•
∈ ℝ4u,
max 𝑓•
𝑝 + 𝑔•
𝑞
𝑠. 𝑡. 	
   𝑓- + 𝑔“ ≤ 𝐶-“, 𝑖 = 1, . . , 𝑚; 𝑗 = 1 … 𝑛
It	
  easy	
  to	
  see	
  that	
  	
   𝑓-= −𝑔- ,	
  so:	
  | 𝑓- − 𝑓“| ≤1|𝑥- − 𝑦“|
𝒲 U 𝑝, 𝑞 = 𝑠𝑢𝑝
||w||¥¦/
𝐸*~[ 𝑓 𝑥 − 𝐸*~ 𝑓(𝑥)
(Kantorovich-­‐Rubinstein	
  duality)
Training	
  WGAN
In	
  WGAN,	
  replace	
  discrimimator with	
   𝑓 and	
  minimize	
  1-­‐Wasserstain	
  distance:
min
"
𝒲 U 𝑝, 𝑞" = 𝑠𝑢𝑝
||†||§¦/
𝐸*~[ 𝑓† 𝑥 − 𝐸q~r(𝑔"(𝑧))
Ref:	
  Wasserstein	
  GAN,	
  ICML2017
Find	
   𝑤
Update	
  	
   𝜃
Thank	
  you	
  for	
  listening

More Related Content

PPTX
Backpropagation
PDF
Wasserstein GAN
PDF
Improved Trainings of Wasserstein GANs (WGAN-GP)
PDF
Tenth-Order Iterative Methods withoutDerivatives forSolving Nonlinear Equations
PDF
On Certain Classess of Multivalent Functions
PDF
The Universal Bayesian Chow-Liu Algorithm
PDF
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
PPTX
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Backpropagation
Wasserstein GAN
Improved Trainings of Wasserstein GANs (WGAN-GP)
Tenth-Order Iterative Methods withoutDerivatives forSolving Nonlinear Equations
On Certain Classess of Multivalent Functions
The Universal Bayesian Chow-Liu Algorithm
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
Learning a nonlinear embedding by preserving class neibourhood structure 최종

What's hot (20)

PPT
Formal systems introduction
PPTX
Restricted boltzmann machine
PDF
Meta-learning and the ELBO
PDF
Variational AutoEncoder
PDF
確率的推論と行動選択
PDF
A Note on the Derivation of the Variational Inference Updates for DILN
PDF
Bellman functions and Lp estimates for paraproducts
PDF
[DL輪読会]The Cramer Distance as a Solution to Biased Wasserstein Gradients
PDF
GradStudentSeminarSept30
PDF
Trilinear embedding for divergence-form operators
PDF
On maximal and variational Fourier restriction
PDF
Specific Finite Groups(General)
PDF
A new Perron-Frobenius theorem for nonnegative tensors
PPT
Product Rules & Amp Laplacian 1
PDF
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
PDF
On uniformly continuous uniform space
PDF
Nikolay Shilov. CSEDays 3
PDF
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
PDF
Quantitative norm convergence of some ergodic averages
PDF
IRJET - Some Results on Fuzzy Semi-Super Modular Lattices
Formal systems introduction
Restricted boltzmann machine
Meta-learning and the ELBO
Variational AutoEncoder
確率的推論と行動選択
A Note on the Derivation of the Variational Inference Updates for DILN
Bellman functions and Lp estimates for paraproducts
[DL輪読会]The Cramer Distance as a Solution to Biased Wasserstein Gradients
GradStudentSeminarSept30
Trilinear embedding for divergence-form operators
On maximal and variational Fourier restriction
Specific Finite Groups(General)
A new Perron-Frobenius theorem for nonnegative tensors
Product Rules & Amp Laplacian 1
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
On uniformly continuous uniform space
Nikolay Shilov. CSEDays 3
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
Quantitative norm convergence of some ergodic averages
IRJET - Some Results on Fuzzy Semi-Super Modular Lattices
Ad

Similar to Metrics for generativemodels (20)

PPTX
Mathematics of nyquist plot [autosaved] [autosaved]
PPTX
Lec05.pptx
PDF
Matrix Transformations on Some Difference Sequence Spaces
PDF
Distributional RL via Moment Matching
PPTX
Learning group em - 20171025 - copy
PDF
Lecture9 xing
PPTX
Koh_Liang_ICML2017
PPTX
Does Zero-Shot RL Exist
PPTX
PDF
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
PDF
Basic calculus (ii) recap
PDF
Some properties of two-fuzzy Nor med spaces
DOCX
Integral dalam Bahasa Inggris
PDF
Annals of Statistics読み回 第一回
DOCX
BSC_COMPUTER _SCIENCE_UNIT-2_DISCRETE MATHEMATICS
PDF
Dual Spaces of Generalized Cesaro Sequence Space and Related Matrix Mapping
PDF
Paper Study: Transformer dissection
PDF
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
PDF
Generalized Laplace - Mellin Integral Transformation
PPTX
DISCRETE LOGARITHM PROBLEM
Mathematics of nyquist plot [autosaved] [autosaved]
Lec05.pptx
Matrix Transformations on Some Difference Sequence Spaces
Distributional RL via Moment Matching
Learning group em - 20171025 - copy
Lecture9 xing
Koh_Liang_ICML2017
Does Zero-Shot RL Exist
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
Basic calculus (ii) recap
Some properties of two-fuzzy Nor med spaces
Integral dalam Bahasa Inggris
Annals of Statistics読み回 第一回
BSC_COMPUTER _SCIENCE_UNIT-2_DISCRETE MATHEMATICS
Dual Spaces of Generalized Cesaro Sequence Space and Related Matrix Mapping
Paper Study: Transformer dissection
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Generalized Laplace - Mellin Integral Transformation
DISCRETE LOGARITHM PROBLEM
Ad

More from Dai-Hai Nguyen (8)

PDF
Advanced machine learning for metabolite identification
PPTX
IBSB tutorial
PDF
Brief introduction on GAN
PDF
Hierarchical selection
PDF
Semi-supervised learning model for molecular property prediction
PDF
DL for molecules
PDF
PDF
Collaborative DL
Advanced machine learning for metabolite identification
IBSB tutorial
Brief introduction on GAN
Hierarchical selection
Semi-supervised learning model for molecular property prediction
DL for molecules
Collaborative DL

Recently uploaded (20)

PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
Quality review (1)_presentation of this 21
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
Foundation of Data Science unit number two notes
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Introduction to Business Data Analytics.
PDF
Lecture1 pattern recognition............
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Quality review (1)_presentation of this 21
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Clinical guidelines as a resource for EBP(1).pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Introduction-to-Cloud-ComputingFinal.pptx
1_Introduction to advance data techniques.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Knowledge Engineering Part 1
Business Ppt On Nestle.pptx huunnnhhgfvu
Major-Components-ofNKJNNKNKNKNKronment.pptx
Mega Projects Data Mega Projects Data
Foundation of Data Science unit number two notes
Business Acumen Training GuidePresentation.pptx
Introduction to Business Data Analytics.
Lecture1 pattern recognition............

Metrics for generativemodels

  • 1. Metrics  for  distributions  and  their   applications  for  generative  models   (part  1) Dai  Hai  Nguyen Kyoto  University
  • 2. Learning  generative  models  ? Q 𝑃" Ρ Distance  ( 𝑃",Q)=? 𝑄 = 1 𝑛 ( 𝛿*+ , -./
  • 3. Learning generative models? • Maximum Likelihood Estimation (MLE): Given training samples 𝑥/, 𝑥2,…, 𝑥,, how to learn 𝑝45678 𝑥; 𝜃 from which training samples are likely to be generated 𝜃∗ = 𝑎𝑟𝑔𝑚𝑎𝑥" ( log   𝑝45678(𝑥-; 𝜃) , -./
  • 4. Learning  generative  models? • Likelihood-free model Random  input NEURAL  NETWORK Z~Uniform Generator Output
  • 5. Learning  generative  models  ? Q 𝑃" Ρ Distance  ( 𝑃",Q)=? 𝑄 = 1 𝑛 ( 𝛿*+ , -./
  • 6. How to measure similarity between 𝑝 and 𝑞 ? § Kullback-Leibler (KL) divergence: asymmetric, i.e., 𝐷HI(𝑝| 𝑞 ≠ 𝐷HI(𝑞| 𝑝 𝐷HI(𝑝| 𝑞 = L 𝑝 𝑥 𝑙𝑜𝑔 𝑝(𝑥) 𝑞(𝑥) 𝑑𝑥 § Jensen-shanon (JS) divergence: symmetric 𝐷PQ(𝑝| 𝑞 = 1 2 𝐷HI(𝑝|| 𝑝 + 𝑞 2 ) + 1 2 𝐷HI(𝑞|| 𝑝 + 𝑞 2 ) § Optimal transport (OT): 𝒲 U 𝑝, 𝑞 = 𝑖𝑛𝑓 X~Z([,) 𝐸 *,^ ~X[||𝑥 − 𝑦||] Where Π(𝑝, 𝑞) is a set of all joint distribution of (X, Y) with marginals 𝑝 and 𝑞
  • 7. Many fundamental problems can be cast as quantifying similarity between two distributions § Maximum likelihood estimation (MLE) is equivalent to minimizing KL divergence Suppose we sample N of 𝑥~𝑝(𝑥|𝜃∗ ) MLE of 𝜃 is 𝜃∗ = argmin " − 1 𝑁 ( log 𝑝 𝑥- 𝜃 = j -./ − Ε*~[ 𝑥 𝜃∗ [log 𝑝 𝑥 𝜃 ] By def of KL divergence: 𝐷HI(𝑝(𝑥|𝜃∗ )| 𝑝 𝑥 𝜃 = Ε*~[ 𝑥 𝜃∗ [log 𝑝 𝑥 𝜃∗ 𝑝 𝑥 𝜃 ] = Ε*~[ 𝑥 𝜃∗ log 𝑝 𝑥 𝜃∗ − Ε*~[ 𝑥 𝜃∗ log 𝑝 𝑥 𝜃
  • 8. Training GAN is equivalent to minimizing JS divergence § GAN has two networks: D and G, which are playing a minimax game min l max n 𝐿 𝐷, 𝐺 = Ε*~(*) log 𝐷 𝑥 + Εq~r(q) log(1 − 𝐷(𝐺(𝑧))) = Ε*~(*) log 𝐷 𝑥 + Ε*~[(*) log(1 − 𝐷(𝑥)) Where 𝑝 𝑥  and 𝑞(𝑥)  is the distributions of fake images and real images, respectively § Fixing G, optimal D can be easily obtained: 𝐷 𝑥 = 𝑞(𝑥) 𝑝 𝑥 + 𝑞(𝑥)
  • 9. Training GAN is equivalent to minimizing JS divergence § GAN has two networks: D and G, which are playing a minimax game min l max n 𝐿 𝐷, 𝐺 = Ε*~(*) log 𝐷 𝑥 + Εq~[(q) log(1 − 𝐷(𝐺(𝑧))) = Ε*~(*) log 𝐷 𝑥 + Ε*~[(*) log(1 − 𝐷(𝑥)) Where 𝑝 𝑥  and 𝑞(𝑥)  is the distribution of fake and real images, respectively § Fixing G, optimal D can be easily obtained by: 𝐷 𝑥 = 𝑝(𝑥) 𝑝 𝑥 + 𝑞(𝑥) And 𝐿 𝐷, 𝐺 = ∫ 𝑞 𝑥 𝑙𝑜𝑔 (*) [ * u(*) 𝑑𝑥 + ∫ 𝑝 𝑥 𝑙𝑜𝑔 [(*) [ * u(*) 𝑑𝑥 = 2𝐷PQ (𝑝| 𝑞 − log4
  • 10. f-­‐divergences • Divergence  between  two  distributions 𝐷w(𝑞| 𝑝 = L 𝑝 𝑥 𝑓( 𝑞 𝑥 𝑝 𝑥 )𝑑𝑥 • f:  generator  function,  convex  and  f(1)  =  0 • Every  function  f  has  a  convex  conjugate  f*  such  that: 𝑓 𝑥 = sup ^∈654(w∗) {𝑥𝑦   − 𝑓∗ (𝑦)}
  • 11. f-­‐divergences • Different  generator  f  give  different  divergences
  • 12. Estimating  f-­‐divergences  from  samples 𝐷w(𝑞| 𝑝 = L 𝑝 𝑥 𝑓 𝑞 𝑥 𝑝 𝑥 𝑑𝑥 = L 𝑝 𝑥 sup ~∈654(w∗) {𝑡 𝑞 𝑥 𝑝 𝑥 − 𝑓∗ (𝑡)} 𝑑𝑥 ≥ sup •∈‚ {L 𝑞 𝑥 𝑇 𝑥 𝑑𝑥 −L 𝑝 𝑥 𝑓∗ 𝑇 𝑥 𝑑𝑥} = sup •∈‚ { 𝐸*~„ 𝑇 𝑥 − 𝐸*~… 𝑓∗ 𝑇 𝑥 } Samples  from  PSamples  from  Q Conjugate  function  of  f(x): 𝑓∗ 𝑥 = sup ~∈654(w) {𝑡𝑥 − 𝑓(𝑡)} Some  properties: • 𝑓(𝑥) = sup ~∈654(w∗) {𝑡𝑥 − 𝑓∗ 𝑡 } • 𝑓∗∗ 𝑥 = 𝑓 𝑥 • 𝑓∗ 𝑥 is  always  convec
  • 13. Training  f-­‐divergence  GAN • f-­‐GAN: m𝑖𝑛 " max †   𝐹 𝜃, 𝑤 = 𝐸*~„ 𝑇† 𝑥 − 𝐸*~…‰ 𝑓∗ 𝑇† 𝑥 f-­‐GAN:  Training  Generative  Neural  Sampler  using  Variational Divergence  Minimization,  NIPS2016
  • 14. Turns  out:  GAN  is  a  specific  case  of  f-­‐divergence • GAN: m𝑖𝑛 " max † 𝐸*~„ log 𝐷† 𝑥 − 𝐸*~…‰ log(1 − 𝐷† 𝑥 ) • f-­‐GAN: m𝑖𝑛 " max † 𝐸*~„ 𝑇† 𝑥 − 𝐸*~…‰ 𝑓∗ 𝑇† 𝑥 By  choosing  suitable  T  and  f,  f-­‐GAN  turns  into  original  GAN  (^^)
  • 15. 1-Wasserstein distance (another option) § It seeks for a probabilistic coupling 𝛾: 𝑊/ = min X∈ℙ L 𝑐 𝑥, 𝑦 𝒳×𝒴 𝛾 𝑥, 𝑦 𝑑𝑥𝑑𝑦 = 𝐸 *,^ ~X 𝑐(𝑥, 𝑦) Where ℙ = {𝛾 ≥ 0, ∫ 𝛾 𝑥, 𝑦 𝑑𝑦 = 𝑝𝒴 , ∫ 𝛾 𝑥, 𝑦 𝑑𝑥 = 𝑞𝒳 } 𝑐 𝑥, 𝑦 is the displacement cost from x to y (e.g. Euclidean distance) § a.k.a Earth mover distance § Can be formulated as Linear Programming (convex)
  • 16. Kantarovich’s formulation of OT § In case of discrete input 𝑝 = ( 𝑎- 𝛿*+ 4 -./ , 𝑞 = ( 𝑏“ 𝛿^” , “./ § Couplings: ℙ = {𝑃 ≥ 0, 𝑃 ∈ ℝ4×, , 𝑃1, = 𝑎, 𝑃• 14 = 𝑏} § LP problem: find P 𝑃 = argmin …∈ℙ < 𝑃, 𝐶 > Where C is cost matrix, i.e. 𝐶-“ = 𝑐(𝑥-, 𝑦“)
  • 17. Why OT is better than KL and JS divergences? § OT provides a smooth measure and more useful than KL and JS § Example:
  • 18. How to apply 1-Wassertain distance to GAN? 𝒲 U 𝑝, 𝑞 = 𝑖𝑛𝑓 X~Z([,) 𝐸 *,^ ~X 𝑥 − 𝑦 = inf X < 𝐶, 𝛾 > s.t. š ∑ 𝛾-“ = 𝑝-,, “./   𝑖 = 1, 𝑚 ∑ 𝛾-“ = 𝑞“,4 -./ 𝑗 = 1, 𝑛 min 𝑐• 𝑥 s.t.               𝐴 𝑥 = 𝑏 𝑥 ≥ 0 m𝑎𝑥 𝑏• 𝑦 s.t.               𝐴• 𝑦 ≤ 𝑐 Primal Dual 𝑐 = 𝑣𝑒𝑐 𝐶 ∈ ℝ4×, 𝑥 = 𝑣𝑒𝑐 𝛾 ∈ ℝ4×, 𝑏• = [𝑝• , 𝑞• ]• ∈ ℝ4u, max 𝑓• 𝑝 + 𝑔• 𝑞 𝑠. 𝑡.   𝑓- + 𝑔“ ≤ 𝐶-“, 𝑖 = 1, . . , 𝑚; 𝑗 = 1 … 𝑛 It  easy  to  see  that     𝑓-= −𝑔- ,  so:  | 𝑓- − 𝑓“| ≤1|𝑥- − 𝑦“| 𝒲 U 𝑝, 𝑞 = 𝑠𝑢𝑝 ||w||¥¦/ 𝐸*~[ 𝑓 𝑥 − 𝐸*~ 𝑓(𝑥) (Kantorovich-­‐Rubinstein  duality)
  • 19. Training  WGAN In  WGAN,  replace  discrimimator with   𝑓 and  minimize  1-­‐Wasserstain  distance: min " 𝒲 U 𝑝, 𝑞" = 𝑠𝑢𝑝 ||†||§¦/ 𝐸*~[ 𝑓† 𝑥 − 𝐸q~r(𝑔"(𝑧)) Ref:  Wasserstein  GAN,  ICML2017 Find   𝑤 Update     𝜃
  • 20. Thank  you  for  listening