Metrics for generativemodels

Metrics
for
distributions
and
their

applications
for
generative
models

(part
1)
Dai
Hai
Nguyen
Kyoto
University

Learning
generative
models
?
Q
𝑃"
Ρ
Distance
( 𝑃",Q)=?
𝑄 =
1
𝑛
( 𝛿*+
,
-./

Learning generative models?
• Maximum Likelihood Estimation (MLE):
Given training samples 𝑥/, 𝑥2,…, 𝑥,, how to learn 𝑝45678 𝑥; 𝜃 from
which training samples are likely to be generated
𝜃∗
= 𝑎𝑟𝑔𝑚𝑎𝑥" ( log
𝑝45678(𝑥-; 𝜃)
,
-./

Learning
generative
models?
• Likelihood-free model
Random
input
NEURAL
NETWORK
Z~Uniform
Generator Output

How to measure similarity between 𝑝 and 𝑞 ?
§ Kullback-Leibler (KL) divergence: asymmetric, i.e., 𝐷HI(𝑝| 𝑞 ≠ 𝐷HI(𝑞| 𝑝
𝐷HI(𝑝| 𝑞 = L 𝑝 𝑥 𝑙𝑜𝑔
𝑝(𝑥)
𝑞(𝑥)
𝑑𝑥
§ Jensen-shanon (JS) divergence: symmetric
𝐷PQ(𝑝| 𝑞 =
1
2
𝐷HI(𝑝||
𝑝 + 𝑞
2
) +
1
2
𝐷HI(𝑞||
𝑝 + 𝑞
2
)
§ Optimal transport (OT):
𝒲 U 𝑝, 𝑞 = 𝑖𝑛𝑓
X~Z([,)
𝐸 *,^ ~X[||𝑥 − 𝑦||]
Where Π(𝑝, 𝑞) is a set of all joint distribution of (X, Y) with marginals 𝑝 and 𝑞

Many fundamental problems can be cast as quantifying
similarity between two distributions
§ Maximum likelihood estimation (MLE) is equivalent to minimizing KL
divergence
Suppose we sample N of 𝑥~𝑝(𝑥|𝜃∗
)
MLE of 𝜃 is
𝜃∗
= argmin
"
−
1
𝑁
( log 𝑝 𝑥- 𝜃 =
j
-./
− Ε*~[ 𝑥 𝜃∗ [log 𝑝 𝑥 𝜃 ]
By def of KL divergence:
𝐷HI(𝑝(𝑥|𝜃∗
)| 𝑝 𝑥 𝜃 = Ε*~[ 𝑥 𝜃∗ [log
𝑝 𝑥 𝜃∗
𝑝 𝑥 𝜃
]
= Ε*~[ 𝑥 𝜃∗ log 𝑝 𝑥 𝜃∗
− Ε*~[ 𝑥 𝜃∗ log 𝑝 𝑥 𝜃

Training GAN is equivalent to minimizing JS divergence
§ GAN has two networks: D and G, which are playing a minimax game
min
l
max
n
𝐿 𝐷, 𝐺 = Ε*~(*) log 𝐷 𝑥 + Εq~r(q) log(1 − 𝐷(𝐺(𝑧)))
= Ε*~(*) log 𝐷 𝑥 + Ε*~[(*) log(1 − 𝐷(𝑥))
Where 𝑝 𝑥
and 𝑞(𝑥)
is the distributions of fake images and real images,
respectively
§ Fixing G, optimal D can be easily obtained:
𝐷 𝑥 =
𝑞(𝑥)
𝑝 𝑥 + 𝑞(𝑥)

Training GAN is equivalent to minimizing JS divergence
§ GAN has two networks: D and G, which are playing a minimax game
min
l
max
n
𝐿 𝐷, 𝐺 = Ε*~(*) log 𝐷 𝑥 + Εq~[(q) log(1 − 𝐷(𝐺(𝑧)))
= Ε*~(*) log 𝐷 𝑥 + Ε*~[(*) log(1 − 𝐷(𝑥))
Where 𝑝 𝑥
and 𝑞(𝑥)
is the distribution of fake and real images, respectively
§ Fixing G, optimal D can be easily obtained by:
𝐷 𝑥 =
𝑝(𝑥)
𝑝 𝑥 + 𝑞(𝑥)
And 𝐿 𝐷, 𝐺 = ∫ 𝑞 𝑥 𝑙𝑜𝑔
(*)
[ * u(*)
𝑑𝑥 + ∫ 𝑝 𝑥 𝑙𝑜𝑔
[(*)
[ * u(*)
𝑑𝑥
= 2𝐷PQ (𝑝| 𝑞 − log4

f-‐divergences
• Divergence
between
two
distributions
𝐷w(𝑞| 𝑝 = L 𝑝 𝑥 𝑓(
𝑞 𝑥
𝑝 𝑥
)𝑑𝑥
• f:
generator
function,
convex
and
f(1)
=
0
• Every
function
f
has
a
convex
conjugate
f*
such
that:
𝑓 𝑥 = sup
^∈654(w∗)
{𝑥𝑦
− 𝑓∗
(𝑦)}

f-‐divergences
• Different
generator
f
give
different
divergences

Estimating
f-‐divergences
from
samples
𝐷w(𝑞| 𝑝 = L 𝑝 𝑥 𝑓
𝑞 𝑥
𝑝 𝑥
𝑑𝑥
= L 𝑝 𝑥 sup
~∈654(w∗)
{𝑡
𝑞 𝑥
𝑝 𝑥
− 𝑓∗
(𝑡)} 𝑑𝑥
≥ sup
•∈‚
{L 𝑞 𝑥 𝑇 𝑥 𝑑𝑥 −L 𝑝 𝑥 𝑓∗
𝑇 𝑥 𝑑𝑥}
= sup
•∈‚
{ 𝐸*~„ 𝑇 𝑥 − 𝐸*~… 𝑓∗
𝑇 𝑥 }
Samples
from
PSamples
from
Q
Conjugate
function
of
f(x):
𝑓∗
𝑥 = sup
~∈654(w)
{𝑡𝑥 − 𝑓(𝑡)}
Some
properties:
• 𝑓(𝑥) = sup
~∈654(w∗)
{𝑡𝑥 − 𝑓∗
𝑡 }
• 𝑓∗∗
𝑥 = 𝑓 𝑥
• 𝑓∗
𝑥 is
always
convec

Training
f-‐divergence
GAN
• f-‐GAN:
m𝑖𝑛
"
max
†

𝐹 𝜃, 𝑤 = 𝐸*~„ 𝑇† 𝑥 − 𝐸*~…‰
𝑓∗
𝑇† 𝑥
f-‐GAN:
Training
Generative
Neural
Sampler
using
Variational Divergence
Minimization,
NIPS2016

Turns
out:
GAN
is
a
specific
case
of
f-‐divergence
• GAN:
m𝑖𝑛
"
max
†
𝐸*~„ log 𝐷† 𝑥 − 𝐸*~…‰
log(1 − 𝐷† 𝑥 )
• f-‐GAN:
m𝑖𝑛
"
max
†
𝐸*~„ 𝑇† 𝑥 − 𝐸*~…‰
𝑓∗
𝑇† 𝑥
By
choosing
suitable
T
and
f,
f-‐GAN
turns
into
original
GAN
(^^)

1-Wasserstein distance (another option)
§ It seeks for a probabilistic coupling 𝛾:
𝑊/ = min
X∈ℙ
L 𝑐 𝑥, 𝑦
𝒳×𝒴
𝛾 𝑥, 𝑦 𝑑𝑥𝑑𝑦 = 𝐸 *,^ ~X 𝑐(𝑥, 𝑦)
Where ℙ = {𝛾 ≥ 0, ∫ 𝛾 𝑥, 𝑦 𝑑𝑦 = 𝑝𝒴
, ∫ 𝛾 𝑥, 𝑦 𝑑𝑥 = 𝑞𝒳
}
𝑐 𝑥, 𝑦 is the displacement cost from x to y (e.g. Euclidean distance)
§ a.k.a Earth mover distance
§ Can be formulated as Linear Programming (convex)

Kantarovich’s formulation of OT
§ In case of discrete input
𝑝 = ( 𝑎- 𝛿*+
4
-./
, 𝑞 = ( 𝑏“ 𝛿^”
,
“./
§ Couplings:
ℙ = {𝑃 ≥ 0, 𝑃 ∈ ℝ4×,
, 𝑃1, = 𝑎, 𝑃•
14 = 𝑏}
§ LP problem: find P
𝑃 = argmin
…∈ℙ
< 𝑃, 𝐶 >
Where C is cost matrix, i.e. 𝐶-“ = 𝑐(𝑥-, 𝑦“)

Why OT is better than KL and JS divergences?
§ OT provides a smooth measure and
more useful than KL and JS
§ Example:

How to apply 1-Wassertain distance to GAN?
𝒲 U 𝑝, 𝑞 = 𝑖𝑛𝑓
X~Z([,)
𝐸 *,^ ~X 𝑥 − 𝑦
= inf
X
< 𝐶, 𝛾 >
s.t. š
∑ 𝛾-“ = 𝑝-,,
“./
𝑖 = 1, 𝑚
∑ 𝛾-“ = 𝑞“,4
-./ 𝑗 = 1, 𝑛
min 𝑐• 𝑥
s.t.

𝐴 𝑥 = 𝑏
𝑥 ≥ 0
m𝑎𝑥 𝑏• 𝑦
s.t.

𝐴• 𝑦 ≤ 𝑐
Primal Dual
𝑐 = 𝑣𝑒𝑐 𝐶 ∈ ℝ4×,
𝑥 = 𝑣𝑒𝑐 𝛾 ∈ ℝ4×,
𝑏•
= [𝑝•
, 𝑞•
]•
∈ ℝ4u,
max 𝑓•
𝑝 + 𝑔•
𝑞
𝑠. 𝑡.
𝑓- + 𝑔“ ≤ 𝐶-“, 𝑖 = 1, . . , 𝑚; 𝑗 = 1 … 𝑛
It
easy
to
see
that

𝑓-= −𝑔- ,
so:
| 𝑓- − 𝑓“| ≤1|𝑥- − 𝑦“|
𝒲 U 𝑝, 𝑞 = 𝑠𝑢𝑝
||w||¥¦/
𝐸*~[ 𝑓 𝑥 − 𝐸*~ 𝑓(𝑥)
(Kantorovich-‐Rubinstein
duality)

Training
WGAN
In
WGAN,
replace
discrimimator with
𝑓 and
minimize
1-‐Wasserstain
distance:
min
"
𝒲 U 𝑝, 𝑞" = 𝑠𝑢𝑝
||†||§¦/
𝐸*~[ 𝑓† 𝑥 − 𝐸q~r(𝑔"(𝑧))
Ref:
Wasserstein
GAN,
ICML2017
Find
𝑤
Update

𝜃

Thank
you
for
listening

Metrics for generativemodels

More Related Content

What's hot (20)

Similar to Metrics for generativemodels (20)

More from Dai-Hai Nguyen (8)

Recently uploaded (20)

Metrics for generativemodels