Slides: Jeffreys centroids for a set of weighted histograms

Jeﬀreys centroids:
A closed-form expression for positive histograms
and a guaranteed tight approximation for
frequency histograms
Frank Nielsen
Frank.Nielsen@acm.org
5793b870
Sony Computer Science Laboratories, Inc.

April 2013

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

1/25

Why histogram clustering?
Task: Classify documents into categories:
Bag-of-Word (BoW) modeling paradigm [3, 6].
◮

Deﬁne a word dictionary, and

◮

Represent each document by a word count histogram.

Centroid-based k-means clustering [1]:
◮

Cluster document histograms to learn categories,

◮

Build visual vocabularies by quantizing image features:
Compressed Histogram of Gradient descriptors [4].

→ histogram centroids
wh = d=1 hi : cumulative sum of bin values
i
˜: normalization operator

2/25

Why Jeﬀreys divergence?
Distance between two frequency histograms p and q :
˜
˜
Kullback-Leibler divergence or relative entropy.
KL(˜ : q ) = H × (˜ : q ) − H(˜ ),
p ˜
p ˜
p
d

p i log
˜

H × (˜ : q ) =
p ˜
i =1

1
, cross − entropy
qi
˜
d

p i log
˜

H(˜ ) = H × (˜ : p ) =
p
p ˜
i =1

1
, Shannon entropy.
pi
˜

→ expected extra number of bits per datum that must be
transmitted when using the “wrong” distribution q instead of the
˜
true distribution p .
˜
p is hidden by nature (and hypothesized), q is estimated.
˜
˜

3/25

Why Jeﬀreys divergence?

When clustering histograms, all histograms play the same role →
Jeﬀreys [8] divergence:
J(p, q) = KL(p : q) + KL(q : p),
d

(p i − q i ) log

J(p, q) =
i =1

pi
= J(q, p).
qi

→ symmetrizes the KL divergence.
(also called J-divergence or symmetrical Kullback-Leibler
divergence, etc.)


4/25

Jeffreys centroids: frequency and positive centroids
A set H = {h1 , ..., hn } of weighted histograms.
n

πj J(hj , x),

c = arg min
x

j=1
n
j=1 πj

πj > 0’s histogram positive weights:
◮

= 1.

Jeffreys positive centroid c:
n

πj J(hj , x),

c = arg min

x∈Rd
+

◮

j=1

Jeffreys frequency centroid c :
˜
n

˜
πj J(hj , x),

c = arg min
˜

x∈∆d

j=1

∆d : Probability (d − 1)-dimensional simplex.

5/25

Prior work
◮

Histogram clustering wrt. χ2 distance [10]

◮

Histogram clustering wrt. Bhattacharyya distance [11, 13]

◮

Histogram clustering wrt. Kullback-Leibler distance as
Bregman k-means clustering [1]

◮

Jeffreys frequency centroid [16] (Newton numerical
optimization)

◮

Jeffreys frequency centroid as equivalent symmetrized
Bregmen centroid [14]

◮

Mixed Bregman clustering [15]

◮

Smooth family of KL symmetrized centroids including
Jensen-Shannon centroids and Jeffreys centroids in limit
case [12]


6/25

Jeﬀreys positive centroid
n

c = arg min J(H, x) = arg min
x∈Rd
+

x∈Rd
+

πj J(hj , x).
j=1

Theorem (Theorem 1)
The Jeﬀreys positive centroid c = (c 1 , ..., c d ) of a set {h1 , ..., hn }
of n weighted positive histograms with d bins can be calculated
component-wise exactly using the Lambert W analytic function:
ci =

ai
i

a
W ( g i e)

,

where ai = n πj hji denotes the coordinate-wise arithmetic
j=1
weighted means and g i = n (hji )πj the coordinate-wise
j=1
geometric weighted means.
Lambert analytic function [2] W (x)e W (x) = x for x ≥ 0.

7/25

Jeﬀreys positive centroid (proof)
n

πj J(hj , x)

min
x

j=1
n

min
x

d

(hji − x i )(log hji − log x i )

πj
j=1
d

i =1
n

πj (x i log x i − x i log hji − hji log x i )

≡ min
x

i =1 j=1
n

d

n

(hji )πj −

x i log x i − x i log
j=1

i =1

πj hji a log x i
j=1

g
d

x i log

min
x

i =1

xi
− a log x i
g


8/25

Jeﬀreys positive centroid (proof)

Coordinate-wise minimize:
min x log
x

x
− a log x
g

Setting the derivative to zero, we solve:
log

x
a
+1− =0
g
x

and get
x=

a
a
W ( g e)


9/25

Jeﬀreys frequency centroid: A guaranteed approximation
n

˜
πj J(hj , x),

c = arg min
˜

x∈∆d

j=1

Relaxing x from probability simplex ∆d to Rd , we get
+
c′ =
˜

c i
ai
, wc =
,c =
ai
wc
W ( g i e)

ci
i

Lemma (Lemma 1)
The cumulative sum wc of the bin values of the Jeﬀreys positive
centroid c of a set of frequency histograms is less or equal to one:
0 < wc ≤ 1.


10/25

Proof of Lemma 1

From Theorem 1:
d

d

ai

ci =

wc =

i

i =1

i =1

a
W ( g i e)

.

Arithmetic-geometric mean inequality: ai ≥ g i
ai
Therefore W ( g i e) ≥ 1 and c i ≤ ai . Thus
d

d

ai = 1

i

c ≤

wc =
i =1


i =1

11/25

Lemma 2

Lemma (Lemma 2)
˜
For any histogram x and frequency histogram h, we have
˜ = J(˜ , h) + (wx − 1)(KL(˜ : h) + log wx ), where wx
˜
˜
J(x, h)
x
x
denotes the normalization factor (wx = d=1 x i ).
i
˜
J(x, H) = J(˜ , H) + (wx − 1)(KL(˜ : H) + log wx ),
x ˜
x ˜
˜
˜
where J(x, H ) = n πj J(x, hj ) and
j=1
KL(˜ : H) = n πj KL(˜ , hj ) (with
x ˜
x ˜
j=1


n
j=1 πj

= 1).

12/25

Proof of Lemma 2
x i = wx x i
˜
d

˜
(wx x i − hi ) log
˜

˜
J(x, h) =
i =1
d

˜
J(x, h) =

(wx x i log
˜
i =1

wx x i
˜
˜i
h

˜
hi ˜
xi
˜
˜
+ wx x i log wx + hi log i − hi log wx )
˜
˜
x
˜
hi
d

= (wx − 1) log wx + J(˜ , h) + (wx − 1)
x ˜

x i log
˜
i =1

xi
˜
˜
hi

= J(˜ , h) + (wx − 1)(KL(˜ : h) + log wx )
x ˜
x ˜

since

d ˜i
i =1 h

=

d
˜i
i =1 x

= 1.


13/25

Guaranteed approximation of c
˜

Theorem (Theorem 2)
c
Let c denote the Jeﬀreys frequency centroid and c ′ = wc the
˜
˜
normalized Jeﬀreys positive centroid. Then the approximation
˜
c ′ H)
1
factor αc ′ = J(˜ ,,H) is such that 1 ≤ αc ′ ≤ wc (with wc ≤ 1).
˜
˜
J(˜ ˜
c


14/25

Proof of Theorem 2
˜
J(c, H) ≤ J(˜, H) ≤ J(˜′ , H)
c ˜
c ˜
From Lemma 2, since
˜
J(˜′ , H) = J(c, H) + (1 − wc )(KL(˜′ , H) + log wc )) and
c ˜
c ˜
˜ ≤ J(˜, H)
˜
J(c, H)
c
1 ≤ αc ′ ≤ 1 +
˜

(1 − wc )(KL(˜′ , H) + log wc )
c ˜
J(˜, H)
c ˜

1
˜
KL(c, H) − log wc
KL(˜′ : H) =
c ˜
wc
αc ′ ≤ 1 +
˜

˜
(1 − wc )KL(c, H)
wc J(˜, H)
c ˜

˜
˜
˜
Since J(˜, H) ≥ J(c, H) and KL(c, H) ≤ J(c, H), we get
c ˜
1
αc ′ ≤ wc .
˜
When wc = 1 the bound is tight.

15/25

In practice...

˜
˜
c in closed-form → compute wc , KL(c, H), J(c, H).
Bound the approximation factor αc ′ as:
˜
αc ′ ≤ 1 +
˜

1
−1
wc


˜
KL(c, H)
1
≤
˜
wc
J(c, H)

16/25

Fine approximation
From [16, 14], minimization of Jeﬀreys frequency centroid
equivalent to:
a ˜
x ˜
c = arg min KL(˜ : x ) + KL(˜ : g )
˜
x ∈∆d
˜

Lagrangian function enforcing
log

i

c i = 1:

a
˜i
ci
˜
+1− i +λ=0
gi
˜
c
˜
a
˜i

ci =
˜
W

˜i e λ+1
a
gi
˜

λ = −KL(˜ : g ) ≤ 0
c ˜


17/25

Fine approximation: Bisection search

˜i
a

ci ≤ 1 ⇒ ci =

a
˜i e λ+1
gi
˜

W
i

a
λ ≥ log(e ˜ g i ) − 1∀i ,
˜

i

a
λ ∈ [max log(e ˜ g i ) − 1, 0]
˜
i

d

˜i
a

c i (λ) =

s(λ) =
i

≤1

i =1

W

a
˜i e λ+1
gi
˜

Function s: monotonously decreasing with s(0) ≤ 1.
→ Bisection search for s(λ∗ ) ≃ 1 for arbitrary precision.


18/25

Experiments: Caltech-256

Caltech-256 [7]: 30607 images labeled into 256 categories (256
Jeﬀreys centroids).
Arbitrary ﬂoating-point precision: http://www.apfloat.org/
c ′′ =
˜
αc (optimal positive)
avg
min
max

0.9648680345638155
0.906414219584823
0.9956399220678585

a ˜
˜+g
2

α ′ (n′ lized approx.)
c
˜
1.0002205080964255
1.0000005079528809
1.0000031489541772


wc ≤ 1(n′ lizing coeff.t)
0.9338228644308926
0.8342819488534723
0.9931975105809021

α ′′ (Veldhuis’ approx.)
c
˜
1.065590178484613
1.0027707382095195
1.3582296675397754

19/25

Experiments: Synthetic data-sets

Random binary histograms
α=

J(˜′ )
c
≥1
J(˜)
c

Performance:
α ∼ 1.0000009, αmax ∼ 1.00181506, αmin = 1.000000.
¯

Express better worst-case upper bound performance?


20/25

Summary and conclusion
◮

Jeffreys positive centroid c in closed-form

◮

normalized Jeffreys positive centroid c ′ within approximation
˜
1
factor wc

◮

Bisection search for arbitrary fine approximation of c .
˜

→ Variational Jeffreys k-means clustering
Other Kullback-Leibler symmetrizations:
◮

Jensen-Shannon divergence [9]

◮

Chernoff divergence [5]

◮

Family of symmetrized centroids including Jensen-Shannon
and Jeffreys centroids [12]


21/25

Thank you!

http://www.informationgeometry.org
@Article{JeffreysCentroid-2013,
author = {Frank Nielsen},
title = {Jeffreys centroids: {A} closed-form expression for positive histograms
and a guaranteed tight approximation for frequency histograms},
journal = {IEEE Signal Processing Letters (SPL)},
year = {2013}
}


22/25

Bibliographic references I
Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh.
Clustering with Bregman divergences.
Journal of Machine Learning Research, 6:1705–1749, 2005.
D. A. Barry, P. J. Culligan-Hensley, and S. J. Barry.
Real values of the W -function.
ACM Trans. Math. Softw., 21(2):161–171, June 1995.
Brigitte Bigi.
Using Kullback-Leibler distance for text categorization.
In Proceedings of the 25th European conference on IR research (ECIR), ECIR’03, pages 305–319, Berlin,
Heidelberg, 2003. Springer-Verlag.
Vijay Chandrasekhar, Gabriel Takacs, David M. Chen, Sam S. Tsai, Yuriy A. Reznik, Radek Grzeszczuk, and
Bernd Girod.
Compressed histogram of gradients: A low-bitrate descriptor.
International Journal of Computer Vision, 96(3):384–399, 2012.
Herman Chernoﬀ.
A measure of asymptotic eﬃciency for tests of a hypothesis based on the sum of observations.
Annals of Mathematical Statistics, 23:493–507, 1952.


23/25

Bibliographic references II
G. Csurka, C. Bray, C. Dance, and L. Fan.
Visual categorization with bags of keypoints.
Workshop on Statistical Learning in Computer Vision (ECCV), pages 1–22, 2004.
G. Griffin, A. Holub, and P. Perona.
Caltech-256 object category dataset.
Technical Report 7694, California Institute of Technology, 2007.
Harold Jeffreys.
An invariant form for the prior probability in estimation problems.
Proceedings of the Royal Society of London, 186(1007):453–461, March 1946.
Jianhua Lin.
Divergence measures based on the Shannon entropy.
IEEE Transactions on Information Theory, 37:145–151, 1991.
Huan Liu and Rudy Setiono.
Chi2: Feature selection and discretization of numeric attributes.
In Proceedings of the Seventh International Conference on Tools with Artificial Intelligence (TAI), pages
88–, Washington, DC, USA, 1995. IEEE Computer Society.
Max Mignotte.
Segmentation by fusion of histogram-based k-means clusters in different color spaces.
IEEE Transactions on Image Processing (TIP), 17(5):780–787, 2008.

24/25

Bibliographic references III
Frank Nielsen.
A family of statistical symmetric divergences based on Jensen’s inequality.
CoRR, abs/1009.4004, 2010.
Frank Nielsen and Sylvain Boltz.
The Burbea-Rao and Bhattacharyya centroids.
IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011.
Frank Nielsen and Richard Nock.
Sided and symmetrized Bregman centroids.
IEEE Transactions on Information Theory, 55(6):2048–2059, June 2009.
Richard Nock, Panu Luosto, and Jyrki Kivinen.
Mixed Bregman clustering with approximation guarantees.
In Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases,
pages 154–169, Berlin, Heidelberg, 2008. Springer-Verlag.
Raymond N. J. Veldhuis.
The centroid of the symmetrical Kullback-Leibler distance.
IEEE signal processing letters, 9(3):96–99, March 2002.


25/25

Slides: Jeffreys centroids for a set of weighted histograms

More Related Content

What's hot (18)

Viewers also liked (6)

Similar to Slides: Jeffreys centroids for a set of weighted histograms (20)

Recently uploaded (20)

Slides: Jeffreys centroids for a set of weighted histograms