Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Computational Information Geometry
on Matrix Manifolds
Frank Nielsen
Frank.Nielsen@acm.org
www.informationgeometry.org
Sony Computer Science Laboratories, Inc.

July 2013, ICTP, Trieste, IT

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

1/56

Geometry of matrix manifolds...
◮

Euclidean geometry, Fr¨benius norm → distance:
o
M

2
F

2
mij =

=
i ,j

Mi ∗
i

2
2

=

M∗j

2
2

= tr(M ⊤ M)

j

◮

Riemannian geometry of symmetric positive definite (SPD)
matrices [9, 2]

◮

Riemannian geometry of rank-deficient positive semi-definite
(SPSD) matrices
Stiefel/Grassman manifolds [3]

◮

Quantum geometry: SPD matrices with unit trace
“One geometry cannot be more true than another;
it can only be more convenient”,
— Jules Henri Poincar´ (1902)
e


2/56

Forthcoming conference (GSI)
28th-30th August, Paris.


3/56

What is Computational Information Geometry?
◮

What is Information? = Essence of data (datum=“thing”)
(make it tangible → e.g., parameters of generative models)

◮

Can we do Intrinsic computing?
(unbiased by any particular “data representation” → same
results after recoding data)

◮

Geometry −→ Science of invariance
(mother of Science, compass & ruler, Descartes
analytic=coordinate/Cartesian, imaginaries, ...).
...the open-ended poetic mathematics!

?!


4/56

Rationale for Computational Information Geometry
◮

Information is ...never void! → lower bounds
◮
◮
◮
◮

◮

Geometry:
◮

◮

◮

Fisher information and Cram´r-Rao lower bound (estimation)
e
Bayes error and Chernoﬀ information (classiﬁcation)
Coding and Shannon entropy (communication)
Program and Kolmogorov complexity (compression).
(Unfortunately not computable!)

Language (point, line, ball, dimension, orthogonal, projection,
geodesic, immersion, etc.)
Power of characterization (eg., intersection of two
pseudo-segments not admitting closed-form expression)

Computing: Information computing. Seeking for mathematical
convenience and mathematical tricks (RKHS in ML).
How to manipulate “space of functions” ?!?


5/56

Example I: Matrix manifold
Pattern = Gaussian mixture models (universal class)
Statistical (dis)similarity/distance: total Bregman divergence
(tBD, tKL).
Invariance: ..., xi ∼ N(µi , Σi ), y = A(x) = Lx + t,
yi ∼ N(Lµi + t, LΣi L⊤ ), D(X1 : X2 ) = D(Y1 : Y2 )
(L: any invertible aﬃne transformation, t a translation)

Shape Retrieval using Hierarchical Total Bregman Soft Clustering [7],
IEEE PAMI, 2012.


6/56

Example II: Matrix manifolds
DTI: diﬀusion ellipsoids, tensor interpolation.
Pattern = zero-centered “Gaussians“
Statistical (dis)similarity/distance: total Bregman divergence
(tBD, tKL).
Invariance: ..., D(A⊤ PA : A⊤ QA) = D(P : Q), A ∈ SL(d):
orthogonal matrix
(volume/orientation preserving)
total Bregman divergence (tBD).

(3D rat corpus callosum)
c

Total Bregman Divergence and its Applications to DTI Analysis [20],
IEEE TMI. 2011. Science Laboratories, Inc.
2013 Frank Nielsen, Sony Computer

7/56

Example III: Gaussian manifolds
Consider 5D Gaussian Mixture Models (GMMs) of color images
(image=RGBxy point set)

A Gaussian mixture model
wi N(µi , Σi ) is interpreted as a
weighted point set {θi = (µi , Σi )}.

8/56

Matrix center points & clustering
Aggregation (matrix quantization for codebooks):
Given a data-set of matrices M = {M1 , ..., Mn } ⊂ M, compute a
center matrix C .
Centering as a variational minimization problem:
wi distancep (C , Mi )

(OPT ) : Cp = arg min

C ∈M

i

Notion of centrality, robustness to outliers?
For diagonal matrices, with “Euclidean” distance, usual geometric
center points:
◮

◮
◮

median (p = 1): robust to outliers (Fermat-Weber point, no
closed form),
centroid (p = 2): breakdown point of 1 (→ tBD)),
circumcenter (lim p → ∞): minimize farthest point
(minimax [1]).


9/56

Diffusion Tensor Magnetic Resonance Imaging
DT-MRI: Measures anisotropic diffusion of water molecules in a
3 × 3 tensor assigned to each voxel position (1990˜).
Used to analyze in-vivo connectivity patterns of brain tissues:
gray matter, white matter (corpus callosum) and cerebrospinal
fluid (CSF)

c Image courtesy Peter J. Basser
(Magnetic resonance imaging of the brain and spine, Chapter 31)

10/56

Gradiometry tensor: 3 × 3 SPSD matrices
Beyond the “constant” g ≃ 9.81m/s 2 . Gravity ﬁeld measuring
anisotropy.

→ Oil & gas industry.
Courtesy of BellGeo.
http://www.bellgeo.com/tech/technology_theory_of_FTG.html

11/56

Structure tensors in computer vision
→ Pioneered in image processing: tensor descriptor of a region at
a pixel. (Harris-Stephens [6]).
Consider a kernel, and compute the tensor descriptor
I ′2 (x)

T (p = (x, y )) = K ∗

I ′ (y )I ′ (x)

I ′ (x)I ′ (y )
I ′ (y )2

,

w (u, v )∇I (u, v )(∇I (u, v ))T

=
u,v

K : uniform, Gaussian kernel (eg., s × s window W centered at the
pixel p)
I ′ (x), I ′ (y ): gradient, derivatives of the image.
Versatile method: corner detection, optical ﬂow estimation,
segmentation, stereo matching, etc.
→ Tensor image processing

12/56

Harris-Stephens structure tensor (1988)
Deformation tensor ﬁeld

Harris-Stephens combined corner-edge detector:
R = det T − k(tr T )2
→ Measures of tensor anisotropy.
Structure tensor represents local orientation
(eigenvectors/eigenvalues).
Harris-Stephens’ combined corner/edge detector (note)

13/56

Matrix with Fr¨benius metric distance
o
Matrix space M with vectorial structure
dE (P, Q) =

P −Q

=

F

tr(P − Q)T (P − Q)

Centroid of tensors:
1
CE =
n

(1)
(2)

n

wi Ti
i =1

→ scalar average of each element of the tensor.
Tensor Field Segmentation Using Region Based Active Contour
Model [21], ECCV, 2004.


14/56

Matrix vectorization & computational geometry
Computational geometry on w × h-dimensional matrix spaces wrt
Fr¨benius distance amounts to computational geometry on
o
Euclidean vector space for D = w × h.
→ Voronoi diagrams, smallest enclosing ball, minimum spanning
tree, etc.
For symmetric matrices, we have D = d(d+1) degrees of freedom,
2
and vectorize as follows:
d

M

F

d
2
mij

=
i =1 j=1

d−1

d

d
2
mij

2
mii + 2

=
i =1

i =1 j=i +1

= m 2
√
√
√
with m = [m11 ...mdd 2m12 2m1d ... 2md−1,d ]T = M.

15/56

Matrix functions

From the spectral decomposition:
M = UDU ⊤
with D = λ(M) = diag(λ1 , ..., λd ) the diagonal matrix of
eigenvalues, consider real-valued function x → f (x) to extend to
matrices as
f (M) = U diag(f (λ1 ), ..., f (λd )) U T
1

Examples: log x, exp x, |x|, x 2 , x 2 , etc.
O(d 3 ) SVD factorization complexity.


16/56

Riemannian cone of SPD matrices
Exponential maps from tangent planes (symmetric matrices Sym)
to the manifold cone C:
expP

: TP C = Sym → C

Logarithmic maps from manifold cone C to tangent planes:
logP

: C → TP C = Sym
1

1

1

1

logP (Q) = P 2 log(P − 2 QP − 2 )P 2

Map any point Q ∈ Sym++ to unique tangent vector at P such
that γ0 = P and γ1 = Q.
Geodesic equation:
1

1

1

γt (P, Q) = P 2 P − 2 QP − 2

t

1

P2

Geodesic (metric length) distance:
1

1

dR (P, Q) = log P − 2 QP − 2

17/56

Riemannian Karcher centroid
d

dR (P, Q) =

log2 λi

tr log2 (P −1 Q) =
i =1

=

log P

1
−2

QP

1
−2

, where the λi ’s are the eigenvalues of P −1 Q.
1
1
(P −1 Q = Q 2 P −1 Q 2 )
Unique mean characterized by n=1 log(Ti−1 CR ) = 0
i
Closed-form solution only for n = 2:
1

1

1

1
2

1

CR (P, Q) = P 2 P − 2 QP − 2 P 2 otherwise iterative
approximation (CR = limt→∞ Ct ):
Ct+1 = Ct exp

1
n

n

log Ct−1 Ti

.

i =1
18/56

Riemannian minimax SPD center (circumcenter [1])
Case of p = ∞, center that minimizes the maximum distance.
GEO-ALG:
Starts with c1 ∈ P and iteratively update the current
1
circumcenter as follows: ci +1 = Geodesic(ci , fi , i +1 ),
where fi denotes the farthest point of P to ci , and
Geodesic(p, q, t) denotes the intermediate point m
on the geodesic passing through p and q such that
ρ(p, m) = t × ρ(p, q).
Geodesic:

1

1

1

γt (P, Q) = P 2 P − 2 QP − 2

t

1

P2

Find t such that d=1 log2 λt = t 2 d=1 log2 λi = r 2
i
i
i
That is t = r .
Prove core-set and guaranteed convergence.

d
2
i =1 log λi .

19/56

Matrices as parameters in probability distributions
Exponential families: Gaussian, Wishart, etc.:
p(x; λ) = pF (x; θ) = exp ( t(x), θ − F (θ) + k(x)) .
Example: Poisson distribution
p(x; λ) =

λx
exp(−λ),
x!

◮

the suﬃcient statistic t(x) = x,

◮

θ = log λ, the natural parameter,

◮

F (θ) = exp θ, the log-normalizer → CONVEX,

◮

and k(x) = − log x! the carrier measure
(with respect to the counting measure).


20/56

Gaussians as an exponential family
p(x; λ) = p(x; µ, Σ) =

1
(x − µ)T Σ−1 (x − µ))
√
exp −
2
2π det Σ

1
θ = (Σ−1 µ, 2 Σ−1 ) ∈ Θ = Rd × Kd×d , with Kd×d cone of
positive deﬁnite matrices,
◮ F (θ) = 1 trθ −1 θ1 θ T − 1 log det θ2 + d log π → CONVEX
1
2
4
2
2
◮ t(x) = (x, −x T x),
◮ k(x) = 0.
Inner product : composite, sum of a dot product and a matrix
trace :
T ′
T ′
θ, θ ′ = θ1 θ1 + trθ2 θ2 .
◮

The coordinate transformation τ : Λ → Θ is given for λ = (µ, Σ)
by
τ (λ) =

1
λ−1 λ1 , λ−1 ,
2
2 2


τ −1 (θ) =

1 −1 1 −1
θ θ1 , θ2
2 2
2
21/56

Convex duality: Legendre transformation
◮

For a strictly convex and diﬀerentiable function F : X → R:
F ∗ (y ) = sup { y , x − F (x)}
x∈X

lF (y ;x);
◮

Maximum obtained for y = ∇F (x):
∇x lF (y ; x) = y − ∇F (x) = 0 ⇒ y = ∇F (x)

◮

Maximum unique from convexity of F (∇2 F ≻ 0):
∇2 lF (y ; x) = −∇2 F (x) ≺ 0
x

◮

Convex conjugates:
(F , X ) ⇔ (F ∗ , Y),


Y = {∇F (x) | x ∈ X }
22/56

Legendre duality: Geometric interpretation
Consider the epigraph of F as a convex object:
◮ convex hull (V -representation), versus
◮ half-space (H-representation).

Legendre transform also called “slope” transform.


23/56

Legendre duality & Canonical divergence
◮

◮
◮

Convex conjugates have functional inverse gradients
∇F −1 = ∇F ∗
∇F ∗ may require numerical approximation
(not always available in analytical closed-form)
Involution: (F ∗ )∗ = F with ∇F ∗ = (∇F )−1 .
Convex conjugate F ∗ expressed using (∇F )−1 :
F ∗ (y ) =
=

◮

x, y − F (x), x = ∇y F ∗ (y )

(∇F )−1 (y ), y − F ((∇F )−1 (y ))

Fenchel-Young inequality at the heart of canonical divergence:
F (x) + F ∗ (y ) ≥ x, y
AF (x : y ) = AF ∗ (y : x) = F (x) + F ∗ (y ) − x, y ≥ 0


24/56

Dual Bregman divergences & canonical divergence [14]
p(x)
≥0
q(x)
= BF (θQ : θP ) = BF ∗ (ηP : ηQ )

KL(P : Q) = EP log

= F (θQ ) + F ∗ (ηP ) − θQ , ηP

= AF (θQ : ηP ) = AF ∗ (ηP : θQ )

with θQ (natural parameterization) and ηP = EP [t(X )] = ∇F (θP )
(moment parameterization).
1
1
dx − p(x) log
dx
KL(P : Q) = p(x) log
q(x)
p(x)
H × (P:Q)

H(p)=H × (P:P)

Shannon cross-entropy and entropy of EF [14]:
H × (P : Q) = F (θQ ) − θQ , ∇F (θP ) − EP [k(x)]
H(P) = F (θP ) − θP , ∇F (θP ) − EP [k(x)]

H(P) = −F ∗ (ηP ) − EP [k(x)]

25/56

Bregman divergence: Geometric interpretation (I)
Potential function F , graph plot F : (x, F (x)).
DF (p : q) = F (p) − F (q) − p − q, ∇F (q)


26/56

Bregman divergence: Geometric interpretation (II)
Potential function f , graph plot F : (x, f (x)).
Bf (p||q) = f (p) − f (q) − (p − q)f ′ (q)

Bf (.||q): vertical distance between the hyperplane Hq tangent to
F at lifted point q , and the translated hyperplane at p .
ˆ
ˆ

27/56

Bregman divergence: Geometric interpretation (III)
Bregman divergence and path integrals
B(θ1 : θ2 ) = F (θ1 ) − F (θ2 ) − θ1 − θ2 , ∇F (θ2 ) ,

(3)

θ1

=
θ2
η2

=
η1
∗

∇F (t) − ∇F (θ2 ), dt ,

(4)

∇F ∗ (t) − ∇F ∗ (η1 ), dt ,

(5)

= B (η2 : η1 )


(6)

28/56

Matrix Bregman divergences [4, 16]
Choose F a real-valued functional generator and extend F to
matrices:
F (X ) = tr(Ψ(X ))
tF ,k N k

Ψ(X ) =
k≥0

(tF ,k from the Taylor expansion of real-valued F )
BF (P : Q) = F (P) − F (Q) − tr((P − Q)⊤ ∇F (Q)),
∇F (X ) =

′
tF ,k N k
k≥0

′
(tF ,k from the Taylor expansion of real-valued F ′ )


29/56

Matrix Bregman divergences [16]


30/56

Particular case: Bregman Schatten p-divergences [5, 16]

Schatten p-norm of real symmetric matrix X :
(unitarily invariant matrix norms)
X

p

= λ(X )

p

Bregman generator:

1
X 2
p
2
Used in regularized convex optimization [5], matrix data
mining [16].
F (X ) =


31/56

Matrix Legendre transformation

Extends classical Legendre-Fenchel transformation:
F ∗ (η) =

sup
spec(θ)⊆dom(F )

tr(θη ⊤ ) − F (θ)

DF (θP : θQ ) = DF ∗ (ηQ : ηP ) = F (θ) + F ∗ (η) − tr(θη ⊤ )
θ and η are dual matrix coordinate systems on the matrix manifold.

Non-metric diﬀerential structure with dual coordinate systems.


32/56

Bregman matrix means
BF (X , P) = F (X ) − F (P) − tr((X − P)T ∇F (P)),
F (·): strictly convex and diﬀerentiable function on an open convex
space.
n

C = ∇F −1

i =1

wi ∇F (Ti )

quasi-arithmetic mean for ∇F .
Since BF (X , P) = BF (P, X ), deﬁne a right-sided centroid M ′ :
Find the center of mass [13] (independent of generator F )
F (X ) = tr(X T X ): the quadratic matrix entropy,
F (X ) = − log det X : the matrix Burg entropy, and
F (X ) = tr(X log X − X ): the von Neumann entropy [19, 18, 15]
(Umegaki quantum relative entropy).

33/56

Total Bregman divergences (tBD)
Instead of ”vertical” projection in Bregman divergence, consider
perpendicular projection.
(Analogy with least squares and total least squares regression.)

tBF (P, Q) =

BF (P, Q)
1 + ∇F (Q)

2

→ proven statistically robust.
Applications to robust DT-MRI segmentation [8].

34/56

Matrix Jensen/Burbea-Rao divergences [10]

Convexity gap deﬁnes a divergence
BRF (P, Q) =

◮
◮
◮
◮

F (P) + F (Q)
−F
2

P +Q
2

≥0

F (X ) = tr(X T X ): the quadratic matrix entropy,
F (X ) = − log det X : the matrix Burg entropy, and

F (X ) = tr(X log X − X ): the von Neumann entropy.
etc.


35/56

Smooth family of convex generators [12, 17]
1-parameter family of generators:
Fα (X ) =

1
tr(αX − X α + (1 − α)I ), α = {0, 1}
α(1 − α)

Bα (P : Q) =

∇Fα (X ) =

1
tr(Q α − P α + αQ α−1 (P − Q))
α(1 − α)

1
(I − X α−1 )
α−1

1

−1
∇Fα (X ) = (I − (α − 1)X ) α−1

When α → 1, ∇Fα (X ) = ∇F1 (X ) = log X . When α → 0,
∇Fα (X ) = ∇F0 (X ) = X −1 − I .
◮

α = 2: Quadratic matrix information

◮

α → 1: von Neumann information

◮

α → 0: Burg log-det information


36/56

Jensen (Burbea-Rao) divergences
Based on Jensen’s inequality for a convex function F :
BRF (X , P) =

F (X ) + F (P)
−F
2

X +P
2

def

= ≥ 0.

strictly convex function F (·).
Includes the special case of Jensen-Shannon divergence:
JS(p, q) = H

p+q
2

−

H(p) + H(q)
2

F (x) = −H(x), the negative Shannon entropy H(x) = −x log x.
→ generators are convex and entropies are concave (negative
generators)


37/56

Visualizing Burbea-Rao divergences

include Squared Mahalanobis distance.


38/56

Burbea-Rao from Symmetrizing Bregman divergences [13]
◮

Jeﬀreys-Bregman divergences.

SF (p; q) =
=
◮

BF (p, q) + BF (q, p)
2
1
p − q, ∇F (p) − ∇F (q) ,
2

Jensen-Bregman divergences (diversity index).
JF (p; q) =
=

BF (p, p+q ) + BF (q, p+q )
2
2
2
F (p) + F (q)
p+q
−F
2
2


= BRF (p, q)

39/56

Skew Burbea-Rao divergences
(α)

BRF
(α)

:

X × X → R+

BRF (p, q) = αF (p) + (1 − α)F (q) − F (αp + (1 − α)q)
(α)

BRF (p, q) = αF (p) + (1 − α)F (q) − F (αp + (1 − α)q)
(1−α)

= BRF

(q, p)

Skew symmetrization of Bregman divergences:
def

αBF (p, αp + (1 − α)q) + (1 − α)BF (q, αp + (1 − α)q) =
(α)

BRF (p, q)

= skew Jensen-Bregman divergences.

40/56

Bregman divergences = asymptotic skewed Jensen
divergences

1
(α)
BRF (p, q)
α→1 1 − α
1
(α)
BF (q, p) = lim BRF (p, q)
α→0 α
BF (p, q) =

lim


41/56

Burbea-Rao/Jensen centroids
(p = 1)
n
X

(α )

wi BRF i (X , Ti ) = arg min L(x)

OPT : CF = arg min

x

i =1

Wlog., equivalent to minimize
n

n

E (c) = (
i =1

wi αi )F (C ) −

i =1

wi F (αi C + (1 − αi )Ti )

Sum E = F + G of convex F + concave G function ⇒
Convex-ConCave Procedure (CCCP, NIPS*01)
Start from arbitrary c0 , and iteratively update as:
∇F (Ct+1 ) = −∇G (Ct )
⇒ guaranteed convergence to a (local) minimum.

42/56

ConCave Convex Procedure (CCCP)
minx E (x) = F (x) + G (x)
∇F (ct+1 ) = −∇G (ct )

Decomposition may not be unique...

43/56

Iterative algorithm for Burbea-Rao centroids
Apply CCCP scheme
∇F (Ct+1 ) =

Ct+1 = ∇F

−1

1

n

n
i =1 wi αi i =1

1

wi αi ∇F (αi Ct + (1 − αi )Ti )

n

n
i =1 wi αi i =1

wi αi ∇F (αi Ct + (1 − αi )Ti )

Get arbitrarily ﬁne approximations of the (skew) Burbea-Rao
matrix centroids and barycenters.


44/56

Special case: α-log det divergence [15, 11]
Cone of Hermitian positive deﬁnite matrices (self-adjoint matrices
¯
M H = M T = M).
F (X ) = − log detX , ∇F (X ) = ∇F −1 (X ) = −X −1
Burbea-Rao α-log det divergences:

 tr(Q −1 P − I ) − log det(Q −1 P)) α = 1


det( 1−α P+ 1+α Q)
(α)
4
2
2
α ∈ R{−1, 1}
Dld (P, Q) =
1−α
2 log
1+α
 1−α
(det P) 2 (det Q) 2

 tr(P −1 Q − I ) − log det(P −1 Q) α = −1
Start with C1 =

1
n

n
i =1 Ti ,
n

Ct+1 = n
i =1

1−α
1+α
Ti +
Ct
2
2

→ unique global mean (obtained from CCCP).


−1

−1

45/56

Bhattacharyya coefficients/distances
Bhattacharyya coefficient and non-metric distance:
C( p, q) =

p(x)q(x)dx, 0 < C (p, q) ≤ 1, B(p, q) = − ln C (p, q).

(coefficient is always strictly positive). Hellinger metric
H(p, q) =

1
2

( p(x) −

q(x))2 dx,

such that 0 ≤ H(p, q) ≤ 1.

H(p, q) =
=

1
2

p(x)dx +

q(x)dx − 2

p(x) q(x)dx

1 − C (p, q).


46/56

Chernoff coefficients/α-divergences
Skew Bhattacharrya divergences based on Chernoff α-coefficients.
Bα (p, q) = − ln
= − ln

x

p α(x)q 1−α (x)dx = − ln Cα (p, q)
q(x)

x

p(x)
q(x)

α

dx

= − ln Eq [Lα (x)]
Amari α-divergence:

1−α
1+α
 4 2 1 − p(x) 2 q(x) 2 dx , α = ±1,
 1−α

p(x)
Dα (p||q) =
α = −1,
p(x) log q(x) dx = KL(p, q),


 q(x) log q(x) dx = KL(q, p),
α = 1,
p(x)
Dα (p||q) = D−α (q||p)

Remapping α′ =

1−α
2

(α = 1 − 2α′ ) to get Chernoff α′ -divergences


47/56

Bhattacharyya/Chernoﬀ of exponential families [10]

Equivalence with skew Burbea-Rao distances:
(α)

Bα (pF (x; θp ), pF (x; θq )) = BRF (θp , θq ),

(7)

= αF (θp ) + (1 − α)F (θq ) − F (αθp + (1 − α)θq )
Bhat. divergence on probability distributions amounts to compute
a Jensen divergence on its parameters


48/56

Closed-form Bhattacharyya distances for exp. fam.

Generic formula that instantiates in those well-known formula in
statistical pattern recognition.
Exp. fam.

F (θ) (up to a constant)

Multinomial

log(1 +

Poisson

exp θ

Gaussian

1
π
1
− 4θ + 2 log(− θ )
2
2

2
2
σ2 +σq
1 (µp −µq ) + 1 ln p
4 σ2 +σ2
2
2σp σq
p
q

Gaussian

1 trΘ−1 θθ T − 1 log det Θ
4
2

1 (µ − µ )T
p
q
8

d −1
exp θi )
i =1

θ2

Bhattacharyya/Burbea-Rao BRF (λp , λq ) = BRF (τ (λp ), τ (λq ))
√
− ln d=1 pi qi
i
1 ( √µ − √µ ) 2
p
q
2


Σp +Σq
2

−1

Σp +Σq

det
1
2
(µp − µq ) + 2 ln det Σ det Σ
p
q

49/56

Wrapping up

◮

Besides Euclidean, log-Euclidean and Riemannian
metric-based means, proposed
divergence-based matrix centroids,

◮

Total Bregman divergences and robustness (conformal
geometry),

◮

Riemannian minimax center,

◮

skew Burbea-Rao/Jensen divergences extending Bregman
divergences,

◮

Bhattacharrya means of densities = Burbea-Rao means on
(matrix) parameters
Which mean you do you mean or need?


50/56

Non-metric matrix manifolds with dually aﬃne connections

In a nutshell:
◮

asymmetric (Bregman) non-metric divergence,

◮

Legendre transform, convex conjugates & dual divergences

◮

Dual θ− or η- or mixed coordinate systems

◮

dual closed-form aﬃne geodesics (convenient computationally)

◮

Pythagorean theorem


51/56

Thank you.

www.informationgeometry.org

“One geometry cannot be more true than another;
it can only be more convenient”,
— Jules Henri Poincar´ (1902)
e


52/56

Bibliographic references I
Marc Arnaudon and Frank Nielsen.
On approximating the Riemannian 1-center.
Comput. Geom., 46(1):93–104, 2013.
Rajendra Bhatia.
The Riemannian mean of positive matrices.
In Frank Nielsen and Rajendra Bhatia, editors, Matrix Information Geometry, pages 35–51, 2012.
Silvere Bonnabel and Rodolphe Sepulchre.
Riemannian metric and geometric mean for positive semideﬁnite matrices of ﬁxed rank.
SIAM J. Matrix Analysis Applications, 31(3):1055–1070, 2009.
Inderjit S. Dhillon and Joel A. Tropp.
Matrix nearness problems with bregman divergences.
SIAM J. Matrix Anal. Appl., 29(4):1120–1146, November 2007.
John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari.
Composite objective mirror descent.
In Adam Tauman Kalai and Mehryar Mohri, editors, COLT, pages 14–26. Omnipress, 2010.
C. Harris and M. Stephens.
A Combined Corner and Edge Detection.
In Proceedings of The Fourth Alvey Vision Conference, pages 147–151, 1988.

53/56

Bibliographic references II
Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen.
Shape retrieval using hierarchical total Bregman soft clustering.
Transactions on Pattern Analysis and Machine Intelligence, 34(12):2407–2419, 2012.
Meizhu Liu, Baba C. Vemuri, Shun ichi Amari, and Frank Nielsen.
Shape retrieval using hierarchical total bregman soft clustering.
IEEE Trans. Pattern Anal. Mach. Intell., 34(12):2407–2419, 2012.
Maher Moakher.
A diﬀerential geometric approach to the geometric mean of symmetric positive-deﬁnite matrices.
SIAM Journal on Matrix Analysis and Applications, 26(3):735–747, 2005.
Frank Nielsen and Sylvain Boltz.
The Burbea-Rao and Bhattacharyya centroids.
IEEE Transactions on Information Theory, 57(8):5455–5466, 2011.
Frank Nielsen, Meizhu Liu, Xiaojing Ye, and Baba C. Vemuri.
Jensen divergence based SPD matrix means and applications.
In International Conference on Pattern Recognition (ICPR), 2012.
Frank Nielsen and Richard Nock.
Quantum Voronoi diagrams and Holevo channel capacity for 1-qubit quantum states.
In IEEE International Symposium on Information Theory (ISIT), pages 96–100, 2008.

54/56

Bibliographic references III
Sided and symmetrized Bregman centroids.
IEEE Trans. Inf. Theor., 55(6):2882–2904, June 2009.
Entropies and cross-entropies of exponential families.
In International Conference on Image Processing (ICIP), pages 3621–3624, 2010.
R. Nock, B. Magdalou, E. Briys, and F. Nielsen.
On tracking portfolios with certainty equivalents on a generalization of Markowitz model: the fool, the wise
and the adaptive.
In Thorsten Joachims, editor, International Conference on Machine Learning (ICML). Omnipress, 2011.
Richard Nock, Brice Magdalou, Eric Briys, and Frank Nielsen.
Mining matrix data with Bregman matrix divergences for portfolio selection.
In Frank Nielsen and Rajendra Bhatia, editors, Matrix Information Geometry, pages 373–402, 2012.
Masanori Ohya and D´nes Petz.
e
Quantum Entropy and Its Use.
1st ed. 1993. Corr 2nd printing, 2004.
Koji Tsuda, Gunnar R¨tsch, and Manfred K. Warmuth.
a
Matrix exponentiated gradient updates for on-line learning and Bregman projection.
J. Mach. Learn. Res., 6:995–1018, December 2005.

55/56

Bibliographic references IV

Hisaharu Umegaki.
Conditional expectation in an operator algebra. IV. Entropy and information.
KodaiMathSemRep, 14(2):59, 1962.
Baba Vemuri, Meizhu Liu, Shun ichi Amari, and Frank Nielsen.
Total Bregman divergence and its applications to DTI analysis.
IEEE Transactions on Medical Imaging, 2011.
Zhizhou Wang and Baba C. Vemuri.
An aﬃne invariant tensor dissimilarity measure and its applications to tensor-valued image segmentation.
In CVPR (1), pages 228–233, 2004.


56/56

Computational Information Geometry on Matrix Manifolds (ICTP 2013)

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Computational Information Geometry on Matrix Manifolds (ICTP 2013) (20)

Recently uploaded (20)

Computational Information Geometry on Matrix Manifolds (ICTP 2013)