Tutorial APS 2023: Phase transition for statistical estimation: algorithms and fundamental limits.

Phase transition for statistical estimation:
algorithms and fundamental limits
Marc Lelarge
INRIA-ENS
APS - INFORMS 2023

Applications to high-dimensional statistics

This tutorial: demystifying statistical physics!
- A simple version of AMP algoritm

- Gap between information-theoretically optimal and
computationally feasible estimators

- Gap between information-theoretically optimal and
computationally feasible estimators
- Running example: matrix model
I connection to random matrix theory
I sparse PCA, community detection, Z2 synchronization,
submatrix localization, hidden clique...

Tutorial APS 2023: Phase transition for statistical estimation: algorithms and fundamental limits.

AMP and its state evolution
Given a matrix W ∈ Rn×n and scalar functions ft : R → R, let
x0 ∈ Rn and
xt+1
= Wft(xt
) − btft−1(xt−1
) ∈ Rn
where
bt =
1
n
n
X
i=1
f 0
t (xt
i ) ∈ R.

AMP and its state evolution
Given a matrix W ∈ Rn×n and scalar functions ft : R → R, let
x0 ∈ Rn and
xt+1
= Wft(xt
) − btft−1(xt−1
) ∈ Rn
where
bt =
1
n
n
X
i=1
f 0
t (xt
i ) ∈ R.
If W ∼ GOE(n), ft are Lipschitz and the components of x0 are i.i.d
∼ X0 with E

X2
0

= 1, then for any nice test function Ψ : Rt → R,
1
n
n
X
i=1
Ψ

x1
i , . . . , xt
i

→ E [Ψ(Z1, . . . , Zt)] ,
where (Z1, . . . , Zt)
d
= (σ1G1, . . . , σtGt), where Gs ∼ N(0, 1) i.i.d.
(Bayati Montanari ’11)

Sanity check
We have x1 = Wf0(x0) so that
x1
i =
X
j
Wijf0(x0
j ),
where Wij ∼ N(0, 1/n) i.i.d. (ignore diagonal terms).
Hence x1 is a centred Gaussian vector with entries having variance
1
N
X
j
f0(x0
j )2
≈ E
h
f0(X0)2
i
= σ1.

AMP proof of Wigner’s semicircle law
Consider AMP with linear functions ft(x) = x, so that
x1
= Wx0
x2
= Wx1
− x0
= (W2
− Id)x0
x3
= Wx2
− x1
= (W3
− 2W)x0
,

x1
= Wx0
x2
= Wx1
− x0
= (W2
− Id)x0
x3
= Wx2
− x1
= (W3
− 2W)x0
,
so xt = Pt(W)x0 with
P0(x) = 1, P1(x) = x
Pt+1(x) = xPt(x) − Pt−1(x).
{Pt} are Chebyshev polynomials orthonormal wr.t. the semicircle
density µSC(x) = 1
2π
q
(4 − x2)+.

x1
= Wx0
x2
= Wx1
− x0
= (W2
− Id)x0
x3
= Wx2
− x1
= (W3
− 2W)x0
,
so xt = Pt(W)x0 with
P0(x) = 1, P1(x) = x
Pt+1(x) = xPt(x) − Pt−1(x).
{Pt} are Chebyshev polynomials orthonormal wr.t. the semicircle
density µSC(x) = 1
2π
q
(4 − x2)+.
When 1
n kx0k = 1, we have 1
n hxs, xti ≈ trPs(W)Pt(Wt).

xt+1
= Wxt
− xt−1
In this case, AMP state evolution gives
1
n
hxs
, xt
i → E [ZsZt] = 1(s = t)

xt+1
= Wxt
− xt−1
1
n
hxs
, xt
i → E [ZsZt] = 1(s = t)
Since 1
n hxs, xti ≈ trPs(W)Pt(Wt), the polynomials Pt are
orthonormal w.r.t the limit empirical spectral distribution of W
which must be µSC.

xt+1
= Wxt
− xt−1
1
n
hxs
, xt
i → E [ZsZt] = 1(s = t)
Since 1
n hxs, xti ≈ trPs(W)Pt(Wt), the polynomials Pt are
orthonormal w.r.t the limit empirical spectral distribution of W
which must be µSC.
Credit: Zhou Fan.

Wigner’s semicircle law: experiments

Explaining the Onsager term
xt+1
= Wxt
− xt−1
The first iteration with an Onsager term appears for t = 2.

xt+1
= Wxt
− xt−1
Then we have x2 = Wx1 − x0 = W2x0 − x0 so that
x2
1 =
X
i
W 2
1i x0
1 +
X
i,j6=1
W1i Wijx0
j − x0
1
=

X
i
W 2
1i x0
1 +
X
i,j6=1
W1i Wijx0
j
| {z }
N(0,1)
−

x0
1

xt+1
= Wxt
− xt−1
Then we have x2 = Wx1 − x0 = W2x0 − x0 so that
x2
1 =
X
i
W 2
1i x0
1 +
X
i,j6=1
W1i Wijx0
j − x0
1
=

X
i
W 2
1i x0
1 +
X
i,j6=1
W1i Wijx0
j
| {z }
N(0,1)
−

x0
1
The Onsager term is very similar to the Itô-correction in stochastic
calculus.

Low-rank matrix estimation
“Spiked Wigner” model
Y
|{z}
observations
=
v
u
u
u
t
λ
n
XX|
| {z }
signal
+ Z
|{z}
noise
I X: vector of dimension n with entries Xi
i.i.d.
∼ P0. EX1 = 0,
EX2
1 = 1.
I Zi,j = Zj,i
i.i.d.
∼ N(0, 1).
I λ: signal-to-noise ratio.
I λ and P0 are known by the statistician.
Goal: recover the low-rank matrix XX|
from Y.

Principal component analysis (PCA)
Spectral estimator:
Estimate X using the eigenvector x̂n associated with the
largest eigenvalue µn of Y/
√
n.

Principal component analysis (PCA)
Spectral estimator:
Estimate X using the eigenvector x̂n associated with the
largest eigenvalue µn of Y/
√
n.
B.B.P. phase transition
I if λ 6 1



µn
a.s.
−
−
−
→
n→∞
2
X · x̂n
a.s.
−
−
−
→
n→∞
0
I if λ 1



µn
a.s.
−
−
−
→
n→∞
√
λ + 1
√
λ
2
|X · x̂n|
a.s.
−
−
−
→
n→∞
p
1 − 1/λ 0
(Baik, Ben Arous, Péché ’05)

Questions
I PCA fails when λ 6 1, but is it still possible to recover
the signal?

Questions
the signal?
I When λ 1, is PCA optimal?

Questions
the signal?
I When λ 1, is PCA optimal?
I More generally, what is the best achievable estimation
performance in both regimes?

Plot of MMSE
Figure: Spiked Wigner model, centred binary prior (unit variance).

We can certainly improve spectral algorithm!

A scalar denoising problem
For Y =
√
γX0 + Z where X0 ∼ P0 and Z ∼ N(0, 1)

Bayes optimal AMP
We define mmse(γ) = E
h
X0 − E[X0|
√
γX0 + Z]
2
i
and the
recursion:
q0 = 1 − λ−1
qt+1 = 1 − mmse(λqt).
With the optimal denoiser gP0 (y, γ) = E[X0|
√
γX0 + Z = y], AMP
is defined by:
xt+1
= Y
s
λ
n
ft(xt
) − λbtft−1(xt−1
),
where ft(y) = gP0 (y/
√
λqt, λqt).

Limiting formula for the MMSE
Theorem (L, Miolane ’19)
MMSEn −
−
−
→
n→∞
1
|{z}
Dummy MSE
− q∗
(λ)2
where q∗
(λ) is the minimizer of
q 0 7→ −EX0∼P0
Z0∼N

log
Z
x0
dP0(x0)e
√
λqZ0x0+λqX0x0+λq
2
x2
0

+
λ
4
q2
A simplified “free energy landscape”:
0.0 0.2 0.4 0.6 0.8 1.0
q
−0.06
−0.05
−0.04
−0.03
−0.02
−0.01
0.00 −F(λ, q)
(a) “Easy” phase (λ = 1.01)
0.0 0.2 0.4 0.6 0.8
q
−0.002
−0.001
0.000
0.001
0.002
0.003
−F(λ, q)
(b) “Hard” phase (λ = 0.625)
0.0 0.2 0.4 0.6 0.8
q
0.0000
0.0025
0.0050
0.0075
0.0100
0.0125
0.0150
0.0175 −F(λ, q)
(c) “Impossible” phase (λ = 0.5)

Phase diagram
Figure: Spiked Wigner model, centred binary prior (unit variance).

Proof ideas: a planted spin system
P(X = x | Y) =
1
Zn
P0(x)eHn(x)
where
Hn(x) =
X
ij
s
λ
n
Yi,jxi xj −
λ
2n
x2
i x2
j .

Proof ideas: a planted spin system
P(X = x | Y) =
1
Zn
P0(x)eHn(x)
where
Hn(x) =
X
ij
s
λ
n
Yi,jxi xj −
λ
2n
x2
i x2
j .
Two step proof:
I Lower bound: Guerra’s interpolation technique. Adapted in
(Korada, Macris ’09) (Krzakala, Xu, Zdeborová ’16)
(
Y =
√
t
p
λ/n XX| + Z
Y0 =
√
1 − t
√
λ X + Z0
I Upper bound: Cavity computations (Mézard, Parisi, Virasoro
’87). Aizenman-Sims-Starr scheme:(Aizenman, Sims,Starr
’03) (Talagrand ’10)

Conclusion
AMP is an iterative denoising algorithm which is optimal when the
energy landscape is simple.
Main references for this tutorial: (Montanari Venkataramanan ’21)
(L. Miolane ’19)
Many recent research directions: universality, structured matrices,
composition... and new applications outside electrical engineering
like in ecology.

Conclusion
AMP is an iterative denoising algorithm which is optimal when the
energy landscape is simple.
Main references for this tutorial: (Montanari Venkataramanan ’21)
(L. Miolane ’19)
Many recent research directions: universality, structured matrices,
composition... and new applications outside electrical engineering
like in ecology.
Deep learning, the new kid on the block:

Thank you for your attention !

Tutorial APS 2023: Phase transition for statistical estimation: algorithms and fundamental limits.

More Related Content

What's hot (20)

Similar to Tutorial APS 2023: Phase transition for statistical estimation: algorithms and fundamental limits. (20)

Recently uploaded (20)

Tutorial APS 2023: Phase transition for statistical estimation: algorithms and fundamental limits.