Social-sparsity brain decoders: faster spatial sparsity

Social-sparsity brain decoders:
faster spatial sparsity
G. Varoquaux, M. Kowalski, B. Thirion

Brain decoding with linear models
Design
matrix
× Coeﬃcients =
Coeﬃcients are
brain maps
Target
G Varoquaux 2

Brain decoding with linear models
Design
matrix
× Coeﬃcients =
Coeﬃcients are
brain maps
Minimize the error:
l(y − Xw)
Target
G Varoquaux 2

Brain decoder maps and prediction accuracy
Face vs house visual recognition [Haxby... 2001]
SVM
error: 26%
G Varoquaux 3

Ridge
error: 15%
G Varoquaux 3

Sparse model
error: 19%
Which decoder predicts best?
How to get good decoder maps?
G Varoquaux 3

Sparse models
Ill-posed inverse problem
minw
l(y − Xw) + λ w 1
A priori:
A small fraction of the voxels are predictive
Sparse models to select relevant regions?
[Yamashita... 2008, Carroll... 2009]
G Varoquaux 4

Sparse models
Ill-posed inverse problem ⇒ regularization
minw
l(y − Xw) + λ w 1
sparsity inducing norm 1 norm,
or elastic net
Elastic Net
G Varoquaux 4

Sparse models
Ill-posed inverse problem ⇒ regularization
minw
l(y − Xw) + λ w 1
sparsity inducing norm 1 norm,
or elastic net
Elastic Net Can only select a
subset of relevant
voxels.
[Varoquaux... 2012]
G Varoquaux 4

Spatial sparse penalties
Spatial regularization, total variation
minw
l(y − Xw) + λ
i
( w)i 2
12 norm: 1 norm of the
gradient magnitudePenalize the image gradient:
Shrinks jointly x , y , and z
Elastic Net TV + 1
[Gramfort... 2013]
G Varoquaux 5

Spatial sparse penalties
minw
l(y − Xw) + λ
i
( w)i 2
More generally: analysis sparsity [Eickenberg... 2015]
Sparse in a transformation of the weights:
minw
l(y − Xw) + λ K w 21
For instance: overlapping blocks
(Kw)1 ↔ G1
(Kw)2 ↔ G2
... G2
G1
G Varoquaux 6

Good convergence of solvers is important
minw
l(y − Xw) + λ
i
( w)i 2
x=17
L R
z=-17
Stopping: ∆E < 10−1
x=17
L R
z=-17
Stopping: ∆E < 10−5
[Dohmatob... 2014]
G Varoquaux 7

Sparse solvers
Iterative Shrinkage-Thresholding Algorithm
minw
l(y − Xw) + λ
i
Kw 1
Settings: min l + p; l smooth, p non-smooth
Minimize successively: (quadratic approx of l) + p
1. Gradient descent on smooth term
FISTA loop
2. Proximal operator
proxpx = miny
1
2 x − y
2
2
+ p(y)
G Varoquaux 8

Sparse solvers
Iterative Shrinkage-Thresholding Algorithm
minw
l(y − Xw) + λ
i
w 1
Settings: min l + p; l smooth, p non-smooth
Minimize successively: (quadratic approx of l) + p
1. Gradient descent on smooth term
FISTA loop
2. Proximal operator
proxpx = miny
1
2 x − y
2
2
+ p(y)
1 penalty: “soft thresholding”:
prox 1
: ∀i wi ← wi

1 −
λ
|wi|


+
G Varoquaux 8

Sparse solvers: proximals and co.
prox 1
: ∀i wi ← wi

1 −
λ
|wi|


+
Group sparsity:
prox 21
on G:
∀i ∈ G wi ← wi

1−
λ
j∈G w2
j


+
G2
G1
G Varoquaux 9

prox 1
: ∀i wi ← wi

1 −
λ
|wi|


+
Group sparsity:
prox 21
on G:

1−
λ
j∈G w2
j


+
G2
G1
Overlapping groups, TV:
Inner loop iterative solver
G2
G1
G Varoquaux 9

Group sparsity:
prox 21
on G:

1−
λ
j∈G w2
j


+
G2
G1
Overlapping groups, TV:
Inner loop iterative solver
G2
G1
Social sparsity shrinkage:
∀i wi ← wi

1−
λ
j∈N(i) w2
j


+
N1
x1
N2
x2
G Varoquaux 9

Social sparsity: “soft-threshold” neighboring voxels
Sparsity must be combined with spatial structure
Convex solvers for non-local sparsity are expensive
Not separable
Social sparsity:
forget the coupling between
soft thresholding
[Kowalski... 2013]
N1
x1
N2
x2
G Varoquaux 10

Empirical evaluation for decoding
25% 10% 0% +10%
TVl1
graph
net
social
sparsity
SVM
+ anova
Prediction accuracy 1
20x 1
5x 1
2x 1x 2x 5x
Run time
bottle/scramble
bottle/shoe
cat/bottle
cat/chair
cat/face
cat/house
cat/scramble
cat/shoe
chair/scramble
chair/shoe
face/house
face/scissors
scissors/scramble
shoe/scramble
OASIS VBM
male vs femaleG Varoquaux 11

Social sparsity maps
L R
z=16
y=34
face vs house
TV- 1 Graph-net Social sparsity
G Varoquaux 12

@GaelVaroquaux
Social-sparsity brain decoders: faster spatial sparsity
Spatial sparsity improves prediction
and denoises maps
TV- 1 “space-net” very successful, but slow
Social-sparsity: heuristic that forgoes couplings
10× faster than TV- 1 almost as accurate
3× faster than graph-net more accurate
Maps segment well regions
ni

References I
M. K. Carroll, G. A. Cecchi, I. Rish, R. Garg, and A. R. Rao.
Prediction and interpretation of distributed neural activity with
sparse models. NeuroImage, 44(1):112 – 122, 2009.
E. Dohmatob, A. Gramfort, B. Thirion, and G. Varoquaux.
Benchmarking solvers for TV-l1 least-squares and logistic
regression in brain imaging. PRNI, 2014.
M. Eickenberg, E. Dohmatob, B. Thirion, and G. Varoquaux.
Total variation meets sparsity: statistical learning with
segmenting penalties. MICCAI, 2015.
A. Gramfort, B. Thirion, and G. Varoquaux. Identifying predictive
regions from fMRI with TV-L1 prior. In PRNI, pages 17–20,
2013.
J. Haxby, I. Gobbini, M. Furey, ... Distributed and overlapping
representations of faces and objects in ventral temporal cortex.
Science, 293:2425, 2001.

References II
M. Kowalski, K. Siedenburg, and M. Dorﬂer. Social sparsity!
neighborhood systems enrich structured shrinkage operators.
Transactions on Signal Processing, 61:2498, 2013.
G. Varoquaux, A. Gramfort, and B. Thirion. Small-sample brain
mapping: sparse recovery on spatially correlated designs with
randomization and clustering. In ICML, page 1375, 2012.
O. Yamashita, M. aki Sato, T. Yoshioka, F. Tong, and
Y. Kamitani. Sparse estimation automatically selects voxels
relevant for the decoding of fMRI activity patterns. NeuroImage,
42(4):1414 – 1429, 2008.

Social-sparsity brain decoders: faster spatial sparsity

More Related Content

What's hot (18)

Similar to Social-sparsity brain decoders: faster spatial sparsity (20)

More from Gael Varoquaux (20)

Recently uploaded (20)

Social-sparsity brain decoders: faster spatial sparsity