Sparse inverse covariance estimation using skggm

Manjari Narayan,  
Postdoctoral Scholar, Stanford University (School of Medicine)  
(PI: Amit Etkin, M.D., Ph.D.)
Tutorial presented at Junior Scientist Workshop at HHMI, Janelia Farms
Sparse Inverse Covariance Estimation
using skggm
skggm: Collaboration with Dr. Jason Laska, ML R&D at Clara Labs.

Explosion of Functional Imaging Tools
fMRI, fNIRS EEG, MEG
Intracranial EEG,
micro-ECoG
Molecular fMRI
Credit: Marie Suver, Ph.D. and Ainul Huda, University of Washington and
Michael H. Dickinson, Ph.D., California Institute of Technology 
http://newsroom.cumc.columbia.edu/blog/2014/11/11/researchers-
receive-nih-brain-initiative-funding/ 
Calcium imaging
Credit: Misha Ahrens, Ph.D., Janelia Farms  
https://www.simonsfoundation.org/features/
foundation-news/how-do-different-brain-regions-
interact-to-enhance-function/ 
Light sheet
microscopy
Photo Credit: Tang, 2015, Scientific Reports.
Voltage-sensitive 
Dye Imaging
Light field
microscopy
Credit: Raju Tomer, Ph.D. & Deisseroth Lab, Stanford University 
http://techfinder.stanford.edu/technology_detail.php?ID=36402

Application: Functional Connectomics
Network as a unit of interest
Unobserved stochastic dependence/interaction
between neurons, circuits, regions, …
Ahrens, et. al. Nature (2012)
A shared goal across
modalities & resolutions
Macroscale
Mesoscale
T or n
p

Probabilistic Graphical Models
Many probabilistic models available, both directed and undirected
Graph G = (V, E)
Vertices V = (1, . . . , p), Edges E ⇢ V ⌦ V
X = (X1, . . . , Xp) ⇠ PX
Probabilistic graphical model relates PX to G
(j, k) 62 E () independence or conditional independence between Xj and Xk
Graph G = (V, E)
Vertices V = (1, . . . , p), Edges E ⇢ V ⌦ V
X = (X1, . . . , Xp) ⇠ PX
Probabilistic graphical model relates PX to G
E () independence or conditional independence between Xj and Xk
Observed: Unobserved:
Examples
Directed Acyclic Graphs (DAGs/Bayes-nets)
State-Space Models including linear/nonlinear VAR,
Undirected Graphical Models or Markov networks
Bivariate associations (Correlation, Granger-Causality, Transfer Entropy)

More informative than correlations:  
 
A measure of “direct” interactions that
elements “indirect” interactions due to
observed common causes.  
Beneﬁts:
Studying cognitive mechanisms
Designing interventional targets
Science-wide efﬁcient use of data
Models for Connectivity: 
Conditional Dependence & Markov Networks
conditional dependence  
(“partial correlations”)
marginal dependence 
(“marginal correlations”)

Introduction to Markov Networks

Markov Properties
• Graph G = (V, E)
• Vertices V = {1, 2, . . . p} and Edges E ⇢ V ⇥ V
• Multivariate Normal x1, . . . , xT
i.i.d
⇠ Np(0, ⌃)
• Inverse Covariance ⌃ 1
= ⇥
1
5
2
4
3
X5 ? X1|XV {1,5}
Pairwise Markov Property (P):
Two variables conditionally independent,
given all other nodes
Lauritzen (1996)

Markov Properties
i.i.d
⇠ Np(0, ⌃)
= ⇥
1
5
2
4
3
X5 ? XV ne(5)|Xne(5), ne(5) = {2, 5}
Local Markov Property (L):
A variable is conditionally independent of all others
given its neighbors
Lauritzen (1996)

Global Markov Property (G):
Given 3 disjoint sets A, B and C such that  
all paths between A to B go through C, then  
A conditionally independent of B given C
Markov Properties
i.i.d
⇠ Np(0, ⌃)
= ⇥
1
5
2
4
3
A
B
C
XA ? XB|XC, where XA = {Xa}a2A
Lauritzen (1996)

Intersection property: Holds for positive densities, e.g. Gaussian 
Factorizes probability distribution 
Computational tractability & Statistical power  
to identify all conditional independences
Extended to some non-positive densities!
Benefits of Global Markov Properties
P(X) = P(XA|XC)P(XB|XC)P(XC)
If A ? B|(C, D) and A ? C|(B, D) then A ? (B [ C)|D
Lauritzen (1996)
1
5
2
4
3
A
B
C

Generality of Markov Networks
For many types of pairwise associations
Markov Networks that satisfy Global Markov Property
Pairwise Association Markov Networks
Correlation Zero partial correlation = Conditional independence
Coherence or Coherency Zero partial coherence = Conditional independence
Directed Information (including transfer
entropy, Sims/Granger prediction, … )
Dynamic extensions to standard Markov properties,
local independence (Didelez 2008)
Pairwise ordering between variables DAGs, CPDAGs, MAGs, PAGs, ….
This is not an exhaustive list!

Generality of Markov Networks
For many probability distributions
Markov Networks that satisfy at least Local if not Global Markov Property
Distributional Assumptions Markov Networks
Exponential Families (Binary, Poisson, Circular, …)
Exponential MRFs including Binary Ising Models,
Poisson Graphical Models,
 
Nonparametric Distributions
Nonparanormal (copulas) Graphical Models,
Kernel Graphical Models
Separable Covariance Structure (Spatio-Temporal) Separable Markov Networks
(P. Ravikumar, G.I. Allen, and others)
(H. Liu, E. Xing, B. Scholkopf, and others)
(G.I. Allen, S. Zhou, A. Hero, P. Hoff, and many others)

From now on: Gaussian Graphical Model
Xk ? Xl|XV {k,l} () ⌃ 1
kl = 0 () (k, l) 62 E
i.i.d
⇠ Np(0, ⌃)
= ⇥
Zero in Inverse Covariance = Conditional Independence
3
1
5
2
4
Lauritzen (1996)
1 2 3 4 5
12345

From now on: Gaussian Graphical Model
Xk ? Xl|XV {k,l} () ⌃ 1
kl = 0 () (k, l) 62 E
i.i.d
⇠ Np(0, ⌃)
= ⇥
Zero in Inverse Covariance = Conditional Independence
3
1
5
2
4
Lauritzen (1996)
Important for
nonparametric 
distributions  
+  
exponential family
1 2 3 4 5
12345

Gaussian Log-Likelihood
Input to log-likelihood is effectively sample covariance
Likelihood for Inverse Covariance
“Covariance Selection”, Dempster, 1972; Banerjee et. al. (2006); Yuan (2006)
ˆ⌃ =
1
T
X>
X, Data matrix XT ⇥p is centered
L(ˆ⌃; ⇥) ⌘ log det ⇥
D
ˆ⌃, ⇥
E

Likelihood for Inverse Covariance
Put all variables on the same scale
“Covariance Selection”, Dempster, 1972; Banerjee et. al. (2006); Yuan (2006)
ˆ⌃ =
1
T
X>
X, R(ˆ⌃) = D
1
2 ˆ⌃D
1
2 , D = diag(ˆ⌃)
L(ˆ⌃; ⇥) ⌘ log det ⇥
D
R(ˆ⌃), ⇥
E
Gaussian Log-Likelihood
Variance-Correlation Decomposition

Degeneracy of Likelihood  
in High Dimensions
Credit: Negaban, Ravikumar, Wainwright & Yu, Statistical Science, 2012;
“A Uniﬁed Framework for High Dimensional Analysis of M-estimators with Decomposable Regularizers”
High curvature (Easy) Low curvature (Hard)
Given XT ⇥p, T ⇡ p

Encourage sparsity with Lasso penalty
Convex problem: Many optimization solutions available
Popular alternative if (L)=>(G): neighborhood selection
Sparse Inverse Covariance
Sparse penalized Maximum Likelihood
ˆ⇥( ) = maximize
⇥ 0
L(ˆ⌃; ⇥) k⇥k1,o↵,
k⇥k1,o↵ =
X
j6=k
|✓j,k|
“Covariance Selection”, Dempster, 1972; Banerjee et. al. (2006);
Yuan (2006); Friedman et. al (2008);“QUIC”, Hsieh et. al. (2011 & 2013); Buhlmann & Van De Geer (2011);
Meinshausen &
Buhlmann (2006)

Fisher Information (F) of the inverse covariance needs to
be well conditioned, not incoherent.
Signal strength of edges needs to be sufficiently larger
than noise
Caveat: Might always hold at infinite sample size but only
probabilistically in finite samples
Model Identifiability of Sparse MLE
When is perfect edge recovery possible?
Meinshausen et. al. 2006; Ravikumar et. al. (2010, 2011);  
Van De Geer & Buhlmann (2013); and others

Model Identifiability: 
Network Structure Matters
Theoretical assumptions often
violated for many networks  
at ﬁnite samples
Narayan et. al (2015a)
Do two unconnected nodes,
share “mutual friends” ? 
Increases with degree
Dependent on structure Meinshausen et. al. 2006;
Ravikumar et. al. (2010, 2011);
Cai & Zhou (2015)
More correlated nodes,
more errors in distinguishing edges from non-edges

We will only look at Lasso and its improved variants
Different estimators have slightly different limitations
Pseudolikelihood; Least squares; Dantzig type ….
Other regularizers behave differently as well  
Model Identifiability of Sparse MLE
When is perfect edge recovery possible?
See review on graphical models from Drton & Maathius (2016)

Features:
scikit-learn interface
Comprehensive range of estimators, model selection procedures, 
metrics, monte-carlo benchmarks of statistical error control, … 
For researcher: Benchmark new estimator/algorithm against others
For data analyst: Best practices for estimation & structure learning 
Github repo: http://github.com/jasonlaska/skggm 
Tutorial notebooks: http://neurostats.org/jf2016-skggm/
skggm: Inverse covariance estimation
By @jasonlaska and @mnarayan

Binder Instructions:
http://mybinder.org/repo/neuroquant/jf2016-skggm 
Alternative/Backup: Install in local anaconda environment
Install skggm:  
pip install skggm
Download notebooks:  
git clone git@github.com:neuroquant/jf2016-skggm.git
skggm: Tutorial Setup
Tutorial: http://neurostats.org/jf2016-skggm/

Ground Truth
Toy Example: Simple Banded or Chain Network Structure

Saturated Precision Matrices
Saturation: Estimate all entries of inverse covariance (precision)
Recall: High curvature of likelihood, easy to distinguish different graphs

Saturated Precision Matrices
Degeneracy at low sample sizes.  
(Using pseudo-inverse for degenerate sample covariance)
Recall: Low curvature of likelihood, hard to distinguish different graphs

Standard Graphical Lasso
Model Selection: How do we choose regularization/sparsity/non-zero support?
ˆ⇥( ) = arg min
⇥ 0
L(ˆ⌃; ⇥) + Pen(⇥)
Friedman et al. 2007; Meinshausen and Buhlmann 2006; Banerjee et al. 2006;
Rothman 2008; Hsieh et al; Cai et al. 2011; and many more.
Sparse Penalized Maximum Likelihood

Cross Validation: Minimizes Type II
Yuan and Lin (2007); Bickel and Levina (2008)
X ! (X⇤,train
, X⇤,test
)
{ ˆ⇥⇤
( )}train
Training Hold-out
Loss({ ˆ⇥⇤
( )}train
; { ˆ⌃⇤
}test
)
E.g. Kullback-Leibler; Log-Likelihood

Extended BIC: Minimizes Type I
Foygel & Drton (2010) Alternatives (StARS, Liu et. al.) 
Coming soon
Privileges sparser models than BIC
min BIC(Ê( )) = min Ln(ˆ⌃; ⇥) + |Ê| log(n) + 4 |Ê| log(p)
Ê
d
= no. of non-zeros in ˆ⇥( )

One-Stage vs. Two-Stage Estimators
Use initial estimates to reduce bias in estimation
Standard Graphical Lasso
Weighted Graphical Lasso
Stage I:
Stage 2: ˆ⇥( ) = maximize
⇥ 0
L(ˆ⌃; ⇥) kW ⇥k1,o↵,
kW ⇥k1,o↵ =
X
j6=k
|wj,k✓j,k|
ˆ⇥( ) = maximize
⇥ 0
L(ˆ⌃; ⇥) k⇥k1,o↵,
k⇥k1,o↵ =
X
j6=k
|✓j,k|
wjk =
1
|ˆ✓|in
jk
E.g. Adaptive weights
Zou 2006; Zhou et. al. 2011 
Buhlmann & Van De Geer 2011;
Cai & Zhou 2015

Strong edges shrink less
Shrinkage of Edges: Lasso vs. Adaptive
Performance very dependent on weights i.e.
Need good separation between strong vs. weak edges
Coefﬁcient
(entryofinversecovariance)
Regularization parameter (lambda)
All edges shrink by same value
Zou (2006)

Weights can be data dependent/adaptive 
Stage I: Any estimator not just MLE 
Stage II: Adaptive MLE
Use to create randomized model averaging
Locally linear approximations to non-convex penalties  
(coming soon to skggm)
Variety of Two-Stage Estimators
Weights can be speciﬁed in many ways
Adaptive Estimation: Zhou, Van De Geer, Buhlmann (2009);
Breheny and Huang (2011); Cai et. al. (2011) and others

High Sparsity Case
True Parameters
n
p = 75, degree = .15p

High Sample Size, High Sparsity
Adaptive Estimator improves on Initial Estimator
n
p = 75, degree = .15p
Difference in sparsity: 69,77
Support Error: 4.0, False Pos: 4.0, False Neg: 0.0

Low Sample Size, High Sparsity
Adaptivity less useful without good initial estimate
n
p = 15, degree = .15p

Moderate Sparsity Case
n
p = 75, degree = .4p
True Parameters

High Sample Size, Moderate Sparsity
Nodes more correlated with each other, but adaptivity still does well
n

Low Sample Size, Moderate Sparsity
Nodes more correlated with each other, more false negatives
n

Model Averaging & Stability Selection
For any initial estimator build an ensemble of estimators and aggregate
n
p = 15, degree
Threshold stability scores => Familywise error control over edges
Meinshausen &
Buhlmann (2010)
ˆ⇥⇤b
( ) = maximize
⇥ 0
L(ˆ⌃⇤b
; ⇥) Pen(W⇤b
( ) ⇥),
w⇤b
jk = w⇤b
kj 2 { /a, a }, with Ber(⇢), for j 6= k
Aggregate I
⇣
ˆ⇥⇤b
( ) 6= 0
⌘

Future plans include
Computational scalability (big-quic, support for Apache spark)
Monte-Carlo “unit-testing” of statistical error control
Novel case studies and more examples
Other estimator classes (pseudo-likelihood, non-convex, …)
Regularizers beyond sparsity: mixture of regularizers, …
Other Markov network models for time-series
Directed graphical models
skggm: Inverse covariance estimation
Version 0.1

Sparse inverse covariance estimation using skggm

More Related Content

What's hot (20)

Similar to Sparse inverse covariance estimation using skggm (20)

Recently uploaded (20)

Sparse inverse covariance estimation using skggm