SlideShare a Scribd company logo
Manjari Narayan, 

Postdoctoral Scholar, Stanford University (School of Medicine) 

(PI: Amit Etkin, M.D., Ph.D.)
Tutorial presented at Junior Scientist Workshop at HHMI, Janelia Farms
Sparse Inverse Covariance Estimation
using skggm
skggm: Collaboration with Dr. Jason Laska, ML R&D at Clara Labs.
Explosion of Functional Imaging Tools
fMRI, fNIRS EEG, MEG
Intracranial EEG,
micro-ECoG
Molecular fMRI
Credit: Marie Suver, Ph.D. and Ainul Huda, University of Washington and
Michael H. Dickinson, Ph.D., California Institute of Technology

http://newsroom.cumc.columbia.edu/blog/2014/11/11/researchers-
receive-nih-brain-initiative-funding/

Calcium imaging
Credit: Misha Ahrens, Ph.D., Janelia Farms 

https://www.simonsfoundation.org/features/
foundation-news/how-do-different-brain-regions-
interact-to-enhance-function/

Light sheet
microscopy
Photo Credit: Tang, 2015, Scientific Reports.
Voltage-sensitive

Dye Imaging
Light field
microscopy
Credit: Raju Tomer, Ph.D. & Deisseroth Lab, Stanford University

http://techfinder.stanford.edu/technology_detail.php?ID=36402
Application: Functional Connectomics
Network as a unit of interest
Unobserved stochastic dependence/interaction
between neurons, circuits, regions, …
Ahrens, et. al. Nature (2012)
A shared goal across
modalities & resolutions
Macroscale
Mesoscale
T or n
p
Probabilistic Graphical Models
Many probabilistic models available, both directed and undirected
Graph G = (V, E)
Vertices V = (1, . . . , p), Edges E ⇢ V ⌦ V
X = (X1, . . . , Xp) ⇠ PX
Probabilistic graphical model relates PX to G
(j, k) 62 E () independence or conditional independence between Xj and Xk
Graph G = (V, E)
Vertices V = (1, . . . , p), Edges E ⇢ V ⌦ V
X = (X1, . . . , Xp) ⇠ PX
Probabilistic graphical model relates PX to G
E () independence or conditional independence between Xj and Xk
Observed: Unobserved:
Examples
Directed Acyclic Graphs (DAGs/Bayes-nets)
State-Space Models including linear/nonlinear VAR,
Undirected Graphical Models or Markov networks
Bivariate associations (Correlation, Granger-Causality, Transfer Entropy)
More informative than correlations: 



A measure of “direct” interactions that
elements “indirect” interactions due to
observed common causes. 

Benefits:
Studying cognitive mechanisms
Designing interventional targets
Science-wide efficient use of data
Models for Connectivity:

Conditional Dependence & Markov Networks
conditional dependence 

(“partial correlations”)
marginal dependence

(“marginal correlations”)
Introduction to Markov Networks
Markov Properties
• Graph G = (V, E)
• Vertices V = {1, 2, . . . p} and Edges E ⇢ V ⇥ V
• Multivariate Normal x1, . . . , xT
i.i.d
⇠ Np(0, ⌃)
• Inverse Covariance ⌃ 1
= ⇥
1
5
2
4
3
X5 ? X1|XV {1,5}
Pairwise Markov Property (P):
Two variables conditionally independent,
given all other nodes
Lauritzen (1996)
Markov Properties
• Graph G = (V, E)
• Vertices V = {1, 2, . . . p} and Edges E ⇢ V ⇥ V
• Multivariate Normal x1, . . . , xT
i.i.d
⇠ Np(0, ⌃)
• Inverse Covariance ⌃ 1
= ⇥
1
5
2
4
3
X5 ? XV ne(5)|Xne(5), ne(5) = {2, 5}
Local Markov Property (L):
A variable is conditionally independent of all others
given its neighbors
Lauritzen (1996)
Global Markov Property (G):
Given 3 disjoint sets A, B and C such that 

all paths between A to B go through C, then 

A conditionally independent of B given C
Markov Properties
• Graph G = (V, E)
• Vertices V = {1, 2, . . . p} and Edges E ⇢ V ⇥ V
• Multivariate Normal x1, . . . , xT
i.i.d
⇠ Np(0, ⌃)
• Inverse Covariance ⌃ 1
= ⇥
1
5
2
4
3
A
B
C
XA ? XB|XC, where XA = {Xa}a2A
Lauritzen (1996)
Intersection property: Holds for positive densities, e.g. Gaussian

Factorizes probability distribution

Computational tractability & Statistical power 

to identify all conditional independences
Extended to some non-positive densities!
Benefits of Global Markov Properties
P(X) = P(XA|XC)P(XB|XC)P(XC)
If A ? B|(C, D) and A ? C|(B, D) then A ? (B [ C)|D
Lauritzen (1996)
1
5
2
4
3
A
B
C
Generality of Markov Networks
For many types of pairwise associations
Markov Networks that satisfy Global Markov Property
Pairwise Association Markov Networks
Correlation Zero partial correlation = Conditional independence
Coherence or Coherency Zero partial coherence = Conditional independence
Directed Information (including transfer
entropy, Sims/Granger prediction, … )
Dynamic extensions to standard Markov properties,
local independence (Didelez 2008)
Pairwise ordering between variables DAGs, CPDAGs, MAGs, PAGs, ….
This is not an exhaustive list!
Generality of Markov Networks
For many probability distributions
Markov Networks that satisfy at least Local if not Global Markov Property
Distributional Assumptions Markov Networks
Exponential Families (Binary, Poisson, Circular, …)
Exponential MRFs including Binary Ising Models,
Poisson Graphical Models,


Nonparametric Distributions
Nonparanormal (copulas) Graphical Models,
Kernel Graphical Models
Separable Covariance Structure (Spatio-Temporal) Separable Markov Networks
(P. Ravikumar, G.I. Allen, and others)
(H. Liu, E. Xing, B. Scholkopf, and others)
(G.I. Allen, S. Zhou, A. Hero, P. Hoff, and many others)
From now on: Gaussian Graphical Model
Xk ? Xl|XV {k,l} () ⌃ 1
kl = 0 () (k, l) 62 E
• Graph G = (V, E)
• Vertices V = {1, 2, . . . p} and Edges E ⇢ V ⇥ V
• Multivariate Normal x1, . . . , xT
i.i.d
⇠ Np(0, ⌃)
• Inverse Covariance ⌃ 1
= ⇥
Zero in Inverse Covariance = Conditional Independence
3
1
5
2
4
Lauritzen (1996)
1 2 3 4 5
12345
From now on: Gaussian Graphical Model
Xk ? Xl|XV {k,l} () ⌃ 1
kl = 0 () (k, l) 62 E
• Graph G = (V, E)
• Vertices V = {1, 2, . . . p} and Edges E ⇢ V ⇥ V
• Multivariate Normal x1, . . . , xT
i.i.d
⇠ Np(0, ⌃)
• Inverse Covariance ⌃ 1
= ⇥
Zero in Inverse Covariance = Conditional Independence
3
1
5
2
4
Lauritzen (1996)
Important for
nonparametric

distributions 

+ 

exponential family
1 2 3 4 5
12345
Estimation in High Dimensions
Gaussian Log-Likelihood
Input to log-likelihood is effectively sample covariance
Likelihood for Inverse Covariance
“Covariance Selection”, Dempster, 1972; Banerjee et. al. (2006); Yuan (2006)
ˆ⌃ =
1
T
X>
X, Data matrix XT ⇥p is centered
L(ˆ⌃; ⇥) ⌘ log det ⇥
D
ˆ⌃, ⇥
E
Gaussian Log-Likelihood
Input to log-likelihood is effectively sample covariance
Likelihood for Inverse Covariance
“Covariance Selection”, Dempster, 1972; Banerjee et. al. (2006); Yuan (2006)
ˆ⌃ =
1
T
X>
X, Data matrix XT ⇥p is centered
L(ˆ⌃; ⇥) ⌘ log det ⇥
D
ˆ⌃, ⇥
E
Likelihood for Inverse Covariance
Put all variables on the same scale
“Covariance Selection”, Dempster, 1972; Banerjee et. al. (2006); Yuan (2006)
ˆ⌃ =
1
T
X>
X, R(ˆ⌃) = D
1
2 ˆ⌃D
1
2 , D = diag(ˆ⌃)
L(ˆ⌃; ⇥) ⌘ log det ⇥
D
R(ˆ⌃), ⇥
E
Gaussian Log-Likelihood
Variance-Correlation Decomposition
Degeneracy of Likelihood 

in High Dimensions
Credit: Negaban, Ravikumar, Wainwright & Yu, Statistical Science, 2012;
“A Unified Framework for High Dimensional Analysis of M-estimators with Decomposable Regularizers”
High curvature (Easy) Low curvature (Hard)
Given XT ⇥p, T ⇡ p
Encourage sparsity with Lasso penalty
Convex problem: Many optimization solutions available
Popular alternative if (L)=>(G): neighborhood selection
Sparse Inverse Covariance
Sparse penalized Maximum Likelihood
ˆ⇥( ) = maximize
⇥ 0
L(ˆ⌃; ⇥) k⇥k1,o↵,
k⇥k1,o↵ =
X
j6=k
|✓j,k|
“Covariance Selection”, Dempster, 1972; Banerjee et. al. (2006);
Yuan (2006); Friedman et. al (2008);“QUIC”, Hsieh et. al. (2011 & 2013); Buhlmann & Van De Geer (2011);
Meinshausen &
Buhlmann (2006)
Fisher Information (F) of the inverse covariance needs to
be well conditioned, not incoherent.
Signal strength of edges needs to be sufficiently larger
than noise
Caveat: Might always hold at infinite sample size but only
probabilistically in finite samples
Model Identifiability of Sparse MLE
When is perfect edge recovery possible?
Meinshausen et. al. 2006; Ravikumar et. al. (2010, 2011); 

Van De Geer & Buhlmann (2013); and others
Model Identifiability:

Network Structure Matters
Theoretical assumptions often
violated for many networks 

at finite samples
Narayan et. al (2015a)
Do two unconnected nodes,
share “mutual friends” ?

Increases with degree
Dependent on structure Meinshausen et. al. 2006;
Ravikumar et. al. (2010, 2011);
Cai & Zhou (2015)
More correlated nodes,
more errors in distinguishing edges from non-edges
We will only look at Lasso and its improved variants
Different estimators have slightly different limitations
Pseudolikelihood; Least squares; Dantzig type ….
Other regularizers behave differently as well 

Model Identifiability of Sparse MLE
When is perfect edge recovery possible?
See review on graphical models from Drton & Maathius (2016)
Features:
scikit-learn interface
Comprehensive range of estimators, model selection procedures,

metrics, monte-carlo benchmarks of statistical error control, …

For researcher: Benchmark new estimator/algorithm against others
For data analyst: Best practices for estimation & structure learning

Github repo: http://github.com/jasonlaska/skggm

Tutorial notebooks: http://neurostats.org/jf2016-skggm/
skggm: Inverse covariance estimation
By @jasonlaska and @mnarayan
Binder Instructions:
http://mybinder.org/repo/neuroquant/jf2016-skggm

Alternative/Backup: Install in local anaconda environment
Install skggm: 

pip install skggm
Download notebooks: 

git clone git@github.com:neuroquant/jf2016-skggm.git
skggm: Tutorial Setup
Tutorial: http://neurostats.org/jf2016-skggm/
Ground Truth
Toy Example: Simple Banded or Chain Network Structure
Saturated Precision Matrices
Saturation: Estimate all entries of inverse covariance (precision)
Recall: High curvature of likelihood, easy to distinguish different graphs
Saturated Precision Matrices
Degeneracy at low sample sizes. 

(Using pseudo-inverse for degenerate sample covariance)
Recall: Low curvature of likelihood, hard to distinguish different graphs
Standard Graphical Lasso
Model Selection: How do we choose regularization/sparsity/non-zero support?
ˆ⇥( ) = arg min
⇥ 0
L(ˆ⌃; ⇥) + Pen(⇥)
Friedman et al. 2007; Meinshausen and Buhlmann 2006; Banerjee et al. 2006;
Rothman 2008; Hsieh et al; Cai et al. 2011; and many more.
Sparse Penalized Maximum Likelihood
Cross Validation: Minimizes Type II
Yuan and Lin (2007); Bickel and Levina (2008)
X ! (X⇤,train
, X⇤,test
)
{ ˆ⇥⇤
( )}train
Training Hold-out
Loss({ ˆ⇥⇤
( )}train
; { ˆ⌃⇤
}test
)
E.g. Kullback-Leibler; Log-Likelihood
Extended BIC: Minimizes Type I
Foygel & Drton (2010) Alternatives (StARS, Liu et. al.)

Coming soon
Privileges sparser models than BIC
min BIC(ˆE( )) = min Ln(ˆ⌃; ⇥) + |ˆE| log(n) + 4 |ˆE| log(p)
ˆE
d
= no. of non-zeros in ˆ⇥( )
One-Stage vs. Two-Stage Estimators
Use initial estimates to reduce bias in estimation
Standard Graphical Lasso
Weighted Graphical Lasso
Stage I:
Stage 2: ˆ⇥( ) = maximize
⇥ 0
L(ˆ⌃; ⇥) kW ⇥k1,o↵,
kW ⇥k1,o↵ =
X
j6=k
|wj,k✓j,k|
ˆ⇥( ) = maximize
⇥ 0
L(ˆ⌃; ⇥) k⇥k1,o↵,
k⇥k1,o↵ =
X
j6=k
|✓j,k|
wjk =
1
|ˆ✓|in
jk
E.g. Adaptive weights
Zou 2006; Zhou et. al. 2011

Buhlmann & Van De Geer 2011;
Cai & Zhou 2015
Strong edges shrink less
Shrinkage of Edges: Lasso vs. Adaptive
Performance very dependent on weights i.e.
Need good separation between strong vs. weak edges
Coefficient
(entryofinversecovariance)
Regularization parameter (lambda)
All edges shrink by same value
Zou (2006)
Weights can be data dependent/adaptive

Stage I: Any estimator not just MLE

Stage II: Adaptive MLE
Use to create randomized model averaging
Locally linear approximations to non-convex penalties 

(coming soon to skggm)
Variety of Two-Stage Estimators
Weights can be specified in many ways
Adaptive Estimation: Zhou, Van De Geer, Buhlmann (2009);
Breheny and Huang (2011); Cai et. al. (2011) and others
High Sparsity Case
True Parameters
n
p = 75, degree = .15p
High Sample Size, High Sparsity
Adaptive Estimator improves on Initial Estimator
n
p = 75, degree = .15p
Difference in sparsity: 69,77
Support Error: 4.0, False Pos: 4.0, False Neg: 0.0
Difference in sparsity: 69,141
Support Error: 36.0, False Pos: 36.0, False Neg: 0.0
Low Sample Size, High Sparsity
Adaptivity less useful without good initial estimate
n
p = 15, degree = .15p
Difference in sparsity: 69,85
Support Error: 8.0, False Pos: 8.0, False Neg: 0.0
Difference in sparsity: 69,149
Support Error: 40.0, False Pos: 40.0, False Neg: 0.0
Moderate Sparsity Case
n
p = 75, degree = .4p
True Parameters
High Sample Size, Moderate Sparsity
Nodes more correlated with each other, but adaptivity still does well
n
p = 75, degree = .4p
Difference in sparsity: 115,129
Support Error: 7.0, False Pos: 7.0, False Neg: 0.0
Difference in sparsity: 115,169
Support Error: 27.0, False Pos: 27.0, False Neg: 0.0
Low Sample Size, Moderate Sparsity
Nodes more correlated with each other, more false negatives
n
p = 15, degree = .4p
Difference in sparsity: 115,135
Support Error: 22.0, False Pos: 16.0, False Neg: 6.0
Difference in sparsity: 115,111
Support Error: 18.0, False Pos: 8.0, False Neg: 10.0
Model Averaging & Stability Selection
For any initial estimator build an ensemble of estimators and aggregate
n
p = 15, degree
Threshold stability scores => Familywise error control over edges
Meinshausen &
Buhlmann (2010)
ˆ⇥⇤b
( ) = maximize
⇥ 0
L(ˆ⌃⇤b
; ⇥) Pen(W⇤b
( ) ⇥),
w⇤b
jk = w⇤b
kj 2 { /a, a }, with Ber(⇢), for j 6= k
Aggregate I
⇣
ˆ⇥⇤b
( ) 6= 0
⌘
Future plans include
Computational scalability (big-quic, support for Apache spark)
Monte-Carlo “unit-testing” of statistical error control
Novel case studies and more examples
Other estimator classes (pseudo-likelihood, non-convex, …)
Regularizers beyond sparsity: mixture of regularizers, …
Other Markov network models for time-series
Directed graphical models
skggm: Inverse covariance estimation
Version 0.1

More Related Content

PDF
An Optimal Approach For Knowledge Protection In Structured Frequent Patterns
PDF
Mesoscale Structures in Networks
PDF
Rsqrd AI - ML Interpretability: Beyond Feature Importance
PDF
slides
PDF
Machine Learning: Generative and Discriminative Models
PDF
Mining co-expression network
PDF
Networks in Space: Granular Force Networks and Beyond
PPTX
Islamic University Pattern Recognition & Neural Network 2019
An Optimal Approach For Knowledge Protection In Structured Frequent Patterns
Mesoscale Structures in Networks
Rsqrd AI - ML Interpretability: Beyond Feature Importance
slides
Machine Learning: Generative and Discriminative Models
Mining co-expression network
Networks in Space: Granular Force Networks and Beyond
Islamic University Pattern Recognition & Neural Network 2019

What's hot (20)

PPTX
Introduction to Interpretable Machine Learning
PDF
Interpretability of machine learning
PDF
Relational machine-learning
PDF
Temporal networks - Alain Barrat
PDF
Low rank models for recommender systems with limited preference information
PDF
$$ Formulating semantic image annotation as a supervised learning problem
PDF
Network Crossover Performance on NK Landscapes and Deceptive Problems
PDF
Declarative data analysis
PDF
Artificial neural networks and its application
PDF
Graph Neural Network for Phenotype Prediction
PDF
Study of Different Multi-instance Learning kNN Algorithms
PDF
MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...
PDF
Current Approaches in Search Result Diversification
PDF
Combining co-expression and co-location for gene network inference in porcine...
PDF
Joint gene network inference with multiple samples: a bootstrapped consensual...
PDF
Artificial Neural Networks: Applications In Management
PDF
Nature Inspired Reasoning Applied in Semantic Web
PDF
08 Exponential Random Graph Models (ERGM)
PDF
Show observe and tell giang nguyen
Introduction to Interpretable Machine Learning
Interpretability of machine learning
Relational machine-learning
Temporal networks - Alain Barrat
Low rank models for recommender systems with limited preference information
$$ Formulating semantic image annotation as a supervised learning problem
Network Crossover Performance on NK Landscapes and Deceptive Problems
Declarative data analysis
Artificial neural networks and its application
Graph Neural Network for Phenotype Prediction
Study of Different Multi-instance Learning kNN Algorithms
MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...
Current Approaches in Search Result Diversification
Combining co-expression and co-location for gene network inference in porcine...
Joint gene network inference with multiple samples: a bootstrapped consensual...
Artificial Neural Networks: Applications In Management
Nature Inspired Reasoning Applied in Semantic Web
08 Exponential Random Graph Models (ERGM)
Show observe and tell giang nguyen
Ad

Similar to Sparse inverse covariance estimation using skggm (20)

PPTX
Higher-order spectral graph clustering with motifs
PDF
Graphical Models 4dummies
PDF
CLIM Program: Remote Sensing Workshop, Multilayer Modeling and Analysis of Co...
PDF
As pi re2015_abstracts
PDF
Mahoney mlconf-nov13
PDF
SVM Based Identification of Psychological Personality Using Handwritten Text
PPT
Presentation2 2000
PDF
Differential privacy (개인정보 차등보호)
PDF
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
PDF
Network Biology: A paradigm for modeling biological complex systems
PPTX
High Dimensional Biological Data Analysis and Visualization
PPTX
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
PDF
nonlinear_rmt.pdf
KEY
Gecco 2011 - Effects of Topology on the diversity of spatially-structured evo...
PDF
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
PDF
Nonequilibrium Network Dynamics_Inference, Fluctuation-Respones & Tipping Poi...
PPTX
A Diffusion Wavelet Approach For 3 D Model Matching
PDF
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
PDF
UTS workshop talk
Higher-order spectral graph clustering with motifs
Graphical Models 4dummies
CLIM Program: Remote Sensing Workshop, Multilayer Modeling and Analysis of Co...
As pi re2015_abstracts
Mahoney mlconf-nov13
SVM Based Identification of Psychological Personality Using Handwritten Text
Presentation2 2000
Differential privacy (개인정보 차등보호)
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Network Biology: A paradigm for modeling biological complex systems
High Dimensional Biological Data Analysis and Visualization
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
nonlinear_rmt.pdf
Gecco 2011 - Effects of Topology on the diversity of spatially-structured evo...
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
Nonequilibrium Network Dynamics_Inference, Fluctuation-Respones & Tipping Poi...
A Diffusion Wavelet Approach For 3 D Model Matching
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
UTS workshop talk
Ad

Recently uploaded (20)

PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
Sciences of Europe No 170 (2025)
PPTX
Microbiology with diagram medical studies .pptx
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
BIOMOLECULES PPT........................
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
famous lake in india and its disturibution and importance
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Placing the Near-Earth Object Impact Probability in Context
bbec55_b34400a7914c42429908233dbd381773.pdf
ECG_Course_Presentation د.محمد صقران ppt
TOTAL hIP ARTHROPLASTY Presentation.pptx
2. Earth - The Living Planet Module 2ELS
Sciences of Europe No 170 (2025)
Microbiology with diagram medical studies .pptx
Taita Taveta Laboratory Technician Workshop Presentation.pptx
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
POSITIONING IN OPERATION THEATRE ROOM.ppt
BIOMOLECULES PPT........................
Classification Systems_TAXONOMY_SCIENCE8.pptx
Phytochemical Investigation of Miliusa longipes.pdf
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
famous lake in india and its disturibution and importance
Derivatives of integument scales, beaks, horns,.pptx
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
Introduction to Fisheries Biotechnology_Lesson 1.pptx

Sparse inverse covariance estimation using skggm

  • 1. Manjari Narayan, 
 Postdoctoral Scholar, Stanford University (School of Medicine) 
 (PI: Amit Etkin, M.D., Ph.D.) Tutorial presented at Junior Scientist Workshop at HHMI, Janelia Farms Sparse Inverse Covariance Estimation using skggm skggm: Collaboration with Dr. Jason Laska, ML R&D at Clara Labs.
  • 2. Explosion of Functional Imaging Tools fMRI, fNIRS EEG, MEG Intracranial EEG, micro-ECoG Molecular fMRI Credit: Marie Suver, Ph.D. and Ainul Huda, University of Washington and Michael H. Dickinson, Ph.D., California Institute of Technology
 http://newsroom.cumc.columbia.edu/blog/2014/11/11/researchers- receive-nih-brain-initiative-funding/
 Calcium imaging Credit: Misha Ahrens, Ph.D., Janelia Farms 
 https://www.simonsfoundation.org/features/ foundation-news/how-do-different-brain-regions- interact-to-enhance-function/
 Light sheet microscopy Photo Credit: Tang, 2015, Scientific Reports. Voltage-sensitive
 Dye Imaging Light field microscopy Credit: Raju Tomer, Ph.D. & Deisseroth Lab, Stanford University
 http://techfinder.stanford.edu/technology_detail.php?ID=36402
  • 3. Application: Functional Connectomics Network as a unit of interest Unobserved stochastic dependence/interaction between neurons, circuits, regions, … Ahrens, et. al. Nature (2012) A shared goal across modalities & resolutions Macroscale Mesoscale T or n p
  • 4. Probabilistic Graphical Models Many probabilistic models available, both directed and undirected Graph G = (V, E) Vertices V = (1, . . . , p), Edges E ⇢ V ⌦ V X = (X1, . . . , Xp) ⇠ PX Probabilistic graphical model relates PX to G (j, k) 62 E () independence or conditional independence between Xj and Xk Graph G = (V, E) Vertices V = (1, . . . , p), Edges E ⇢ V ⌦ V X = (X1, . . . , Xp) ⇠ PX Probabilistic graphical model relates PX to G E () independence or conditional independence between Xj and Xk Observed: Unobserved: Examples Directed Acyclic Graphs (DAGs/Bayes-nets) State-Space Models including linear/nonlinear VAR, Undirected Graphical Models or Markov networks Bivariate associations (Correlation, Granger-Causality, Transfer Entropy)
  • 5. More informative than correlations: 
 
 A measure of “direct” interactions that elements “indirect” interactions due to observed common causes. 
 Benefits: Studying cognitive mechanisms Designing interventional targets Science-wide efficient use of data Models for Connectivity:
 Conditional Dependence & Markov Networks conditional dependence 
 (“partial correlations”) marginal dependence
 (“marginal correlations”)
  • 7. Markov Properties • Graph G = (V, E) • Vertices V = {1, 2, . . . p} and Edges E ⇢ V ⇥ V • Multivariate Normal x1, . . . , xT i.i.d ⇠ Np(0, ⌃) • Inverse Covariance ⌃ 1 = ⇥ 1 5 2 4 3 X5 ? X1|XV {1,5} Pairwise Markov Property (P): Two variables conditionally independent, given all other nodes Lauritzen (1996)
  • 8. Markov Properties • Graph G = (V, E) • Vertices V = {1, 2, . . . p} and Edges E ⇢ V ⇥ V • Multivariate Normal x1, . . . , xT i.i.d ⇠ Np(0, ⌃) • Inverse Covariance ⌃ 1 = ⇥ 1 5 2 4 3 X5 ? XV ne(5)|Xne(5), ne(5) = {2, 5} Local Markov Property (L): A variable is conditionally independent of all others given its neighbors Lauritzen (1996)
  • 9. Global Markov Property (G): Given 3 disjoint sets A, B and C such that 
 all paths between A to B go through C, then 
 A conditionally independent of B given C Markov Properties • Graph G = (V, E) • Vertices V = {1, 2, . . . p} and Edges E ⇢ V ⇥ V • Multivariate Normal x1, . . . , xT i.i.d ⇠ Np(0, ⌃) • Inverse Covariance ⌃ 1 = ⇥ 1 5 2 4 3 A B C XA ? XB|XC, where XA = {Xa}a2A Lauritzen (1996)
  • 10. Intersection property: Holds for positive densities, e.g. Gaussian
 Factorizes probability distribution
 Computational tractability & Statistical power 
 to identify all conditional independences Extended to some non-positive densities! Benefits of Global Markov Properties P(X) = P(XA|XC)P(XB|XC)P(XC) If A ? B|(C, D) and A ? C|(B, D) then A ? (B [ C)|D Lauritzen (1996) 1 5 2 4 3 A B C
  • 11. Generality of Markov Networks For many types of pairwise associations Markov Networks that satisfy Global Markov Property Pairwise Association Markov Networks Correlation Zero partial correlation = Conditional independence Coherence or Coherency Zero partial coherence = Conditional independence Directed Information (including transfer entropy, Sims/Granger prediction, … ) Dynamic extensions to standard Markov properties, local independence (Didelez 2008) Pairwise ordering between variables DAGs, CPDAGs, MAGs, PAGs, …. This is not an exhaustive list!
  • 12. Generality of Markov Networks For many probability distributions Markov Networks that satisfy at least Local if not Global Markov Property Distributional Assumptions Markov Networks Exponential Families (Binary, Poisson, Circular, …) Exponential MRFs including Binary Ising Models, Poisson Graphical Models, 
 Nonparametric Distributions Nonparanormal (copulas) Graphical Models, Kernel Graphical Models Separable Covariance Structure (Spatio-Temporal) Separable Markov Networks (P. Ravikumar, G.I. Allen, and others) (H. Liu, E. Xing, B. Scholkopf, and others) (G.I. Allen, S. Zhou, A. Hero, P. Hoff, and many others)
  • 13. From now on: Gaussian Graphical Model Xk ? Xl|XV {k,l} () ⌃ 1 kl = 0 () (k, l) 62 E • Graph G = (V, E) • Vertices V = {1, 2, . . . p} and Edges E ⇢ V ⇥ V • Multivariate Normal x1, . . . , xT i.i.d ⇠ Np(0, ⌃) • Inverse Covariance ⌃ 1 = ⇥ Zero in Inverse Covariance = Conditional Independence 3 1 5 2 4 Lauritzen (1996) 1 2 3 4 5 12345
  • 14. From now on: Gaussian Graphical Model Xk ? Xl|XV {k,l} () ⌃ 1 kl = 0 () (k, l) 62 E • Graph G = (V, E) • Vertices V = {1, 2, . . . p} and Edges E ⇢ V ⇥ V • Multivariate Normal x1, . . . , xT i.i.d ⇠ Np(0, ⌃) • Inverse Covariance ⌃ 1 = ⇥ Zero in Inverse Covariance = Conditional Independence 3 1 5 2 4 Lauritzen (1996) Important for nonparametric
 distributions 
 + 
 exponential family 1 2 3 4 5 12345
  • 15. Estimation in High Dimensions
  • 16. Gaussian Log-Likelihood Input to log-likelihood is effectively sample covariance Likelihood for Inverse Covariance “Covariance Selection”, Dempster, 1972; Banerjee et. al. (2006); Yuan (2006) ˆ⌃ = 1 T X> X, Data matrix XT ⇥p is centered L(ˆ⌃; ⇥) ⌘ log det ⇥ D ˆ⌃, ⇥ E
  • 17. Gaussian Log-Likelihood Input to log-likelihood is effectively sample covariance Likelihood for Inverse Covariance “Covariance Selection”, Dempster, 1972; Banerjee et. al. (2006); Yuan (2006) ˆ⌃ = 1 T X> X, Data matrix XT ⇥p is centered L(ˆ⌃; ⇥) ⌘ log det ⇥ D ˆ⌃, ⇥ E
  • 18. Likelihood for Inverse Covariance Put all variables on the same scale “Covariance Selection”, Dempster, 1972; Banerjee et. al. (2006); Yuan (2006) ˆ⌃ = 1 T X> X, R(ˆ⌃) = D 1 2 ˆ⌃D 1 2 , D = diag(ˆ⌃) L(ˆ⌃; ⇥) ⌘ log det ⇥ D R(ˆ⌃), ⇥ E Gaussian Log-Likelihood Variance-Correlation Decomposition
  • 19. Degeneracy of Likelihood 
 in High Dimensions Credit: Negaban, Ravikumar, Wainwright & Yu, Statistical Science, 2012; “A Unified Framework for High Dimensional Analysis of M-estimators with Decomposable Regularizers” High curvature (Easy) Low curvature (Hard) Given XT ⇥p, T ⇡ p
  • 20. Encourage sparsity with Lasso penalty Convex problem: Many optimization solutions available Popular alternative if (L)=>(G): neighborhood selection Sparse Inverse Covariance Sparse penalized Maximum Likelihood ˆ⇥( ) = maximize ⇥ 0 L(ˆ⌃; ⇥) k⇥k1,o↵, k⇥k1,o↵ = X j6=k |✓j,k| “Covariance Selection”, Dempster, 1972; Banerjee et. al. (2006); Yuan (2006); Friedman et. al (2008);“QUIC”, Hsieh et. al. (2011 & 2013); Buhlmann & Van De Geer (2011); Meinshausen & Buhlmann (2006)
  • 21. Fisher Information (F) of the inverse covariance needs to be well conditioned, not incoherent. Signal strength of edges needs to be sufficiently larger than noise Caveat: Might always hold at infinite sample size but only probabilistically in finite samples Model Identifiability of Sparse MLE When is perfect edge recovery possible? Meinshausen et. al. 2006; Ravikumar et. al. (2010, 2011); 
 Van De Geer & Buhlmann (2013); and others
  • 22. Model Identifiability:
 Network Structure Matters Theoretical assumptions often violated for many networks 
 at finite samples Narayan et. al (2015a) Do two unconnected nodes, share “mutual friends” ?
 Increases with degree Dependent on structure Meinshausen et. al. 2006; Ravikumar et. al. (2010, 2011); Cai & Zhou (2015) More correlated nodes, more errors in distinguishing edges from non-edges
  • 23. We will only look at Lasso and its improved variants Different estimators have slightly different limitations Pseudolikelihood; Least squares; Dantzig type …. Other regularizers behave differently as well 
 Model Identifiability of Sparse MLE When is perfect edge recovery possible? See review on graphical models from Drton & Maathius (2016)
  • 24. Features: scikit-learn interface Comprehensive range of estimators, model selection procedures,
 metrics, monte-carlo benchmarks of statistical error control, …
 For researcher: Benchmark new estimator/algorithm against others For data analyst: Best practices for estimation & structure learning
 Github repo: http://github.com/jasonlaska/skggm
 Tutorial notebooks: http://neurostats.org/jf2016-skggm/ skggm: Inverse covariance estimation By @jasonlaska and @mnarayan
  • 25. Binder Instructions: http://mybinder.org/repo/neuroquant/jf2016-skggm
 Alternative/Backup: Install in local anaconda environment Install skggm: 
 pip install skggm Download notebooks: 
 git clone git@github.com:neuroquant/jf2016-skggm.git skggm: Tutorial Setup Tutorial: http://neurostats.org/jf2016-skggm/
  • 26. Ground Truth Toy Example: Simple Banded or Chain Network Structure
  • 27. Saturated Precision Matrices Saturation: Estimate all entries of inverse covariance (precision) Recall: High curvature of likelihood, easy to distinguish different graphs
  • 28. Saturated Precision Matrices Degeneracy at low sample sizes. 
 (Using pseudo-inverse for degenerate sample covariance) Recall: Low curvature of likelihood, hard to distinguish different graphs
  • 29. Standard Graphical Lasso Model Selection: How do we choose regularization/sparsity/non-zero support? ˆ⇥( ) = arg min ⇥ 0 L(ˆ⌃; ⇥) + Pen(⇥) Friedman et al. 2007; Meinshausen and Buhlmann 2006; Banerjee et al. 2006; Rothman 2008; Hsieh et al; Cai et al. 2011; and many more. Sparse Penalized Maximum Likelihood
  • 30. Cross Validation: Minimizes Type II Yuan and Lin (2007); Bickel and Levina (2008) X ! (X⇤,train , X⇤,test ) { ˆ⇥⇤ ( )}train Training Hold-out Loss({ ˆ⇥⇤ ( )}train ; { ˆ⌃⇤ }test ) E.g. Kullback-Leibler; Log-Likelihood
  • 31. Extended BIC: Minimizes Type I Foygel & Drton (2010) Alternatives (StARS, Liu et. al.)
 Coming soon Privileges sparser models than BIC min BIC(ˆE( )) = min Ln(ˆ⌃; ⇥) + |ˆE| log(n) + 4 |ˆE| log(p) ˆE d = no. of non-zeros in ˆ⇥( )
  • 32. One-Stage vs. Two-Stage Estimators Use initial estimates to reduce bias in estimation Standard Graphical Lasso Weighted Graphical Lasso Stage I: Stage 2: ˆ⇥( ) = maximize ⇥ 0 L(ˆ⌃; ⇥) kW ⇥k1,o↵, kW ⇥k1,o↵ = X j6=k |wj,k✓j,k| ˆ⇥( ) = maximize ⇥ 0 L(ˆ⌃; ⇥) k⇥k1,o↵, k⇥k1,o↵ = X j6=k |✓j,k| wjk = 1 |ˆ✓|in jk E.g. Adaptive weights Zou 2006; Zhou et. al. 2011
 Buhlmann & Van De Geer 2011; Cai & Zhou 2015
  • 33. Strong edges shrink less Shrinkage of Edges: Lasso vs. Adaptive Performance very dependent on weights i.e. Need good separation between strong vs. weak edges Coefficient (entryofinversecovariance) Regularization parameter (lambda) All edges shrink by same value Zou (2006)
  • 34. Weights can be data dependent/adaptive
 Stage I: Any estimator not just MLE
 Stage II: Adaptive MLE Use to create randomized model averaging Locally linear approximations to non-convex penalties 
 (coming soon to skggm) Variety of Two-Stage Estimators Weights can be specified in many ways Adaptive Estimation: Zhou, Van De Geer, Buhlmann (2009); Breheny and Huang (2011); Cai et. al. (2011) and others
  • 35. High Sparsity Case True Parameters n p = 75, degree = .15p
  • 36. High Sample Size, High Sparsity Adaptive Estimator improves on Initial Estimator n p = 75, degree = .15p Difference in sparsity: 69,77 Support Error: 4.0, False Pos: 4.0, False Neg: 0.0 Difference in sparsity: 69,141 Support Error: 36.0, False Pos: 36.0, False Neg: 0.0
  • 37. Low Sample Size, High Sparsity Adaptivity less useful without good initial estimate n p = 15, degree = .15p Difference in sparsity: 69,85 Support Error: 8.0, False Pos: 8.0, False Neg: 0.0 Difference in sparsity: 69,149 Support Error: 40.0, False Pos: 40.0, False Neg: 0.0
  • 38. Moderate Sparsity Case n p = 75, degree = .4p True Parameters
  • 39. High Sample Size, Moderate Sparsity Nodes more correlated with each other, but adaptivity still does well n p = 75, degree = .4p Difference in sparsity: 115,129 Support Error: 7.0, False Pos: 7.0, False Neg: 0.0 Difference in sparsity: 115,169 Support Error: 27.0, False Pos: 27.0, False Neg: 0.0
  • 40. Low Sample Size, Moderate Sparsity Nodes more correlated with each other, more false negatives n p = 15, degree = .4p Difference in sparsity: 115,135 Support Error: 22.0, False Pos: 16.0, False Neg: 6.0 Difference in sparsity: 115,111 Support Error: 18.0, False Pos: 8.0, False Neg: 10.0
  • 41. Model Averaging & Stability Selection For any initial estimator build an ensemble of estimators and aggregate n p = 15, degree Threshold stability scores => Familywise error control over edges Meinshausen & Buhlmann (2010) ˆ⇥⇤b ( ) = maximize ⇥ 0 L(ˆ⌃⇤b ; ⇥) Pen(W⇤b ( ) ⇥), w⇤b jk = w⇤b kj 2 { /a, a }, with Ber(⇢), for j 6= k Aggregate I ⇣ ˆ⇥⇤b ( ) 6= 0 ⌘
  • 42. Future plans include Computational scalability (big-quic, support for Apache spark) Monte-Carlo “unit-testing” of statistical error control Novel case studies and more examples Other estimator classes (pseudo-likelihood, non-convex, …) Regularizers beyond sparsity: mixture of regularizers, … Other Markov network models for time-series Directed graphical models skggm: Inverse covariance estimation Version 0.1