SlideShare a Scribd company logo
A General Framework for
Multiple Testing Dependence
Jeffrey Leek
Johns Hopkins University School of Medicine
High-dimensional multiple hypothesis testing is common.
Problem:
Dependence between tests can result in incorrect statistical
and scientific results.
A solution:
Define and address multiple testing dependence at the
level of the data – not the P-values.
Big Picture Ideas
High-Dimensional Multiple Testing Is Common
Spatial EpidemiologyBrain Imaging
Molecular Biology
4
Inflammation and the Host Response to Injury
mRNA
Expression
~50,000
genes
Clinical Data 
>150
clinical variables
Patient 1 Patient 2 Patient 166….
MOF
measures
severity of
injury
Data at Initial Time Point
Multiple Organ Failure
Simple Analysis
1.ā€ˆFit the model to the data, xi, for gene i:
xi = ai + biMOF + ei
2. Calculate P-values for testing the hypotheses:
H0: bi = 0 vs. H1: bi ≠ 0
3
Four ā€œReplicatedā€ Studies
Phase 1
Phase 3
Phase 2
Phase 4
P-value P-value
P-value P-value
Frequency
Frequency
Frequency
Frequency
ā€¢ā€ˆ Data for test i:
ā€¢ā€ˆ ā€œPrimary variable(s)ā€:
ā€¢ā€ˆ Model:
ā€¢ā€ˆ Hypothesis test i:
€
xi = xi1,xi2,…,xin( )
€
Y = y1,y2,…,yn( )
€
xij = ai + biksk y j( )
k=1
d
āˆ‘ + eij
H0i :bi ∈ Ω0 H1i :bi ∈ Ω1
{m hypothesis tests, n observations per test}
Start With The Whole Data
= +
X = B S(Y) + E
observations
tests
Underlying Model
A Simple Simulated Example
Independent E Dependent E
Genes
Genes
Arrays Arrays
Null P-Value Distributions
Independent E
Dependent E
Frequency
Frequency
Frequency
Frequency
Frequency
Frequency
Frequency
Frequency
P-value P-value P-value P-value
P-value P-value P-value P-value
Null P-Value Distributions
|ρ| = 0.40 |ρ| = 0.31 |ρ| = 0.10 |ρ| = 0.00Correlation
Independent E
Dependent E
Frequency
Frequency
Frequency
Frequency
Frequency
Frequency
Frequency
Frequency
P-value P-value P-value P-value
P-value P-value P-value P-value
Null Distribution Behavior
Dependent E
Independent E
False Discovery Rate Estimates
Independent E Dependent E
Ranking Estimates
Independent E Dependent E
Data X
Fit Model
X= BS + E
Obtain
and R
€
ˆB
Calculate
P-values
Form P-value
Threshold
When To Address Dependence?
Form Test-Statistics
and
Null Distribution
Data X
Fit Model
X= BS + E
Obtain
and R
€
ˆB
Calculate
P-values
Form P-value
Threshold
When To Address Dependence?
Form Test-Statistics
and
Null Distribution
Existing Approaches
Empirical null approaches
modify the null distribution at
the test-statistic level
Dependence adjustments
conservatively modify
the P-value threshold
Examples of Existing Approaches
ā€¢ā€ˆ Empirical Null
ā€“ā€ˆDevlin and Roeder Biometrics (1999)
ā€“ā€ˆEfron JASA (2004)
ā€“ā€ˆSchwartzman AOAS (2008)
ā€¢ā€ˆ Error Rate Adjustments
ā€“ā€ˆBenjamini and Yekutieli Annals of Statistics (2001)
ā€“ā€ˆRomano, Shaikh, and Wolf Test (2001)
ā€“ā€ˆDudoit, Gilbert, van der Laan Biometrical Journal (2008)
Data X
Fit Model
X= BS + E
Obtain
and R
€
ˆB
Calculate
P-values
Form P-value
Threshold
When To Address Dependence?
Form Test-Statistics
and
Null Distribution
Our Approach
Fit the model:
X = BS + ΓG + U
where G is a valid dependence
kernel
Dependence and bias are no longer present at any of these steps;
standard methods can be used.
Data X
Fit Model
X= BS + E
Obtain
and R
€
ˆB
Calculate
P-values
Form P-value
Threshold
When To Address Dependence?
Form Test-Statistics
and
Null Distribution
Our Approach
Fit the model:
X = BS + ΓG + U
where G is a valid dependence
kernel
New Dependence Definitions
Definition – Data X are population-level multiple testing
dependent if:
Definition - Data X are estimation-level multiple testing
dependent if:
Leek and Storey (2008)
Structure in E
Array
MOF1Genes
Signal + Dependent Noise
Dependent Noise
Independent Noise
= +
X = B S + E
observations
tests
data
random
variation
primary
variables
Decomposing E
= +
X = B S + H + U
tests
+
independent
variation
observations
data
primary
variables
dependent
variation
Decomposing E
= +
X = B S + Ī“ G + U
tests
+
independent
variation
observations
data
primary
variables
dependence
kernel
Decomposing E
H
Decomposing E
Theorem Let the data be distributed according to the
model:
Suppose that for each ei there is no Borel measurable
function, g, such that ei =g(ei,…,ei-1,ei+1,…,em) almost
surely. Then there exist matrices Ī“(mƗr), G(rƗn) (r ≤ n) and
U(mƗn) such that:
where the rows of U are independent and ui ≠ 0 and
ui=hi(ei) for a non-random Borel measurable function hi.
Leek and Storey (2008)
Dependence Kernel
Leek and Storey (2008)
Definition – Dependence Kernel
An r Ɨn matrix G forms a dependence kernel for the data X, if
the following equality holds:
X = BS + E
= BS + ΓG + U
where the rows of U are independent.
Fitting S & G Results In Independent Tests
Leek and Storey (2008)
Theorem Let G be any valid dependence kernel for the data X.
Suppose that the model:
is fit by least squares resulting in residuals:
if the rowspace jointly spanned by S and G has dimension less
than n, then the ri and the are jointly independent given S
and G and:
€
ˆbi
= +
X = B S + Ī“ G + U
tests
+
independent
variation
observations
data
primary
variables
dependence
kernel
A ā€œBlessingā€ of Dimensionality
Iteratively Reweighted Surrogate Variable Analysis
1.ā€ˆ Estimate the row dimension, , of G.
2.ā€ˆ Form an initial estimate equal to the first right
singular vectors of R = X - S.
3.ā€ˆ Estimate .
4.ā€ˆ Weight the ith row of X by and
set to be the first right singular vectors of the
weighted matrix.
ˆG(b+1)
€
ˆr
€
ˆB
Iterate for b=0,…,B:
€
ˆG0
ˆr
€
X = BS + ΓG + U
€
xi = biS + γiG + ui
Whole data:
Test i data:
€
ˆr
An Example of the IRW-SVA Algorithm
The Data True GEstimate of GPr(G & !S)
An Example of the IRW-SVA Algorithm
The Data True GEstimate of GPr(G & !S)
An Example of the IRW-SVA Algorithm
The Data True GEstimate of GPr(G & !S)
An Example of the IRW-SVA Algorithm
The Data True GEstimate of GPr(G & !S)
An Example of the IRW-SVA Algorithm
The Data True GEstimate of GPr(G & !S)
An Example of the IRW-SVA Algorithm
The Data True GEstimate of GPr(G & !S)
An Example of the IRW-SVA Algorithm
The Data True GEstimate of GPr(G & !S)
Iteratively Re-weighted Surrogate Variable Analysis
1.ā€ˆ Estimate the row dimension, , of G.
2.ā€ˆ Form an initial estimate equal to the first right
singular vectors of R = X - S.
3.ā€ˆ Estimate .
4.ā€ˆ Weight the ith row of X by and
set to be the first right singular vectors of the
weighted matrix.
ˆG(b+1)
€
ˆr
€
ˆB
€
ˆG0
ˆr
€
X = BS + ΓG + U
€
xi = biS + γiG + ui
Whole data:
Test i data:
€
ˆr
Iterate for b=0,…,B:
1.ā€ˆ Buja and Eyuboglu (1992) proposed a
permutation approach.
2.ā€ˆ Patterson, Price, and Reich (2006) proposed a
sequential testing strategy based on Tracey-
Widom theory.
3.ā€ˆ Leek (in preparation) proposes an eigenvalue
estimator that is consistent in the number of
tests.
Estimating The Row Dimension of G
1.ā€ˆ Assume the data follow X = BS + Ī“G + U, where G
and S have row dimensions r and d, r + d < n.
2.ā€ˆ Calculate the singular values s1,…, sn of X and choose
b, such that r+d < b.
3.ā€ˆ Calculate the eigenvalues, Ī»1,…, Ī»n of
where P = I - S(STS)-1ST and R = XP.
4.ā€ˆ Set
ˆr = 1 Ī»j > māˆ’1/ 3
( )
j=1
n
āˆ‘
€
€
1
m
RT
R āˆ’ sb
2
P[ ]
Estimating The Row Dimension of G
Theorem As ,
is a consistent estimate of the row dimension of G,
provided that:
(1)ā€ˆuij are independent
(2)ā€ˆE[uij]=0
(3)ā€ˆ
(4)ā€ˆ
(5)ā€ˆ Ī“TĪ“ is positive definite with unique eigenvalues
€
m → āˆž
€
E[uij
2
] = σi
2
< M1
€
E[uij
4
] < M2
€
lim
mā†’āˆž
1
m
Leek (In Prep.)
€
ˆr = 1 Ī»j > māˆ’1/ 3
( )
j=1
n
āˆ‘
Estimating The Row Dimension of G
Iteratively Re-weighted Surrogate Variable Analysis
1.ā€ˆ Estimate the row dimension, , of G.
2.ā€ˆ Form an initial estimate equal to the first right
singular vectors of R = X - S.
3.ā€ˆ Estimate .
4.ā€ˆ Weight the ith row of X by and
set to be the first right singular vectors of the
weighted matrix.
ˆG(b+1)
€
ˆr
€
ˆB
€
ˆG0
ˆr
€
X = BS + ΓG + U
€
xi = biS + γiG + ui
Whole data:
Test i data:
€
ˆr
Iterate for b=0,…,B:
Break The Estimation Into Two Components
1.ā€ˆ Form F-statistics F1,…,Fm for testing the hypotheses:
2.ā€ˆ Bootstrap from the conditional null model to obtain null-
statistics , k =1,…K.
3.ā€ˆ From Bayes’ Theorem:
where and .
Estimating the Probability Weights
€
F1
0k
,...,Fm
0k
€
Fi
0k
~ g0
€
Fi ~ Ļ€0g0 + (1āˆ’ Ļ€0)g1
1.ā€ˆ Form F-statistics F1,…,Fm for testing the hypotheses:
2.ā€ˆ Bootstrap from the conditional null model to obtain null-
statistics , k =1,…K.
3.ā€ˆ From Bayes’ Theorem:
4.ā€ˆ Estimate the ratio of the densities with a non-parametric
logistic regression where Fi are ā€œsuccessesā€ and Fi
0k are
ā€œfailuresā€ (Anderson and Blair 1982).
where and . .
Estimating the Probability Weights
€
F1
0k
,...,Fm
0k
€
Fi
0k
~ g0
€
Fi ~ Ļ€0g0 + (1āˆ’ Ļ€0)g1
1.ā€ˆ Form F-statistics F1,…,Fm for testing the hypotheses:
2.ā€ˆ Bootstrap from the conditional null model to obtain null-
statistics , k =1,…K.
3.ā€ˆ From Bayes’ Theorem:
4.ā€ˆ Estimate the ratio of the densities with a non-parametric
logistic regression where Fi are ā€œsuccessesā€ and Fi
0k are
ā€œfailuresā€ (Anderson and Blair 1982).
5.ā€ˆ Estimate Ļ€0 according to Storey (2002).
where and .
Estimating the Probability Weights
€
F1
0k
,...,Fm
0k
€
Fi
0k
~ g0
€
Fi ~ Ļ€0g0 + (1āˆ’ Ļ€0)g1
Estimating the Probability Weights
Estimate of posterior
probability bi ≠ 0.
SVA-Adjusted Analysis
1.ā€ˆ Estimate G with IRW-SVA
2.ā€ˆ Fit
3.ā€ˆ Test the hypotheses
€
H0i :bi ∈ Ω0 H1i :bi ∈ Ω1
A Simple Simulated Example
Independent E Dependent E
Genes
Genes
Arrays Arrays
Null Distribution Behavior
Dependent E
Independent E
Dependent E
+ IRW-SVA
False Discovery Rate Estimates
Independent E Dependent E
Dependent E
+ IRW-SVA
True False Discovery Rate True False Discovery Rate True False Discovery Rate
Q-value
Q-value
Q-value
Ranking Estimates
Independent E Dependent E
Dependent E
+ IRW-SVA
Ranking by True Signal to Noise Ranking by True Signal to Noise Ranking by True Signal to Noise
AverageRankingbyT-Statistic
AverageRankingbyT-Statistic
AverageRankingbyT-Statistic
53
Inflammation and the Host Response to Injury
mRNA
Expression
~50,000
genes
Clinical Data 
>150
clinical variables
Patient 1 Patient 2 Patient 166….
MOF1
measures
severity of
injury
Phase 1 Phase 2 Phase 3 Phase 4
Four ā€œReplicatedā€ Studies
FrequencyFrequency
P-value P-value P-value P-value
P-value P-value P-value P-value
Frequency
Frequency
Frequency
Frequency
Frequency
Frequency
Frequency
Functional Enrichment Across Phases
Number of phases in which a significant pathway appears
Percentoftotalsignificantpathways
1 of 4 2 of 4 3 of 4 4 of 4
Unadjusted
IRW-SVAAdjusted
ā€¢ā€ˆ High-dimensional hypothesis testing is common.
ā€¢ā€ˆ Dependence between tests can result in incorrect
statistical and scientific inference.
ā€¢ā€ˆ We can define and address dependence at the
level of the model using the dependence kernel.
ā€¢ā€ˆ IRW-SVA can be used to improve inference in
high-dimensional multiple hypothesis testing.
Summary
Future Work
ā€¢ā€ˆ Multiple Testing
ā€“ā€ˆDevelop dependence kernel estimates for spatial data
ā€“ā€ˆDevelop diagnostic tests for multiple testing procedures
ā€¢ā€ˆ High-Dimensional Asymptotics
ā€“ā€ˆExtend methods for asymptotic SVD to binary data
ā€¢ā€ˆ Feature Selection for High-Dimensional Classifiers
ā€“ā€ˆExtensions of top-scoring pairs (TSP) to survival data
ā€“ā€ˆTheoretical connections to LDA and SVM
ā€“ā€ˆEmbedding TSP in a logic regression framework
Thank You
1.ā€ˆ Calculate the residuals R = X - S.
2.ā€ˆ Calculate the singular values of R, d1,…,dn.
3.ā€ˆ Permute each row of R individually to get R0.
4.ā€ˆ Take the SVD of the residuals R* = R0 - S to
obtain null singular values .
5.ā€ˆ Compare di to for k=1,…,K to calculate a P-
value for the ith right singular vector.
Estimating The Row Dimension of G
€
ˆB
€
ˆB0
€
di0
k
€
di0
k
For k =1,…,K do steps 3-4:
Buja and Eyuboglu (1992)
Why Does This Work?
Leek and Storey (2007), Leek and Storey (2008)
Useful Fact:
X = BS + E
= BS + ΓG + U
= BS + ΛH + U
if G and H have the same column space.
ā€¢ā€ˆ References:
Benjamini Y and Hochberg Y. (1995), ā€œControlling the false discovery rate – a
practical and powerful approach to multiple testing.ā€ JRSSB, 57: 289-300.
De Castro MC, Monte-Mor RL, Sawyer DO, and Singer, BH. (2005),
ā€œMalaria risk on the amazon frontier.ā€ PNAS, 103: 2452-2457.
Delin B and Roeder K. (1999), ā€œGenomic control for association studies.ā€
Biometrics, 55: 997-1004.
Efron B. (2004) ā€œLarge-scale simultaneous hypothesis testing: The choice of a
null hypothesis.ā€ JASA, 99: 96-104.
Leek JT and Storey JD. (2008) ā€œA general framework for multiple testing
dependence.ā€ Proceedings of the National Academy of Sciences , 105:
18718-18723.
Leek JT and Storey JD. (2007) ā€œCapturing heterogeneity in gene expression
studies by ā€˜Surrogate Variable Analysis’.ā€ PLoS Genetics, 3: e161.
Taylor JE and Worsley KJ. (2007) ā€œDetecting sparse signals in random fields,
with applications to brain mapping.ā€ JASA, 102: 913-928.
Thank You
1.ā€ˆ Perform each hypothesis test individually.
2.ā€ˆ Obtain the test-statistic for each test.
3.ā€ˆ Compare distribution of test-statistics to the
theoretical null distribution.
4.ā€ˆ Adjust theoretical null so that it matches the
observed statistics in a low signal region.
Empirical Null
Theoretical Null
Efron (2004)
Theoretical Null
Empirical Null
Efron (2004)
Empirical Null Results in Incorrect Null Distribution
Dep. Kernel
ā€¢ā€ˆ Observed statistics or observed P-values come
from mixture distribution:
π0g0 + π1g1
ā€¢ā€ˆ Dependence distorts g0 … can go either way:
ā€¢ā€ˆ Must use full data set to capture dependence
With Confounding Empirical Null is Ill-Posed

More Related Content

PDF
Predictive mean-matching2
PDF
Fi review5
PDF
Some sampling techniques for big data analysis
PDF
PDF
Chapter2: Likelihood-based approach
PDF
6300 solutionsmanual free
PDF
Propensity albert
PDF
Fractional hot deck imputation - Jae Kim
Predictive mean-matching2
Fi review5
Some sampling techniques for big data analysis
Chapter2: Likelihood-based approach
6300 solutionsmanual free
Propensity albert
Fractional hot deck imputation - Jae Kim

What's hot (20)

PDF
better together? statistical learning in models made of modules
PDF
Convergence of ABC methods
PDF
prior selection for mixture estimation
PDF
Bayesian inference on mixtures
PDF
ABC-Gibbs
PDF
Laplace's Demon: seminar #1
PDF
Inference in generative models using the Wasserstein distance [[INI]
PDF
Multiple estimators for Monte Carlo approximations
PDF
8517ijaia06
Ā 
PDF
ABC-Gibbs
PPT
Introductory maths analysis chapter 10 official
PDF
asymptotics of ABC
PDF
ABC workshop: 17w5025
PDF
An overview of Bayesian testing
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PDF
CISEA 2019: ABC consistency and convergence
PDF
Lesson 23: Antiderivatives (Section 021 handout)
PDF
Approximate Bayesian model choice via random forests
PDF
Multilinear Twisted Paraproducts
PDF
Monte Carlo in MontrƩal 2017
better together? statistical learning in models made of modules
Convergence of ABC methods
prior selection for mixture estimation
Bayesian inference on mixtures
ABC-Gibbs
Laplace's Demon: seminar #1
Inference in generative models using the Wasserstein distance [[INI]
Multiple estimators for Monte Carlo approximations
8517ijaia06
Ā 
ABC-Gibbs
Introductory maths analysis chapter 10 official
asymptotics of ABC
ABC workshop: 17w5025
An overview of Bayesian testing
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
CISEA 2019: ABC consistency and convergence
Lesson 23: Antiderivatives (Section 021 handout)
Approximate Bayesian model choice via random forests
Multilinear Twisted Paraproducts
Monte Carlo in MontrƩal 2017
Ad

Similar to JHU Job Talk (20)

PDF
Workshop 4
Ā 
PPT
Lecture-4 Advanced biostatistics BLUP.ppt
PPTX
Omnibus diagnostic procedures for vector multiplicative errors models.pptx
PDF
InnerSoft STATS - Methods and formulas help
PDF
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
PDF
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
PPTX
Static Models of Continuous Variables
DOCX
1 FACULTY OF SCIENCE AND ENGINEERING SCHOOL OF COMPUT.docx
PDF
BlUP and BLUE- REML of linear mixed model
PDF
Mixed Model Analysis for Overdispersion
PPTX
Vergoulas Choosing the appropriate statistical test (2019 Hippokratia journal)
PPTX
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
PPT
1609 probability function p on subspace of s
PPT
Toward a Unified Approach to Fitting Loss Models
PPTX
Some statistical concepts relevant to proteomics data analysis
PDF
Whitcher Ismrm 2009
PDF
R Cheat Sheet for Data Analysts and Statisticians.pdf
PDF
Integration of biological annotations using hierarchical modeling
Ā 
PDF
Petrini - MSc Thesis
PPTX
Data science
Workshop 4
Ā 
Lecture-4 Advanced biostatistics BLUP.ppt
Omnibus diagnostic procedures for vector multiplicative errors models.pptx
InnerSoft STATS - Methods and formulas help
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
Static Models of Continuous Variables
1 FACULTY OF SCIENCE AND ENGINEERING SCHOOL OF COMPUT.docx
BlUP and BLUE- REML of linear mixed model
Mixed Model Analysis for Overdispersion
Vergoulas Choosing the appropriate statistical test (2019 Hippokratia journal)
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
1609 probability function p on subspace of s
Toward a Unified Approach to Fitting Loss Models
Some statistical concepts relevant to proteomics data analysis
Whitcher Ismrm 2009
R Cheat Sheet for Data Analysts and Statisticians.pdf
Integration of biological annotations using hierarchical modeling
Ā 
Petrini - MSc Thesis
Data science
Ad

More from jtleek (11)

PPTX
Data science as a science
Ā 
PDF
JHU Data Science MOOCs - Behind the Scenes
Ā 
PDF
Fixing the leaks in the pipeline from public genomics data to the clinic
Ā 
PDF
Evidence based data analysis
Ā 
PDF
Evidence based data analysis
Ā 
PDF
Leek romesf-2015
Ā 
PDF
The Largest Data Science Program in the World: The Johns Hopkins Data Science...
Ā 
PDF
Flash talk about Johns Hopkins Biostatistics Genomics Group
Ā 
PDF
10 things statistics taught us about big data
Ā 
PDF
Big data and statisticians
Ā 
PDF
Data Science Education at JHSPH
Ā 
Data science as a science
Ā 
JHU Data Science MOOCs - Behind the Scenes
Ā 
Fixing the leaks in the pipeline from public genomics data to the clinic
Ā 
Evidence based data analysis
Ā 
Evidence based data analysis
Ā 
Leek romesf-2015
Ā 
The Largest Data Science Program in the World: The Johns Hopkins Data Science...
Ā 
Flash talk about Johns Hopkins Biostatistics Genomics Group
Ā 
10 things statistics taught us about big data
Ā 
Big data and statisticians
Ā 
Data Science Education at JHSPH
Ā 

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Introduction to Business Data Analytics.
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Database Infoormation System (DBIS).pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Quality review (1)_presentation of this 21
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
ā€œGetting Started with Data Analytics Using R – Concepts, Tools & Case Studiesā€
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Lecture1 pattern recognition............
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction to Business Data Analytics.
IB Computer Science - Internal Assessment.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Database Infoormation System (DBIS).pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Moving the Public Sector (Government) to a Digital Adoption
STUDY DESIGN details- Lt Col Maksud (21).pptx
Quality review (1)_presentation of this 21
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
ā€œGetting Started with Data Analytics Using R – Concepts, Tools & Case Studiesā€
Business Ppt On Nestle.pptx huunnnhhgfvu
Lecture1 pattern recognition............
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Miokarditis (Inflamasi pada Otot Jantung)
MODULE 8 - DISASTER risk PREPAREDNESS.pptx

JHU Job Talk

  • 1. A General Framework for Multiple Testing Dependence Jeffrey Leek Johns Hopkins University School of Medicine
  • 2. High-dimensional multiple hypothesis testing is common. Problem: Dependence between tests can result in incorrect statistical and scientific results. A solution: Define and address multiple testing dependence at the level of the data – not the P-values. Big Picture Ideas
  • 3. High-Dimensional Multiple Testing Is Common Spatial EpidemiologyBrain Imaging Molecular Biology
  • 4. 4 Inflammation and the Host Response to Injury mRNA Expression ~50,000 genes Clinical Data >150 clinical variables Patient 1 Patient 2 Patient 166…. MOF measures severity of injury
  • 5. Data at Initial Time Point Multiple Organ Failure
  • 6. Simple Analysis 1.ā€ˆFit the model to the data, xi, for gene i: xi = ai + biMOF + ei 2. Calculate P-values for testing the hypotheses: H0: bi = 0 vs. H1: bi ≠ 0 3
  • 7. Four ā€œReplicatedā€ Studies Phase 1 Phase 3 Phase 2 Phase 4 P-value P-value P-value P-value Frequency Frequency Frequency Frequency
  • 8. ā€¢ā€ˆ Data for test i: ā€¢ā€ˆ ā€œPrimary variable(s)ā€: ā€¢ā€ˆ Model: ā€¢ā€ˆ Hypothesis test i: € xi = xi1,xi2,…,xin( ) € Y = y1,y2,…,yn( ) € xij = ai + biksk y j( ) k=1 d āˆ‘ + eij H0i :bi ∈ Ī©0 H1i :bi ∈ Ī©1 {m hypothesis tests, n observations per test} Start With The Whole Data
  • 9. = + X = B S(Y) + E observations tests Underlying Model
  • 10. A Simple Simulated Example Independent E Dependent E Genes Genes Arrays Arrays
  • 11. Null P-Value Distributions Independent E Dependent E Frequency Frequency Frequency Frequency Frequency Frequency Frequency Frequency P-value P-value P-value P-value P-value P-value P-value P-value
  • 12. Null P-Value Distributions |ρ| = 0.40 |ρ| = 0.31 |ρ| = 0.10 |ρ| = 0.00Correlation Independent E Dependent E Frequency Frequency Frequency Frequency Frequency Frequency Frequency Frequency P-value P-value P-value P-value P-value P-value P-value P-value
  • 14. False Discovery Rate Estimates Independent E Dependent E
  • 16. Data X Fit Model X= BS + E Obtain and R € ˆB Calculate P-values Form P-value Threshold When To Address Dependence? Form Test-Statistics and Null Distribution
  • 17. Data X Fit Model X= BS + E Obtain and R € ˆB Calculate P-values Form P-value Threshold When To Address Dependence? Form Test-Statistics and Null Distribution Existing Approaches Empirical null approaches modify the null distribution at the test-statistic level Dependence adjustments conservatively modify the P-value threshold
  • 18. Examples of Existing Approaches ā€¢ā€ˆ Empirical Null ā€“ā€ˆDevlin and Roeder Biometrics (1999) ā€“ā€ˆEfron JASA (2004) ā€“ā€ˆSchwartzman AOAS (2008) ā€¢ā€ˆ Error Rate Adjustments ā€“ā€ˆBenjamini and Yekutieli Annals of Statistics (2001) ā€“ā€ˆRomano, Shaikh, and Wolf Test (2001) ā€“ā€ˆDudoit, Gilbert, van der Laan Biometrical Journal (2008)
  • 19. Data X Fit Model X= BS + E Obtain and R € ˆB Calculate P-values Form P-value Threshold When To Address Dependence? Form Test-Statistics and Null Distribution Our Approach Fit the model: X = BS + Ī“G + U where G is a valid dependence kernel
  • 20. Dependence and bias are no longer present at any of these steps; standard methods can be used. Data X Fit Model X= BS + E Obtain and R € ˆB Calculate P-values Form P-value Threshold When To Address Dependence? Form Test-Statistics and Null Distribution Our Approach Fit the model: X = BS + Ī“G + U where G is a valid dependence kernel
  • 21. New Dependence Definitions Definition – Data X are population-level multiple testing dependent if: Definition - Data X are estimation-level multiple testing dependent if: Leek and Storey (2008)
  • 22. Structure in E Array MOF1Genes Signal + Dependent Noise Dependent Noise Independent Noise
  • 23. = + X = B S + E observations tests data random variation primary variables Decomposing E
  • 24. = + X = B S + H + U tests + independent variation observations data primary variables dependent variation Decomposing E
  • 25. = + X = B S + Ī“ G + U tests + independent variation observations data primary variables dependence kernel Decomposing E H
  • 26. Decomposing E Theorem Let the data be distributed according to the model: Suppose that for each ei there is no Borel measurable function, g, such that ei =g(ei,…,ei-1,ei+1,…,em) almost surely. Then there exist matrices Ī“(mƗr), G(rƗn) (r ≤ n) and U(mƗn) such that: where the rows of U are independent and ui ≠ 0 and ui=hi(ei) for a non-random Borel measurable function hi. Leek and Storey (2008)
  • 27. Dependence Kernel Leek and Storey (2008) Definition – Dependence Kernel An r Ɨn matrix G forms a dependence kernel for the data X, if the following equality holds: X = BS + E = BS + Ī“G + U where the rows of U are independent.
  • 28. Fitting S & G Results In Independent Tests Leek and Storey (2008) Theorem Let G be any valid dependence kernel for the data X. Suppose that the model: is fit by least squares resulting in residuals: if the rowspace jointly spanned by S and G has dimension less than n, then the ri and the are jointly independent given S and G and: € ˆbi
  • 29. = + X = B S + Ī“ G + U tests + independent variation observations data primary variables dependence kernel A ā€œBlessingā€ of Dimensionality
  • 30. Iteratively Reweighted Surrogate Variable Analysis 1.ā€ˆ Estimate the row dimension, , of G. 2.ā€ˆ Form an initial estimate equal to the first right singular vectors of R = X - S. 3.ā€ˆ Estimate . 4.ā€ˆ Weight the ith row of X by and set to be the first right singular vectors of the weighted matrix. ˆG(b+1) € ˆr € ˆB Iterate for b=0,…,B: € ˆG0 ˆr € X = BS + Ī“G + U € xi = biS + γiG + ui Whole data: Test i data: € ˆr
  • 31. An Example of the IRW-SVA Algorithm The Data True GEstimate of GPr(G & !S)
  • 32. An Example of the IRW-SVA Algorithm The Data True GEstimate of GPr(G & !S)
  • 33. An Example of the IRW-SVA Algorithm The Data True GEstimate of GPr(G & !S)
  • 34. An Example of the IRW-SVA Algorithm The Data True GEstimate of GPr(G & !S)
  • 35. An Example of the IRW-SVA Algorithm The Data True GEstimate of GPr(G & !S)
  • 36. An Example of the IRW-SVA Algorithm The Data True GEstimate of GPr(G & !S)
  • 37. An Example of the IRW-SVA Algorithm The Data True GEstimate of GPr(G & !S)
  • 38. Iteratively Re-weighted Surrogate Variable Analysis 1.ā€ˆ Estimate the row dimension, , of G. 2.ā€ˆ Form an initial estimate equal to the first right singular vectors of R = X - S. 3.ā€ˆ Estimate . 4.ā€ˆ Weight the ith row of X by and set to be the first right singular vectors of the weighted matrix. ˆG(b+1) € ˆr € ˆB € ˆG0 ˆr € X = BS + Ī“G + U € xi = biS + γiG + ui Whole data: Test i data: € ˆr Iterate for b=0,…,B:
  • 39. 1.ā€ˆ Buja and Eyuboglu (1992) proposed a permutation approach. 2.ā€ˆ Patterson, Price, and Reich (2006) proposed a sequential testing strategy based on Tracey- Widom theory. 3.ā€ˆ Leek (in preparation) proposes an eigenvalue estimator that is consistent in the number of tests. Estimating The Row Dimension of G
  • 40. 1.ā€ˆ Assume the data follow X = BS + Ī“G + U, where G and S have row dimensions r and d, r + d < n. 2.ā€ˆ Calculate the singular values s1,…, sn of X and choose b, such that r+d < b. 3.ā€ˆ Calculate the eigenvalues, Ī»1,…, Ī»n of where P = I - S(STS)-1ST and R = XP. 4.ā€ˆ Set ˆr = 1 Ī»j > māˆ’1/ 3 ( ) j=1 n āˆ‘ € € 1 m RT R āˆ’ sb 2 P[ ] Estimating The Row Dimension of G
  • 41. Theorem As , is a consistent estimate of the row dimension of G, provided that: (1)ā€ˆuij are independent (2)ā€ˆE[uij]=0 (3)ā€ˆ (4)ā€ˆ (5)ā€ˆ Ī“TĪ“ is positive definite with unique eigenvalues € m → āˆž € E[uij 2 ] = σi 2 < M1 € E[uij 4 ] < M2 € lim mā†’āˆž 1 m Leek (In Prep.) € ˆr = 1 Ī»j > māˆ’1/ 3 ( ) j=1 n āˆ‘ Estimating The Row Dimension of G
  • 42. Iteratively Re-weighted Surrogate Variable Analysis 1.ā€ˆ Estimate the row dimension, , of G. 2.ā€ˆ Form an initial estimate equal to the first right singular vectors of R = X - S. 3.ā€ˆ Estimate . 4.ā€ˆ Weight the ith row of X by and set to be the first right singular vectors of the weighted matrix. ˆG(b+1) € ˆr € ˆB € ˆG0 ˆr € X = BS + Ī“G + U € xi = biS + γiG + ui Whole data: Test i data: € ˆr Iterate for b=0,…,B:
  • 43. Break The Estimation Into Two Components
  • 44. 1.ā€ˆ Form F-statistics F1,…,Fm for testing the hypotheses: 2.ā€ˆ Bootstrap from the conditional null model to obtain null- statistics , k =1,…K. 3.ā€ˆ From Bayes’ Theorem: where and . Estimating the Probability Weights € F1 0k ,...,Fm 0k € Fi 0k ~ g0 € Fi ~ Ļ€0g0 + (1āˆ’ Ļ€0)g1
  • 45. 1.ā€ˆ Form F-statistics F1,…,Fm for testing the hypotheses: 2.ā€ˆ Bootstrap from the conditional null model to obtain null- statistics , k =1,…K. 3.ā€ˆ From Bayes’ Theorem: 4.ā€ˆ Estimate the ratio of the densities with a non-parametric logistic regression where Fi are ā€œsuccessesā€ and Fi 0k are ā€œfailuresā€ (Anderson and Blair 1982). where and . . Estimating the Probability Weights € F1 0k ,...,Fm 0k € Fi 0k ~ g0 € Fi ~ Ļ€0g0 + (1āˆ’ Ļ€0)g1
  • 46. 1.ā€ˆ Form F-statistics F1,…,Fm for testing the hypotheses: 2.ā€ˆ Bootstrap from the conditional null model to obtain null- statistics , k =1,…K. 3.ā€ˆ From Bayes’ Theorem: 4.ā€ˆ Estimate the ratio of the densities with a non-parametric logistic regression where Fi are ā€œsuccessesā€ and Fi 0k are ā€œfailuresā€ (Anderson and Blair 1982). 5.ā€ˆ Estimate Ļ€0 according to Storey (2002). where and . Estimating the Probability Weights € F1 0k ,...,Fm 0k € Fi 0k ~ g0 € Fi ~ Ļ€0g0 + (1āˆ’ Ļ€0)g1
  • 47. Estimating the Probability Weights Estimate of posterior probability bi ≠ 0.
  • 48. SVA-Adjusted Analysis 1.ā€ˆ Estimate G with IRW-SVA 2.ā€ˆ Fit 3.ā€ˆ Test the hypotheses € H0i :bi ∈ Ī©0 H1i :bi ∈ Ī©1
  • 49. A Simple Simulated Example Independent E Dependent E Genes Genes Arrays Arrays
  • 50. Null Distribution Behavior Dependent E Independent E Dependent E + IRW-SVA
  • 51. False Discovery Rate Estimates Independent E Dependent E Dependent E + IRW-SVA True False Discovery Rate True False Discovery Rate True False Discovery Rate Q-value Q-value Q-value
  • 52. Ranking Estimates Independent E Dependent E Dependent E + IRW-SVA Ranking by True Signal to Noise Ranking by True Signal to Noise Ranking by True Signal to Noise AverageRankingbyT-Statistic AverageRankingbyT-Statistic AverageRankingbyT-Statistic
  • 53. 53 Inflammation and the Host Response to Injury mRNA Expression ~50,000 genes Clinical Data >150 clinical variables Patient 1 Patient 2 Patient 166…. MOF1 measures severity of injury
  • 54. Phase 1 Phase 2 Phase 3 Phase 4 Four ā€œReplicatedā€ Studies FrequencyFrequency P-value P-value P-value P-value P-value P-value P-value P-value Frequency Frequency Frequency Frequency Frequency Frequency Frequency
  • 55. Functional Enrichment Across Phases Number of phases in which a significant pathway appears Percentoftotalsignificantpathways 1 of 4 2 of 4 3 of 4 4 of 4 Unadjusted IRW-SVAAdjusted
  • 56. ā€¢ā€ˆ High-dimensional hypothesis testing is common. ā€¢ā€ˆ Dependence between tests can result in incorrect statistical and scientific inference. ā€¢ā€ˆ We can define and address dependence at the level of the model using the dependence kernel. ā€¢ā€ˆ IRW-SVA can be used to improve inference in high-dimensional multiple hypothesis testing. Summary
  • 57. Future Work ā€¢ā€ˆ Multiple Testing ā€“ā€ˆDevelop dependence kernel estimates for spatial data ā€“ā€ˆDevelop diagnostic tests for multiple testing procedures ā€¢ā€ˆ High-Dimensional Asymptotics ā€“ā€ˆExtend methods for asymptotic SVD to binary data ā€¢ā€ˆ Feature Selection for High-Dimensional Classifiers ā€“ā€ˆExtensions of top-scoring pairs (TSP) to survival data ā€“ā€ˆTheoretical connections to LDA and SVM ā€“ā€ˆEmbedding TSP in a logic regression framework
  • 59. 1.ā€ˆ Calculate the residuals R = X - S. 2.ā€ˆ Calculate the singular values of R, d1,…,dn. 3.ā€ˆ Permute each row of R individually to get R0. 4.ā€ˆ Take the SVD of the residuals R* = R0 - S to obtain null singular values . 5.ā€ˆ Compare di to for k=1,…,K to calculate a P- value for the ith right singular vector. Estimating The Row Dimension of G € ˆB € ˆB0 € di0 k € di0 k For k =1,…,K do steps 3-4: Buja and Eyuboglu (1992)
  • 60. Why Does This Work? Leek and Storey (2007), Leek and Storey (2008) Useful Fact: X = BS + E = BS + Ī“G + U = BS + Ī›H + U if G and H have the same column space.
  • 61. ā€¢ā€ˆ References: Benjamini Y and Hochberg Y. (1995), ā€œControlling the false discovery rate – a practical and powerful approach to multiple testing.ā€ JRSSB, 57: 289-300. De Castro MC, Monte-Mor RL, Sawyer DO, and Singer, BH. (2005), ā€œMalaria risk on the amazon frontier.ā€ PNAS, 103: 2452-2457. Delin B and Roeder K. (1999), ā€œGenomic control for association studies.ā€ Biometrics, 55: 997-1004. Efron B. (2004) ā€œLarge-scale simultaneous hypothesis testing: The choice of a null hypothesis.ā€ JASA, 99: 96-104. Leek JT and Storey JD. (2008) ā€œA general framework for multiple testing dependence.ā€ Proceedings of the National Academy of Sciences , 105: 18718-18723. Leek JT and Storey JD. (2007) ā€œCapturing heterogeneity in gene expression studies by ā€˜Surrogate Variable Analysis’.ā€ PLoS Genetics, 3: e161. Taylor JE and Worsley KJ. (2007) ā€œDetecting sparse signals in random fields, with applications to brain mapping.ā€ JASA, 102: 913-928. Thank You
  • 62. 1.ā€ˆ Perform each hypothesis test individually. 2.ā€ˆ Obtain the test-statistic for each test. 3.ā€ˆ Compare distribution of test-statistics to the theoretical null distribution. 4.ā€ˆ Adjust theoretical null so that it matches the observed statistics in a low signal region. Empirical Null
  • 65. Empirical Null Results in Incorrect Null Distribution Dep. Kernel
  • 66. ā€¢ā€ˆ Observed statistics or observed P-values come from mixture distribution: Ļ€0g0 + Ļ€1g1 ā€¢ā€ˆ Dependence distorts g0 … can go either way: ā€¢ā€ˆ Must use full data set to capture dependence With Confounding Empirical Null is Ill-Posed