2019 PMED Spring Course - Introduction to Nonsmooth Inference - Eric Laber, April 17, 2019

Intro to nonsmooth inference
Eric B. Laber
Department of Statistics, North Carolina State University
April 2019
SAMSI

Last time
SMARTs gold standard for est and eval of txt regimes
Highly configurable but choices driven by science
Looked at examples with varying scientific/clinical goals which
lead to different timing, txt options, response criteria etc.
Often powered by simple comparisons
First-stage response rates
Fixed regimes (most- vs. least-intensive)
First stage txts (problematic)
If test statistic is regular and asymptotically normal under null
can use same basic template for power
1 / 87

Quick SMART review
R
PCST-Full
Treatment 0
PCST-Brief
Treatment 1
Response?
Response?
R
No
R
No
R
Yes
R
Yes
PCST-Full maintenance
Treatment 2
No further treatment
Treatment 3
PCST-Plus
Treatment 4
PCST-Full maintenance
Treatment 2
PCST-Brief maintenance
Treatment 5
No further intervention
Treatment 3
PCST-Full
Treatment 0
PCST-Brief maintenance
Treatment 5
2 / 87

Refresher
Suppose that researchers are interested in comparing the
embedded regimes:
(e1) assign PCST-Full initially, assign PCST-Full maintenance to
responders, and assign PCST-Plus to non-responders;
(e2) assign PCST-Brief initially, assign no further intervention to
responders, and assign PCST-Brief maintenance to responders.
Recall our general template:
Test statistic: Vn(e1) − Vn(e2), where Vn is IPWE
Use
√
nTn/σ2
e1,e2,n asy normal and reject when this is large in
magnitude
3 / 87

Goals for today
Introduction to inference for txt regimes
Nonregular inference (and why we should care)
Basic strategies with a toy problem
Examples in one-stage problems
4 / 87

Warm up part I: quiz!
Discuss with your stat buddy:
What are some common scenarios where series approx or the
bootstrap cannot ensure correct op characteristics?
What is a local alternative?
How do we know if an asymptotic approx is adequate?
True or false
If n is large asymptotic approximations can be trusted.
The top review of CLT on yelp complains about the burritos
being too expensive.
The BBC produced an Hitler-themed sitcom titled ‘Heil Honey,
I’m home’ in the 1950s.
5 / 87

On reality and fantasy
Your cat didn’t say that. You know how I know? It’s a
cat. It doesn’t talk. If you died, it would eat you. Starting
with your face.
– Matt Zabka, recently single
6 / 87

Asymptotic approximations
Basic idea: study behavior of statistical procedure in terms of
dominating features while ignoring lower order ones
Often, but not always, consider diverging sample size
‘Dominating features’ intentionally ambiguous
Generate new insights and general statistical procedures as
large classes of problems share same dominating features
Asymptotics mustn’t be applied mindlessly
Disgusting trend in statistics: propose method, push through
irrelevant asymptotics, handpick simulation experiments
Require careful thought about what op characteristics are
needed scientiﬁcally and how to ensure these hold with the
kind of data that are likely to be observed
No panacea ⇒ handcrafted construction and evaluation
7 / 87

Inferential questions in precision medicine
Identify key tailoring variables
Evaluate performance of true optimal regime
Evaluate performance of estimated optimal regime
Compare performance of two+ (possibly data-driven) regimes
. . .
8 / 87

Toy problem: max of means
Simple problem that retains many of the salient features of
inference for txt regimes
Non-smooth function of smooth functionals
Well-studied in the literature
Basic notation
For Z1, . . . , Zn ∼i.i.d. P comprising ind copies of Z ∼ P write
Pf (Z) = f (z)dP(z) and Pnf (Z) = n−1
n
i=1
f (Zi )
Use ‘ ’ to denote convergence in distribution
Check: assuming requisite moments exist:
√
n (Pn − P) Z ????
9 / 87

Max of means
Observe X1, . . . , Xn ∼i.i.d. P in Rp
with µ0 = PX, deﬁne
θ0 =
p
j=1
µ0,j = max(µ0,1, . . . , mu0,p)
While we consider this estimand primarily for illustration, it
corresponds to problem of estimating the mean outcome
under an optimal one-size-ﬁts-all treatment recommendation
where µ0,j is mean outcome under treatment j = 1, . . . , p.
10 / 87

Max of means: estimation
Deﬁne µn = PnX, the plug-in estimator of θ0 is
θn =
p
j=1
µn,j
Warm-up:
Three minutes trying to derive limiting distn of
√
n(θn − θ0)
Three minutes discussing soln with your stat buddy
11 / 87

Max of means: ﬁrst result
For v ∈ Rp
deﬁne U(v) = arg max
j
vj
Lemma
Assume regularity conditions under which
√
n(Pn − P)X is
asymptotically normal with mean zero and variance-covariance
matrix Σ. Then
√
n θn − θ0
j∈U(µ0)
Zj ,
where Z ∼ Normal(0, Σ).
12 / 87

Max of means: proof of ﬁrst result
13 / 87

Max of means: discussion of ﬁrst result
Limiting distribution of
√
n(θn − θ0) depends abruptly on µ0
If µ0 = (0, 0) T
and Σ = I2, the the limiting distn is the max
of two ind std normals
If µ0 = (0, ) T
for > 0 and Σ = I2, the limiting distn is std
normal even if = 1x10−27
!!
How can we use such an asymptotic result in practice?!
Limiting distn of
√
n(θn − θ0) depends only on submatrix of Σ
cor. to elements of U(θ0). What about in ﬁnite samples?
14 / 87

Max of means: discussion of first result cont’d
Suppose X1, . . . , Xn ∼i.i.d. Normal(θ0, Ip) and µ0 has a
unique maximizer, i.e., U(µ0) a singleton, say {µ0,1}
√
n(θn − θ0) Normal(0, 1)
P
√
n θn − θ0 ≤ t = Φ(t)
p
j=2
Φ t +
√
n(θ0 − µ0,j )
Quick break: derive this.
If the gaps θ0 − µ0,j are small relative to
√
n, the finite sample
behavior can be quite different from limit Φ(t)1
1
Note that the limiting distribution doesn’t depend on these gaps at all!
15 / 87

Max of means: normal approximation in pictures
Generate data from Normal(µ, I6) with µ1 = 2 and
µj = µ1 − δ for j = 2, . . . , 6. Results shown for n = 100.
n(θ^
n − θ0)
Density,δ=0.5
−4 −2 0 2 4
0.00.10.20.30.40.50.6
n(θ^
n − θ0)
Density,δ=0.1
−4 −2 0 2 4
0.00.10.20.30.40.50.6
n(θ^
n − θ0)
Density,δ=0.01
−4 −2 0 2 4
0.00.10.20.30.40.50.6
16 / 87

Choosing the right asymptotic framework
Dangerous pattern of thinking:
In practice, none of the txt eﬀect diﬀerences are zero.
I’ll build my asy approximations assuming a unique maximizer.
17 / 87

Dangerous pattern of thinking:
In practice, none of the txt effect differences are zero.
I’ll build my asy approximations assuming a unique maximizer.
There finitely many components so maximizer is
well-separated.
Idea! Plug-in estimated mazimizer and use asy normal approx.
Preceding pattern happens frequently, e.g., oracle property in
model selection, max eigenvalues in matrix , and txt regimes
17 / 87

What goes wrong? After all, this thinking works well in many
other settings, e.g., everything you learned in stat 101.
18 / 87

What goes wrong? After all, this thinking works well in many
other settings, e.g., everything you learned in stat 101.
Finite sample behavior driven by small (not necessarily zero)
differences in txt effectiveness
We saw this analytically in normal case
Intuition helped by thinking in extremes, e.g,. what if all txts
were equal? What if one were infinitely better than others?
Abrupt dependence of limiting distribution on U(µ0) is a
redflag. It is tempting to construct procedures that will recover
this limiting distn even if some txt differences are exactly zero.
This is asymptotics for asymptotics sake. Don’t do it.
18 / 87

Asymptotic working assumptions
A useful asy approximation should be robust to the setting
where some (all) txt differences are zero
Necessary but not sufficient
Heuristic: in small samples, one cannot distinguish between
small (but nonzero) txt differences so use an asy framework
which allows for exact equality.
This heuristic has been misinterpreted and misused in lit.
Some procedures we’ll look at are designed for such robustness
19 / 87

Local asymptotics: horseshoes and hand grenades
Allowing null txt differences problematic
Asymptotically, differences either zero or infinite2
Txt differences are (probably) not exactly zero
Challenge: allow small differences to persist as n diverges
Local or moving parameter asy framework does this
Idea: allow gen model to change with n so that gaps
θ0 − max
j /∈U(µ0)
µ0,j shrink to zero as n increases3
2
In the stat sense that we have power one to discriminate between them.
3
This idea should be familiar from hypothesis testing.
20 / 87

Triangular arrays
For each n, X1,n, . . . , Xn,n ∼i.i.d. Pn
Observations Distribution
X1,1 P1
X1,2 X2,2 P2
X1,3 X2,3 X3,3 P3
X1,4 X2,4 X3,4 X4,4 P4
...
...
...
...
...
...
Deﬁne µ0,n = PnX and θ0,n =
p
j=1
µ0,j
Assume µ0,n = µ0 + s/
√
n where s ∈ Rp
called local parameter
Assume
√
n(Pn − Pn)X Normal(0, Σ)4
4
This is true under very mild conditions on the sequence of distributions
{Pn}n≥1. However, given our limited time we will not discuss such conditions.
See van der Vaart and Wellner (1996) for details. 21 / 87

Quick quiz
Suppose that X1,n, . . . , Xn,n ∼i.i.d. Normal(µ0 + s/
√
n, Σ)
what is is distribution of
√
n(Pn − Pn)X?
22 / 87

Local alternatives anticipate unstable performance
Lemma
Let s ∈ Rp
be ﬁxed. Assume that for each n we observe {Xi,n}n
i=1
drawn i.i.d. from Pn which satisﬁes: (i) PnX = mu0 + s/
√
n, and
(ii)
√
n(Pn − Pn)X Normal(0, Σ). Then, under Pn,
√
n θn − θ0,n
j∈U(µ0)
(Zj + sj ) −
j∈U(µ0)
sj ,
23 / 87

Local alternatives anticipate unstable performance
Lemma
Let s ∈ Rp
be ﬁxed. Assume that for each n we observe {Xi,n}n
i=1
drawn i.i.d. from Pn which satisﬁes: (i) PnX = mu0 + s/
√
n, and
(ii)
√
n(Pn − Pn)X Normal(0, Σ). Then, under Pn,
√
n θn − θ0,n
j∈U(µ0)
(Zj + sj ) −
j∈U(µ0)
sj ,
Discussion/observations on local limiting distn
Dependence of limiting distn on s ⇒ nonregular
Set U(µ0) represents set of near-maximizers though sj = 0
corresponds to exact equality (so haven’t ruled this out)
23 / 87

Proof of local limiting distribution
24 / 87

Intermission: more regularity after this

Comments on nonregularity
Sensitivity of estimator to local alternatives cannot be
rectiﬁed through the choice of a more clever estimator
Inherent property of the estimand5
This has not stopped some from trying...
Remainder of today’s notes: cataloging of conﬁdence intervals
5
See van der Vaart (1991), Hirano and Porter (2012), and L. et al. (2011,
2014, 2019)
25 / 87

Projection region
Idea: exploit the following two facts
µn is nicely behaved (reg. asy normal)
If µ0 were known this would be trivial6
Given α ∈ (0, 1) denote acceptable error level and ζn,1−α a
conﬁdence region for µ0, e.g.,
ζn,1−α = µ ∈ Rp
: n(µn − µ) T
Σn(µn − µ) ≤ χ2
p,1−α ,
where Σn = Pn(X − µn)(X − µn) T
Projection CI:
Γn,1−α =



θ ∈ R : θ =
p
j=1
µj for some µ ∈ ζn,1−α



6
In this problem, θ0 is a function of µ0 and is thus completely known when
µ0 is known. In more complicated problems, knowing the value of a nuisance
parameter will make the inference problem of interest regular.
26 / 87

Prove the following with your stat buddy
P (θ0 ∈ Γn,1−α) ≥ 1 − α + oP(1)
27 / 87

Comments on projection regions
Useful when parameter of interest is a non-smooth functional
of a smooth (regular) parameter
Robust and widely applicable but conservative
Projection interval valid under local alternatives (why?)
Can reduce conservatism using pre-test (L. et al., 2014)
Berger and Boos (1991) and Robins (2004) for seminal papers
Consider as a ﬁrst option in new non-reg problem
28 / 87

Bound-based conﬁdence intervals
Idea: sandwich non-smooth functional between smooth upper
and lower bounds than bootstrap bounds to form conf region
Let {τn}n≥1 be seq of pos constants such that τn → ∞ and
τn = o(
√
n) as n → ∞, deﬁne
Un(µ0) = j : max
k
√
n (µn,k − µn,j ) /σj,k,n ≤ τn ,
where σj,k,n is est of asy variance of µn,k − µn,j .
Note* May help to think of Un to be the indices of txts that
we cannot distinguish from being optimal.
29 / 87

Bound-based conﬁdence intervals cont’d
Given Un(µ0) deﬁne
Sn(µ0) = s ∈ Rp
: sj = µ0,j if j ∈ Un(µ0) ,
then, it follows that
Un = sup
s∈Sn(µ0)
√
n



p
j=1
(µn,j − µ0,j + sj ) −
p
j=1
sj



is an upper bound on
√
n(θn − θ0). (Why?) A lower bound,
Ln is constructed by replacing sup with an inf.
30 / 87

Dad, where do bounds come from?
Un obtained by taking sup over all local, i.e., order 1/
√
n,
perturbations of generative model
By construction, insensitive to local perturbations ⇒ regular
Un(µ0) conservative est of U(µ0), lets wave our hands:
31 / 87

Bootstrapping the bounds
Both Un and Ln are regular and their distns consistently
estimated via nonpar bootstrap
Let u
(b)
n,1−α/2 be (1 − α/2) × 100 perc of bootstrap distn of Un
and
(b)
n,α/2 the (α/2) × 100 perc of bootstrap distn of Ln
Bound based conﬁdence interval
θn − u
(b)
n,1−α/2/
√
n, θn −
(b)
n,α/2/
√
n
32 / 87

Bound-based intervals discussion
General approach, applies to implicitly deﬁned estimators as
as those with closed form expressions like we considered here
Less conservative than projection interval but still
conservative, such conservatism is unavoidable
Bounds are tightest in some sense
Bounding quantiles directly rather than estimand may reduce
conservatism though possibly at price of addl complexity
See Fan et al. (2017) for other improvements/reﬁnements
33 / 87

Bootstrap methods
Bootstrap is not consistent without modiﬁcation
Due to instability (nonregularity)
Nondiﬀerentiability of max operator causes this instability (see
Shao 1994 for a nice review)
Bootstrap is appealing for complex problems
Doesn’t require explicitly computing asy approximations7
.
Higher order convergence properties
7
There are exceptions to this, including parametric bootstrap and those
based on quadratic expansions
34 / 87

How about some witchcraft?
m-out-of-n bootstrap can be used to create valid conﬁdence
intervals for non-smooth functionals
Idea: resample datasets of size mn = o(n) so sample-level
parameters converge ‘faster’ than bootstrap analogs
(i.e., witchcraft)
35 / 87

m-out-of-n bootstrap
Accepting some components of witchcraft on faith
√
mn µ(b)
mn
− µn Normal(0, Σ) conditional on the data
See Arcones and Gine (1989) for details
An even toyier example than our toy example: W1, . . . , Wn
i.i.d. w/ (µ, σ2
) derive limit distns of
√
n(|W n| − |µ|) and
√
mn(|W
(b)
n | − |W n|)
36 / 87

m-out-ofn bootstrap with max of means
Derive limiting distribution of
√
mn θ
(b)
mn − θn :
37 / 87

Intermission
I wouldn’t want to wind up hooked to a bunch of wires and
tubes, unless somehow the wires and tubes were keeping
me alive. —Don Alden Adams

Finally! Back to treatment regimes (brieﬂy)
Consider a one-stage problem with observed data
{(Xi , Ai , Yi )}n
i=1 where X ∈ Rp
, A ∈ {−1, 1}, and Y ∈ R
Assume requisite causal conditions hold
Assume linear rules π(x) = sign(x T
β), where β ∈ Rp
, and x
might contain polynomial terms etc.
38 / 87

Warm-up! Derive limiting distn of parameters in
linear Q-learning!!!!8
Posit linear model Q(x, a; β) = x T
0 + ax T
1 β, indexed by
β = (β T
0 , β T
1 ) T
and x0, x1 known features.
βn = arg min
β
Pn {Y − Q(X, A; β)}
2
and
β∗
= arg min
β
P {Y − Q(X, A; β)}
Derive limiting distribution of
√
n(βn − β∗
)
Construct conﬁdence interval for Q(x, a) assuming
Q(x, a) = Q(x, a; β∗
)
8
He exclaimed.
39 / 87

Parameters in (1-stage) Q-learning are easy!
Similar arguments show that coeﬃcients indexing
g-computation and outcome weighted learning asy normal
Preview: consider the (regression-based) estimator of the
value of πn(x) = sign(x T
1 β1,n), which you’ll recall is
Vn(β1,n) = Pn max
a
Q(X, a; βn)
= PnX T
0 β0,n + Pn|X T
1 β1,n|
What is limit of
√
n Vn(β1,n) − V (β1,n) and can we use it
to derive CI for V (βn)? What about a CI for V (β∗
)?
40 / 87

Parameters in (1-stage) OWL are easy!
To illustrate, assume P(A = 1|X) = P(A = −1|X) = 1/2 wp1
Recall OWL based on cvx relaxation of IPWE
Vn(β) = Pn
Y 1 A = sign(X T β)
P(A|X)
= 2PnY 1 A sign(X T
β) > 0
Let : R → R be cvx, OWL estimator is
βn = arg min
β∈Rp
Pn|Y | (W T
β),
where W = sign(Y )AX
41 / 87

Some facts about OWL (and more generally
convex M-estimators)
Fn |y| (w T
β) is composition of linear and cvx function and
thus cvx in β for each (y, w) ⇒ greatly simplifies inference!
Regularity conditions
β∗
= arg min
β
P|Y | (W T
β) exists and unique
Map β → P|Y | (W T
β) differentiable in nbrhd of β∗9
Under these conditions,
√
n(βn − β∗
) is regular10
and
asymptotically normal ⇒ many results from Q-learning port
9
More formally, require
|y| w T
(β∗
+ δ) − |y| (w T
β∗
) = S(y, w, ; β∗
) T
δ + R(y, w, δ; β∗
) where
PS(Y , W ; β∗
) = 0, ΣO = PS(Y , W ; β∗
)S(Y , W ; β∗
) T
finite, and
PR(Y , W , δ; β∗
) = (1/2)δ T
ΩO δ + o(||δ||2
) (Haberman, 1989; Niemiro, 1992;
Hjort and Pollard, 2011).
10
I am being a bit loose with language in this course by referring to both
estimands and rescaled estimators as ‘regular’ or ‘non-regular.’ 42 / 87

Value function(s)
Three ways to measure performance
Conditional value: V (πn) = PY ∗
(πn) = E Y ∗
(πn) πn ,
measures the performance of an estimated decision rule as if it
were to be deployed in popn (note* this is a random variable)
Unconditional value: Vn = EV (πn), measures the average
performance of the algorithm used to construct πn with sample
of size n
Population-level value: V (π∗
), where π∗
(x) = sign(x T
β∗
),
measures the potential of applying precision medicine strategy
in given domain if algorithm for constructing πn will be used
Discuss these measures with your stat buddy. Is there a
meaningful distinction as the sample size grows large?
43 / 87

It’s a wacky world out there
The three value measures need not coincide asymptotically
Let πn(x) = sign(x T
βn) and suppose
√
nβn Normal(0, Σ)
so that β∗
≡ 0 and π∗
(x) ≡ −1. With stat buddy, compute:
Vn(βn) ????
Vn = EV (βn) → ????
V (β∗
) = ????
44 / 87

Have some conﬁdence you useless pile!
We’ll construct conﬁdence sets for V (βn) and V (β∗
) as these
are most commonly of interest in application
Starting with conditional value fn assume that the
data-generating model is a triangular array Pn such that:
(A0) πn(x) = sign(x T
βn)
(A1) ∃β∗
n s.t. β∗
n = β∗
+ s/
√
n for some s ∈ Rp
and
√
n(βn − β∗
n ) =
√
n(Pn − Pn)u(X, A, Y ) + oPn (1), where u
does not depend on s, sup
n
Pn||u(X, A, Y )||2
< ∞, and
Cov {u(X, A, Y )} is p.d.
(A2) If F is uniformly bounded Donsker class and
√
n(pn − P) T
in ∞
(F) under P then
√
n(Pn − Pn) T in ∞
(F) under Pn.
(A3) sup
n
Pn||Y ||2
< ∞.
Detailed discussion is beyond the scope of this class. Laber
will wave his hands a bit. Our goal is understand key 45 / 87

Building block: joint distribution before
nonsmooth operator
Deﬁne class of functions
G = g(X, A, Y ; δ) = Y 1 AX T
δ > 0 1 X T
β∗
= 0 : δ ∈ Rp
view
√
n(Pn − Pn) as random element of ∞
(Rp
) .
Lemma
Assume (A0)-(A3). Then
√
n



Pn − Pn
βn − β∗
(Pn − Pn)Y 1 AX T
β∗
> 0





T
Z
W


in ∞
(Rp
) × Rp
× R under Pn.
46 / 87

Limiting distn of V (βn)
Corollary
Assume (A0)-(A3). Then,
√
n Vn(βn) − V (βn) T (Z + s) + W.
Notes
Presence of s shows this is nonregular
T is a Brownian bridge indexed by Rp
W and Z are normal
47 / 87

Bound-based confidence interval
Limiting distribution: T(Z + s) + W
Local parameter only appears in first term
(Asy) bound should only affect this term
Schematic for constructing a bound
Partition input space into those that are ‘near’ the decision
boundary x T
β∗
= 0 vs. those that are ’far’ from boundary
Take sup/inf over local perturbations of points in ‘near’ group
48 / 87

Upper bound
Let Σn be estimator of asy var of βn an upper bound on
√
n Vn(βn) − V (βn) is
Un = sup
ω∈Rp
√
n(Pn−Pn)Y 1 AX T
ω > 0 1
n(X T βn)2
X T ΣnX
≤ τn
+
√
n(Pn − Pn)Y 1 AX T
βn > 0 1
n(X T βn)2
X T ΣnX
> τn ,
where τn is seq of tuning parameters s.t. τn → ∞ and
τn = o(n) as n → ∞. Lower bound constructed by replacing
sup with inf.
49 / 87

Limiting distribution of bounds
Theorem
Assume (A0)-(A3). Then
(Ln, Un) inf
ω∈Rp
T(ω) + W, sup
ω∈Rp
T(ω) + W
under Pn.
Recall limit distn of
√
n Vn(βn) − V (βn) is T(Z + s) + W
Bounds equiv to sup/inf over local perturbations
If all subject have large txt eﬀects, bound are tight
Bootstrap bounds to construct conﬁdence bound, theoretical
results for the bootstrap bounds given in book
50 / 87

Note on tuning
Seq {τn}n≥1 can affect finite sample performance
Idea: tune using double bootstrap, i.e., bootstrap the
bootstrap samples to estimate coverage and adapt τn
Double bootstrap considered computationally expensive but
not much of a burden in most problems with modern
computing infrastructure
Tuning can be done without affecting theoretical results
51 / 87

Algy the friendly tuning algorithm
28 STATISTICAL INFERENCE
Alg. 1.5: Tuning the critical value τn using the double bootstrap
Input: {(Xi, Ai, Yi)}
n
i=1, M, α ∈ (0, 1), τ
(1)
n , . . . , τ
(L)
n
1 V = Vn(dn)
2 for j = 1, . . . , L do
3 c(j)
= 0
4 for b = 1, . . . , M do
5 Draw a sample of size n, say S
(b)
n , from {(Xi, Ai, Yi)}
n
i=1
with replacement
6 Compute bound-based confidence set, ζ
(b)
mn , using sample S
(b)
n
and critical value τ
(j)
n
7 if V ∈ ζ
(b)
mn then
8 c(j)
= c(j)
+ 1
9 end
10 end
11 end
12 Set j∗
= arg minj : c(j)≥M(1−α) c(j)
Output: Return τ
(j)
n
52 / 87

Intermission
If any man says he hates war more than I do, he better
have a knife, that’s all I have to say. –Ghandi
53 / 87

m-out-of-n bootstrap
Bound-based intervals complex (conceptually and technically)
Subsampling easier to implement and understand11
Does not require specialized code etc.
Let mn = o(n) be resample size s.t. mn → ∞ and mn = o(n),
let P
(b)
mn be bootstrap empirical distn
Approximate
√
n Vn(βn) − V (βn) with its bootstrap analog
√
mn V (b)
mn
(β(b)
mn
) − Vn(βn) (Laber might draw picture)
Let mn
and umn
be the (α/2) × 100 and (1 − α/2) × 100
percentiles of
√
mn V (b)
mn
(β(b)
mn
) − Vn(βn) , ci given by
Vn(βn) − umn /
√
mn, Vn(βn − mn /
√
mn
11
Though the theory underpinning subsampling can be non-trivial so
‘understand’ here is meant more mechanically.
54 / 87

Emmy the subsampling boostrap algoINFERENCE WITH ONE-STAGE REGIMES 29
Alg. 1.6: m-out-of-n bootstrap confidence set for the conditional
value
Input: mn, {Xi, Ai, Yi}
n
i=1, M, α ∈ (0, 1)
1 for b = 1, . . . , M do
2 Draw a sample of size mn, say S
(b)
mn , from {Xi, Ai, Yi}
n
i=1 with
replacement
3 Compute β
(b)
mn on S
(b)
mn
4 Δ
(b)
mn =
√
mn i∈S
(b)
mn
YiI AiXT
i β
(b)
mn > 0 −
n
k=1 YkI AkXT
k β
(b)
mn > 0
5 end
6 Relabel so that Δ
(1)
mn ≤ Δ
(2)
mn ≤ · · · ≤ Δ
(B)
mn
7 mn
= Δ
( Bα/2 )
mn
8 umn = Δ
( B(1−α/2) )
mn
Output: Vn(dn) − umn /
√
mn, Vn(dn) − mn /
√
mn
55 / 87

m-out-of-n cont’d
Provides valid conﬁdence intervals under (A0)-(A3)
Proof omitted (tedious)
Can tune mn using double bootstrap
Reliance on asymptotic tomfoolery makes me hesitant to use
this in practice12
12
I did not always think this way, see Chakraborty, L., and Zhao (2014ab).
Also, I should not be so dismissive of these methods. Some of the work in this
area has been quite deep and produced general uniformly convergent methods.
See work by Romano and colleagues.
56 / 87

Conﬁdence interval for opt regime within a class
Let π∗
denote the optimal txt regime within a given class, our
goal is to construct a CI for V (π∗
)
Were π∗
known, one could use
√
n Vn(π∗
) − V (π∗
) , with
your stats buddy, compute this limiting distn
Suppose that we could construct a valid conﬁdence region for
π∗
, suggest a method for CI for V (π∗
)
57 / 87

Projection interval
For any fixed π, let ζn,1−ν(π) be a (1 − ν) × 100% confidence
set for V (π), e.g., using asymptotic approx on previous slide
Let Dn,1−η denote a (1 − η) × 100% confidence set for π∗
,
then a (1 − η − ν) × 100% confidence region for V (π∗
) is
π∈Dn,1−η
ζn,1−ν(π)
Why?
58 / 87

Ex. projection interval for linear regime
Consider regimes of the form π(x; β) = sign(x T
β), then
√
n Vn(β) − V (β) =
√
n(Pn − P)
Y 1AX T β>0
P(A|X)
Normal 0, σ2
(β) ,
take
ζn,1−ν(β) = Vn(β) −
z1−ν/2σn(β)
√
n
, Vn(β) +
z1−ν2σn(β)
√
n
If
√
n(βn − β∗
) Normal(0, Σ), take
Dn,1−η = β : n(βn − β) T
Σ−1
n (βn − β) ≤ χ2
p,1−η
Projection interval:
β∈Dn,1−η
ζn,1−ν(β)
59 / 87

Quiz break!
What does the ‘Q’ in Q-learning stand for?
In txt regimes, which of the following is not yet a thing:
A-learning, B-learning, C-Learning, D-learning, E-learning?
Write down the two-stage Q-learning algorithm assuming
binary treatments and linear models at each stage
True or false:
I would rather have a zombie ice dragon than two live ﬁre
dragons.
The story of the Easter bunny is based on the little known
story of Jesus swapping the internal organs of chickens and
rabbits to prevent a widespread famine
Q-learning has been used to obtain state-of-the-art
performance in game-playing domains like chess, backgammon,
and atari
60 / 87

Inference for two-stage linear Q-learning
Learning objectives
Identify source of nonregularlity
Understand implications on coverage and asy bias
Intuition behind bounds
Hopefully this will be trivial for you now!
61 / 87

Reminder: setup and notation
Observe {(X1,i , A1,i , X2,i , A2,i , Yi )}n
i=1, i.i.d. from P
X1 ∈ Rp1
: baseline subj. info.
A1 ∈ {0, 1} : ﬁrst treatment
X2 ∈ Rp2
: interim subj. info. during course of A1
A2 ∈ {0, 1} : second treatment
Y ∈ R : outcome, higher is better
Deﬁne history H1 = X1, H2 = (X1, A1, X2)
DTR π = (π1, π2) where
πt : supp Ht → supp At,
patient presenting with Ht = ht assigned treatment πt(ht)
62 / 87

Characterizing optimal DTR
Optimal regime maximizes value EY ∗
(π)
Deﬁne Q-functions
Q2 (h2, a2) = E Y H2 = h2, A2 = a2
Q1(h1, a1) = E max
a2
Q2(H2, a2) H1 = h1, A1 = a1
Dynamic programming (Bellman, 1957)
πopt
t (ht) = arg max
at
Qt(ht, at)
63 / 87

Q-learning
Regression-based dynamic programming algorithm
(Q0) Postulate working models for Q-functions
Qt(ht, at; βt) = h T
t,0 βt,0 + ath T
t,1 βt,1, ht,0, ht,1 features of ht
(Q1) Compute β2 = arg min
β2
Pn {Y − Q2(H2, A2; β2)}
2
(Q2) Compute
β1 = arg min
β1
Pn max
a2
Q2(H2, A2; β2) − Q1(H1, A1; β1)
2
(Q3) πt(ht) = arg max
at
Qt(ht, at; βt)
Population parameters β∗
t obtained by replacing Pn with P
Inference for β∗
2 standard, just OLS
Focus on conﬁdence intervals for c T
β∗
1 for ﬁxed c ∈ Rdim,β∗
1
64 / 87

Inference for c T
β∗
1
Non-smooth max operator makes β1 non-regular
Distn of c T
√
n(β1 − β∗
1 ) sensitive to small perturbations of P
Limiting distn does not have mean zero (asymptotic bias)
Occurs with small second stage txt eﬀects, H T
2,1 β∗
2,1 ≈ 0
Conﬁdence intervals based on series approximations or
bootstrap can perform poorly; proposed remedies include:
Apply shrinkage to reduce asymptotic bias
Form conservative estimates to tail probabilities of
c T
√
n(β1 − β∗
1 )
65 / 87

Characterizing asymptotic bias
Deﬁnition
For constant c ∈ Rdim β∗
1 and
√
n-consistent estimator β1 of β∗
1
with
√
n(β1 − β∗
1) M, deﬁne the c-directional asymptotic bias
Bias β1, c Ec T
M.
66 / 87

Characterizing asymptotic bias cont’d
Theorem (Asymptotic bias Q-learning)
Let c ∈ Rdim β∗
1 be ﬁxed. Under moment conditions:
Bias β1, c =
c T Σ−1
1,∞P B1 HT
2,1Σ21,21H2,11H T
2,1 β∗
2,1=0
√
2π
,
where B1 = (H T
1,0 , A1H T
1,1 ) T
, Σ1,∞ = PB1B T
1 , and Σ21,21, is the
asy. cov. of
√
n(β2,1 − β∗
2,1).
67 / 87

Characterizing asymptotic bias cont’d
Theorem (Asymptotic bias Q-learning)
1 be ﬁxed. Under moment conditions:
Bias β1, c =
c T Σ−1
1,∞P B1 HT
2,1Σ21,21H2,11H T
2,1 β∗
2,1=0
√
2π
,
where B1 = (H T
1,0 , A1H T
1,1 ) T
, Σ1,∞ = PB1B T
1 , and Σ21,21, is the
asy. cov. of
√
n(β2,1 − β∗
2,1).
Asymptotic bias for Q-learning
Ave. of c T
Σ1,∞B1 with wts ∝ Var H T
2,1 β2,11H T
2,1 β∗
2,1=0|H2,1
May be reduced by shrinking h T
2,1 β2,1 when h T
2,1 β∗
2,1 = 0
67 / 87

Reducing asymptotic bias to improve inference
Shrinkage is a popular method for reducing asymptotic bias
with goal of improving interval coverage
Chakraborty et al. (2009) apply soft-thresholding
Moodie et al. (2010) apply hard-thresholding
Goldberg et al. (2013) and Song et al.(2015) use lasso-type
penalization
Shrinkage methods target
max
a2
Q2 h2, a2; β2 = h T
2,0β2,0 + max
a2∈{0,1}
a2h T
2,1β2,1
68 / 87

Reducing asymptotic bias to improve inference
Shrinkage is a popular method for reducing asymptotic bias
with goal of improving interval coverage
Chakraborty et al. (2009) apply soft-thresholding
Moodie et al. (2010) apply hard-thresholding
Goldberg et al. (2013) and Song et al.(2014) use lasso-type
penalization
Shrinkage methods target
max
a2
Q2 h2, a2; β2 = h T
2,0β2,0 + h T
2,1β2,1
+
68 / 87

Soft-thresholding (Chakraborty et al., 2009)
In Q-learning, replace max
a2
Q2(H2, A2; β2) with
H T
2,0 β2,0 + H T
2,1 β2,1
+



1 −
σH T
2,1 Σ21,21H2,1
n H T
2,1 β2,1
2



+
Amount of shrinkage governed by σ > 0
Penalization schemes (Goldberg et al., 2013, Song et al.,
2014) reduce to this estimator under certain designs
No theoretical justiﬁcation in Chakraborty et al. (2009) but
improved coverage of bootstrap intervals in some settings
69 / 87

Soft-thresholding and asymptotic bias
Theorem
1 and let βσ
1 denote the soft-thresholding estimator.
Under moment conditions:
1. |Bias βσ
1 , c | ≤ |Bias β1, c | for any σ > 0.
2. If Bias β1, c = 0, then for σ > 0
Bias(βσ
1 , c)
Bias(β1, c)
= exp −
σ
2
− σ
∞
√
σ
1
x
exp −
x2
2
dx
70 / 87

Soft-thresholding and asymptotic bias cont’d
Is thresholding useful in reducing asymptotic bias?
Preceding theorem says yes, and more shrinkage is better
Chakraborty et al. suggest σ = 3, which corresponds to
13-fold decrease in asymptotic bias
However, the preceding theorem is based on pointwise, i.e.,
ﬁxed parameter, asymptotics and may not faithfully reﬂect
small sample performance
71 / 87

Local generative model
Use local asymptotics approximate small sample behavior of
soft-thresholding
Assume:
1. For any s ∈ Rdim β∗
2,1 there exists sequence of distributions Pn
so that
√
n dP1/2
n − dP1/2
−
1
2
νsdP1/2
2
→ 0,
for some measurable function νs.
2. β∗
2,1,n = β∗
2,1 + s/
√
n, where
β∗
2,n = arg min
β2
Pn {Y − Q2(H2, A2; β2)}
2
72 / 87

Local asymptotics view of soft-thresholding
Theorem
1 be ﬁxed. Under the local generative model and
moment conditions:
1. sup
s∈R
dim β∗
2,1
|Bias(β1, c)| ≤ K < ∞.
2. sup
s∈R
dim β∗
2,1
|Bias(βσ
1 , c)| → ∞ as σ → ∞.
73 / 87

Local asymptotics view of soft-thresholding
Theorem
1 be ﬁxed. Under the local generative model and
moment conditions:
1. sup
s∈R
dim β∗
2,1
|Bias(β1, c)| ≤ K < ∞.
2. sup
s∈R
dim β∗
2,1
|Bias(βσ
1 , c)| → ∞ as σ → ∞.
Thresholding can be inﬁnitely worse than doing nothing if
done too aggressively in small samples
73 / 87

Data-driven tuning
Is it possible to construct a data-driven choice of σ that
consistently leads to less asymptotic bias than no shrinkage?
Consider data from a two-arm randomized trial{(Ai , Yi )}n
i=1,
A ∈ {0, 1}, Y ∈ R coded so that higher is better 13
Deﬁne µ∗
a E (Y |A = a) and µa = PnY 1A=a/Pn1A=a
Mean outcome under optimal treatment assignment
θ∗
= max(µ∗
0, µ∗
1), corresponding estimator
θ = max(µ0, µ1) = µ0 + [µ1 − µ0]+
Soft-thresholding estimator
θσ
= µ0 + [µ1 − µ0]+ 1 −
4σ
n (µ1 − µ0)
2
+
13
This is equivalent to two-stage Q-learning with no covariates and a single
ﬁrst stage treatment.
74 / 87

Data-driven tuning: toy example
n = 10 n = 100 n = 100 Bias
Optimal value of σ depends on µ∗
1 − µ∗
0
Variability in µ1 − µ0 prevents identification of optimal value of σ,
using plug-in estimator may lead to large bias
Data-driven σ that significantly improves asymptotic bias over no
shrinkage is difficult
75 / 87

Asymptotic bias: discussion
Asymptotic bias exists in Q-learning
Local asymptotics show that aggressively shrinking to reduce
asymptotic bias can be inﬁnitely worse than no shrinkage
Data-driven tuning seems to require choosing σ very small or
risking large bias
76 / 87

Confidence intervals for c T
β∗
1
Possible to construct valid confidence intervals in presence of
asymptotic bias
Idea: construct regular bounds on c T √
n(β1 − β∗
1)
Bootstrap bounds to form confidence interval
Tightest among all regular bounds ⇒ automatic adaptivity
Local uniform convergence
Can also obtain conditional properties (Robins and Rotnitzky,
2014) and global uniform convergence (Wu, 2014)
77 / 87

Regular bounds on c T
√
n(β1 − β∗
1)
Deﬁne
Vn(c, γ) = c T
Sn + c T
Σ−1
1 PnUn(γ),
where
Sn = ˆΣ−1
1
√
n(Pn − P)B1 H T
2,0 β∗
2,0 + H T
2,1 β∗
2,1
+
− B T
1 β∗
1
+ˆΣ−1
1
√
nPnH T
2,0 β2,0 − β∗
2,0 ,
Un(γ) = B1 H T
2,1 (Zn + γ)
+
− H T
2,1 γ
+
Sn is smooth and Un(γ) is non-smooth
78 / 87

Regular bounds on c T
√
n(β1 − β∗
1) cont’d
It can be shown that c T √
n(β1 − β∗
1) = Vn(c, β∗
2,1)
Use pretesting to construct upper bound
Un(c) = c T
Sn + c T
Σ−1
1 PnUn(β∗
2,1)1Tn(H2,1)>λn
+ sup
γ
c T
Σ−1
1 PnUn(γ)1Tn(H2,1)≤λn
,
where Tn(h2,1) test statistic for null h T
2,1β∗
2,1 = 0 and λn is a
critical value
Lower bound, Ln(c) obtained by taking inf
Bootstrap bounds to ﬁnd conﬁdence interval
79 / 87

Validity of the bounds
Theorem
1 be ﬁxed. Assume the local generative model,
under moment conditions and conditions on the pretest:
1. cT √
n(β1−β∗
1,n) c T
S∞+c T
Σ−1
1,∞PB1H T
2,1 Z∞1H T
2,1 β∗
2,1>0
+c T
Σ−1
1,∞PB1 H T
2,1 (Z∞ + s)
+
− H T
2,1 s
+
1H T
2,1 β∗
2,1=0
2. Un(c) c T
S∞+c T
Σ−1
1,∞PB1H T
2,1 Z∞1H T
2,1 β∗
2,1>0
+sup
γ
c T
Σ−1
1,∞PB1 H T
2,1 (Z∞ + γ)
+
− H T
2,1 γ
+
1H T
2,1 β∗
2,1=0
80 / 87

Validity of the bootstrap bounds
Theorem
Fix α ∈ (0, 1) and c ∈ Rdim β∗
1 . Let and u denote the
(α/2) × 100 and (1 − α/2) × 100 percentiles of bootstrap
distribution of the bounds and let PM denote the distribution with
respect to bootstrap weights. Under moment conditions and
conditions on the pretest for any > 0:
P PM c T
β1 −
u
√
n
≤ c T
β∗
1 ≤ c T
β1 − √
n
< 1 − α − = o(1).
81 / 87

Uniform validity of the bootstrap bounds
(Tianshuang Wu)
Theorem
Fix α ∈ (0, 1) and c ∈ Rdim β∗
1 . Let and u denote the
(α/2) × 100 and (1 − α/2) × 100 percentiles of bootstrap
distribution of the bounds and let PM denote the distribution with
respect to bootstrap weights. Under moment conditions and
conditions on the pretest for any > 0:
inf
P∈P
P PM c T
β1 −
u
√
n
≤ c T
β∗
1 ≤ c T
β1 − √
n
< 1 − α − ,
converges to zero for a large class of distributions P.
82 / 87

Simulation experiments
Class of generative models
Xt ∈ {−1, 1}, At ∈ {−1, 1}, t ∈ {1, 2}
P(At = 1) = P(At = −1) = 0.5, t ∈ {1, 2}
X1 ∼ Bernoulli(0.5)
X2|X1, A1 ∼ Bernoulli {expit(δ1X1 + δ2A1)}
∼ N(0, 1)
Y = γ1 +γ2X1 +γ3A1 +γ4X1A1 +γ5A2 +γ6X2A2 +γ7A1A2 +
Vary parameters to obtain range of eﬀects sizes, classify
generative models as
Non-regular (NR)
Nearly non-regular (NNR)
Regular (R)
83 / 87

Simulation experiments cont’d
Compare bounding conﬁdence interval (ACI) with bootstrap
(BOOT) and bootstrap thresholding (THRESH)
Compare in terms of width and coverage (target 95%)
Results based on 1000 Monte Carlo replications with datasets
of size n = 150
Bootstrap computed with 1000 resamples
Tuning parameter λn chosen with double bootstrap
84 / 87

Simulation experiments: results
Coverage (target 95%)
Method
Ex. 1
NNR
Ex. 2
NR
Ex. 3
NNR
Ex. 4
R
Ex. 5
NR
Ex. 6
NNR
BOOT 0.935* 0.930* 0.933* 0.928* 0.925* 0.928*
THRESH 0.945 0.938 0.942 0.943 0.759* 0.762*
ACI 0.971 0.958 0.961 0.943 0.953 0.953
Average width
Method
Ex. 1
NNR
Ex. 2
NR
Ex. 3
NNR
Ex. 4
R
Ex. 5
NR
Ex. 6
NNR
BOOT 0.385* 0.430* 0.430* 0.436* 0.428* 0.428*
THRESH 0.339 0.426 0.427 0.436 0.426* 0.424*
ACI 0.441 0.470 0.470 0.469 0.473 0.473
85 / 87

Ex. DTR for ADHD without uncertainty
Prior medication?
Low
dose
MEDS
Yes
Adequate response? Continue
MEDS
Yes
High adherence?
No
Add
BMOD
NoIntensify
MEDS
YesLow
dose
BMOD
No
Adequate response?
Continue
BMOD
Yes
High adherence?
No
Intensify
BMOD
Yes
Add
MEDS
No
86 / 87

Ex. DTR for ADHD with uncertainty
Prior medication?
Low dose
MEDS
∼OR∼
BMOD
Yes
Adequate response? Continue
MEDS
Yes
High adherence?
No
Add OTHER
∼OR∼
Itensify SAME
NoIntensify
MEDS
YesLow
dose
BMOD
No
Adequate response?
Continue
BMOD
Yes
High adherence?
No
Add MEDS
∼OR∼
Intensify BMOD
Yes
Add
MEDS
No
87 / 87

2019 PMED Spring Course - Introduction to Nonsmooth Inference - Eric Laber, April 17, 2019

More Related Content

Similar to 2019 PMED Spring Course - Introduction to Nonsmooth Inference - Eric Laber, April 17, 2019 (20)

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded (20)

2019 PMED Spring Course - Introduction to Nonsmooth Inference - Eric Laber, April 17, 2019