SlideShare a Scribd company logo
Intro to nonsmooth inference
Eric B. Laber
Department of Statistics, North Carolina State University
April 2019
SAMSI
Last time
SMARTs gold standard for est and eval of txt regimes
Highly configurable but choices driven by science
Looked at examples with varying scientific/clinical goals which
lead to different timing, txt options, response criteria etc.
Often powered by simple comparisons
First-stage response rates
Fixed regimes (most- vs. least-intensive)
First stage txts (problematic)
If test statistic is regular and asymptotically normal under null
can use same basic template for power
1 / 87
Quick SMART review
R
PCST-Full
Treatment 0
PCST-Brief
Treatment 1
Response?
Response?
R
No
R
No
R
Yes
R
Yes
PCST-Full maintenance
Treatment 2
No further treatment
Treatment 3
PCST-Plus
Treatment 4
PCST-Full maintenance
Treatment 2
PCST-Brief maintenance
Treatment 5
No further intervention
Treatment 3
PCST-Full
Treatment 0
PCST-Brief maintenance
Treatment 5
2 / 87
Refresher
Suppose that researchers are interested in comparing the
embedded regimes:
(e1) assign PCST-Full initially, assign PCST-Full maintenance to
responders, and assign PCST-Plus to non-responders;
(e2) assign PCST-Brief initially, assign no further intervention to
responders, and assign PCST-Brief maintenance to responders.
Recall our general template:
Test statistic: Vn(e1) − Vn(e2), where Vn is IPWE
Use
√
nTn/σ2
e1,e2,n asy normal and reject when this is large in
magnitude
3 / 87
Goals for today
Introduction to inference for txt regimes
Nonregular inference (and why we should care)
Basic strategies with a toy problem
Examples in one-stage problems
4 / 87
Warm up part I: quiz!
Discuss with your stat buddy:
What are some common scenarios where series approx or the
bootstrap cannot ensure correct op characteristics?
What is a local alternative?
How do we know if an asymptotic approx is adequate?
True or false
If n is large asymptotic approximations can be trusted.
The top review of CLT on yelp complains about the burritos
being too expensive.
The BBC produced an Hitler-themed sitcom titled ‘Heil Honey,
I’m home’ in the 1950s.
5 / 87
On reality and fantasy
Your cat didn’t say that. You know how I know? It’s a
cat. It doesn’t talk. If you died, it would eat you. Starting
with your face.
– Matt Zabka, recently single
6 / 87
Asymptotic approximations
Basic idea: study behavior of statistical procedure in terms of
dominating features while ignoring lower order ones
Often, but not always, consider diverging sample size
‘Dominating features’ intentionally ambiguous
Generate new insights and general statistical procedures as
large classes of problems share same dominating features
Asymptotics mustn’t be applied mindlessly
Disgusting trend in statistics: propose method, push through
irrelevant asymptotics, handpick simulation experiments
Require careful thought about what op characteristics are
needed scientifically and how to ensure these hold with the
kind of data that are likely to be observed
No panacea ⇒ handcrafted construction and evaluation
7 / 87
Inferential questions in precision medicine
Identify key tailoring variables
Evaluate performance of true optimal regime
Evaluate performance of estimated optimal regime
Compare performance of two+ (possibly data-driven) regimes
. . .
8 / 87
Toy problem: max of means
Simple problem that retains many of the salient features of
inference for txt regimes
Non-smooth function of smooth functionals
Well-studied in the literature
Basic notation
For Z1, . . . , Zn ∼i.i.d. P comprising ind copies of Z ∼ P write
Pf (Z) = f (z)dP(z) and Pnf (Z) = n−1
n
i=1
f (Zi )
Use ‘ ’ to denote convergence in distribution
Check: assuming requisite moments exist:
√
n (Pn − P) Z ????
9 / 87
Max of means
Observe X1, . . . , Xn ∼i.i.d. P in Rp
with µ0 = PX, define
θ0 =
p
j=1
µ0,j = max(µ0,1, . . . , mu0,p)
While we consider this estimand primarily for illustration, it
corresponds to problem of estimating the mean outcome
under an optimal one-size-fits-all treatment recommendation
where µ0,j is mean outcome under treatment j = 1, . . . , p.
10 / 87
Max of means: estimation
Define µn = PnX, the plug-in estimator of θ0 is
θn =
p
j=1
µn,j
Warm-up:
Three minutes trying to derive limiting distn of
√
n(θn − θ0)
Three minutes discussing soln with your stat buddy
11 / 87
Max of means: first result
For v ∈ Rp
define U(v) = arg max
j
vj
Lemma
Assume regularity conditions under which
√
n(Pn − P)X is
asymptotically normal with mean zero and variance-covariance
matrix Σ. Then
√
n θn − θ0
j∈U(µ0)
Zj ,
where Z ∼ Normal(0, Σ).
12 / 87
Max of means: proof of first result
13 / 87
Extra page if needed
Max of means: discussion of first result
Limiting distribution of
√
n(θn − θ0) depends abruptly on µ0
If µ0 = (0, 0) T
and Σ = I2, the the limiting distn is the max
of two ind std normals
If µ0 = (0, ) T
for > 0 and Σ = I2, the limiting distn is std
normal even if = 1x10−27
!!
How can we use such an asymptotic result in practice?!
Limiting distn of
√
n(θn − θ0) depends only on submatrix of Σ
cor. to elements of U(θ0). What about in finite samples?
14 / 87
Max of means: discussion of first result cont’d
Suppose X1, . . . , Xn ∼i.i.d. Normal(θ0, Ip) and µ0 has a
unique maximizer, i.e., U(µ0) a singleton, say {µ0,1}
√
n(θn − θ0) Normal(0, 1)
P
√
n θn − θ0 ≤ t = Φ(t)
p
j=2
Φ t +
√
n(θ0 − µ0,j )
Quick break: derive this.
If the gaps θ0 − µ0,j are small relative to
√
n, the finite sample
behavior can be quite different from limit Φ(t)1
1
Note that the limiting distribution doesn’t depend on these gaps at all!
15 / 87
Max of means: normal approximation in pictures
Generate data from Normal(µ, I6) with µ1 = 2 and
µj = µ1 − δ for j = 2, . . . , 6. Results shown for n = 100.
n(θ^
n − θ0)
Density,δ=0.5
−4 −2 0 2 4
0.00.10.20.30.40.50.6
n(θ^
n − θ0)
Density,δ=0.1
−4 −2 0 2 4
0.00.10.20.30.40.50.6
n(θ^
n − θ0)
Density,δ=0.01
−4 −2 0 2 4
0.00.10.20.30.40.50.6
16 / 87
Choosing the right asymptotic framework
Dangerous pattern of thinking:
In practice, none of the txt effect differences are zero.
I’ll build my asy approximations assuming a unique maximizer.
17 / 87
Choosing the right asymptotic framework
Dangerous pattern of thinking:
In practice, none of the txt effect differences are zero.
I’ll build my asy approximations assuming a unique maximizer.
There finitely many components so maximizer is
well-separated.
Idea! Plug-in estimated mazimizer and use asy normal approx.
Preceding pattern happens frequently, e.g., oracle property in
model selection, max eigenvalues in matrix , and txt regimes
17 / 87
Choosing the right asymptotic framework
What goes wrong? After all, this thinking works well in many
other settings, e.g., everything you learned in stat 101.
18 / 87
Choosing the right asymptotic framework
What goes wrong? After all, this thinking works well in many
other settings, e.g., everything you learned in stat 101.
Finite sample behavior driven by small (not necessarily zero)
differences in txt effectiveness
We saw this analytically in normal case
Intuition helped by thinking in extremes, e.g,. what if all txts
were equal? What if one were infinitely better than others?
Abrupt dependence of limiting distribution on U(µ0) is a
redflag. It is tempting to construct procedures that will recover
this limiting distn even if some txt differences are exactly zero.
This is asymptotics for asymptotics sake. Don’t do it.
18 / 87
Asymptotic working assumptions
A useful asy approximation should be robust to the setting
where some (all) txt differences are zero
Necessary but not sufficient
Heuristic: in small samples, one cannot distinguish between
small (but nonzero) txt differences so use an asy framework
which allows for exact equality.
This heuristic has been misinterpreted and misused in lit.
Some procedures we’ll look at are designed for such robustness
19 / 87
Local asymptotics: horseshoes and hand grenades
Allowing null txt differences problematic
Asymptotically, differences either zero or infinite2
Txt differences are (probably) not exactly zero
Challenge: allow small differences to persist as n diverges
Local or moving parameter asy framework does this
Idea: allow gen model to change with n so that gaps
θ0 − max
j /∈U(µ0)
µ0,j shrink to zero as n increases3
2
In the stat sense that we have power one to discriminate between them.
3
This idea should be familiar from hypothesis testing.
20 / 87
Triangular arrays
For each n, X1,n, . . . , Xn,n ∼i.i.d. Pn
Observations Distribution
X1,1 P1
X1,2 X2,2 P2
X1,3 X2,3 X3,3 P3
X1,4 X2,4 X3,4 X4,4 P4
...
...
...
...
...
...
Define µ0,n = PnX and θ0,n =
p
j=1
µ0,j
Assume µ0,n = µ0 + s/
√
n where s ∈ Rp
called local parameter
Assume
√
n(Pn − Pn)X Normal(0, Σ)4
4
This is true under very mild conditions on the sequence of distributions
{Pn}n≥1. However, given our limited time we will not discuss such conditions.
See van der Vaart and Wellner (1996) for details. 21 / 87
Quick quiz
Suppose that X1,n, . . . , Xn,n ∼i.i.d. Normal(µ0 + s/
√
n, Σ)
what is is distribution of
√
n(Pn − Pn)X?
22 / 87
Local alternatives anticipate unstable performance
Lemma
Let s ∈ Rp
be fixed. Assume that for each n we observe {Xi,n}n
i=1
drawn i.i.d. from Pn which satisfies: (i) PnX = mu0 + s/
√
n, and
(ii)
√
n(Pn − Pn)X Normal(0, Σ). Then, under Pn,
√
n θn − θ0,n
j∈U(µ0)
(Zj + sj ) −
j∈U(µ0)
sj ,
where Z ∼ Normal(0, Σ).
23 / 87
Local alternatives anticipate unstable performance
Lemma
Let s ∈ Rp
be fixed. Assume that for each n we observe {Xi,n}n
i=1
drawn i.i.d. from Pn which satisfies: (i) PnX = mu0 + s/
√
n, and
(ii)
√
n(Pn − Pn)X Normal(0, Σ). Then, under Pn,
√
n θn − θ0,n
j∈U(µ0)
(Zj + sj ) −
j∈U(µ0)
sj ,
where Z ∼ Normal(0, Σ).
Discussion/observations on local limiting distn
Dependence of limiting distn on s ⇒ nonregular
Set U(µ0) represents set of near-maximizers though sj = 0
corresponds to exact equality (so haven’t ruled this out)
23 / 87
Proof of local limiting distribution
24 / 87
Extra page if needed
Intermission: more regularity after this
Comments on nonregularity
Sensitivity of estimator to local alternatives cannot be
rectified through the choice of a more clever estimator
Inherent property of the estimand5
This has not stopped some from trying...
Remainder of today’s notes: cataloging of confidence intervals
5
See van der Vaart (1991), Hirano and Porter (2012), and L. et al. (2011,
2014, 2019)
25 / 87
Projection region
Idea: exploit the following two facts
µn is nicely behaved (reg. asy normal)
If µ0 were known this would be trivial6
Given α ∈ (0, 1) denote acceptable error level and ζn,1−α a
confidence region for µ0, e.g.,
ζn,1−α = µ ∈ Rp
: n(µn − µ) T
Σn(µn − µ) ≤ χ2
p,1−α ,
where Σn = Pn(X − µn)(X − µn) T
Projection CI:
Γn,1−α =



θ ∈ R : θ =
p
j=1
µj for some µ ∈ ζn,1−α



6
In this problem, θ0 is a function of µ0 and is thus completely known when
µ0 is known. In more complicated problems, knowing the value of a nuisance
parameter will make the inference problem of interest regular.
26 / 87
Prove the following with your stat buddy
P (θ0 ∈ Γn,1−α) ≥ 1 − α + oP(1)
27 / 87
Comments on projection regions
Useful when parameter of interest is a non-smooth functional
of a smooth (regular) parameter
Robust and widely applicable but conservative
Projection interval valid under local alternatives (why?)
Can reduce conservatism using pre-test (L. et al., 2014)
Berger and Boos (1991) and Robins (2004) for seminal papers
Consider as a first option in new non-reg problem
28 / 87
Bound-based confidence intervals
Idea: sandwich non-smooth functional between smooth upper
and lower bounds than bootstrap bounds to form conf region
Let {τn}n≥1 be seq of pos constants such that τn → ∞ and
τn = o(
√
n) as n → ∞, define
Un(µ0) = j : max
k
√
n (µn,k − µn,j ) /σj,k,n ≤ τn ,
where σj,k,n is est of asy variance of µn,k − µn,j .
Note* May help to think of Un to be the indices of txts that
we cannot distinguish from being optimal.
29 / 87
Bound-based confidence intervals cont’d
Given Un(µ0) define
Sn(µ0) = s ∈ Rp
: sj = µ0,j if j ∈ Un(µ0) ,
then, it follows that
Un = sup
s∈Sn(µ0)
√
n



p
j=1
(µn,j − µ0,j + sj ) −
p
j=1
sj



is an upper bound on
√
n(θn − θ0). (Why?) A lower bound,
Ln is constructed by replacing sup with an inf.
30 / 87
Dad, where do bounds come from?
Un obtained by taking sup over all local, i.e., order 1/
√
n,
perturbations of generative model
By construction, insensitive to local perturbations ⇒ regular
Un(µ0) conservative est of U(µ0), lets wave our hands:
31 / 87
Bootstrapping the bounds
Both Un and Ln are regular and their distns consistently
estimated via nonpar bootstrap
Let u
(b)
n,1−α/2 be (1 − α/2) × 100 perc of bootstrap distn of Un
and
(b)
n,α/2 the (α/2) × 100 perc of bootstrap distn of Ln
Bound based confidence interval
θn − u
(b)
n,1−α/2/
√
n, θn −
(b)
n,α/2/
√
n
32 / 87
Bound-based intervals discussion
General approach, applies to implicitly defined estimators as
as those with closed form expressions like we considered here
Less conservative than projection interval but still
conservative, such conservatism is unavoidable
Bounds are tightest in some sense
Bounding quantiles directly rather than estimand may reduce
conservatism though possibly at price of addl complexity
See Fan et al. (2017) for other improvements/refinements
33 / 87
Bootstrap methods
Bootstrap is not consistent without modification
Due to instability (nonregularity)
Nondifferentiability of max operator causes this instability (see
Shao 1994 for a nice review)
Bootstrap is appealing for complex problems
Doesn’t require explicitly computing asy approximations7
.
Higher order convergence properties
7
There are exceptions to this, including parametric bootstrap and those
based on quadratic expansions
34 / 87
How about some witchcraft?
m-out-of-n bootstrap can be used to create valid confidence
intervals for non-smooth functionals
Idea: resample datasets of size mn = o(n) so sample-level
parameters converge ‘faster’ than bootstrap analogs
(i.e., witchcraft)
35 / 87
m-out-of-n bootstrap
Accepting some components of witchcraft on faith
√
mn µ(b)
mn
− µn Normal(0, Σ) conditional on the data
See Arcones and Gine (1989) for details
An even toyier example than our toy example: W1, . . . , Wn
i.i.d. w/ (µ, σ2
) derive limit distns of
√
n(|W n| − |µ|) and
√
mn(|W
(b)
n | − |W n|)
36 / 87
m-out-ofn bootstrap with max of means
Derive limiting distribution of
√
mn θ
(b)
mn − θn :
37 / 87
Extra page if needed
Intermission
I wouldn’t want to wind up hooked to a bunch of wires and
tubes, unless somehow the wires and tubes were keeping
me alive. —Don Alden Adams
Finally! Back to treatment regimes (briefly)
Consider a one-stage problem with observed data
{(Xi , Ai , Yi )}n
i=1 where X ∈ Rp
, A ∈ {−1, 1}, and Y ∈ R
Assume requisite causal conditions hold
Assume linear rules π(x) = sign(x T
β), where β ∈ Rp
, and x
might contain polynomial terms etc.
38 / 87
Warm-up! Derive limiting distn of parameters in
linear Q-learning!!!!8
Posit linear model Q(x, a; β) = x T
0 + ax T
1 β, indexed by
β = (β T
0 , β T
1 ) T
and x0, x1 known features.
βn = arg min
β
Pn {Y − Q(X, A; β)}
2
and
β∗
= arg min
β
P {Y − Q(X, A; β)}
Derive limiting distribution of
√
n(βn − β∗
)
Construct confidence interval for Q(x, a) assuming
Q(x, a) = Q(x, a; β∗
)
8
He exclaimed.
39 / 87
Extra page if needed
Extra page if needed
Parameters in (1-stage) Q-learning are easy!
Similar arguments show that coefficients indexing
g-computation and outcome weighted learning asy normal
Preview: consider the (regression-based) estimator of the
value of πn(x) = sign(x T
1 β1,n), which you’ll recall is
Vn(β1,n) = Pn max
a
Q(X, a; βn)
= PnX T
0 β0,n + Pn|X T
1 β1,n|
What is limit of
√
n Vn(β1,n) − V (β1,n) and can we use it
to derive CI for V (βn)? What about a CI for V (β∗
)?
40 / 87
Parameters in (1-stage) OWL are easy!
To illustrate, assume P(A = 1|X) = P(A = −1|X) = 1/2 wp1
Recall OWL based on cvx relaxation of IPWE
Vn(β) = Pn
Y 1 A = sign(X T β)
P(A|X)
= 2PnY 1 A sign(X T
β) > 0
Let : R → R be cvx, OWL estimator is
βn = arg min
β∈Rp
Pn|Y | (W T
β),
where W = sign(Y )AX
41 / 87
Extra page if needed
Some facts about OWL (and more generally
convex M-estimators)
Fn |y| (w T
β) is composition of linear and cvx function and
thus cvx in β for each (y, w) ⇒ greatly simplifies inference!
Regularity conditions
β∗
= arg min
β
P|Y | (W T
β) exists and unique
Map β → P|Y | (W T
β) differentiable in nbrhd of β∗9
Under these conditions,
√
n(βn − β∗
) is regular10
and
asymptotically normal ⇒ many results from Q-learning port
9
More formally, require
|y| w T
(β∗
+ δ) − |y| (w T
β∗
) = S(y, w, ; β∗
) T
δ + R(y, w, δ; β∗
) where
PS(Y , W ; β∗
) = 0, ΣO = PS(Y , W ; β∗
)S(Y , W ; β∗
) T
finite, and
PR(Y , W , δ; β∗
) = (1/2)δ T
ΩO δ + o(||δ||2
) (Haberman, 1989; Niemiro, 1992;
Hjort and Pollard, 2011).
10
I am being a bit loose with language in this course by referring to both
estimands and rescaled estimators as ‘regular’ or ‘non-regular.’ 42 / 87
Value function(s)
Three ways to measure performance
Conditional value: V (πn) = PY ∗
(πn) = E Y ∗
(πn) πn ,
measures the performance of an estimated decision rule as if it
were to be deployed in popn (note* this is a random variable)
Unconditional value: Vn = EV (πn), measures the average
performance of the algorithm used to construct πn with sample
of size n
Population-level value: V (π∗
), where π∗
(x) = sign(x T
β∗
),
measures the potential of applying precision medicine strategy
in given domain if algorithm for constructing πn will be used
Discuss these measures with your stat buddy. Is there a
meaningful distinction as the sample size grows large?
43 / 87
It’s a wacky world out there
The three value measures need not coincide asymptotically
Let πn(x) = sign(x T
βn) and suppose
√
nβn Normal(0, Σ)
so that β∗
≡ 0 and π∗
(x) ≡ −1. With stat buddy, compute:
Vn(βn) ????
Vn = EV (βn) → ????
V (β∗
) = ????
44 / 87
Calculon!
Extra page if needed
Have some confidence you useless pile!
We’ll construct confidence sets for V (βn) and V (β∗
) as these
are most commonly of interest in application
Starting with conditional value fn assume that the
data-generating model is a triangular array Pn such that:
(A0) πn(x) = sign(x T
βn)
(A1) ∃β∗
n s.t. β∗
n = β∗
+ s/
√
n for some s ∈ Rp
and
√
n(βn − β∗
n ) =
√
n(Pn − Pn)u(X, A, Y ) + oPn (1), where u
does not depend on s, sup
n
Pn||u(X, A, Y )||2
< ∞, and
Cov {u(X, A, Y )} is p.d.
(A2) If F is uniformly bounded Donsker class and
√
n(pn − P) T
in ∞
(F) under P then
√
n(Pn − Pn) T in ∞
(F) under Pn.
(A3) sup
n
Pn||Y ||2
< ∞.
Detailed discussion is beyond the scope of this class. Laber
will wave his hands a bit. Our goal is understand key 45 / 87
Building block: joint distribution before
nonsmooth operator
Define class of functions
G = g(X, A, Y ; δ) = Y 1 AX T
δ > 0 1 X T
β∗
= 0 : δ ∈ Rp
view
√
n(Pn − Pn) as random element of ∞
(Rp
) .
Lemma
Assume (A0)-(A3). Then
√
n



Pn − Pn
βn − β∗
(Pn − Pn)Y 1 AX T
β∗
> 0





T
Z
W


in ∞
(Rp
) × Rp
× R under Pn.
46 / 87
Limiting distn of V (βn)
Corollary
Assume (A0)-(A3). Then,
√
n Vn(βn) − V (βn) T (Z + s) + W.
Notes
Presence of s shows this is nonregular
T is a Brownian bridge indexed by Rp
W and Z are normal
47 / 87
Hand-waving!
Extra page if needed
Bound-based confidence interval
Limiting distribution: T(Z + s) + W
Local parameter only appears in first term
(Asy) bound should only affect this term
Schematic for constructing a bound
Partition input space into those that are ‘near’ the decision
boundary x T
β∗
= 0 vs. those that are ’far’ from boundary
Take sup/inf over local perturbations of points in ‘near’ group
48 / 87
Upper bound
Let Σn be estimator of asy var of βn an upper bound on
√
n Vn(βn) − V (βn) is
Un = sup
ω∈Rp
√
n(Pn−Pn)Y 1 AX T
ω > 0 1
n(X T βn)2
X T ΣnX
≤ τn
+
√
n(Pn − Pn)Y 1 AX T
βn > 0 1
n(X T βn)2
X T ΣnX
> τn ,
where τn is seq of tuning parameters s.t. τn → ∞ and
τn = o(n) as n → ∞. Lower bound constructed by replacing
sup with inf.
49 / 87
Limiting distribution of bounds
Theorem
Assume (A0)-(A3). Then
(Ln, Un) inf
ω∈Rp
T(ω) + W, sup
ω∈Rp
T(ω) + W
under Pn.
Recall limit distn of
√
n Vn(βn) − V (βn) is T(Z + s) + W
Bounds equiv to sup/inf over local perturbations
If all subject have large txt effects, bound are tight
Bootstrap bounds to construct confidence bound, theoretical
results for the bootstrap bounds given in book
50 / 87
Note on tuning
Seq {τn}n≥1 can affect finite sample performance
Idea: tune using double bootstrap, i.e., bootstrap the
bootstrap samples to estimate coverage and adapt τn
Double bootstrap considered computationally expensive but
not much of a burden in most problems with modern
computing infrastructure
Tuning can be done without affecting theoretical results
51 / 87
Algy the friendly tuning algorithm
28 STATISTICAL INFERENCE
Alg. 1.5: Tuning the critical value τn using the double bootstrap
Input: {(Xi, Ai, Yi)}
n
i=1, M, α ∈ (0, 1), τ
(1)
n , . . . , τ
(L)
n
1 V = Vn(dn)
2 for j = 1, . . . , L do
3 c(j)
= 0
4 for b = 1, . . . , M do
5 Draw a sample of size n, say S
(b)
n , from {(Xi, Ai, Yi)}
n
i=1
with replacement
6 Compute bound-based confidence set, ζ
(b)
mn , using sample S
(b)
n
and critical value τ
(j)
n
7 if V ∈ ζ
(b)
mn then
8 c(j)
= c(j)
+ 1
9 end
10 end
11 end
12 Set j∗
= arg minj : c(j)≥M(1−α) c(j)
Output: Return τ
(j)
n
52 / 87
Intermission
If any man says he hates war more than I do, he better
have a knife, that’s all I have to say. –Ghandi
53 / 87
m-out-of-n bootstrap
Bound-based intervals complex (conceptually and technically)
Subsampling easier to implement and understand11
Does not require specialized code etc.
Let mn = o(n) be resample size s.t. mn → ∞ and mn = o(n),
let P
(b)
mn be bootstrap empirical distn
Approximate
√
n Vn(βn) − V (βn) with its bootstrap analog
√
mn V (b)
mn
(β(b)
mn
) − Vn(βn) (Laber might draw picture)
Let mn
and umn
be the (α/2) × 100 and (1 − α/2) × 100
percentiles of
√
mn V (b)
mn
(β(b)
mn
) − Vn(βn) , ci given by
Vn(βn) − umn /
√
mn, Vn(βn − mn /
√
mn
11
Though the theory underpinning subsampling can be non-trivial so
‘understand’ here is meant more mechanically.
54 / 87
Emmy the subsampling boostrap algoINFERENCE WITH ONE-STAGE REGIMES 29
Alg. 1.6: m-out-of-n bootstrap confidence set for the conditional
value
Input: mn, {Xi, Ai, Yi}
n
i=1, M, α ∈ (0, 1)
1 for b = 1, . . . , M do
2 Draw a sample of size mn, say S
(b)
mn , from {Xi, Ai, Yi}
n
i=1 with
replacement
3 Compute β
(b)
mn on S
(b)
mn
4 Δ
(b)
mn =
√
mn i∈S
(b)
mn
YiI AiXT
i β
(b)
mn > 0 −
n
k=1 YkI AkXT
k β
(b)
mn > 0
5 end
6 Relabel so that Δ
(1)
mn ≤ Δ
(2)
mn ≤ · · · ≤ Δ
(B)
mn
7 mn
= Δ
( Bα/2 )
mn
8 umn = Δ
( B(1−α/2) )
mn
Output: Vn(dn) − umn /
√
mn, Vn(dn) − mn /
√
mn
55 / 87
m-out-of-n cont’d
Provides valid confidence intervals under (A0)-(A3)
Proof omitted (tedious)
Can tune mn using double bootstrap
Reliance on asymptotic tomfoolery makes me hesitant to use
this in practice12
12
I did not always think this way, see Chakraborty, L., and Zhao (2014ab).
Also, I should not be so dismissive of these methods. Some of the work in this
area has been quite deep and produced general uniformly convergent methods.
See work by Romano and colleagues.
56 / 87
Confidence interval for opt regime within a class
Let π∗
denote the optimal txt regime within a given class, our
goal is to construct a CI for V (π∗
)
Were π∗
known, one could use
√
n Vn(π∗
) − V (π∗
) , with
your stats buddy, compute this limiting distn
Suppose that we could construct a valid confidence region for
π∗
, suggest a method for CI for V (π∗
)
57 / 87
Projection interval
For any fixed π, let ζn,1−ν(π) be a (1 − ν) × 100% confidence
set for V (π), e.g., using asymptotic approx on previous slide
Let Dn,1−η denote a (1 − η) × 100% confidence set for π∗
,
then a (1 − η − ν) × 100% confidence region for V (π∗
) is
π∈Dn,1−η
ζn,1−ν(π)
Why?
58 / 87
Ex. projection interval for linear regime
Consider regimes of the form π(x; β) = sign(x T
β), then
√
n Vn(β) − V (β) =
√
n(Pn − P)
Y 1AX T β>0
P(A|X)
Normal 0, σ2
(β) ,
take
ζn,1−ν(β) = Vn(β) −
z1−ν/2σn(β)
√
n
, Vn(β) +
z1−ν2σn(β)
√
n
If
√
n(βn − β∗
) Normal(0, Σ), take
Dn,1−η = β : n(βn − β) T
Σ−1
n (βn − β) ≤ χ2
p,1−η
Projection interval:
β∈Dn,1−η
ζn,1−ν(β)
59 / 87
Quiz break!
What does the ‘Q’ in Q-learning stand for?
In txt regimes, which of the following is not yet a thing:
A-learning, B-learning, C-Learning, D-learning, E-learning?
Write down the two-stage Q-learning algorithm assuming
binary treatments and linear models at each stage
True or false:
I would rather have a zombie ice dragon than two live fire
dragons.
The story of the Easter bunny is based on the little known
story of Jesus swapping the internal organs of chickens and
rabbits to prevent a widespread famine
Q-learning has been used to obtain state-of-the-art
performance in game-playing domains like chess, backgammon,
and atari
60 / 87
Inference for two-stage linear Q-learning
Learning objectives
Identify source of nonregularlity
Understand implications on coverage and asy bias
Intuition behind bounds
Hopefully this will be trivial for you now!
61 / 87
Reminder: setup and notation
Observe {(X1,i , A1,i , X2,i , A2,i , Yi )}n
i=1, i.i.d. from P
X1 ∈ Rp1
: baseline subj. info.
A1 ∈ {0, 1} : first treatment
X2 ∈ Rp2
: interim subj. info. during course of A1
A2 ∈ {0, 1} : second treatment
Y ∈ R : outcome, higher is better
Define history H1 = X1, H2 = (X1, A1, X2)
DTR π = (π1, π2) where
πt : supp Ht → supp At,
patient presenting with Ht = ht assigned treatment πt(ht)
62 / 87
Characterizing optimal DTR
Optimal regime maximizes value EY ∗
(π)
Define Q-functions
Q2 (h2, a2) = E Y H2 = h2, A2 = a2
Q1(h1, a1) = E max
a2
Q2(H2, a2) H1 = h1, A1 = a1
Dynamic programming (Bellman, 1957)
πopt
t (ht) = arg max
at
Qt(ht, at)
63 / 87
Q-learning
Regression-based dynamic programming algorithm
(Q0) Postulate working models for Q-functions
Qt(ht, at; βt) = h T
t,0 βt,0 + ath T
t,1 βt,1, ht,0, ht,1 features of ht
(Q1) Compute β2 = arg min
β2
Pn {Y − Q2(H2, A2; β2)}
2
(Q2) Compute
β1 = arg min
β1
Pn max
a2
Q2(H2, A2; β2) − Q1(H1, A1; β1)
2
(Q3) πt(ht) = arg max
at
Qt(ht, at; βt)
Population parameters β∗
t obtained by replacing Pn with P
Inference for β∗
2 standard, just OLS
Focus on confidence intervals for c T
β∗
1 for fixed c ∈ Rdim,β∗
1
64 / 87
Q-learning
Regression-based dynamic programming algorithm
(Q0) Postulate working models for Q-functions
Qt(ht, at; βt) = h T
t,0 βt,0 + ath T
t,1 βt,1, ht,0, ht,1 features of ht
(Q1) Compute β2 = arg min
β2
Pn {Y − Q2(H2, A2; β2)}
2
(Q2) Compute
β1 = arg min
β1
Pn max
a2
Q2(H2, A2; β2) − Q1(H1, A1; β1)
2
(Q3) πt(ht) = arg max
at
Qt(ht, at; βt)
Population parameters β∗
t obtained by replacing Pn with P
Inference for β∗
2 standard, just OLS
Focus on confidence intervals for c T
β∗
1 for fixed c ∈ Rdim,β∗
1
64 / 87
Inference for c T
β∗
1
Non-smooth max operator makes β1 non-regular
Distn of c T
√
n(β1 − β∗
1 ) sensitive to small perturbations of P
Limiting distn does not have mean zero (asymptotic bias)
Occurs with small second stage txt effects, H T
2,1 β∗
2,1 ≈ 0
Confidence intervals based on series approximations or
bootstrap can perform poorly; proposed remedies include:
Apply shrinkage to reduce asymptotic bias
Form conservative estimates to tail probabilities of
c T
√
n(β1 − β∗
1 )
65 / 87
Characterizing asymptotic bias
Definition
For constant c ∈ Rdim β∗
1 and
√
n-consistent estimator β1 of β∗
1
with
√
n(β1 − β∗
1) M, define the c-directional asymptotic bias
Bias β1, c Ec T
M.
66 / 87
Characterizing asymptotic bias cont’d
Theorem (Asymptotic bias Q-learning)
Let c ∈ Rdim β∗
1 be fixed. Under moment conditions:
Bias β1, c =
c T Σ−1
1,∞P B1 HT
2,1Σ21,21H2,11H T
2,1 β∗
2,1=0
√
2π
,
where B1 = (H T
1,0 , A1H T
1,1 ) T
, Σ1,∞ = PB1B T
1 , and Σ21,21, is the
asy. cov. of
√
n(β2,1 − β∗
2,1).
67 / 87
Characterizing asymptotic bias cont’d
Theorem (Asymptotic bias Q-learning)
Let c ∈ Rdim β∗
1 be fixed. Under moment conditions:
Bias β1, c =
c T Σ−1
1,∞P B1 HT
2,1Σ21,21H2,11H T
2,1 β∗
2,1=0
√
2π
,
where B1 = (H T
1,0 , A1H T
1,1 ) T
, Σ1,∞ = PB1B T
1 , and Σ21,21, is the
asy. cov. of
√
n(β2,1 − β∗
2,1).
Asymptotic bias for Q-learning
Ave. of c T
Σ1,∞B1 with wts ∝ Var H T
2,1 β2,11H T
2,1 β∗
2,1=0|H2,1
May be reduced by shrinking h T
2,1 β2,1 when h T
2,1 β∗
2,1 = 0
67 / 87
Reducing asymptotic bias to improve inference
Shrinkage is a popular method for reducing asymptotic bias
with goal of improving interval coverage
Chakraborty et al. (2009) apply soft-thresholding
Moodie et al. (2010) apply hard-thresholding
Goldberg et al. (2013) and Song et al.(2015) use lasso-type
penalization
Shrinkage methods target
max
a2
Q2 h2, a2; β2 = h T
2,0β2,0 + max
a2∈{0,1}
a2h T
2,1β2,1
68 / 87
Reducing asymptotic bias to improve inference
Shrinkage is a popular method for reducing asymptotic bias
with goal of improving interval coverage
Chakraborty et al. (2009) apply soft-thresholding
Moodie et al. (2010) apply hard-thresholding
Goldberg et al. (2013) and Song et al.(2014) use lasso-type
penalization
Shrinkage methods target
max
a2
Q2 h2, a2; β2 = h T
2,0β2,0 + h T
2,1β2,1
+
68 / 87
Soft-thresholding (Chakraborty et al., 2009)
In Q-learning, replace max
a2
Q2(H2, A2; β2) with
H T
2,0 β2,0 + H T
2,1 β2,1
+



1 −
σH T
2,1 Σ21,21H2,1
n H T
2,1 β2,1
2



+
Amount of shrinkage governed by σ > 0
Penalization schemes (Goldberg et al., 2013, Song et al.,
2014) reduce to this estimator under certain designs
No theoretical justification in Chakraborty et al. (2009) but
improved coverage of bootstrap intervals in some settings
69 / 87
Soft-thresholding and asymptotic bias
Theorem
Let c ∈ Rdim β∗
1 and let βσ
1 denote the soft-thresholding estimator.
Under moment conditions:
1. |Bias βσ
1 , c | ≤ |Bias β1, c | for any σ > 0.
2. If Bias β1, c = 0, then for σ > 0
Bias(βσ
1 , c)
Bias(β1, c)
= exp −
σ
2
− σ
∞
√
σ
1
x
exp −
x2
2
dx
70 / 87
Soft-thresholding and asymptotic bias cont’d
Is thresholding useful in reducing asymptotic bias?
Preceding theorem says yes, and more shrinkage is better
Chakraborty et al. suggest σ = 3, which corresponds to
13-fold decrease in asymptotic bias
However, the preceding theorem is based on pointwise, i.e.,
fixed parameter, asymptotics and may not faithfully reflect
small sample performance
71 / 87
Local generative model
Use local asymptotics approximate small sample behavior of
soft-thresholding
Assume:
1. For any s ∈ Rdim β∗
2,1 there exists sequence of distributions Pn
so that
√
n dP1/2
n − dP1/2
−
1
2
νsdP1/2
2
→ 0,
for some measurable function νs.
2. β∗
2,1,n = β∗
2,1 + s/
√
n, where
β∗
2,n = arg min
β2
Pn {Y − Q2(H2, A2; β2)}
2
72 / 87
Local asymptotics view of soft-thresholding
Theorem
Let c ∈ Rdim β∗
1 be fixed. Under the local generative model and
moment conditions:
1. sup
s∈R
dim β∗
2,1
|Bias(β1, c)| ≤ K < ∞.
2. sup
s∈R
dim β∗
2,1
|Bias(βσ
1 , c)| → ∞ as σ → ∞.
73 / 87
Local asymptotics view of soft-thresholding
Theorem
Let c ∈ Rdim β∗
1 be fixed. Under the local generative model and
moment conditions:
1. sup
s∈R
dim β∗
2,1
|Bias(β1, c)| ≤ K < ∞.
2. sup
s∈R
dim β∗
2,1
|Bias(βσ
1 , c)| → ∞ as σ → ∞.
Thresholding can be infinitely worse than doing nothing if
done too aggressively in small samples
73 / 87
Data-driven tuning
Is it possible to construct a data-driven choice of σ that
consistently leads to less asymptotic bias than no shrinkage?
Consider data from a two-arm randomized trial{(Ai , Yi )}n
i=1,
A ∈ {0, 1}, Y ∈ R coded so that higher is better 13
Define µ∗
a E (Y |A = a) and µa = PnY 1A=a/Pn1A=a
Mean outcome under optimal treatment assignment
θ∗
= max(µ∗
0, µ∗
1), corresponding estimator
θ = max(µ0, µ1) = µ0 + [µ1 − µ0]+
Soft-thresholding estimator
θσ
= µ0 + [µ1 − µ0]+ 1 −
4σ
n (µ1 − µ0)
2
+
13
This is equivalent to two-stage Q-learning with no covariates and a single
first stage treatment.
74 / 87
Data-driven tuning: toy example
n = 10 n = 100 n = 100 Bias
Optimal value of σ depends on µ∗
1 − µ∗
0
Variability in µ1 − µ0 prevents identification of optimal value of σ,
using plug-in estimator may lead to large bias
Data-driven σ that significantly improves asymptotic bias over no
shrinkage is difficult
75 / 87
Asymptotic bias: discussion
Asymptotic bias exists in Q-learning
Local asymptotics show that aggressively shrinking to reduce
asymptotic bias can be infinitely worse than no shrinkage
Data-driven tuning seems to require choosing σ very small or
risking large bias
76 / 87
Confidence intervals for c T
β∗
1
Possible to construct valid confidence intervals in presence of
asymptotic bias
Idea: construct regular bounds on c T √
n(β1 − β∗
1)
Bootstrap bounds to form confidence interval
Tightest among all regular bounds ⇒ automatic adaptivity
Local uniform convergence
Can also obtain conditional properties (Robins and Rotnitzky,
2014) and global uniform convergence (Wu, 2014)
77 / 87
Regular bounds on c T
√
n(β1 − β∗
1)
Define
Vn(c, γ) = c T
Sn + c T
Σ−1
1 PnUn(γ),
where
Sn = ˆΣ−1
1
√
n(Pn − P)B1 H T
2,0 β∗
2,0 + H T
2,1 β∗
2,1
+
− B T
1 β∗
1
+ˆΣ−1
1
√
nPnH T
2,0 β2,0 − β∗
2,0 ,
Un(γ) = B1 H T
2,1 (Zn + γ)
+
− H T
2,1 γ
+
Sn is smooth and Un(γ) is non-smooth
78 / 87
Regular bounds on c T
√
n(β1 − β∗
1) cont’d
It can be shown that c T √
n(β1 − β∗
1) = Vn(c, β∗
2,1)
Use pretesting to construct upper bound
Un(c) = c T
Sn + c T
Σ−1
1 PnUn(β∗
2,1)1Tn(H2,1)>λn
+ sup
γ
c T
Σ−1
1 PnUn(γ)1Tn(H2,1)≤λn
,
where Tn(h2,1) test statistic for null h T
2,1β∗
2,1 = 0 and λn is a
critical value
Lower bound, Ln(c) obtained by taking inf
Bootstrap bounds to find confidence interval
79 / 87
Validity of the bounds
Theorem
Let c ∈ Rdim β∗
1 be fixed. Assume the local generative model,
under moment conditions and conditions on the pretest:
1. cT √
n(β1−β∗
1,n) c T
S∞+c T
Σ−1
1,∞PB1H T
2,1 Z∞1H T
2,1 β∗
2,1>0
+c T
Σ−1
1,∞PB1 H T
2,1 (Z∞ + s)
+
− H T
2,1 s
+
1H T
2,1 β∗
2,1=0
2. Un(c) c T
S∞+c T
Σ−1
1,∞PB1H T
2,1 Z∞1H T
2,1 β∗
2,1>0
+sup
γ
c T
Σ−1
1,∞PB1 H T
2,1 (Z∞ + γ)
+
− H T
2,1 γ
+
1H T
2,1 β∗
2,1=0
80 / 87
Validity of the bootstrap bounds
Theorem
Fix α ∈ (0, 1) and c ∈ Rdim β∗
1 . Let and u denote the
(α/2) × 100 and (1 − α/2) × 100 percentiles of bootstrap
distribution of the bounds and let PM denote the distribution with
respect to bootstrap weights. Under moment conditions and
conditions on the pretest for any > 0:
P PM c T
β1 −
u
√
n
≤ c T
β∗
1 ≤ c T
β1 − √
n
< 1 − α − = o(1).
81 / 87
Uniform validity of the bootstrap bounds
(Tianshuang Wu)
Theorem
Fix α ∈ (0, 1) and c ∈ Rdim β∗
1 . Let and u denote the
(α/2) × 100 and (1 − α/2) × 100 percentiles of bootstrap
distribution of the bounds and let PM denote the distribution with
respect to bootstrap weights. Under moment conditions and
conditions on the pretest for any > 0:
inf
P∈P
P PM c T
β1 −
u
√
n
≤ c T
β∗
1 ≤ c T
β1 − √
n
< 1 − α − ,
converges to zero for a large class of distributions P.
82 / 87
Simulation experiments
Class of generative models
Xt ∈ {−1, 1}, At ∈ {−1, 1}, t ∈ {1, 2}
P(At = 1) = P(At = −1) = 0.5, t ∈ {1, 2}
X1 ∼ Bernoulli(0.5)
X2|X1, A1 ∼ Bernoulli {expit(δ1X1 + δ2A1)}
∼ N(0, 1)
Y = γ1 +γ2X1 +γ3A1 +γ4X1A1 +γ5A2 +γ6X2A2 +γ7A1A2 +
Vary parameters to obtain range of effects sizes, classify
generative models as
Non-regular (NR)
Nearly non-regular (NNR)
Regular (R)
83 / 87
Simulation experiments cont’d
Compare bounding confidence interval (ACI) with bootstrap
(BOOT) and bootstrap thresholding (THRESH)
Compare in terms of width and coverage (target 95%)
Results based on 1000 Monte Carlo replications with datasets
of size n = 150
Bootstrap computed with 1000 resamples
Tuning parameter λn chosen with double bootstrap
84 / 87
Simulation experiments: results
Coverage (target 95%)
Method
Ex. 1
NNR
Ex. 2
NR
Ex. 3
NNR
Ex. 4
R
Ex. 5
NR
Ex. 6
NNR
BOOT 0.935* 0.930* 0.933* 0.928* 0.925* 0.928*
THRESH 0.945 0.938 0.942 0.943 0.759* 0.762*
ACI 0.971 0.958 0.961 0.943 0.953 0.953
Average width
Method
Ex. 1
NNR
Ex. 2
NR
Ex. 3
NNR
Ex. 4
R
Ex. 5
NR
Ex. 6
NNR
BOOT 0.385* 0.430* 0.430* 0.436* 0.428* 0.428*
THRESH 0.339 0.426 0.427 0.436 0.426* 0.424*
ACI 0.441 0.470 0.470 0.469 0.473 0.473
85 / 87
Ex. DTR for ADHD without uncertainty
Prior medication?
Low
dose
MEDS
Yes
Adequate response? Continue
MEDS
Yes
High adherence?
No
Add
BMOD
NoIntensify
MEDS
YesLow
dose
BMOD
No
Adequate response?
Continue
BMOD
Yes
High adherence?
No
Intensify
BMOD
Yes
Add
MEDS
No
86 / 87
Ex. DTR for ADHD with uncertainty
Prior medication?
Low dose
MEDS
∼OR∼
BMOD
Yes
Adequate response? Continue
MEDS
Yes
High adherence?
No
Add OTHER
∼OR∼
Itensify SAME
NoIntensify
MEDS
YesLow
dose
BMOD
No
Adequate response?
Continue
BMOD
Yes
High adherence?
No
Add MEDS
∼OR∼
Intensify BMOD
Yes
Add
MEDS
No
87 / 87

More Related Content

PPTX
The False Discovery Rate: An Overview
PDF
파이썬으로 익히는 딥러닝 기본 (18년)
PDF
Neural Processes
PDF
차원축소 훑어보기 (PCA, SVD, NMF)
PDF
Helib
PPTX
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
PDF
2019 PMED Spring Course - SMARTs-Part II - Eric Laber, April 10, 2019
PPT
1609 probability function p on subspace of s
The False Discovery Rate: An Overview
파이썬으로 익히는 딥러닝 기본 (18년)
Neural Processes
차원축소 훑어보기 (PCA, SVD, NMF)
Helib
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
2019 PMED Spring Course - SMARTs-Part II - Eric Laber, April 10, 2019
1609 probability function p on subspace of s

Similar to 2019 PMED Spring Course - Introduction to Nonsmooth Inference - Eric Laber, April 17, 2019 (20)

PDF
Asymptotics of ABC, lecture, Collège de France
PDF
Estimation rs
PDF
GDRR Opening Workshop - Modeling Approaches for High-Frequency Financial Time...
PDF
the ABC of ABC
PDF
SAS Homework Help
PDF
Minimax statistical learning with Wasserstein distances (NeurIPS2018 Reading ...
PDF
eatonmuirheadsoaita
PDF
Stochastic Approximation And Nonlinear Regression Arthur E Albert Leland A Ga...
PDF
Statistics (1): estimation, Chapter 1: Models
PDF
Thesis_NickyGrant_2013
PDF
3_MLE_printable.pdf
PDF
Workshop in honour of Don Poskitt and Gael Martin
PDF
Machine Learning With MapReduce, K-Means, MLE
PDF
Optimum Engineering Design - Day 2b. Classical Optimization methods
PDF
Optimal Estimating Sequence for a Hilbert Space Valued Parameter
PDF
20320130406030
PDF
PPT
Multivariate outlier detection
PDF
Regression on gaussian symbols
PPTX
probability assignment help (2)
Asymptotics of ABC, lecture, Collège de France
Estimation rs
GDRR Opening Workshop - Modeling Approaches for High-Frequency Financial Time...
the ABC of ABC
SAS Homework Help
Minimax statistical learning with Wasserstein distances (NeurIPS2018 Reading ...
eatonmuirheadsoaita
Stochastic Approximation And Nonlinear Regression Arthur E Albert Leland A Ga...
Statistics (1): estimation, Chapter 1: Models
Thesis_NickyGrant_2013
3_MLE_printable.pdf
Workshop in honour of Don Poskitt and Gael Martin
Machine Learning With MapReduce, K-Means, MLE
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimal Estimating Sequence for a Hilbert Space Valued Parameter
20320130406030
Multivariate outlier detection
Regression on gaussian symbols
probability assignment help (2)
Ad

More from The Statistical and Applied Mathematical Sciences Institute (20)

PDF
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
PDF
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
PDF
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
PDF
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
PDF
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
PDF
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
PPTX
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
PDF
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
PDF
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
PPTX
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
PDF
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
PDF
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
PDF
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
PDF
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
PDF
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
PDF
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
PPTX
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
PPTX
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
PDF
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
PDF
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
Ad

Recently uploaded (20)

PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PPTX
Lesson notes of climatology university.
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
RMMM.pdf make it easy to upload and study
PDF
Complications of Minimal Access Surgery at WLH
PDF
A systematic review of self-coping strategies used by university students to ...
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PDF
Trump Administration's workforce development strategy
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Orientation - ARALprogram of Deped to the Parents.pptx
Supply Chain Operations Speaking Notes -ICLT Program
Weekly quiz Compilation Jan -July 25.pdf
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Lesson notes of climatology university.
2.FourierTransform-ShortQuestionswithAnswers.pdf
Chinmaya Tiranga quiz Grand Finale.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
GDM (1) (1).pptx small presentation for students
Microbial diseases, their pathogenesis and prophylaxis
RMMM.pdf make it easy to upload and study
Complications of Minimal Access Surgery at WLH
A systematic review of self-coping strategies used by university students to ...
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
Trump Administration's workforce development strategy
Module 4: Burden of Disease Tutorial Slides S2 2025

2019 PMED Spring Course - Introduction to Nonsmooth Inference - Eric Laber, April 17, 2019

  • 1. Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April 2019 SAMSI
  • 2. Last time SMARTs gold standard for est and eval of txt regimes Highly configurable but choices driven by science Looked at examples with varying scientific/clinical goals which lead to different timing, txt options, response criteria etc. Often powered by simple comparisons First-stage response rates Fixed regimes (most- vs. least-intensive) First stage txts (problematic) If test statistic is regular and asymptotically normal under null can use same basic template for power 1 / 87
  • 3. Quick SMART review R PCST-Full Treatment 0 PCST-Brief Treatment 1 Response? Response? R No R No R Yes R Yes PCST-Full maintenance Treatment 2 No further treatment Treatment 3 PCST-Plus Treatment 4 PCST-Full maintenance Treatment 2 PCST-Brief maintenance Treatment 5 No further intervention Treatment 3 PCST-Full Treatment 0 PCST-Brief maintenance Treatment 5 2 / 87
  • 4. Refresher Suppose that researchers are interested in comparing the embedded regimes: (e1) assign PCST-Full initially, assign PCST-Full maintenance to responders, and assign PCST-Plus to non-responders; (e2) assign PCST-Brief initially, assign no further intervention to responders, and assign PCST-Brief maintenance to responders. Recall our general template: Test statistic: Vn(e1) − Vn(e2), where Vn is IPWE Use √ nTn/σ2 e1,e2,n asy normal and reject when this is large in magnitude 3 / 87
  • 5. Goals for today Introduction to inference for txt regimes Nonregular inference (and why we should care) Basic strategies with a toy problem Examples in one-stage problems 4 / 87
  • 6. Warm up part I: quiz! Discuss with your stat buddy: What are some common scenarios where series approx or the bootstrap cannot ensure correct op characteristics? What is a local alternative? How do we know if an asymptotic approx is adequate? True or false If n is large asymptotic approximations can be trusted. The top review of CLT on yelp complains about the burritos being too expensive. The BBC produced an Hitler-themed sitcom titled ‘Heil Honey, I’m home’ in the 1950s. 5 / 87
  • 7. On reality and fantasy Your cat didn’t say that. You know how I know? It’s a cat. It doesn’t talk. If you died, it would eat you. Starting with your face. – Matt Zabka, recently single 6 / 87
  • 8. Asymptotic approximations Basic idea: study behavior of statistical procedure in terms of dominating features while ignoring lower order ones Often, but not always, consider diverging sample size ‘Dominating features’ intentionally ambiguous Generate new insights and general statistical procedures as large classes of problems share same dominating features Asymptotics mustn’t be applied mindlessly Disgusting trend in statistics: propose method, push through irrelevant asymptotics, handpick simulation experiments Require careful thought about what op characteristics are needed scientifically and how to ensure these hold with the kind of data that are likely to be observed No panacea ⇒ handcrafted construction and evaluation 7 / 87
  • 9. Inferential questions in precision medicine Identify key tailoring variables Evaluate performance of true optimal regime Evaluate performance of estimated optimal regime Compare performance of two+ (possibly data-driven) regimes . . . 8 / 87
  • 10. Toy problem: max of means Simple problem that retains many of the salient features of inference for txt regimes Non-smooth function of smooth functionals Well-studied in the literature Basic notation For Z1, . . . , Zn ∼i.i.d. P comprising ind copies of Z ∼ P write Pf (Z) = f (z)dP(z) and Pnf (Z) = n−1 n i=1 f (Zi ) Use ‘ ’ to denote convergence in distribution Check: assuming requisite moments exist: √ n (Pn − P) Z ???? 9 / 87
  • 11. Max of means Observe X1, . . . , Xn ∼i.i.d. P in Rp with µ0 = PX, define θ0 = p j=1 µ0,j = max(µ0,1, . . . , mu0,p) While we consider this estimand primarily for illustration, it corresponds to problem of estimating the mean outcome under an optimal one-size-fits-all treatment recommendation where µ0,j is mean outcome under treatment j = 1, . . . , p. 10 / 87
  • 12. Max of means: estimation Define µn = PnX, the plug-in estimator of θ0 is θn = p j=1 µn,j Warm-up: Three minutes trying to derive limiting distn of √ n(θn − θ0) Three minutes discussing soln with your stat buddy 11 / 87
  • 13. Max of means: first result For v ∈ Rp define U(v) = arg max j vj Lemma Assume regularity conditions under which √ n(Pn − P)X is asymptotically normal with mean zero and variance-covariance matrix Σ. Then √ n θn − θ0 j∈U(µ0) Zj , where Z ∼ Normal(0, Σ). 12 / 87
  • 14. Max of means: proof of first result 13 / 87
  • 15. Extra page if needed
  • 16. Max of means: discussion of first result Limiting distribution of √ n(θn − θ0) depends abruptly on µ0 If µ0 = (0, 0) T and Σ = I2, the the limiting distn is the max of two ind std normals If µ0 = (0, ) T for > 0 and Σ = I2, the limiting distn is std normal even if = 1x10−27 !! How can we use such an asymptotic result in practice?! Limiting distn of √ n(θn − θ0) depends only on submatrix of Σ cor. to elements of U(θ0). What about in finite samples? 14 / 87
  • 17. Max of means: discussion of first result cont’d Suppose X1, . . . , Xn ∼i.i.d. Normal(θ0, Ip) and µ0 has a unique maximizer, i.e., U(µ0) a singleton, say {µ0,1} √ n(θn − θ0) Normal(0, 1) P √ n θn − θ0 ≤ t = Φ(t) p j=2 Φ t + √ n(θ0 − µ0,j ) Quick break: derive this. If the gaps θ0 − µ0,j are small relative to √ n, the finite sample behavior can be quite different from limit Φ(t)1 1 Note that the limiting distribution doesn’t depend on these gaps at all! 15 / 87
  • 18. Max of means: normal approximation in pictures Generate data from Normal(µ, I6) with µ1 = 2 and µj = µ1 − δ for j = 2, . . . , 6. Results shown for n = 100. n(θ^ n − θ0) Density,δ=0.5 −4 −2 0 2 4 0.00.10.20.30.40.50.6 n(θ^ n − θ0) Density,δ=0.1 −4 −2 0 2 4 0.00.10.20.30.40.50.6 n(θ^ n − θ0) Density,δ=0.01 −4 −2 0 2 4 0.00.10.20.30.40.50.6 16 / 87
  • 19. Choosing the right asymptotic framework Dangerous pattern of thinking: In practice, none of the txt effect differences are zero. I’ll build my asy approximations assuming a unique maximizer. 17 / 87
  • 20. Choosing the right asymptotic framework Dangerous pattern of thinking: In practice, none of the txt effect differences are zero. I’ll build my asy approximations assuming a unique maximizer. There finitely many components so maximizer is well-separated. Idea! Plug-in estimated mazimizer and use asy normal approx. Preceding pattern happens frequently, e.g., oracle property in model selection, max eigenvalues in matrix , and txt regimes 17 / 87
  • 21. Choosing the right asymptotic framework What goes wrong? After all, this thinking works well in many other settings, e.g., everything you learned in stat 101. 18 / 87
  • 22. Choosing the right asymptotic framework What goes wrong? After all, this thinking works well in many other settings, e.g., everything you learned in stat 101. Finite sample behavior driven by small (not necessarily zero) differences in txt effectiveness We saw this analytically in normal case Intuition helped by thinking in extremes, e.g,. what if all txts were equal? What if one were infinitely better than others? Abrupt dependence of limiting distribution on U(µ0) is a redflag. It is tempting to construct procedures that will recover this limiting distn even if some txt differences are exactly zero. This is asymptotics for asymptotics sake. Don’t do it. 18 / 87
  • 23. Asymptotic working assumptions A useful asy approximation should be robust to the setting where some (all) txt differences are zero Necessary but not sufficient Heuristic: in small samples, one cannot distinguish between small (but nonzero) txt differences so use an asy framework which allows for exact equality. This heuristic has been misinterpreted and misused in lit. Some procedures we’ll look at are designed for such robustness 19 / 87
  • 24. Local asymptotics: horseshoes and hand grenades Allowing null txt differences problematic Asymptotically, differences either zero or infinite2 Txt differences are (probably) not exactly zero Challenge: allow small differences to persist as n diverges Local or moving parameter asy framework does this Idea: allow gen model to change with n so that gaps θ0 − max j /∈U(µ0) µ0,j shrink to zero as n increases3 2 In the stat sense that we have power one to discriminate between them. 3 This idea should be familiar from hypothesis testing. 20 / 87
  • 25. Triangular arrays For each n, X1,n, . . . , Xn,n ∼i.i.d. Pn Observations Distribution X1,1 P1 X1,2 X2,2 P2 X1,3 X2,3 X3,3 P3 X1,4 X2,4 X3,4 X4,4 P4 ... ... ... ... ... ... Define µ0,n = PnX and θ0,n = p j=1 µ0,j Assume µ0,n = µ0 + s/ √ n where s ∈ Rp called local parameter Assume √ n(Pn − Pn)X Normal(0, Σ)4 4 This is true under very mild conditions on the sequence of distributions {Pn}n≥1. However, given our limited time we will not discuss such conditions. See van der Vaart and Wellner (1996) for details. 21 / 87
  • 26. Quick quiz Suppose that X1,n, . . . , Xn,n ∼i.i.d. Normal(µ0 + s/ √ n, Σ) what is is distribution of √ n(Pn − Pn)X? 22 / 87
  • 27. Local alternatives anticipate unstable performance Lemma Let s ∈ Rp be fixed. Assume that for each n we observe {Xi,n}n i=1 drawn i.i.d. from Pn which satisfies: (i) PnX = mu0 + s/ √ n, and (ii) √ n(Pn − Pn)X Normal(0, Σ). Then, under Pn, √ n θn − θ0,n j∈U(µ0) (Zj + sj ) − j∈U(µ0) sj , where Z ∼ Normal(0, Σ). 23 / 87
  • 28. Local alternatives anticipate unstable performance Lemma Let s ∈ Rp be fixed. Assume that for each n we observe {Xi,n}n i=1 drawn i.i.d. from Pn which satisfies: (i) PnX = mu0 + s/ √ n, and (ii) √ n(Pn − Pn)X Normal(0, Σ). Then, under Pn, √ n θn − θ0,n j∈U(µ0) (Zj + sj ) − j∈U(µ0) sj , where Z ∼ Normal(0, Σ). Discussion/observations on local limiting distn Dependence of limiting distn on s ⇒ nonregular Set U(µ0) represents set of near-maximizers though sj = 0 corresponds to exact equality (so haven’t ruled this out) 23 / 87
  • 29. Proof of local limiting distribution 24 / 87
  • 30. Extra page if needed
  • 32. Comments on nonregularity Sensitivity of estimator to local alternatives cannot be rectified through the choice of a more clever estimator Inherent property of the estimand5 This has not stopped some from trying... Remainder of today’s notes: cataloging of confidence intervals 5 See van der Vaart (1991), Hirano and Porter (2012), and L. et al. (2011, 2014, 2019) 25 / 87
  • 33. Projection region Idea: exploit the following two facts µn is nicely behaved (reg. asy normal) If µ0 were known this would be trivial6 Given α ∈ (0, 1) denote acceptable error level and ζn,1−α a confidence region for µ0, e.g., ζn,1−α = µ ∈ Rp : n(µn − µ) T Σn(µn − µ) ≤ χ2 p,1−α , where Σn = Pn(X − µn)(X − µn) T Projection CI: Γn,1−α =    θ ∈ R : θ = p j=1 µj for some µ ∈ ζn,1−α    6 In this problem, θ0 is a function of µ0 and is thus completely known when µ0 is known. In more complicated problems, knowing the value of a nuisance parameter will make the inference problem of interest regular. 26 / 87
  • 34. Prove the following with your stat buddy P (θ0 ∈ Γn,1−α) ≥ 1 − α + oP(1) 27 / 87
  • 35. Comments on projection regions Useful when parameter of interest is a non-smooth functional of a smooth (regular) parameter Robust and widely applicable but conservative Projection interval valid under local alternatives (why?) Can reduce conservatism using pre-test (L. et al., 2014) Berger and Boos (1991) and Robins (2004) for seminal papers Consider as a first option in new non-reg problem 28 / 87
  • 36. Bound-based confidence intervals Idea: sandwich non-smooth functional between smooth upper and lower bounds than bootstrap bounds to form conf region Let {τn}n≥1 be seq of pos constants such that τn → ∞ and τn = o( √ n) as n → ∞, define Un(µ0) = j : max k √ n (µn,k − µn,j ) /σj,k,n ≤ τn , where σj,k,n is est of asy variance of µn,k − µn,j . Note* May help to think of Un to be the indices of txts that we cannot distinguish from being optimal. 29 / 87
  • 37. Bound-based confidence intervals cont’d Given Un(µ0) define Sn(µ0) = s ∈ Rp : sj = µ0,j if j ∈ Un(µ0) , then, it follows that Un = sup s∈Sn(µ0) √ n    p j=1 (µn,j − µ0,j + sj ) − p j=1 sj    is an upper bound on √ n(θn − θ0). (Why?) A lower bound, Ln is constructed by replacing sup with an inf. 30 / 87
  • 38. Dad, where do bounds come from? Un obtained by taking sup over all local, i.e., order 1/ √ n, perturbations of generative model By construction, insensitive to local perturbations ⇒ regular Un(µ0) conservative est of U(µ0), lets wave our hands: 31 / 87
  • 39. Bootstrapping the bounds Both Un and Ln are regular and their distns consistently estimated via nonpar bootstrap Let u (b) n,1−α/2 be (1 − α/2) × 100 perc of bootstrap distn of Un and (b) n,α/2 the (α/2) × 100 perc of bootstrap distn of Ln Bound based confidence interval θn − u (b) n,1−α/2/ √ n, θn − (b) n,α/2/ √ n 32 / 87
  • 40. Bound-based intervals discussion General approach, applies to implicitly defined estimators as as those with closed form expressions like we considered here Less conservative than projection interval but still conservative, such conservatism is unavoidable Bounds are tightest in some sense Bounding quantiles directly rather than estimand may reduce conservatism though possibly at price of addl complexity See Fan et al. (2017) for other improvements/refinements 33 / 87
  • 41. Bootstrap methods Bootstrap is not consistent without modification Due to instability (nonregularity) Nondifferentiability of max operator causes this instability (see Shao 1994 for a nice review) Bootstrap is appealing for complex problems Doesn’t require explicitly computing asy approximations7 . Higher order convergence properties 7 There are exceptions to this, including parametric bootstrap and those based on quadratic expansions 34 / 87
  • 42. How about some witchcraft? m-out-of-n bootstrap can be used to create valid confidence intervals for non-smooth functionals Idea: resample datasets of size mn = o(n) so sample-level parameters converge ‘faster’ than bootstrap analogs (i.e., witchcraft) 35 / 87
  • 43. m-out-of-n bootstrap Accepting some components of witchcraft on faith √ mn µ(b) mn − µn Normal(0, Σ) conditional on the data See Arcones and Gine (1989) for details An even toyier example than our toy example: W1, . . . , Wn i.i.d. w/ (µ, σ2 ) derive limit distns of √ n(|W n| − |µ|) and √ mn(|W (b) n | − |W n|) 36 / 87
  • 44. m-out-ofn bootstrap with max of means Derive limiting distribution of √ mn θ (b) mn − θn : 37 / 87
  • 45. Extra page if needed
  • 46. Intermission I wouldn’t want to wind up hooked to a bunch of wires and tubes, unless somehow the wires and tubes were keeping me alive. —Don Alden Adams
  • 47. Finally! Back to treatment regimes (briefly) Consider a one-stage problem with observed data {(Xi , Ai , Yi )}n i=1 where X ∈ Rp , A ∈ {−1, 1}, and Y ∈ R Assume requisite causal conditions hold Assume linear rules π(x) = sign(x T β), where β ∈ Rp , and x might contain polynomial terms etc. 38 / 87
  • 48. Warm-up! Derive limiting distn of parameters in linear Q-learning!!!!8 Posit linear model Q(x, a; β) = x T 0 + ax T 1 β, indexed by β = (β T 0 , β T 1 ) T and x0, x1 known features. βn = arg min β Pn {Y − Q(X, A; β)} 2 and β∗ = arg min β P {Y − Q(X, A; β)} Derive limiting distribution of √ n(βn − β∗ ) Construct confidence interval for Q(x, a) assuming Q(x, a) = Q(x, a; β∗ ) 8 He exclaimed. 39 / 87
  • 49. Extra page if needed
  • 50. Extra page if needed
  • 51. Parameters in (1-stage) Q-learning are easy! Similar arguments show that coefficients indexing g-computation and outcome weighted learning asy normal Preview: consider the (regression-based) estimator of the value of πn(x) = sign(x T 1 β1,n), which you’ll recall is Vn(β1,n) = Pn max a Q(X, a; βn) = PnX T 0 β0,n + Pn|X T 1 β1,n| What is limit of √ n Vn(β1,n) − V (β1,n) and can we use it to derive CI for V (βn)? What about a CI for V (β∗ )? 40 / 87
  • 52. Parameters in (1-stage) OWL are easy! To illustrate, assume P(A = 1|X) = P(A = −1|X) = 1/2 wp1 Recall OWL based on cvx relaxation of IPWE Vn(β) = Pn Y 1 A = sign(X T β) P(A|X) = 2PnY 1 A sign(X T β) > 0 Let : R → R be cvx, OWL estimator is βn = arg min β∈Rp Pn|Y | (W T β), where W = sign(Y )AX 41 / 87
  • 53. Extra page if needed
  • 54. Some facts about OWL (and more generally convex M-estimators) Fn |y| (w T β) is composition of linear and cvx function and thus cvx in β for each (y, w) ⇒ greatly simplifies inference! Regularity conditions β∗ = arg min β P|Y | (W T β) exists and unique Map β → P|Y | (W T β) differentiable in nbrhd of β∗9 Under these conditions, √ n(βn − β∗ ) is regular10 and asymptotically normal ⇒ many results from Q-learning port 9 More formally, require |y| w T (β∗ + δ) − |y| (w T β∗ ) = S(y, w, ; β∗ ) T δ + R(y, w, δ; β∗ ) where PS(Y , W ; β∗ ) = 0, ΣO = PS(Y , W ; β∗ )S(Y , W ; β∗ ) T finite, and PR(Y , W , δ; β∗ ) = (1/2)δ T ΩO δ + o(||δ||2 ) (Haberman, 1989; Niemiro, 1992; Hjort and Pollard, 2011). 10 I am being a bit loose with language in this course by referring to both estimands and rescaled estimators as ‘regular’ or ‘non-regular.’ 42 / 87
  • 55. Value function(s) Three ways to measure performance Conditional value: V (πn) = PY ∗ (πn) = E Y ∗ (πn) πn , measures the performance of an estimated decision rule as if it were to be deployed in popn (note* this is a random variable) Unconditional value: Vn = EV (πn), measures the average performance of the algorithm used to construct πn with sample of size n Population-level value: V (π∗ ), where π∗ (x) = sign(x T β∗ ), measures the potential of applying precision medicine strategy in given domain if algorithm for constructing πn will be used Discuss these measures with your stat buddy. Is there a meaningful distinction as the sample size grows large? 43 / 87
  • 56. It’s a wacky world out there The three value measures need not coincide asymptotically Let πn(x) = sign(x T βn) and suppose √ nβn Normal(0, Σ) so that β∗ ≡ 0 and π∗ (x) ≡ −1. With stat buddy, compute: Vn(βn) ???? Vn = EV (βn) → ???? V (β∗ ) = ???? 44 / 87
  • 58. Extra page if needed
  • 59. Have some confidence you useless pile! We’ll construct confidence sets for V (βn) and V (β∗ ) as these are most commonly of interest in application Starting with conditional value fn assume that the data-generating model is a triangular array Pn such that: (A0) πn(x) = sign(x T βn) (A1) ∃β∗ n s.t. β∗ n = β∗ + s/ √ n for some s ∈ Rp and √ n(βn − β∗ n ) = √ n(Pn − Pn)u(X, A, Y ) + oPn (1), where u does not depend on s, sup n Pn||u(X, A, Y )||2 < ∞, and Cov {u(X, A, Y )} is p.d. (A2) If F is uniformly bounded Donsker class and √ n(pn − P) T in ∞ (F) under P then √ n(Pn − Pn) T in ∞ (F) under Pn. (A3) sup n Pn||Y ||2 < ∞. Detailed discussion is beyond the scope of this class. Laber will wave his hands a bit. Our goal is understand key 45 / 87
  • 60. Building block: joint distribution before nonsmooth operator Define class of functions G = g(X, A, Y ; δ) = Y 1 AX T δ > 0 1 X T β∗ = 0 : δ ∈ Rp view √ n(Pn − Pn) as random element of ∞ (Rp ) . Lemma Assume (A0)-(A3). Then √ n    Pn − Pn βn − β∗ (Pn − Pn)Y 1 AX T β∗ > 0      T Z W   in ∞ (Rp ) × Rp × R under Pn. 46 / 87
  • 61. Limiting distn of V (βn) Corollary Assume (A0)-(A3). Then, √ n Vn(βn) − V (βn) T (Z + s) + W. Notes Presence of s shows this is nonregular T is a Brownian bridge indexed by Rp W and Z are normal 47 / 87
  • 63. Extra page if needed
  • 64. Bound-based confidence interval Limiting distribution: T(Z + s) + W Local parameter only appears in first term (Asy) bound should only affect this term Schematic for constructing a bound Partition input space into those that are ‘near’ the decision boundary x T β∗ = 0 vs. those that are ’far’ from boundary Take sup/inf over local perturbations of points in ‘near’ group 48 / 87
  • 65. Upper bound Let Σn be estimator of asy var of βn an upper bound on √ n Vn(βn) − V (βn) is Un = sup ω∈Rp √ n(Pn−Pn)Y 1 AX T ω > 0 1 n(X T βn)2 X T ΣnX ≤ τn + √ n(Pn − Pn)Y 1 AX T βn > 0 1 n(X T βn)2 X T ΣnX > τn , where τn is seq of tuning parameters s.t. τn → ∞ and τn = o(n) as n → ∞. Lower bound constructed by replacing sup with inf. 49 / 87
  • 66. Limiting distribution of bounds Theorem Assume (A0)-(A3). Then (Ln, Un) inf ω∈Rp T(ω) + W, sup ω∈Rp T(ω) + W under Pn. Recall limit distn of √ n Vn(βn) − V (βn) is T(Z + s) + W Bounds equiv to sup/inf over local perturbations If all subject have large txt effects, bound are tight Bootstrap bounds to construct confidence bound, theoretical results for the bootstrap bounds given in book 50 / 87
  • 67. Note on tuning Seq {τn}n≥1 can affect finite sample performance Idea: tune using double bootstrap, i.e., bootstrap the bootstrap samples to estimate coverage and adapt τn Double bootstrap considered computationally expensive but not much of a burden in most problems with modern computing infrastructure Tuning can be done without affecting theoretical results 51 / 87
  • 68. Algy the friendly tuning algorithm 28 STATISTICAL INFERENCE Alg. 1.5: Tuning the critical value τn using the double bootstrap Input: {(Xi, Ai, Yi)} n i=1, M, α ∈ (0, 1), τ (1) n , . . . , τ (L) n 1 V = Vn(dn) 2 for j = 1, . . . , L do 3 c(j) = 0 4 for b = 1, . . . , M do 5 Draw a sample of size n, say S (b) n , from {(Xi, Ai, Yi)} n i=1 with replacement 6 Compute bound-based confidence set, ζ (b) mn , using sample S (b) n and critical value τ (j) n 7 if V ∈ ζ (b) mn then 8 c(j) = c(j) + 1 9 end 10 end 11 end 12 Set j∗ = arg minj : c(j)≥M(1−α) c(j) Output: Return τ (j) n 52 / 87
  • 69. Intermission If any man says he hates war more than I do, he better have a knife, that’s all I have to say. –Ghandi 53 / 87
  • 70. m-out-of-n bootstrap Bound-based intervals complex (conceptually and technically) Subsampling easier to implement and understand11 Does not require specialized code etc. Let mn = o(n) be resample size s.t. mn → ∞ and mn = o(n), let P (b) mn be bootstrap empirical distn Approximate √ n Vn(βn) − V (βn) with its bootstrap analog √ mn V (b) mn (β(b) mn ) − Vn(βn) (Laber might draw picture) Let mn and umn be the (α/2) × 100 and (1 − α/2) × 100 percentiles of √ mn V (b) mn (β(b) mn ) − Vn(βn) , ci given by Vn(βn) − umn / √ mn, Vn(βn − mn / √ mn 11 Though the theory underpinning subsampling can be non-trivial so ‘understand’ here is meant more mechanically. 54 / 87
  • 71. Emmy the subsampling boostrap algoINFERENCE WITH ONE-STAGE REGIMES 29 Alg. 1.6: m-out-of-n bootstrap confidence set for the conditional value Input: mn, {Xi, Ai, Yi} n i=1, M, α ∈ (0, 1) 1 for b = 1, . . . , M do 2 Draw a sample of size mn, say S (b) mn , from {Xi, Ai, Yi} n i=1 with replacement 3 Compute β (b) mn on S (b) mn 4 Δ (b) mn = √ mn i∈S (b) mn YiI AiXT i β (b) mn > 0 − n k=1 YkI AkXT k β (b) mn > 0 5 end 6 Relabel so that Δ (1) mn ≤ Δ (2) mn ≤ · · · ≤ Δ (B) mn 7 mn = Δ ( Bα/2 ) mn 8 umn = Δ ( B(1−α/2) ) mn Output: Vn(dn) − umn / √ mn, Vn(dn) − mn / √ mn 55 / 87
  • 72. m-out-of-n cont’d Provides valid confidence intervals under (A0)-(A3) Proof omitted (tedious) Can tune mn using double bootstrap Reliance on asymptotic tomfoolery makes me hesitant to use this in practice12 12 I did not always think this way, see Chakraborty, L., and Zhao (2014ab). Also, I should not be so dismissive of these methods. Some of the work in this area has been quite deep and produced general uniformly convergent methods. See work by Romano and colleagues. 56 / 87
  • 73. Confidence interval for opt regime within a class Let π∗ denote the optimal txt regime within a given class, our goal is to construct a CI for V (π∗ ) Were π∗ known, one could use √ n Vn(π∗ ) − V (π∗ ) , with your stats buddy, compute this limiting distn Suppose that we could construct a valid confidence region for π∗ , suggest a method for CI for V (π∗ ) 57 / 87
  • 74. Projection interval For any fixed π, let ζn,1−ν(π) be a (1 − ν) × 100% confidence set for V (π), e.g., using asymptotic approx on previous slide Let Dn,1−η denote a (1 − η) × 100% confidence set for π∗ , then a (1 − η − ν) × 100% confidence region for V (π∗ ) is π∈Dn,1−η ζn,1−ν(π) Why? 58 / 87
  • 75. Ex. projection interval for linear regime Consider regimes of the form π(x; β) = sign(x T β), then √ n Vn(β) − V (β) = √ n(Pn − P) Y 1AX T β>0 P(A|X) Normal 0, σ2 (β) , take ζn,1−ν(β) = Vn(β) − z1−ν/2σn(β) √ n , Vn(β) + z1−ν2σn(β) √ n If √ n(βn − β∗ ) Normal(0, Σ), take Dn,1−η = β : n(βn − β) T Σ−1 n (βn − β) ≤ χ2 p,1−η Projection interval: β∈Dn,1−η ζn,1−ν(β) 59 / 87
  • 76. Quiz break! What does the ‘Q’ in Q-learning stand for? In txt regimes, which of the following is not yet a thing: A-learning, B-learning, C-Learning, D-learning, E-learning? Write down the two-stage Q-learning algorithm assuming binary treatments and linear models at each stage True or false: I would rather have a zombie ice dragon than two live fire dragons. The story of the Easter bunny is based on the little known story of Jesus swapping the internal organs of chickens and rabbits to prevent a widespread famine Q-learning has been used to obtain state-of-the-art performance in game-playing domains like chess, backgammon, and atari 60 / 87
  • 77. Inference for two-stage linear Q-learning Learning objectives Identify source of nonregularlity Understand implications on coverage and asy bias Intuition behind bounds Hopefully this will be trivial for you now! 61 / 87
  • 78. Reminder: setup and notation Observe {(X1,i , A1,i , X2,i , A2,i , Yi )}n i=1, i.i.d. from P X1 ∈ Rp1 : baseline subj. info. A1 ∈ {0, 1} : first treatment X2 ∈ Rp2 : interim subj. info. during course of A1 A2 ∈ {0, 1} : second treatment Y ∈ R : outcome, higher is better Define history H1 = X1, H2 = (X1, A1, X2) DTR π = (π1, π2) where πt : supp Ht → supp At, patient presenting with Ht = ht assigned treatment πt(ht) 62 / 87
  • 79. Characterizing optimal DTR Optimal regime maximizes value EY ∗ (π) Define Q-functions Q2 (h2, a2) = E Y H2 = h2, A2 = a2 Q1(h1, a1) = E max a2 Q2(H2, a2) H1 = h1, A1 = a1 Dynamic programming (Bellman, 1957) πopt t (ht) = arg max at Qt(ht, at) 63 / 87
  • 80. Q-learning Regression-based dynamic programming algorithm (Q0) Postulate working models for Q-functions Qt(ht, at; βt) = h T t,0 βt,0 + ath T t,1 βt,1, ht,0, ht,1 features of ht (Q1) Compute β2 = arg min β2 Pn {Y − Q2(H2, A2; β2)} 2 (Q2) Compute β1 = arg min β1 Pn max a2 Q2(H2, A2; β2) − Q1(H1, A1; β1) 2 (Q3) πt(ht) = arg max at Qt(ht, at; βt) Population parameters β∗ t obtained by replacing Pn with P Inference for β∗ 2 standard, just OLS Focus on confidence intervals for c T β∗ 1 for fixed c ∈ Rdim,β∗ 1 64 / 87
  • 81. Q-learning Regression-based dynamic programming algorithm (Q0) Postulate working models for Q-functions Qt(ht, at; βt) = h T t,0 βt,0 + ath T t,1 βt,1, ht,0, ht,1 features of ht (Q1) Compute β2 = arg min β2 Pn {Y − Q2(H2, A2; β2)} 2 (Q2) Compute β1 = arg min β1 Pn max a2 Q2(H2, A2; β2) − Q1(H1, A1; β1) 2 (Q3) πt(ht) = arg max at Qt(ht, at; βt) Population parameters β∗ t obtained by replacing Pn with P Inference for β∗ 2 standard, just OLS Focus on confidence intervals for c T β∗ 1 for fixed c ∈ Rdim,β∗ 1 64 / 87
  • 82. Inference for c T β∗ 1 Non-smooth max operator makes β1 non-regular Distn of c T √ n(β1 − β∗ 1 ) sensitive to small perturbations of P Limiting distn does not have mean zero (asymptotic bias) Occurs with small second stage txt effects, H T 2,1 β∗ 2,1 ≈ 0 Confidence intervals based on series approximations or bootstrap can perform poorly; proposed remedies include: Apply shrinkage to reduce asymptotic bias Form conservative estimates to tail probabilities of c T √ n(β1 − β∗ 1 ) 65 / 87
  • 83. Characterizing asymptotic bias Definition For constant c ∈ Rdim β∗ 1 and √ n-consistent estimator β1 of β∗ 1 with √ n(β1 − β∗ 1) M, define the c-directional asymptotic bias Bias β1, c Ec T M. 66 / 87
  • 84. Characterizing asymptotic bias cont’d Theorem (Asymptotic bias Q-learning) Let c ∈ Rdim β∗ 1 be fixed. Under moment conditions: Bias β1, c = c T Σ−1 1,∞P B1 HT 2,1Σ21,21H2,11H T 2,1 β∗ 2,1=0 √ 2π , where B1 = (H T 1,0 , A1H T 1,1 ) T , Σ1,∞ = PB1B T 1 , and Σ21,21, is the asy. cov. of √ n(β2,1 − β∗ 2,1). 67 / 87
  • 85. Characterizing asymptotic bias cont’d Theorem (Asymptotic bias Q-learning) Let c ∈ Rdim β∗ 1 be fixed. Under moment conditions: Bias β1, c = c T Σ−1 1,∞P B1 HT 2,1Σ21,21H2,11H T 2,1 β∗ 2,1=0 √ 2π , where B1 = (H T 1,0 , A1H T 1,1 ) T , Σ1,∞ = PB1B T 1 , and Σ21,21, is the asy. cov. of √ n(β2,1 − β∗ 2,1). Asymptotic bias for Q-learning Ave. of c T Σ1,∞B1 with wts ∝ Var H T 2,1 β2,11H T 2,1 β∗ 2,1=0|H2,1 May be reduced by shrinking h T 2,1 β2,1 when h T 2,1 β∗ 2,1 = 0 67 / 87
  • 86. Reducing asymptotic bias to improve inference Shrinkage is a popular method for reducing asymptotic bias with goal of improving interval coverage Chakraborty et al. (2009) apply soft-thresholding Moodie et al. (2010) apply hard-thresholding Goldberg et al. (2013) and Song et al.(2015) use lasso-type penalization Shrinkage methods target max a2 Q2 h2, a2; β2 = h T 2,0β2,0 + max a2∈{0,1} a2h T 2,1β2,1 68 / 87
  • 87. Reducing asymptotic bias to improve inference Shrinkage is a popular method for reducing asymptotic bias with goal of improving interval coverage Chakraborty et al. (2009) apply soft-thresholding Moodie et al. (2010) apply hard-thresholding Goldberg et al. (2013) and Song et al.(2014) use lasso-type penalization Shrinkage methods target max a2 Q2 h2, a2; β2 = h T 2,0β2,0 + h T 2,1β2,1 + 68 / 87
  • 88. Soft-thresholding (Chakraborty et al., 2009) In Q-learning, replace max a2 Q2(H2, A2; β2) with H T 2,0 β2,0 + H T 2,1 β2,1 +    1 − σH T 2,1 Σ21,21H2,1 n H T 2,1 β2,1 2    + Amount of shrinkage governed by σ > 0 Penalization schemes (Goldberg et al., 2013, Song et al., 2014) reduce to this estimator under certain designs No theoretical justification in Chakraborty et al. (2009) but improved coverage of bootstrap intervals in some settings 69 / 87
  • 89. Soft-thresholding and asymptotic bias Theorem Let c ∈ Rdim β∗ 1 and let βσ 1 denote the soft-thresholding estimator. Under moment conditions: 1. |Bias βσ 1 , c | ≤ |Bias β1, c | for any σ > 0. 2. If Bias β1, c = 0, then for σ > 0 Bias(βσ 1 , c) Bias(β1, c) = exp − σ 2 − σ ∞ √ σ 1 x exp − x2 2 dx 70 / 87
  • 90. Soft-thresholding and asymptotic bias cont’d Is thresholding useful in reducing asymptotic bias? Preceding theorem says yes, and more shrinkage is better Chakraborty et al. suggest σ = 3, which corresponds to 13-fold decrease in asymptotic bias However, the preceding theorem is based on pointwise, i.e., fixed parameter, asymptotics and may not faithfully reflect small sample performance 71 / 87
  • 91. Local generative model Use local asymptotics approximate small sample behavior of soft-thresholding Assume: 1. For any s ∈ Rdim β∗ 2,1 there exists sequence of distributions Pn so that √ n dP1/2 n − dP1/2 − 1 2 νsdP1/2 2 → 0, for some measurable function νs. 2. β∗ 2,1,n = β∗ 2,1 + s/ √ n, where β∗ 2,n = arg min β2 Pn {Y − Q2(H2, A2; β2)} 2 72 / 87
  • 92. Local asymptotics view of soft-thresholding Theorem Let c ∈ Rdim β∗ 1 be fixed. Under the local generative model and moment conditions: 1. sup s∈R dim β∗ 2,1 |Bias(β1, c)| ≤ K < ∞. 2. sup s∈R dim β∗ 2,1 |Bias(βσ 1 , c)| → ∞ as σ → ∞. 73 / 87
  • 93. Local asymptotics view of soft-thresholding Theorem Let c ∈ Rdim β∗ 1 be fixed. Under the local generative model and moment conditions: 1. sup s∈R dim β∗ 2,1 |Bias(β1, c)| ≤ K < ∞. 2. sup s∈R dim β∗ 2,1 |Bias(βσ 1 , c)| → ∞ as σ → ∞. Thresholding can be infinitely worse than doing nothing if done too aggressively in small samples 73 / 87
  • 94. Data-driven tuning Is it possible to construct a data-driven choice of σ that consistently leads to less asymptotic bias than no shrinkage? Consider data from a two-arm randomized trial{(Ai , Yi )}n i=1, A ∈ {0, 1}, Y ∈ R coded so that higher is better 13 Define µ∗ a E (Y |A = a) and µa = PnY 1A=a/Pn1A=a Mean outcome under optimal treatment assignment θ∗ = max(µ∗ 0, µ∗ 1), corresponding estimator θ = max(µ0, µ1) = µ0 + [µ1 − µ0]+ Soft-thresholding estimator θσ = µ0 + [µ1 − µ0]+ 1 − 4σ n (µ1 − µ0) 2 + 13 This is equivalent to two-stage Q-learning with no covariates and a single first stage treatment. 74 / 87
  • 95. Data-driven tuning: toy example n = 10 n = 100 n = 100 Bias Optimal value of σ depends on µ∗ 1 − µ∗ 0 Variability in µ1 − µ0 prevents identification of optimal value of σ, using plug-in estimator may lead to large bias Data-driven σ that significantly improves asymptotic bias over no shrinkage is difficult 75 / 87
  • 96. Asymptotic bias: discussion Asymptotic bias exists in Q-learning Local asymptotics show that aggressively shrinking to reduce asymptotic bias can be infinitely worse than no shrinkage Data-driven tuning seems to require choosing σ very small or risking large bias 76 / 87
  • 97. Confidence intervals for c T β∗ 1 Possible to construct valid confidence intervals in presence of asymptotic bias Idea: construct regular bounds on c T √ n(β1 − β∗ 1) Bootstrap bounds to form confidence interval Tightest among all regular bounds ⇒ automatic adaptivity Local uniform convergence Can also obtain conditional properties (Robins and Rotnitzky, 2014) and global uniform convergence (Wu, 2014) 77 / 87
  • 98. Regular bounds on c T √ n(β1 − β∗ 1) Define Vn(c, γ) = c T Sn + c T Σ−1 1 PnUn(γ), where Sn = ˆΣ−1 1 √ n(Pn − P)B1 H T 2,0 β∗ 2,0 + H T 2,1 β∗ 2,1 + − B T 1 β∗ 1 +ˆΣ−1 1 √ nPnH T 2,0 β2,0 − β∗ 2,0 , Un(γ) = B1 H T 2,1 (Zn + γ) + − H T 2,1 γ + Sn is smooth and Un(γ) is non-smooth 78 / 87
  • 99. Regular bounds on c T √ n(β1 − β∗ 1) cont’d It can be shown that c T √ n(β1 − β∗ 1) = Vn(c, β∗ 2,1) Use pretesting to construct upper bound Un(c) = c T Sn + c T Σ−1 1 PnUn(β∗ 2,1)1Tn(H2,1)>λn + sup γ c T Σ−1 1 PnUn(γ)1Tn(H2,1)≤λn , where Tn(h2,1) test statistic for null h T 2,1β∗ 2,1 = 0 and λn is a critical value Lower bound, Ln(c) obtained by taking inf Bootstrap bounds to find confidence interval 79 / 87
  • 100. Validity of the bounds Theorem Let c ∈ Rdim β∗ 1 be fixed. Assume the local generative model, under moment conditions and conditions on the pretest: 1. cT √ n(β1−β∗ 1,n) c T S∞+c T Σ−1 1,∞PB1H T 2,1 Z∞1H T 2,1 β∗ 2,1>0 +c T Σ−1 1,∞PB1 H T 2,1 (Z∞ + s) + − H T 2,1 s + 1H T 2,1 β∗ 2,1=0 2. Un(c) c T S∞+c T Σ−1 1,∞PB1H T 2,1 Z∞1H T 2,1 β∗ 2,1>0 +sup γ c T Σ−1 1,∞PB1 H T 2,1 (Z∞ + γ) + − H T 2,1 γ + 1H T 2,1 β∗ 2,1=0 80 / 87
  • 101. Validity of the bootstrap bounds Theorem Fix α ∈ (0, 1) and c ∈ Rdim β∗ 1 . Let and u denote the (α/2) × 100 and (1 − α/2) × 100 percentiles of bootstrap distribution of the bounds and let PM denote the distribution with respect to bootstrap weights. Under moment conditions and conditions on the pretest for any > 0: P PM c T β1 − u √ n ≤ c T β∗ 1 ≤ c T β1 − √ n < 1 − α − = o(1). 81 / 87
  • 102. Uniform validity of the bootstrap bounds (Tianshuang Wu) Theorem Fix α ∈ (0, 1) and c ∈ Rdim β∗ 1 . Let and u denote the (α/2) × 100 and (1 − α/2) × 100 percentiles of bootstrap distribution of the bounds and let PM denote the distribution with respect to bootstrap weights. Under moment conditions and conditions on the pretest for any > 0: inf P∈P P PM c T β1 − u √ n ≤ c T β∗ 1 ≤ c T β1 − √ n < 1 − α − , converges to zero for a large class of distributions P. 82 / 87
  • 103. Simulation experiments Class of generative models Xt ∈ {−1, 1}, At ∈ {−1, 1}, t ∈ {1, 2} P(At = 1) = P(At = −1) = 0.5, t ∈ {1, 2} X1 ∼ Bernoulli(0.5) X2|X1, A1 ∼ Bernoulli {expit(δ1X1 + δ2A1)} ∼ N(0, 1) Y = γ1 +γ2X1 +γ3A1 +γ4X1A1 +γ5A2 +γ6X2A2 +γ7A1A2 + Vary parameters to obtain range of effects sizes, classify generative models as Non-regular (NR) Nearly non-regular (NNR) Regular (R) 83 / 87
  • 104. Simulation experiments cont’d Compare bounding confidence interval (ACI) with bootstrap (BOOT) and bootstrap thresholding (THRESH) Compare in terms of width and coverage (target 95%) Results based on 1000 Monte Carlo replications with datasets of size n = 150 Bootstrap computed with 1000 resamples Tuning parameter λn chosen with double bootstrap 84 / 87
  • 105. Simulation experiments: results Coverage (target 95%) Method Ex. 1 NNR Ex. 2 NR Ex. 3 NNR Ex. 4 R Ex. 5 NR Ex. 6 NNR BOOT 0.935* 0.930* 0.933* 0.928* 0.925* 0.928* THRESH 0.945 0.938 0.942 0.943 0.759* 0.762* ACI 0.971 0.958 0.961 0.943 0.953 0.953 Average width Method Ex. 1 NNR Ex. 2 NR Ex. 3 NNR Ex. 4 R Ex. 5 NR Ex. 6 NNR BOOT 0.385* 0.430* 0.430* 0.436* 0.428* 0.428* THRESH 0.339 0.426 0.427 0.436 0.426* 0.424* ACI 0.441 0.470 0.470 0.469 0.473 0.473 85 / 87
  • 106. Ex. DTR for ADHD without uncertainty Prior medication? Low dose MEDS Yes Adequate response? Continue MEDS Yes High adherence? No Add BMOD NoIntensify MEDS YesLow dose BMOD No Adequate response? Continue BMOD Yes High adherence? No Intensify BMOD Yes Add MEDS No 86 / 87
  • 107. Ex. DTR for ADHD with uncertainty Prior medication? Low dose MEDS ∼OR∼ BMOD Yes Adequate response? Continue MEDS Yes High adherence? No Add OTHER ∼OR∼ Itensify SAME NoIntensify MEDS YesLow dose BMOD No Adequate response? Continue BMOD Yes High adherence? No Add MEDS ∼OR∼ Intensify BMOD Yes Add MEDS No 87 / 87