Econometrics 2017-graduate-3

Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Advanced Econometrics #3: Model & Variable Selection*
A. Charpentier (Université de Rennes 1)
Université de Rennes 1,
Graduate Course, 2017.
@freakonometrics 1

“Great plot.
Now need to ﬁnd the theory that explains it”
Deville (2017) http://twitter.com
@freakonometrics 2

Preliminary Results: Numerical Optimization
Problem : x ∈ argmin{f(x); x ∈ Rd
}
Gradient descent : xk+1 = xk − η f(xk) starting from some x0
Problem : x ∈ argmin{f(x); x ∈ X ⊂ Rd
}
Projected descent : xk+1 = ΠX xk − η f(xk) starting from some x0
A constrained problem is said to be convex if



min{f(x)} with f convex
s.t. gi(x) = 0, ∀i = 1, · · · , n with gi linear
hi(x) ≤ 0, ∀i = 1, · · · , m with hi convex
Lagrangian : L(x, λ, µ) = f(x) +
n
i=1
λigi(x) +
m
i=1
µihi(x) where x are primal
variables and (λ, µ) are dual variables.
Remark L is an aﬃne function in (λ, µ)
@freakonometrics 3

Preliminary Results: Numerical Optimization
Karush–Kuhn–Tucker conditions : a convex problem has a solution x if and
only if there are (λ , µ ) such that the following condition hold
• stationarity : xL(x, λ, µ) = 0 at (x , λ , µ )
• primal admissibility : gi(x ) = 0 and hi(x ) ≤ 0, ∀i
• dual admissibility : µ ≥ 0
Let L denote the associated dual function L(λ, µ) = min
x
{L(x, λ, µ)}
L is a convex function in (λ, µ) and the dual problem is max
λ,µ
{L(λ, µ)}.
@freakonometrics 4

References
Motivation
Banerjee, A., Chandrasekhar, A.G., Duﬂo, E. & Jackson, M.O. (2016). Gossip:
Identifying Central Individuals in a Social Networks.
References
Belloni, A. & Chernozhukov, V. 2009. Least squares after model selection in
high-dimensional sparse models.
Hastie, T., Tibshirani, R. & Wainwright, M. 2015 Statistical Learning with
Sparsity: The Lasso and Generalizations. CRC Press.
@freakonometrics 5

Preambule
Assume that y = m(x) + ε, where ε is some idosyncatic impredictible noise.
The error E[(y − m(x))2
] is the sume of three terms
• variance of the estimator : E[(y − m(x))2
]
• bias2
of the estimator : [m(x − m(x)]2
• variance of the noise : E[(y − m(x))2
]
(the latter exists, even with a ‘perfect’ model).
@freakonometrics 6

Preambule
Consider a parametric model, with true (unkown) parameter θ, then
mse(ˆθ) = E (ˆθ − θ)2
= E (ˆθ − E ˆθ )2
variance
+ E (E ˆθ − θ)2
bias2
Let θ denote an unbiased estimator of θ. Then
ˆθ =
θ2
θ2 + mse(θ)
· θ = θ −
mse(θ)
θ2 + mse(θ)
· θ
penalty
satisﬁes mse(ˆθ) ≤ mse(θ).
@freakonometrics 7
−2 −1 0 1 2 3 4
0.00.20.40.60.8
variance

Occam’s Razor
The “law of parsimony”, “lex parsimoniæ”
Penalize too complex models
@freakonometrics 8

James & Stein Estimator
Let X ∼ N(µ, σ2
I). We want to estimate µ.
µmle = Xn ∼ N µ,
σ2
n
I .
From James & Stein (1961) Estimation with quadratic loss
µJS = 1 −
(d − 2)σ2
n y 2
y
where · is the Euclidean norm.
One can prove that if d ≥ 3,
E µJS − µ
2
< E µmle − µ
2
Samworth (2015) Stein’s paradox, “one should use the price of tea in China to
obtain a better estimate of the chance of rain in Melbourne”.
@freakonometrics 9

James & Stein Estimator
Heuristics : consider a biased estimator, to decrease the variance.
See Efron (2010) Large-Scale Inference
@freakonometrics 10

Motivation: Avoiding Overﬁt
Generalization : the model should perform well on new data (and not only on the
training ones).
q
q
q q q q q q q q q q q
2 4 6 8 10 12
0.00.20.40.6
q
q
q q q q q q
q q q
q
q
@freakonometrics 11
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
−1.5−1.0−0.50.00.51.01.5
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
−1.5−1.0−0.50.00.51.01.5
q
q
q
q
q
q
q
q
q
qq

Reducing Dimension with PCA
Use principal components to reduce dimension (on centered and scaled variables):
we want d vectors z1, · · · , zd such that
First Compoment is z1 = Xω1 where
ω1 = argmax
ω =1
X · ω 2
= argmax
ω =1
ωT
XT
Xω
Second Compoment is z2 = Xω2 where
ω2 = argmax
ω =1
X
(1)
· ω 2
0 20 40 60 80
−8−6−4−2
Age
LogMortalityRate
−10 −5 0 5 10 15
−101234
PC score 1
PCscore2
q
q
qq
q
q
qqq
q
qqq
qq
q
qqq
q
q
q
qq
q
q
q
q
q
q
qqq
q
q
q
q
q
q
qq
q
qq
q
q
qq
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qqq
q
qqq
qq
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
qq
q
q
qqqq
q
q
q
q
q
q
q
q
q
qqq
q
qqq
qq
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
1914
1915
1916
1917
1918
1919
1940
1942
1943
1944
0 20 40 60 80
−10−8−6−4−2
Age
LogMortalityRate
−10 −5 0 5 10 15
−10123
PC score 1
PCscore2
qq
q
q
q
q
q
qq
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qq
qqqq
q
qqqq
q
q
q
q
q
q
with X
(1)
= X − Xω1
z1
ωT
1 .
@freakonometrics 12

Reducing Dimension with PCA
A regression on (the d) principal components, y = zT
β + η could be an
interesting idea, unfortunatley, principal components have no reason to be
correlated with y. First compoment was z1 = Xω1 where
ω1 = argmax
ω =1
X · ω 2
= argmax
ω =1
ωT
XT
Xω
It is a non-supervised technique.
Instead, use partial least squares, introduced in Wold (1966) Estimation of
Principal Components and Related Models by Iterative Least squares. First
compoment is z1 = Xω1 where
ω1 = argmax
ω =1
{ y, X · ω } = argmax
ω =1
ωT
XT
yyT
Xω
@freakonometrics 13

Terminology
Consider a dataset {yi, xi}, assumed to be generated from Y, X, from an
unknown distribution P.
Let m0(·) be the “true” model. Assume that yi = m0(xi) + εi.
In a regression context (quadratic loss function function), the risk associated to
m is
R(m) = EP Y − m(X)
2
An optimal model m within a class M satisﬁes
R(m ) = inf
m∈M
R(m)
Such a model m is usually called oracle.
Observe that m (x) = E[Y |X = x] is the solution of
R(m ) = inf
m∈M
R(m) where M is the set of measurable functions
@freakonometrics 14

The empirical risk is
Rn(m) =
1
n
n
i=1
yi − m(xi)
2
For instance, m can be a linear predictor, m(x) = β0 + xT
β, where θ = (β0, β)
should estimated (trained).
E Rn(m) = E (m(X) − Y )2
can be expressed as
E (m(X) − E[m(X)|X])2
variance of m
+ E E[m(X)|X] − E[Y |X]
m0(X)
2
bias of m
+ E Y − E[Y |X]
m0(X)
)2
variance of the noise
The third term is the risk of the “optimal” estimator m, that cannot be
decreased.
@freakonometrics 15

Mallows Penalty and Model Complexity
Consider a linear predictor (see #1), i.e. y = m(x) = Ay.
Assume that y = m0(x) + ε, with E[ε] = 0 and Var[ε] = σ2
I.
Let · denote the Euclidean norm
Empirical risk : Rn(m) = 1
n y − m(x) 2
Vapnik’s risk : E[Rn(m)] =
1
n
m0(x − m(x) 2
+
1
n
E y − m0(x 2
with
m0(x = E[Y |X = x].
Observe that
nE Rn(m) = E y − m(x) 2
= (I − A)m0
2
+ σ2
I − A 2
while
= E m0(x) − m(x) 2
=
2
(I − A)m0
bias
+ σ2
A 2
variance
@freakonometrics 16

Mallows Penalty and Model Complexity
One can obtain
E Rn(m) = E Rn(m) + 2
σ2
n
trace(A).
If trace(A) ≥ 0 the empirical risk underestimate the true risk of the estimator.
The number of degrees of freedom of the (linear) predictor is related to trace(A)
2
σ2
n
trace(A) is called Mallow’s penalty CL.
If A is a projection matrix, trace(A) is the dimension of the projection space, p,
then we obtain Mallow’s CP , 2
σ2
n
p.
Remark : Mallows (1973) Some Comments on Cp introduced this penalty while
focusing on the R2
.
@freakonometrics 17

Penalty and Likelihood
CP is associated to a quadratic risk
an alternative is to use a distance on the (conditional) distribution of Y , namely
Kullback-Leibler distance
discrete case: DKL(P Q) =
i
P(i) log
P(i)
Q(i)
continuous case :
DKL(P Q) =
∞
−∞
p(x) log
p(x)
q(x)
dxDKL(P Q) =
∞
−∞
p(x) log p(x)
q(x) dx
Let f denote the true (unknown) density, and fθ some parametric distribution,
DKL(f fθ) =
∞
−∞
f(x) log
f(x)
fθ(x)
dx= f(x) log[f(x)] dx− f(x) log[fθ(x)] dx
relative information
Hence
minimize {DKL(f fθ)} ←→ maximize E log[fθ(X)]
@freakonometrics 18

Akaike (1974) A new look at the statistical model identiﬁcation observe that for n
large enough
E log[fθ(X)] ∼ log[L(θ)] − dim(θ)
Thus
AIC = −2 log L(θ) + 2dim(θ)
Example : in a (Gaussian) linear model, yi = β0 + xT
i β + εi
AIC = n log
1
n
n
i=1
εi + 2[dim(β) + 2]
@freakonometrics 19

Remark : this is valid for large sample (rule of thumb n/dim(θ) > 40),
otherwise use a corrected AIC
AICc = AIC +
2k(k + 1)
n − k − 1
bias correction
where k = dim(θ)
see Sugiura (1978) Further analysis of the data by Akaike’s information criterion and
the ﬁnite corrections second order AIC.
Using a Bayesian interpretation, Schwarz (1978) Estimating the dimension of a
model obtained
BIC = −2 log L(θ) + log(n)dim(θ).
Observe that the criteria considered is
criteria = −function L(θ) + penality complexity
@freakonometrics 20

Estimation of the Risk
Consider a naive bootstrap procedure, based on a bootstrap sample
Sb = {(y
(b)
i , x
(b)
i )}.
The plug-in estimator of the empirical risk is
Rn(m(b)
) =
1
n
n
i=1
yi − m(b)
(xi)
2
and then
Rn =
1
B
B
b=1
Rn(m(b)
) =
1
B
B
b=1
1
n
n
i=1
yi − m(b)
(xi)
2
@freakonometrics 21

Estimation of the Risk
One might improve this estimate using a out-of-bag procedure
Rn =
1
n
n
i=1
1
#Bi
b∈Bi
yi − m(b)
(xi)
2
where Bi is the set of all boostrap sample that contain (yi, xi).
Remark: P ((yi, xi) /∈ Sb) = 1 −
1
n
n
∼ e−1
= 36, 78%.
@freakonometrics 22

Linear Regression Shortcoming
Least Squares Estimator β = (XT
X)−1
XT
y
Unbiased Estimator E[β] = β
Variance Var[β] = σ2
(XT
X)−1
which can be (extremely) large when det[(XT
X)] ∼ 0.
X =







1 −1 2
1 0 1
1 2 −1
1 1 0







then XT
X =




4 2 2
2 6 −4
2 −4 6



 while XT
X+I =




5 2 2
2 7 −4
2 −4 7




eigenvalues : {10, 6, 0} {11, 7, 1}
Ad-hoc strategy: use XT
X + λI
@freakonometrics 23

Linear Regression Shortcoming
Evolution of (β1, β2) →
n
i=1
[yi − (β1x1,i + β2x2,i)]2
when cor(X1, X2) = r ∈ [0, 1], on top.
Below, Ridge regression
(β1, β2) →
n
i=1
[yi − (β1x1,i + β2x2,i)]2
+λ(β2
1 + β2
2)
where λ ∈ [0, ∞), below,
when cor(X1, X2) ∼ 1 (colinearity).
@freakonometrics 24
−2 −1 0 1 2 3 4
−3−2−10123
β1
β2
500
1000
1500
2000
2000
2500
2500 2500
2500
3000
3000 3000
3000
3500
q
−2 −1 0 1 2 3 4
−3−2−10123
β1
β2
1000
1000
2000
2000
3000
3000
4000
4000
5000
5000
6000
6000
7000

Normalization : Euclidean 2 vs. Mahalonobis
We want to penalize complicated models :
if βk is “too small”, we prefer to have βk = 0.
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Instead of d(x, y) = (x − y)T
(x − y)
use dΣ(x, y) = (x − y)TΣ−1
(x − y)
beta1
beta2
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 25

Ridge Regression
... like the least square, but it shrinks estimated coeﬃcients towards 0.
β
ridge
λ = argmin



n
i=1
(yi − xT
i β)2
+ λ
p
j=1
β2
j



β
ridge
λ = argmin



y − Xβ
2
2
=criteria
+ λ β 2
2
=penalty



λ ≥ 0 is a tuning parameter.
The constant is usually unpenalized. The true equation is
β
ridge
λ = argmin



y − (β0 + Xβ)
2
2
=criteria
+ λ β
2
2
=penalty



@freakonometrics 26

Ridge Regression
β
ridge
λ = argmin y − (β0 + Xβ)
2
2
+ λ β
2
2
can be seen as a constrained optimization problem
β
ridge
λ = argmin
β 2
2
≤hλ
y − (β0 + Xβ)
2
2
Explicit solution
βλ = (XT
X + λI)−1
XT
y
If λ → 0, β
ridge
0 = β
ols
If λ → ∞, β
ridge
∞ = 0.
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 27

Ridge Regression
This penalty can be seen as rather unfair if compo-
nents of x are not expressed on the same scale
• center: xj = 0, then β0 = y
• scale: xT
j xj = 1
Then compute
β
ridge
λ = argmin



y − Xβ 2
2
=loss
+ λ β 2
2
=penalty



beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
40
40
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 28

Ridge Regression
Observe that if xj1 ⊥ xj2 , then
β
ridge
λ = [1 + λ]−1
β
ols
λ
which explain relationship with shrinkage.
But generally, it is not the case...
q
q
Theorem There exists λ such that mse[β
ridge
λ ] ≤ mse[β
ols
λ ]
Ridge Regression
@freakonometrics 29

Lλ(β) =
n
i=1
(yi − β0 − xT
i β)2
+ λ
p
j=1
β2
j
∂Lλ(β)
∂β
= −2XT
y + 2(XT
X + λI)β
∂2
Lλ(β)
∂β∂βT
= 2(XT
X + λI)
where XT
X is a semi-positive deﬁnite matrix, and λI is a positive deﬁnite
matrix, and
βλ = (XT
X + λI)−1
XT
y
@freakonometrics 30

The Bayesian Interpretation
From a Bayesian perspective,
P[θ|y]
posterior
∝ P[y|θ]
likelihood
· P[θ]
prior
i.e. log P[θ|y] = log P[y|θ]
log likelihood
+ log P[θ]
penalty
If β has a prior N(0, τ2
I) distribution, then its posterior distribution has mean
E[β|y, X] = XT
X +
σ2
τ2
I
−1
XT
y.
@freakonometrics 31

Properties of the Ridge Estimator
βλ = (XT
X + λI)−1
XT
y
E[βλ] = XT
X(λI + XT
X)−1
β.
i.e. E[βλ] = β.
Observe that E[βλ] → 0 as λ → ∞.
Assume that X is an orthogonal design matrix, i.e. XT
X = I, then
βλ = (1 + λ)−1
β
ols
.
@freakonometrics 32

Set W λ = (I + λ[XT
X]−1
)−1
. One can prove that
W λβ
ols
= βλ.
Thus,
Var[βλ] = W λVar[β
ols
]W T
λ
and
Var[βλ] = σ2
(XT
X + λI)−1
XT
X[(XT
X + λI)−1
]T
.
Observe that
Var[β
ols
] − Var[βλ] = σ2
W λ[2λ(XT
X)−2
+ λ2
(XT
X)−3
]W T
λ ≥ 0.
@freakonometrics 33

Hence, the conﬁdence ellipsoid of ridge estimator is
indeed smaller than the OLS,
If X is an orthogonal design matrix,
Var[βλ] = σ2
(1 + λ)−2
I.
mse[βλ] = σ2
trace(W λ(XT
X)−1
W T
λ) + βT
(W λ − I)T
(W λ − I)β.
If X is an orthogonal design matrix,
mse[βλ] =
pσ2
(1 + λ)2
+
λ2
(1 + λ)2
βT
β
@freakonometrics 34
0.0 0.2 0.4 0.6 0.8
−1.0−0.8−0.6−0.4−0.2
β1
β2
1
2
3
4
5
6
7

mse[βλ] =
pσ2
(1 + λ)2
+
λ2
(1 + λ)2
βT
β
is minimal for
λ =
pσ2
βT
β
Note that there exists λ > 0 such that mse[βλ] < mse[β0] = mse[β
ols
].
@freakonometrics 35

SVD decomposition
Consider the singular value decomposition X = UDV T
. Then
β
ols
= V D−2
D UT
y
βλ = V (D2
+ λI)−1
D UT
y
Observe that
D−1
i,i ≥
Di,i
D2
i,i + λ
hence, the ridge penality shrinks singular values.
Set now R = UD (n × n matrix), so that X = RV T
,
βλ = V (RT
R + λI)−1
RT
y
@freakonometrics 36

Hat matrix and Degrees of Freedom
Recall that Y = HY with
H = X(XT
X)−1
XT
Similarly
Hλ = X(XT
X + λI)−1
XT
trace[Hλ] =
p
j=1
d2
j,j
d2
j,j + λ
→ 0, as λ → ∞.
@freakonometrics 37

Sparsity Issues
In severall applications, k can be (very) large, but a lot of features are just noise:
βj = 0 for many j’s. Let s denote the number of relevent features, with s << k,
cf Hastie, Tibshirani & Wainwright (2015) Statistical Learning with Sparsity,
s = card{S} where S = {j; βj = 0}
The model is now y = XT
SβS + ε, where XT
SXS is a full rank matrix.
@freakonometrics 38
q
q
= . +

Going further on sparcity issues
The Ridge regression problem was to solve
β = argmin
β∈{ β 2 ≤s}
{ Y − XT
β 2
2
}
Deﬁne a 0
= 1(|ai| > 0).
Here dim(β) = k but β 0
= s.
We wish we could solve
β = argmin
β∈{ β 0 =s}
{ Y − XT
β 2
2
}
Problem: it is usually not possible to describe all possible constraints, since
s
k
coeﬃcients should be chosen here (with k (very) large).
@freakonometrics 39
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0

In a convex problem, solve the dual problem,
e.g. in the Ridge regression : primal problem
min
β∈{ β 2 ≤s}
{ Y − XT
β 2
2
}
and the dual problem
min
β∈{ Y −XTβ 2 ≤t}
{ β 2
2
}
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
26
27
30
32
35
40
40
50
60
70
80
90
100
110
120
120
130
130
140 140
X
q
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
26
27
30
32
35
40
40
50
60
70
80
90
100
110
120
120
130
130
140 140
X
q
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 40

Idea: solve the dual problem
β = argmin
β∈{ Y −XTβ 2 ≤h}
{ β 0 }
where we might convexify the 0 norm, · 0
.
@freakonometrics 41

On [−1, +1]k
, the convex hull of β 0
is β 1
On [−a, +a]k
, the convex hull of β 0
is a−1
β 1
Hence, why not solve
β = argmin
β; β 1 ≤˜s
{ Y − XT
β 2
}
which is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization
problem
β = argmin{ Y − XT
β 2
2
+λ β 1 }
@freakonometrics 42

LASSO Least Absolute Shrinkage and Selection Operator
β ∈ argmin{ Y − XT
β 2
2
+λ β 1
}
is a convex problem (several algorithms ), but not strictly convex (no unicity of
the minimum). Nevertheless, predictions y = xT
β are unique.
MM, minimize majorization, coordinate descent Hunter & Lange (2003) A
Tutorial on MM Algorithms.
@freakonometrics 43

LASSO Regression
No explicit solution...
If λ → 0, β
lasso
0 = β
ols
If λ → ∞, β
lasso
∞ = 0.
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 44

LASSO Regression
For some λ, there are k’s such that β
lasso
k,λ = 0.
Further, λ → β
lasso
k,λ is piecewise linear
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
40
40
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 45

LASSO Regression
In the orthogonal case, XT
X = I,
β
lasso
k,λ = sign(β
ols
k ) |β
ols
k | −
λ
2
i.e. the LASSO estimate is related to the soft
threshold function...
q
q
@freakonometrics 46

Optimal LASSO Penalty
Use cross validation, e.g. K-fold,
β(−k)(λ) = argmin



i∈Ik
[yi − xT
i β]2
+ λ β 1



then compute the sum of the squared errors,
Qk(λ) =
i∈Ik
[yi − xT
i β(−k)(λ)]2
and ﬁnally solve
λ = argmin Q(λ) =
1
K
k
Qk(λ)
Note that this might overﬁt, so Hastie, Tibshiriani & Friedman (2009) Elements
of Statistical Learning suggest the largest λ such that
Q(λ) ≤ Q(λ ) + se[λ ] with se[λ]2
=
1
K2
K
k=1
[Qk(λ) − Q(λ)]2
@freakonometrics 47

LASSO and Ridge, with R
1 > library(glmnet)
2 > chicago=read.table("http:// freakonometrics .free.fr/
chicago.txt",header=TRUE ,sep=";")
3 > standardize <- function(x) {(x-mean(x))/sd(x)}
4 > z0 <- standardize(chicago[, 1])
7 > ridge <-glmnet(cbind(z1 , z2), z0 , alpha =0, intercept=
FALSE , lambda =1)
8 > lasso <-glmnet(cbind(z1 , z2), z0 , alpha =1, intercept=
FALSE , lambda =1)
9 > elastic <-glmnet(cbind(z1 , z2), z0 , alpha =.5,
intercept=FALSE , lambda =1)
Elastic net, λ1 β 1
+ λ2 β 2
2
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
q
q
qq
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
@freakonometrics 48

LASSO Regression, Smoothing and Overﬁt
LASSO can be used to avoid overﬁt.
@freakonometrics 49
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
−1.0−0.50.00.51.0

Ridge vs. LASSO
Consider simulated data (output on the right).
With orthogonal variables, shrinkage operators are
0 1 2 3 4 5
012345
β
β(ridge)
0 1 2 3 4 5
012345
β
β(lasso)
0.0 0.5 1.0 1.5 2.0
−2.0−1.5−1.0−0.50.00.51.0
L1 Norm
Coefficients
3 3 3 3 3
0.0 0.5 1.0 1.5 2.0
−2.0−1.5−1.0−0.50.00.51.0
L1 Norm
Coefficients
0 1 3 3 3
@freakonometrics 50

Optimization Heuristics
First idea: given some initial guess β(0), |β| ∼ |β(0)| +
1
2|β(0)|
(β2
− β2
(0))
LASSO estimate can probably be derived from iterated Ridge estimates
y − Xβ(k+1)
2
2
+ λ β(k+1) 1 ∼ Xβ(k+1)
2
2
+
λ
2 j
1
|βj,(k)|
[βj,(k+1)]2
which is a weighted ridge penalty function
Thus,
β(k+1) = XT
X + λ∆(k)
−1
XT
y
where ∆(k) = diag[|βj,(k)|−1
]. Then β(k) → β
lasso
, as k → ∞.
@freakonometrics 51

Properties of LASSO Estimate
From this iterative technique
β
lasso
λ ∼ XT
X + λ∆
−1
XT
y
where ∆ = diag[|β
lasso
j,λ |−1
] if β
lasso
j,λ = 0, 0 otherwise.
Thus,
E[β
lasso
λ ] ∼ XT
X + λ∆
−1
XT
Xβ
and
Var[β
lasso
λ ] ∼ σ2
XT
X + λ∆
−1
XT
XT
X XT
X + λ∆
−1
XT
@freakonometrics 52

Consider here a simpliﬁed problem, min
a∈R
1
2
(a − b)2
+ λ|a|
g(a)
with λ > 0.
Observe that g (0) = −b ± λ. Then
• if |b| ≤ λ, then a = 0
• if b ≥ λ, then a = b − λ
• if b ≤ −λ, then a = b + λ
a = argmin
a∈R
1
2
(a − b)2
+ λ|a| = Sλ(b) = sign(b) · (|b| − λ)+,
also called soft-thresholding operator.
@freakonometrics 53

Deﬁnition for any convex function h, deﬁne the proximal operator operator of h,
proximalh(y) = argmin
x∈Rd
1
2
x − y 2
2
+ h(x)
Note that
proximalλ · 2
2
(y) =
1
1 + λ
x shrinkage operator
proximalλ · 1
(y) = Sλ(y) = sign(y) · (|y| − λ)+
@freakonometrics 54

We want to solve here
θ ∈ argmin
θ∈Rd
1
n
y − mθ(x)) 2
2
f(θ)
+ λpenalty(θ)
g(θ)
.
where f is convex and smooth, and g is convex, but not smooth...
1. Focus on f : descent lemma, ∀θ, θ
f(θ) ≤ f(θ ) + f(θ ), θ − θ +
t
2
θ − θ 2
2
Consider a gradient descent sequence θk, i.e. θk+1 = θk − t−1
f(θk), then
f(θ) ≤
ϕ(θ): θk+1=argmin{ϕ(θ)}
f(θk) + f(θk), θ − θk +
t
2
θ − θk
2
2
@freakonometrics 55

2. Add function g
f(θ)+g(θ) ≤
ψ(θ)
f(θk) + f(θk), θ − θk +
t
2
θ − θk
2
2
+g(θ)
And one can proof that
θk+1 = argmin
θ∈Rd
ψ(θ) = proximalg/t θk − t−1
f(θk)
so called proximal gradient descent algorithm, since
argmin {ψ(θ)} = argmin
t
2
θ − θk − t−1
f(θk)
2
2
+ g(θ)
@freakonometrics 56

Coordinate-wise minimization
Consider some convex differentiable f : Rk
→ R function.
Consider x ∈ Rk
obtained by minimizing along each coordinate axis, i.e.
f(x1, xi−1, xi, xi+1, · · · , xk) ≥ f(x1, xi−1, xi , xi+1, · · · , xk)
for all i. Is x a global minimizer? i.e.
f(x) ≥ f(x ), ∀x ∈ Rk
.
Yes. If f is convex and differentiable.
f(x)|x=x =
∂f(x)
∂x1
, · · · ,
∂f(x)
∂xk
= 0
There might be problem if f is not differentiable (except in each axis direction).
If f(x) = g(x) +
k
i=1 hi(xi) with g convex and differentiable, yes, since
f(x) − f(x ) ≥ g(x )T
(x − x ) +
i
[hi(xi) − hi(xi )]
@freakonometrics 57

Coordinate-wise minimization
f(x) − f(x ) ≥
i
[ ig(x )T
(xi − xi )hi(xi) − hi(xi )]
≥0
≥ 0
Thus, for functions f(x) = g(x) +
k
i=1 hi(xi) we can use coordinate descent to
ﬁnd a minimizer, i.e. at step j
x
(j)
1 ∈ argmin
x1
f(x1, x
(j−1)
2 , x
(j−1)
3 , · · · x
(j−1)
k )
x
(j)
2 ∈ argmin
x2
f(x
(j)
1 , x2, x
(j−1)
3 , · · · x
(j−1)
k )
x
(j)
3 ∈ argmin
x3
f(x
(j)
1 , x
(j)
2 , x3, · · · x
(j−1)
k )
Tseng (2001) Convergence of Block Coordinate Descent Method: if f is continuous,
then x∞
is a minimizer of f.
@freakonometrics 58

Application in Linear Regression
Let f(x) = 1
2 y − Ax 2
, with y ∈ Rn
and A ∈ Mn×k. Let A = [A1, · · · , Ak].
Let us minimize in direction i. Let x−i denote the vector in Rk−1
without xi.
Here
0 =
∂f(x)
∂xi
= AT
i [Ax − y] = AT
i [Aixi + A−ix−i − y]
thus, the optimal value is here
xi =
AT
i [A−ix−i − y]
AT
i Ai
@freakonometrics 59

Application to LASSO
Let f(x) = 1
2 y − Ax 2
+ λ x 1 , so that the non-diﬀerentiable part is
separable, since x 1
=
k
i=1 |xi|.
Let us minimize in direction i. Let x−i denote the vector in Rk−1
without xi.
Here
0 =
∂f(x)
∂xi
= AT
i [Aixi + A−ix−i − y] + λsi
where si ∈ ∂|xi|. Thus, solution is obtained by soft-thresholding
xi = Sλ/ Ai
2
AT
i [A−ix−i − y]
AT
i Ai
@freakonometrics 60

Convergence rate for LASSO
Let f(x) = g(x) + λ x 1
with
• g convex, g Lipschitz with constant L > 0, and Id − g/L monotone
inscreasing in each component
• there exists z such that, componentwise, either z ≥ Sλ(z − g(z)) or
z ≤ Sλ(z − g(z))
Saka & Tewari (2010), On the ﬁnite time convergence of cyclic coordinate descent
methods proved that a coordinate descent starting from z satisﬁes
f(x(j)
) − f(x ) ≤
L z − x 2
2j
@freakonometrics 61

Graphical Lasso and Covariance Estimation
We want to estimate an (unknown) covariance matrix Σ, or Σ−1
.
An estimate for Σ−1
is Θ solution of
Θ ∈ argmin
Θ∈Mk×k
{− log[det(Θ)] + trace[SΘ] + λ Θ 1
} where S =
XT
X
n
and where Θ 1
= |Θi,j|.
See van Wieringen (2016) Undirected network reconstruction from high-dimensional
data and https://github.com/kaizhang/glasso
@freakonometrics 62

Application to Network Simpliﬁcation
Can be applied on networks, to spot ‘signiﬁcant’
connexions...
Source: http://khughitt.github.io/graphical-lasso/
@freakonometrics 63

Extention of Penalization Techniques
In a more general context, we want to solve
θ ∈ argmin
θ∈Rd
1
n
n
i=1
(yi, mθ(xi)) + λ · penalty(θ) .
@freakonometrics 64

Econometrics 2017-graduate-3

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Econometrics 2017-graduate-3 (20)

More from Arthur Charpentier (20)

Recently uploaded (20)

Econometrics 2017-graduate-3