SlideShare a Scribd company logo
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Advanced Econometrics #3: Model & Variable Selection*
A. Charpentier (Université de Rennes 1)
Université de Rennes 1,
Graduate Course, 2017.
@freakonometrics 1
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
“Great plot.
Now need to find the theory that explains it”
Deville (2017) http://twitter.com
@freakonometrics 2
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Preliminary Results: Numerical Optimization
Problem : x ∈ argmin{f(x); x ∈ Rd
}
Gradient descent : xk+1 = xk − η f(xk) starting from some x0
Problem : x ∈ argmin{f(x); x ∈ X ⊂ Rd
}
Projected descent : xk+1 = ΠX xk − η f(xk) starting from some x0
A constrained problem is said to be convex if



min{f(x)} with f convex
s.t. gi(x) = 0, ∀i = 1, · · · , n with gi linear
hi(x) ≤ 0, ∀i = 1, · · · , m with hi convex
Lagrangian : L(x, λ, µ) = f(x) +
n
i=1
λigi(x) +
m
i=1
µihi(x) where x are primal
variables and (λ, µ) are dual variables.
Remark L is an affine function in (λ, µ)
@freakonometrics 3
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Preliminary Results: Numerical Optimization
Karush–Kuhn–Tucker conditions : a convex problem has a solution x if and
only if there are (λ , µ ) such that the following condition hold
• stationarity : xL(x, λ, µ) = 0 at (x , λ , µ )
• primal admissibility : gi(x ) = 0 and hi(x ) ≤ 0, ∀i
• dual admissibility : µ ≥ 0
Let L denote the associated dual function L(λ, µ) = min
x
{L(x, λ, µ)}
L is a convex function in (λ, µ) and the dual problem is max
λ,µ
{L(λ, µ)}.
@freakonometrics 4
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
References
Motivation
Banerjee, A., Chandrasekhar, A.G., Duflo, E. & Jackson, M.O. (2016). Gossip:
Identifying Central Individuals in a Social Networks.
References
Belloni, A. & Chernozhukov, V. 2009. Least squares after model selection in
high-dimensional sparse models.
Hastie, T., Tibshirani, R. & Wainwright, M. 2015 Statistical Learning with
Sparsity: The Lasso and Generalizations. CRC Press.
@freakonometrics 5
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Preambule
Assume that y = m(x) + ε, where ε is some idosyncatic impredictible noise.
The error E[(y − m(x))2
] is the sume of three terms
• variance of the estimator : E[(y − m(x))2
]
• bias2
of the estimator : [m(x − m(x)]2
• variance of the noise : E[(y − m(x))2
]
(the latter exists, even with a ‘perfect’ model).
@freakonometrics 6
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Preambule
Consider a parametric model, with true (unkown) parameter θ, then
mse(ˆθ) = E (ˆθ − θ)2
= E (ˆθ − E ˆθ )2
variance
+ E (E ˆθ − θ)2
bias2
Let θ denote an unbiased estimator of θ. Then
ˆθ =
θ2
θ2 + mse(θ)
· θ = θ −
mse(θ)
θ2 + mse(θ)
· θ
penalty
satisfies mse(ˆθ) ≤ mse(θ).
@freakonometrics 7
−2 −1 0 1 2 3 4
0.00.20.40.60.8
variance
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Occam’s Razor
The “law of parsimony”, “lex parsimoniæ”
Penalize too complex models
@freakonometrics 8
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
James & Stein Estimator
Let X ∼ N(µ, σ2
I). We want to estimate µ.
µmle = Xn ∼ N µ,
σ2
n
I .
From James & Stein (1961) Estimation with quadratic loss
µJS = 1 −
(d − 2)σ2
n y 2
y
where · is the Euclidean norm.
One can prove that if d ≥ 3,
E µJS − µ
2
< E µmle − µ
2
Samworth (2015) Stein’s paradox, “one should use the price of tea in China to
obtain a better estimate of the chance of rain in Melbourne”.
@freakonometrics 9
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
James & Stein Estimator
Heuristics : consider a biased estimator, to decrease the variance.
See Efron (2010) Large-Scale Inference
@freakonometrics 10
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Motivation: Avoiding Overfit
Generalization : the model should perform well on new data (and not only on the
training ones).
q
q
q q q q q q q q q q q
2 4 6 8 10 12
0.00.20.40.6
q
q
q q q q q q
q q q
q
q
@freakonometrics 11
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
−1.5−1.0−0.50.00.51.01.5
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
−1.5−1.0−0.50.00.51.01.5
q
q
q
q
q
q
q
q
q
qq
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Reducing Dimension with PCA
Use principal components to reduce dimension (on centered and scaled variables):
we want d vectors z1, · · · , zd such that
First Compoment is z1 = Xω1 where
ω1 = argmax
ω =1
X · ω 2
= argmax
ω =1
ωT
XT
Xω
Second Compoment is z2 = Xω2 where
ω2 = argmax
ω =1
X
(1)
· ω 2
0 20 40 60 80
−8−6−4−2
Age
LogMortalityRate
−10 −5 0 5 10 15
−101234
PC score 1
PCscore2
q
q
qq
q
q
qqq
q
qqq
qq
q
qqq
q
q
q
qq
q
q
q
q
q
q
qqq
q
q
q
q
q
q
qq
q
qq
q
q
qq
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qqq
q
qqq
qq
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
qq
q
q
qqqq
q
q
q
q
q
q
q
q
q
qqq
q
qqq
qq
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
1914
1915
1916
1917
1918
1919
1940
1942
1943
1944
0 20 40 60 80
−10−8−6−4−2
Age
LogMortalityRate
−10 −5 0 5 10 15
−10123
PC score 1
PCscore2
qq
q
q
q
q
q
qq
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qq
qqqq
q
qqqq
q
q
q
q
q
q
with X
(1)
= X − Xω1
z1
ωT
1 .
@freakonometrics 12
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Reducing Dimension with PCA
A regression on (the d) principal components, y = zT
β + η could be an
interesting idea, unfortunatley, principal components have no reason to be
correlated with y. First compoment was z1 = Xω1 where
ω1 = argmax
ω =1
X · ω 2
= argmax
ω =1
ωT
XT
Xω
It is a non-supervised technique.
Instead, use partial least squares, introduced in Wold (1966) Estimation of
Principal Components and Related Models by Iterative Least squares. First
compoment is z1 = Xω1 where
ω1 = argmax
ω =1
{ y, X · ω } = argmax
ω =1
ωT
XT
yyT
Xω
@freakonometrics 13
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Terminology
Consider a dataset {yi, xi}, assumed to be generated from Y, X, from an
unknown distribution P.
Let m0(·) be the “true” model. Assume that yi = m0(xi) + εi.
In a regression context (quadratic loss function function), the risk associated to
m is
R(m) = EP Y − m(X)
2
An optimal model m within a class M satisfies
R(m ) = inf
m∈M
R(m)
Such a model m is usually called oracle.
Observe that m (x) = E[Y |X = x] is the solution of
R(m ) = inf
m∈M
R(m) where M is the set of measurable functions
@freakonometrics 14
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
The empirical risk is
Rn(m) =
1
n
n
i=1
yi − m(xi)
2
For instance, m can be a linear predictor, m(x) = β0 + xT
β, where θ = (β0, β)
should estimated (trained).
E Rn(m) = E (m(X) − Y )2
can be expressed as
E (m(X) − E[m(X)|X])2
variance of m
+ E E[m(X)|X] − E[Y |X]
m0(X)
2
bias of m
+ E Y − E[Y |X]
m0(X)
)2
variance of the noise
The third term is the risk of the “optimal” estimator m, that cannot be
decreased.
@freakonometrics 15
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Mallows Penalty and Model Complexity
Consider a linear predictor (see #1), i.e. y = m(x) = Ay.
Assume that y = m0(x) + ε, with E[ε] = 0 and Var[ε] = σ2
I.
Let · denote the Euclidean norm
Empirical risk : Rn(m) = 1
n y − m(x) 2
Vapnik’s risk : E[Rn(m)] =
1
n
m0(x − m(x) 2
+
1
n
E y − m0(x 2
with
m0(x = E[Y |X = x].
Observe that
nE Rn(m) = E y − m(x) 2
= (I − A)m0
2
+ σ2
I − A 2
while
= E m0(x) − m(x) 2
=
2
(I − A)m0
bias
+ σ2
A 2
variance
@freakonometrics 16
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Mallows Penalty and Model Complexity
One can obtain
E Rn(m) = E Rn(m) + 2
σ2
n
trace(A).
If trace(A) ≥ 0 the empirical risk underestimate the true risk of the estimator.
The number of degrees of freedom of the (linear) predictor is related to trace(A)
2
σ2
n
trace(A) is called Mallow’s penalty CL.
If A is a projection matrix, trace(A) is the dimension of the projection space, p,
then we obtain Mallow’s CP , 2
σ2
n
p.
Remark : Mallows (1973) Some Comments on Cp introduced this penalty while
focusing on the R2
.
@freakonometrics 17
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Penalty and Likelihood
CP is associated to a quadratic risk
an alternative is to use a distance on the (conditional) distribution of Y , namely
Kullback-Leibler distance
discrete case: DKL(P Q) =
i
P(i) log
P(i)
Q(i)
continuous case :
DKL(P Q) =
∞
−∞
p(x) log
p(x)
q(x)
dxDKL(P Q) =
∞
−∞
p(x) log p(x)
q(x) dx
Let f denote the true (unknown) density, and fθ some parametric distribution,
DKL(f fθ) =
∞
−∞
f(x) log
f(x)
fθ(x)
dx= f(x) log[f(x)] dx− f(x) log[fθ(x)] dx
relative information
Hence
minimize {DKL(f fθ)} ←→ maximize E log[fθ(X)]
@freakonometrics 18
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Penalty and Likelihood
Akaike (1974) A new look at the statistical model identification observe that for n
large enough
E log[fθ(X)] ∼ log[L(θ)] − dim(θ)
Thus
AIC = −2 log L(θ) + 2dim(θ)
Example : in a (Gaussian) linear model, yi = β0 + xT
i β + εi
AIC = n log
1
n
n
i=1
εi + 2[dim(β) + 2]
@freakonometrics 19
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Penalty and Likelihood
Remark : this is valid for large sample (rule of thumb n/dim(θ) > 40),
otherwise use a corrected AIC
AICc = AIC +
2k(k + 1)
n − k − 1
bias correction
where k = dim(θ)
see Sugiura (1978) Further analysis of the data by Akaike’s information criterion and
the finite corrections second order AIC.
Using a Bayesian interpretation, Schwarz (1978) Estimating the dimension of a
model obtained
BIC = −2 log L(θ) + log(n)dim(θ).
Observe that the criteria considered is
criteria = −function L(θ) + penality complexity
@freakonometrics 20
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Estimation of the Risk
Consider a naive bootstrap procedure, based on a bootstrap sample
Sb = {(y
(b)
i , x
(b)
i )}.
The plug-in estimator of the empirical risk is
Rn(m(b)
) =
1
n
n
i=1
yi − m(b)
(xi)
2
and then
Rn =
1
B
B
b=1
Rn(m(b)
) =
1
B
B
b=1
1
n
n
i=1
yi − m(b)
(xi)
2
@freakonometrics 21
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Estimation of the Risk
One might improve this estimate using a out-of-bag procedure
Rn =
1
n
n
i=1
1
#Bi
b∈Bi
yi − m(b)
(xi)
2
where Bi is the set of all boostrap sample that contain (yi, xi).
Remark: P ((yi, xi) /∈ Sb) = 1 −
1
n
n
∼ e−1
= 36, 78%.
@freakonometrics 22
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Linear Regression Shortcoming
Least Squares Estimator β = (XT
X)−1
XT
y
Unbiased Estimator E[β] = β
Variance Var[β] = σ2
(XT
X)−1
which can be (extremely) large when det[(XT
X)] ∼ 0.
X =







1 −1 2
1 0 1
1 2 −1
1 1 0







then XT
X =




4 2 2
2 6 −4
2 −4 6



 while XT
X+I =




5 2 2
2 7 −4
2 −4 7




eigenvalues : {10, 6, 0} {11, 7, 1}
Ad-hoc strategy: use XT
X + λI
@freakonometrics 23
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Linear Regression Shortcoming
Evolution of (β1, β2) →
n
i=1
[yi − (β1x1,i + β2x2,i)]2
when cor(X1, X2) = r ∈ [0, 1], on top.
Below, Ridge regression
(β1, β2) →
n
i=1
[yi − (β1x1,i + β2x2,i)]2
+λ(β2
1 + β2
2)
where λ ∈ [0, ∞), below,
when cor(X1, X2) ∼ 1 (colinearity).
@freakonometrics 24
−2 −1 0 1 2 3 4
−3−2−10123
β1
β2
500
1000
1500
2000
2000
2500
2500 2500
2500
3000
3000 3000
3000
3500
q
−2 −1 0 1 2 3 4
−3−2−10123
β1
β2
1000
1000
2000
2000
3000
3000
4000
4000
5000
5000
6000
6000
7000
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Normalization : Euclidean 2 vs. Mahalonobis
We want to penalize complicated models :
if βk is “too small”, we prefer to have βk = 0.
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Instead of d(x, y) = (x − y)T
(x − y)
use dΣ(x, y) = (x − y)TΣ−1
(x − y)
beta1
beta2
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 25
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Ridge Regression
... like the least square, but it shrinks estimated coefficients towards 0.
β
ridge
λ = argmin



n
i=1
(yi − xT
i β)2
+ λ
p
j=1
β2
j



β
ridge
λ = argmin



y − Xβ
2
2
=criteria
+ λ β 2
2
=penalty



λ ≥ 0 is a tuning parameter.
The constant is usually unpenalized. The true equation is
β
ridge
λ = argmin



y − (β0 + Xβ)
2
2
=criteria
+ λ β
2
2
=penalty



@freakonometrics 26
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Ridge Regression
β
ridge
λ = argmin y − (β0 + Xβ)
2
2
+ λ β
2
2
can be seen as a constrained optimization problem
β
ridge
λ = argmin
β 2
2
≤hλ
y − (β0 + Xβ)
2
2
Explicit solution
βλ = (XT
X + λI)−1
XT
y
If λ → 0, β
ridge
0 = β
ols
If λ → ∞, β
ridge
∞ = 0.
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 27
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Ridge Regression
This penalty can be seen as rather unfair if compo-
nents of x are not expressed on the same scale
• center: xj = 0, then β0 = y
• scale: xT
j xj = 1
Then compute
β
ridge
λ = argmin



y − Xβ 2
2
=loss
+ λ β 2
2
=penalty



beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
40
40
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 28
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Ridge Regression
Observe that if xj1 ⊥ xj2 , then
β
ridge
λ = [1 + λ]−1
β
ols
λ
which explain relationship with shrinkage.
But generally, it is not the case...
q
q
Theorem There exists λ such that mse[β
ridge
λ ] ≤ mse[β
ols
λ ]
Ridge Regression
@freakonometrics 29
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Lλ(β) =
n
i=1
(yi − β0 − xT
i β)2
+ λ
p
j=1
β2
j
∂Lλ(β)
∂β
= −2XT
y + 2(XT
X + λI)β
∂2
Lλ(β)
∂β∂βT
= 2(XT
X + λI)
where XT
X is a semi-positive definite matrix, and λI is a positive definite
matrix, and
βλ = (XT
X + λI)−1
XT
y
@freakonometrics 30
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
The Bayesian Interpretation
From a Bayesian perspective,
P[θ|y]
posterior
∝ P[y|θ]
likelihood
· P[θ]
prior
i.e. log P[θ|y] = log P[y|θ]
log likelihood
+ log P[θ]
penalty
If β has a prior N(0, τ2
I) distribution, then its posterior distribution has mean
E[β|y, X] = XT
X +
σ2
τ2
I
−1
XT
y.
@freakonometrics 31
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Properties of the Ridge Estimator
βλ = (XT
X + λI)−1
XT
y
E[βλ] = XT
X(λI + XT
X)−1
β.
i.e. E[βλ] = β.
Observe that E[βλ] → 0 as λ → ∞.
Assume that X is an orthogonal design matrix, i.e. XT
X = I, then
βλ = (1 + λ)−1
β
ols
.
@freakonometrics 32
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Properties of the Ridge Estimator
Set W λ = (I + λ[XT
X]−1
)−1
. One can prove that
W λβ
ols
= βλ.
Thus,
Var[βλ] = W λVar[β
ols
]W T
λ
and
Var[βλ] = σ2
(XT
X + λI)−1
XT
X[(XT
X + λI)−1
]T
.
Observe that
Var[β
ols
] − Var[βλ] = σ2
W λ[2λ(XT
X)−2
+ λ2
(XT
X)−3
]W T
λ ≥ 0.
@freakonometrics 33
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Properties of the Ridge Estimator
Hence, the confidence ellipsoid of ridge estimator is
indeed smaller than the OLS,
If X is an orthogonal design matrix,
Var[βλ] = σ2
(1 + λ)−2
I.
mse[βλ] = σ2
trace(W λ(XT
X)−1
W T
λ) + βT
(W λ − I)T
(W λ − I)β.
If X is an orthogonal design matrix,
mse[βλ] =
pσ2
(1 + λ)2
+
λ2
(1 + λ)2
βT
β
Properties of the Ridge Estimator
@freakonometrics 34
0.0 0.2 0.4 0.6 0.8
−1.0−0.8−0.6−0.4−0.2
β1
β2
1
2
3
4
5
6
7
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
mse[βλ] =
pσ2
(1 + λ)2
+
λ2
(1 + λ)2
βT
β
is minimal for
λ =
pσ2
βT
β
Note that there exists λ > 0 such that mse[βλ] < mse[β0] = mse[β
ols
].
@freakonometrics 35
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
SVD decomposition
Consider the singular value decomposition X = UDV T
. Then
β
ols
= V D−2
D UT
y
βλ = V (D2
+ λI)−1
D UT
y
Observe that
D−1
i,i ≥
Di,i
D2
i,i + λ
hence, the ridge penality shrinks singular values.
Set now R = UD (n × n matrix), so that X = RV T
,
βλ = V (RT
R + λI)−1
RT
y
@freakonometrics 36
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Hat matrix and Degrees of Freedom
Recall that Y = HY with
H = X(XT
X)−1
XT
Similarly
Hλ = X(XT
X + λI)−1
XT
trace[Hλ] =
p
j=1
d2
j,j
d2
j,j + λ
→ 0, as λ → ∞.
@freakonometrics 37
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Sparsity Issues
In severall applications, k can be (very) large, but a lot of features are just noise:
βj = 0 for many j’s. Let s denote the number of relevent features, with s << k,
cf Hastie, Tibshirani & Wainwright (2015) Statistical Learning with Sparsity,
s = card{S} where S = {j; βj = 0}
The model is now y = XT
SβS + ε, where XT
SXS is a full rank matrix.
@freakonometrics 38
q
q
= . +
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Going further on sparcity issues
The Ridge regression problem was to solve
β = argmin
β∈{ β 2 ≤s}
{ Y − XT
β 2
2
}
Define a 0
= 1(|ai| > 0).
Here dim(β) = k but β 0
= s.
We wish we could solve
β = argmin
β∈{ β 0 =s}
{ Y − XT
β 2
2
}
Problem: it is usually not possible to describe all possible constraints, since
s
k
coefficients should be chosen here (with k (very) large).
@freakonometrics 39
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Going further on sparcity issues
In a convex problem, solve the dual problem,
e.g. in the Ridge regression : primal problem
min
β∈{ β 2 ≤s}
{ Y − XT
β 2
2
}
and the dual problem
min
β∈{ Y −XTβ 2 ≤t}
{ β 2
2
}
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
26
27
30
32
35
40
40
50
60
70
80
90
100
110
120
120
130
130
140 140
X
q
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
26
27
30
32
35
40
40
50
60
70
80
90
100
110
120
120
130
130
140 140
X
q
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 40
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Going further on sparcity issues
Idea: solve the dual problem
β = argmin
β∈{ Y −XTβ 2 ≤h}
{ β 0 }
where we might convexify the 0 norm, · 0
.
@freakonometrics 41
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Going further on sparcity issues
On [−1, +1]k
, the convex hull of β 0
is β 1
On [−a, +a]k
, the convex hull of β 0
is a−1
β 1
Hence, why not solve
β = argmin
β; β 1 ≤˜s
{ Y − XT
β 2
}
which is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization
problem
β = argmin{ Y − XT
β 2
2
+λ β 1 }
@freakonometrics 42
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
LASSO Least Absolute Shrinkage and Selection Operator
β ∈ argmin{ Y − XT
β 2
2
+λ β 1
}
is a convex problem (several algorithms ), but not strictly convex (no unicity of
the minimum). Nevertheless, predictions y = xT
β are unique.
MM, minimize majorization, coordinate descent Hunter & Lange (2003) A
Tutorial on MM Algorithms.
@freakonometrics 43
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
LASSO Regression
No explicit solution...
If λ → 0, β
lasso
0 = β
ols
If λ → ∞, β
lasso
∞ = 0.
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 44
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
LASSO Regression
For some λ, there are k’s such that β
lasso
k,λ = 0.
Further, λ → β
lasso
k,λ is piecewise linear
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
40
40
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 45
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
LASSO Regression
In the orthogonal case, XT
X = I,
β
lasso
k,λ = sign(β
ols
k ) |β
ols
k | −
λ
2
i.e. the LASSO estimate is related to the soft
threshold function...
q
q
@freakonometrics 46
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Optimal LASSO Penalty
Use cross validation, e.g. K-fold,
β(−k)(λ) = argmin



i∈Ik
[yi − xT
i β]2
+ λ β 1



then compute the sum of the squared errors,
Qk(λ) =
i∈Ik
[yi − xT
i β(−k)(λ)]2
and finally solve
λ = argmin Q(λ) =
1
K
k
Qk(λ)
Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009) Elements
of Statistical Learning suggest the largest λ such that
Q(λ) ≤ Q(λ ) + se[λ ] with se[λ]2
=
1
K2
K
k=1
[Qk(λ) − Q(λ)]2
@freakonometrics 47
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
LASSO and Ridge, with R
1 > library(glmnet)
2 > chicago=read.table("http:// freakonometrics .free.fr/
chicago.txt",header=TRUE ,sep=";")
3 > standardize <- function(x) {(x-mean(x))/sd(x)}
4 > z0 <- standardize(chicago[, 1])
5 > z1 <- standardize(chicago[, 3])
6 > z2 <- standardize(chicago[, 4])
7 > ridge <-glmnet(cbind(z1 , z2), z0 , alpha =0, intercept=
FALSE , lambda =1)
8 > lasso <-glmnet(cbind(z1 , z2), z0 , alpha =1, intercept=
FALSE , lambda =1)
9 > elastic <-glmnet(cbind(z1 , z2), z0 , alpha =.5,
intercept=FALSE , lambda =1)
Elastic net, λ1 β 1
+ λ2 β 2
2
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
qq
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
q
q
qq
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
qq
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
@freakonometrics 48
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
LASSO Regression, Smoothing and Overfit
LASSO can be used to avoid overfit.
@freakonometrics 49
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
−1.0−0.50.00.51.0
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Ridge vs. LASSO
Consider simulated data (output on the right).
With orthogonal variables, shrinkage operators are
0 1 2 3 4 5
012345
β
β(ridge)
0 1 2 3 4 5
012345
β
β(lasso)
0.0 0.5 1.0 1.5 2.0
−2.0−1.5−1.0−0.50.00.51.0
L1 Norm
Coefficients
3 3 3 3 3
0.0 0.5 1.0 1.5 2.0
−2.0−1.5−1.0−0.50.00.51.0
L1 Norm
Coefficients
0 1 3 3 3
@freakonometrics 50
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Optimization Heuristics
First idea: given some initial guess β(0), |β| ∼ |β(0)| +
1
2|β(0)|
(β2
− β2
(0))
LASSO estimate can probably be derived from iterated Ridge estimates
y − Xβ(k+1)
2
2
+ λ β(k+1) 1 ∼ Xβ(k+1)
2
2
+
λ
2 j
1
|βj,(k)|
[βj,(k+1)]2
which is a weighted ridge penalty function
Thus,
β(k+1) = XT
X + λ∆(k)
−1
XT
y
where ∆(k) = diag[|βj,(k)|−1
]. Then β(k) → β
lasso
, as k → ∞.
@freakonometrics 51
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Properties of LASSO Estimate
From this iterative technique
β
lasso
λ ∼ XT
X + λ∆
−1
XT
y
where ∆ = diag[|β
lasso
j,λ |−1
] if β
lasso
j,λ = 0, 0 otherwise.
Thus,
E[β
lasso
λ ] ∼ XT
X + λ∆
−1
XT
Xβ
and
Var[β
lasso
λ ] ∼ σ2
XT
X + λ∆
−1
XT
XT
X XT
X + λ∆
−1
XT
@freakonometrics 52
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Optimization Heuristics
Consider here a simplified problem, min
a∈R
1
2
(a − b)2
+ λ|a|
g(a)
with λ > 0.
Observe that g (0) = −b ± λ. Then
• if |b| ≤ λ, then a = 0
• if b ≥ λ, then a = b − λ
• if b ≤ −λ, then a = b + λ
a = argmin
a∈R
1
2
(a − b)2
+ λ|a| = Sλ(b) = sign(b) · (|b| − λ)+,
also called soft-thresholding operator.
@freakonometrics 53
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Optimization Heuristics
Definition for any convex function h, define the proximal operator operator of h,
proximalh(y) = argmin
x∈Rd
1
2
x − y 2
2
+ h(x)
Note that
proximalλ · 2
2
(y) =
1
1 + λ
x shrinkage operator
proximalλ · 1
(y) = Sλ(y) = sign(y) · (|y| − λ)+
@freakonometrics 54
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Optimization Heuristics
We want to solve here
θ ∈ argmin
θ∈Rd
1
n
y − mθ(x)) 2
2
f(θ)
+ λpenalty(θ)
g(θ)
.
where f is convex and smooth, and g is convex, but not smooth...
1. Focus on f : descent lemma, ∀θ, θ
f(θ) ≤ f(θ ) + f(θ ), θ − θ +
t
2
θ − θ 2
2
Consider a gradient descent sequence θk, i.e. θk+1 = θk − t−1
f(θk), then
f(θ) ≤
ϕ(θ): θk+1=argmin{ϕ(θ)}
f(θk) + f(θk), θ − θk +
t
2
θ − θk
2
2
@freakonometrics 55
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Optimization Heuristics
2. Add function g
f(θ)+g(θ) ≤
ψ(θ)
f(θk) + f(θk), θ − θk +
t
2
θ − θk
2
2
+g(θ)
And one can proof that
θk+1 = argmin
θ∈Rd
ψ(θ) = proximalg/t θk − t−1
f(θk)
so called proximal gradient descent algorithm, since
argmin {ψ(θ)} = argmin
t
2
θ − θk − t−1
f(θk)
2
2
+ g(θ)
@freakonometrics 56
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Coordinate-wise minimization
Consider some convex differentiable f : Rk
→ R function.
Consider x ∈ Rk
obtained by minimizing along each coordinate axis, i.e.
f(x1, xi−1, xi, xi+1, · · · , xk) ≥ f(x1, xi−1, xi , xi+1, · · · , xk)
for all i. Is x a global minimizer? i.e.
f(x) ≥ f(x ), ∀x ∈ Rk
.
Yes. If f is convex and differentiable.
f(x)|x=x =
∂f(x)
∂x1
, · · · ,
∂f(x)
∂xk
= 0
There might be problem if f is not differentiable (except in each axis direction).
If f(x) = g(x) +
k
i=1 hi(xi) with g convex and differentiable, yes, since
f(x) − f(x ) ≥ g(x )T
(x − x ) +
i
[hi(xi) − hi(xi )]
@freakonometrics 57
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Coordinate-wise minimization
f(x) − f(x ) ≥
i
[ ig(x )T
(xi − xi )hi(xi) − hi(xi )]
≥0
≥ 0
Thus, for functions f(x) = g(x) +
k
i=1 hi(xi) we can use coordinate descent to
find a minimizer, i.e. at step j
x
(j)
1 ∈ argmin
x1
f(x1, x
(j−1)
2 , x
(j−1)
3 , · · · x
(j−1)
k )
x
(j)
2 ∈ argmin
x2
f(x
(j)
1 , x2, x
(j−1)
3 , · · · x
(j−1)
k )
x
(j)
3 ∈ argmin
x3
f(x
(j)
1 , x
(j)
2 , x3, · · · x
(j−1)
k )
Tseng (2001) Convergence of Block Coordinate Descent Method: if f is continuous,
then x∞
is a minimizer of f.
@freakonometrics 58
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Application in Linear Regression
Let f(x) = 1
2 y − Ax 2
, with y ∈ Rn
and A ∈ Mn×k. Let A = [A1, · · · , Ak].
Let us minimize in direction i. Let x−i denote the vector in Rk−1
without xi.
Here
0 =
∂f(x)
∂xi
= AT
i [Ax − y] = AT
i [Aixi + A−ix−i − y]
thus, the optimal value is here
xi =
AT
i [A−ix−i − y]
AT
i Ai
@freakonometrics 59
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Application to LASSO
Let f(x) = 1
2 y − Ax 2
+ λ x 1 , so that the non-differentiable part is
separable, since x 1
=
k
i=1 |xi|.
Let us minimize in direction i. Let x−i denote the vector in Rk−1
without xi.
Here
0 =
∂f(x)
∂xi
= AT
i [Aixi + A−ix−i − y] + λsi
where si ∈ ∂|xi|. Thus, solution is obtained by soft-thresholding
xi = Sλ/ Ai
2
AT
i [A−ix−i − y]
AT
i Ai
@freakonometrics 60
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Convergence rate for LASSO
Let f(x) = g(x) + λ x 1
with
• g convex, g Lipschitz with constant L > 0, and Id − g/L monotone
inscreasing in each component
• there exists z such that, componentwise, either z ≥ Sλ(z − g(z)) or
z ≤ Sλ(z − g(z))
Saka & Tewari (2010), On the finite time convergence of cyclic coordinate descent
methods proved that a coordinate descent starting from z satisfies
f(x(j)
) − f(x ) ≤
L z − x 2
2j
@freakonometrics 61
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Graphical Lasso and Covariance Estimation
We want to estimate an (unknown) covariance matrix Σ, or Σ−1
.
An estimate for Σ−1
is Θ solution of
Θ ∈ argmin
Θ∈Mk×k
{− log[det(Θ)] + trace[SΘ] + λ Θ 1
} where S =
XT
X
n
and where Θ 1
= |Θi,j|.
See van Wieringen (2016) Undirected network reconstruction from high-dimensional
data and https://github.com/kaizhang/glasso
@freakonometrics 62
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Application to Network Simplification
Can be applied on networks, to spot ‘significant’
connexions...
Source: http://khughitt.github.io/graphical-lasso/
@freakonometrics 63
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Extention of Penalization Techniques
In a more general context, we want to solve
θ ∈ argmin
θ∈Rd
1
n
n
i=1
(yi, mθ(xi)) + λ · penalty(θ) .
@freakonometrics 64

More Related Content

PDF
Slides ineq-3b
PDF
Classification
PDF
Slides ACTINFO 2016
PDF
Slides ensae-2016-11
PDF
Slides edf-2
PDF
Graduate Econometrics Course, part 4, 2017
PDF
Slides ineq-2
PDF
Slides ensae 8
Slides ineq-3b
Classification
Slides ACTINFO 2016
Slides ensae-2016-11
Slides edf-2
Graduate Econometrics Course, part 4, 2017
Slides ineq-2
Slides ensae 8

What's hot (20)

PDF
Slides ensae 11bis
PDF
Inequalities #3
PDF
Inequality, slides #2
PDF
Slides ensae 9
PDF
Slides erasmus
PDF
Multiattribute utility copula
PDF
Proba stats-r1-2017
PDF
Slides ineq-4
PDF
Quantile and Expectile Regression
PDF
Inequality #4
PDF
Slides erm-cea-ia
PDF
Slides barcelona Machine Learning
PDF
Slides amsterdam-2013
PDF
Slides econometrics-2018-graduate-2
PDF
Slides toulouse
PDF
Inequalities #2
PDF
Slides risk-rennes
PDF
Slides simplexe
PDF
Slides Bank England
PDF
Slides astin
Slides ensae 11bis
Inequalities #3
Inequality, slides #2
Slides ensae 9
Slides erasmus
Multiattribute utility copula
Proba stats-r1-2017
Slides ineq-4
Quantile and Expectile Regression
Inequality #4
Slides erm-cea-ia
Slides barcelona Machine Learning
Slides amsterdam-2013
Slides econometrics-2018-graduate-2
Slides toulouse
Inequalities #2
Slides risk-rennes
Slides simplexe
Slides Bank England
Slides astin
Ad

Viewers also liked (20)

PDF
Slides econometrics-2017-graduate-2
PDF
Econometrics, PhD Course, #1 Nonlinearities
PDF
Soutenance julie viard_partie_1
PDF
Lg ph d_slides_vfinal
PDF
Slides act6420-e2014-ts-1
PDF
Slides inequality 2017
PDF
So a webinar-2013-2
PDF
Slides arbres-ubuntu
PDF
Julien slides - séminaire quantact
PDF
Slides 2040-6
PDF
Slides 2040-4
PDF
Slides guanauato
PDF
Slides ensae-2016-4
PDF
Slides ensae-2016-8
PDF
Slides ensae-2016-9
PDF
Slides ensae-2016-5
PDF
Slides ensae-2016-7
PDF
Slides ensae-2016-6
PDF
Slides ensae-2016-10
Slides econometrics-2017-graduate-2
Econometrics, PhD Course, #1 Nonlinearities
Soutenance julie viard_partie_1
Lg ph d_slides_vfinal
Slides act6420-e2014-ts-1
Slides inequality 2017
So a webinar-2013-2
Slides arbres-ubuntu
Julien slides - séminaire quantact
Slides 2040-6
Slides 2040-4
Slides guanauato
Slides ensae-2016-4
Slides ensae-2016-8
Slides ensae-2016-9
Slides ensae-2016-5
Slides ensae-2016-7
Slides ensae-2016-6
Slides ensae-2016-10
Ad

Similar to Econometrics 2017-graduate-3 (20)

PDF
Slides econometrics-2018-graduate-3
PDF
Varese italie #2
PDF
Varese italie seminar
PDF
Lausanne 2019 #1
PDF
Estimation rs
PDF
Varese italie #2
PDF
ch02ans.pdf The Simple Linear Regression Model: Specification and Estimation
PDF
199ae1e6bc77d0ed5efc0cd2d83cc532_econometrics.pdf
PDF
Econometric Analysis 8th Edition Greene Solutions Manual
PPTX
Topic 1.4
PDF
Side 2019 #3
PPTX
Introduction to Econometrics
PDF
Side 2019 #9
PDF
econometrics
PDF
Slides econometrics-2018-graduate-1
PDF
PanelDadasdsadadsadasdasdasdataNotes-1b.pdf
PDF
A basic introduction to learning
PPTX
Static Models of Continuous Variables
PPTX
Linear regression, costs & gradient descent
PDF
Detection & Estimation Theory
Slides econometrics-2018-graduate-3
Varese italie #2
Varese italie seminar
Lausanne 2019 #1
Estimation rs
Varese italie #2
ch02ans.pdf The Simple Linear Regression Model: Specification and Estimation
199ae1e6bc77d0ed5efc0cd2d83cc532_econometrics.pdf
Econometric Analysis 8th Edition Greene Solutions Manual
Topic 1.4
Side 2019 #3
Introduction to Econometrics
Side 2019 #9
econometrics
Slides econometrics-2018-graduate-1
PanelDadasdsadadsadasdasdasdataNotes-1b.pdf
A basic introduction to learning
Static Models of Continuous Variables
Linear regression, costs & gradient descent
Detection & Estimation Theory

More from Arthur Charpentier (20)

PDF
Family History and Life Insurance
PDF
ACT6100 introduction
PDF
Family History and Life Insurance (UConn actuarial seminar)
PDF
Control epidemics
PDF
STT5100 Automne 2020, introduction
PDF
Family History and Life Insurance
PDF
Machine Learning in Actuarial Science & Insurance
PDF
Reinforcement Learning in Economics and Finance
PDF
Optimal Control and COVID-19
PDF
Slides OICA 2020
PDF
Lausanne 2019 #3
PDF
Lausanne 2019 #4
PDF
Lausanne 2019 #2
PDF
Side 2019 #10
PDF
Side 2019 #11
PDF
Side 2019 #12
PDF
Side 2019 #8
PDF
Side 2019 #7
PDF
Side 2019 #6
PDF
Side 2019 #5
Family History and Life Insurance
ACT6100 introduction
Family History and Life Insurance (UConn actuarial seminar)
Control epidemics
STT5100 Automne 2020, introduction
Family History and Life Insurance
Machine Learning in Actuarial Science & Insurance
Reinforcement Learning in Economics and Finance
Optimal Control and COVID-19
Slides OICA 2020
Lausanne 2019 #3
Lausanne 2019 #4
Lausanne 2019 #2
Side 2019 #10
Side 2019 #11
Side 2019 #12
Side 2019 #8
Side 2019 #7
Side 2019 #6
Side 2019 #5

Recently uploaded (20)

PPTX
Cell Types and Its function , kingdom of life
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Classroom Observation Tools for Teachers
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Cell Structure & Organelles in detailed.
PDF
Business Ethics Teaching Materials for college
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Complications of Minimal Access Surgery at WLH
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Basic Mud Logging Guide for educational purpose
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Pre independence Education in Inndia.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
Cell Types and Its function , kingdom of life
STATICS OF THE RIGID BODIES Hibbelers.pdf
VCE English Exam - Section C Student Revision Booklet
Classroom Observation Tools for Teachers
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Cell Structure & Organelles in detailed.
Business Ethics Teaching Materials for college
O7-L3 Supply Chain Operations - ICLT Program
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Supply Chain Operations Speaking Notes -ICLT Program
Complications of Minimal Access Surgery at WLH
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
TR - Agricultural Crops Production NC III.pdf
Basic Mud Logging Guide for educational purpose
102 student loan defaulters named and shamed – Is someone you know on the list?
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Pre independence Education in Inndia.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra

Econometrics 2017-graduate-3

  • 1. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Advanced Econometrics #3: Model & Variable Selection* A. Charpentier (Université de Rennes 1) Université de Rennes 1, Graduate Course, 2017. @freakonometrics 1
  • 2. Arthur CHARPENTIER, Advanced Econometrics Graduate Course “Great plot. Now need to find the theory that explains it” Deville (2017) http://twitter.com @freakonometrics 2
  • 3. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preliminary Results: Numerical Optimization Problem : x ∈ argmin{f(x); x ∈ Rd } Gradient descent : xk+1 = xk − η f(xk) starting from some x0 Problem : x ∈ argmin{f(x); x ∈ X ⊂ Rd } Projected descent : xk+1 = ΠX xk − η f(xk) starting from some x0 A constrained problem is said to be convex if    min{f(x)} with f convex s.t. gi(x) = 0, ∀i = 1, · · · , n with gi linear hi(x) ≤ 0, ∀i = 1, · · · , m with hi convex Lagrangian : L(x, λ, µ) = f(x) + n i=1 λigi(x) + m i=1 µihi(x) where x are primal variables and (λ, µ) are dual variables. Remark L is an affine function in (λ, µ) @freakonometrics 3
  • 4. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preliminary Results: Numerical Optimization Karush–Kuhn–Tucker conditions : a convex problem has a solution x if and only if there are (λ , µ ) such that the following condition hold • stationarity : xL(x, λ, µ) = 0 at (x , λ , µ ) • primal admissibility : gi(x ) = 0 and hi(x ) ≤ 0, ∀i • dual admissibility : µ ≥ 0 Let L denote the associated dual function L(λ, µ) = min x {L(x, λ, µ)} L is a convex function in (λ, µ) and the dual problem is max λ,µ {L(λ, µ)}. @freakonometrics 4
  • 5. Arthur CHARPENTIER, Advanced Econometrics Graduate Course References Motivation Banerjee, A., Chandrasekhar, A.G., Duflo, E. & Jackson, M.O. (2016). Gossip: Identifying Central Individuals in a Social Networks. References Belloni, A. & Chernozhukov, V. 2009. Least squares after model selection in high-dimensional sparse models. Hastie, T., Tibshirani, R. & Wainwright, M. 2015 Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press. @freakonometrics 5
  • 6. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preambule Assume that y = m(x) + ε, where ε is some idosyncatic impredictible noise. The error E[(y − m(x))2 ] is the sume of three terms • variance of the estimator : E[(y − m(x))2 ] • bias2 of the estimator : [m(x − m(x)]2 • variance of the noise : E[(y − m(x))2 ] (the latter exists, even with a ‘perfect’ model). @freakonometrics 6
  • 7. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preambule Consider a parametric model, with true (unkown) parameter θ, then mse(ˆθ) = E (ˆθ − θ)2 = E (ˆθ − E ˆθ )2 variance + E (E ˆθ − θ)2 bias2 Let θ denote an unbiased estimator of θ. Then ˆθ = θ2 θ2 + mse(θ) · θ = θ − mse(θ) θ2 + mse(θ) · θ penalty satisfies mse(ˆθ) ≤ mse(θ). @freakonometrics 7 −2 −1 0 1 2 3 4 0.00.20.40.60.8 variance
  • 8. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Occam’s Razor The “law of parsimony”, “lex parsimoniæ” Penalize too complex models @freakonometrics 8
  • 9. Arthur CHARPENTIER, Advanced Econometrics Graduate Course James & Stein Estimator Let X ∼ N(µ, σ2 I). We want to estimate µ. µmle = Xn ∼ N µ, σ2 n I . From James & Stein (1961) Estimation with quadratic loss µJS = 1 − (d − 2)σ2 n y 2 y where · is the Euclidean norm. One can prove that if d ≥ 3, E µJS − µ 2 < E µmle − µ 2 Samworth (2015) Stein’s paradox, “one should use the price of tea in China to obtain a better estimate of the chance of rain in Melbourne”. @freakonometrics 9
  • 10. Arthur CHARPENTIER, Advanced Econometrics Graduate Course James & Stein Estimator Heuristics : consider a biased estimator, to decrease the variance. See Efron (2010) Large-Scale Inference @freakonometrics 10
  • 11. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Motivation: Avoiding Overfit Generalization : the model should perform well on new data (and not only on the training ones). q q q q q q q q q q q q q 2 4 6 8 10 12 0.00.20.40.6 q q q q q q q q q q q q q @freakonometrics 11 q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 −1.5−1.0−0.50.00.51.01.5 q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 −1.5−1.0−0.50.00.51.01.5 q q q q q q q q q qq
  • 12. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Reducing Dimension with PCA Use principal components to reduce dimension (on centered and scaled variables): we want d vectors z1, · · · , zd such that First Compoment is z1 = Xω1 where ω1 = argmax ω =1 X · ω 2 = argmax ω =1 ωT XT Xω Second Compoment is z2 = Xω2 where ω2 = argmax ω =1 X (1) · ω 2 0 20 40 60 80 −8−6−4−2 Age LogMortalityRate −10 −5 0 5 10 15 −101234 PC score 1 PCscore2 q q qq q q qqq q qqq qq q qqq q q q qq q q q q q q qqq q q q q q q qq q qq q q qq q q q q q q qq q qqq q q q q q q q q q q q q q q qq q q qqq q qqq qq q qqq q q q q q q q q q q q q q q q q q q q qq q q q q qq q q qq q q qqqq q q q q q q q q q qqq q qqq qq q qq q q q q qq q q q q q q q q q q q q q 1914 1915 1916 1917 1918 1919 1940 1942 1943 1944 0 20 40 60 80 −10−8−6−4−2 Age LogMortalityRate −10 −5 0 5 10 15 −10123 PC score 1 PCscore2 qq q q q q q qq qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q q q qq q q q q q q q q q q qq qq q q q q q qq qqqq q qqqq q q q q q q with X (1) = X − Xω1 z1 ωT 1 . @freakonometrics 12
  • 13. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Reducing Dimension with PCA A regression on (the d) principal components, y = zT β + η could be an interesting idea, unfortunatley, principal components have no reason to be correlated with y. First compoment was z1 = Xω1 where ω1 = argmax ω =1 X · ω 2 = argmax ω =1 ωT XT Xω It is a non-supervised technique. Instead, use partial least squares, introduced in Wold (1966) Estimation of Principal Components and Related Models by Iterative Least squares. First compoment is z1 = Xω1 where ω1 = argmax ω =1 { y, X · ω } = argmax ω =1 ωT XT yyT Xω @freakonometrics 13
  • 14. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Terminology Consider a dataset {yi, xi}, assumed to be generated from Y, X, from an unknown distribution P. Let m0(·) be the “true” model. Assume that yi = m0(xi) + εi. In a regression context (quadratic loss function function), the risk associated to m is R(m) = EP Y − m(X) 2 An optimal model m within a class M satisfies R(m ) = inf m∈M R(m) Such a model m is usually called oracle. Observe that m (x) = E[Y |X = x] is the solution of R(m ) = inf m∈M R(m) where M is the set of measurable functions @freakonometrics 14
  • 15. Arthur CHARPENTIER, Advanced Econometrics Graduate Course The empirical risk is Rn(m) = 1 n n i=1 yi − m(xi) 2 For instance, m can be a linear predictor, m(x) = β0 + xT β, where θ = (β0, β) should estimated (trained). E Rn(m) = E (m(X) − Y )2 can be expressed as E (m(X) − E[m(X)|X])2 variance of m + E E[m(X)|X] − E[Y |X] m0(X) 2 bias of m + E Y − E[Y |X] m0(X) )2 variance of the noise The third term is the risk of the “optimal” estimator m, that cannot be decreased. @freakonometrics 15
  • 16. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Mallows Penalty and Model Complexity Consider a linear predictor (see #1), i.e. y = m(x) = Ay. Assume that y = m0(x) + ε, with E[ε] = 0 and Var[ε] = σ2 I. Let · denote the Euclidean norm Empirical risk : Rn(m) = 1 n y − m(x) 2 Vapnik’s risk : E[Rn(m)] = 1 n m0(x − m(x) 2 + 1 n E y − m0(x 2 with m0(x = E[Y |X = x]. Observe that nE Rn(m) = E y − m(x) 2 = (I − A)m0 2 + σ2 I − A 2 while = E m0(x) − m(x) 2 = 2 (I − A)m0 bias + σ2 A 2 variance @freakonometrics 16
  • 17. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Mallows Penalty and Model Complexity One can obtain E Rn(m) = E Rn(m) + 2 σ2 n trace(A). If trace(A) ≥ 0 the empirical risk underestimate the true risk of the estimator. The number of degrees of freedom of the (linear) predictor is related to trace(A) 2 σ2 n trace(A) is called Mallow’s penalty CL. If A is a projection matrix, trace(A) is the dimension of the projection space, p, then we obtain Mallow’s CP , 2 σ2 n p. Remark : Mallows (1973) Some Comments on Cp introduced this penalty while focusing on the R2 . @freakonometrics 17
  • 18. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Penalty and Likelihood CP is associated to a quadratic risk an alternative is to use a distance on the (conditional) distribution of Y , namely Kullback-Leibler distance discrete case: DKL(P Q) = i P(i) log P(i) Q(i) continuous case : DKL(P Q) = ∞ −∞ p(x) log p(x) q(x) dxDKL(P Q) = ∞ −∞ p(x) log p(x) q(x) dx Let f denote the true (unknown) density, and fθ some parametric distribution, DKL(f fθ) = ∞ −∞ f(x) log f(x) fθ(x) dx= f(x) log[f(x)] dx− f(x) log[fθ(x)] dx relative information Hence minimize {DKL(f fθ)} ←→ maximize E log[fθ(X)] @freakonometrics 18
  • 19. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Penalty and Likelihood Akaike (1974) A new look at the statistical model identification observe that for n large enough E log[fθ(X)] ∼ log[L(θ)] − dim(θ) Thus AIC = −2 log L(θ) + 2dim(θ) Example : in a (Gaussian) linear model, yi = β0 + xT i β + εi AIC = n log 1 n n i=1 εi + 2[dim(β) + 2] @freakonometrics 19
  • 20. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Penalty and Likelihood Remark : this is valid for large sample (rule of thumb n/dim(θ) > 40), otherwise use a corrected AIC AICc = AIC + 2k(k + 1) n − k − 1 bias correction where k = dim(θ) see Sugiura (1978) Further analysis of the data by Akaike’s information criterion and the finite corrections second order AIC. Using a Bayesian interpretation, Schwarz (1978) Estimating the dimension of a model obtained BIC = −2 log L(θ) + log(n)dim(θ). Observe that the criteria considered is criteria = −function L(θ) + penality complexity @freakonometrics 20
  • 21. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Estimation of the Risk Consider a naive bootstrap procedure, based on a bootstrap sample Sb = {(y (b) i , x (b) i )}. The plug-in estimator of the empirical risk is Rn(m(b) ) = 1 n n i=1 yi − m(b) (xi) 2 and then Rn = 1 B B b=1 Rn(m(b) ) = 1 B B b=1 1 n n i=1 yi − m(b) (xi) 2 @freakonometrics 21
  • 22. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Estimation of the Risk One might improve this estimate using a out-of-bag procedure Rn = 1 n n i=1 1 #Bi b∈Bi yi − m(b) (xi) 2 where Bi is the set of all boostrap sample that contain (yi, xi). Remark: P ((yi, xi) /∈ Sb) = 1 − 1 n n ∼ e−1 = 36, 78%. @freakonometrics 22
  • 23. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Linear Regression Shortcoming Least Squares Estimator β = (XT X)−1 XT y Unbiased Estimator E[β] = β Variance Var[β] = σ2 (XT X)−1 which can be (extremely) large when det[(XT X)] ∼ 0. X =        1 −1 2 1 0 1 1 2 −1 1 1 0        then XT X =     4 2 2 2 6 −4 2 −4 6     while XT X+I =     5 2 2 2 7 −4 2 −4 7     eigenvalues : {10, 6, 0} {11, 7, 1} Ad-hoc strategy: use XT X + λI @freakonometrics 23
  • 24. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Linear Regression Shortcoming Evolution of (β1, β2) → n i=1 [yi − (β1x1,i + β2x2,i)]2 when cor(X1, X2) = r ∈ [0, 1], on top. Below, Ridge regression (β1, β2) → n i=1 [yi − (β1x1,i + β2x2,i)]2 +λ(β2 1 + β2 2) where λ ∈ [0, ∞), below, when cor(X1, X2) ∼ 1 (colinearity). @freakonometrics 24 −2 −1 0 1 2 3 4 −3−2−10123 β1 β2 500 1000 1500 2000 2000 2500 2500 2500 2500 3000 3000 3000 3000 3500 q −2 −1 0 1 2 3 4 −3−2−10123 β1 β2 1000 1000 2000 2000 3000 3000 4000 4000 5000 5000 6000 6000 7000
  • 25. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Normalization : Euclidean 2 vs. Mahalonobis We want to penalize complicated models : if βk is “too small”, we prefer to have βk = 0. 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Instead of d(x, y) = (x − y)T (x − y) use dΣ(x, y) = (x − y)TΣ−1 (x − y) beta1 beta2 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics 25
  • 26. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Ridge Regression ... like the least square, but it shrinks estimated coefficients towards 0. β ridge λ = argmin    n i=1 (yi − xT i β)2 + λ p j=1 β2 j    β ridge λ = argmin    y − Xβ 2 2 =criteria + λ β 2 2 =penalty    λ ≥ 0 is a tuning parameter. The constant is usually unpenalized. The true equation is β ridge λ = argmin    y − (β0 + Xβ) 2 2 =criteria + λ β 2 2 =penalty    @freakonometrics 26
  • 27. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Ridge Regression β ridge λ = argmin y − (β0 + Xβ) 2 2 + λ β 2 2 can be seen as a constrained optimization problem β ridge λ = argmin β 2 2 ≤hλ y − (β0 + Xβ) 2 2 Explicit solution βλ = (XT X + λI)−1 XT y If λ → 0, β ridge 0 = β ols If λ → ∞, β ridge ∞ = 0. beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 30 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics 27
  • 28. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Ridge Regression This penalty can be seen as rather unfair if compo- nents of x are not expressed on the same scale • center: xj = 0, then β0 = y • scale: xT j xj = 1 Then compute β ridge λ = argmin    y − Xβ 2 2 =loss + λ β 2 2 =penalty    beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 30 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 40 40 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics 28
  • 29. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Ridge Regression Observe that if xj1 ⊥ xj2 , then β ridge λ = [1 + λ]−1 β ols λ which explain relationship with shrinkage. But generally, it is not the case... q q Theorem There exists λ such that mse[β ridge λ ] ≤ mse[β ols λ ] Ridge Regression @freakonometrics 29
  • 30. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Lλ(β) = n i=1 (yi − β0 − xT i β)2 + λ p j=1 β2 j ∂Lλ(β) ∂β = −2XT y + 2(XT X + λI)β ∂2 Lλ(β) ∂β∂βT = 2(XT X + λI) where XT X is a semi-positive definite matrix, and λI is a positive definite matrix, and βλ = (XT X + λI)−1 XT y @freakonometrics 30
  • 31. Arthur CHARPENTIER, Advanced Econometrics Graduate Course The Bayesian Interpretation From a Bayesian perspective, P[θ|y] posterior ∝ P[y|θ] likelihood · P[θ] prior i.e. log P[θ|y] = log P[y|θ] log likelihood + log P[θ] penalty If β has a prior N(0, τ2 I) distribution, then its posterior distribution has mean E[β|y, X] = XT X + σ2 τ2 I −1 XT y. @freakonometrics 31
  • 32. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Properties of the Ridge Estimator βλ = (XT X + λI)−1 XT y E[βλ] = XT X(λI + XT X)−1 β. i.e. E[βλ] = β. Observe that E[βλ] → 0 as λ → ∞. Assume that X is an orthogonal design matrix, i.e. XT X = I, then βλ = (1 + λ)−1 β ols . @freakonometrics 32
  • 33. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Properties of the Ridge Estimator Set W λ = (I + λ[XT X]−1 )−1 . One can prove that W λβ ols = βλ. Thus, Var[βλ] = W λVar[β ols ]W T λ and Var[βλ] = σ2 (XT X + λI)−1 XT X[(XT X + λI)−1 ]T . Observe that Var[β ols ] − Var[βλ] = σ2 W λ[2λ(XT X)−2 + λ2 (XT X)−3 ]W T λ ≥ 0. @freakonometrics 33
  • 34. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Properties of the Ridge Estimator Hence, the confidence ellipsoid of ridge estimator is indeed smaller than the OLS, If X is an orthogonal design matrix, Var[βλ] = σ2 (1 + λ)−2 I. mse[βλ] = σ2 trace(W λ(XT X)−1 W T λ) + βT (W λ − I)T (W λ − I)β. If X is an orthogonal design matrix, mse[βλ] = pσ2 (1 + λ)2 + λ2 (1 + λ)2 βT β Properties of the Ridge Estimator @freakonometrics 34 0.0 0.2 0.4 0.6 0.8 −1.0−0.8−0.6−0.4−0.2 β1 β2 1 2 3 4 5 6 7
  • 35. Arthur CHARPENTIER, Advanced Econometrics Graduate Course mse[βλ] = pσ2 (1 + λ)2 + λ2 (1 + λ)2 βT β is minimal for λ = pσ2 βT β Note that there exists λ > 0 such that mse[βλ] < mse[β0] = mse[β ols ]. @freakonometrics 35
  • 36. Arthur CHARPENTIER, Advanced Econometrics Graduate Course SVD decomposition Consider the singular value decomposition X = UDV T . Then β ols = V D−2 D UT y βλ = V (D2 + λI)−1 D UT y Observe that D−1 i,i ≥ Di,i D2 i,i + λ hence, the ridge penality shrinks singular values. Set now R = UD (n × n matrix), so that X = RV T , βλ = V (RT R + λI)−1 RT y @freakonometrics 36
  • 37. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Hat matrix and Degrees of Freedom Recall that Y = HY with H = X(XT X)−1 XT Similarly Hλ = X(XT X + λI)−1 XT trace[Hλ] = p j=1 d2 j,j d2 j,j + λ → 0, as λ → ∞. @freakonometrics 37
  • 38. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Sparsity Issues In severall applications, k can be (very) large, but a lot of features are just noise: βj = 0 for many j’s. Let s denote the number of relevent features, with s << k, cf Hastie, Tibshirani & Wainwright (2015) Statistical Learning with Sparsity, s = card{S} where S = {j; βj = 0} The model is now y = XT SβS + ε, where XT SXS is a full rank matrix. @freakonometrics 38 q q = . +
  • 39. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Going further on sparcity issues The Ridge regression problem was to solve β = argmin β∈{ β 2 ≤s} { Y − XT β 2 2 } Define a 0 = 1(|ai| > 0). Here dim(β) = k but β 0 = s. We wish we could solve β = argmin β∈{ β 0 =s} { Y − XT β 2 2 } Problem: it is usually not possible to describe all possible constraints, since s k coefficients should be chosen here (with k (very) large). @freakonometrics 39 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0
  • 40. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Going further on sparcity issues In a convex problem, solve the dual problem, e.g. in the Ridge regression : primal problem min β∈{ β 2 ≤s} { Y − XT β 2 2 } and the dual problem min β∈{ Y −XTβ 2 ≤t} { β 2 2 } beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 26 27 30 32 35 40 40 50 60 70 80 90 100 110 120 120 130 130 140 140 X q −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 26 27 30 32 35 40 40 50 60 70 80 90 100 110 120 120 130 130 140 140 X q −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics 40
  • 41. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Going further on sparcity issues Idea: solve the dual problem β = argmin β∈{ Y −XTβ 2 ≤h} { β 0 } where we might convexify the 0 norm, · 0 . @freakonometrics 41
  • 42. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Going further on sparcity issues On [−1, +1]k , the convex hull of β 0 is β 1 On [−a, +a]k , the convex hull of β 0 is a−1 β 1 Hence, why not solve β = argmin β; β 1 ≤˜s { Y − XT β 2 } which is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization problem β = argmin{ Y − XT β 2 2 +λ β 1 } @freakonometrics 42
  • 43. Arthur CHARPENTIER, Advanced Econometrics Graduate Course LASSO Least Absolute Shrinkage and Selection Operator β ∈ argmin{ Y − XT β 2 2 +λ β 1 } is a convex problem (several algorithms ), but not strictly convex (no unicity of the minimum). Nevertheless, predictions y = xT β are unique. MM, minimize majorization, coordinate descent Hunter & Lange (2003) A Tutorial on MM Algorithms. @freakonometrics 43
  • 44. Arthur CHARPENTIER, Advanced Econometrics Graduate Course LASSO Regression No explicit solution... If λ → 0, β lasso 0 = β ols If λ → ∞, β lasso ∞ = 0. beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics 44
  • 45. Arthur CHARPENTIER, Advanced Econometrics Graduate Course LASSO Regression For some λ, there are k’s such that β lasso k,λ = 0. Further, λ → β lasso k,λ is piecewise linear beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 30 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 40 40 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics 45
  • 46. Arthur CHARPENTIER, Advanced Econometrics Graduate Course LASSO Regression In the orthogonal case, XT X = I, β lasso k,λ = sign(β ols k ) |β ols k | − λ 2 i.e. the LASSO estimate is related to the soft threshold function... q q @freakonometrics 46
  • 47. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Optimal LASSO Penalty Use cross validation, e.g. K-fold, β(−k)(λ) = argmin    i∈Ik [yi − xT i β]2 + λ β 1    then compute the sum of the squared errors, Qk(λ) = i∈Ik [yi − xT i β(−k)(λ)]2 and finally solve λ = argmin Q(λ) = 1 K k Qk(λ) Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009) Elements of Statistical Learning suggest the largest λ such that Q(λ) ≤ Q(λ ) + se[λ ] with se[λ]2 = 1 K2 K k=1 [Qk(λ) − Q(λ)]2 @freakonometrics 47
  • 48. Arthur CHARPENTIER, Advanced Econometrics Graduate Course LASSO and Ridge, with R 1 > library(glmnet) 2 > chicago=read.table("http:// freakonometrics .free.fr/ chicago.txt",header=TRUE ,sep=";") 3 > standardize <- function(x) {(x-mean(x))/sd(x)} 4 > z0 <- standardize(chicago[, 1]) 5 > z1 <- standardize(chicago[, 3]) 6 > z2 <- standardize(chicago[, 4]) 7 > ridge <-glmnet(cbind(z1 , z2), z0 , alpha =0, intercept= FALSE , lambda =1) 8 > lasso <-glmnet(cbind(z1 , z2), z0 , alpha =1, intercept= FALSE , lambda =1) 9 > elastic <-glmnet(cbind(z1 , z2), z0 , alpha =.5, intercept=FALSE , lambda =1) Elastic net, λ1 β 1 + λ2 β 2 2 q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qq q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q qq q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qq q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q @freakonometrics 48
  • 49. Arthur CHARPENTIER, Advanced Econometrics Graduate Course LASSO Regression, Smoothing and Overfit LASSO can be used to avoid overfit. @freakonometrics 49 q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 −1.0−0.50.00.51.0
  • 50. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Ridge vs. LASSO Consider simulated data (output on the right). With orthogonal variables, shrinkage operators are 0 1 2 3 4 5 012345 β β(ridge) 0 1 2 3 4 5 012345 β β(lasso) 0.0 0.5 1.0 1.5 2.0 −2.0−1.5−1.0−0.50.00.51.0 L1 Norm Coefficients 3 3 3 3 3 0.0 0.5 1.0 1.5 2.0 −2.0−1.5−1.0−0.50.00.51.0 L1 Norm Coefficients 0 1 3 3 3 @freakonometrics 50
  • 51. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Optimization Heuristics First idea: given some initial guess β(0), |β| ∼ |β(0)| + 1 2|β(0)| (β2 − β2 (0)) LASSO estimate can probably be derived from iterated Ridge estimates y − Xβ(k+1) 2 2 + λ β(k+1) 1 ∼ Xβ(k+1) 2 2 + λ 2 j 1 |βj,(k)| [βj,(k+1)]2 which is a weighted ridge penalty function Thus, β(k+1) = XT X + λ∆(k) −1 XT y where ∆(k) = diag[|βj,(k)|−1 ]. Then β(k) → β lasso , as k → ∞. @freakonometrics 51
  • 52. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Properties of LASSO Estimate From this iterative technique β lasso λ ∼ XT X + λ∆ −1 XT y where ∆ = diag[|β lasso j,λ |−1 ] if β lasso j,λ = 0, 0 otherwise. Thus, E[β lasso λ ] ∼ XT X + λ∆ −1 XT Xβ and Var[β lasso λ ] ∼ σ2 XT X + λ∆ −1 XT XT X XT X + λ∆ −1 XT @freakonometrics 52
  • 53. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Optimization Heuristics Consider here a simplified problem, min a∈R 1 2 (a − b)2 + λ|a| g(a) with λ > 0. Observe that g (0) = −b ± λ. Then • if |b| ≤ λ, then a = 0 • if b ≥ λ, then a = b − λ • if b ≤ −λ, then a = b + λ a = argmin a∈R 1 2 (a − b)2 + λ|a| = Sλ(b) = sign(b) · (|b| − λ)+, also called soft-thresholding operator. @freakonometrics 53
  • 54. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Optimization Heuristics Definition for any convex function h, define the proximal operator operator of h, proximalh(y) = argmin x∈Rd 1 2 x − y 2 2 + h(x) Note that proximalλ · 2 2 (y) = 1 1 + λ x shrinkage operator proximalλ · 1 (y) = Sλ(y) = sign(y) · (|y| − λ)+ @freakonometrics 54
  • 55. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Optimization Heuristics We want to solve here θ ∈ argmin θ∈Rd 1 n y − mθ(x)) 2 2 f(θ) + λpenalty(θ) g(θ) . where f is convex and smooth, and g is convex, but not smooth... 1. Focus on f : descent lemma, ∀θ, θ f(θ) ≤ f(θ ) + f(θ ), θ − θ + t 2 θ − θ 2 2 Consider a gradient descent sequence θk, i.e. θk+1 = θk − t−1 f(θk), then f(θ) ≤ ϕ(θ): θk+1=argmin{ϕ(θ)} f(θk) + f(θk), θ − θk + t 2 θ − θk 2 2 @freakonometrics 55
  • 56. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Optimization Heuristics 2. Add function g f(θ)+g(θ) ≤ ψ(θ) f(θk) + f(θk), θ − θk + t 2 θ − θk 2 2 +g(θ) And one can proof that θk+1 = argmin θ∈Rd ψ(θ) = proximalg/t θk − t−1 f(θk) so called proximal gradient descent algorithm, since argmin {ψ(θ)} = argmin t 2 θ − θk − t−1 f(θk) 2 2 + g(θ) @freakonometrics 56
  • 57. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Coordinate-wise minimization Consider some convex differentiable f : Rk → R function. Consider x ∈ Rk obtained by minimizing along each coordinate axis, i.e. f(x1, xi−1, xi, xi+1, · · · , xk) ≥ f(x1, xi−1, xi , xi+1, · · · , xk) for all i. Is x a global minimizer? i.e. f(x) ≥ f(x ), ∀x ∈ Rk . Yes. If f is convex and differentiable. f(x)|x=x = ∂f(x) ∂x1 , · · · , ∂f(x) ∂xk = 0 There might be problem if f is not differentiable (except in each axis direction). If f(x) = g(x) + k i=1 hi(xi) with g convex and differentiable, yes, since f(x) − f(x ) ≥ g(x )T (x − x ) + i [hi(xi) − hi(xi )] @freakonometrics 57
  • 58. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Coordinate-wise minimization f(x) − f(x ) ≥ i [ ig(x )T (xi − xi )hi(xi) − hi(xi )] ≥0 ≥ 0 Thus, for functions f(x) = g(x) + k i=1 hi(xi) we can use coordinate descent to find a minimizer, i.e. at step j x (j) 1 ∈ argmin x1 f(x1, x (j−1) 2 , x (j−1) 3 , · · · x (j−1) k ) x (j) 2 ∈ argmin x2 f(x (j) 1 , x2, x (j−1) 3 , · · · x (j−1) k ) x (j) 3 ∈ argmin x3 f(x (j) 1 , x (j) 2 , x3, · · · x (j−1) k ) Tseng (2001) Convergence of Block Coordinate Descent Method: if f is continuous, then x∞ is a minimizer of f. @freakonometrics 58
  • 59. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Application in Linear Regression Let f(x) = 1 2 y − Ax 2 , with y ∈ Rn and A ∈ Mn×k. Let A = [A1, · · · , Ak]. Let us minimize in direction i. Let x−i denote the vector in Rk−1 without xi. Here 0 = ∂f(x) ∂xi = AT i [Ax − y] = AT i [Aixi + A−ix−i − y] thus, the optimal value is here xi = AT i [A−ix−i − y] AT i Ai @freakonometrics 59
  • 60. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Application to LASSO Let f(x) = 1 2 y − Ax 2 + λ x 1 , so that the non-differentiable part is separable, since x 1 = k i=1 |xi|. Let us minimize in direction i. Let x−i denote the vector in Rk−1 without xi. Here 0 = ∂f(x) ∂xi = AT i [Aixi + A−ix−i − y] + λsi where si ∈ ∂|xi|. Thus, solution is obtained by soft-thresholding xi = Sλ/ Ai 2 AT i [A−ix−i − y] AT i Ai @freakonometrics 60
  • 61. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Convergence rate for LASSO Let f(x) = g(x) + λ x 1 with • g convex, g Lipschitz with constant L > 0, and Id − g/L monotone inscreasing in each component • there exists z such that, componentwise, either z ≥ Sλ(z − g(z)) or z ≤ Sλ(z − g(z)) Saka & Tewari (2010), On the finite time convergence of cyclic coordinate descent methods proved that a coordinate descent starting from z satisfies f(x(j) ) − f(x ) ≤ L z − x 2 2j @freakonometrics 61
  • 62. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Graphical Lasso and Covariance Estimation We want to estimate an (unknown) covariance matrix Σ, or Σ−1 . An estimate for Σ−1 is Θ solution of Θ ∈ argmin Θ∈Mk×k {− log[det(Θ)] + trace[SΘ] + λ Θ 1 } where S = XT X n and where Θ 1 = |Θi,j|. See van Wieringen (2016) Undirected network reconstruction from high-dimensional data and https://github.com/kaizhang/glasso @freakonometrics 62
  • 63. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Application to Network Simplification Can be applied on networks, to spot ‘significant’ connexions... Source: http://khughitt.github.io/graphical-lasso/ @freakonometrics 63
  • 64. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Extention of Penalization Techniques In a more general context, we want to solve θ ∈ argmin θ∈Rd 1 n n i=1 (yi, mθ(xi)) + λ · penalty(θ) . @freakonometrics 64