About functional SIR

About functional SIR
Victor Picheny, Rémi Servien & Nathalie Villa-Vialaneix
nathalie.villa@toulouse.inra.fr
http://www.nathalievilla.org
Journées “Données fonctionnelles”
Institut de Mathématiques de Toulouse, June 19th 2017
Nathalie Villa-Vialaneix | SISIR 1/34

A joint work of SFCB team
Victor Picheny Rémi Servien NV2

Sommaire
1 Background and motivation
2 Presentation of SIR
3 Our proposal
4 Simulations and Real data

Sommaire
3 Our proposal

Introduction
X a functional random variable and Y ∈ R
n i.i.d. realizations of (X, Y)

Objectives
variable selection in functional regression
selection of full intervals made of consecutive points
without any a priori information on the intervals
fully data-driven procedure

Question and mathematical framework
A functional regression problem: X: random variable (functional) & Y:
random real variable
E(Y|X)?

E(Y|X)?
Data: n i.i.d. observations (xi, yi)i=1,...,n.
xi is not perfectly known but sampled at (ﬁxed) points
xi = (xi(t1), . . . , xi(tp))T
∈ Rp
. We denote: X =


xT
1
...
xT
n


.

E(Y|X)?
xi = (xi(t1), . . . , xi(tp))T
∈ Rp
. We denote: X =


xT
1
...
xT
n


.
Question: Find a model that is easily interpretable and points out relevant
intervals for the prediction within the deﬁnition domain of X.

E(Y|X)?
xi = (xi(t1), . . . , xi(tp))T
∈ Rp
. We denote: X =


xT
1
...
xT
n


.
Question: Find a model that is easily interpretable and points out relevant
intervals for the prediction within the deﬁnition domain of X.
Method: Do not expand X on a functional basis but use the fact that the
entries of the digitized function xi are ordered in a natural way.

Related works (variable selection in FDA)
LASSO / L1
regularization in linear models
[Ferraty et al., 2010, Aneiros and Vieu, 2014] (isolated evaluation
points), [Matsui and Konishi, 2011] (selects elements of an expansion
basis)
[Fraiman et al., 2016] (blinding approach usable for various problems:
PCA, regression...)
[Gregorutti et al., 2015] adaptation of the importance of variables in
random forest for groups of variables
[Fauvel et al., 2015, Ferraty and Hall, 2015] cross validation and a
greedy update of the selected evaluation points to select the most
relevant evaluation points in a nonparametric framework

Related works (variable selection in FDA)
LASSO / L1
regularization in linear models
[Ferraty et al., 2010, Aneiros and Vieu, 2014] (isolated evaluation
points), [Matsui and Konishi, 2011] (selects elements of an expansion
basis)
[Fraiman et al., 2016] (blinding approach usable for various problems:
PCA, regression...)
[Gregorutti et al., 2015] adaptation of the importance of variables in
random forest for groups of variables
[Fauvel et al., 2015, Ferraty and Hall, 2015] cross validation and a
greedy update of the selected evaluation points to select the most
relevant evaluation points in a nonparametric framework
However, none of these approach propose to automatically design and
select contiguous sets of variables.

Related works (selection of groups of variables)
[James et al., 2009] L1
regularization in linear model with sparsity on
derivatives: piecewise constant predictors
[Park et al., 2016] criterion based on a minimization of the overall
correlation during a greedy segmentation
[Grollemund et al., 2017] Bayesian approach in which a posteriori
distribution about informative intervals can be obtained

All are proposed in the framework of the linear model and the second one
does not use the target variable to deﬁne and select relevant intervals.

All are proposed in the framework of the linear model and the second one
does not use the target variable to deﬁne and select relevant intervals.
Our proposal: a semi-parametric (not entirely linear) model which selects
relevant intervals combined with an automatic procedure to deﬁne the
intervals.

Sommaire
3 Our proposal

SIR in multidimensional framework
SIR: a semi-parametric regression model for X ∈ Rp
Y = F(aT
1 X, . . . , aT
d X, )
for a1, . . . , ad ∈ Rp
(to be estimated), F : Rd+1
→ R, unknown, and , an
error, independant from X.
Standard assumption for SIR
Y X | PA (X)
in which A is the so-called EDR space, spanned by (ak )k=1,...,d.

SIR in multidimensional framework
SIR: a semi-parametric regression model for X ∈ Rp
Y = F(aT
1 X, . . . , aT
d X, )
for a1, . . . , ad ∈ Rp
(to be estimated), F : Rd+1
→ R, unknown, and , an
error, independant from X.
Standard assumption for SIR
Y X | PA (X)
in which A is the so-called EDR space, spanned by (ak )k=1,...,d.
SIR is the regression extension of Linear Discriminant Analysis.

Estimation
Equivalence between SIR and eigendecomposition

Estimation
A is included in the space spanned by the ﬁrst d Σ-orthogonal
eigenvectors of the generalized eigendecomposition problem:
Γa = λΣa, Σ covariance matrix of X and Γ covariance matrix of
E(X|Y)

Estimation
E(X|Y)
Estimation (when n > p)
compute X = 1
n
n
i=1 xi and ˆΣ = 1
n XT
(X − X)

Estimation
E(X|Y)
compute X = 1
n
n
i=1 xi and ˆΣ = 1
n XT
(X − X)
split the range of Y into H different slices: τ1, ... τH and estimate
ˆE(X|Y) = 1
nh i: yi∈τh
xi
h=1,...,H
, with nh = |{i : yi ∈ τh}|, in each slice,
to obtain an estimate of ˆΓ

Estimation
E(X|Y)
compute X = 1
n
n
i=1 xi and ˆΣ = 1
n XT
(X − X)
split the range of Y into H different slices: τ1, ... τH and estimate
ˆE(X|Y) = 1
nh i: yi∈τh
xi
h=1,...,H
, with nh = |{i : yi ∈ τh}|, in each slice,
to obtain an estimate of ˆΓ
solve the eigendecomposition problem ˆΓa = λˆΣa and obtain the
eigenvectors a1, . . . , ad

SIR in large dimensions: problem
In large dimension (or in Functional Data Analysis), n < p and ˆΣ is
ill-conditionned and does not have an inverse ⇒ Z = (X − InX
T
)ˆΣ−1/2
can
not be computed.

SIR in large dimensions: problem
In large dimension (or in Functional Data Analysis), n < p and ˆΣ is
ill-conditionned and does not have an inverse ⇒ Z = (X − InX
T
)ˆΣ−1/2
can
not be computed.
Different solutions have been proposed in the litterature based on:
prior dimension reduction (e.g., PCA) [Ferré and Yao, 2003] (in the
framework of FDA)
regularization (ridge...)
[Li and Yin, 2008, Bernard-Michel et al., 2008]: equivalent to the
generalized eigendecomposition problem ˆΓa = λ(ˆΣ + µ2I)a
sparse SIR
[Li and Yin, 2008, Li and Nachtsheim, 2008, Ni et al., 2005]
QZ-SIR [Coudret et al., 2014]: uses a method similar to QR-algorithm

SIR in large dimensions: sparse versions
Speciﬁc issue to introduce sparsity in SIR
Sparsity on a multiple-index model: most authors use shrinkage
approaches or sparsity on a single-index model and depletion (not shown)
First version: Li and Yin (2008) based on the regression formulation
Pro : Sparsity common to all dimensions d
Cons : Minimization problem with dependent variables in Rp
Second version: Li and Nachtsheim (2008) based on the correlation
formulation
Pro : Minimization problem with independent variables in Rd
Cons : Sparsity different in all dimensions d

Equivalent formulations
SIR as a regression problem [Li and Yin, 2008] shows that SIR is
equivalent to the (double) minimization of
E(A, C) =
H
h=1
ˆph Xh − X − ˆΣACh
2
for Xh = 1
nh i: yi∈τh
, A a (p × d)-matrix and C a vector in Rd
.

E(A, C) =
H
h=1
2
for Xh = 1
nh i: yi∈τh
.
Rk: Given A, C is obtained as the solution of an ordinary least square
problem...

E(A, C) =
H
h=1
2
for Xh = 1
nh i: yi∈τh
.
problem...
SIR as a Canonical Correlation problem [Li and Nachtsheim, 2008]
shows that SIR rewrites as the double optimisation problem
maxaj,φ Cor(φ(Y), aT
j
X), where φ is any function R → R and (aj)j are
Σ-orthonormal.

E(A, C) =
H
h=1
2
for Xh = 1
nh i: yi∈τh
.
problem...
SIR as a Canonical Correlation problem [Li and Nachtsheim, 2008]
shows that SIR rewrites as the double optimisation problem
maxaj,φ Cor(φ(Y), aT
j
X), where φ is any function R → R and (aj)j are
Σ-orthonormal.
Rk: The solution is shown to satisfy φ(y) = aT
j
E(X|Y = y) and aj is
also obtained as the solution of the mean square error problem:
min
aj
E φ(Y) − aT
j X
2

First version: sparse penalization of the ridge solution
If (ˆA, ˆC) are the solutions of the ridge SIR,
[Ni et al., 2005, Li and Yin, 2008] propose to shrink this solution by
minimizing
Es,1(α) =
H
h=1
ˆph Xh − X − ˆΣDiag(α)ˆA ˆCh
2
+ µ1 α L1
(regression formulation of SIR)

Second version: [Li and Nachtsheim, 2008] derive the sparse optimization
problem from the correlation formulation of SIR:
min
as
j
n
i=1
Pâj
(X|yi) − (as
j )T
xi
2
+ µ1,j as
j L1
,
in which Pâj
is the projection of Ê(X|Y = yi) = Xh onto the space spanned
by the solution of the ridge problem.

Characteristics of the different approaches and possible
extensions
[Li and Yin, 2008] [Li and Nachtsheim, 2008]
sparsity on shrinkage coefﬁcients estimates
nb optimization pb 1 d
sparsity common to all dims speciﬁc to each dim

Sommaire
3 Our proposal

SIR in large dimensions: our sparse version
Background: Back to the functional setting, we suppose that t1, ..., tp are
split into D intervals I1, ..., ID.
Based on the minimization problem of Li and Nachtsheim (2008)
Our adaptation: Sparsity under the intervals using α = (α1, . . . , αD)
∀l = 1, . . . , p, âs
jl
= ˆαk âjl for k such that tj ∈ Ik .
the sparsity constraint is put on α and not directly on âs
j
α are made identical for all dimensions of the projection j = 1, . . . , d

SIR in large dimensions: our sparse version
Li and Nachtsheim (2008) (LASSO):
min
as
j
n
i=1
Pâj
(X|yi) − (as
j )T
xi
2
+ µ1,j as
j L1
,
in which Pâj
is the projection of Ê(X|Y = yi) = Xh (for h such that yi in
slide h) onto the space spanned by the âj.
Our adaptation:
ˆα = arg min
α∈RD
d
j=1
n
i=1
Pâj
(X|yi) − (Λ(α) âj) xi
2
+ µ1 α L1
with ∀l = 1, . . . , p, âs
jl
= ˆαk âjl for k such that tj ∈ Ik and
Λ(α) = Diag (α1I|I1|, . . . , αDI|ID |) ∈ Mp×p.

Summary : SISIR: a two step approach
First step: Solve the projection problem (using SIR and L2-regularization of
Σ) that provides the estimates (ˆaj)j∈{1,...,d} of the vectors spanning the EDR
space.
Second step: Sparsity under the D intervals using α = (α1, . . . , αD)
solving a LASSO problem : handles functional setting by penalizing entire
intervals and not just isolated points.

SISIR: Characteristics
uses the approach based on the correlation formulation (because the
dimensionality of the optimization problem is smaller);
uses a shrinkage approach and optimizes shrinkage coefﬁcients in a
single optimization problem;
handles functional setting by penalizing entire intervals and not just
isolated points.

An automatic approach to deﬁne intervals
1 Initial state: ∀ k = 1, . . . , p, τk = {tk }

2 Iterate
along the regularization path, select three values for µ1:

2 Iterate
along the regularization path, select three values for µ1: P% of the
coefficients are zero, P% of the coefficients are non zero, best GCV.
define: D−
(“strong zeros”) and D+
(“strong non zeros”)

2 Iterate
deﬁne: D−
merge consecutive “strong zeros” (or “strong non zeros”) or “strong
zeros” (resp. “strong non zeros”) separated by a few numbers of
intervals which are of undetermined type.
Until no more iterations can be performed.

2 Iterate
deﬁne: D−
3 Output: Collection of models (ﬁrst with p intervals, last with 1), M∗
D
(optimal for GCV) and corresponding GCVD versus D (number of
intervals).

2 Iterate
deﬁne: D−
3 Output: Collection of models (ﬁrst with p intervals, last with 1), M∗
D
(optimal for GCV) and corresponding GCVD versus D (number of
intervals).
Final solution: Minimize GCVD over D.

Sommaire
3 Our proposal

Simulation framework
Data generated with:
X(t) a Gaussian process with mean µ(t) = −5 + 4t − 4t2
and a
Matern covariance
aj = sin
t(2+j)π
2 −
(j−1)π
3 IIj
(t)
Y = d
j=1 log X, aj
one model: (M1), d = 1, I1 = [0.2, 0.4].

Deﬁnition of the intervals
D = p = 200 (initial state=LASSO) D = 142
D = 41 D = 5

Second model
(M2): d = 3 and I1 = [0, 0.1], I2 = [0.5, 0.65] and I3 = [0.65, 0.78].

Second model
SISIR standard sparse

Tecator dataset
relevant intervals
easily interpretable
good MSE

Sunﬂower dataset
climatic time series (between 1975 and 2012 in France)
daily measure from April to October
X=evaportranspiration, Y=yield, n = 111, p = 309

Sunﬂower dataset
only two points identiﬁed outside the interval
focus on the second half of the interval
matches expert knowledge

Conclusion
SI-SIR:
sparse dimension reduction model adapted to functional framework
fully automated deﬁnition of relevant intervals in the range of the
predictors
Package SISIR available on CRAN at
https://cran.r-project.org/package=SISIR.
Perspectives
adaptation to multiple X
application to large-scale real data (agricultural application:
X={temperature,rainfall ...}, Y={yield})
replace CV criterion?

Aneiros, G. and Vieu, P. (2014).
Variable in inﬁnite-dimensional problems.
Statistics and Probability Letters, 94:12–20.
Bernard-Michel, C., Gardes, L., and Girard, S. (2008).
A note on sliced inverse regression with regularizations.
Biometrics, 64(3):982–986.
Coudret, R., Liquet, B., and Saracco, J. (2014).
Comparison of sliced inverse regression aproaches for undetermined cases.
Journal de la Société Française de Statistique, 155(2):72–96.
Fauvel, M., Deschene, C., Zullo, A., and Ferraty, F. (2015).
Fast forward feature selection of hyperspectral images for classiﬁcation with Gaussian mixture models.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 8(6):2824–2831.
Ferraty, F. and Hall, P. (2015).
An algorithm for nonlinear, nonparametric model choice and prediction.
Journal of Computational and Graphical Statistics, 24(3):695–714.
Ferraty, F., Hall, P., and Vieu, P. (2010).
Most-predictive design points for functiona data predictors.
Biometrika, 97(4):807–824.
Ferré, L. and Yao, A. (2003).
Functional sliced inverse regression analysis.
Statistics, 37(6):475–488.
Fraiman, R., Gimenez, Y., and Svarc, M. (2016).
Feature selection for functional data.
Journal of Multivariate Analysis, 146:191–208.
Gregorutti, B., Michel, B., and Saint-Pierre, P. (2015).
Grouped variable importance with random forests and application to multiple functional data analysis.

Computational Statistics and Data Analysis, 90:15–35.
Grollemund, P., Abraham, C., Baragatti, M., and Pudlo, P. (2017).
Bayesian functional linear regression with sparse step functions.
Preprint.
James, G., Wang, J., and Zhu, J. (2009).
Functional linear regression that’s interpretable.
Annals of Statistics, 37(5A):2083–2108.
Li, L. and Nachtsheim, C. (2008).
Sparse sliced inverse regression.
Technometrics, 48(4):503–510.
Li, L. and Yin, X. (2008).
Sliced inverse regression with regularizations.
Biometrics, 64(1):124–131.
Liquet, B. and Saracco, J. (2012).
A graphical tool for selecting the number of slices and the dimension of the model in SIR and SAVE approches.
Computational Statistics, 27(1):103–125.
Matsui, H. and Konishi, S. (2011).
Variable selection for functional regression models via the l1 regularization.
Computational Statistics and Data Analysis, 55(12):3304–3310.
Ni, L., Cook, D., and Tsai, C. (2005).
A note on shrinkage sliced inverse regression.
Biometrika, 92(1):242–247.
Park, A., Aston, J., and Ferraty, F. (2016).
Stable and predictive functional domain selection with application to brain images.
Preprint arXiv 1606.02186.

Parameter estimation
H (number of slices): usually, SIR is known to be not very sensitive to
the number of slices (> d + 1). We took H = 10 (i.e., 10/30
observations per slice);

µ2 and d (ridge estimate ˆA):
L-fold CV for µ2 (for a d0 large enough) Note that GCV as described in
[Li and Yin, 2008] can not be used since the current version of the L2
penalty involves the use of an estimate of Σ−1
.

L-fold CV for µ2 (for a d0 large enough)
using again L-fold CV, ∀ d = 1, . . . , d0, an estimate of
R(d) = d − E Tr Πd
ˆΠd ,
in which Πd and ˆΠd are the projector onto the ﬁrst d dimensions of the
EDR space and its estimate, is derived similarly as in
[Liquet and Saracco, 2012]. The evolution of ˆR(d) versus d is studied
to select a relevant d.

L-fold CV for µ2 (for a d0 large enough)
using again L-fold CV, ∀ d = 1, . . . , d0, an estimate of
R(d) = d − E Tr Πd
ˆΠd ,
in which Πd and ˆΠd are the projector onto the ﬁrst d dimensions of the
EDR space and its estimate, is derived similarly as in
[Liquet and Saracco, 2012]. The evolution of ˆR(d) versus d is studied
to select a relevant d.
µ1 (LASSO) glmnet is used, in which µ1 is selected by CV along the
regularization path.

About functional SIR

More Related Content

What's hot (20)

Similar to About functional SIR (20)

More from tuxette (20)

Recently uploaded (20)

About functional SIR