SlideShare a Scribd company logo
CS592 Presentation #5
Sparse Additive Models
20173586 Jeongmin Cha
20174463 Jaesung Choe
20184144 Andries Bruno
1
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
1. Brief of Additive Models
1. Brief of Additive Models
1. Brief of Additive Models
1. Brief of Additive Models
1. Brief of Additive Models
1. Introduction
● Combine ideas from
Sparse
Linear
Models
Additive
Nonparametric
regression
Sparse Additive Models (SpAM)
Backfitting
sparsity
constraint
1. Introduction
● SpAM ⋍ additive nonparametric regression model
○ but, + sparsity constraint on
○ functional version of group lasso
● Nonparametric regression model
relaxes the strong assumptions made by a linear model
1. Introduction
● The authors show the estimator of
● 1. Sparsistence (Sparsity pattern consistency)
○ SpAM backfitting algorithm recovers the correct sparsity pattern asymptotically
● 2. Persistence
○ the estimator is persistent, predictive risk of estimator converges
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
● Data Representation
● Additive Nonparametric model
2. Notation and Assumption
2. Notation and Assumption
● P = the joint distribution of ( Xi
, Yi
)
● The definition of L2
(P) norm (f on [0, 1]):
● On each dimension,
● hilbert subspace of L2
(P) of P-measurable functions
● zero mean
● The hilbert subspace has the inner product
● hilbert space
of dimensional functions in the additive form
2. Notation and Assumption
● uniformly bounded, orthonormal basis on [0,1]
● The dimensional function
2. Notation and Assumption
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
#1. Formulate to the population SpAM
3. Sparse Backfitting
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
#1. Formulate to the population SpAM
3. Sparse Backfitting
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Additive model optimization problem
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
Sparse additive model optimization problem
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
(Red) : functional mapping (Xj ↦ g(Xj) ↦β*g(Xj))
(Green) : coefficients β would become sparse.
: Lasso
Sparse additive model optimization problem
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
q q
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse(!!) additive model
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Sampled!
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Soft
thresholding
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Backfitting algorithm
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Backfitting algorithm
(Theorem. 1)
3. Sparse Backfitting
#2. Theorem 1
From the penalized lagrangian form,
3. Sparse Backfitting
#2. Theorem 1 says
From the penalized lagrangian form,
The minimizers ( ) satisfy
where denotes projection matrix, represents residual matrix
and means the positive part.
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( )
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function
Only positive parts survive after thresholding.
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function
Only positive parts survive after thresholding.
Discussion points:
Why do you think theorem 1 is important ??
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function
Only positive parts survive after thresholding.
Discussion points:
Why do you think theorem 1 is important ??
: Only positive parts survive such that function becomes sparse.
3. Sparse Backfitting
#3. Backfitting algorithm
- According to theorem 1,
- Estimate smoother projection matrix where
- Flow of the backfitting algorithm
3. Sparse Backfitting
#4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso)
SpAM backfitting algorithm is the functional version of the coordinate
descent algorithm.
: Functional mapping.
: Iterate through coordinate
: Smoothing matrix
3. Sparse Backfitting
#4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso)
SpAM backfitting algorithm is the functional version of the coordinate
descent algorithm.
: Functional mapping.
: Iterate through coordinate
: Smoothing matrix
Discussion points:
Now, we understand that SpAM is the functional version of group lasso.
Then SpAM is alway better than lasso or group lasso?
3. Sparse Backfitting
#4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso)
SpAM backfitting algorithm is the functional version of the coordinate
descent algorithm.
: Functional mapping.
: Iterate through coordinate
: Smoothing matrix
Discussion points:
Now, we understand that SpAM is the functional version of group lasso.
Then SpAM is alway better than lasso or group lasso?
(Hint : lasso is linear model, and SpAM is …? )
For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity).
3. Sparse Backfitting
[1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140.
GAMSEL:: 1 VS SpAM: vs Lasso:
[2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307.
For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity).
3. Sparse Backfitting
[1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140.
GAMSEL:: 1 VS SpAM: vs Lasso:
[2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307.
Discussion points:
Now, we understand that SpAM is the functional version of group lasso.
Then SpAM is alway better than lasso or group lasso?
(Hint : lasso is linear model, and SpAM is …? )
When there is non-linearity, SpAM can be effective.
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
SpAM backfitting algorithm
#5. Risk estimation
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
6.1. Synthetic Data
● Generate 150 samples from a 200-dimensional additive model
● The remaining 196 features are irrelevant and are set to zero plus a 0
mean gaussian noise.
6.1. Synthetic Data
● Empirical probability of selecting the true four variables as a function of
the sample size n
The same thresholding phenomenon that was
shown in the lasso is observed.
6.1. Synthetic Data
● Empirical probability of selecting the true four variables as a function of
the sample size n
6.2. Boston Housing
● There are 506 observations with 10 covariates.
● To explore the sparsistency properties of SpAM, 20 irrelevant variables are
added.
● Ten of those are randomly drawn from Uniform(0, 1)
● The remainder are permutations of the original 10 covariates.
6.2. Boston Housing
● SpAM identifies 6 of the nonzero components.
● Both types of irrelevant variables are correctly zeroed out.
6.3. SpAM for Spam
● Dataset consists of 3,065 emails which serve as training set.
● 57 attributes are available. These are all numeric
● Attributes measure the percentage of specific words in an email, the
average and maximum run lengths of uppercase letters.
● Sample on 300 emails from the training set and use the remainder as test
set.
6.3. SpAM for Spam
Best model
6.3. SpAM for Spam
The 33 selected variables cover 80% of the significant predictors.
6.3. Functional Sparse Coding
● Here we compare SpAM with lasso. We consider natural images.
● The problem setup is as follows:
● y is the data to be represented. X is an nxp matrix with columns X_j’s
vectors to be learned. The L1 penalty encourages sparsity in the
coefficients.
6.3. Functional Sparse Coding
● y is the data to be represented. X is an nxp matrix with columns X_j’s
vectors to be learned. The L1 penalty encourages sparsity in the
coefficients.
● Sparsity allows specialization of features and enforces capturing of salient
properties of the data.
6.3. Functional Sparse Coding
● When solved with lasso and SGD, 200 codewords that capture edge
features at different scales and spatial orientations are learned:
6.3. Functional Sparse Coding
● In the functional version, no assumption of linearity is made between X and
y. Instead, the following additive model is used:
● This leads to the following optimization problem:
6.3. Functional Sparse Coding
● Which model is lasso and which is SpAM?
6.3. Functional Sparse Coding
● What about expressiveness?
6.3. Functional Sparse Coding
● The sparse linear model use 8 codewords
while the functional uses 7 with a lower
residual sum of squares (RSS)
● Also, the linear and nonlinear versions learn
different codewords.
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
7. Discussion Points (1)
● As the authors said, SpAM is essentially a functional version of the
grouped lasso. Then, are there any formulations for functional versions of
other methods - e.g. ridge, fused lasso? Finding a generalized functional
version of lasso families will be an interesting problem
○ Functional logistic regression with fused lasso penalty (FLR-FLP)
7. Discussion Points (1)
● Objective function = FLR loss + lasso penalty + fused lasso penalty
● FLR loss
● gamma = coefficient in functional parameters
7. Discussion Points (2)
● How might we handle group sparsity in additive models(GroupSpAM) as
an analogy to GroupLasso?
7. Discussion Points (2)
● How might we handle group sparsity in additive models(GroupSpAM) as
an analogy to GroupLasso.
○ First we assume G is a partition of {1,..,p} and that G’s do not overlap
○ The optimization problem then becomes
7. Discussion Points (2)
● How might we handle group sparsity in additive models(GroupSpAM) as
an analogy to GroupLasso.
● The regularization term becomes
7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
○ It turns out smoothing has some connections to bias-variance tradeoffs.
7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
○ It turns out smoothing has some connections to bias-variance tradeoffs.
● Let’s suppose we make our estimates too smooth. What may we expect
then?
7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
○ It turns out smoothing has some connections to bias-variance tradeoffs.
● Let’s suppose we make our estimates too smooth. What may we expect
then?
○ If our estimates are too smooth, we risk bias.
■ Thus me wake erroneous assumptions about the underlying functions.
■ In this case we miss relevant relations between features and targets. Thus we
underfit.
7. Discussion Points (3)
● What if we make rough estimates. What may we expect then?
7. Discussion Points (3)
● What if we make rough estimates. What may we expect then?
○ We risk variance. What does this mean?
7. Discussion Points (3)
● What if we make rough estimates. What may we expect then?
○ We risk variance. What does this mean?
● The learned model becomes sensitive to small variations in the data. Thus
we overfit.
We must keep a balance between bias and variance by using an appropriate
level of smoothing.
7. Discussion Points (4)
● Some Notes on Practicality
With modern computing power, can you think of situations where a linear
sparsity inducing model such as lasso may be preferred over Sparse
Additive Models?
7. Discussion Points (4)
● Some Notes on Practicality
With modern computing power, can you think of situations where a linear
sparsity inducing model such as lasso may be preferred over Sparse
Additive Models?
● Our data analysis is guided by a credible scientific theory which
asserts linear relationships among the variables we measure.
7. Discussion Points (4)
● Some Notes on Practicality
With modern computing power, can you think of situations where a linear
sparsity inducing model such as lasso may be preferred over Sparse
Additive Models?
● Our data analysis is guided by a credible scientific theory which
asserts linear relationships among the variables we measure.
● Our data set is so massive that either the extra processing time, or the
extra computer memory needed to fit and store an additive rather than
a linear model is prohibitive.
Thank you for listening
76
CS592 Presentation #5
Sparse Additive Models
20173586 Jeongmin Cha
20174463 Jaesung Choe
20184144 Andries Bruno
77
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
1. Brief of Additive Models
1. Brief of Additive Models
1. Brief of Additive Models
1. Brief of Additive Models
1. Brief of Additive Models
1. Introduction
● Combine ideas from
Sparse
Linear
Models
Additive
Nonparametric
regression
Sparse Additive Models (SpAM)
Backfitting
sparsity
constraint
1. Introduction
● SpAM ⋍ additive nonparametric regression model
○ but, + sparsity constraint on
○ functional version of group lasso
● Nonparametric regression model
relaxes the strong assumptions made by a linear model
1. Introduction
● The authors show the estimator of
● 1. Sparsistence (Sparsity pattern consistency)
○ SpAM backfitting algorithm recovers the correct sparsity pattern asymptotically
● 2. Persistence
○ the estimator is persistent, predictive risk of estimator converges
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
● Data Representation
● Additive Nonparametric model
2. Notation and Assumption
2. Notation and Assumption
● P = the joint distribution of ( Xi
, Yi
)
● The definition of L2
(P) norm (f on [0, 1]):
● On each dimension,
● hilbert subspace of L2
(P) of P-measurable functions
● zero mean
● The hilbert subspace has the inner product
● hilbert space
of dimensional functions in the additive form
2. Notation and Assumption
● uniformly bounded, orthonormal basis on [0,1]
● The dimensional function
2. Notation and Assumption
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
#1. Formulate to the population SpAM
3. Sparse Backfitting
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
#1. Formulate to the population SpAM
3. Sparse Backfitting
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Additive model optimization problem
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
Sparse additive model optimization problem
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
(Red) : functional mapping (Xj ↦ g(Xj) ↦β*g(Xj))
(Green) : coefficients β would become sparse.
: Lasso
Sparse additive model optimization problem
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
q q
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse(!!) additive model
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Sampled!
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Soft
thresholding
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Backfitting algorithm
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Backfitting algorithm
(Theorem. 1)
3. Sparse Backfitting
#2. Theorem 1
From the penalized lagrangian form,
3. Sparse Backfitting
#2. Theorem 1 says
From the penalized lagrangian form,
The minimizers ( ) satisfy
where denotes projection matrix, represents residual matrix
and means the positive part.
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( )
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function
Only positive parts survive after thresholding.
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function
Only positive parts survive after thresholding.
Discussion points:
Why do you think theorem 1 is important ??
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function
Only positive parts survive after thresholding.
Discussion points:
Why do you think theorem 1 is important ??
: Only positive parts survive such that function becomes sparse.
3. Sparse Backfitting
#3. Backfitting algorithm
- According to theorem 1,
- Estimate smoother projection matrix where
- Flow of the backfitting algorithm
3. Sparse Backfitting
#4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso)
SpAM backfitting algorithm is the functional version of the coordinate
descent algorithm.
: Functional mapping.
: Iterate through coordinate
: Smoothing matrix
3. Sparse Backfitting
#4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso)
SpAM backfitting algorithm is the functional version of the coordinate
descent algorithm.
: Functional mapping.
: Iterate through coordinate
: Smoothing matrix
Discussion points:
Now, we understand that SpAM is the functional version of group lasso.
Then SpAM is alway better than lasso or group lasso?
3. Sparse Backfitting
#4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso)
SpAM backfitting algorithm is the functional version of the coordinate
descent algorithm.
: Functional mapping.
: Iterate through coordinate
: Smoothing matrix
Discussion points:
Now, we understand that SpAM is the functional version of group lasso.
Then SpAM is alway better than lasso or group lasso?
(Hint : lasso is linear model, and SpAM is …? )
For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity).
3. Sparse Backfitting
[1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140.
GAMSEL:: 1 VS SpAM: vs Lasso:
[2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307.
For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity).
3. Sparse Backfitting
[1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140.
GAMSEL:: 1 VS SpAM: vs Lasso:
[2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307.
Discussion points:
Now, we understand that SpAM is the functional version of group lasso.
Then SpAM is alway better than lasso or group lasso?
(Hint : lasso is linear model, and SpAM is …? )
When there is non-linearity, SpAM can be effective.
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
SpAM backfitting algorithm
#5. Risk estimation
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
6.1. Synthetic Data
● Generate 150 samples from a 200-dimensional additive model
● The remaining 196 features are irrelevant and are set to zero plus a 0
mean gaussian noise.
6.1. Synthetic Data
● Empirical probability of selecting the true four variables as a function of
the sample size n
The same thresholding phenomenon that was
shown in the lasso is observed.
6.1. Synthetic Data
● Empirical probability of selecting the true four variables as a function of
the sample size n
6.2. Boston Housing
● There are 506 observations with 10 covariates.
● To explore the sparsistency properties of SpAM, 20 irrelevant variables are
added.
● Ten of those are randomly drawn from Uniform(0, 1)
● The remainder are permutations of the original 10 covariates.
6.2. Boston Housing
● SpAM identifies 6 of the nonzero components.
● Both types of irrelevant variables are correctly zeroed out.
6.3. SpAM for Spam
● Dataset consists of 3,065 emails which serve as training set.
● 57 attributes are available. These are all numeric
● Attributes measure the percentage of specific words in an email, the
average and maximum run lengths of uppercase letters.
● Sample on 300 emails from the training set and use the remainder as test
set.
6.3. SpAM for Spam
Best model
6.3. SpAM for Spam
The 33 selected variables cover 80% of the significant predictors.
6.3. Functional Sparse Coding
● Here we compare SpAM with lasso. We consider natural images.
● The problem setup is as follows:
● y is the data to be represented. X is an nxp matrix with columns X_j’s
vectors to be learned. The L1 penalty encourages sparsity in the
coefficients.
6.3. Functional Sparse Coding
● y is the data to be represented. X is an nxp matrix with columns X_j’s
vectors to be learned. The L1 penalty encourages sparsity in the
coefficients.
● Sparsity allows specialization of features and enforces capturing of salient
properties of the data.
6.3. Functional Sparse Coding
● When solved with lasso and SGD, 200 codewords that capture edge
features at different scales and spatial orientations are learned:
6.3. Functional Sparse Coding
● In the functional version, no assumption of linearity is made between X and
y. Instead, the following additive model is used:
● This leads to the following optimization problem:
6.3. Functional Sparse Coding
● Which model is lasso and which is SpAM?
6.3. Functional Sparse Coding
● What about expressiveness?
6.3. Functional Sparse Coding
● The sparse linear model use 8 codewords
while the functional uses 7 with a lower
residual sum of squares (RSS)
● Also, the linear and nonlinear versions learn
different codewords.
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
7. Discussion Points (1)
● As the authors said, SpAM is essentially a functional version of the
grouped lasso. Then, are there any formulations for functional versions of
other methods - e.g. ridge, fused lasso? Finding a generalized functional
version of lasso families will be an interesting problem
○ Functional logistic regression with fused lasso penalty (FLR-FLP)
7. Discussion Points (1)
● Objective function = FLR loss + lasso penalty + fused lasso penalty
● FLR loss
● gamma = coefficient in functional parameters
7. Discussion Points (2)
● How might we handle group sparsity in additive models(GroupSpAM) as
an analogy to GroupLasso?
7. Discussion Points (2)
● How might we handle group sparsity in additive models(GroupSpAM) as
an analogy to GroupLasso.
○ First we assume G is a partition of {1,..,p} and that G’s do not overlap
○ The optimization problem then becomes
7. Discussion Points (2)
● How might we handle group sparsity in additive models(GroupSpAM) as
an analogy to GroupLasso.
● The regularization term becomes
7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
○ It turns out smoothing has some connections to bias-variance tradeoffs.
7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
○ It turns out smoothing has some connections to bias-variance tradeoffs.
● Let’s suppose we make our estimates too smooth. What may we expect
then?
7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
○ It turns out smoothing has some connections to bias-variance tradeoffs.
● Let’s suppose we make our estimates too smooth. What may we expect
then?
○ If our estimates are too smooth, we risk bias.
■ Thus me wake erroneous assumptions about the underlying functions.
■ In this case we miss relevant relations between features and targets. Thus we
underfit.
7. Discussion Points (3)
● What if we make rough estimates. What may we expect then?
7. Discussion Points (3)
● What if we make rough estimates. What may we expect then?
○ We risk variance. What does this mean?
7. Discussion Points (3)
● What if we make rough estimates. What may we expect then?
○ We risk variance. What does this mean?
● The learned model becomes sensitive to small variations in the data. Thus
we overfit.
We must keep a balance between bias and variance by using an appropriate
level of smoothing.
7. Discussion Points (4)
● Some Notes on Practicality
With modern computing power, can you think of situations where a linear
sparsity inducing model such as lasso may be preferred over Sparse
Additive Models?
7. Discussion Points (4)
● Some Notes on Practicality
With modern computing power, can you think of situations where a linear
sparsity inducing model such as lasso may be preferred over Sparse
Additive Models?
● Our data analysis is guided by a credible scientific theory which
asserts linear relationships among the variables we measure.
7. Discussion Points (4)
● Some Notes on Practicality
With modern computing power, can you think of situations where a linear
sparsity inducing model such as lasso may be preferred over Sparse
Additive Models?
● Our data analysis is guided by a credible scientific theory which
asserts linear relationships among the variables we measure.
● Our data set is so massive that either the extra processing time, or the
extra computer memory needed to fit and store an additive rather than
a linear model is prohibitive.
Thank you for listening
152

More Related Content

PDF
PPTX
Computational Assignment Help
PDF
20110319 parameterized algorithms_fomin_lecture03-04
PDF
Understanding Dynamic Programming through Bellman Operators
PPTX
Simplification of cfg ppt
PPTX
Software Construction Assignment Help
PDF
PDF
Computational Assignment Help
20110319 parameterized algorithms_fomin_lecture03-04
Understanding Dynamic Programming through Bellman Operators
Simplification of cfg ppt
Software Construction Assignment Help

What's hot (20)

PDF
Polynomial Kernel for Interval Vertex Deletion
PPTX
Electrical Engineering Exam Help
PDF
Phase Responce of Pole zero
PDF
Oct.22nd.Presentation.Final
PDF
Time Series Analysis in Cryptocurrency Markets: The "Bitcoin Brothers" (Paper...
PDF
Filter Designing
PDF
Adomian Decomposition Method for Certain Space-Time Fractional Partial Differ...
PDF
Kernel for Chordal Vertex Deletion
PPTX
Position analysis and dimensional synthesis
PDF
Polylogarithmic approximation algorithm for weighted F-deletion problems
PPTX
Computer Science Assignment Help
PPTX
Basics of Integration and Derivatives
PDF
Colloquium presentation
PDF
2015 CMS Winter Meeting Poster
PPTX
Digital Signal Processing Homework Help
PDF
Fine Grained Complexity
PDF
computervision project
PDF
Guarding Polygons via CSP
Polynomial Kernel for Interval Vertex Deletion
Electrical Engineering Exam Help
Phase Responce of Pole zero
Oct.22nd.Presentation.Final
Time Series Analysis in Cryptocurrency Markets: The "Bitcoin Brothers" (Paper...
Filter Designing
Adomian Decomposition Method for Certain Space-Time Fractional Partial Differ...
Kernel for Chordal Vertex Deletion
Position analysis and dimensional synthesis
Polylogarithmic approximation algorithm for weighted F-deletion problems
Computer Science Assignment Help
Basics of Integration and Derivatives
Colloquium presentation
2015 CMS Winter Meeting Poster
Digital Signal Processing Homework Help
Fine Grained Complexity
computervision project
Guarding Polygons via CSP
Ad

Similar to Sparse Additive Models (SPAM) (20)

PDF
Sparsenet
PDF
Sparsity by worst-case quadratic penalties
PDF
QMC: Operator Splitting Workshop, Thresholdings, Robustness, and Generalized ...
PDF
Low Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
PDF
QMC: Operator Splitting Workshop, Sparse Non-Parametric Regression - Noah Sim...
PDF
Lecture5 kernel svm
PDF
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
PDF
Stochastic Alternating Direction Method of Multipliers
PDF
Sparse Regularization
PDF
MASSS_Presentation_20160209
PDF
Recursive Compressed Sensing
PDF
Lec17 sparse signal processing & applications
PDF
CVPR2010: Sparse Coding and Dictionary Learning for Image Analysis: Part 1: S...
PDF
Tuto cvpr part1
PDF
Low Complexity Regularization of Inverse Problems
PDF
Matrix Padding Method for Sparse Signal Reconstruction
PDF
Nonconvex Compressed Sensing with the Sum-of-Squares Method
PDF
Banque de France's Workshop on Granularity: Xavier Gabaix slides, June 2016
PDF
SURF 2012 Final Report(1)
PPTX
Introduction to TreeNet (2004)
Sparsenet
Sparsity by worst-case quadratic penalties
QMC: Operator Splitting Workshop, Thresholdings, Robustness, and Generalized ...
Low Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
QMC: Operator Splitting Workshop, Sparse Non-Parametric Regression - Noah Sim...
Lecture5 kernel svm
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Stochastic Alternating Direction Method of Multipliers
Sparse Regularization
MASSS_Presentation_20160209
Recursive Compressed Sensing
Lec17 sparse signal processing & applications
CVPR2010: Sparse Coding and Dictionary Learning for Image Analysis: Part 1: S...
Tuto cvpr part1
Low Complexity Regularization of Inverse Problems
Matrix Padding Method for Sparse Signal Reconstruction
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Banque de France's Workshop on Granularity: Xavier Gabaix slides, June 2016
SURF 2012 Final Report(1)
Introduction to TreeNet (2004)
Ad

More from Jeongmin Cha (8)

PDF
차정민 (소프트웨어 엔지니어) 이력서 + 경력기술서
PDF
Causal Effect Inference with Deep Latent-Variable Models
PDF
Composing graphical models with neural networks for structured representatio...
PPTX
Waterful Application (iOS + AppleWatch)
PPTX
시스템 프로그램 설계 2 최종발표 (차정민, 조경재)
PPTX
시스템 프로그램 설계1 최종발표
PPTX
마이크로프로세서 응용(2013-2)
PPTX
최종발표
차정민 (소프트웨어 엔지니어) 이력서 + 경력기술서
Causal Effect Inference with Deep Latent-Variable Models
Composing graphical models with neural networks for structured representatio...
Waterful Application (iOS + AppleWatch)
시스템 프로그램 설계 2 최종발표 (차정민, 조경재)
시스템 프로그램 설계1 최종발표
마이크로프로세서 응용(2013-2)
최종발표

Recently uploaded (20)

PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Quality review (1)_presentation of this 21
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to machine learning and Linear Models
IBA_Chapter_11_Slides_Final_Accessible.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Fluorescence-microscope_Botany_detailed content
Clinical guidelines as a resource for EBP(1).pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
Business Acumen Training GuidePresentation.pptx
climate analysis of Dhaka ,Banglades.pptx
.pdf is not working space design for the following data for the following dat...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Data_Analytics_and_PowerBI_Presentation.pptx
IB Computer Science - Internal Assessment.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Qualitative Qantitative and Mixed Methods.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
oil_refinery_comprehensive_20250804084928 (1).pptx
Quality review (1)_presentation of this 21
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Reliability_Chapter_ presentation 1221.5784
Introduction to machine learning and Linear Models

Sparse Additive Models (SPAM)

  • 1. CS592 Presentation #5 Sparse Additive Models 20173586 Jeongmin Cha 20174463 Jaesung Choe 20184144 Andries Bruno 1
  • 2. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 3. 1. Brief of Additive Models
  • 4. 1. Brief of Additive Models
  • 5. 1. Brief of Additive Models
  • 6. 1. Brief of Additive Models
  • 7. 1. Brief of Additive Models
  • 8. 1. Introduction ● Combine ideas from Sparse Linear Models Additive Nonparametric regression Sparse Additive Models (SpAM) Backfitting sparsity constraint
  • 9. 1. Introduction ● SpAM ⋍ additive nonparametric regression model ○ but, + sparsity constraint on ○ functional version of group lasso ● Nonparametric regression model relaxes the strong assumptions made by a linear model
  • 10. 1. Introduction ● The authors show the estimator of ● 1. Sparsistence (Sparsity pattern consistency) ○ SpAM backfitting algorithm recovers the correct sparsity pattern asymptotically ● 2. Persistence ○ the estimator is persistent, predictive risk of estimator converges
  • 11. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 12. ● Data Representation ● Additive Nonparametric model 2. Notation and Assumption
  • 13. 2. Notation and Assumption ● P = the joint distribution of ( Xi , Yi ) ● The definition of L2 (P) norm (f on [0, 1]):
  • 14. ● On each dimension, ● hilbert subspace of L2 (P) of P-measurable functions ● zero mean ● The hilbert subspace has the inner product ● hilbert space of dimensional functions in the additive form 2. Notation and Assumption
  • 15. ● uniformly bounded, orthonormal basis on [0,1] ● The dimensional function 2. Notation and Assumption
  • 16. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 17. #1. Formulate to the population SpAM 3. Sparse Backfitting Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function)
  • 18. #1. Formulate to the population SpAM 3. Sparse Backfitting Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Additive model optimization problem
  • 19. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting Sparse additive model optimization problem
  • 20. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting (Red) : functional mapping (Xj ↦ g(Xj) ↦β*g(Xj)) (Green) : coefficients β would become sparse. : Lasso Sparse additive model optimization problem
  • 21. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting q q Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse(!!) additive model
  • 22. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso
  • 23. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Sampled!
  • 24. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Soft thresholding
  • 25. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Backfitting algorithm
  • 26. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Backfitting algorithm (Theorem. 1)
  • 27. 3. Sparse Backfitting #2. Theorem 1 From the penalized lagrangian form,
  • 28. 3. Sparse Backfitting #2. Theorem 1 says From the penalized lagrangian form, The minimizers ( ) satisfy where denotes projection matrix, represents residual matrix and means the positive part.
  • 29. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting
  • 30. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
  • 31. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is,
  • 32. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( )
  • 33. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function Only positive parts survive after thresholding.
  • 34. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function Only positive parts survive after thresholding. Discussion points: Why do you think theorem 1 is important ??
  • 35. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function Only positive parts survive after thresholding. Discussion points: Why do you think theorem 1 is important ?? : Only positive parts survive such that function becomes sparse.
  • 36. 3. Sparse Backfitting #3. Backfitting algorithm - According to theorem 1, - Estimate smoother projection matrix where - Flow of the backfitting algorithm
  • 37. 3. Sparse Backfitting #4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso) SpAM backfitting algorithm is the functional version of the coordinate descent algorithm. : Functional mapping. : Iterate through coordinate : Smoothing matrix
  • 38. 3. Sparse Backfitting #4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso) SpAM backfitting algorithm is the functional version of the coordinate descent algorithm. : Functional mapping. : Iterate through coordinate : Smoothing matrix Discussion points: Now, we understand that SpAM is the functional version of group lasso. Then SpAM is alway better than lasso or group lasso?
  • 39. 3. Sparse Backfitting #4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso) SpAM backfitting algorithm is the functional version of the coordinate descent algorithm. : Functional mapping. : Iterate through coordinate : Smoothing matrix Discussion points: Now, we understand that SpAM is the functional version of group lasso. Then SpAM is alway better than lasso or group lasso? (Hint : lasso is linear model, and SpAM is …? )
  • 40. For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity). 3. Sparse Backfitting [1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140. GAMSEL:: 1 VS SpAM: vs Lasso: [2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307.
  • 41. For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity). 3. Sparse Backfitting [1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140. GAMSEL:: 1 VS SpAM: vs Lasso: [2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307. Discussion points: Now, we understand that SpAM is the functional version of group lasso. Then SpAM is alway better than lasso or group lasso? (Hint : lasso is linear model, and SpAM is …? ) When there is non-linearity, SpAM can be effective.
  • 42. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 44. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 45. 6.1. Synthetic Data ● Generate 150 samples from a 200-dimensional additive model ● The remaining 196 features are irrelevant and are set to zero plus a 0 mean gaussian noise.
  • 46. 6.1. Synthetic Data ● Empirical probability of selecting the true four variables as a function of the sample size n The same thresholding phenomenon that was shown in the lasso is observed.
  • 47. 6.1. Synthetic Data ● Empirical probability of selecting the true four variables as a function of the sample size n
  • 48. 6.2. Boston Housing ● There are 506 observations with 10 covariates. ● To explore the sparsistency properties of SpAM, 20 irrelevant variables are added. ● Ten of those are randomly drawn from Uniform(0, 1) ● The remainder are permutations of the original 10 covariates.
  • 49. 6.2. Boston Housing ● SpAM identifies 6 of the nonzero components. ● Both types of irrelevant variables are correctly zeroed out.
  • 50. 6.3. SpAM for Spam ● Dataset consists of 3,065 emails which serve as training set. ● 57 attributes are available. These are all numeric ● Attributes measure the percentage of specific words in an email, the average and maximum run lengths of uppercase letters. ● Sample on 300 emails from the training set and use the remainder as test set.
  • 51. 6.3. SpAM for Spam Best model
  • 52. 6.3. SpAM for Spam The 33 selected variables cover 80% of the significant predictors.
  • 53. 6.3. Functional Sparse Coding ● Here we compare SpAM with lasso. We consider natural images. ● The problem setup is as follows: ● y is the data to be represented. X is an nxp matrix with columns X_j’s vectors to be learned. The L1 penalty encourages sparsity in the coefficients.
  • 54. 6.3. Functional Sparse Coding ● y is the data to be represented. X is an nxp matrix with columns X_j’s vectors to be learned. The L1 penalty encourages sparsity in the coefficients. ● Sparsity allows specialization of features and enforces capturing of salient properties of the data.
  • 55. 6.3. Functional Sparse Coding ● When solved with lasso and SGD, 200 codewords that capture edge features at different scales and spatial orientations are learned:
  • 56. 6.3. Functional Sparse Coding ● In the functional version, no assumption of linearity is made between X and y. Instead, the following additive model is used: ● This leads to the following optimization problem:
  • 57. 6.3. Functional Sparse Coding ● Which model is lasso and which is SpAM?
  • 58. 6.3. Functional Sparse Coding ● What about expressiveness?
  • 59. 6.3. Functional Sparse Coding ● The sparse linear model use 8 codewords while the functional uses 7 with a lower residual sum of squares (RSS) ● Also, the linear and nonlinear versions learn different codewords.
  • 60. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 61. 7. Discussion Points (1) ● As the authors said, SpAM is essentially a functional version of the grouped lasso. Then, are there any formulations for functional versions of other methods - e.g. ridge, fused lasso? Finding a generalized functional version of lasso families will be an interesting problem ○ Functional logistic regression with fused lasso penalty (FLR-FLP)
  • 62. 7. Discussion Points (1) ● Objective function = FLR loss + lasso penalty + fused lasso penalty ● FLR loss ● gamma = coefficient in functional parameters
  • 63. 7. Discussion Points (2) ● How might we handle group sparsity in additive models(GroupSpAM) as an analogy to GroupLasso?
  • 64. 7. Discussion Points (2) ● How might we handle group sparsity in additive models(GroupSpAM) as an analogy to GroupLasso. ○ First we assume G is a partition of {1,..,p} and that G’s do not overlap ○ The optimization problem then becomes
  • 65. 7. Discussion Points (2) ● How might we handle group sparsity in additive models(GroupSpAM) as an analogy to GroupLasso. ● The regularization term becomes
  • 66. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions?
  • 67. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions? ○ It turns out smoothing has some connections to bias-variance tradeoffs.
  • 68. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions? ○ It turns out smoothing has some connections to bias-variance tradeoffs. ● Let’s suppose we make our estimates too smooth. What may we expect then?
  • 69. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions? ○ It turns out smoothing has some connections to bias-variance tradeoffs. ● Let’s suppose we make our estimates too smooth. What may we expect then? ○ If our estimates are too smooth, we risk bias. ■ Thus me wake erroneous assumptions about the underlying functions. ■ In this case we miss relevant relations between features and targets. Thus we underfit.
  • 70. 7. Discussion Points (3) ● What if we make rough estimates. What may we expect then?
  • 71. 7. Discussion Points (3) ● What if we make rough estimates. What may we expect then? ○ We risk variance. What does this mean?
  • 72. 7. Discussion Points (3) ● What if we make rough estimates. What may we expect then? ○ We risk variance. What does this mean? ● The learned model becomes sensitive to small variations in the data. Thus we overfit. We must keep a balance between bias and variance by using an appropriate level of smoothing.
  • 73. 7. Discussion Points (4) ● Some Notes on Practicality With modern computing power, can you think of situations where a linear sparsity inducing model such as lasso may be preferred over Sparse Additive Models?
  • 74. 7. Discussion Points (4) ● Some Notes on Practicality With modern computing power, can you think of situations where a linear sparsity inducing model such as lasso may be preferred over Sparse Additive Models? ● Our data analysis is guided by a credible scientific theory which asserts linear relationships among the variables we measure.
  • 75. 7. Discussion Points (4) ● Some Notes on Practicality With modern computing power, can you think of situations where a linear sparsity inducing model such as lasso may be preferred over Sparse Additive Models? ● Our data analysis is guided by a credible scientific theory which asserts linear relationships among the variables we measure. ● Our data set is so massive that either the extra processing time, or the extra computer memory needed to fit and store an additive rather than a linear model is prohibitive.
  • 76. Thank you for listening 76
  • 77. CS592 Presentation #5 Sparse Additive Models 20173586 Jeongmin Cha 20174463 Jaesung Choe 20184144 Andries Bruno 77
  • 78. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 79. 1. Brief of Additive Models
  • 80. 1. Brief of Additive Models
  • 81. 1. Brief of Additive Models
  • 82. 1. Brief of Additive Models
  • 83. 1. Brief of Additive Models
  • 84. 1. Introduction ● Combine ideas from Sparse Linear Models Additive Nonparametric regression Sparse Additive Models (SpAM) Backfitting sparsity constraint
  • 85. 1. Introduction ● SpAM ⋍ additive nonparametric regression model ○ but, + sparsity constraint on ○ functional version of group lasso ● Nonparametric regression model relaxes the strong assumptions made by a linear model
  • 86. 1. Introduction ● The authors show the estimator of ● 1. Sparsistence (Sparsity pattern consistency) ○ SpAM backfitting algorithm recovers the correct sparsity pattern asymptotically ● 2. Persistence ○ the estimator is persistent, predictive risk of estimator converges
  • 87. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 88. ● Data Representation ● Additive Nonparametric model 2. Notation and Assumption
  • 89. 2. Notation and Assumption ● P = the joint distribution of ( Xi , Yi ) ● The definition of L2 (P) norm (f on [0, 1]):
  • 90. ● On each dimension, ● hilbert subspace of L2 (P) of P-measurable functions ● zero mean ● The hilbert subspace has the inner product ● hilbert space of dimensional functions in the additive form 2. Notation and Assumption
  • 91. ● uniformly bounded, orthonormal basis on [0,1] ● The dimensional function 2. Notation and Assumption
  • 92. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 93. #1. Formulate to the population SpAM 3. Sparse Backfitting Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function)
  • 94. #1. Formulate to the population SpAM 3. Sparse Backfitting Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Additive model optimization problem
  • 95. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting Sparse additive model optimization problem
  • 96. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting (Red) : functional mapping (Xj ↦ g(Xj) ↦β*g(Xj)) (Green) : coefficients β would become sparse. : Lasso Sparse additive model optimization problem
  • 97. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting q q Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse(!!) additive model
  • 98. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso
  • 99. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Sampled!
  • 100. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Soft thresholding
  • 101. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Backfitting algorithm
  • 102. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Backfitting algorithm (Theorem. 1)
  • 103. 3. Sparse Backfitting #2. Theorem 1 From the penalized lagrangian form,
  • 104. 3. Sparse Backfitting #2. Theorem 1 says From the penalized lagrangian form, The minimizers ( ) satisfy where denotes projection matrix, represents residual matrix and means the positive part.
  • 105. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting
  • 106. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
  • 107. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is,
  • 108. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( )
  • 109. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function Only positive parts survive after thresholding.
  • 110. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function Only positive parts survive after thresholding. Discussion points: Why do you think theorem 1 is important ??
  • 111. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function Only positive parts survive after thresholding. Discussion points: Why do you think theorem 1 is important ?? : Only positive parts survive such that function becomes sparse.
  • 112. 3. Sparse Backfitting #3. Backfitting algorithm - According to theorem 1, - Estimate smoother projection matrix where - Flow of the backfitting algorithm
  • 113. 3. Sparse Backfitting #4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso) SpAM backfitting algorithm is the functional version of the coordinate descent algorithm. : Functional mapping. : Iterate through coordinate : Smoothing matrix
  • 114. 3. Sparse Backfitting #4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso) SpAM backfitting algorithm is the functional version of the coordinate descent algorithm. : Functional mapping. : Iterate through coordinate : Smoothing matrix Discussion points: Now, we understand that SpAM is the functional version of group lasso. Then SpAM is alway better than lasso or group lasso?
  • 115. 3. Sparse Backfitting #4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso) SpAM backfitting algorithm is the functional version of the coordinate descent algorithm. : Functional mapping. : Iterate through coordinate : Smoothing matrix Discussion points: Now, we understand that SpAM is the functional version of group lasso. Then SpAM is alway better than lasso or group lasso? (Hint : lasso is linear model, and SpAM is …? )
  • 116. For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity). 3. Sparse Backfitting [1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140. GAMSEL:: 1 VS SpAM: vs Lasso: [2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307.
  • 117. For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity). 3. Sparse Backfitting [1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140. GAMSEL:: 1 VS SpAM: vs Lasso: [2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307. Discussion points: Now, we understand that SpAM is the functional version of group lasso. Then SpAM is alway better than lasso or group lasso? (Hint : lasso is linear model, and SpAM is …? ) When there is non-linearity, SpAM can be effective.
  • 118. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 119. SpAM backfitting algorithm #5. Risk estimation
  • 120. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 121. 6.1. Synthetic Data ● Generate 150 samples from a 200-dimensional additive model ● The remaining 196 features are irrelevant and are set to zero plus a 0 mean gaussian noise.
  • 122. 6.1. Synthetic Data ● Empirical probability of selecting the true four variables as a function of the sample size n The same thresholding phenomenon that was shown in the lasso is observed.
  • 123. 6.1. Synthetic Data ● Empirical probability of selecting the true four variables as a function of the sample size n
  • 124. 6.2. Boston Housing ● There are 506 observations with 10 covariates. ● To explore the sparsistency properties of SpAM, 20 irrelevant variables are added. ● Ten of those are randomly drawn from Uniform(0, 1) ● The remainder are permutations of the original 10 covariates.
  • 125. 6.2. Boston Housing ● SpAM identifies 6 of the nonzero components. ● Both types of irrelevant variables are correctly zeroed out.
  • 126. 6.3. SpAM for Spam ● Dataset consists of 3,065 emails which serve as training set. ● 57 attributes are available. These are all numeric ● Attributes measure the percentage of specific words in an email, the average and maximum run lengths of uppercase letters. ● Sample on 300 emails from the training set and use the remainder as test set.
  • 127. 6.3. SpAM for Spam Best model
  • 128. 6.3. SpAM for Spam The 33 selected variables cover 80% of the significant predictors.
  • 129. 6.3. Functional Sparse Coding ● Here we compare SpAM with lasso. We consider natural images. ● The problem setup is as follows: ● y is the data to be represented. X is an nxp matrix with columns X_j’s vectors to be learned. The L1 penalty encourages sparsity in the coefficients.
  • 130. 6.3. Functional Sparse Coding ● y is the data to be represented. X is an nxp matrix with columns X_j’s vectors to be learned. The L1 penalty encourages sparsity in the coefficients. ● Sparsity allows specialization of features and enforces capturing of salient properties of the data.
  • 131. 6.3. Functional Sparse Coding ● When solved with lasso and SGD, 200 codewords that capture edge features at different scales and spatial orientations are learned:
  • 132. 6.3. Functional Sparse Coding ● In the functional version, no assumption of linearity is made between X and y. Instead, the following additive model is used: ● This leads to the following optimization problem:
  • 133. 6.3. Functional Sparse Coding ● Which model is lasso and which is SpAM?
  • 134. 6.3. Functional Sparse Coding ● What about expressiveness?
  • 135. 6.3. Functional Sparse Coding ● The sparse linear model use 8 codewords while the functional uses 7 with a lower residual sum of squares (RSS) ● Also, the linear and nonlinear versions learn different codewords.
  • 136. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 137. 7. Discussion Points (1) ● As the authors said, SpAM is essentially a functional version of the grouped lasso. Then, are there any formulations for functional versions of other methods - e.g. ridge, fused lasso? Finding a generalized functional version of lasso families will be an interesting problem ○ Functional logistic regression with fused lasso penalty (FLR-FLP)
  • 138. 7. Discussion Points (1) ● Objective function = FLR loss + lasso penalty + fused lasso penalty ● FLR loss ● gamma = coefficient in functional parameters
  • 139. 7. Discussion Points (2) ● How might we handle group sparsity in additive models(GroupSpAM) as an analogy to GroupLasso?
  • 140. 7. Discussion Points (2) ● How might we handle group sparsity in additive models(GroupSpAM) as an analogy to GroupLasso. ○ First we assume G is a partition of {1,..,p} and that G’s do not overlap ○ The optimization problem then becomes
  • 141. 7. Discussion Points (2) ● How might we handle group sparsity in additive models(GroupSpAM) as an analogy to GroupLasso. ● The regularization term becomes
  • 142. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions?
  • 143. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions? ○ It turns out smoothing has some connections to bias-variance tradeoffs.
  • 144. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions? ○ It turns out smoothing has some connections to bias-variance tradeoffs. ● Let’s suppose we make our estimates too smooth. What may we expect then?
  • 145. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions? ○ It turns out smoothing has some connections to bias-variance tradeoffs. ● Let’s suppose we make our estimates too smooth. What may we expect then? ○ If our estimates are too smooth, we risk bias. ■ Thus me wake erroneous assumptions about the underlying functions. ■ In this case we miss relevant relations between features and targets. Thus we underfit.
  • 146. 7. Discussion Points (3) ● What if we make rough estimates. What may we expect then?
  • 147. 7. Discussion Points (3) ● What if we make rough estimates. What may we expect then? ○ We risk variance. What does this mean?
  • 148. 7. Discussion Points (3) ● What if we make rough estimates. What may we expect then? ○ We risk variance. What does this mean? ● The learned model becomes sensitive to small variations in the data. Thus we overfit. We must keep a balance between bias and variance by using an appropriate level of smoothing.
  • 149. 7. Discussion Points (4) ● Some Notes on Practicality With modern computing power, can you think of situations where a linear sparsity inducing model such as lasso may be preferred over Sparse Additive Models?
  • 150. 7. Discussion Points (4) ● Some Notes on Practicality With modern computing power, can you think of situations where a linear sparsity inducing model such as lasso may be preferred over Sparse Additive Models? ● Our data analysis is guided by a credible scientific theory which asserts linear relationships among the variables we measure.
  • 151. 7. Discussion Points (4) ● Some Notes on Practicality With modern computing power, can you think of situations where a linear sparsity inducing model such as lasso may be preferred over Sparse Additive Models? ● Our data analysis is guided by a credible scientific theory which asserts linear relationships among the variables we measure. ● Our data set is so massive that either the extra processing time, or the extra computer memory needed to fit and store an additive rather than a linear model is prohibitive.
  • 152. Thank you for listening 152