Sparse Additive Models (SPAM)

1. CS592 Presentation #5 Sparse Additive Models 20173586 Jeongmin Cha 20174463 Jaesung Choe 20184144 Andries Bruno 1

2. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points

3. 1. Brief of Additive Models

8. 1. Introduction ● Combine ideas from Sparse Linear Models Additive Nonparametric regression Sparse Additive Models (SpAM) Backfitting sparsity constraint

9. 1. Introduction ● SpAM ⋍ additive nonparametric regression model ○ but, + sparsity constraint on ○ functional version of group lasso ● Nonparametric regression model relaxes the strong assumptions made by a linear model

10. 1. Introduction ● The authors show the estimator of ● 1. Sparsistence (Sparsity pattern consistency) ○ SpAM backfitting algorithm recovers the correct sparsity pattern asymptotically ● 2. Persistence ○ the estimator is persistent, predictive risk of estimator converges

12. ● Data Representation ● Additive Nonparametric model 2. Notation and Assumption

13. 2. Notation and Assumption ● P = the joint distribution of ( Xi , Yi ) ● The definition of L2 (P) norm (f on [0, 1]):

14. ● On each dimension, ● hilbert subspace of L2 (P) of P-measurable functions ● zero mean ● The hilbert subspace has the inner product ● hilbert space of dimensional functions in the additive form 2. Notation and Assumption

15. ● uniformly bounded, orthonormal basis on [0,1] ● The dimensional function 2. Notation and Assumption

17. #1. Formulate to the population SpAM 3. Sparse Backfitting Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function)

18. #1. Formulate to the population SpAM 3. Sparse Backfitting Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Additive model optimization problem

19. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting Sparse additive model optimization problem

20. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting (Red) : functional mapping (Xj ↦ g(Xj) ↦β*g(Xj)) (Green) : coefficients β would become sparse. : Lasso Sparse additive model optimization problem

21. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting q q Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse(!!) additive model

22. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso

23. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Sampled!

24. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Soft thresholding

25. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Backfitting algorithm

26. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Backfitting algorithm (Theorem. 1)

27. 3. Sparse Backfitting #2. Theorem 1 From the penalized lagrangian form,

28. 3. Sparse Backfitting #2. Theorem 1 says From the penalized lagrangian form, The minimizers ( ) satisfy where denotes projection matrix, represents residual matrix and means the positive part.

29. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting

30. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.

31. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is,

32. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( )

33. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function Only positive parts survive after thresholding.

34. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function Only positive parts survive after thresholding. Discussion points: Why do you think theorem 1 is important ??

35. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function Only positive parts survive after thresholding. Discussion points: Why do you think theorem 1 is important ?? : Only positive parts survive such that function becomes sparse.

36. 3. Sparse Backfitting #3. Backfitting algorithm - According to theorem 1, - Estimate smoother projection matrix where - Flow of the backfitting algorithm

37. 3. Sparse Backfitting #4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso) SpAM backfitting algorithm is the functional version of the coordinate descent algorithm. : Functional mapping. : Iterate through coordinate : Smoothing matrix

38. 3. Sparse Backfitting #4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso) SpAM backfitting algorithm is the functional version of the coordinate descent algorithm. : Functional mapping. : Iterate through coordinate : Smoothing matrix Discussion points: Now, we understand that SpAM is the functional version of group lasso. Then SpAM is alway better than lasso or group lasso?

39. 3. Sparse Backfitting #4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso) SpAM backfitting algorithm is the functional version of the coordinate descent algorithm. : Functional mapping. : Iterate through coordinate : Smoothing matrix Discussion points: Now, we understand that SpAM is the functional version of group lasso. Then SpAM is alway better than lasso or group lasso? (Hint : lasso is linear model, and SpAM is …? )

40. For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity). 3. Sparse Backfitting [1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140. GAMSEL:: 1 VS SpAM: vs Lasso: [2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307.

41. For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity). 3. Sparse Backfitting [1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140. GAMSEL:: 1 VS SpAM: vs Lasso: [2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307. Discussion points: Now, we understand that SpAM is the functional version of group lasso. Then SpAM is alway better than lasso or group lasso? (Hint : lasso is linear model, and SpAM is …? ) When there is non-linearity, SpAM can be effective.

43. SpAM backfitting algorithm #5. Risk estimation

45. 6.1. Synthetic Data ● Generate 150 samples from a 200-dimensional additive model ● The remaining 196 features are irrelevant and are set to zero plus a 0 mean gaussian noise.

46. 6.1. Synthetic Data ● Empirical probability of selecting the true four variables as a function of the sample size n The same thresholding phenomenon that was shown in the lasso is observed.

47. 6.1. Synthetic Data ● Empirical probability of selecting the true four variables as a function of the sample size n

48. 6.2. Boston Housing ● There are 506 observations with 10 covariates. ● To explore the sparsistency properties of SpAM, 20 irrelevant variables are added. ● Ten of those are randomly drawn from Uniform(0, 1) ● The remainder are permutations of the original 10 covariates.

49. 6.2. Boston Housing ● SpAM identifies 6 of the nonzero components. ● Both types of irrelevant variables are correctly zeroed out.

50. 6.3. SpAM for Spam ● Dataset consists of 3,065 emails which serve as training set. ● 57 attributes are available. These are all numeric ● Attributes measure the percentage of specific words in an email, the average and maximum run lengths of uppercase letters. ● Sample on 300 emails from the training set and use the remainder as test set.

51. 6.3. SpAM for Spam Best model

52. 6.3. SpAM for Spam The 33 selected variables cover 80% of the significant predictors.

53. 6.3. Functional Sparse Coding ● Here we compare SpAM with lasso. We consider natural images. ● The problem setup is as follows: ● y is the data to be represented. X is an nxp matrix with columns X_j’s vectors to be learned. The L1 penalty encourages sparsity in the coefficients.

54. 6.3. Functional Sparse Coding ● y is the data to be represented. X is an nxp matrix with columns X_j’s vectors to be learned. The L1 penalty encourages sparsity in the coefficients. ● Sparsity allows specialization of features and enforces capturing of salient properties of the data.

55. 6.3. Functional Sparse Coding ● When solved with lasso and SGD, 200 codewords that capture edge features at different scales and spatial orientations are learned:

56. 6.3. Functional Sparse Coding ● In the functional version, no assumption of linearity is made between X and y. Instead, the following additive model is used: ● This leads to the following optimization problem:

57. 6.3. Functional Sparse Coding ● Which model is lasso and which is SpAM?

58. 6.3. Functional Sparse Coding ● What about expressiveness?

59. 6.3. Functional Sparse Coding ● The sparse linear model use 8 codewords while the functional uses 7 with a lower residual sum of squares (RSS) ● Also, the linear and nonlinear versions learn different codewords.

61. 7. Discussion Points (1) ● As the authors said, SpAM is essentially a functional version of the grouped lasso. Then, are there any formulations for functional versions of other methods - e.g. ridge, fused lasso? Finding a generalized functional version of lasso families will be an interesting problem ○ Functional logistic regression with fused lasso penalty (FLR-FLP)

62. 7. Discussion Points (1) ● Objective function = FLR loss + lasso penalty + fused lasso penalty ● FLR loss ● gamma = coefficient in functional parameters

63. 7. Discussion Points (2) ● How might we handle group sparsity in additive models(GroupSpAM) as an analogy to GroupLasso?

64. 7. Discussion Points (2) ● How might we handle group sparsity in additive models(GroupSpAM) as an analogy to GroupLasso. ○ First we assume G is a partition of {1,..,p} and that G’s do not overlap ○ The optimization problem then becomes

65. 7. Discussion Points (2) ● How might we handle group sparsity in additive models(GroupSpAM) as an analogy to GroupLasso. ● The regularization term becomes

66. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions?

67. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions? ○ It turns out smoothing has some connections to bias-variance tradeoffs.

68. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions? ○ It turns out smoothing has some connections to bias-variance tradeoffs. ● Let’s suppose we make our estimates too smooth. What may we expect then?

69. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions? ○ It turns out smoothing has some connections to bias-variance tradeoffs. ● Let’s suppose we make our estimates too smooth. What may we expect then? ○ If our estimates are too smooth, we risk bias. ■ Thus me wake erroneous assumptions about the underlying functions. ■ In this case we miss relevant relations between features and targets. Thus we underfit.

70. 7. Discussion Points (3) ● What if we make rough estimates. What may we expect then?

71. 7. Discussion Points (3) ● What if we make rough estimates. What may we expect then? ○ We risk variance. What does this mean?

72. 7. Discussion Points (3) ● What if we make rough estimates. What may we expect then? ○ We risk variance. What does this mean? ● The learned model becomes sensitive to small variations in the data. Thus we overfit. We must keep a balance between bias and variance by using an appropriate level of smoothing.

73. 7. Discussion Points (4) ● Some Notes on Practicality With modern computing power, can you think of situations where a linear sparsity inducing model such as lasso may be preferred over Sparse Additive Models?

74. 7. Discussion Points (4) ● Some Notes on Practicality With modern computing power, can you think of situations where a linear sparsity inducing model such as lasso may be preferred over Sparse Additive Models? ● Our data analysis is guided by a credible scientific theory which asserts linear relationships among the variables we measure.

75. 7. Discussion Points (4) ● Some Notes on Practicality With modern computing power, can you think of situations where a linear sparsity inducing model such as lasso may be preferred over Sparse Additive Models? ● Our data analysis is guided by a credible scientific theory which asserts linear relationships among the variables we measure. ● Our data set is so massive that either the extra processing time, or the extra computer memory needed to fit and store an additive rather than a linear model is prohibitive.

76. Thank you for listening 76

77. CS592 Presentation #5 Sparse Additive Models 20173586 Jeongmin Cha 20174463 Jaesung Choe 20184144 Andries Bruno 77

84. 1. Introduction ● Combine ideas from Sparse Linear Models Additive Nonparametric regression Sparse Additive Models (SpAM) Backfitting sparsity constraint

85. 1. Introduction ● SpAM ⋍ additive nonparametric regression model ○ but, + sparsity constraint on ○ functional version of group lasso ● Nonparametric regression model relaxes the strong assumptions made by a linear model

86. 1. Introduction ● The authors show the estimator of ● 1. Sparsistence (Sparsity pattern consistency) ○ SpAM backfitting algorithm recovers the correct sparsity pattern asymptotically ● 2. Persistence ○ the estimator is persistent, predictive risk of estimator converges

88. ● Data Representation ● Additive Nonparametric model 2. Notation and Assumption

89. 2. Notation and Assumption ● P = the joint distribution of ( Xi , Yi ) ● The definition of L2 (P) norm (f on [0, 1]):

90. ● On each dimension, ● hilbert subspace of L2 (P) of P-measurable functions ● zero mean ● The hilbert subspace has the inner product ● hilbert space of dimensional functions in the additive form 2. Notation and Assumption

91. ● uniformly bounded, orthonormal basis on [0,1] ● The dimensional function 2. Notation and Assumption

93. #1. Formulate to the population SpAM 3. Sparse Backfitting Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function)

94. #1. Formulate to the population SpAM 3. Sparse Backfitting Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Additive model optimization problem

95. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting Sparse additive model optimization problem

96. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting (Red) : functional mapping (Xj ↦ g(Xj) ↦β*g(Xj)) (Green) : coefficients β would become sparse. : Lasso Sparse additive model optimization problem

97. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting q q Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse(!!) additive model

98. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso

99. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Sampled!

100. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Soft thresholding

101. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Backfitting algorithm

102. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Backfitting algorithm (Theorem. 1)

103. 3. Sparse Backfitting #2. Theorem 1 From the penalized lagrangian form,

104. 3. Sparse Backfitting #2. Theorem 1 says From the penalized lagrangian form, The minimizers ( ) satisfy where denotes projection matrix, represents residual matrix and means the positive part.

105. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting

106. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.

107. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is,

108. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( )

109. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function Only positive parts survive after thresholding.

110. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function Only positive parts survive after thresholding. Discussion points: Why do you think theorem 1 is important ??

111. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function Only positive parts survive after thresholding. Discussion points: Why do you think theorem 1 is important ?? : Only positive parts survive such that function becomes sparse.

112. 3. Sparse Backfitting #3. Backfitting algorithm - According to theorem 1, - Estimate smoother projection matrix where - Flow of the backfitting algorithm

113. 3. Sparse Backfitting #4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso) SpAM backfitting algorithm is the functional version of the coordinate descent algorithm. : Functional mapping. : Iterate through coordinate : Smoothing matrix

114. 3. Sparse Backfitting #4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso) SpAM backfitting algorithm is the functional version of the coordinate descent algorithm. : Functional mapping. : Iterate through coordinate : Smoothing matrix Discussion points: Now, we understand that SpAM is the functional version of group lasso. Then SpAM is alway better than lasso or group lasso?

115. 3. Sparse Backfitting #4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso) SpAM backfitting algorithm is the functional version of the coordinate descent algorithm. : Functional mapping. : Iterate through coordinate : Smoothing matrix Discussion points: Now, we understand that SpAM is the functional version of group lasso. Then SpAM is alway better than lasso or group lasso? (Hint : lasso is linear model, and SpAM is …? )

116. For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity). 3. Sparse Backfitting [1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140. GAMSEL:: 1 VS SpAM: vs Lasso: [2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307.

117. For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity). 3. Sparse Backfitting [1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140. GAMSEL:: 1 VS SpAM: vs Lasso: [2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307. Discussion points: Now, we understand that SpAM is the functional version of group lasso. Then SpAM is alway better than lasso or group lasso? (Hint : lasso is linear model, and SpAM is …? ) When there is non-linearity, SpAM can be effective.

119. SpAM backfitting algorithm #5. Risk estimation

121. 6.1. Synthetic Data ● Generate 150 samples from a 200-dimensional additive model ● The remaining 196 features are irrelevant and are set to zero plus a 0 mean gaussian noise.

122. 6.1. Synthetic Data ● Empirical probability of selecting the true four variables as a function of the sample size n The same thresholding phenomenon that was shown in the lasso is observed.

123. 6.1. Synthetic Data ● Empirical probability of selecting the true four variables as a function of the sample size n

124. 6.2. Boston Housing ● There are 506 observations with 10 covariates. ● To explore the sparsistency properties of SpAM, 20 irrelevant variables are added. ● Ten of those are randomly drawn from Uniform(0, 1) ● The remainder are permutations of the original 10 covariates.

125. 6.2. Boston Housing ● SpAM identifies 6 of the nonzero components. ● Both types of irrelevant variables are correctly zeroed out.

126. 6.3. SpAM for Spam ● Dataset consists of 3,065 emails which serve as training set. ● 57 attributes are available. These are all numeric ● Attributes measure the percentage of specific words in an email, the average and maximum run lengths of uppercase letters. ● Sample on 300 emails from the training set and use the remainder as test set.

127. 6.3. SpAM for Spam Best model

128. 6.3. SpAM for Spam The 33 selected variables cover 80% of the significant predictors.

129. 6.3. Functional Sparse Coding ● Here we compare SpAM with lasso. We consider natural images. ● The problem setup is as follows: ● y is the data to be represented. X is an nxp matrix with columns X_j’s vectors to be learned. The L1 penalty encourages sparsity in the coefficients.

130. 6.3. Functional Sparse Coding ● y is the data to be represented. X is an nxp matrix with columns X_j’s vectors to be learned. The L1 penalty encourages sparsity in the coefficients. ● Sparsity allows specialization of features and enforces capturing of salient properties of the data.

131. 6.3. Functional Sparse Coding ● When solved with lasso and SGD, 200 codewords that capture edge features at different scales and spatial orientations are learned:

132. 6.3. Functional Sparse Coding ● In the functional version, no assumption of linearity is made between X and y. Instead, the following additive model is used: ● This leads to the following optimization problem:

133. 6.3. Functional Sparse Coding ● Which model is lasso and which is SpAM?

134. 6.3. Functional Sparse Coding ● What about expressiveness?

135. 6.3. Functional Sparse Coding ● The sparse linear model use 8 codewords while the functional uses 7 with a lower residual sum of squares (RSS) ● Also, the linear and nonlinear versions learn different codewords.

137. 7. Discussion Points (1) ● As the authors said, SpAM is essentially a functional version of the grouped lasso. Then, are there any formulations for functional versions of other methods - e.g. ridge, fused lasso? Finding a generalized functional version of lasso families will be an interesting problem ○ Functional logistic regression with fused lasso penalty (FLR-FLP)

138. 7. Discussion Points (1) ● Objective function = FLR loss + lasso penalty + fused lasso penalty ● FLR loss ● gamma = coefficient in functional parameters

139. 7. Discussion Points (2) ● How might we handle group sparsity in additive models(GroupSpAM) as an analogy to GroupLasso?

140. 7. Discussion Points (2) ● How might we handle group sparsity in additive models(GroupSpAM) as an analogy to GroupLasso. ○ First we assume G is a partition of {1,..,p} and that G’s do not overlap ○ The optimization problem then becomes

141. 7. Discussion Points (2) ● How might we handle group sparsity in additive models(GroupSpAM) as an analogy to GroupLasso. ● The regularization term becomes

142. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions?

143. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions? ○ It turns out smoothing has some connections to bias-variance tradeoffs.

144. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions? ○ It turns out smoothing has some connections to bias-variance tradeoffs. ● Let’s suppose we make our estimates too smooth. What may we expect then?

145. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions? ○ It turns out smoothing has some connections to bias-variance tradeoffs. ● Let’s suppose we make our estimates too smooth. What may we expect then? ○ If our estimates are too smooth, we risk bias. ■ Thus me wake erroneous assumptions about the underlying functions. ■ In this case we miss relevant relations between features and targets. Thus we underfit.

146. 7. Discussion Points (3) ● What if we make rough estimates. What may we expect then?

147. 7. Discussion Points (3) ● What if we make rough estimates. What may we expect then? ○ We risk variance. What does this mean?

148. 7. Discussion Points (3) ● What if we make rough estimates. What may we expect then? ○ We risk variance. What does this mean? ● The learned model becomes sensitive to small variations in the data. Thus we overfit. We must keep a balance between bias and variance by using an appropriate level of smoothing.

149. 7. Discussion Points (4) ● Some Notes on Practicality With modern computing power, can you think of situations where a linear sparsity inducing model such as lasso may be preferred over Sparse Additive Models?

150. 7. Discussion Points (4) ● Some Notes on Practicality With modern computing power, can you think of situations where a linear sparsity inducing model such as lasso may be preferred over Sparse Additive Models? ● Our data analysis is guided by a credible scientific theory which asserts linear relationships among the variables we measure.

151. 7. Discussion Points (4) ● Some Notes on Practicality With modern computing power, can you think of situations where a linear sparsity inducing model such as lasso may be preferred over Sparse Additive Models? ● Our data analysis is guided by a credible scientific theory which asserts linear relationships among the variables we measure. ● Our data set is so massive that either the extra processing time, or the extra computer memory needed to fit and store an additive rather than a linear model is prohibitive.

152. Thank you for listening 152

Sparse Additive Models (SPAM)

More Related Content

What's hot (20)

Similar to Sparse Additive Models (SPAM) (20)

More from Jeongmin Cha (8)

Recently uploaded (20)

Sparse Additive Models (SPAM)