머피의 머신러닝 13 Sparse Linear Model

Sparse Linear Model
Jungkyu Lee
Daum Search Quality Team

13.1 Introduction
• model-based approach를 사용해서 feature selection하는 방법에 대해서 알아본다
• Application
• small N, large D proble의 경우, featur가 너무 많기 때문에, feature selection을 하고 싶다
• 14장에서, kernel function에 대해서 다룬다. (sparse kernel machine)
• feature selecton이 N개의 training example 중 부분 집합만 사용하는 방법이다.

5.3 Bayesian model selection
• regression시 너무 높은 degree의 polynomial을 쓰면 overfitting이 일어날 수 있고 반대로 너무 낮
은 degee의 polynomial을 쓰면 underfiting이 일어날 수 있다
• 다른 복잡도를 가진 모델을 만날을 때, 일반적으로 어떤 것이 가장 좋은 모델인가?
• 13장에서 다룰 것 model = feature subset 입니다
• Approach
• One approach is to use cross-validation to estimate the generalization error of all the
candidate models, and then to pick the model that seems the best.
• A more efficient approach is to compute the posterior over models (Bayesian model selection.

• If we use a uniform prior over models, p(m)∝1, this amounts to picking the model which
maximizes

marginal likelihood
cross-validation은 train와 test셋을 나누어야 하고 (보통 cs community에서 많이 함)
posterior 법은 train set으로만 하는 것 같다 (bic,aic)  이건 솔직히 왜 하는지는 아직 이해는 안가지만

5.3.2.4 BIC approximation to log marginal likelihood
• In general, computing the integral in Equation 5.13 can be quite difficult.
• Bayesian information criterion or BIC
likelihood

model complexity

• dof (ˆθ) is the number of degrees of freedom
• penalized log likelihood

Bayesian variable selection
p(D|γ) 구하는 방법
The spike and slab model
wj를 계속 살림
Beroulli Gaussian Model

l0 regulization
최적화 어려움
l1 regulization (lasso)

13.2 Bayesian variable selection
• 어떤 피쳐가 릴러번트한지를 랜덤변수로 본다.
• model = m = γ
• Let γj =1 if feature j is “relevant”, and let γj =0 otherwise.
• Our goal is to compute the posterior over models

linregAllsubsetsGraycodeDemo.

13.2.1 The spike and slab model
•

을 구체적으로 구하는 방법에 대해서 논의한다 (linear regression의 경우)

• The posterior is given by

the number of non-zero elements of the vector.


• γ이 0인 것의 feature를 X와 w에서 없앤다, Xr, wr

feature selection γ 에 따라 p(D|γ)의 분산이 바뀐다

• When the marginal likelihood cannot be computed in closed form (e.g., if we are using logistic
regression or a nonlinear model) . we can approximate it using BIC

model complexity로 페널티

• 요약하면, p(γ|D)을 구하기 위해

• 결과적으로 feature relevance vector γ의 posterior는

• 즉 (maginal likelihood) – (model complexity)
= (likelihood – model complexity) – (model complexity)
• complexity에 대한 penalties가 두 번 일어나는데, 그냥 λ하나로 묶는다

13.2.2 From the Bernoulli-Gaussian model to

l0 regularization

• Bernoulli Gaussian model, binary mask model

• spike and slab model 과는 다르게, irrelevant한 coefficients들이 사라지지 않는다
• the binary mask model has the form γj →y←wj, whereas the spike and slab model has the form
γj →wj →y.

• the Bernoulli-Gaussian model은 l0 regularization을 유도하는데 사용된다.
• 데이터가 주어졌을 때, γ와 w의 posterior는

• joint prior p(γ, w)는 다음과 같이 정의한다

즉 위의 함수를 최소화하는 γ와 w = posterior가 가장 큰 γ와 w

l0 regularization


l0 regularization

• σ2w→∞,이면,

• likelihood에 model complexity를 더한 BIC 근사와 비슷한 모양이 되었다

• bit vector γ을 없애고 0이 아닌 wj만 표현하므로써, 다음과 같이 표현할 수 있다.

• 이것을 l0 regularization이라고 부른다.

• 하지만 lo regularization은 최적화하기 어렵다.
• 이 장의 나머지에서 l0 regularization을 최적화하는 방법에 대해서 알아본다(lasso)

13.2.3 Algorithms
• 앞에서는 γ를 찾을 때 최적화로 찾을 수도 있다(lasso)
• 하지만, 이러한 γ 최적화가 불가능한 경우도 있다.
• Since there are 2D models, we cannot explore the full posterior, or find the globally optimal model.
• Instead we will have to resort to heuristics of one form or another.
• All of the methods we will discuss involve searching through the space of models, and evaluating the
cost f(γ) at each point.

13.2.3 Algorithms
13.2.3.1 Greedy search
• Single best replacement:

• 가장 간단한 방법은 greedy hill climbing을 사용하는 것이다.
• 각 단계에서, 변수 하나를 추가하거나 뺌으로써, 도달할 수 있는 모델의 이웃을 정의한다.
• 즉 각 변수에 대해서, 그 것을 추가해서 현재 모델을 능가한다면 추가하고, 그 변수를 뺌으로써
능가한다면, 그 변수를 뺀다.

13.2.3 Algorithms

(13.27)

• Orthogonal least squares

• λ=0 이면, 식(13.27)에서 모델의 complexity penalty는 없어지고, deletion step의 이유가 없어진다.
왜냐하면, 변수를 쓰지 않음으로써 얻는 이점이 사라지기 때문이다(training error는 계속 준다)
• 이 경우, SBR은 orthogonal least squares = greedy forwards selection와 같아진다

• 현재 feature 집합에서, feature를 하나씩 추가해보고 w를 최적화하면서, 에러가 가장 적은
feature를 고른다.
• We then update the active set by setting γ (t+1)=γ(t)∪{j∗}
• To choose the next feature to add at step t, we need to solve D−Dt least squares problems at step
t,where Dt =|γt| is cardinality of the current active set.

13.2.3 Algorithms
• Orthogonal matching pursuits

• so we are just looking for the column that is most correlated with the current residual
• This only requires one least squares calculation per iteration and so is faster than orthogonal least
squares, but is not quite as accurate
• 다해보지 말고, 가장, residual과 연관 있는 feature만 테스트한다
• even more aggressive approximation is to just greedily add the feature that is most correlated with
the current residual.
• This is called matching pursuits(Mallat and Zhang 1993).
• This is also equivalent to a method known as least squares boosting (Section 16.4.6).

13.2.3 Algorithms

• Backwards selection Backwards selection
• starts with all variables in the model (the so called saturated model), and then deletes the worst
one at each step.
• This is equivalent to performing a greedy search from the top of the lattice downwards.
• This can give better results than a bottom-up search, since the decision about whether to keep
a variable or not is made in the context of all the other variables that might depend on it.
(의존 관계가 있을 feature들이 있는 상태에서 selection을 하므로, 성능은 더 좋음)
• However, this method is typically infeasible for large problems, since the saturated model will
be too expensive to fit.(=fit할 feature가 많아서 계산은 많이 한다)
• Bayesian Matching pursuit
• The algorithm of (Schniter et al. 2008) is similiar to OMP except it uses a Bayesian marginal
likelihood scoring criterion (under a spike and slab model) instead of a least squares objective.

13.2.3 Algorithms
13.2.3.2 Stochastic search

• If we want to approximate the posterior, rather than just computing a mode (e.g. because we want to
compute marginal inclusion probabilities), one option is to use MCMC.

• The standard approach is to use Metropolis Hastings, where the proposal distribution just flips single
bits
• This enables us to efficiently compute p(γ’|D) given p (γ|D).
• The probability of a state (bit configuration) is estimated by counting how many times the random
walk visits this state.

γ 을 이렇게, 2^D 조합을 다 찾거나, 휴리스틱하게 찾는 방법 말고,
analytically 최적화 하는 방법은 없는 걸까? lasso

13.3

l1 regularization: basics 왜 l0에서 l1으로 바꾸는가?

• When we have many variables, it is computationally difficult to find the posterior mode of p(γ|D).
• Part of the problem is due to the fact that the γj variables are discrete, γj ∈{0,1}.
• In the optimization community, it is common to relax hard constraints of this form by replacing
discrete variables with continuous variables.
• We can do this by replacing the spike-and-slab style prior, that assigns finite probability mass to the
event that wj =0, to continuous priors that “encourage” wj =0 by putting a lot of probability density
near the origin, such as a zero-mean Laplace distribution.

• l1 regularization

• In the case of linear regression, the l1 objective becomes

13.3.1 Why does l1 regularization yield sparse solutions?
• lasso, which stands for “least absolute shrinkage and selection operator”

• 코너에 거칠 확률이 더 커진다
목적함수

제약조건

모서리에서의 페널티가 더 작다.
모서리에 붙는 w가 최적화에 선호된다
모서리에 붙는 w라는 건 sparse한 w이다

13.3 l1 regularization: basics
13.3.2 Optimality conditions for lasso
• The lasso objective has the form
• Unfortunately, the||w||1 term is not differentiable whenever wj =0.
• This is an example of a non-smooth optimization problem.

• To handle non-smooth functions, we need to extend the notion of a derivative.
• We define a subderivative or subgradient of a (convex) function f: I→R at a point θ0 to be a scalar g
such that

• We define the set of subderivatives as the interval[a, b] where a and b are the one-sided limits

• The set [a, b] of all subderivatives is called the subdifferential of the function f at θ0 and is denoted
∂f(θ)|θ0.
• For example, in the case of the absolute value function f(θ)=|θ|, the subderivative is given by

• If the function is everywhere differentiable, then ∂f(θ)={df(θ)/dθ}.

0에서의 미분값이 무한히 많다

• Let us apply these concepts to the lasso problem.
• Let us initially ignore the non-smooth penalty term.

j feature 를 제외한 나머지 feature
로 예측한 residual과 j feature와의
correlation

• where w−j is w without component j, and similarly for xi,−j.
• We see that cj is (proportional to) the correlation between the j’th feature x:,j and the residual due to
the other features, r−j =y−X:,−jw−j.
•

X행렬에서 j 번째 feature만 고른 벡터와 j번째 feature만 빼고 예측한 값과 실제 값과의 차이 벡터와의 correlation

• Hence the magnitude of cj is an indication of how relevant feature j is for predicting y(relative to the
other features and the current parameters).
•

j가 예측에 포함됨으로써, y와의 차이를 메꿔줄 수 있는지의 정도?

머피의 머신러닝 13 Sparse Linear Model

• 그러므로 f를 최적화하는 w는 cj의 범위에 따라 다음과 같이 정의할 수 있다

• where

• and x+= max(x,0) is the positive part of x. This is called soft thresholding.
j feature 를 제외한 나머지 feature로 예측한 residual과
j feature와의 correlation cj가 –λ보다 크게 음의 상관관
계가 있지 않거나, λ의 이상의 음의 상관관계가 있지
않으면 feature j는 0 즉 안쓴다

13.4.1 Coordinate descent
• j번째 featur를 제외한 나머지 features는 고정하고, j번째 feature만 최적화 한다

Coordinate descent for lasso (aka shooting algorithm)
• Coordinate descent는 one-dimensianl optimization problem이 analytically 풀리면 유용하다.
• 앞에서 보았듯이 lasso의 최적해 w는 나머지 coefficien가 고정된 상태에서, 특정 featur에 대한 wj
를 최적화 할 수 있다.

• See (Yaun et al. 2010) for some extensions of this method to the logistic regression case.
• resulting algorithm was the fastest method in their experimental comparison, which concerned
document classification with large sparse feature vectors(representing bags of words)

• By contrast, in Figure 13.5(b), we illustrate hard thresholding.
• This sets values of wj to 0 if −λ≤cj ≤λ, but it does not shrink the values of wj outside of this interval.
• The slope of the soft thresholding line does not coincide with the diagonal, which means that even
large coefficients are shrunk towards zero;
• consequently lasso is a biased estimator.
• This is undesirable, since if the likelihood indicates (via cj) that the coefficient wj should be large, we
do not want to shrink it. We will discuss this issue in more detail in Section 13.6.2.

13.3.3 Comparison of least squares, lasso, ridge and subset
selection
• For simplicity, assume all the features of X are orthonormal, so XTX=I. In this case, the RSS is given by

13.3.3 Comparison of least squares, lasso, ridge and subset
selection
• LS = least squares,
• Subset = best subset regression(all possible subsets regression procedure)
• lasso gives better prediction accuracy
• Lasso also gives rise to a sparse solution. Of course, for other problems, ridge may give better
predictive accuracy.
• In practice, a combination of lasso and ridge, known as the elastic net, often performs best, since it
provides a good combination of sparsity and regularization (see Section 13.5.3)

13.3.4 Regularization path
• As we increase λ, the solution vector ˆ w(λ) will tend to get sparser, although not necessarily
monotonically.

• We can plot the values
path.

for each feature j; this is known as the regularization

• W가 발현되는 critical한 시점이 있다
• 한번 fitting 시 feature마다 critical한 시점까지 구하는 알고리즘(LARS, least angle regression
and shrinkage)

13.5.3 Elastic net (ridge and lasso combined)
• 강하게 연관되어 있는 features들이 많을 때 lasso는 그 중에 하나를 임의적으로 고르려 한다.
• In the D>N case, lasso can select at most N variables before it saturates.
• If N>D, but the variables are correlated, it has been empirically observed that the prediction
performance of ridge is better than that of lasso

• grouping effect = 높은 연관 관계가 있는 feature들은 같은 weight를 가지려 한다(lasso는 고름)
• For example, if two features are equal, so X:j =X:k, one can show that their estimates are also
equal, ˆ wj =ˆwk.
• By contrast, with lasso, we may have that ˆ wj =0and ˆ wk=0or vice versa.
• 그니까 Elastic net은 강한 상관 관계 때문에 없어지는 feature는 살려주고, response랑 관계 없는
feature만 걸러주는 장점이 있는듯

conclusion
Bayesian variable selection
p(D|γ) 구하는 방법
The spike and slab model
wj를 계속 살림
Beroulli Gaussian Model

l0 regulization
최적화 어려움
l1 regulization (lasso)

머피의 머신러닝 13 Sparse Linear Model

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to 머피의 머신러닝 13 Sparse Linear Model (20)

Recently uploaded (20)

머피의 머신러닝 13 Sparse Linear Model