4. Ensuring generalization
A useful rule of thumb: to guarantee generalization, make sure that
your training data set size ! is at least linear in the number " of free
parameters in the function that you’re trying to learn.
Theorem. Let F be a function class with size |F|. Let y = f∗
(x) for some f∗
∈
F. Suppose we get a training set S = {(x1, y1), . . . , (xn, yn)} of size n with each
datapoint drawn i.i.d. from the data distribution D. Let
fERM
S = argmin
f∈F
1
n
n
!
i=1
!(f(xi), yi).
For any constants ", δ ∈ (0, 1), if n ≥ ln(|F|/δ)
" , then with probability (1 − δ) over
{(x1, y1), . . . , (xn, yn)}, R(fERM
S ) < ".
10. ℓ! regularization: penalizing large weights
!2 regularization, ψ(w) = !w!2
2:
G(w) = RSS(w) + λ!w!2
2 = !Xw − y!2
2 + λ!w!2
2
∇G(w) = 2(XT
Xw − XT
y) + 2λw = 0
⇒
!
XT
X + λI
"
w = XT
y
⇒ w∗
=
!
XT
X + λI
"−1
XT
y
Linear regression with !2 regularization is also known as ridge regression.
With a Bayesian viewpoint, corresponds to a Gaussian prior for w.
11. Encouraging sparsity: ℓ" regularization
Sparsity of !: Number of non-zero coefficients in !. Same as ||#||!
Advantage:
Sparse models are a natural inductive bias in many settings. In many applications
we have numerous possible features, only some of which may have any
relationship with the label.
Sparse models may also be more interpretable. They could narrow down a small
number of features which carry a lot of signal.
Data required to learn sparse model maybe significantly less than to learn dense
model.
We’ll see more on the third point next.
12. ℓ# regularization: The good, the bad and the ugly
Choose ψ(w) = !w!0.
G(w) =
n
!
i=1
(wT
xi − yi)2
+ λ!w!0.
16. ℓ$ regularization as a proxy for ℓ# regularization
Choose ψ(w) = !w!1.
G(w) =
n
!
i=1
(wT
xi − yi)2
+ λ!w!1.
17. ℓ$ regularization as a proxy for ℓ# regularization
Theorem. Given n vectors {xi ∈ Rd
, i ∈ [n]} drawn i.i.d. from N(0, I), let yi =
w∗T
xi for some w∗
with "w∗
"0 = s. Then for some fixed constant C > 0, the
minimizer of G(w) with ψ(w) = "w"1 will be w∗
as long as n > C · s log d (with
high probability over the randomness in the training datapoints xi).
[similar result can also be proven under more general conditions].
18. Why does ℓ$ regularization encourage sparse solutions?
Adapted from ESL
!!
!"
!!
!"
"#$ "#$
argmin! RSS * , subject to 4 * ≤ 6
Optimization problem:
24. !! =
#! − %/2, #! > %/2
0, #! ≤ %/2
#! + %/2, #! < −%/2
Let $7 = &(9)
;
'
Using subgradients, we can show that for the ℓ< regularized case:
Diving deeper: ℓ$ and ℓ% regularization for the “isotropic” case
25. Summary: Isotropic case (!!! = #).
Let %" = !(")
!
&
!!
"!
No regularization "! = !!
ℓ" regularization "! = !!/(1 + ))
ℓ# regularization "! =
!! − )/2, !! > )/2
0, !! ≤ )/2
!! + )/2, !! < −)/2
Diving deeper: ℓ! and ℓ" regularization for the “isotropic” case
26. Implicit regularization
So far, we explicitly added a )(!) term to our objective function to regularize.
In many cases, the optimization algorithm we use can themselves act as
regularizers, favoring some solutions over others.
Currently a very active area of research, you’ll see more in the homework.
27. Bias-variance tradeoff
The phenomenon of underfitting and overfitting is often referred to as the bias-
variance tradeoff in the literature.
A model whose complexity is too small for the task will underfit. This is a model
with a large bias because the model’s accuracy will not improve even if we add
a lot of training data.
sin(x) fitting example we saw in Lec 3
28. Bias-variance tradeoff
The phenomenon of underfitting and overfitting is often referred to as the bias-
variance tradeoff in the literature.
A model whose complexity is too large for the amount of available training data
will overfit. This is a model with high variance, because the model’s predictions
will vary a lot with the randomness in the training data (it can even fit any noise
in the training data).
sin(x) fitting example we saw in Lec 3
30. Kernel methods give a way to choose and efficiently work with the nonlinear map
φ : Rd
→ RM
(for linear regression, and much more broadly).
Recall the nonlinear function map for linear regression:
Motivation
31. w∗
= argmin
w
F(w)
= argmin
w
!
!Φw − y!2
2 + λ!w!2
2
"
=
!
ΦT
Φ + λI
"−1
ΦT
y
Φ =
φ(x1)T
φ(x2)T
.
.
.
φ(xn)T
, y =
y1
y2
.
.
.
yn
Let’s continue with regularized least squares with non-linear basis:
This operates in space RM
and M could be huge (and even infinite).
Regularized least squares
32. By setting the gradient of F(w) = !Φw − y!2
2 + λ!w!2
2 to be 0:
ΦT
(Φw∗
− y) + λw∗
= 0
we know
w∗
=
1
λ
ΦT
(y − Φw∗
) = ΦT
α =
n
!
i=1
αiφ(xi)
Thus the least square solution is a linear combination of features of the datapoints!
This calculation does not show what α should be, but ignore that for now.
Regularized least squares solution: Another look
33. Assuming we know α, the prediction of w∗
on a new example x is
w∗T
φ(x) =
n
!
i=1
αiφ(xi)T
φ(x)
Therefore, only inner products in the new feature space matter!
Kernel methods are exactly about computing inner products without explicitly comput-
ing φ.
But we need to figure out what α is first!
Why is this helpful?
34. Plugging in w = ΦT
α into F(w) gives
H(α) = F(ΦT
α)
= !ΦΦT
α − y!2
2 + λ!ΦT
α!2
2
= !Kα − y!2
2 + λαT
Kα (K = ΦΦT
∈ Rn×n
)
K is called Gram matrix or kernel matrix where the (i, j)-th entry is
K(i,j) = φ(xi)T
φ(xj)
Solving for !, Step 1: Kernel matrix
37. Minimize (the so-called dual formulation)
H(α) = !Kα − y!2
2 + λαT
Kα
Setting the derivative to 0 we have
0 = (K2
+ λK)α − Ky= K ((K + λI)α − y)
Thus α = (K + λI)−1
y is a minimizer and we obtain
w∗
= ΦT
α = ΦT
(K + λI)−1
y
Exercise: are there other minimizers? and are there other w∗
’s?
Solving for ", Step 2: Minimize the dual
38. Minimizing F(w) gives w∗
= (ΦT
Φ + λI)−1
ΦT
y
Minimizing H(α) gives w∗
= ΦT
(ΦΦT
+ λI)−1
y
Note I has different dimensions in these two formulas.
Natural question: are the two solutions the same or different?
They have to be the same because F(w) has a unique minimizer!
And they are:
(ΦT
Φ + λI)−1
ΦT
y
= (ΦT
Φ + λI)−1
ΦT
(ΦΦT
+ λI)(ΦΦT
+ λI)−1
y
= (ΦT
Φ + λI)−1
(ΦT
ΦΦT
+ λΦT
)(ΦΦT
+ λI)−1
y
= (ΦT
Φ + λI)−1
(ΦT
Φ + λI)ΦT
(ΦΦT
+ λI)−1
y
= ΦT
(ΦΦT
+ λI)−1
y
Comparing two solutions
39. If the solutions are the same, then what is the difference?
First, computing (ΦΦT
+λI)−1
can be more efficient than computing (ΦT
Φ+λI)−1
when n ≤ M.
More importantly, computing α = (K + λI)−1
y also only requires computing inner
products in the new feature space!
Now we can conclude that the exact form of φ(·) is not essential; all we need to do is
know the inner products φ(x)T
φ(x"
).
For some φ it is indeed possible to compute φ(x)T
φ(x"
) without computing/knowing
φ. This is the kernel trick.
The kernel trick
40. Consider the following polynomial basis φ : R2
→ R3
:
φ(x) =
x2
1
√
2x1x2
x2
2
What is the inner product between φ(x) and φ(x!
)?
φ(x)T
φ(x!
) = x1
2
x!
1
2
+ 2x1x2x!
1x!
2 + x2
2
x!
2
2
= (x1x!
1 + x2x!
2)2
= (xT
x!
)2
Therefore, the inner product in the new space is simply a function of the inner product
in the original space.
The kernel trick: Example 1
41. φ : Rd
→ R2d
is parameterized by θ:
φθ(x) =
cos(θx1)
sin(θx1)
.
.
.
cos(θxm)
sin(θxm)
What is the inner product between φθ(x) and φθ(x!
)?
φθ(x)T
φθ(x!
) =
d
'
m=1
cos(θxm) cos(θx!
m) + sin(θxm) sin(θx!
m)
=
d
'
m=1
cos(θ(xm − x!
m)) (trigonometric identity)
Once again, the inner product in the new space is a simple function of the features in
the original space.
The kernel trick: Example 2
42. Based on φθ, define φL : Rd
→ R2d(L+1)
for some integer L:
φL(x) =
φ0(x)
φ2π
L
(x)
φ2 2π
L
(x)
.
.
.
φL 2π
L
(x)
What is the inner product between φL(x) and φL(x!
)?
φL(x)T
φL(x!
) =
L
'
"=0
φ2π"
L
(x)T
φ2π"
L
(x!
)
=
L
'
"=0
d
'
m=1
cos
(
2π"
L
(xm − x!
m)
)
The kernel trick: Example 3
43. When L → ∞, even if we cannot compute φ(x) (since it’s a vector of infinite dimen-
sion), we can still compute inner product:
φ∞(x)T
φ∞(x"
) =
! 2π
0
d
"
m=1
cos(θ(xm − x"
m)) dθ
=
d
"
m=1
sin(2π(xm − x"
m))
xm − x"
m
Again, a simple function of the original features.
Note that when using this mapping in linear regression, we are learning a weight w∗
with infinite dimension!
The kernel trick: Example 4
44. Definition: a function k : Rd
× Rd
→ R is called a kernel function if there exists a
function φ : Rd
→ RM
so that for any x, x!
∈ Rd
,
k(x, x!
) = φ(x)T
φ(x!
)
Examples we have seen
k(x, x!
) = (xT
x!
)2
k(x, x!
) =
d
!
m=1
sin(2π(xm − x!
m))
xm − x!
m
Kernel functions
45. Choosing a nonlinear basis φ becomes equivalent to choosing a kernel function.
As long as computing the kernel function is more efficient, we should apply the kernel
trick.
Gram/kernel matrix becomes:
K = ΦΦT
=
k(x1, x1) k(x1, x2) · · · k(x1, xn)
k(x2, x1) k(x2, x2) · · · k(x2, xn)
.
.
.
.
.
.
.
.
.
.
.
.
k(xn, x1) k(xn, x2) · · · k(xn, xn)
In fact, k is a kernel if and only if K is positive semidefinite for any n and any x1,
x2, . . . , xn (Mercer theorem).
• useful for proving that a function is not a kernel
Using kernel functions
46. Function
k(x, x!
) = !x − x!
!2
2
is not a kernel, why?
If it is a kernel, the kernel matrix for two data points x1 and x2:
K =
!
0 !x1 − x2!2
2
!x1 − x2!2
2 0
"
must be positive semidefinite, but is it?
Examples which are not kernels
47. For any function f : Rd
→ R, k(x, x!
) = f(x)f(x!
) is a kernel.
If k1(·, ·) and k2(·, ·) are kernels, then the following are also kernels:
• conical combination: αk1(·, ·) + βk2(·, ·) if α, β ≥ 0
• product: k1(·, ·)k2(·, ·)
• exponential: ek(·,·)
• · · ·
Verify using the definition of kernel!
Properties of kernels
48. Polynomial kernel
k(x, x!
) = (xT
x!
+ c)M
for c ≥ 0 and M is a positive integer.
What is the corresponding φ?
Popular kernels
49. Gaussian kernel or Radial basis function (RBF) kernel
k(x, x!
) = exp
!
−
"x − x!
"2
2
2σ2
"
for some σ > 0.
What is the corresponding φ?
Popular kernels
50. Gaussian kernel or Radial basis function (RBF) kernel
k(x, x!
) = exp
!
−
"x − x!
"2
2
2σ2
"
for some σ > 0.
What is the corresponding φ?
Popular kernels
51. As long as w∗
=
!n
i=1 αiφ(xi), prediction on a new example x becomes
w∗T
φ(x) =
n
"
i=1
αiφ(xi)T
φ(x) =
n
"
i=1
αik(xi, x).
This is known as a non-parametric method. Informally speaking, this means that
there is no fixed set of parameters that the model is trying to learn (remember w∗
could be infinite). Nearest-neighbors is another non-parametric method we have seen.
Prediction with kernels
52. Classification with kernels
Similar ideas extend to the classification case, and we can predict using sign(wT
φ).
Data may become linearly separable in the feature space!