lec4_annotated.pdf ml csci 567 vatsal sharan

CSCI 567: Machine Learning
Vatsal Sharan
Fall 2022
Lecture 4, Sep 15

Administrivia
HW2 will be released tonight, due in about 2 weeks.
We will post some practice problems for the quiz by early next
week.

Ensuring generalization
A useful rule of thumb: to guarantee generalization, make sure that
your training data set size ! is at least linear in the number " of free
parameters in the function that you’re trying to learn.
Theorem. Let F be a function class with size |F|. Let y = f∗
(x) for some f∗
∈
F. Suppose we get a training set S = {(x1, y1), . . . , (xn, yn)} of size n with each
datapoint drawn i.i.d. from the data distribution D. Let
fERM
S = argmin
f∈F
1
n
n
!
i=1
!(f(xi), yi).
For any constants ", δ ∈ (0, 1), if n ≥ ln(|F|/δ)
" , then with probability (1 − δ) over
{(x1, y1), . . . , (xn, yn)}, R(fERM
S ) < ".

Beyond linear models: nonlinearly transformed features

Underfitting and overfitting
See Colab notebook

Preventing overfitting: Regularization

ℓ! regularization: penalizing large weights
!2 regularization, ψ(w) = !w!2
2:
G(w) = RSS(w) + λ!w!2
2 = !Xw − y!2
2 + λ!w!2
2
∇G(w) = 2(XT
Xw − XT
y) + 2λw = 0
⇒
!
XT
X + λI
"
w = XT
y
⇒ w∗
=
!
XT
X + λI
"−1
XT
y
Linear regression with !2 regularization is also known as ridge regression.
With a Bayesian viewpoint, corresponds to a Gaussian prior for w.

Encouraging sparsity: ℓ" regularization
Sparsity of !: Number of non-zero coefficients in !. Same as ||#||!
Advantage:
Sparse models are a natural inductive bias in many settings. In many applications
we have numerous possible features, only some of which may have any
relationship with the label.
Sparse models may also be more interpretable. They could narrow down a small
number of features which carry a lot of signal.
Data required to learn sparse model maybe significantly less than to learn dense
model.
We’ll see more on the third point next.

ℓ# regularization: The good, the bad and the ugly
Choose ψ(w) = !w!0.
G(w) =
n
!
i=1
(wT
xi − yi)2
+ λ!w!0.

ℓ# regularization: The good, the bad and the ugly

ℓ$ regularization as a proxy for ℓ# regularization
Choose ψ(w) = !w!1.
G(w) =
n
!
i=1
(wT
xi − yi)2
+ λ!w!1.

ℓ$ regularization as a proxy for ℓ# regularization
Theorem. Given n vectors {xi ∈ Rd
, i ∈ [n]} drawn i.i.d. from N(0, I), let yi =
w∗T
xi for some w∗
with "w∗
"0 = s. Then for some fixed constant C > 0, the
minimizer of G(w) with ψ(w) = "w"1 will be w∗
as long as n > C · s log d (with
high probability over the randomness in the training datapoints xi).
[similar result can also be proven under more general conditions].

Why does ℓ$ regularization encourage sparse solutions?
Adapted from ESL
!!
!"
!!
!"
"#$ "#$
argmin! RSS * , subject to 4 * ≤ 6
Optimization problem:

Diving deeper: ℓ$ and ℓ% regularization for the “isotropic” case

!! =
#! − %/2, #! > %/2
0, #! ≤ %/2
#! + %/2, #! < −%/2
Let $7 = &(9)
;
'
Using subgradients, we can show that for the ℓ< regularized case:
Diving deeper: ℓ$ and ℓ% regularization for the “isotropic” case

Summary: Isotropic case (!!! = #).
Let %" = !(")
!
&
!!
"!
No regularization "! = !!
ℓ" regularization "! = !!/(1 + ))
ℓ# regularization "! =
!! − )/2, !! > )/2
0, !! ≤ )/2
!! + )/2, !! < −)/2
Diving deeper: ℓ! and ℓ" regularization for the “isotropic” case

Implicit regularization
So far, we explicitly added a )(!) term to our objective function to regularize.
In many cases, the optimization algorithm we use can themselves act as
regularizers, favoring some solutions over others.
Currently a very active area of research, you’ll see more in the homework.

Bias-variance tradeoff
The phenomenon of underfitting and overfitting is often referred to as the bias-
variance tradeoff in the literature.
A model whose complexity is too small for the task will underfit. This is a model
with a large bias because the model’s accuracy will not improve even if we add
a lot of training data.
sin(x) fitting example we saw in Lec 3

Bias-variance tradeoff
The phenomenon of underfitting and overfitting is often referred to as the bias-
variance tradeoff in the literature.
A model whose complexity is too large for the amount of available training data
will overfit. This is a model with high variance, because the model’s predictions
will vary a lot with the randomness in the training data (it can even fit any noise
in the training data).
sin(x) fitting example we saw in Lec 3

Kernel methods give a way to choose and efficiently work with the nonlinear map
φ : Rd
→ RM
(for linear regression, and much more broadly).
Recall the nonlinear function map for linear regression:
Motivation

w∗
= argmin
w
F(w)
= argmin
w
!
!Φw − y!2
2 + λ!w!2
2
"
=
!
ΦT
Φ + λI
"−1
ΦT
y
Φ =





φ(x1)T
φ(x2)T
.
.
.
φ(xn)T





, y =





y1
y2
.
.
.
yn





Let’s continue with regularized least squares with non-linear basis:
This operates in space RM
and M could be huge (and even infinite).
Regularized least squares

By setting the gradient of F(w) = !Φw − y!2
2 + λ!w!2
2 to be 0:
ΦT
(Φw∗
− y) + λw∗
= 0
we know
w∗
=
1
λ
ΦT
(y − Φw∗
) = ΦT
α =
n
!
i=1
αiφ(xi)
Thus the least square solution is a linear combination of features of the datapoints!
This calculation does not show what α should be, but ignore that for now.
Regularized least squares solution: Another look

Assuming we know α, the prediction of w∗
on a new example x is
w∗T
φ(x) =
n
!
i=1
αiφ(xi)T
φ(x)
Therefore, only inner products in the new feature space matter!
Kernel methods are exactly about computing inner products without explicitly comput-
ing φ.
But we need to figure out what α is first!
Why is this helpful?

Plugging in w = ΦT
α into F(w) gives
H(α) = F(ΦT
α)
= !ΦΦT
α − y!2
2 + λ!ΦT
α!2
2
= !Kα − y!2
2 + λαT
Kα (K = ΦΦT
∈ Rn×n
)
K is called Gram matrix or kernel matrix where the (i, j)-th entry is
K(i,j) = φ(xi)T
φ(xj)
Solving for !, Step 1: Kernel matrix

φ(x1) =




1
−1
1
−1



 φ(x2) =




1
0
0
0



 φ(x3) =




1
1
1
1




Gram/Kernel matrix
K =


φ(x1)T
φ(x1) φ(x1)T
φ(x2) φ(x1)T
φ(x3)
φ(x2)T
φ(x1) φ(x2)T
φ(x2) φ(x2)T
φ(x3)
φ(x3)T
φ(x1) φ(x3)T
φ(x2) φ(x3)T
φ(x3)


=


4 1 0
1 1 1
0 1 4


Kernel matrix: Example

Kernel matrix vs Covariance matrix

Minimize (the so-called dual formulation)
H(α) = !Kα − y!2
2 + λαT
Kα
Setting the derivative to 0 we have
0 = (K2
+ λK)α − Ky= K ((K + λI)α − y)
Thus α = (K + λI)−1
y is a minimizer and we obtain
w∗
= ΦT
α = ΦT
(K + λI)−1
y
Exercise: are there other minimizers? and are there other w∗
’s?
Solving for ", Step 2: Minimize the dual

Minimizing F(w) gives w∗
= (ΦT
Φ + λI)−1
ΦT
y
Minimizing H(α) gives w∗
= ΦT
(ΦΦT
+ λI)−1
y
Note I has different dimensions in these two formulas.
Natural question: are the two solutions the same or different?
They have to be the same because F(w) has a unique minimizer!
And they are:
(ΦT
Φ + λI)−1
ΦT
y
= (ΦT
Φ + λI)−1
ΦT
(ΦΦT
+ λI)(ΦΦT
+ λI)−1
y
= (ΦT
Φ + λI)−1
(ΦT
ΦΦT
+ λΦT
)(ΦΦT
+ λI)−1
y
= (ΦT
Φ + λI)−1
(ΦT
Φ + λI)ΦT
(ΦΦT
+ λI)−1
y
= ΦT
(ΦΦT
+ λI)−1
y
Comparing two solutions

If the solutions are the same, then what is the difference?
First, computing (ΦΦT
+λI)−1
can be more efficient than computing (ΦT
Φ+λI)−1
when n ≤ M.
More importantly, computing α = (K + λI)−1
y also only requires computing inner
products in the new feature space!
Now we can conclude that the exact form of φ(·) is not essential; all we need to do is
know the inner products φ(x)T
φ(x"
).
For some φ it is indeed possible to compute φ(x)T
φ(x"
) without computing/knowing
φ. This is the kernel trick.
The kernel trick

Consider the following polynomial basis φ : R2
→ R3
:
φ(x) =


x2
1
√
2x1x2
x2
2


What is the inner product between φ(x) and φ(x!
)?
φ(x)T
φ(x!
) = x1
2
x!
1
2
+ 2x1x2x!
1x!
2 + x2
2
x!
2
2
= (x1x!
1 + x2x!
2)2
= (xT
x!
)2
Therefore, the inner product in the new space is simply a function of the inner product
in the original space.
The kernel trick: Example 1

φ : Rd
→ R2d
is parameterized by θ:
φθ(x) =







cos(θx1)
sin(θx1)
.
.
.
cos(θxm)
sin(θxm)







What is the inner product between φθ(x) and φθ(x!
)?
φθ(x)T
φθ(x!
) =
d
'
m=1
cos(θxm) cos(θx!
m) + sin(θxm) sin(θx!
m)
=
d
'
m=1
cos(θ(xm − x!
m)) (trigonometric identity)
Once again, the inner product in the new space is a simple function of the features in
the original space.

Based on φθ, define φL : Rd
→ R2d(L+1)
for some integer L:
φL(x) =







φ0(x)
φ2π
L
(x)
φ2 2π
L
(x)
.
.
.
φL 2π
L
(x)







What is the inner product between φL(x) and φL(x!
)?
φL(x)T
φL(x!
) =
L
'
"=0
φ2π"
L
(x)T
φ2π"
L
(x!
)
=
L
'
"=0
d
'
m=1
cos
(
2π"
L
(xm − x!
m)
)

When L → ∞, even if we cannot compute φ(x) (since it’s a vector of infinite dimen-
sion), we can still compute inner product:
φ∞(x)T
φ∞(x"
) =
! 2π
0
d
"
m=1
cos(θ(xm − x"
m)) dθ
=
d
"
m=1
sin(2π(xm − x"
m))
xm − x"
m
Again, a simple function of the original features.
Note that when using this mapping in linear regression, we are learning a weight w∗
with infinite dimension!

Definition: a function k : Rd
× Rd
→ R is called a kernel function if there exists a
function φ : Rd
→ RM
so that for any x, x!
∈ Rd
,
k(x, x!
) = φ(x)T
φ(x!
)
Examples we have seen
k(x, x!
) = (xT
x!
)2
k(x, x!
) =
d
!
m=1
sin(2π(xm − x!
m))
xm − x!
m
Kernel functions

Choosing a nonlinear basis φ becomes equivalent to choosing a kernel function.
As long as computing the kernel function is more efficient, we should apply the kernel
trick.
Gram/kernel matrix becomes:
K = ΦΦT
=





k(x1, x1) k(x1, x2) · · · k(x1, xn)
k(x2, x1) k(x2, x2) · · · k(x2, xn)
.
.
.
.
.
.
.
.
.
.
.
.
k(xn, x1) k(xn, x2) · · · k(xn, xn)





In fact, k is a kernel if and only if K is positive semidefinite for any n and any x1,
x2, . . . , xn (Mercer theorem).
• useful for proving that a function is not a kernel
Using kernel functions

Function
k(x, x!
) = !x − x!
!2
2
is not a kernel, why?
If it is a kernel, the kernel matrix for two data points x1 and x2:
K =
!
0 !x1 − x2!2
2
!x1 − x2!2
2 0
"
must be positive semidefinite, but is it?
Examples which are not kernels

For any function f : Rd
→ R, k(x, x!
) = f(x)f(x!
) is a kernel.
If k1(·, ·) and k2(·, ·) are kernels, then the following are also kernels:
• conical combination: αk1(·, ·) + βk2(·, ·) if α, β ≥ 0
• product: k1(·, ·)k2(·, ·)
• exponential: ek(·,·)
• · · ·
Verify using the definition of kernel!
Properties of kernels

Polynomial kernel
k(x, x!
) = (xT
x!
+ c)M
for c ≥ 0 and M is a positive integer.
What is the corresponding φ?
Popular kernels

Gaussian kernel or Radial basis function (RBF) kernel
k(x, x!
) = exp
!
−
"x − x!
"2
2
2σ2
"
for some σ > 0.
What is the corresponding φ?
Popular kernels

As long as w∗
=
!n
i=1 αiφ(xi), prediction on a new example x becomes
w∗T
φ(x) =
n
"
i=1
αiφ(xi)T
φ(x) =
n
"
i=1
αik(xi, x).
This is known as a non-parametric method. Informally speaking, this means that
there is no fixed set of parameters that the model is trying to learn (remember w∗
could be infinite). Nearest-neighbors is another non-parametric method we have seen.
Prediction with kernels

Classification with kernels
Similar ideas extend to the classification case, and we can predict using sign(wT
φ).
Data may become linearly separable in the feature space!

lec4_annotated.pdf ml csci 567 vatsal sharan

More Related Content

Similar to lec4_annotated.pdf ml csci 567 vatsal sharan (20)

Recently uploaded (20)

lec4_annotated.pdf ml csci 567 vatsal sharan