SlideShare a Scribd company logo
CSCI 567: Machine Learning
Vatsal Sharan
Fall 2022
Lecture 4, Sep 15
Administrivia
HW2 will be released tonight, due in about 2 weeks.
We will post some practice problems for the quiz by early next
week.
Recap
Ensuring generalization
A useful rule of thumb: to guarantee generalization, make sure that
your training data set size ! is at least linear in the number " of free
parameters in the function that you’re trying to learn.
Theorem. Let F be a function class with size |F|. Let y = f∗
(x) for some f∗
∈
F. Suppose we get a training set S = {(x1, y1), . . . , (xn, yn)} of size n with each
datapoint drawn i.i.d. from the data distribution D. Let
fERM
S = argmin
f∈F
1
n
n
!
i=1
!(f(xi), yi).
For any constants ", δ ∈ (0, 1), if n ≥ ln(|F|/δ)
" , then with probability (1 − δ) over
{(x1, y1), . . . , (xn, yn)}, R(fERM
S ) < ".
Beyond linear models: nonlinearly transformed features
Polynomial basis functions
Underfitting and overfitting
See Colab notebook
Preventing overfitting: Regularization
Understanding
regularization
ℓ! regularization: penalizing large weights
!2 regularization, ψ(w) = !w!2
2:
G(w) = RSS(w) + λ!w!2
2 = !Xw − y!2
2 + λ!w!2
2
∇G(w) = 2(XT
Xw − XT
y) + 2λw = 0
⇒
!
XT
X + λI
"
w = XT
y
⇒ w∗
=
!
XT
X + λI
"−1
XT
y
Linear regression with !2 regularization is also known as ridge regression.
With a Bayesian viewpoint, corresponds to a Gaussian prior for w.
Encouraging sparsity: ℓ" regularization
Sparsity of !: Number of non-zero coefficients in !. Same as ||#||!
Advantage:
Sparse models are a natural inductive bias in many settings. In many applications
we have numerous possible features, only some of which may have any
relationship with the label.
Sparse models may also be more interpretable. They could narrow down a small
number of features which carry a lot of signal.
Data required to learn sparse model maybe significantly less than to learn dense
model.
We’ll see more on the third point next.
ℓ# regularization: The good, the bad and the ugly
Choose ψ(w) = !w!0.
G(w) =
n
!
i=1
(wT
xi − yi)2
+ λ!w!0.
ℓ# regularization: The good, the bad and the ugly
ℓ# regularization: The good, the bad and the ugly
ℓ# regularization: The good, the bad and the ugly
ℓ$ regularization as a proxy for ℓ# regularization
Choose ψ(w) = !w!1.
G(w) =
n
!
i=1
(wT
xi − yi)2
+ λ!w!1.
ℓ$ regularization as a proxy for ℓ# regularization
Theorem. Given n vectors {xi ∈ Rd
, i ∈ [n]} drawn i.i.d. from N(0, I), let yi =
w∗T
xi for some w∗
with "w∗
"0 = s. Then for some fixed constant C > 0, the
minimizer of G(w) with ψ(w) = "w"1 will be w∗
as long as n > C · s log d (with
high probability over the randomness in the training datapoints xi).
[similar result can also be proven under more general conditions].
Why does ℓ$ regularization encourage sparse solutions?
Adapted from ESL
!!
!"
!!
!"
"#$ "#$
argmin! RSS * , subject to 4 * ≤ 6
Optimization problem:
Diving deeper: ℓ$ and ℓ% regularization for the “isotropic” case
Diving deeper: ℓ$ and ℓ% regularization for the “isotropic” case
Diving deeper: ℓ$ and ℓ% regularization for the “isotropic” case
Diving deeper: ℓ$ and ℓ% regularization for the “isotropic” case
Diving deeper: ℓ$ and ℓ% regularization for the “isotropic” case
!! =
#! − %/2, #! > %/2
0, #! ≤ %/2
#! + %/2, #! < −%/2
Let $7 = &(9)
;
'
Using subgradients, we can show that for the ℓ< regularized case:
Diving deeper: ℓ$ and ℓ% regularization for the “isotropic” case
Summary: Isotropic case (!!! = #).
Let %" = !(")
!
&
!!
"!
No regularization "! = !!
ℓ" regularization "! = !!/(1 + ))
ℓ# regularization "! =
!! − )/2, !! > )/2
0, !! ≤ )/2
!! + )/2, !! < −)/2
Diving deeper: ℓ! and ℓ" regularization for the “isotropic” case
Implicit regularization
So far, we explicitly added a )(!) term to our objective function to regularize.
In many cases, the optimization algorithm we use can themselves act as
regularizers, favoring some solutions over others.
Currently a very active area of research, you’ll see more in the homework.
Bias-variance tradeoff
The phenomenon of underfitting and overfitting is often referred to as the bias-
variance tradeoff in the literature.
A model whose complexity is too small for the task will underfit. This is a model
with a large bias because the model’s accuracy will not improve even if we add
a lot of training data.
sin(x) fitting example we saw in Lec 3
Bias-variance tradeoff
The phenomenon of underfitting and overfitting is often referred to as the bias-
variance tradeoff in the literature.
A model whose complexity is too large for the amount of available training data
will overfit. This is a model with high variance, because the model’s predictions
will vary a lot with the randomness in the training data (it can even fit any noise
in the training data).
sin(x) fitting example we saw in Lec 3
Kernels
Kernel methods give a way to choose and efficiently work with the nonlinear map
φ : Rd
→ RM
(for linear regression, and much more broadly).
Recall the nonlinear function map for linear regression:
Motivation
w∗
= argmin
w
F(w)
= argmin
w
!
!Φw − y!2
2 + λ!w!2
2
"
=
!
ΦT
Φ + λI
"−1
ΦT
y
Φ =





φ(x1)T
φ(x2)T
.
.
.
φ(xn)T





, y =





y1
y2
.
.
.
yn





Let’s continue with regularized least squares with non-linear basis:
This operates in space RM
and M could be huge (and even infinite).
Regularized least squares
By setting the gradient of F(w) = !Φw − y!2
2 + λ!w!2
2 to be 0:
ΦT
(Φw∗
− y) + λw∗
= 0
we know
w∗
=
1
λ
ΦT
(y − Φw∗
) = ΦT
α =
n
!
i=1
αiφ(xi)
Thus the least square solution is a linear combination of features of the datapoints!
This calculation does not show what α should be, but ignore that for now.
Regularized least squares solution: Another look
Assuming we know α, the prediction of w∗
on a new example x is
w∗T
φ(x) =
n
!
i=1
αiφ(xi)T
φ(x)
Therefore, only inner products in the new feature space matter!
Kernel methods are exactly about computing inner products without explicitly comput-
ing φ.
But we need to figure out what α is first!
Why is this helpful?
Plugging in w = ΦT
α into F(w) gives
H(α) = F(ΦT
α)
= !ΦΦT
α − y!2
2 + λ!ΦT
α!2
2
= !Kα − y!2
2 + λαT
Kα (K = ΦΦT
∈ Rn×n
)
K is called Gram matrix or kernel matrix where the (i, j)-th entry is
K(i,j) = φ(xi)T
φ(xj)
Solving for !, Step 1: Kernel matrix
φ(x1) =




1
−1
1
−1



 φ(x2) =




1
0
0
0



 φ(x3) =




1
1
1
1




Gram/Kernel matrix
K =


φ(x1)T
φ(x1) φ(x1)T
φ(x2) φ(x1)T
φ(x3)
φ(x2)T
φ(x1) φ(x2)T
φ(x2) φ(x2)T
φ(x3)
φ(x3)T
φ(x1) φ(x3)T
φ(x2) φ(x3)T
φ(x3)


=


4 1 0
1 1 1
0 1 4


Kernel matrix: Example
Kernel matrix vs Covariance matrix
Minimize (the so-called dual formulation)
H(α) = !Kα − y!2
2 + λαT
Kα
Setting the derivative to 0 we have
0 = (K2
+ λK)α − Ky= K ((K + λI)α − y)
Thus α = (K + λI)−1
y is a minimizer and we obtain
w∗
= ΦT
α = ΦT
(K + λI)−1
y
Exercise: are there other minimizers? and are there other w∗
’s?
Solving for ", Step 2: Minimize the dual
Minimizing F(w) gives w∗
= (ΦT
Φ + λI)−1
ΦT
y
Minimizing H(α) gives w∗
= ΦT
(ΦΦT
+ λI)−1
y
Note I has different dimensions in these two formulas.
Natural question: are the two solutions the same or different?
They have to be the same because F(w) has a unique minimizer!
And they are:
(ΦT
Φ + λI)−1
ΦT
y
= (ΦT
Φ + λI)−1
ΦT
(ΦΦT
+ λI)(ΦΦT
+ λI)−1
y
= (ΦT
Φ + λI)−1
(ΦT
ΦΦT
+ λΦT
)(ΦΦT
+ λI)−1
y
= (ΦT
Φ + λI)−1
(ΦT
Φ + λI)ΦT
(ΦΦT
+ λI)−1
y
= ΦT
(ΦΦT
+ λI)−1
y
Comparing two solutions
If the solutions are the same, then what is the difference?
First, computing (ΦΦT
+λI)−1
can be more efficient than computing (ΦT
Φ+λI)−1
when n ≤ M.
More importantly, computing α = (K + λI)−1
y also only requires computing inner
products in the new feature space!
Now we can conclude that the exact form of φ(·) is not essential; all we need to do is
know the inner products φ(x)T
φ(x"
).
For some φ it is indeed possible to compute φ(x)T
φ(x"
) without computing/knowing
φ. This is the kernel trick.
The kernel trick
Consider the following polynomial basis φ : R2
→ R3
:
φ(x) =


x2
1
√
2x1x2
x2
2


What is the inner product between φ(x) and φ(x!
)?
φ(x)T
φ(x!
) = x1
2
x!
1
2
+ 2x1x2x!
1x!
2 + x2
2
x!
2
2
= (x1x!
1 + x2x!
2)2
= (xT
x!
)2
Therefore, the inner product in the new space is simply a function of the inner product
in the original space.
The kernel trick: Example 1
φ : Rd
→ R2d
is parameterized by θ:
φθ(x) =







cos(θx1)
sin(θx1)
.
.
.
cos(θxm)
sin(θxm)







What is the inner product between φθ(x) and φθ(x!
)?
φθ(x)T
φθ(x!
) =
d
'
m=1
cos(θxm) cos(θx!
m) + sin(θxm) sin(θx!
m)
=
d
'
m=1
cos(θ(xm − x!
m)) (trigonometric identity)
Once again, the inner product in the new space is a simple function of the features in
the original space.
The kernel trick: Example 2
Based on φθ, define φL : Rd
→ R2d(L+1)
for some integer L:
φL(x) =







φ0(x)
φ2π
L
(x)
φ2 2π
L
(x)
.
.
.
φL 2π
L
(x)







What is the inner product between φL(x) and φL(x!
)?
φL(x)T
φL(x!
) =
L
'
"=0
φ2π"
L
(x)T
φ2π"
L
(x!
)
=
L
'
"=0
d
'
m=1
cos
(
2π"
L
(xm − x!
m)
)
The kernel trick: Example 3
When L → ∞, even if we cannot compute φ(x) (since it’s a vector of infinite dimen-
sion), we can still compute inner product:
φ∞(x)T
φ∞(x"
) =
! 2π
0
d
"
m=1
cos(θ(xm − x"
m)) dθ
=
d
"
m=1
sin(2π(xm − x"
m))
xm − x"
m
Again, a simple function of the original features.
Note that when using this mapping in linear regression, we are learning a weight w∗
with infinite dimension!
The kernel trick: Example 4
Definition: a function k : Rd
× Rd
→ R is called a kernel function if there exists a
function φ : Rd
→ RM
so that for any x, x!
∈ Rd
,
k(x, x!
) = φ(x)T
φ(x!
)
Examples we have seen
k(x, x!
) = (xT
x!
)2
k(x, x!
) =
d
!
m=1
sin(2π(xm − x!
m))
xm − x!
m
Kernel functions
Choosing a nonlinear basis φ becomes equivalent to choosing a kernel function.
As long as computing the kernel function is more efficient, we should apply the kernel
trick.
Gram/kernel matrix becomes:
K = ΦΦT
=





k(x1, x1) k(x1, x2) · · · k(x1, xn)
k(x2, x1) k(x2, x2) · · · k(x2, xn)
.
.
.
.
.
.
.
.
.
.
.
.
k(xn, x1) k(xn, x2) · · · k(xn, xn)





In fact, k is a kernel if and only if K is positive semidefinite for any n and any x1,
x2, . . . , xn (Mercer theorem).
• useful for proving that a function is not a kernel
Using kernel functions
Function
k(x, x!
) = !x − x!
!2
2
is not a kernel, why?
If it is a kernel, the kernel matrix for two data points x1 and x2:
K =
!
0 !x1 − x2!2
2
!x1 − x2!2
2 0
"
must be positive semidefinite, but is it?
Examples which are not kernels
For any function f : Rd
→ R, k(x, x!
) = f(x)f(x!
) is a kernel.
If k1(·, ·) and k2(·, ·) are kernels, then the following are also kernels:
• conical combination: αk1(·, ·) + βk2(·, ·) if α, β ≥ 0
• product: k1(·, ·)k2(·, ·)
• exponential: ek(·,·)
• · · ·
Verify using the definition of kernel!
Properties of kernels
Polynomial kernel
k(x, x!
) = (xT
x!
+ c)M
for c ≥ 0 and M is a positive integer.
What is the corresponding φ?
Popular kernels
Gaussian kernel or Radial basis function (RBF) kernel
k(x, x!
) = exp
!
−
"x − x!
"2
2
2σ2
"
for some σ > 0.
What is the corresponding φ?
Popular kernels
Gaussian kernel or Radial basis function (RBF) kernel
k(x, x!
) = exp
!
−
"x − x!
"2
2
2σ2
"
for some σ > 0.
What is the corresponding φ?
Popular kernels
As long as w∗
=
!n
i=1 αiφ(xi), prediction on a new example x becomes
w∗T
φ(x) =
n
"
i=1
αiφ(xi)T
φ(x) =
n
"
i=1
αik(xi, x).
This is known as a non-parametric method. Informally speaking, this means that
there is no fixed set of parameters that the model is trying to learn (remember w∗
could be infinite). Nearest-neighbors is another non-parametric method we have seen.
Prediction with kernels
Classification with kernels
Similar ideas extend to the classification case, and we can predict using sign(wT
φ).
Data may become linearly separable in the feature space!

More Related Content

PPT
Introduction to Machine Learning STUDENTS.ppt
PDF
lec3_annotated.pdf ml csci 567 vatsal sharan
PDF
Lecture5 kernel svm
PPTX
ML_in_QM_JC_02-10-18
PPTX
13Kernel_Machines.pptx
PDF
lec5_annotated.pdf ml csci 567 vatsal sharan
PDF
MLHEP Lectures - day 2, basic track
PPTX
Regularization_BY_MOHAMED_ESSAM.pptx
Introduction to Machine Learning STUDENTS.ppt
lec3_annotated.pdf ml csci 567 vatsal sharan
Lecture5 kernel svm
ML_in_QM_JC_02-10-18
13Kernel_Machines.pptx
lec5_annotated.pdf ml csci 567 vatsal sharan
MLHEP Lectures - day 2, basic track
Regularization_BY_MOHAMED_ESSAM.pptx

Similar to lec4_annotated.pdf ml csci 567 vatsal sharan (20)

PDF
Introduction to Supervised ML Concepts and Algorithms
PDF
Relationship between some machine learning concepts
PDF
Optimum Engineering Design - Day 2b. Classical Optimization methods
PPT
November, 2006 CCKM'06 1
PDF
A Simple Review on SVM
PPTX
super vector machines algorithms using deep
PPTX
SVMs.pptx support vector machines machine learning
PPT
Section5 Rbf
PPTX
support vector machine
PDF
Lecture 2 neural network covers the basic
PPTX
When Models Meet Data: From ancient science to todays Artificial Intelligence...
PDF
Error Estimates for Multi-Penalty Regularization under General Source Condition
PDF
Linear regression
PPTX
linear regression1.pptx machine learning
PDF
nber_slides.pdf
PPTX
Polynomial Regression explaining with examples .pptx
PDF
IVR - Chapter 1 - Introduction
PPTX
Linear Regression in machine learning.pptx
PPTX
lec0734523532453425324523452345245432.pptx
PDF
8517ijaia06
Introduction to Supervised ML Concepts and Algorithms
Relationship between some machine learning concepts
Optimum Engineering Design - Day 2b. Classical Optimization methods
November, 2006 CCKM'06 1
A Simple Review on SVM
super vector machines algorithms using deep
SVMs.pptx support vector machines machine learning
Section5 Rbf
support vector machine
Lecture 2 neural network covers the basic
When Models Meet Data: From ancient science to todays Artificial Intelligence...
Error Estimates for Multi-Penalty Regularization under General Source Condition
Linear regression
linear regression1.pptx machine learning
nber_slides.pdf
Polynomial Regression explaining with examples .pptx
IVR - Chapter 1 - Introduction
Linear Regression in machine learning.pptx
lec0734523532453425324523452345245432.pptx
8517ijaia06
Ad

Recently uploaded (20)

PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Foundation of Data Science unit number two notes
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Lecture1 pattern recognition............
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
annual-report-2024-2025 original latest.
PPTX
1_Introduction to advance data techniques.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Reliability_Chapter_ presentation 1221.5784
Foundation of Data Science unit number two notes
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
ISS -ESG Data flows What is ESG and HowHow
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Lecture1 pattern recognition............
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction-to-Cloud-ComputingFinal.pptx
Business Analytics and business intelligence.pdf
annual-report-2024-2025 original latest.
1_Introduction to advance data techniques.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Ad

lec4_annotated.pdf ml csci 567 vatsal sharan

  • 1. CSCI 567: Machine Learning Vatsal Sharan Fall 2022 Lecture 4, Sep 15
  • 2. Administrivia HW2 will be released tonight, due in about 2 weeks. We will post some practice problems for the quiz by early next week.
  • 4. Ensuring generalization A useful rule of thumb: to guarantee generalization, make sure that your training data set size ! is at least linear in the number " of free parameters in the function that you’re trying to learn. Theorem. Let F be a function class with size |F|. Let y = f∗ (x) for some f∗ ∈ F. Suppose we get a training set S = {(x1, y1), . . . , (xn, yn)} of size n with each datapoint drawn i.i.d. from the data distribution D. Let fERM S = argmin f∈F 1 n n ! i=1 !(f(xi), yi). For any constants ", δ ∈ (0, 1), if n ≥ ln(|F|/δ) " , then with probability (1 − δ) over {(x1, y1), . . . , (xn, yn)}, R(fERM S ) < ".
  • 5. Beyond linear models: nonlinearly transformed features
  • 10. ℓ! regularization: penalizing large weights !2 regularization, ψ(w) = !w!2 2: G(w) = RSS(w) + λ!w!2 2 = !Xw − y!2 2 + λ!w!2 2 ∇G(w) = 2(XT Xw − XT y) + 2λw = 0 ⇒ ! XT X + λI " w = XT y ⇒ w∗ = ! XT X + λI "−1 XT y Linear regression with !2 regularization is also known as ridge regression. With a Bayesian viewpoint, corresponds to a Gaussian prior for w.
  • 11. Encouraging sparsity: ℓ" regularization Sparsity of !: Number of non-zero coefficients in !. Same as ||#||! Advantage: Sparse models are a natural inductive bias in many settings. In many applications we have numerous possible features, only some of which may have any relationship with the label. Sparse models may also be more interpretable. They could narrow down a small number of features which carry a lot of signal. Data required to learn sparse model maybe significantly less than to learn dense model. We’ll see more on the third point next.
  • 12. ℓ# regularization: The good, the bad and the ugly Choose ψ(w) = !w!0. G(w) = n ! i=1 (wT xi − yi)2 + λ!w!0.
  • 13. ℓ# regularization: The good, the bad and the ugly
  • 14. ℓ# regularization: The good, the bad and the ugly
  • 15. ℓ# regularization: The good, the bad and the ugly
  • 16. ℓ$ regularization as a proxy for ℓ# regularization Choose ψ(w) = !w!1. G(w) = n ! i=1 (wT xi − yi)2 + λ!w!1.
  • 17. ℓ$ regularization as a proxy for ℓ# regularization Theorem. Given n vectors {xi ∈ Rd , i ∈ [n]} drawn i.i.d. from N(0, I), let yi = w∗T xi for some w∗ with "w∗ "0 = s. Then for some fixed constant C > 0, the minimizer of G(w) with ψ(w) = "w"1 will be w∗ as long as n > C · s log d (with high probability over the randomness in the training datapoints xi). [similar result can also be proven under more general conditions].
  • 18. Why does ℓ$ regularization encourage sparse solutions? Adapted from ESL !! !" !! !" "#$ "#$ argmin! RSS * , subject to 4 * ≤ 6 Optimization problem:
  • 19. Diving deeper: ℓ$ and ℓ% regularization for the “isotropic” case
  • 20. Diving deeper: ℓ$ and ℓ% regularization for the “isotropic” case
  • 21. Diving deeper: ℓ$ and ℓ% regularization for the “isotropic” case
  • 22. Diving deeper: ℓ$ and ℓ% regularization for the “isotropic” case
  • 23. Diving deeper: ℓ$ and ℓ% regularization for the “isotropic” case
  • 24. !! = #! − %/2, #! > %/2 0, #! ≤ %/2 #! + %/2, #! < −%/2 Let $7 = &(9) ; ' Using subgradients, we can show that for the ℓ< regularized case: Diving deeper: ℓ$ and ℓ% regularization for the “isotropic” case
  • 25. Summary: Isotropic case (!!! = #). Let %" = !(") ! & !! "! No regularization "! = !! ℓ" regularization "! = !!/(1 + )) ℓ# regularization "! = !! − )/2, !! > )/2 0, !! ≤ )/2 !! + )/2, !! < −)/2 Diving deeper: ℓ! and ℓ" regularization for the “isotropic” case
  • 26. Implicit regularization So far, we explicitly added a )(!) term to our objective function to regularize. In many cases, the optimization algorithm we use can themselves act as regularizers, favoring some solutions over others. Currently a very active area of research, you’ll see more in the homework.
  • 27. Bias-variance tradeoff The phenomenon of underfitting and overfitting is often referred to as the bias- variance tradeoff in the literature. A model whose complexity is too small for the task will underfit. This is a model with a large bias because the model’s accuracy will not improve even if we add a lot of training data. sin(x) fitting example we saw in Lec 3
  • 28. Bias-variance tradeoff The phenomenon of underfitting and overfitting is often referred to as the bias- variance tradeoff in the literature. A model whose complexity is too large for the amount of available training data will overfit. This is a model with high variance, because the model’s predictions will vary a lot with the randomness in the training data (it can even fit any noise in the training data). sin(x) fitting example we saw in Lec 3
  • 30. Kernel methods give a way to choose and efficiently work with the nonlinear map φ : Rd → RM (for linear regression, and much more broadly). Recall the nonlinear function map for linear regression: Motivation
  • 31. w∗ = argmin w F(w) = argmin w ! !Φw − y!2 2 + λ!w!2 2 " = ! ΦT Φ + λI "−1 ΦT y Φ =      φ(x1)T φ(x2)T . . . φ(xn)T      , y =      y1 y2 . . . yn      Let’s continue with regularized least squares with non-linear basis: This operates in space RM and M could be huge (and even infinite). Regularized least squares
  • 32. By setting the gradient of F(w) = !Φw − y!2 2 + λ!w!2 2 to be 0: ΦT (Φw∗ − y) + λw∗ = 0 we know w∗ = 1 λ ΦT (y − Φw∗ ) = ΦT α = n ! i=1 αiφ(xi) Thus the least square solution is a linear combination of features of the datapoints! This calculation does not show what α should be, but ignore that for now. Regularized least squares solution: Another look
  • 33. Assuming we know α, the prediction of w∗ on a new example x is w∗T φ(x) = n ! i=1 αiφ(xi)T φ(x) Therefore, only inner products in the new feature space matter! Kernel methods are exactly about computing inner products without explicitly comput- ing φ. But we need to figure out what α is first! Why is this helpful?
  • 34. Plugging in w = ΦT α into F(w) gives H(α) = F(ΦT α) = !ΦΦT α − y!2 2 + λ!ΦT α!2 2 = !Kα − y!2 2 + λαT Kα (K = ΦΦT ∈ Rn×n ) K is called Gram matrix or kernel matrix where the (i, j)-th entry is K(i,j) = φ(xi)T φ(xj) Solving for !, Step 1: Kernel matrix
  • 35. φ(x1) =     1 −1 1 −1     φ(x2) =     1 0 0 0     φ(x3) =     1 1 1 1     Gram/Kernel matrix K =   φ(x1)T φ(x1) φ(x1)T φ(x2) φ(x1)T φ(x3) φ(x2)T φ(x1) φ(x2)T φ(x2) φ(x2)T φ(x3) φ(x3)T φ(x1) φ(x3)T φ(x2) φ(x3)T φ(x3)   =   4 1 0 1 1 1 0 1 4   Kernel matrix: Example
  • 36. Kernel matrix vs Covariance matrix
  • 37. Minimize (the so-called dual formulation) H(α) = !Kα − y!2 2 + λαT Kα Setting the derivative to 0 we have 0 = (K2 + λK)α − Ky= K ((K + λI)α − y) Thus α = (K + λI)−1 y is a minimizer and we obtain w∗ = ΦT α = ΦT (K + λI)−1 y Exercise: are there other minimizers? and are there other w∗ ’s? Solving for ", Step 2: Minimize the dual
  • 38. Minimizing F(w) gives w∗ = (ΦT Φ + λI)−1 ΦT y Minimizing H(α) gives w∗ = ΦT (ΦΦT + λI)−1 y Note I has different dimensions in these two formulas. Natural question: are the two solutions the same or different? They have to be the same because F(w) has a unique minimizer! And they are: (ΦT Φ + λI)−1 ΦT y = (ΦT Φ + λI)−1 ΦT (ΦΦT + λI)(ΦΦT + λI)−1 y = (ΦT Φ + λI)−1 (ΦT ΦΦT + λΦT )(ΦΦT + λI)−1 y = (ΦT Φ + λI)−1 (ΦT Φ + λI)ΦT (ΦΦT + λI)−1 y = ΦT (ΦΦT + λI)−1 y Comparing two solutions
  • 39. If the solutions are the same, then what is the difference? First, computing (ΦΦT +λI)−1 can be more efficient than computing (ΦT Φ+λI)−1 when n ≤ M. More importantly, computing α = (K + λI)−1 y also only requires computing inner products in the new feature space! Now we can conclude that the exact form of φ(·) is not essential; all we need to do is know the inner products φ(x)T φ(x" ). For some φ it is indeed possible to compute φ(x)T φ(x" ) without computing/knowing φ. This is the kernel trick. The kernel trick
  • 40. Consider the following polynomial basis φ : R2 → R3 : φ(x) =   x2 1 √ 2x1x2 x2 2   What is the inner product between φ(x) and φ(x! )? φ(x)T φ(x! ) = x1 2 x! 1 2 + 2x1x2x! 1x! 2 + x2 2 x! 2 2 = (x1x! 1 + x2x! 2)2 = (xT x! )2 Therefore, the inner product in the new space is simply a function of the inner product in the original space. The kernel trick: Example 1
  • 41. φ : Rd → R2d is parameterized by θ: φθ(x) =        cos(θx1) sin(θx1) . . . cos(θxm) sin(θxm)        What is the inner product between φθ(x) and φθ(x! )? φθ(x)T φθ(x! ) = d ' m=1 cos(θxm) cos(θx! m) + sin(θxm) sin(θx! m) = d ' m=1 cos(θ(xm − x! m)) (trigonometric identity) Once again, the inner product in the new space is a simple function of the features in the original space. The kernel trick: Example 2
  • 42. Based on φθ, define φL : Rd → R2d(L+1) for some integer L: φL(x) =        φ0(x) φ2π L (x) φ2 2π L (x) . . . φL 2π L (x)        What is the inner product between φL(x) and φL(x! )? φL(x)T φL(x! ) = L ' "=0 φ2π" L (x)T φ2π" L (x! ) = L ' "=0 d ' m=1 cos ( 2π" L (xm − x! m) ) The kernel trick: Example 3
  • 43. When L → ∞, even if we cannot compute φ(x) (since it’s a vector of infinite dimen- sion), we can still compute inner product: φ∞(x)T φ∞(x" ) = ! 2π 0 d " m=1 cos(θ(xm − x" m)) dθ = d " m=1 sin(2π(xm − x" m)) xm − x" m Again, a simple function of the original features. Note that when using this mapping in linear regression, we are learning a weight w∗ with infinite dimension! The kernel trick: Example 4
  • 44. Definition: a function k : Rd × Rd → R is called a kernel function if there exists a function φ : Rd → RM so that for any x, x! ∈ Rd , k(x, x! ) = φ(x)T φ(x! ) Examples we have seen k(x, x! ) = (xT x! )2 k(x, x! ) = d ! m=1 sin(2π(xm − x! m)) xm − x! m Kernel functions
  • 45. Choosing a nonlinear basis φ becomes equivalent to choosing a kernel function. As long as computing the kernel function is more efficient, we should apply the kernel trick. Gram/kernel matrix becomes: K = ΦΦT =      k(x1, x1) k(x1, x2) · · · k(x1, xn) k(x2, x1) k(x2, x2) · · · k(x2, xn) . . . . . . . . . . . . k(xn, x1) k(xn, x2) · · · k(xn, xn)      In fact, k is a kernel if and only if K is positive semidefinite for any n and any x1, x2, . . . , xn (Mercer theorem). • useful for proving that a function is not a kernel Using kernel functions
  • 46. Function k(x, x! ) = !x − x! !2 2 is not a kernel, why? If it is a kernel, the kernel matrix for two data points x1 and x2: K = ! 0 !x1 − x2!2 2 !x1 − x2!2 2 0 " must be positive semidefinite, but is it? Examples which are not kernels
  • 47. For any function f : Rd → R, k(x, x! ) = f(x)f(x! ) is a kernel. If k1(·, ·) and k2(·, ·) are kernels, then the following are also kernels: • conical combination: αk1(·, ·) + βk2(·, ·) if α, β ≥ 0 • product: k1(·, ·)k2(·, ·) • exponential: ek(·,·) • · · · Verify using the definition of kernel! Properties of kernels
  • 48. Polynomial kernel k(x, x! ) = (xT x! + c)M for c ≥ 0 and M is a positive integer. What is the corresponding φ? Popular kernels
  • 49. Gaussian kernel or Radial basis function (RBF) kernel k(x, x! ) = exp ! − "x − x! "2 2 2σ2 " for some σ > 0. What is the corresponding φ? Popular kernels
  • 50. Gaussian kernel or Radial basis function (RBF) kernel k(x, x! ) = exp ! − "x − x! "2 2 2σ2 " for some σ > 0. What is the corresponding φ? Popular kernels
  • 51. As long as w∗ = !n i=1 αiφ(xi), prediction on a new example x becomes w∗T φ(x) = n " i=1 αiφ(xi)T φ(x) = n " i=1 αik(xi, x). This is known as a non-parametric method. Informally speaking, this means that there is no fixed set of parameters that the model is trying to learn (remember w∗ could be infinite). Nearest-neighbors is another non-parametric method we have seen. Prediction with kernels
  • 52. Classification with kernels Similar ideas extend to the classification case, and we can predict using sign(wT φ). Data may become linearly separable in the feature space!