SlideShare a Scribd company logo
4
Most read
5
Most read
15
Most read
Support Vector Machines for Regression
July 15, 2015
1 / 16
Overview
1 Linear Regression
2 Non-linear Regression and Kernels
2 / 16
Linear Regression Model
The linear regression model
f(x) = xT
β + β0
To estimate β, we consider minimization of
H(β, β0) =
N
i=1
V (yi − f(xi)) +
λ
2
β 2
with a loss function V and a regularization λ
2 β 2
• How to apply SVM to solve the linear regression problem?
3 / 16
Linear Regression Model (Cont)
The basic idea:
Given training data set (x1, y1), ..., (xN , yN )
Target: find a function f(x) that has at most deviation from targets
yi for all the training data and at the same time is as less complex
(flat) as possible.
In other words we do not care about errors as long as they are less
than but will not accept any deviation larger than this.
4 / 16
Linear Regression Model (Cont)
• We want to find one ” -tube” that can contains all the samples.
• Intuitively, a tube, with a small width, seems to over-fit with the training
data.
We should find f(x) that its -tube’s width is as big as possible (more
generalization capability, less prediction error in future).
• With a defined , a bigger tube
corresponds to a smaller β
(flatter function).
• Optimization problem:
minimize
1
2
β 2
s.t
yi − f(xi) ≤
f(xi) − yi ≤
5 / 16
Linear Regression Model (Cont)
With a defined , this problem is not always feasible, so we also want to
allow some errors.
Use slack variables ξi, ξ∗
i , the
new optimization problem:
minimize
1
2
β 2
+ C
N
i=1
(ξi + ξ∗
i )
s.t



yi − f(xi) ≤ + ξ∗
i
f(xi) − yi ≤ + ξi
ξi, ξ∗
i ≥ 0
6 / 16
Linear Regression Model (Cont)
Let λ = 1
C
Use an ” -insensitive” error measure,
ignoring errors of size less than
V (r) =
0 if |r| <
|r| − , otherwise.
We have the minimization of
H(β, β0) =
N
i=1
V (yi − f(xi)) +
λ
2
β 2
7 / 16
Linear Regression Model (Cont)
The Lagrange (primal) function:
LP =
1
2
β 2
+ C
N
i=1
(ξ∗
i + ξi) −
N
i=1
α∗
i ( + ξ∗
i − yi + xT
i β + β0)
−
N
i=1
αi(ε + ξi + yi − xT
i β − β0) −
N
i=1
(η∗
i ξ∗
i + ηiξi)
which we minimize w.r.t β, β0, ξi, ξ∗
i . Setting the respective derivatives to
0, we get
0 =
N
i=1
(α∗
i − αi)
β =
N
i=1
(α∗
i − αi)xi
α
(∗)
i = C − η
(∗)
i , ∀i
8 / 16
Linear Regression Model (Cont)
Substitute to the primal function, we obtain the dual optimization problem:
max
αi,α∗
i
−
N
i=1
(α∗
i +αi)+
N
i=1
yi(α∗
i −αi)−
1
2
N
i,i =1
(α∗
i −αi)(α∗
i −αi ) xi, xi
s.t



0 ≤ αi, α∗
i ≤ C(= 1/λ)
N
i=1(α∗
i − αi) = 0
αiα∗
i = 0
The solution function has the form
ˆβ =
N
i=1
(ˆα∗
i − ˆαi)xi
ˆf(x) =
N
i=1
(ˆα∗
i − ˆαi) x, xi + β0
9 / 16
Linear Regression Model (Cont)
Follow KKT conditions, we have
ˆα∗
i ( + ξ∗
i − yi + ˆf(xi)) = 0
ˆαi( + ξi + yi − ˆf(xi)) = 0
(C − ˆα∗
i )ˆξ∗
i = 0
(C − ˆαi)ˆξi = 0
→ For all data points inside the -tube, ˆαi = ˆα∗
i = 0. Only data points
outside may have (ˆα∗
i − ˆαi) = 0.
→ Do not need all xi to describe β. The associated data points are called
the support vectors.
10 / 16
Linear Regression Model (Cont)
Parameter controls the width of the -insensitive tube. The value of
can affect the number of support vectors used to construct the
regression function. The bigger , the fewer support vectors are
selected, the ”flatter” estimates.
It is associated with the choice of the loss function ( -insensitive loss
function, quadratic loss function or Huber loss function, etc.)
Parameter C (1
λ) determines the trade off between the model
complexity (flatness) and the degree to which deviations larger than
are tolerated.
It is interpreted as a traditional regularization parameter that can be
estimated by cross-validation for example
11 / 16
Non-linear Regression and Kernels
When the data is non-linear, use a map ϕ to transform the data into a
higher dimensional feature space to make it possible to perform the linear
regression.
12 / 16
Non-linear Regression and Kernels (Cont)
Suppose we consider approximation of the regression function in term of a
set of basis function {hm(x)}, m = 1, 2, ..., M:
f(x) =
M
m=1
βmhm(x) + β0
To estimate β and β0, minimize
H(β, β0) =
N
i=1
V (yi − f(xi)) +
λ
2
β2
m
for some general error measure V (r). The solution has the form
ˆf(x) =
N
i=1
ˆαiK(x, xi)
with K(x, x ) = M
m=1 hm(x)hm(x )
13 / 16
Non-linear Regression and Kernels (Cont)
Let work out with V (r) = r2. Let H be the N x M basis matrix with imth
element hm(xi) For simplicity assume β0 = 0. Estimate β by minimize
H(β) = (y − Hβ)T
(y − Hβ) + λ β 2
Setting the first derivative to zero, we have the solution ˆy = Hˆβ with ˆβ
determined by
−2HT
(y − Hˆβ) + 2λˆβ = 0
−HT
(y − Hˆβ) + λˆβ = 0
−HHT
(y − Hˆβ) + λHˆβ = 0 (premultiply by H)
(HHT
+ λI)Hˆβ = HHT
y
Hˆβ = (HHT
+ λI)−1
HHT
y
14 / 16
Non-linear Regression and Kernels (Cont)
We have estimate function:
f(x) = h(x)T ˆβ
= h(x)T
HT
(HHT
)−1
Hˆβ
= h(x)T
HT
(HHT
)−1
(HHT
+ λI)−1
HHT
y
= h(x)T
HT
[(HHT
+ λI)(HHT
)]−1
HHT
y
= h(x)T
HT
[(HHT
)(HHT
) + λ(HHT
)I]−1
HHT
y
= h(x)T
HT
[(HHT
)(HHT
+ λI)]−1
HHT
y
= h(x)T
HT
(HHT
+ λI)−1
(HHT
)−1
HHT
y
= h(x)T
HT
(HHT
+ λI)−1
y
= [K(x, x1)K(x, x2)...K(x, xN )]ˆα
=
N
i=1
ˆαiK(x, xi)
where ˆα = (HHT
+ λI)−1y. 15 / 16
• The matrix N x N HHT
consists of inner products between pair of
observation i, i . {HHT
}i,i = K(xi, xi )
→ Need not specify or evaluate the large set of functions
h1(x), h2(x), ..., hM (x).
Only the inner product kernel K(xi, xi ) need be evaluated, at the N
training points and at points x for predictions there.
• Some popular choices of K are
dth-Degree polynomial: K(x, x ) = (1 + x, x )d
Radial basis: K(x, x ) = exp(−γ x − x 2)
Neural network: K(x, x ) = tanh(κ1 x, x + κ2)
• This property depends on the choice of squared norm β 2
16 / 16

More Related Content

PDF
Study Material Numerical Differentiation and Integration
PPT
Artificial neural network
PDF
Deep Learning Frameworks slides
PDF
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
PDF
Deep Learning Theory Seminar (Chap 1-2, part 1)
PPT
backpropagation in neural networks
PPTX
Neural network
ODP
NAIVE BAYES CLASSIFIER
Study Material Numerical Differentiation and Integration
Artificial neural network
Deep Learning Frameworks slides
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Deep Learning Theory Seminar (Chap 1-2, part 1)
backpropagation in neural networks
Neural network
NAIVE BAYES CLASSIFIER

What's hot (20)

PPTX
Evaluation of multilabel multi class classification
PPTX
Artifical Neural Network and its applications
PPTX
Transformers AI PPT.pptx
PDF
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
PPTX
PPT
Z transfrm ppt
PPTX
K means clustering
PPTX
Artificial Intelligence, Machine Learning and Deep Learning
PPTX
Eigenvalues and Eigenvector
PDF
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
PPT
Eigen values and eigen vectors engineering
PPTX
Artificial Neural Network Topology
PPTX
Support vector machines (svm)
PPT
Support Vector Machines
PDF
Autoencoders
PPTX
Mc Culloch Pitts Neuron
PDF
Introduction to Diffusion Models
PPTX
Convolutional Neural Network (CNN)
ZIP
Hashing
Evaluation of multilabel multi class classification
Artifical Neural Network and its applications
Transformers AI PPT.pptx
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Z transfrm ppt
K means clustering
Artificial Intelligence, Machine Learning and Deep Learning
Eigenvalues and Eigenvector
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
Eigen values and eigen vectors engineering
Artificial Neural Network Topology
Support vector machines (svm)
Support Vector Machines
Autoencoders
Mc Culloch Pitts Neuron
Introduction to Diffusion Models
Convolutional Neural Network (CNN)
Hashing
Ad

Similar to SVM for Regression (20)

PDF
Basic concepts of curve fittings
PDF
curve fitting lecture slides February 24
PDF
Lecture 2 neural network covers the basic
PDF
Lecture 5 - Linear Regression Linear Regression
PDF
Lecture3 linear svm_with_slack
PDF
Applied numerical methods lec8
PPTX
Lecture 8 about data mining and how to use it.pptx
PDF
Lecture5 kernel svm
PDF
Linear Regression
PDF
X01 Supervised learning problem linear regression one feature theorie
PDF
Lecture 11 linear regression
PPTX
Linear Regression.pptx
PDF
Linear_Models_with_R_----_(2._Estimation).pdf
PPT
lecture6.ppt
PDF
A Brief Introduction to Linear Regression
PDF
Regression using Apache SystemML by Alexandre V Evfimievski
PDF
Regression using Apache SystemML by Alexandre V Evfimievski
PPTX
11Polynomial RegressionPolynomial RegressionPolynomial RegressionPolynomial R...
PDF
deeplearninhg........ applicationsWEEK 05.pdf
PDF
Basic concepts of curve fittings
curve fitting lecture slides February 24
Lecture 2 neural network covers the basic
Lecture 5 - Linear Regression Linear Regression
Lecture3 linear svm_with_slack
Applied numerical methods lec8
Lecture 8 about data mining and how to use it.pptx
Lecture5 kernel svm
Linear Regression
X01 Supervised learning problem linear regression one feature theorie
Lecture 11 linear regression
Linear Regression.pptx
Linear_Models_with_R_----_(2._Estimation).pdf
lecture6.ppt
A Brief Introduction to Linear Regression
Regression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V Evfimievski
11Polynomial RegressionPolynomial RegressionPolynomial RegressionPolynomial R...
deeplearninhg........ applicationsWEEK 05.pdf
Ad

Recently uploaded (20)

PPTX
Database Infoormation System (DBIS).pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
annual-report-2024-2025 original latest.
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
1_Introduction to advance data techniques.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Database Infoormation System (DBIS).pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
STUDY DESIGN details- Lt Col Maksud (21).pptx
Clinical guidelines as a resource for EBP(1).pdf
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction-to-Cloud-ComputingFinal.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
annual-report-2024-2025 original latest.
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
1_Introduction to advance data techniques.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
climate analysis of Dhaka ,Banglades.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Quality review (1)_presentation of this 21
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx

SVM for Regression

  • 1. Support Vector Machines for Regression July 15, 2015 1 / 16
  • 2. Overview 1 Linear Regression 2 Non-linear Regression and Kernels 2 / 16
  • 3. Linear Regression Model The linear regression model f(x) = xT β + β0 To estimate β, we consider minimization of H(β, β0) = N i=1 V (yi − f(xi)) + λ 2 β 2 with a loss function V and a regularization λ 2 β 2 • How to apply SVM to solve the linear regression problem? 3 / 16
  • 4. Linear Regression Model (Cont) The basic idea: Given training data set (x1, y1), ..., (xN , yN ) Target: find a function f(x) that has at most deviation from targets yi for all the training data and at the same time is as less complex (flat) as possible. In other words we do not care about errors as long as they are less than but will not accept any deviation larger than this. 4 / 16
  • 5. Linear Regression Model (Cont) • We want to find one ” -tube” that can contains all the samples. • Intuitively, a tube, with a small width, seems to over-fit with the training data. We should find f(x) that its -tube’s width is as big as possible (more generalization capability, less prediction error in future). • With a defined , a bigger tube corresponds to a smaller β (flatter function). • Optimization problem: minimize 1 2 β 2 s.t yi − f(xi) ≤ f(xi) − yi ≤ 5 / 16
  • 6. Linear Regression Model (Cont) With a defined , this problem is not always feasible, so we also want to allow some errors. Use slack variables ξi, ξ∗ i , the new optimization problem: minimize 1 2 β 2 + C N i=1 (ξi + ξ∗ i ) s.t    yi − f(xi) ≤ + ξ∗ i f(xi) − yi ≤ + ξi ξi, ξ∗ i ≥ 0 6 / 16
  • 7. Linear Regression Model (Cont) Let λ = 1 C Use an ” -insensitive” error measure, ignoring errors of size less than V (r) = 0 if |r| < |r| − , otherwise. We have the minimization of H(β, β0) = N i=1 V (yi − f(xi)) + λ 2 β 2 7 / 16
  • 8. Linear Regression Model (Cont) The Lagrange (primal) function: LP = 1 2 β 2 + C N i=1 (ξ∗ i + ξi) − N i=1 α∗ i ( + ξ∗ i − yi + xT i β + β0) − N i=1 αi(ε + ξi + yi − xT i β − β0) − N i=1 (η∗ i ξ∗ i + ηiξi) which we minimize w.r.t β, β0, ξi, ξ∗ i . Setting the respective derivatives to 0, we get 0 = N i=1 (α∗ i − αi) β = N i=1 (α∗ i − αi)xi α (∗) i = C − η (∗) i , ∀i 8 / 16
  • 9. Linear Regression Model (Cont) Substitute to the primal function, we obtain the dual optimization problem: max αi,α∗ i − N i=1 (α∗ i +αi)+ N i=1 yi(α∗ i −αi)− 1 2 N i,i =1 (α∗ i −αi)(α∗ i −αi ) xi, xi s.t    0 ≤ αi, α∗ i ≤ C(= 1/λ) N i=1(α∗ i − αi) = 0 αiα∗ i = 0 The solution function has the form ˆβ = N i=1 (ˆα∗ i − ˆαi)xi ˆf(x) = N i=1 (ˆα∗ i − ˆαi) x, xi + β0 9 / 16
  • 10. Linear Regression Model (Cont) Follow KKT conditions, we have ˆα∗ i ( + ξ∗ i − yi + ˆf(xi)) = 0 ˆαi( + ξi + yi − ˆf(xi)) = 0 (C − ˆα∗ i )ˆξ∗ i = 0 (C − ˆαi)ˆξi = 0 → For all data points inside the -tube, ˆαi = ˆα∗ i = 0. Only data points outside may have (ˆα∗ i − ˆαi) = 0. → Do not need all xi to describe β. The associated data points are called the support vectors. 10 / 16
  • 11. Linear Regression Model (Cont) Parameter controls the width of the -insensitive tube. The value of can affect the number of support vectors used to construct the regression function. The bigger , the fewer support vectors are selected, the ”flatter” estimates. It is associated with the choice of the loss function ( -insensitive loss function, quadratic loss function or Huber loss function, etc.) Parameter C (1 λ) determines the trade off between the model complexity (flatness) and the degree to which deviations larger than are tolerated. It is interpreted as a traditional regularization parameter that can be estimated by cross-validation for example 11 / 16
  • 12. Non-linear Regression and Kernels When the data is non-linear, use a map ϕ to transform the data into a higher dimensional feature space to make it possible to perform the linear regression. 12 / 16
  • 13. Non-linear Regression and Kernels (Cont) Suppose we consider approximation of the regression function in term of a set of basis function {hm(x)}, m = 1, 2, ..., M: f(x) = M m=1 βmhm(x) + β0 To estimate β and β0, minimize H(β, β0) = N i=1 V (yi − f(xi)) + λ 2 β2 m for some general error measure V (r). The solution has the form ˆf(x) = N i=1 ˆαiK(x, xi) with K(x, x ) = M m=1 hm(x)hm(x ) 13 / 16
  • 14. Non-linear Regression and Kernels (Cont) Let work out with V (r) = r2. Let H be the N x M basis matrix with imth element hm(xi) For simplicity assume β0 = 0. Estimate β by minimize H(β) = (y − Hβ)T (y − Hβ) + λ β 2 Setting the first derivative to zero, we have the solution ˆy = Hˆβ with ˆβ determined by −2HT (y − Hˆβ) + 2λˆβ = 0 −HT (y − Hˆβ) + λˆβ = 0 −HHT (y − Hˆβ) + λHˆβ = 0 (premultiply by H) (HHT + λI)Hˆβ = HHT y Hˆβ = (HHT + λI)−1 HHT y 14 / 16
  • 15. Non-linear Regression and Kernels (Cont) We have estimate function: f(x) = h(x)T ˆβ = h(x)T HT (HHT )−1 Hˆβ = h(x)T HT (HHT )−1 (HHT + λI)−1 HHT y = h(x)T HT [(HHT + λI)(HHT )]−1 HHT y = h(x)T HT [(HHT )(HHT ) + λ(HHT )I]−1 HHT y = h(x)T HT [(HHT )(HHT + λI)]−1 HHT y = h(x)T HT (HHT + λI)−1 (HHT )−1 HHT y = h(x)T HT (HHT + λI)−1 y = [K(x, x1)K(x, x2)...K(x, xN )]ˆα = N i=1 ˆαiK(x, xi) where ˆα = (HHT + λI)−1y. 15 / 16
  • 16. • The matrix N x N HHT consists of inner products between pair of observation i, i . {HHT }i,i = K(xi, xi ) → Need not specify or evaluate the large set of functions h1(x), h2(x), ..., hM (x). Only the inner product kernel K(xi, xi ) need be evaluated, at the N training points and at points x for predictions there. • Some popular choices of K are dth-Degree polynomial: K(x, x ) = (1 + x, x )d Radial basis: K(x, x ) = exp(−γ x − x 2) Neural network: K(x, x ) = tanh(κ1 x, x + κ2) • This property depends on the choice of squared norm β 2 16 / 16