SlideShare a Scribd company logo
Chapter 7
Reviewer : Sunwoo Kim
Christopher M. Bishop
Pattern Recognition and Machine Learning
Yonsei University
Department of Applied Statistics
Chapter 7. Sparse Kernel Machines
2
Kernel based regression & classification machine
Like gaussian process, many kernel based approaches require full computation of kernel functions.
In this section, we are going to cover some sparse solution machines.
ā€˜What does sparse solution mean?’
= A method which uses only part of full dataset.
This method is called a ā€˜support vector machine.’
** As I mentioned, I have uploaded a report which covered the detail idea of support vector machine!
There are some interesting parts in support vector machine.
1st : Support vector machine does not yield probability of specific decision. It only gives classification result.
2nd : We can always find global optimal solution with convex optimization.
3rd : It can be extended to bayes methods by using Relevance Vector Machine (Soon)
Since I wrote basic idea in other file, thus I am going to skip the basics!
Important idea of this method is ā€˜using decision boundaries and maximizing the margin!’
First, let’s take a look at optimization issues.
Chapter 7.0. Lagrange Multipliers and KKT condition
3
Lagrange Multipliers
Consider we are maximizing function š‘“(š‘‹) with respect to š‘‹.
If we are doing such works under some constraint š‘” š‘‹ = 0.
Then, constraint š‘”(š‘‹) forms a (š· āˆ’ 1) dimension surface in feature space.
Here, consider constraint of g x, y, z = š‘„ + š‘¦ āˆ’ š‘§ = 0
(š‘„, š‘¦, š‘§) constraint forms a plane of right figure’s grey surface.
However, gradient of g, āˆ‡š‘” š‘„, š‘¦, š‘§ = š‘” āˆ‡š‘„, āˆ‡š‘¦, āˆ‡š‘§ = (1, 1, āˆ’1).
Check that this gradient is orthogonal to the constraint surface.
Now, let’s extend this idea to general dimension.
By using taylor series,
š‘” š‘„ + šœ– = š‘” š‘„ + šœ–š‘‡
āˆ‡š‘”(š‘„)
As we can see, as šœ– → 0, this epsilon lies on constraint plane, and
šœ–š‘‡
āˆ‡š‘” š‘„ ā‰ˆ 0, which fits our result of toy result.
Chapter 7.0. Lagrange Multipliers and KKT condition
4
Lagrange Multipliers
Now, let’s get back to our original optimization issues. š‘“(š‘‹) is some value in this D – dimension space.
Here, maximum value of š‘“(š‘‹) occurs when the variables just kisses the constraint surface (sharing the tangent line).
This indicates two gradient vectors āˆ‡š‘“ š‘‹ + šœ†āˆ‡š‘” š‘‹ = 0, and šœ† is a constant which changes the sign of vectors.
Thus, we can find final equation by using
Let’s consider the inequality constraint of the equation.
Most of the parts are same. Still optimal point occurs on the ā€˜kissing point’.
1st. However, as you can see, we are turning on and turning off the conditions according to whether it satisfies, or not.
2nd . Direction of šœ† is important, since we have to move away from the shaded region, which is š‘” š‘„ > 0.
Here, there is a condition called Karush-Kuhn-Tucker(KKT) condition, which makes our optimization optimal.
Such conditions are,
Chapter 7.0. Lagrange Multipliers and KKT condition
5
Summary
So, we have got some intuition regarding the optimization with Lagrange.
So, we are solving following formula.
1. We ant to find maximum of š‘“(š‘‹), with the constraints š‘” š‘‹ = 0 / ā„Ž š‘‹ ≄ 0
2. Objective equation is šæ š‘„, šœ†š‘–, šœ‡š‘˜ = š‘“ š‘‹ + š‘—=1
š½
šœ†š‘—š‘”š‘—(š‘‹) + š‘˜=1
š¾
šœ‡š‘˜ā„Žš‘˜(š‘‹)
3. But subject to šœ‡š‘˜ ≄ 0, šœ‡š‘˜ā„Žš‘˜ š‘‹ = 0.
4. So, in short,
š‘Žš‘Ÿš‘”š‘šš‘Žš‘„š‘‹(šæ š‘„, šœ†š‘–, šœ‡š‘˜ = š‘“ š‘‹ +
š‘—=1
š½
šœ†š‘—š‘”š‘— š‘‹ +
š‘˜=1
š¾
šœ‡š‘˜ā„Žš‘˜ š‘‹ )
š‘†. š‘”. šœ‡š‘˜ ≄ 0
š‘†. š‘”. šœ‡š‘˜ā„Žš‘˜ š‘‹ = 0
Check how these equations are being used in
optimization of Support Vector Machine!
- Dual Representation
- Lagrange
- KKT condition
Chapter 7.1. Maximum Margin Classifiers
6
General formula
Output takes only two forms, {-1 , 1}.
š‘¦ š‘‹ = š‘Šš‘‡
šœ™ š‘‹ + š‘ tš‘›
1 š‘–š‘“ š‘¦ š‘‹ > 0
āˆ’1 š‘–š‘“ š‘¦ š‘‹ < 0
Thus, optimal values of š‘¦(š‘‹) can be expressed by š‘”š‘›š‘¦ š‘‹ > 0
Here, we assume data is perfectly separable!
We discussed perpendicular distance and other related issues in chapter 4.
Distance from an arbitrary data point can be expressed as
As our goal is to maximize this margin, distance should also be maximized.
By using this, we can set our optimization function as
Here, we are free to set inner term of equation as 1. (Since we can achieve
this by simple re-scaling, and this point corresponds to the decision surface.)
Then, following condition satisfies.
Chapter 7.1. Maximum Margin Classifiers
7
General formula
Thus, our final objective function becomes…
Since there is a constraint of š‘”š‘› š‘Šš‘‡
šœ™ š‘‹š‘› + š‘ ≄ 1, we can use Lagrange multipliers! (Posing constraint on objective function!)
Here, we can re-write optimization function by
By computing
šœ•šæ š‘¤,š‘,š‘Ž
šœ•š‘¤
, we can get following equations.
Which is a dual representation of optimization!
Chapter 7.1. Maximum Margin Classifiers
8
General formula
Here, kernel function means the inner product of two kernel values.
1. By changing optimization model complexity increases.
2. However, this allows us to use kernel function in the optimization.
For the new input, we can classify it by using this equation!
KKT condition should be satisfied!
Now, let’s talk about support vectors.
Most of the well-classified data are š‘”š‘›š‘¦ š‘‹š‘› > 1.
Here, there are support vectors, data that lie on the boundary of classifier.
They can be defined by
Circled data are
support vectors
Dual representation means this.
š‘Žš‘Ÿš‘”š‘šš‘–š‘› š‘œš‘Ÿš‘–š‘”š‘–š‘›š‘Žš‘™ š‘’š‘žš‘¢š‘Žš‘”š‘–š‘œš‘› ≄ (š‘‘š‘¢š‘Žš‘™ āˆ’ š‘’š‘žš‘¢š‘Žš‘”š‘–š‘œš‘›)
Here, if we maximize dual representation, we can get the greatest lower bound of original equation!
Two values become equal under KKT condition, and we can express equation in terms of kernel!
Chapter 7.1. Maximum Margin Classifiers
9
Finding support vectors
By optimizing aforementioned equation, we can get the value of š‘Ž, (which is usually expressed as šœ†)
Note the constraints of
Here, data point which š‘”š‘›š‘¦ š‘‹š‘› = 1 satisfies are the support vectors.
Conversely, this means š’‚š’ ≠ šŸŽ data are the support vectors!
This is easy since we have already found all š‘Žš‘› values! Now, think of how we can get bias term.
Here, š‘† is the set of support vectors,
And š‘”š‘› indicates the label of any support vector Figure of SVM with
Gaussian Kernel
Stable solution of bias
Chapter 7.1. Maximum Margin Classifiers
10
Overlapping class distributions
We have assumed data are all separable, which is actually an impossible situation.
Which means, (right figure)
In order to take this into our model, we think of new constraint, a slack variable.
Slack variable is a variable which gives different value for each data.
šœ‰š‘› = š‘”š‘› āˆ’ š‘¦ š‘‹š‘› ≄ 0
Correctly classified : šœ‰ = 0
On the boundary : šœ‰ = 1
Mis-Classified : šœ‰ > 1.
This slack variable should be as small as it can!
Thus, this can be added to original objective function with hyper-parameter š¶
Now, we are trying to minimize š‘³ under constraint of šƒš’
š‘³ =
Chapter 7.1. Maximum Margin Classifiers
11
Optimization with slack variable
By using partial derivative for each parameters, we can get following equations.
Most of the parts are same with the previous separable
case example(without slack variables)
Here, lagrange multiplier š‘Žš‘› has upper limit š‘Ŗ
Dual representation (Maximization)
Chapter 7.1. Maximum Margin Classifiers
12
Slack variable optimization + Nu SVM
Here again, support vectors are the data which satisfies š‘Žš‘› > 0, which means š‘”š‘›š‘¦ š‘‹š‘› = 1 āˆ’ šœ‰š‘›
1. If š‘Žš‘› < š¶, this implies šœ‡š‘› ≄ 0, → šœ‰š‘› = 0 / Well classified!
2. If š‘Žš‘› = š¶, this implies šœ‡š‘› = 0, → šœ‰š‘› ≠ 0 / Again two possible cases.
2.1. šœ‰ ≤ 1 : correctly classified! / But over boundary
2.2. šœ‰ > 1 : Misclassified!
To compute bias, we again find values of šŸŽ < š’‚š’ < š‘Ŗ, and corresponding data.
Note that scalar š¶ is a trade-off parameter of violation of data
In order to get a more intuitive hyper-param, there is a
SVM called šœˆ āˆ’ š‘†š‘‰š‘€ (nu-SVM).
Here, optimization equation becomes…
Here, š‚ indicates,
- Upper-bound of margin errors (šœ‰ >
0) (Can or cannot be wrong)
- Lower-bound of # of support
vectors
Chapter 7.1. Maximum Margin Classifiers
13
Characteristic & SMO
SMO from https://www.youtube.com/watch?v=vqoVIchkM7I
As mentioned above, we are updating lagrange multipliers two at a time!
Selecting š‘Žš‘› also has various methods.
Above equation can be solved in closed form!
Consider label predicting equation of SVM
Do we have to save all data, and performing weighted sum with respect to
all data all the time??
Actually not, since data within the boundary has value of š‘Žš‘›.
Which means, we only need data of š’‚š’ > šŸŽ, which are the
Support Vectors!
Chapter 7.1. Maximum Margin Classifiers
14
Relation to the logistic regression
https://www.slideshare.net/ssuser36cf8e/prml-chapter-7-svm-supplementary-files
Check 15 & 16 page of this file!!
Chapter 7.1. Maximum Margin Classifiers
15
SVM for regression
We can extend simple idea of ā€˜error acceptance’ to the linear regression.
This is called ā€²š āˆ’ š’Šš’š’•š’†š’š’”š’Šš’•š’Šš’—š’† š’†š’“š’“š’š’“ š’‡š’–š’š’„š’•š’Šš’š’ā€².
Red : š āˆ’ š’š’š’”š’” Green : Squared loss
Which means, error smaller than šœ– is okay. Otherwise, not really good!
The boundary(red region) is called a ā€˜tube’.
However, not every data can exist between ϵ interval.
Thus, we introduce slack variable again.
Thus, error can be computed as
This can be viewed as
Regularized error!
There still exist constraint of
šƒ ≄ šŸŽ & šƒ ≄ šŸŽ
Chapter 7.1. Maximum Margin Classifiers
16
SVM for regression
Here, lagrangian objective function can be
By plugging them in…
Thus, new prediction can be written as…
Chapter 7.1. Maximum Margin Classifiers
17
SVM for regression
Dual representation should satisfy KKT condition to be great lower bound. Which should be…
Interpretation of lagrange multipliers.
š‘Žš‘› ≠ 0 : Support vectors or above boundaries
š‘Žš‘› ≠ 0 : Support vectors or below boundaries
Here, bias term can be computed as…
Just like classification case, here also we can implement
š‚ āˆ’ š‘ŗš‘½š‘“ āˆ’ š‘¹š’†š’ˆš’“š’†š’”š’”š’š’“
Interpretation of hyper-parameter
1. At most šœˆš‘ data fall outside of the tube.
2. At least šœˆN data are the support vectors.
Chapter 7.2. Relevance Vector Machines
18
Limitation of SVM, derivation of RVM
One fundamental limitation of SVM is that ā€˜it cannot yields a probability’.
It can only decide whether specific data belongs to certain class or not.
In order to overcome this issue, (to generate probability) we can think of a new model called ā€˜relevance vector machine’.
It uses the idea of kernel, but still has a structure of probability model.
Let’s begin with a regression example.
RVM also has a structure of pdf
Here, predicted mean š‘¦ š‘‹ is equal to
Here, RVM substitutes basis
function š“(š‘æ) to a kernel function.
Thus, it includes total š‘“ = š‘µ + šŸ parameters
Basic idea is clear. Now, let’s move onto ā€˜how to define distributions?’
First, we have to define likelihood function.
Chapter 7.2. Relevance Vector Machines
19
RVM for regression
Now, we have to define prior distribution of š‘Š, which is a parameter of a model.
Here, please note that we are fitting individual šœ¶ values for
each dimension of š’˜.
By computing product of likelihood and prior, (š‘ š‘¤ š‘„ āˆ š‘ š‘„ š‘¤ š‘(š‘¤)) we can get posterior.
We can also use general result which we derived in chapter 3.
Here, we haven’t computed nuisance
parameters š›¼, š›½ for the model.
We are using evidence approximation, which we
did in chapter 3.
** Evidence Approx.
We are getting rid of the influence of š‘¤ by
integrating it out, then compute most likely
value of each parameters.(MLE)
Chapter 7.2. Relevance Vector Machines
20
RVM for regression
Thus, in order to estimate š›¼ š‘Žš‘›š‘‘ š›½, we have to compute
This can be transformed into following terms with log function.
We have to maximize above ln š‘(š‘”|š‘‹, š›¼, š›½) with respect to š›¼ and š›½.
This optimization cannot be expressed in a closed form. We have to use iterative methods. That is,
Here, Ī£š‘–š‘– is a diagonal term of posterior’s
covariance matrix
Take a look at š›¼.
Huge šœ¶ indicates zero variance with mean zero (precision)
Of weight parameter. Thus, that basis does not have any power.
Chapter 7.2. Relevance Vector Machines
21
RVM for regression
After we find optimal values for š›¼ and š›½, we generate predictive distribution for target value š’•.
Now let’s compare SVM’s regression and RVM’s regression.
SVM RVM
1. RVM requires much less number of
relevance(support) vectors, which
means we can save prediction time.
2. However, RVM takes more time to
train model, due to inversion of š‘Ŗ
matrix.
Chapter 7.2. Relevance Vector Machines
22
Analysis of Sparsity
Let’s focus on parameter š›¼. How does it contribute to the model’s sparsity?(Selecting reasonable basis)
Consider there exists only one basis function and two data š‘„1, š‘”1 , (š‘„2, š‘”2).
Then, aforementioned value š‘Ŗ can be computed as
šœ‘ is a N-dimensional vector of šœ™ š‘‹1 , šœ™ š‘‹2
š‘‡
.
And similarly š’• = š‘”1, š‘”2
š‘»
When š›¼ has an
infinite value
Finite value of š›¼.
Direction of š‹ is significant!
Chapter 7.2. Relevance Vector Machines
23
Mathematical perspective
We now move onto š‘ āˆ’ š‘‘š‘–š‘š variables. We are still thinking of optimizing š¶ with respect to š›¼ š‘Žš‘›š‘‘ š›½.
We can re-write š¶ by
Here, š‹š’Š indicates i-th column of design matrix šš½.
Here, we have to compute
However, we don’t know |š¶| and š¶āˆ’1
.
We have to think how we can express
them with š‘Ŗāˆ’š’Š, šœ¶š’Š, š’‚š’š’… š‹
By using the equation of
Chapter 7.2. Relevance Vector Machines
24
Mathematical perspective
We can sort all values with new variables š‘ š‘– and š‘žš‘–
Here, š’”š’Š indicates sparsity and š’’š’Š indicates quality of š‹
1. Sparsity(š‘ š‘–) measures the extent to which basis function šœ‘š‘–
overlaps with other basis vectors in the model.
2. Quality(š‘žš‘–) measures the alignment of the basis vector š‹š’Š and
other training vectors t.
Now, in order to decide optimal value of š›¼š‘– we do not need to
consider values of other š›¼š‘—. So, we have to only calculate derivative
of šœ†(š›¼š‘–), which will be introduced in the following page.
Chapter 7.2. Relevance Vector Machines
25
Mathematical perspective
Equation can be
Recall that š›¼š‘– ≄ 0, (It’s a precision!) we should think of two conditions.
1. If š‘žš‘–
2
< š‘ š‘–, then š›¼š‘– → āˆž / Second term goes positive, so first term should be close to zero.
2. If š‘žš‘–
2
> š‘ š‘– solution can be
According to these equations, we can get iterative optimization methods of RVM.
Chapter 7.2. Relevance Vector Machines
26
RVM for classification
Relevance vector machine can be extended to classification model by simply using logistic regression model
with ARD prior.
Just as we covered in chapter 4, we are not integrating with respect to š‘¤. Instead, we use Laplace approximation.
It’s been a while, thus let’s revise Laplace approximation for short.
That is, weight parameters are having different prior, and are independent!
What we need are…
1. Mode of posterior
2. Hessian of posterior.
Here, modes are…
Note that šµ = š‘š‘„š‘ š‘œš‘“ š‘¦š‘›(1 āˆ’ š‘¦š‘›)
Chapter 7.2. Relevance Vector Machines
27
RVM for classification
Here, we don’t know exact value of š›¼, we have to estimate it by evidence value.
After substituting each function of parameter, we can get estimation of š›¼ if we set derivative of the marginal likelihood.
Note that result is equivalent
to that of regression
At the same time, by defining š’• as following, we can get much simple path.
Note that this result fits the result of regression example.
Thus, we can put same analysis with šœ¶ as we did before!
For the multi-class case, we can simply train š‘˜ āˆ’ different
models of š‘˜ āˆ’ š‘š‘™š‘Žš‘ š‘  labels. Then use softmax function.

More Related Content

PDF
Pr045 deep lab_semantic_segmentation
PPTX
PRML Chapter 3
PPTX
Regression (Linear Regression and Logistic Regression) by Akanksha Bali
PPTX
Support vector machine
PDF
Feature Extraction
PDF
An overview of Bayesian testing
PPTX
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
PPTX
support vector machine 1.pptx
Pr045 deep lab_semantic_segmentation
PRML Chapter 3
Regression (Linear Regression and Logistic Regression) by Akanksha Bali
Support vector machine
Feature Extraction
An overview of Bayesian testing
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
support vector machine 1.pptx

What's hot (20)

PPT
Instance Based Learning in Machine Learning
PPTX
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
PPTX
Ensemble learning
PPT
Support Vector machine
PPTX
Machine Learning using Support Vector Machine
PPTX
PRML Chapter 8
PPTX
Knowledge Based Agent
PPTX
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
PPTX
Naive Bayes
PPTX
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
PPTX
PRML Chapter 1
PPTX
Support vector machine
PDF
Mask R-CNN
PDF
Variational Autoencoders For Image Generation
PDF
Lecture10 - NaĆÆve Bayes
PPTX
Cramer row inequality
PPTX
Support vector machines
PPTX
Moment Generating Functions
PPTX
Heuristic search
PPTX
Regression analysis by akanksha Bali
Instance Based Learning in Machine Learning
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
Ensemble learning
Support Vector machine
Machine Learning using Support Vector Machine
PRML Chapter 8
Knowledge Based Agent
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Naive Bayes
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
PRML Chapter 1
Support vector machine
Mask R-CNN
Variational Autoencoders For Image Generation
Lecture10 - NaĆÆve Bayes
Cramer row inequality
Support vector machines
Moment Generating Functions
Heuristic search
Regression analysis by akanksha Bali
Ad

Similar to PRML Chapter 7 (20)

PPTX
PRML Chapter 4
PPTX
PRML Chapter 9
PPTX
PRML Chapter 12
PPTX
PRML Chapter 5
PPTX
Linear Regression in machine learning.pptx
PPTX
lec0734523532453425324523452345245432.pptx
PPTX
linear regression1.pptx machine learning
PDF
Linear logisticregression
PPTX
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
PPTX
Introduction & Optimization for Machine Learning
PPTX
i just wanted to Your score increases as you pick a categ
PPTX
PRML Chapter 6
PPTX
PRML Chapter 10
PPTX
UE19EC353 ML Unit4_slides.pptx
PPTX
Support Vector Machine topic of machine learning.pptx
PPTX
Support Vector Machine.pptx
PDF
Machine Learning 1
PPTX
classification algorithms in machine learning.pptx
PDF
Deep learning concepts
Ā 
PPTX
Machine learning introduction lecture notes
PRML Chapter 4
PRML Chapter 9
PRML Chapter 12
PRML Chapter 5
Linear Regression in machine learning.pptx
lec0734523532453425324523452345245432.pptx
linear regression1.pptx machine learning
Linear logisticregression
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Introduction & Optimization for Machine Learning
i just wanted to Your score increases as you pick a categ
PRML Chapter 6
PRML Chapter 10
UE19EC353 ML Unit4_slides.pptx
Support Vector Machine topic of machine learning.pptx
Support Vector Machine.pptx
Machine Learning 1
classification algorithms in machine learning.pptx
Deep learning concepts
Ā 
Machine learning introduction lecture notes
Ad

Recently uploaded (20)

PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Business Analytics and business intelligence.pdf
PDF
annual-report-2024-2025 original latest.
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Computer network topology notes for revision
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
.pdf is not working space design for the following data for the following dat...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Mega Projects Data Mega Projects Data
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Business Analytics and business intelligence.pdf
annual-report-2024-2025 original latest.
IB Computer Science - Internal Assessment.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Computer network topology notes for revision
Business Acumen Training GuidePresentation.pptx
Fluorescence-microscope_Botany_detailed content
oil_refinery_comprehensive_20250804084928 (1).pptx
climate analysis of Dhaka ,Banglades.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj

PRML Chapter 7

  • 1. Chapter 7 Reviewer : Sunwoo Kim Christopher M. Bishop Pattern Recognition and Machine Learning Yonsei University Department of Applied Statistics
  • 2. Chapter 7. Sparse Kernel Machines 2 Kernel based regression & classification machine Like gaussian process, many kernel based approaches require full computation of kernel functions. In this section, we are going to cover some sparse solution machines. ā€˜What does sparse solution mean?’ = A method which uses only part of full dataset. This method is called a ā€˜support vector machine.’ ** As I mentioned, I have uploaded a report which covered the detail idea of support vector machine! There are some interesting parts in support vector machine. 1st : Support vector machine does not yield probability of specific decision. It only gives classification result. 2nd : We can always find global optimal solution with convex optimization. 3rd : It can be extended to bayes methods by using Relevance Vector Machine (Soon) Since I wrote basic idea in other file, thus I am going to skip the basics! Important idea of this method is ā€˜using decision boundaries and maximizing the margin!’ First, let’s take a look at optimization issues.
  • 3. Chapter 7.0. Lagrange Multipliers and KKT condition 3 Lagrange Multipliers Consider we are maximizing function š‘“(š‘‹) with respect to š‘‹. If we are doing such works under some constraint š‘” š‘‹ = 0. Then, constraint š‘”(š‘‹) forms a (š· āˆ’ 1) dimension surface in feature space. Here, consider constraint of g x, y, z = š‘„ + š‘¦ āˆ’ š‘§ = 0 (š‘„, š‘¦, š‘§) constraint forms a plane of right figure’s grey surface. However, gradient of g, āˆ‡š‘” š‘„, š‘¦, š‘§ = š‘” āˆ‡š‘„, āˆ‡š‘¦, āˆ‡š‘§ = (1, 1, āˆ’1). Check that this gradient is orthogonal to the constraint surface. Now, let’s extend this idea to general dimension. By using taylor series, š‘” š‘„ + šœ– = š‘” š‘„ + šœ–š‘‡ āˆ‡š‘”(š‘„) As we can see, as šœ– → 0, this epsilon lies on constraint plane, and šœ–š‘‡ āˆ‡š‘” š‘„ ā‰ˆ 0, which fits our result of toy result.
  • 4. Chapter 7.0. Lagrange Multipliers and KKT condition 4 Lagrange Multipliers Now, let’s get back to our original optimization issues. š‘“(š‘‹) is some value in this D – dimension space. Here, maximum value of š‘“(š‘‹) occurs when the variables just kisses the constraint surface (sharing the tangent line). This indicates two gradient vectors āˆ‡š‘“ š‘‹ + šœ†āˆ‡š‘” š‘‹ = 0, and šœ† is a constant which changes the sign of vectors. Thus, we can find final equation by using Let’s consider the inequality constraint of the equation. Most of the parts are same. Still optimal point occurs on the ā€˜kissing point’. 1st. However, as you can see, we are turning on and turning off the conditions according to whether it satisfies, or not. 2nd . Direction of šœ† is important, since we have to move away from the shaded region, which is š‘” š‘„ > 0. Here, there is a condition called Karush-Kuhn-Tucker(KKT) condition, which makes our optimization optimal. Such conditions are,
  • 5. Chapter 7.0. Lagrange Multipliers and KKT condition 5 Summary So, we have got some intuition regarding the optimization with Lagrange. So, we are solving following formula. 1. We ant to find maximum of š‘“(š‘‹), with the constraints š‘” š‘‹ = 0 / ā„Ž š‘‹ ≄ 0 2. Objective equation is šæ š‘„, šœ†š‘–, šœ‡š‘˜ = š‘“ š‘‹ + š‘—=1 š½ šœ†š‘—š‘”š‘—(š‘‹) + š‘˜=1 š¾ šœ‡š‘˜ā„Žš‘˜(š‘‹) 3. But subject to šœ‡š‘˜ ≄ 0, šœ‡š‘˜ā„Žš‘˜ š‘‹ = 0. 4. So, in short, š‘Žš‘Ÿš‘”š‘šš‘Žš‘„š‘‹(šæ š‘„, šœ†š‘–, šœ‡š‘˜ = š‘“ š‘‹ + š‘—=1 š½ šœ†š‘—š‘”š‘— š‘‹ + š‘˜=1 š¾ šœ‡š‘˜ā„Žš‘˜ š‘‹ ) š‘†. š‘”. šœ‡š‘˜ ≄ 0 š‘†. š‘”. šœ‡š‘˜ā„Žš‘˜ š‘‹ = 0 Check how these equations are being used in optimization of Support Vector Machine! - Dual Representation - Lagrange - KKT condition
  • 6. Chapter 7.1. Maximum Margin Classifiers 6 General formula Output takes only two forms, {-1 , 1}. š‘¦ š‘‹ = š‘Šš‘‡ šœ™ š‘‹ + š‘ tš‘› 1 š‘–š‘“ š‘¦ š‘‹ > 0 āˆ’1 š‘–š‘“ š‘¦ š‘‹ < 0 Thus, optimal values of š‘¦(š‘‹) can be expressed by š‘”š‘›š‘¦ š‘‹ > 0 Here, we assume data is perfectly separable! We discussed perpendicular distance and other related issues in chapter 4. Distance from an arbitrary data point can be expressed as As our goal is to maximize this margin, distance should also be maximized. By using this, we can set our optimization function as Here, we are free to set inner term of equation as 1. (Since we can achieve this by simple re-scaling, and this point corresponds to the decision surface.) Then, following condition satisfies.
  • 7. Chapter 7.1. Maximum Margin Classifiers 7 General formula Thus, our final objective function becomes… Since there is a constraint of š‘”š‘› š‘Šš‘‡ šœ™ š‘‹š‘› + š‘ ≄ 1, we can use Lagrange multipliers! (Posing constraint on objective function!) Here, we can re-write optimization function by By computing šœ•šæ š‘¤,š‘,š‘Ž šœ•š‘¤ , we can get following equations. Which is a dual representation of optimization!
  • 8. Chapter 7.1. Maximum Margin Classifiers 8 General formula Here, kernel function means the inner product of two kernel values. 1. By changing optimization model complexity increases. 2. However, this allows us to use kernel function in the optimization. For the new input, we can classify it by using this equation! KKT condition should be satisfied! Now, let’s talk about support vectors. Most of the well-classified data are š‘”š‘›š‘¦ š‘‹š‘› > 1. Here, there are support vectors, data that lie on the boundary of classifier. They can be defined by Circled data are support vectors Dual representation means this. š‘Žš‘Ÿš‘”š‘šš‘–š‘› š‘œš‘Ÿš‘–š‘”š‘–š‘›š‘Žš‘™ š‘’š‘žš‘¢š‘Žš‘”š‘–š‘œš‘› ≄ (š‘‘š‘¢š‘Žš‘™ āˆ’ š‘’š‘žš‘¢š‘Žš‘”š‘–š‘œš‘›) Here, if we maximize dual representation, we can get the greatest lower bound of original equation! Two values become equal under KKT condition, and we can express equation in terms of kernel!
  • 9. Chapter 7.1. Maximum Margin Classifiers 9 Finding support vectors By optimizing aforementioned equation, we can get the value of š‘Ž, (which is usually expressed as šœ†) Note the constraints of Here, data point which š‘”š‘›š‘¦ š‘‹š‘› = 1 satisfies are the support vectors. Conversely, this means š’‚š’ ≠ šŸŽ data are the support vectors! This is easy since we have already found all š‘Žš‘› values! Now, think of how we can get bias term. Here, š‘† is the set of support vectors, And š‘”š‘› indicates the label of any support vector Figure of SVM with Gaussian Kernel Stable solution of bias
  • 10. Chapter 7.1. Maximum Margin Classifiers 10 Overlapping class distributions We have assumed data are all separable, which is actually an impossible situation. Which means, (right figure) In order to take this into our model, we think of new constraint, a slack variable. Slack variable is a variable which gives different value for each data. šœ‰š‘› = š‘”š‘› āˆ’ š‘¦ š‘‹š‘› ≄ 0 Correctly classified : šœ‰ = 0 On the boundary : šœ‰ = 1 Mis-Classified : šœ‰ > 1. This slack variable should be as small as it can! Thus, this can be added to original objective function with hyper-parameter š¶ Now, we are trying to minimize š‘³ under constraint of šƒš’ š‘³ =
  • 11. Chapter 7.1. Maximum Margin Classifiers 11 Optimization with slack variable By using partial derivative for each parameters, we can get following equations. Most of the parts are same with the previous separable case example(without slack variables) Here, lagrange multiplier š‘Žš‘› has upper limit š‘Ŗ Dual representation (Maximization)
  • 12. Chapter 7.1. Maximum Margin Classifiers 12 Slack variable optimization + Nu SVM Here again, support vectors are the data which satisfies š‘Žš‘› > 0, which means š‘”š‘›š‘¦ š‘‹š‘› = 1 āˆ’ šœ‰š‘› 1. If š‘Žš‘› < š¶, this implies šœ‡š‘› ≄ 0, → šœ‰š‘› = 0 / Well classified! 2. If š‘Žš‘› = š¶, this implies šœ‡š‘› = 0, → šœ‰š‘› ≠ 0 / Again two possible cases. 2.1. šœ‰ ≤ 1 : correctly classified! / But over boundary 2.2. šœ‰ > 1 : Misclassified! To compute bias, we again find values of šŸŽ < š’‚š’ < š‘Ŗ, and corresponding data. Note that scalar š¶ is a trade-off parameter of violation of data In order to get a more intuitive hyper-param, there is a SVM called šœˆ āˆ’ š‘†š‘‰š‘€ (nu-SVM). Here, optimization equation becomes… Here, š‚ indicates, - Upper-bound of margin errors (šœ‰ > 0) (Can or cannot be wrong) - Lower-bound of # of support vectors
  • 13. Chapter 7.1. Maximum Margin Classifiers 13 Characteristic & SMO SMO from https://www.youtube.com/watch?v=vqoVIchkM7I As mentioned above, we are updating lagrange multipliers two at a time! Selecting š‘Žš‘› also has various methods. Above equation can be solved in closed form! Consider label predicting equation of SVM Do we have to save all data, and performing weighted sum with respect to all data all the time?? Actually not, since data within the boundary has value of š‘Žš‘›. Which means, we only need data of š’‚š’ > šŸŽ, which are the Support Vectors!
  • 14. Chapter 7.1. Maximum Margin Classifiers 14 Relation to the logistic regression https://www.slideshare.net/ssuser36cf8e/prml-chapter-7-svm-supplementary-files Check 15 & 16 page of this file!!
  • 15. Chapter 7.1. Maximum Margin Classifiers 15 SVM for regression We can extend simple idea of ā€˜error acceptance’ to the linear regression. This is called ā€²š āˆ’ š’Šš’š’•š’†š’š’”š’Šš’•š’Šš’—š’† š’†š’“š’“š’š’“ š’‡š’–š’š’„š’•š’Šš’š’ā€². Red : š āˆ’ š’š’š’”š’” Green : Squared loss Which means, error smaller than šœ– is okay. Otherwise, not really good! The boundary(red region) is called a ā€˜tube’. However, not every data can exist between ϵ interval. Thus, we introduce slack variable again. Thus, error can be computed as This can be viewed as Regularized error! There still exist constraint of šƒ ≄ šŸŽ & šƒ ≄ šŸŽ
  • 16. Chapter 7.1. Maximum Margin Classifiers 16 SVM for regression Here, lagrangian objective function can be By plugging them in… Thus, new prediction can be written as…
  • 17. Chapter 7.1. Maximum Margin Classifiers 17 SVM for regression Dual representation should satisfy KKT condition to be great lower bound. Which should be… Interpretation of lagrange multipliers. š‘Žš‘› ≠ 0 : Support vectors or above boundaries š‘Žš‘› ≠ 0 : Support vectors or below boundaries Here, bias term can be computed as… Just like classification case, here also we can implement š‚ āˆ’ š‘ŗš‘½š‘“ āˆ’ š‘¹š’†š’ˆš’“š’†š’”š’”š’š’“ Interpretation of hyper-parameter 1. At most šœˆš‘ data fall outside of the tube. 2. At least šœˆN data are the support vectors.
  • 18. Chapter 7.2. Relevance Vector Machines 18 Limitation of SVM, derivation of RVM One fundamental limitation of SVM is that ā€˜it cannot yields a probability’. It can only decide whether specific data belongs to certain class or not. In order to overcome this issue, (to generate probability) we can think of a new model called ā€˜relevance vector machine’. It uses the idea of kernel, but still has a structure of probability model. Let’s begin with a regression example. RVM also has a structure of pdf Here, predicted mean š‘¦ š‘‹ is equal to Here, RVM substitutes basis function š“(š‘æ) to a kernel function. Thus, it includes total š‘“ = š‘µ + šŸ parameters Basic idea is clear. Now, let’s move onto ā€˜how to define distributions?’ First, we have to define likelihood function.
  • 19. Chapter 7.2. Relevance Vector Machines 19 RVM for regression Now, we have to define prior distribution of š‘Š, which is a parameter of a model. Here, please note that we are fitting individual šœ¶ values for each dimension of š’˜. By computing product of likelihood and prior, (š‘ š‘¤ š‘„ āˆ š‘ š‘„ š‘¤ š‘(š‘¤)) we can get posterior. We can also use general result which we derived in chapter 3. Here, we haven’t computed nuisance parameters š›¼, š›½ for the model. We are using evidence approximation, which we did in chapter 3. ** Evidence Approx. We are getting rid of the influence of š‘¤ by integrating it out, then compute most likely value of each parameters.(MLE)
  • 20. Chapter 7.2. Relevance Vector Machines 20 RVM for regression Thus, in order to estimate š›¼ š‘Žš‘›š‘‘ š›½, we have to compute This can be transformed into following terms with log function. We have to maximize above ln š‘(š‘”|š‘‹, š›¼, š›½) with respect to š›¼ and š›½. This optimization cannot be expressed in a closed form. We have to use iterative methods. That is, Here, Ī£š‘–š‘– is a diagonal term of posterior’s covariance matrix Take a look at š›¼. Huge šœ¶ indicates zero variance with mean zero (precision) Of weight parameter. Thus, that basis does not have any power.
  • 21. Chapter 7.2. Relevance Vector Machines 21 RVM for regression After we find optimal values for š›¼ and š›½, we generate predictive distribution for target value š’•. Now let’s compare SVM’s regression and RVM’s regression. SVM RVM 1. RVM requires much less number of relevance(support) vectors, which means we can save prediction time. 2. However, RVM takes more time to train model, due to inversion of š‘Ŗ matrix.
  • 22. Chapter 7.2. Relevance Vector Machines 22 Analysis of Sparsity Let’s focus on parameter š›¼. How does it contribute to the model’s sparsity?(Selecting reasonable basis) Consider there exists only one basis function and two data š‘„1, š‘”1 , (š‘„2, š‘”2). Then, aforementioned value š‘Ŗ can be computed as šœ‘ is a N-dimensional vector of šœ™ š‘‹1 , šœ™ š‘‹2 š‘‡ . And similarly š’• = š‘”1, š‘”2 š‘» When š›¼ has an infinite value Finite value of š›¼. Direction of š‹ is significant!
  • 23. Chapter 7.2. Relevance Vector Machines 23 Mathematical perspective We now move onto š‘ āˆ’ š‘‘š‘–š‘š variables. We are still thinking of optimizing š¶ with respect to š›¼ š‘Žš‘›š‘‘ š›½. We can re-write š¶ by Here, š‹š’Š indicates i-th column of design matrix šš½. Here, we have to compute However, we don’t know |š¶| and š¶āˆ’1 . We have to think how we can express them with š‘Ŗāˆ’š’Š, šœ¶š’Š, š’‚š’š’… š‹ By using the equation of
  • 24. Chapter 7.2. Relevance Vector Machines 24 Mathematical perspective We can sort all values with new variables š‘ š‘– and š‘žš‘– Here, š’”š’Š indicates sparsity and š’’š’Š indicates quality of š‹ 1. Sparsity(š‘ š‘–) measures the extent to which basis function šœ‘š‘– overlaps with other basis vectors in the model. 2. Quality(š‘žš‘–) measures the alignment of the basis vector š‹š’Š and other training vectors t. Now, in order to decide optimal value of š›¼š‘– we do not need to consider values of other š›¼š‘—. So, we have to only calculate derivative of šœ†(š›¼š‘–), which will be introduced in the following page.
  • 25. Chapter 7.2. Relevance Vector Machines 25 Mathematical perspective Equation can be Recall that š›¼š‘– ≄ 0, (It’s a precision!) we should think of two conditions. 1. If š‘žš‘– 2 < š‘ š‘–, then š›¼š‘– → āˆž / Second term goes positive, so first term should be close to zero. 2. If š‘žš‘– 2 > š‘ š‘– solution can be According to these equations, we can get iterative optimization methods of RVM.
  • 26. Chapter 7.2. Relevance Vector Machines 26 RVM for classification Relevance vector machine can be extended to classification model by simply using logistic regression model with ARD prior. Just as we covered in chapter 4, we are not integrating with respect to š‘¤. Instead, we use Laplace approximation. It’s been a while, thus let’s revise Laplace approximation for short. That is, weight parameters are having different prior, and are independent! What we need are… 1. Mode of posterior 2. Hessian of posterior. Here, modes are… Note that šµ = š‘š‘„š‘ š‘œš‘“ š‘¦š‘›(1 āˆ’ š‘¦š‘›)
  • 27. Chapter 7.2. Relevance Vector Machines 27 RVM for classification Here, we don’t know exact value of š›¼, we have to estimate it by evidence value. After substituting each function of parameter, we can get estimation of š›¼ if we set derivative of the marginal likelihood. Note that result is equivalent to that of regression At the same time, by defining š’• as following, we can get much simple path. Note that this result fits the result of regression example. Thus, we can put same analysis with šœ¶ as we did before! For the multi-class case, we can simply train š‘˜ āˆ’ different models of š‘˜ āˆ’ š‘š‘™š‘Žš‘ š‘  labels. Then use softmax function.