09 Machine Learning - Introduction Support Vector Machines

1. Machine Learning for Data Mining Introduction to Support Vector Machines Andres Mendez-Vazquez June 22, 2016 1 / 124

2. Outline 1 History The Beginning 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 5 Soft Margins Introduction The Soft Margin Solution 6 More About Kernels Basic Idea From Inner products to Kernels 2 / 124

4. History Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963 At the Institute of Control Sciences, Moscow On the paper “Estimation of dependencies based on empirical data” Corinna Cortes and Vladimir Vapnik in 1995 They Invented their Current Incarnation - Soft Margins At the AT&T Labs BTW Corinna Cortes Danish computer scientist who is known for her contributions to the ﬁeld of machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award (ACM) for her work on theoretical foundations of support vector machines. 4 / 124

11. In addition Alexey Yakovlevich Chervonenkis He was a Soviet and Russian mathematician, and, with Vladimir Vapnik, was one of the main developers of the Vapnik–Chervonenkis theory, also known as the "fundamental theory of learning" an important part of computational learning theory. He died in September 22nd, 2014 At Losiny Ostrov National Park on 22 September 2014. 5 / 124

12. In addition Alexey Yakovlevich Chervonenkis He was a Soviet and Russian mathematician, and, with Vladimir Vapnik, was one of the main developers of the Vapnik–Chervonenkis theory, also known as the "fundamental theory of learning" an important part of computational learning theory. He died in September 22nd, 2014 At Losiny Ostrov National Park on 22 September 2014. 5 / 124

13. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 6 / 124

23. Separable Classes Given xi, i = 1, · · · , N A set of samples belonging to two classes ω1, ω2. Objective We want to obtain a decision function as simple as g (x) = wT x + w0 8 / 124

24. Separable Classes Given xi, i = 1, · · · , N A set of samples belonging to two classes ω1, ω2. Objective We want to obtain a decision function as simple as g (x) = wT x + w0 8 / 124

25. Such that we can do the following A linear separation function g (x) = wt x + w0 9 / 124

27. In other words ... We have the following samples For x1, · · · , xm ∈ C1 For x1, · · · , xn ∈ C2 We want the following decision surfaces wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1 wT xj + w0 ≤ 0 for dj = −1 if xj ∈ C2 11 / 124

31. What do we want? Our goal is to search for a direction w that gives the maximum possible margin direction 2 MARGINSdirection 1 12 / 124

32. Remember We have the following d Projection r distance 0 13 / 124

33. A Little of Geometry Thus r d A B C Then d = |w0| w2 1 + w2 2 , r = |g (x)| w2 1 + w2 2 (1) 14 / 124

34. A Little of Geometry Thus r d A B C Then d = |w0| w2 1 + w2 2 , r = |g (x)| w2 1 + w2 2 (1) 14 / 124

35. First d = |w0|√ w2 1+w2 2 We can use the following rule in a triangle with a 90o angle Area = 1 2 Cd (2) In addition, the area can be calculated also as Area = 1 2 AB (3) Thus d = AB C Remark: Can you get the rest of values? 15 / 124

38. What about r = |g(x)|√ w2 1+w2 2 ? First, remember g (xp) = 0 and x = xp + r w w (4) Thus, we have g (x) =wT xp + r w w + w0 =wT xp + w0 + r wT w w =wT xp + w0 + r w 2 w =g (xp) + r w Then r = g(x) ||w|| 16 / 124

44. This has the following interpretation The distance from the projection 0 17 / 124

45. Now We know that the straight line that we are looking for looks like wT x + w0 = 0 (5) What about something like this wT x + w0 = δ (6) Clearly This will be above or below the initial line wT x + w0 = 0. 18 / 124

48. Come back to the hyperplanes We have then for each border support line an speciﬁc bias!!! Support Vectors 19 / 124

49. Then, normalize by δ The new margin functions w T x + w10 = 1 w T x + w01 = −1 where w = w δ , w10 = w0 δ ,and w01 = w0 δ Now, we come back to the middle separator hyperplane, but with the normalized term wT xi + w0 ≥ w T x + w10 for di = +1 wT xi + w0 ≤ w T x + w01 for di = −1 Where w0 is the bias of that central hyperplane!! And the w is the normalized direction of w 20 / 124

55. Come back to the hyperplanes The meaning of what I am saying!!! 21 / 124

57. A little about Support Vectors They are the vectors (Here, we assume that w) xi such that wT xi + w0 = 1 or wT xi + w0 = −1 Properties The vectors nearest to the decision surface and the most diﬃcult to classify. Because of that, we have the name “Support Vector Machines”. 23 / 124

60. Now, we can resume the decision rule for the hyperplane For the support vectors g (xi) = wT xi + w0 = −(+)1 for di = −(+)1 (7) Implies The distance to the support vectors is: r = g (xi) ||w|| =    1 ||w|| if di = +1 − 1 ||w|| if di = −1 24 / 124

61. Now, we can resume the decision rule for the hyperplane For the support vectors g (xi) = wT xi + w0 = −(+)1 for di = −(+)1 (7) Implies The distance to the support vectors is: r = g (xi) ||w|| =    1 ||w|| if di = +1 − 1 ||w|| if di = −1 24 / 124

62. Therefore ... We want the optimum value of the margin of separation as ρ = 1 ||w|| + 1 ||w|| = 2 ||w|| (8) And the support vectors deﬁne the value of ρ 25 / 124

63. Therefore ... We want the optimum value of the margin of separation as ρ = 1 ||w|| + 1 ||w|| = 2 ||w|| (8) And the support vectors deﬁne the value of ρ Support Vectors 25 / 124

65. Quadratic Optimization Then, we have the samples with labels T = {(xi, di)}N i=1 Then we can put the decision rule as di wT xi + w0 ≥ 1 i = 1, · · · , N 27 / 124

66. Quadratic Optimization Then, we have the samples with labels T = {(xi, di)}N i=1 Then we can put the decision rule as di wT xi + w0 ≥ 1 i = 1, · · · , N 27 / 124

67. Then, we have the optimization problem The optimization problem minwΦ (w) = 1 2wT w s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N Observations The cost functions Φ (w) is convex. The constrains are linear with respect to w. 28 / 124

71. Lagrange Multipliers The method of Lagrange multipliers Gives a set of necessary conditions to identify optimal points of equality constrained optimization problems. This is done by converting a constrained problem to an equivalent unconstrained problem with the help of certain unspeciﬁed parameters known as Lagrange multipliers. 30 / 124

72. Lagrange Multipliers The method of Lagrange multipliers Gives a set of necessary conditions to identify optimal points of equality constrained optimization problems. This is done by converting a constrained problem to an equivalent unconstrained problem with the help of certain unspeciﬁed parameters known as Lagrange multipliers. 30 / 124

73. Lagrange Multipliers The classical problem formulation min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 It can be converted into min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (9) where L(x, λ) is the Lagrangian function. λ is an unspeciﬁed positive or negative constant called the Lagrange Multiplier. 31 / 124

76. Finding an Optimum using Lagrange Multipliers New problem min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} We want a λ = λ∗ optimal If the minimum of L (x1, x2, ..., xn, λ∗ ) occurs at (x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗ and (x1, x2, ..., xn) T ∗ satisﬁes h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn) T ∗ minimizes: min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 Trick It is to ﬁnd appropriate value for Lagrangian multiplier λ. 32 / 124

79. Remember Think about this Remember First Law of Newton!!! Yes!!! 33 / 124

80. Remember Think about this Remember First Law of Newton!!! Yes!!! A system in equilibrium does not move Static Body 33 / 124

81. Lagrange Multipliers Deﬁnition Gives a set of necessary conditions to identify optimal points of equality constrained optimization problem 34 / 124

82. Lagrange was a Physicists He was thinking in the following formula A system in equilibrium has the following equation: F1 + F2 + ... + FK = 0 (10) But functions do not have forces? Are you sure? Think about the following The Gradient of a surface. 35 / 124

85. Gradient to a Surface After all a gradient is a measure of the maximal change For example the gradient of a function of three variables: f (x) = i ∂f (x) ∂x + j ∂f (x) ∂y + k ∂f (x) ∂z (11) where i, j and k are unitary vectors in the directions x, y and z. 36 / 124

86. Example We have f (x, y) = x exp {−x2 − y2 } 37 / 124

87. Example With Gradient at the the contours when projecting in the 2D plane 38 / 124

88. Now, Think about this Yes, we can use the gradient However, we need to do some scaling of the forces by using parameters λ Thus, we have F0 + λ1F1 + ... + λKFK = 0 (12) where F0 is the gradient of the principal cost function and Fi for i = 1, 2, .., K. 39 / 124

89. Now, Think about this Yes, we can use the gradient However, we need to do some scaling of the forces by using parameters λ Thus, we have F0 + λ1F1 + ... + λKFK = 0 (12) where F0 is the gradient of the principal cost function and Fi for i = 1, 2, .., K. 39 / 124

90. Thus If we have the following optimization: min f (x) s.tg1 (x) = 0 g2 (x) = 0 40 / 124

91. Geometric interpretation in the case of minimization What is wrong? Gradients are going in the other direction, we can ﬁx by simple multiplying by -1 Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize f (−→x ) g1 (−→x ) g2 (−→x ) −∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0 Nevertheless: it is equivalent to f −→x − λ1 g1 −→x − λ2 g2 −→x = 0 40 / 124

92. Geometric interpretation in the case of minimization What is wrong? Gradients are going in the other direction, we can ﬁx by simple multiplying by -1 Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize f (−→x ) g1 (−→x ) g2 (−→x ) −∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0 Nevertheless: it is equivalent to f −→x − λ1 g1 −→x − λ2 g2 −→x = 0 40 / 124

94. Method Steps 1 Original problem is rewritten as: 1 minimize L (x, λ) = f (x) − λh1 (x) 2 Take derivatives of L (x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ). 42 / 124

95. Method Steps 1 Original problem is rewritten as: 1 minimize L (x, λ) = f (x) − λh1 (x) 2 Take derivatives of L (x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ). 42 / 124

96. Example We can apply that to the following problem min f (x, y) = x2 − 8x + y2 − 12y + 48 s.t x + y = 8 43 / 124

97. Then, Rewriting The Optimization Problem The optimization with equality constraints minwΦ (w) = 1 2wT w s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N 44 / 124

98. Then, for our problem Using the Lagrange Multipliers (We will call them αi) We obtain the following cost function that we want to minimize J(w, w0, α) = 1 2 wT w − N i=1 αi[di(wT xi + w0) − 1] Observation Minimize with respect to w and w0. Maximize with respect to α because it dominates − N i=1 αi[di(wT xi + w0) − 1]. (13) 45 / 124

102. Saddle Point? At the left the original problem, at the right the Lagrangian!!! f (−→x ) g1 (−→x ) g2 (−→x ) 46 / 124

104. Karush-Kuhn-Tucker Conditions First An Inequality Constrained Problem P min f (x) s.t g1 (x) = 0 ... gN (x) = 0 A really minimal version!!! Hey, it is a patch work!!! A point x is a local minimum of an equality constrained problem P only if a set of non-negative αj’s may be found such that: L (x, α) = f (x) − N i=1 αi gi (x) = 0 48 / 124

105. Karush-Kuhn-Tucker Conditions First An Inequality Constrained Problem P min f (x) s.t g1 (x) = 0 ... gN (x) = 0 A really minimal version!!! Hey, it is a patch work!!! A point x is a local minimum of an equality constrained problem P only if a set of non-negative αj’s may be found such that: L (x, α) = f (x) − N i=1 αi gi (x) = 0 48 / 124

106. Karush-Kuhn-Tucker Conditions Important Think about this each constraint correspond to a sample in both classes, thus The corresponding αi’s are going to be zero after optimization, if a constraint is not active i.e. di wT xi + w0 − 1 = 0 (Remember Maximization). Again the Support Vectors This actually deﬁnes the idea of support vectors!!! Thus Only the αi’s with active constraints (Support Vectors) will be diﬀerent from zero when di wT xi + w0 − 1 = 0. 49 / 124

109. A small deviation from the SVM’s for the sake of Vox Populi Theorem (Karush-Kuhn-Tucker Necessary Conditions) Let X be a non-empty open set Rn, and let f : Rn → R and gi : Rn → R for i = 1, ..., m. Consider the problem P to minimize f (x) subject to x ∈ X and gi (x) ≤ 0 i = 1, ..., m. Let x be a feasible solution, and denote I = {i|gi (x) = 0}. Suppose that f and gi for i ∈ I are diﬀerentiable at x and that gi i /∈ I are continuous at x. Furthermore, suppose that gi (x) for i ∈ I are linearly independent. If x solves problem P locally, there exist scalars ui for i ∈ I such that f (x) + i∈I ui gi (x) = 0 ui ≥ 0 for i ∈ I 50 / 124

110. It is more... In addition to the above assumptions If gi for each i /∈ I is also diﬀerentiable at x, the previous conditions can be written in the following equivalent form: f (x) + m i=1 ui gi (x) = 0 ugi (x) = 0 for i = 1, ..., m ui ≥ 0 for i = 1, ..., m 51 / 124

111. The necessary conditions for optimality We use the previous theorem 1 2 wT w − N i=1 αi[di(wT xi + w0) − 1] (14) Condition 1 ∂J (w, w0, α) ∂w = 0 Condition 2 ∂J (w, w0, α) ∂w0 = 0 52 / 124

114. Using the conditions We have the ﬁrst condition ∂J(w, w0, α) ∂w = ∂ 1 2wT w ∂w − ∂ N i=1 αi[di(wT xi + w0) − 1] ∂w = 0 ∂J(w, w0, α) ∂w = 1 2 (w + w) − N i=1 αidixi Thus w = N i=1 αidixi (15) 53 / 124

117. In a similar way ... We have by the second optimality condition N i=1 αidi = 0 Note αi di wT xi + w0 − 1 = 0 Because the constraint vanishes in the optimal solution i.e. αi = 0 or di wT xi + w0 − 1 = 0. 54 / 124

118. In a similar way ... We have by the second optimality condition N i=1 αidi = 0 Note αi di wT xi + w0 − 1 = 0 Because the constraint vanishes in the optimal solution i.e. αi = 0 or di wT xi + w0 − 1 = 0. 54 / 124

119. Thus We need something extra Our classic trick of transforming a problem into another problem In this case We use the Primal-Dual Problem for Lagrangian Where We move from a minimization to a maximization!!! 55 / 124

123. Lagrangian Dual Problem Consider the following nonlinear programming problem Primal Problem P min f (x) s.t gi (x) ≤ 0 for i = 1, ..., m hi (x) = 0 for i = 1, ..., l x ∈ X Lagrange Dual Problem D max Θ (u, v) s.t. u > 0 where Θ (u, v) = infx f (x) + m i=1 uigi (x) + l i=1 vihi (x) |x ∈ X 57 / 124

124. Lagrangian Dual Problem Consider the following nonlinear programming problem Primal Problem P min f (x) s.t gi (x) ≤ 0 for i = 1, ..., m hi (x) = 0 for i = 1, ..., l x ∈ X Lagrange Dual Problem D max Θ (u, v) s.t. u > 0 where Θ (u, v) = infx f (x) + m i=1 uigi (x) + l i=1 vihi (x) |x ∈ X 57 / 124

125. What does this mean? Assume that the equality constraint does not exist We have then min f (x) s.t gi (x) ≤ 0 for i = 1, ..., m x ∈ X Now assume that we ﬁnish with only one constraint We have then min f (x) s.t g (x) ≤ 0 x ∈ X 58 / 124

126. What does this mean? Assume that the equality constraint does not exist We have then min f (x) s.t gi (x) ≤ 0 for i = 1, ..., m x ∈ X Now assume that we ﬁnish with only one constraint We have then min f (x) s.t g (x) ≤ 0 x ∈ X 58 / 124

127. What does this mean? First, we have the following ﬁgure A B X G Slope: Slope: 59 / 124

128. What does this means? Thus at the y − z plane you have G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16) Thus Given u ≥ 0, we need to minimize f (x) + ug(x) to ﬁnd θ (u) - Equivalent to f (x) + u g(x) = 0 60 / 124

129. What does this means? Thus at the y − z plane you have G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16) Thus Given u ≥ 0, we need to minimize f (x) + ug(x) to ﬁnd θ (u) - Equivalent to f (x) + u g(x) = 0 60 / 124

130. What does this means? Thus at the y − z plane, we have z + uy = α (17) a line with slope −u. Then, to minimize z + uy = α We need to move the line z + uy = α in a parallel to itself as far down as possible, along its negative gradient, while in contact with G. 61 / 124

131. What does this means? Thus at the y − z plane, we have z + uy = α (17) a line with slope −u. Then, to minimize z + uy = α We need to move the line z + uy = α in a parallel to itself as far down as possible, along its negative gradient, while in contact with G. 61 / 124

132. In other words Move the line parallel to itself until it supports G A B X G Slope: Slope: Note The Set G lies above the line and touches it. 62 / 124

133. Thus Thus Then, the problem is to ﬁnd the slope of the supporting hyperplane for G. Then intersection with the z-axis Gives θ(u) 63 / 124

134. Thus Thus Then, the problem is to ﬁnd the slope of the supporting hyperplane for G. Then intersection with the z-axis Gives θ(u) 63 / 124

135. Again We can see the θ A B X G Slope: Slope: 64 / 124

136. Thus The dual problem is equivalent Finding the slope of the supporting hyperplane such that its intercept on the z-axis is maximal 65 / 124

137. Or Such an hyperplane has slope −u and support G at (y, z) A B X G Slope: Slope: Remark: The optimal solution is u and the optimal dual objective is z. 66 / 124

138. Or Such an hyperplane has slope −u and support G at (y, z) A B X G Slope: Slope: Remark: The optimal solution is u and the optimal dual objective is z. 66 / 124

139. For more on this Please!!! Look at this book From “Nonlinear Programming: Theory and Algorithms” by Mokhtar S. Bazaraa, and C. M. Shetty. Wiley, New York, (2006) At Page 260. 67 / 124

140. Example (Lagrange Dual) Primal min x2 1 + x2 2 s.t. −x1 − x2 + 4 ≤ 0 x1, x2 ≥ 0 Lagrange Dual Θ(u) = inf {x2 1 + x2 2 + u(−x1 − x2 + 4)|x1, x2 ≥ 0} 68 / 124

141. Example (Lagrange Dual) Primal min x2 1 + x2 2 s.t. −x1 − x2 + 4 ≤ 0 x1, x2 ≥ 0 Lagrange Dual Θ(u) = inf {x2 1 + x2 2 + u(−x1 − x2 + 4)|x1, x2 ≥ 0} 68 / 124

142. Solution Derive with respect to x1 and x2 We have two case to take in account: u ≥ 0 and u < 0 The ﬁrst case is clear What about when u < 0 We have that θ (u) = −1 2u2 + 4u if u ≥ 0 4u if u < 0 (18) 69 / 124

146. Duality Theorem First Property If the Primal has an optimal solution, the dual too. Thus In order to w ∗ and α∗ to be optimal solutions for the primal and dual problem respectively, It is necessary and suﬃcient that w∗: It is feasible for the primal problem and Φ(w∗) = J (w∗, w0∗, α∗) = min w J (w∗, w0∗, α∗) 71 / 124

147. Duality Theorem First Property If the Primal has an optimal solution, the dual too. Thus In order to w ∗ and α∗ to be optimal solutions for the primal and dual problem respectively, It is necessary and suﬃcient that w∗: It is feasible for the primal problem and Φ(w∗) = J (w∗, w0∗, α∗) = min w J (w∗, w0∗, α∗) 71 / 124

148. Reformulate our Equations We have then J (w, w0, α) = 1 2 wT w − N i=1 αidiwT xi − w0 N i=1 αidi + N i=1 αi Now for our 2nd optimality condition J (w, w0, α) = 1 2 wT w − N i=1 αidiwT xi + N i=1 αi 72 / 124

149. Reformulate our Equations We have then J (w, w0, α) = 1 2 wT w − N i=1 αidiwT xi − w0 N i=1 αidi + N i=1 αi Now for our 2nd optimality condition J (w, w0, α) = 1 2 wT w − N i=1 αidiwT xi + N i=1 αi 72 / 124

150. We have ﬁnally for the 1st Optimality Condition: First wT w = N i=1 αidiwT xi = N i=1 N j=1 αiαjdidjxT j xi Second, setting J (w, w0, α) = Q (α) Q (α) = N i=1 αi − 1 2 N i=1 N j=1 αiαjdidjxT j xi 73 / 124

151. We have ﬁnally for the 1st Optimality Condition: First wT w = N i=1 αidiwT xi = N i=1 N j=1 αiαjdidjxT j xi Second, setting J (w, w0, α) = Q (α) Q (α) = N i=1 αi − 1 2 N i=1 N j=1 αiαjdidjxT j xi 73 / 124

152. From here, we have the problem This is the problem that we really solve Given the training sample {(xi, di)}N i=1, ﬁnd the Lagrange multipliers {αi}N i=1 that maximize the objective function Q(α) = N i=1 αi − 1 2 N i=1 N j=1 αiαjdidjxT j xi subject to the constraints N i=1 αidi = 0 (19) αi ≥ 0 for i = 1, · · · , N (20) Note In the Primal, we were trying to minimize the cost function, for this it is necessary to maximize α. That is the reason why we are maximizing Q (α). 74 / 124

153. From here, we have the problem This is the problem that we really solve Given the training sample {(xi, di)}N i=1, ﬁnd the Lagrange multipliers {αi}N i=1 that maximize the objective function Q(α) = N i=1 αi − 1 2 N i=1 N j=1 αiαjdidjxT j xi subject to the constraints N i=1 αidi = 0 (19) αi ≥ 0 for i = 1, · · · , N (20) Note In the Primal, we were trying to minimize the cost function, for this it is necessary to maximize α. That is the reason why we are maximizing Q (α). 74 / 124

154. Solving for α We can compute w∗ once we get the optimal α∗ i by using (Eq. 15) w∗ = N i=1 α∗ i dixi In addition, we can compute the optimal bias w∗ 0 using the optimal weight, w∗ For this, we use the positive margin equation: g x(s) = wT x(s) + w0 = 1 corresponding to a positive support vector. Then w0 = 1 − (w∗ )T x(s) for d(s) = 1 (21) 75 / 124

158. What do we need? Until now, we have only a maximal margin algorithm All this work ﬁne when the classes are separable Problem, What when they are not separable? What we can do? 77 / 124

162. Map to a higher Dimensional Space Assume that exist a mapping x ∈ Rl → y ∈ Rk Then, it is possible to deﬁne the following mapping 79 / 124

163. Map to a higher Dimensional Space Assume that exist a mapping x ∈ Rl → y ∈ Rk Then, it is possible to deﬁne the following mapping 79 / 124

164. Deﬁne a map to a higher Dimension Nonlinear transformations Given a series of nonlinear transformations {φi (x)}m i=1 from input space to the feature space. We can deﬁne the decision surface as m i=1 wiφi (x) + w0 = 0 80 / 124

165. Deﬁne a map to a higher Dimension Nonlinear transformations Given a series of nonlinear transformations {φi (x)}m i=1 from input space to the feature space. We can deﬁne the decision surface as m i=1 wiφi (x) + w0 = 0 80 / 124

166. This allows us to deﬁne The following vector φ (x) = (φ0 (x) , φ1 (x) , · · · , φm (x)) T that represents the mapping. From this mapping We can deﬁne the following kernel function K : X × X → R K (xi, xj) = φ (xi)T φ (xj) 81 / 124

167. This allows us to deﬁne The following vector φ (x) = (φ0 (x) , φ1 (x) , · · · , φm (x)) T that represents the mapping. From this mapping We can deﬁne the following kernel function K : X × X → R K (xi, xj) = φ (xi)T φ (xj) 81 / 124

169. Example Assume x ∈ R → y =    x2 1√ 2x1x2 x2 2    We can show that yT i yj = xT i xj 2 83 / 124

170. Example Assume x ∈ R → y =    x2 1√ 2x1x2 x2 2    We can show that yT i yj = xT i xj 2 83 / 124

171. Example of Kernels Polynomials k (x, z) = (xT z + 1)q q > 0 Radial Basis Functions k (x, z) = exp − ||x − z||2 σ2 Hyperbolic Tangents k (x, z) = tanh βxT z + γ 84 / 124

175. Now, How to select a Kernel? We have a problem Selecting a speciﬁc kernel and parameters is usually done in a try-and-see manner. Thus In general, the Radial Basis Functions kernel is a reasonable ﬁrst choice. Then if this fails, we can try the other possible kernels. 86 / 124

178. Thus, we have something like this Step 1 Normalize the data. Step 2 Use cross-validation to adjust the parameters of the selected kernel. Step 3 Train against the entire dataset. 87 / 124

182. Optimal Hyperplane for non-separable patterns Important We have been considering only problems where the classes are linearly separable. Now What happen when the patterns are not separable? Thus, we can still build a separating hyperplane But errors will happen in the classiﬁcation... We need to minimize them... 89 / 124

185. What if the following happens Some data points invade the “margin” space Optimal Hyperplane Data Point Violating Property 90 / 124

186. Fixing the Problem - Corinna’s Style The margin of separation between classes is said to be soft if a data point (xi, di) violates the following condition di wT xi + b ≥ +1 i = 1, 2, ..., N (22) This violation can arise in one of two ways The data point (xi, di) falls inside the region of separation but on the right side of the decision surface - still correct classiﬁcation. 91 / 124

187. Fixing the Problem - Corinna’s Style The margin of separation between classes is said to be soft if a data point (xi, di) violates the following condition di wT xi + b ≥ +1 i = 1, 2, ..., N (22) This violation can arise in one of two ways The data point (xi, di) falls inside the region of separation but on the right side of the decision surface - still correct classiﬁcation. 91 / 124

188. We have then Example Optimal Hyperplane Data Point Violating Property 92 / 124

189. Or... This violation can arise in one of two ways The data point (xi, di) falls on the wrong side of the decision surface - incorrect classiﬁcation. Example 93 / 124

190. Or... This violation can arise in one of two ways The data point (xi, di) falls on the wrong side of the decision surface - incorrect classiﬁcation. Example Optimal Hyperplane Data Point Violating Property 93 / 124

191. Solving the problem What to do? We introduce a set of nonnegative scalar values {ξi}N i=1. Introduce this into the decision rule di wT xi + b ≥ 1 − ξi i = 1, 2, ..., N (23) 94 / 124

194. The ξi are called slack variables What? In 1995, Corinna Cortes and Vladimir N. Vapnik suggested a modiﬁed maximum margin idea that allows for mislabeled examples. Ok!!! Instead of expecting to have constant margin for all the samples, the margin can change depending of the sample. What do we have? ξi measures the deviation of a data point from the ideal condition of pattern separability. 95 / 124

197. Properties of ξi What if? You have 0 ≤ ξi ≤ 1 We have 96 / 124

198. Properties of ξi What if? You have 0 ≤ ξi ≤ 1 We have Optimal Hyperplane Data Point Violating Property 96 / 124

199. Properties of ξi What if? You have ξi > 1 We have 97 / 124

200. Properties of ξi What if? You have ξi > 1 We have Optimal Hyperplane Data Point Violating Property 97 / 124

201. Support Vectors We want Support vectors that satisfy equation (Eq. 23) even when ξi > 0 di wT xi + b ≥ 1 − ξi i = 1, 2, ..., N 98 / 124

202. We want the following We want to ﬁnd an hyperplane Such that average error is misclassiﬁed over all the samples 1 N N i=1 e2 (24) 99 / 124

203. First Attempt Into Minimization We can try the following Given I (x) = 0 if x ≤ 0 1 if x > 0 (25) Minimize the following Φ (ξ) = N i=1 I (ξi − 1) (26) with respect to the weight vector w subject to 1 di wT xi + b ≥ 1 − ξi i = 1, 2, ..., N 2 w 2 ≤ C for a given C. 100 / 124

204. First Attempt Into Minimization We can try the following Given I (x) = 0 if x ≤ 0 1 if x > 0 (25) Minimize the following Φ (ξ) = N i=1 I (ξi − 1) (26) with respect to the weight vector w subject to 1 di wT xi + b ≥ 1 − ξi i = 1, 2, ..., N 2 w 2 ≤ C for a given C. 100 / 124

205. Problem Using this ﬁrst attempt Minimization of Φ (ξ) with respect to w is a non-convex optimization problem that is NP-complete. Thus, we need to use an approximation, maybe Φ (ξ) = N i=1 ξi (27) Now, we simplify the computations by integrating the vector w Φ (w, ξ) = 1 2 wT w + C N i=1 ξi (28) 101 / 124

208. Important First Minimizing the first term in (Eq. 28) is related to minimize the Vapnik–Chervonenkis dimension. Which is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a statistical classification algorithm. Second The second term N i=1 ξi is an upper bound on the number of test errors. 102 / 124

211. Some problems for the Parameter C Little Problem The parameter C has to be selected by the user. This can be done in two ways 1 The parameter C is determined experimentally via the standard use of a training! (validation) test set. 2 It is determined analytically by estimating the Vapnik–Chervonenkis dimension. 103 / 124

214. Primal Problem Problem, given samples {(xi, di)}N i=1 min w,ξ Φ (w, ξ) = min w,ξ 1 2 wT w + C N i=1 ξi s.t. di(wT xi + w0) ≥ 1 − ξi for i = 1, · · · , N ξi ≥ 0 for all i With C a user-speciﬁed positive parameter. 104 / 124

216. Final Setup Using Lagrange Multipliers and dual-primal method is possible to obtain the following setup Given the training sample {(xi, di)}N i=1, ﬁnd the Lagrange multipliers {αi}N i=1 that maximize the objective function min α Q(α) = min α    N i=1 αi − 1 2 N i=1 N j=1 αiαjdidjxT j xi    subject to the constraints N i=1 αidi = 0 (29) 0 ≤ αi ≤ C for i = 1, · · · , N (30) where C is a user-speciﬁed positive parameter. 106 / 124

217. Remarks Something Notable Note that neither the slack variables nor their Lagrange multipliers appear in the dual problem. The dual problem for the case of non-separable patterns is thus similar to that for the simple case of linearly separable patterns The only big diﬀerence Instead of using the constraint αi ≥ 0, the new problem use the more stringent constraint 0 ≤ αi ≤ C. Note the following ξi = 0 if αi < C (31) 107 / 124

221. Finally The optimal solution for the weight vector w∗ w∗ = Ns i=1 α∗ i dixi Where Ns is the number of support vectors. In addition The determination of the optimum values to that described before. The KKT conditions are as follow αi di wT xi + wo − 1 + ξi = 0 for i = 1, 2, ..., N. µiξi = 0 for i = 1, 2, ..., N. 108 / 124

225. Where... The µi are Lagrange multipliers They are used to enforce the non-negativity of the slack variables ξi for all i. Something Notable At saddle point, the derivative of the Lagrangian function for the primal problem: 1 2 wT w + C N i=1 ξi − N i=1 αi di wT xi + wo − 1 + ξi − N i=1 µiξi (32) 109 / 124

226. Where... The µi are Lagrange multipliers They are used to enforce the non-negativity of the slack variables ξi for all i. Something Notable At saddle point, the derivative of the Lagrangian function for the primal problem: 1 2 wT w + C N i=1 ξi − N i=1 αi di wT xi + wo − 1 + ξi − N i=1 µiξi (32) 109 / 124

227. Thus We get αi + µi = C (33) Thus, we get if αi < C Then µi > 0 ⇒ ξi = 0 We may determine w0 Using any data point (xi, di) in the training set such that 0 ≤ α∗ i ≤ C. Then, given ξi = 0, w∗ 0 = 1 di − (w∗ )T xi (34) 110 / 124

230. Nevertheless It is better To take the mean value of w∗ 0 from all such data points in the training sample (Burges, 1998). BTW He has a great book in SVM’s “An Introduction to Support Vector Machines and Other Kernel-based Learning Methods” 111 / 124

232. Basic Idea Something Notable The SVM uses the scalar product xi, xj as a measure of similarity between xi and xj, and of distance to the hyperplane. Since the scalar product is linear, the SVM is a linear method. But Using a nonlinear function instead, we can make the classiﬁer nonlinear. 113 / 124

235. We do this by deﬁning the following map Nonlinear transformations Given a series of nonlinear transformations {φi (x)} m i=1 from input space to the feature space. We can deﬁne the decision surface as m i=1 wiφi (x) + w0 = 0 . 114 / 124

236. We do this by deﬁning the following map Nonlinear transformations Given a series of nonlinear transformations {φi (x)} m i=1 from input space to the feature space. We can deﬁne the decision surface as m i=1 wiφi (x) + w0 = 0 . 114 / 124

237. This allows us to deﬁne The following vector φ (x) = (φ0 (x) , φ1 (x) , · · · , φm (x)) T That represents the mapping. 115 / 124

239. Finally We deﬁne the decision surface as wT φ (x) = 0 (35) We now seek "linear" separability of features, we may write w = N i=1 αidiφ (xi) (36) Thus, we ﬁnish with the following decision surface N i=1 αidiφT (xi) φ (x) = 0 (37) 117 / 124

242. Thus The term φT (xi) φ (x) It represents the inner product of two vectors induced in the feature space induced by the input patterns. We can introduce the inner-product kernel K (xi, x) = φT (xi) φ (x) = m j=0 φj (xi) φj (x) (38) Property: Symmetry K (xi, x) = K (x, xi) (39) 118 / 124

245. This allows to redeﬁne the optimal hyperplane We get N i=1 αidiK (xi, x) = 0 (40) Something Notable Using kernels, we can avoid to go from: Input Space =⇒ Mapping Space =⇒ Inner Product (41) By directly going from Input Space =⇒ Inner Product (42) 119 / 124

248. Important Something Notable The expansion of (Eq. 38) for the inner-product kernel K (xi, x) is an important special case of that arises in functional analysis. 120 / 124

249. Mercer’s Theorem Mercer’s Theorem Let K (x, x ) be a continuous symmetric kernel that is deﬁned in the closed interval a ≤ x ≤ b and likewise for x . The kernel K (x, x ) can be expanded in the series K x, x = ∞ i=1 λiφi (x) φi x (43) With Positive coeﬃcients, λi > 0 for all i. 121 / 124

250. Mercer’s Theorem Mercer’s Theorem Let K (x, x ) be a continuous symmetric kernel that is deﬁned in the closed interval a ≤ x ≤ b and likewise for x . The kernel K (x, x ) can be expanded in the series K x, x = ∞ i=1 λiφi (x) φi x (43) With Positive coeﬃcients, λi > 0 for all i. 121 / 124

251. Mercer’s Theorem For this expression to be valid and or it to converge absolutely and uniformly It is necessary and suﬃcient that the condition ˆ b a ˆ b a K x, x ψ (x) ψ x dxdx ≥ 0 (44) holds for all ψ such that ´ b a ψ2 (x) dx < ∞(Example of a quadratic norm for functions). 122 / 124

252. Remarks First The functions φi (x) are called eigenfunctions of the expansion and the numbers λi are called eigenvalues. Second The fact that all of the eigenvalues are positive means that the kernel K (x, x ) is positive deﬁnite. 123 / 124

253. Remarks First The functions φi (x) are called eigenfunctions of the expansion and the numbers λi are called eigenvalues. Second The fact that all of the eigenvalues are positive means that the kernel K (x, x ) is positive deﬁnite. 123 / 124

254. Not only that We have that For λi = 1, the ith image of √ λiφi (x) induced in the feature space by the input vector x is an eigenfunction of the expansion. In theory The dimensionality of the feature space (i.e., the number of eigenvalues/ eigenfunctions) can be inﬁnitely large. 124 / 124

255. Not only that We have that For λi = 1, the ith image of √ λiφi (x) induced in the feature space by the input vector x is an eigenfunction of the expansion. In theory The dimensionality of the feature space (i.e., the number of eigenvalues/ eigenfunctions) can be inﬁnitely large. 124 / 124

09 Machine Learning - Introduction Support Vector Machines

More Related Content

Viewers also liked (20)

Similar to 09 Machine Learning - Introduction Support Vector Machines (20)

More from Andres Mendez-Vazquez (20)

Recently uploaded (20)

09 Machine Learning - Introduction Support Vector Machines