06 Machine Learning - Naive Bayes

1. Machine Learning for Data Mining Introduction to Bayesian Classiﬁers Andres Mendez-Vazquez August 3, 2015 1 / 71

2. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Inﬂuence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 2 / 71

3. Classiﬁcation problem Training Data Samples of the form (d, h(d)) d Where d are the data objects to classify (inputs) h (d) h(d) are the correct class info for d, h(d) ∈ 1, . . . K 3 / 71

7. Classiﬁcation Problem Goal Given dnew, provide h(dnew) The Machinery in General looks... Supervised Learning Training Info: Desired/Trget Output INPUT OUTPUT 5 / 71

9. Naive Bayes Model Task for two classes Let ω1, ω2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (ω1) for class 1. P (ω2) for class 2. The Rule for classiﬁcation is the following one P (ωi|x) = P (x|ωi) P (ωi) P (x) (1) Remark: Bayes to the next level. 7 / 71

13. In Informal English We have that posterior = likelihood × prior − information evidence (2) Basically One: If we can observe x. Two: we can convert the prior-information to the posterior information. 8 / 71

16. We have the following terms... Likelihood We call p (x|ωi) the likelihood of ωi given x: This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi is the “likely” class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71

22. The most important term in all this The factor likelihood × prior − information (3) 10 / 71

23. Example We have the likelihood of two classes 11 / 71

24. Example We have the posterior of two classes when P (ω1) = 2 3 and P (ω2) = 1 3 12 / 71

25. Naive Bayes Model In the case of two classes P (x) = 2 i=1 p (x, ωi) = 2 i=1 p (x|ωi) P (ωi) (4) 13 / 71

26. Error in this rule We have that P (error|x) = P (ω1|x) if we decide ω2 P (ω2|x) if we decide ω1 (5) Thus, we have that P (error) = ˆ ∞ −∞ P (error, x) dx = ˆ ∞ −∞ P (error|x) p (x) dx (6) 14 / 71

27. Error in this rule We have that P (error|x) = P (ω1|x) if we decide ω2 P (ω2|x) if we decide ω1 (5) Thus, we have that P (error) = ˆ ∞ −∞ P (error, x) dx = ˆ ∞ −∞ P (error|x) p (x) dx (6) 14 / 71

28. Classification Rule Thus, we have the Bayes Classification Rule 1 If P (ω1|x) > P (ω2|x) x is classified to ω1 2 If P (ω1|x) < P (ω2|x) x is classified to ω2 15 / 71

29. Classification Rule Thus, we have the Bayes Classification Rule 1 If P (ω1|x) > P (ω2|x) x is classified to ω1 2 If P (ω1|x) < P (ω2|x) x is classified to ω2 15 / 71

33. We have several cases If for some x we have P (x|ω1) = P (x|ω2) The ﬁnal decision relies completely from the prior probability. On the Other hand if P (ω1) = P (ω2), the “state” is equally probable In this case the decision is based entirely on the likelihoods P (x|ωi). 17 / 71

34. We have several cases If for some x we have P (x|ω1) = P (x|ω2) The ﬁnal decision relies completely from the prior probability. On the Other hand if P (ω1) = P (ω2), the “state” is equally probable In this case the decision is based entirely on the likelihoods P (x|ωi). 17 / 71

35. How the Rule looks like If P (ω1) = P (ω2) the Rule depends on the term p (x|ωi) 18 / 71

36. The Error in the Second Case of Naive Bayes Error in equiprobable classes P (error) = 1 2 x0ˆ −∞ p (x|ω2) dx + 1 2 ∞ˆ x0 p (x|ω1) dx (8) Remark: P (ω1) = P (ω2) = 1 2 19 / 71

37. What do we want to prove? Something Notable Bayesian classiﬁer is optimal with respect to minimizing the classiﬁcation error probability. 20 / 71

38. Proof Step 1 R1 be the region of the feature space in which we decide in favor of ω1 R2 be the region of the feature space in which we decide in favor of ω2 Step 2 Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9) Thus Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2) = P (ω1) ˆ R2 p (x|ω1) dx + P (ω2) ˆ R1 p (x|ω2) dx 21 / 71

42. Proof It is more Pe = P (ω1) ˆ R2 p (ω1, x) P (ω1) dx + P (ω2) ˆ R1 p (ω2, x) P (ω2) dx (10) Finally Pe = ˆ R2 p (ω1|x) p (x) dx + ˆ R1 p (ω2|x) p (x) dx (11) Now, we choose the Bayes Classiﬁcation Rule R1 : P (ω1|x) > P (ω2|x) R2 : P (ω2|x) > P (ω1|x) 22 / 71

48. Graphically P (ω1): Thanks Edith 2013 Class!!! In Red 24 / 71

49. Thus we have´ R1 p (ω1|x) p (x) dx = ´ R1 p (ω1, x) dx = PR1 (ω1) Thus 25 / 71

50. Finally Finally Pe = P (ω1) − ˆ R1 [p (ω1|x) − p (ω2|x)] p (x) dx (15) Thus, we have Pe =   P (ω1) − ˆ R1 p (ω1|x) p (x) dx    + ˆ R1 p (ω2|x) p (x) dx 26 / 71

51. Finally Finally Pe = P (ω1) − ˆ R1 [p (ω1|x) − p (ω2|x)] p (x) dx (15) Thus, we have Pe =   P (ω1) − ˆ R1 p (ω1|x) p (x) dx    + ˆ R1 p (ω2|x) p (x) dx 26 / 71

52. Pe for a non optimal rule A great idea Edith!!! 27 / 71

53. Which decision function for minimizing the error A single number in this case 28 / 71

54. Error is minimized by the Bayesian Naive Rule Thus The probability of error is minimized at the region of space in which: R1 : P (ω1|x) > P (ω2|x) R2 : P (ω2|x) > P (ω1|x) 29 / 71

57. Pe for an optimal rule A great idea Edith!!! 30 / 71

58. For M classes ω1, ω2, ..., ωM We have that vector x is in ωi P (ωi|x) > P (ωj|x) ∀j = i (16) Something Notable It turns out that such a choice also minimizes the classiﬁcation error probability. 31 / 71

59. For M classes ω1, ω2, ..., ωM We have that vector x is in ωi P (ωi|x) > P (ωj|x) ∀j = i (16) Something Notable It turns out that such a choice also minimizes the classiﬁcation error probability. 31 / 71

61. Minimizing the Risk Something Notable The classification error probability is not always the best criterion to be adopted for minimization. All the errors get the same importance However Certain errors are more important than others. For example Really serious, a doctor makes a wrong decision and a malign tumor gets classified as benign. Not so serious, a benign tumor gets classified as malign. 33 / 71

65. It is based on the following idea In order to measure the predictive performance of a function f : X → Y We use a loss function : Y × Y → R+ (17) A non-negative function that quantiﬁes how bad the prediction f (x) is the true label y. Thus, we can say that (f (x) , y) is the loss incurred by f on the pair (x, y). 34 / 71

68. Example In classiﬁcation In the classiﬁcation case, binary or otherwise, a natural loss function is the 0-1 loss where y = f (x) : y , y = 1 y = y (18) 35 / 71

69. Example In classiﬁcation In the classiﬁcation case, binary or otherwise, a natural loss function is the 0-1 loss where y = f (x) : y , y = 1 y = y (18) 35 / 71

70. Furthermore For regression problems, some natural choices 1 Squared loss: (y , y) = (y − y)2 2 Absolute loss: (y , y) = |y − y| Thus, given the loss function, we can deﬁne the risk as R (f ) = E(X,Y ) [ (f (X) , Y )] (19) Although we cannot see the expected risk, we can use the sample to estimate the following ˆR (f ) = 1 n N i=1 (f (xi) , yi) (20) 36 / 71

74. Thus Risk Minimization Minimizing the empirical risk over a ﬁxed class F ⊆ YX of functions leads to a very important learning rule, named empirical risk minimization (ERM): ˆfN = arg min f ∈F ˆR (f ) (21) If we knew the distribution P and do not restrict ourselves F, the best function would be f ∗ = arg min f R (f ) (22) 37 / 71

75. Thus Risk Minimization Minimizing the empirical risk over a ﬁxed class F ⊆ YX of functions leads to a very important learning rule, named empirical risk minimization (ERM): ˆfN = arg min f ∈F ˆR (f ) (21) If we knew the distribution P and do not restrict ourselves F, the best function would be f ∗ = arg min f R (f ) (22) 37 / 71

76. Thus For classiﬁcation with 0-1 loss, f ∗ is called the Bayesian Classiﬁer f ∗ = arg min y∈Y P (Y = y|X = x) (23) 38 / 71

77. The New Risk Function Now, we have the following loss functions for two classes λ12 is the loss value if the sample is in class 1, but the function classiﬁed as in class 2. λ21 is the loss value if the sample is in class 1, but the function classiﬁed as in class 2. We can generate the new risk function r =λ12Pe (x, ω1) + λ21Pe (x, ω2) =λ12 ˆ R2 p (x, ω1) dx + λ21 ˆ R1 p (x, ω2) dx =λ12 ˆ R2 p (x|ω1) P (ω1) dx + λ21 ˆ R1 p (x|ω2) P (ω2) dx 39 / 71

82. The New Risk Function The new risk function to be minimized r = λ12P (ω1) ˆ R2 p (x|ω1) dx + λ21P (ω2) ˆ R1 p (x|ω2) dx (24) Where the λ terms work as a weight factor for each error Thus, we could have something like λ12 > λ21!!! Then Errors due to the assignment of patterns originating from class 1 to class 2 will have a larger eﬀect on the cost function than the errors associated with the second term in the summation. 40 / 71

85. Now, Consider a M class problem We have Rj, j = 1, ..., M be the regions where the classes ωj live. Now, think the following error Assume x belong to class ωk, but lies in Ri with i = k −→ the vector is misclassiﬁed. Now, for this error we associate the term λki (loss) With that we have a loss matrix L where (k, i) location corresponds to such a loss. 41 / 71

88. Thus, we have a general loss associated to each class ωi Deﬁnition rk = M i=1 λki ˆ Ri p (x|ωk) dx (25) We want to minimize the global risk r = M k=1 rkP (ωk) = M i=1 ˆ Ri M k=1 λkip (x|ωk) P (ωk) dx (26) For this We want to select the set of partition regions Rj. 42 / 71

91. How do we do that? We minimize each integral ˆ Ri M k=1 λkip (x|ωk) P (ωk) dx (27) We need to minimize M k=1 λkip (x|ωk) P (ωk) (28) We can do the following If x ∈ Ri, then li = M k=1 λkip (x|ωk) P (ωk) < lj = M k=1 λkjp (x|ωk) P (ωk) (29) for all j = i. 43 / 71

94. Remarks When we have Kronecker’s delta δki = 0 if k = i 1 if k = i (30) We can do the following λki = 1 − δki (31) We ﬁnish with M k=1,k=i p (x|ωk) P (ωk) < M k=1,k=j p (x|ωk) P (ωk) (32) 44 / 71

97. Then We have that p (x|ωj) P (ωj) < p (x|ωi) P (ωi) (33) for all j = i. 45 / 71

98. Special Case: The Two-Class Case We get l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2) l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2) We assign x to ω1 if l1 < l2 (λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34) If we assume that λii < λij - Correct Decisions are penalized much less than wrong ones x ∈ ω1 (ω2) if l12 ≡ p (x|ω1) p (x|ω2) > (<) P (ω2) P (ω1) (λ21 − λ22) (λ12 − λ11) (35) 46 / 71

101. Special Case: The Two-Class Case Deﬁnition l12 is known as the likelihood ratio and the preceding test as the likelihood ratio test. 47 / 71

103. Decision Surface Because the R1 and R2 are contiguous The separating surface between both of them is described by P (ω1|x) − P (ω2|x) = 0 (36) Thus, we deﬁne the decision function as g12 (x) = P (ω1|x) − P (ω2|x) = 0 (37) 49 / 71

104. Decision Surface Because the R1 and R2 are contiguous The separating surface between both of them is described by P (ω1|x) − P (ω2|x) = 0 (36) Thus, we deﬁne the decision function as g12 (x) = P (ω1|x) − P (ω2|x) = 0 (37) 49 / 71

105. Which decision function for the Naive Bayes A single number in this case 50 / 71

106. In general First Instead of working with probabilities, we work with an equivalent function of them gi (x) = f (P (ωi|x)). Classic Example the Monotonically increasing f (P (ωi|x)) = ln P (ωi|x). The decision test is now classify x in ωi if gi (x) > gj (x) ∀j = i. The decision surfaces, separating contiguous regions, are described by gij (x) = gi (x) − gj (x) i, j = 1, 2, ..., M i = j 51 / 71

111. Gaussian Distribution We can use the Gaussian distribution p (x|ωi) = 1 (2π) l/2 |Σi| 1/2 exp − 1 2 (x − µi)T Σ−1 i (x − µi) (38) Example 53 / 71

112. Gaussian Distribution We can use the Gaussian distribution p (x|ωi) = 1 (2π) l/2 |Σi| 1/2 exp − 1 2 (x − µi)T Σ−1 i (x − µi) (38) Example 53 / 71

113. Some Properties About Σ It is the covariance matrix between variables. Thus It is positive deﬁnite. Symmetric. The inverse exists. 54 / 71

114. Some Properties About Σ It is the covariance matrix between variables. Thus It is positive deﬁnite. Symmetric. The inverse exists. 54 / 71

116. Inﬂuence of the Covariance Σ Look at the following Covariance Σ = 1 0 0 1 It simple the unit Gaussian with mean µ 56 / 71

117. Inﬂuence of the Covariance Σ Look at the following Covariance Σ = 1 0 0 1 It simple the unit Gaussian with mean µ 56 / 71

118. The Covariance Σ as a Rotation Look at the following Covariance Σ = 4 0 0 1 Actually, it ﬂatten the circle through the x − axis 57 / 71

119. The Covariance Σ as a Rotation Look at the following Covariance Σ = 4 0 0 1 Actually, it ﬂatten the circle through the x − axis 57 / 71

120. Inﬂuence of the Covariance Σ Look at the following Covariance Σa = RΣbRT with R = cos θ − sin θ − sin θ cos θ It allows to rotate the axises 58 / 71

121. Inﬂuence of the Covariance Σ Look at the following Covariance Σa = RΣbRT with R = cos θ − sin θ − sin θ cos θ It allows to rotate the axises 58 / 71

122. Now For Two Classes Then, we use the following trick for two Classes i = 1, 2 We know that the pdf of correct classiﬁcation is p (x, ω1) = p (x|ωi) P (ωi)!!! Thus It is possible to generate the following decision function: gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39) Thus gi (x) = − 1 2 (x − µi)T Σ−1 i (x − µi) + ln P (ωi) + ci (40) 59 / 71

126. Given a series of classes ω1, ω2, ..., ωM We assume for each class ωj The samples are drawn independetly according to the probability law p (x|ωj) We call those samples as i.i.d. — independent identically distributed random variables. We assume in addition p (x|ωj) has a known parametric form with vector θj of parameters. 61 / 71

129. Given a series of classes ω1, ω2, ..., ωM For example p (x|ωj) ∼ N µj, Σj (41) In our case We will assume that there is no dependence between classes!!! 62 / 71

130. Given a series of classes ω1, ω2, ..., ωM For example p (x|ωj) ∼ N µj, Σj (41) In our case We will assume that there is no dependence between classes!!! 62 / 71

131. Now Suppose that ωj contains n samples x1, x2, ..., xn p (x1, x2, ..., xn|θj) = n j=1 p (xj|θj) (42) We can see then the function p (x1, x2, ..., xn|θj) as a function of L (θj) = n j=1 p (xj|θj) (43) 63 / 71

132. Now Suppose that ωj contains n samples x1, x2, ..., xn p (x1, x2, ..., xn|θj) = n j=1 p (xj|θj) (42) We can see then the function p (x1, x2, ..., xn|θj) as a function of L (θj) = n j=1 p (xj|θj) (43) 63 / 71

133. Example L (θj) = log n j=1 p (xj|θj) 64 / 71

135. Maximum Likelihood on a Gaussian Then, using the log!!! ln L (ωi) = − n 2 ln |Σi| − 1 2   n j=1 (xj − µi)T Σ−1 i (xj − µi)   + c2 (44) We know that dxT Ax dx = Ax + AT x, dAx dx = A (45) Thus, we expand equation44 − n 2 ln |Σi| − 1 2 n j=1 xj T Σ−1 i xj − 2xj T Σ−1 i µi + µi T Σ−1 i µi + c2 (46) 66 / 71

138. Maximum Likelihood Then ∂ ln L (ωi) ∂µi = n j=1 Σ−1 i (xj − µi) = 0 nΣ−1 i  −µi + 1 n n j=1 xj   = 0 ˆµi = 1 n n j=1 xj 67 / 71

139. Maximum Likelihood Then, we derive with respect to Σi For this we use the following tricks: 1 ∂ log|Σ| ∂Σ−1 = − 1 |Σ| · |Σ| (Σ)T = −Σ 2 ∂Tr[AB] ∂A = ∂Tr[BA] ∂A = BT 3 Trace(of a number)=the number 4 Tr(AT B) = Tr BAT Thus f (Σi) = − n 2 ln |ΣI | − 1 2 n j=1 (xj − µi)T Σ−1 i (xj − µi) + c1 (47) 68 / 71

140. Maximum Likelihood Thus f (Σi) = − n 2 ln |Σi|− 1 2 n j=1 Trace (xj − µi)T Σ−1 i (xj − µi) +c1 (48) Tricks!!! f (Σi) = − n 2 ln |Σi|− 1 2 n j=1 Trace Σ−1 i (xj − µi) (xj − µi)T +c1 (49) 69 / 71

141. Maximum Likelihood Thus f (Σi) = − n 2 ln |Σi|− 1 2 n j=1 Trace (xj − µi)T Σ−1 i (xj − µi) +c1 (48) Tricks!!! f (Σi) = − n 2 ln |Σi|− 1 2 n j=1 Trace Σ−1 i (xj − µi) (xj − µi)T +c1 (49) 69 / 71

142. Maximum Likelihood Derivative with respect to Σ ∂f (Σi) ∂Σi = n 2 Σi − 1 2 n j=1 (xj − µi) (xj − µi)T T (50) Thus, when making it equal to zero ˆΣi = 1 n n j=1 (xj − µi) (xj − µi)T (51) 70 / 71

143. Maximum Likelihood Derivative with respect to Σ ∂f (Σi) ∂Σi = n 2 Σi − 1 2 n j=1 (xj − µi) (xj − µi)T T (50) Thus, when making it equal to zero ˆΣi = 1 n n j=1 (xj − µi) (xj − µi)T (51) 70 / 71

144. Exercises Duda and Hart Chapter 3 3.1, 3.2, 3.3, 3.13 Theodoridis Chapter 2 2.5, 2.7, 2.10, 2.12, 2.14, 2.17 71 / 71

06 Machine Learning - Naive Bayes

More Related Content

What's hot (20)

Viewers also liked (16)

Similar to 06 Machine Learning - Naive Bayes (20)

More from Andres Mendez-Vazquez (20)

Recently uploaded (20)

06 Machine Learning - Naive Bayes