2. Probability & Bayesian Learning
• Bayesian Learning is based on probabilities
• Bayesian Theorem:
• Let X be a data tuple with n attributes. In Bayesian terms, X is considered
“evidence.”
• Let H be some hypothesis such as that the data tuple X belongs to a
specified class C
• For classification problems, we want to determine P(H|X), the
probability that the hypothesis H holds given the “evidence” or observed
data tuple X (probability that tuple X belongs to class C)
• P(H|X) is the posterior probability of H conditioned on X
• Ex: customers data: attributes age and income, (X is a 35-year-old customer
with an income of $40,000). Suppose that H is the hypothesis that our
customer will buy a computer. Then P(H|X) reflects the probability that
customer X will buy a computer given that we know the customer’s age and
income
3. Probability & Bayesian Learning
• Bayes Theorem
• P(H) is the prior probability, or a priori probability, of H.
• For our example, this is the probability that any given customer will buy a
computer, regardless of age, income, or any other information
• Similarly, P(X|H) is the posterior probability of X conditioned on H.
• Ex. it is the probability that a customer, X, is 35 years old and earns $40,000,
given that we know the customer will buy a computer.
• P(X) is the prior probability of X
• Ex. it is the probability that a person from our set of customers is 35 years
old and earns $40,000.
• P(H), P(X|H), and P(X) may be estimated from the given training
dataset
4. Probability & Bayesian Learning
• Bayes’ theorem is useful for calculating the posterior
probability, P(H|X), from P(H), P(X|H), and P(X):
P(H
5. Bayesian Classification
• A Naive Bayes classifier or simple Bayesian classifier working:
• Let D be training dataset of tuples with class labels
• Each tuple X has n attributes A1… An with values x1… xn
• Each Tuple is represented as X = (x1… xn)
• Suppose that there are m classes, C1, C2,..., Cm.
• Given a new tuple, X, the classifier will predict that X belongs to the class
having the highest posterior probability, conditioned on X.
• That is, the naive Bayesian classifier predicts that tuple X belongs to the
class Ci if and only if:
P(Ci|X) > P(Cj|X) for 1 ≤ j ≤ m, j ≠ i
• Thus, we maximize P(Ci|X)
• The class Ci for which P(Ci|X) is maximized is called the maximum
posteriori hypothesis.
6. Bayesian Classification
• By Bayes Theorem:
As P(X) is constant for all classes, only P(X|Ci)P(Ci) needs to be maximized
• Class prior probabilities P(Ci) = |Ci,D|/|D|, where |Ci,D| is the number of
training tuples of class Ci in D.
• Posterior Probability of X conditioned on Ci, P (X|Ci) :
P(X|Ci) = = P(x1| C1) × P(x2|C2) × ··· × P(xn|Cn)
• To predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci
• The predicted class label is the class Ci for which P(X|Ci).P(Ci) is the
maximum
7. Examples
• An insurance company insured 2000 scooter drivers, 4000
car drivers, and 6000 truck drivers. The probability of an
accident involving a scooter driver, car driver, and a truck is
0.01, 0.03, and 0.015 respectively. One of the insured
persons meets with an accident. What is the probability that
he is a scooter driver?
• Solution:-
• Let E1, E2, E3, and A be the events defined as
follows:
• E1 = person chosen is a scooter driver
• E2 = person chosen is a car driver
• E3 = person chosen is a truck driver and
• A = person meets with an accident
8. • Since there are 12000 people, therefore:
• P(E1) = 2000/12000 = ⅙
• P(E2) = 4000/12000 = ⅓
• P(E3) = 6000/12000 = ½
• It is given that P(A / E1) = Probability that a person meets with
an accident given that he is a scooter driver = 0.01
• Similarly, you have P(A / E2) = 0.03 and P(A / E3) = 0.15
• You are required to find P(E1 / A), i.e. given that the person
meets with an accident, what is the probability that he was a
scooter driver?
• P(E1/A) =
= (1/6 * 0.01) / [(1/6 * 0.01) + (1/3 * 0.03) + (1/2 * 0.15)]
= 1/52
9. Example 2
• Three urns contain 6 red, 4 black; 4 red, 6 black, and 5 red, 5
black balls respectively. One of the urns is selected at random
and a ball is drawn from it. If the ball drawn is red, find the
probability that it is drawn from the first urn.
• Solution:-
• Solution: Let E1, E2, E3, and A be the events defined as follows:
• E1 = urn first is chosen
• E2 = urn second is chosen
• E3 = urn third is chosen
• A = ball drawn is red
• Since there are three urns and one of the three urns is chosen
at random, therefore:
• P(E1) = P(E2) = P(E3) = ⅓
10. • If E1 has already occurred, then urn first has been chosen,
containing 6 red and 4 black balls.
• The probability of drawing a red ball from it is 6/10.
• So, P(A/E1) = 6/10
• Similarly, you have P(A/E2) = 4/10 and P(A/E3) = 5/10
• You are required to find the P(E1/A) i.e., given that the ball
drawn is red, what is the probability that it is drawn from the
first urn.
• By Bayes theorem, you have
• P(E1/A) =
= 1/3 * 6/10(1/3 * 6/10) + (1/3 * 4/10) + (1/3 * 5/10)
= ⅖
11. Example 3
• Amy has two bags. Bag I has 7 red and 4 blue balls
and bag II has 5 red and 9 blue balls. Amy draws a
ball at random and it turns out to be red. Determine
the probability that the ball was from the bag I.
• Solution:
• Assume A to be the event of drawing a red ball.
• Let X and Y be the events that the ball is from the
bag I and bag II, respectively.
• We know that the probability of choosing a bag for
drawing a ball is 1/2, that is,
• P(X) = P(Y) = 1/2
12. • Since there are 7 red balls out of a total of 11 balls in the bag
I, therefore,
• P(drawing a red ball from the bag I) = P(A|X) = 7/11
• Similarly, P(drawing a red ball from bag II) = P(A|Y) = 5/14
• We need to determine the value of P(the ball drawn is from
the bag I given that it is a red ball), that is, P(X|A).
• Using Bayes theorem, we have the following:
• P(X|A) =
= [((7/11)(1/2))/(7/11)(1/2)+(5/14)(1/2)]
= 0.64
14. Naïve Bayes Classifier
• Data tuples attributes : age, income, student, and credit rating.
• The class label attribute: buys computer ( {yes, no}).
• C1 : class buys computer = yes and C2: buys computer = no.
• The tuple we wish to classify is:
X = (age = youth, income = medium, student = yes, credit rating = fair)
• We need to maximize P(X|Ci)P(Ci), for i = 1, 2.
• P(Ci), the prior probability of each class, can be computed based on
the training tuples:
• P(buys computer = yes) = P (C1) = 9/14 = 0.643
(no. of ‘YES’ tuples/total no. of tuples)
• P(buys computer = no) = P(C2) = 5/14 = 0.357
(no. of ‘NO’ tuples/total no. of tuples)
19. Sr. No Outlook Temperature Humidity Windy Class: Play
Golf
1 Rainy Hot High FALSE No
2 Rainy Hot High TRUE No
3 Overcast Hot High FALSE Yes
4 Sunny Mild High FALSE Yes
5 Sunny Cool Normal FALSE Yes
6 Sunny Cool Normal TRUE No
7 Overcast Cool Normal TRUE Yes
8 Rainy Mild High FALSE No
9 Rainy Cool Normal FALSE Yes
10 Sunny Mild Normal FALSE Yes
11 Rainy Mild Normal TRUE Yes
12 Overcast Mild High TRUE Yes
13 Overcast Hot Normal FALSE Yes
14 Sunny Mild High TRUE No
21. Linear Regression
• Regression: Supervised learning problem where we are given
examples of instances whose x and y values are given and you have
to learn a function, f, so that given an unknown value x, it should
predict value of y.
f (x) y (for regression, y is continuous)
• Different types of functions can be used for regression. The simplest
being “Linear Regression”
• X can have multiple features (attributes).
• Simple Regression – where x has only one feature
• We can plot now values of x and y in feature space.
22. Linear Regression
• Linear regression considers a simple line as
function for mapping x to y.
• Given x and y values, we need to find out the best
line that represents the data so that given a new
unknown value of x, y can be predicted.
• In simple words – Given an input x, we have to
compute y.
• Ex. Predict cost of flat (y), given number of rooms (x)
• Predict weight of a person (y), given person’s age (x)
• Predict Salary of a person (y), given work experience (x)
• X – is called independent variable
• Y – dependent variable
• Simple Regression: One dependent variable, one
independent variable (ϴ0 + ϴ1.x)
• Multiple Regression – One dependent variable, two or
more independent variables (ϴ0 + ϴ1.x1 + ϴ2.x2 + ϴ3.x3 …)
Regressi
on
Simple
Regression
(1 feature)
Linear
Non-
Linear
Multiple
Regression
(more than one
features)
Linear
Non
Linear
23. Linear Regression
• Formula for linear regression is given by:
y = a + bx
• In Machine Learning, Hypothesis function for Linear Regression: y = ϴ0 + ϴ1.x
• Here,
• x: is input data (training Data),
• y: is data label
• ϴ0 : intercept (y-intercept)
• ϴ1: coefficient of x (slope of line)
• When we train the model, it fits the best line to predict the value of y for given value of x.
• This is done by finding the best ϴ0 and ϴ1 values.
• Cost function (J): The aim is to predict the value of y such that error between predicted value are
true value is minimum.
• So, it is very important to update the θ0 and θ1 values, to reach the best value that minimize the error between
predicted y value (pred) and true y value (y).
minimize
j = - cost function
24. Linear Regression
• Gradient Descent:
To update θ0 and θ1 values in order to reduce Cost function (minimizing error value) and
achieving the best fit line the model uses Gradient Descent.
• The idea is to start with random θ0 and θ1 values and then iteratively updating the values,
reaching minimum cost (minimum error).
• How θ0 and θ1 get updated:
25. • θj
: Weights of the hypothesis
• hθ
(xi
) : predicted y value for ith
input
• j : Feature index number (can be 0, 1, 2, ......, n)
• α : Learning Rate of Gradient Descent
26. Linear Regression
Consider a dataset Iteration 1: to start, θ0 and θ1 values are randomly
chosen. Let us suppose, θ0 = 0 and θ1 = 0
Linear Regression: y = ϴ0 + ϴ1.x
[y1 y2 y3 y4] = [θ0 θ1]
y1= θ0.1 + θ1.
y2= θ0.1 + θ1.
y3= θ0.1 + θ1.
y4= θ0.1 + θ1.
Sample
No (m)
Experience
(X)
Salary (y) –
in lakhs
1 2 3
2 6 10
3 5 4
4 7 3
Cost Function Error
30. Linear Regression
Iteration 3 : θ0 = 0.098 and θ1 = 0.051
=
• y1= 0.098 x 1 + 0.051 x
• y2= 0.098 x 1 + 0.051 x
• y3= 0.098 x 1 + 0.051 x
• y4= 0.098 x 1 + 0.051 x
y1= 0.098 x 1 + 0.051 x
y2= 0.098 x 1 + 0.051 x
y3= 0.098 x 1 + 0.051 x
y4= 0.098 x 1 + 0.051 x
=
= (7.84 + 93.06 + 12.93 + 6.47) = 15.03
31. Linear Regression
= 0.098 - [(0.2-3) + (0.353-10) + (0.404-4) + (0.455 -
3)]
= 0.098 - (-2.8 + (– 9.647) + (– 3.596) +(-2.545))
= 0.098 - (-18.588)
= 0.098 – (-0.0046) = 0.102
= 0.051 - [(0.2-3)2 + (0.353-10)6 + (0.404- 4)5 +
(0.455 - 3)7]
= 0.051 - (-2.8x2 + (– 9.647x6) + (– 3.596x5) +
(–2.545x7))
= 0.051 - (-5.6 + (-57.882) + (-17.98) +
(-
17.815))
= 0.051 – (-99.277) = 0.051 + 0.0248 = 0.075
θ0 = 0.102 and θ1 = 0.075
The iterations are continued till the error reduces to minimum and we get good fit (i.e., values of θ0 and θ1 )