SlideShare a Scribd company logo
•Machine Learning
•UNIT - 2
Probability & Bayesian Learning
• Bayesian Learning is based on probabilities
• Bayesian Theorem:
• Let X be a data tuple with n attributes. In Bayesian terms, X is considered
“evidence.”
• Let H be some hypothesis such as that the data tuple X belongs to a
specified class C
• For classification problems, we want to determine P(H|X), the
probability that the hypothesis H holds given the “evidence” or observed
data tuple X (probability that tuple X belongs to class C)
• P(H|X) is the posterior probability of H conditioned on X
• Ex: customers data: attributes age and income, (X is a 35-year-old customer
with an income of $40,000). Suppose that H is the hypothesis that our
customer will buy a computer. Then P(H|X) reflects the probability that
customer X will buy a computer given that we know the customer’s age and
income
Probability & Bayesian Learning
• Bayes Theorem
• P(H) is the prior probability, or a priori probability, of H.
• For our example, this is the probability that any given customer will buy a
computer, regardless of age, income, or any other information
• Similarly, P(X|H) is the posterior probability of X conditioned on H.
• Ex. it is the probability that a customer, X, is 35 years old and earns $40,000,
given that we know the customer will buy a computer.
• P(X) is the prior probability of X
• Ex. it is the probability that a person from our set of customers is 35 years
old and earns $40,000.
• P(H), P(X|H), and P(X) may be estimated from the given training
dataset
Probability & Bayesian Learning
• Bayes’ theorem is useful for calculating the posterior
probability, P(H|X), from P(H), P(X|H), and P(X):
P(H
Bayesian Classification
• A Naive Bayes classifier or simple Bayesian classifier working:
• Let D be training dataset of tuples with class labels
• Each tuple X has n attributes A1… An with values x1… xn
• Each Tuple is represented as X = (x1… xn)
• Suppose that there are m classes, C1, C2,..., Cm.
• Given a new tuple, X, the classifier will predict that X belongs to the class
having the highest posterior probability, conditioned on X.
• That is, the naive Bayesian classifier predicts that tuple X belongs to the
class Ci if and only if:
P(Ci|X) > P(Cj|X) for 1 ≤ j ≤ m, j ≠ i
• Thus, we maximize P(Ci|X)
• The class Ci for which P(Ci|X) is maximized is called the maximum
posteriori hypothesis.
Bayesian Classification
• By Bayes Theorem:
As P(X) is constant for all classes, only P(X|Ci)P(Ci) needs to be maximized
• Class prior probabilities P(Ci) = |Ci,D|/|D|, where |Ci,D| is the number of
training tuples of class Ci in D.
• Posterior Probability of X conditioned on Ci, P (X|Ci) :
P(X|Ci) = = P(x1| C1) × P(x2|C2) × ··· × P(xn|Cn)
• To predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci
• The predicted class label is the class Ci for which P(X|Ci).P(Ci) is the
maximum
Examples
• An insurance company insured 2000 scooter drivers, 4000
car drivers, and 6000 truck drivers. The probability of an
accident involving a scooter driver, car driver, and a truck is
0.01, 0.03, and 0.015 respectively. One of the insured
persons meets with an accident. What is the probability that
he is a scooter driver?
• Solution:-
• Let E1, E2, E3, and A be the events defined as
follows:
• E1 = person chosen is a scooter driver
• E2 = person chosen is a car driver
• E3 = person chosen is a truck driver and
• A = person meets with an accident
• Since there are 12000 people, therefore:
• P(E1) = 2000/12000 = ⅙
• P(E2) = 4000/12000 = ⅓
• P(E3) = 6000/12000 = ½
• It is given that P(A / E1) = Probability that a person meets with
an accident given that he is a scooter driver = 0.01
• Similarly, you have P(A / E2) = 0.03 and P(A / E3) = 0.15
• You are required to find P(E1 / A), i.e. given that the person
meets with an accident, what is the probability that he was a
scooter driver?
• P(E1/A) =
= (1/6 * 0.01) / [(1/6 * 0.01) + (1/3 * 0.03) + (1/2 * 0.15)]
= 1/52
Example 2
• Three urns contain 6 red, 4 black; 4 red, 6 black, and 5 red, 5
black balls respectively. One of the urns is selected at random
and a ball is drawn from it. If the ball drawn is red, find the
probability that it is drawn from the first urn.
• Solution:-
• Solution: Let E1, E2, E3, and A be the events defined as follows:
• E1 = urn first is chosen
• E2 = urn second is chosen
• E3 = urn third is chosen
• A = ball drawn is red
• Since there are three urns and one of the three urns is chosen
at random, therefore:
• P(E1) = P(E2) = P(E3) = ⅓
• If E1 has already occurred, then urn first has been chosen,
containing 6 red and 4 black balls.
• The probability of drawing a red ball from it is 6/10.
• So, P(A/E1) = 6/10
• Similarly, you have P(A/E2) = 4/10 and P(A/E3) = 5/10
• You are required to find the P(E1/A) i.e., given that the ball
drawn is red, what is the probability that it is drawn from the
first urn.
• By Bayes theorem, you have
• P(E1/A) =
= 1/3 * 6/10(1/3 * 6/10) + (1/3 * 4/10) + (1/3 * 5/10)
= ⅖
Example 3
• Amy has two bags. Bag I has 7 red and 4 blue balls
and bag II has 5 red and 9 blue balls. Amy draws a
ball at random and it turns out to be red. Determine
the probability that the ball was from the bag I.
• Solution:
• Assume A to be the event of drawing a red ball.
• Let X and Y be the events that the ball is from the
bag I and bag II, respectively.
• We know that the probability of choosing a bag for
drawing a ball is 1/2, that is,
• P(X) = P(Y) = 1/2
• Since there are 7 red balls out of a total of 11 balls in the bag
I, therefore,
• P(drawing a red ball from the bag I) = P(A|X) = 7/11
• Similarly, P(drawing a red ball from bag II) = P(A|Y) = 5/14
• We need to determine the value of P(the ball drawn is from
the bag I given that it is a red ball), that is, P(X|A).
• Using Bayes theorem, we have the following:
• P(X|A) =
= [((7/11)(1/2))/(7/11)(1/2)+(5/14)(1/2)]
= 0.64
Naïve Bayes Classification- Dataset
Naïve Bayes Classifier
• Data tuples attributes : age, income, student, and credit rating.
• The class label attribute: buys computer ( {yes, no}).
• C1 : class buys computer = yes and C2: buys computer = no.
• The tuple we wish to classify is:
X = (age = youth, income = medium, student = yes, credit rating = fair)
• We need to maximize P(X|Ci)P(Ci), for i = 1, 2.
• P(Ci), the prior probability of each class, can be computed based on
the training tuples:
• P(buys computer = yes) = P (C1) = 9/14 = 0.643
(no. of ‘YES’ tuples/total no. of tuples)
• P(buys computer = no) = P(C2) = 5/14 = 0.357
(no. of ‘NO’ tuples/total no. of tuples)
Naïve Bayes Classifier
• To compute P(X|Ci), fori = 1, 2, we compute the following conditional
probabilities:
• P(age = youth | buys computer = yes) = 2/9 = 0.222
• P(age = youth | buys computer = no) = 3/5 = 0.600
• P(income = medium | buys computer = yes) = 4/9 = 0.444
• P(income = medium | buys computer = no) = 2/5 = 0.400
• P(student = yes | buys computer = yes) = 6/9 = 0.667
• P(student = yes | buys computer = no) = 1/5 = 0.200
• P(credit rating = fair | buys computer = yes) = 6/9 = 0.667
• P(credit rating = fair | buys computer = no) = 2/5 = 0.400
Naïve Bayes Classifier
• Using these probabilities, we obtain:
• P(X|buys computer = yes) =
P(age = youth | buys computer = yes)
x P(income = medium | buys computer = yes)
x P(student = yes | buys computer = yes) ×
x P(credit rating = fair | buys computer = yes)
= 0.222 × 0.444 × 0.667 × 0.667 = 0.044.
• Similarly, P(X|buys computer = no)
= 0.600 × 0.400 × 0.200 × 0.400 = 0.019.
Naïve Bayes Classifier
• To find the class, Ci , that maximizes P(X|Ci)P(Ci), we compute:
• P(X|buys computer = yes) X P(buys computer = yes)
= 0.044 × 0.643 = 0.028
• P(X|buys computer = no) X P(buys computer = no)
= 0.019 × 0.357 = 0.007
• Therefore, the naive Bayesian classifier predicts buys computer
= yes for tuple X.
EXAMPLE - II
Sr. No Outlook Temperature Humidity Windy Class: Play
Golf
1 Rainy Hot High FALSE No
2 Rainy Hot High TRUE No
3 Overcast Hot High FALSE Yes
4 Sunny Mild High FALSE Yes
5 Sunny Cool Normal FALSE Yes
6 Sunny Cool Normal TRUE No
7 Overcast Cool Normal TRUE Yes
8 Rainy Mild High FALSE No
9 Rainy Cool Normal FALSE Yes
10 Sunny Mild Normal FALSE Yes
11 Rainy Mild Normal TRUE Yes
12 Overcast Mild High TRUE Yes
13 Overcast Hot Normal FALSE Yes
14 Sunny Mild High TRUE No
Unit 2 Machine Learning it's most important topic of basic
Linear Regression
• Regression: Supervised learning problem where we are given
examples of instances whose x and y values are given and you have
to learn a function, f, so that given an unknown value x, it should
predict value of y.
f (x) y (for regression, y is continuous)
• Different types of functions can be used for regression. The simplest
being “Linear Regression”
• X can have multiple features (attributes).
• Simple Regression – where x has only one feature
• We can plot now values of x and y in feature space.
Linear Regression
• Linear regression considers a simple line as
function for mapping x to y.
• Given x and y values, we need to find out the best
line that represents the data so that given a new
unknown value of x, y can be predicted.
• In simple words – Given an input x, we have to
compute y.
• Ex. Predict cost of flat (y), given number of rooms (x)
• Predict weight of a person (y), given person’s age (x)
• Predict Salary of a person (y), given work experience (x)
• X – is called independent variable
• Y – dependent variable
• Simple Regression: One dependent variable, one
independent variable (ϴ0 + ϴ1.x)
• Multiple Regression – One dependent variable, two or
more independent variables (ϴ0 + ϴ1.x1 + ϴ2.x2 + ϴ3.x3 …)
Regressi
on
Simple
Regression
(1 feature)
Linear
Non-
Linear
Multiple
Regression
(more than one
features)
Linear
Non
Linear
Linear Regression
• Formula for linear regression is given by:
y = a + bx
• In Machine Learning, Hypothesis function for Linear Regression: y = ϴ0 + ϴ1.x
• Here,
• x: is input data (training Data),
• y: is data label
• ϴ0 : intercept (y-intercept)
• ϴ1: coefficient of x (slope of line)
• When we train the model, it fits the best line to predict the value of y for given value of x.
• This is done by finding the best ϴ0 and ϴ1 values.
• Cost function (J): The aim is to predict the value of y such that error between predicted value are
true value is minimum.
• So, it is very important to update the θ0 and θ1 values, to reach the best value that minimize the error between
predicted y value (pred) and true y value (y).
minimize
j = - cost function
Linear Regression
• Gradient Descent:
To update θ0 and θ1 values in order to reduce Cost function (minimizing error value) and
achieving the best fit line the model uses Gradient Descent.
• The idea is to start with random θ0 and θ1 values and then iteratively updating the values,
reaching minimum cost (minimum error).
• How θ0 and θ1 get updated:
• θj
: Weights of the hypothesis
• hθ
(xi
) : predicted y value for ith
input
• j : Feature index number (can be 0, 1, 2, ......, n)
• α : Learning Rate of Gradient Descent
Linear Regression
Consider a dataset Iteration 1: to start, θ0 and θ1 values are randomly
chosen. Let us suppose, θ0 = 0 and θ1 = 0
Linear Regression: y = ϴ0 + ϴ1.x
[y1 y2 y3 y4] = [θ0 θ1]
y1= θ0.1 + θ1.
y2= θ0.1 + θ1.
y3= θ0.1 + θ1.
y4= θ0.1 + θ1.
Sample
No (m)
Experience
(X)
Salary (y) –
in lakhs
1 2 3
2 6 10
3 5 4
4 7 3
Cost Function Error
Linear Regression
• Gradient Descent (Update θ0 value)
Here, j = 0 (θ0)
Gradient Descent – Update θ1 value
Here, j = 1 (θ1)
Iteration 2 – θ0 = 0.005 and θ1 = 0.02657
Linear Regression
=
= (8.66 + 96.80 + 14.93 + 7.91) = 16.03
Linear Regression
• Gradient Descent (Update θ0 value)
Here, j = 0
Gradient Descent – Update θ1 value
Here, j = 1
= 0.005 - [(0.057-3) + (0.161-10) + (0.135-4) + (0.187
- 3)]
= 0.005 - (-2.943 – 9.839 – 3.865 – 2.813)
= 0.005 - (-19.46)
= 0.005 – (-0.0048) = 0.0098
= 0.026 - [(0.057-3)2 + (0.161-10)6 + (0.135- 4)5 +
(0.187 - 3)7]
= 0.026 - (-2.943x2 + (– 9.839x6) + (– 3.865x5) +
(–
2.813x7))
= 0.026 - (-5.886 + (-59.034) + (-19.325) +
(-
19.691))
= 0.026 – (-103.936) = 0.026 + 0.0259 = 0.0519
Linear Regression
Iteration 3 : θ0 = 0.098 and θ1 = 0.051
=
• y1= 0.098 x 1 + 0.051 x
• y2= 0.098 x 1 + 0.051 x
• y3= 0.098 x 1 + 0.051 x
• y4= 0.098 x 1 + 0.051 x
y1= 0.098 x 1 + 0.051 x
y2= 0.098 x 1 + 0.051 x
y3= 0.098 x 1 + 0.051 x
y4= 0.098 x 1 + 0.051 x
=
= (7.84 + 93.06 + 12.93 + 6.47) = 15.03
Linear Regression
= 0.098 - [(0.2-3) + (0.353-10) + (0.404-4) + (0.455 -
3)]
= 0.098 - (-2.8 + (– 9.647) + (– 3.596) +(-2.545))
= 0.098 - (-18.588)
= 0.098 – (-0.0046) = 0.102
= 0.051 - [(0.2-3)2 + (0.353-10)6 + (0.404- 4)5 +
(0.455 - 3)7]
= 0.051 - (-2.8x2 + (– 9.647x6) + (– 3.596x5) +
(–2.545x7))
= 0.051 - (-5.6 + (-57.882) + (-17.98) +
(-
17.815))
= 0.051 – (-99.277) = 0.051 + 0.0248 = 0.075
θ0 = 0.102 and θ1 = 0.075
The iterations are continued till the error reduces to minimum and we get good fit (i.e., values of θ0 and θ1 )
Unit 2 Machine Learning it's most important topic of basic

More Related Content

PDF
Bayes 6
PPT
Lec12-Probability.ppt
PPT
Lec12-Probability.ppt
PPT
Lec12-Probability (1).ppt
PPT
Lec12-Probability.ppt
PPT
Lecture07_ Naive Bayes Classifier Machine Learning
PPT
Recitation decision trees-adaboost-02-09-2006-3
PDF
sample space formation.pdf
Bayes 6
Lec12-Probability.ppt
Lec12-Probability.ppt
Lec12-Probability (1).ppt
Lec12-Probability.ppt
Lecture07_ Naive Bayes Classifier Machine Learning
Recitation decision trees-adaboost-02-09-2006-3
sample space formation.pdf

Similar to Unit 2 Machine Learning it's most important topic of basic (20)

PPTX
Pattern recognition binoy 05-naive bayes classifier
PDF
Logistic-Regression - Machine learning model
PPTX
Bayesian classification
PPTX
unit 3 -ML.pptx
PDF
04_NBayes-Machine Learning 10-601-1-26-2015.pptx.pdf
PPTX
Basic statistics for algorithmic trading
PDF
An introduction to Bayesian Statistics using Python
PDF
Warmup_New.pdf
PDF
Bayesian Learning- part of machine learning
PPT
Probability_Review MATHEMATICS PART 1.ppt
PPT
Introduction Lesson in Probability and Review
PPT
Probability_Review.ppt
PPT
Probability_Review.ppt
PPT
Probability_Review.ppt for your knowledg
PPT
Probability_Review.ppt
PPT
Probability Review for beginner to be used for
PPT
Probability_Review.ppt
PPT
Probability_Review HELPFUL IN STATISTICS.ppt
PPTX
kmean_naivebayes.pptx
PPT
Probability_Review.ppt
Pattern recognition binoy 05-naive bayes classifier
Logistic-Regression - Machine learning model
Bayesian classification
unit 3 -ML.pptx
04_NBayes-Machine Learning 10-601-1-26-2015.pptx.pdf
Basic statistics for algorithmic trading
An introduction to Bayesian Statistics using Python
Warmup_New.pdf
Bayesian Learning- part of machine learning
Probability_Review MATHEMATICS PART 1.ppt
Introduction Lesson in Probability and Review
Probability_Review.ppt
Probability_Review.ppt
Probability_Review.ppt for your knowledg
Probability_Review.ppt
Probability Review for beginner to be used for
Probability_Review.ppt
Probability_Review HELPFUL IN STATISTICS.ppt
kmean_naivebayes.pptx
Probability_Review.ppt
Ad

Recently uploaded (20)

PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPT
Project quality management in manufacturing
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Sustainable Sites - Green Building Construction
PPTX
UNIT 4 Total Quality Management .pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Construction Project Organization Group 2.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
DOCX
573137875-Attendance-Management-System-original
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Welding lecture in detail for understanding
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
R24 SURVEYING LAB MANUAL for civil enggi
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Operating System & Kernel Study Guide-1 - converted.pdf
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Project quality management in manufacturing
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Sustainable Sites - Green Building Construction
UNIT 4 Total Quality Management .pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Embodied AI: Ushering in the Next Era of Intelligent Systems
Construction Project Organization Group 2.pptx
Foundation to blockchain - A guide to Blockchain Tech
573137875-Attendance-Management-System-original
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Welding lecture in detail for understanding
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
R24 SURVEYING LAB MANUAL for civil enggi
Ad

Unit 2 Machine Learning it's most important topic of basic

  • 2. Probability & Bayesian Learning • Bayesian Learning is based on probabilities • Bayesian Theorem: • Let X be a data tuple with n attributes. In Bayesian terms, X is considered “evidence.” • Let H be some hypothesis such as that the data tuple X belongs to a specified class C • For classification problems, we want to determine P(H|X), the probability that the hypothesis H holds given the “evidence” or observed data tuple X (probability that tuple X belongs to class C) • P(H|X) is the posterior probability of H conditioned on X • Ex: customers data: attributes age and income, (X is a 35-year-old customer with an income of $40,000). Suppose that H is the hypothesis that our customer will buy a computer. Then P(H|X) reflects the probability that customer X will buy a computer given that we know the customer’s age and income
  • 3. Probability & Bayesian Learning • Bayes Theorem • P(H) is the prior probability, or a priori probability, of H. • For our example, this is the probability that any given customer will buy a computer, regardless of age, income, or any other information • Similarly, P(X|H) is the posterior probability of X conditioned on H. • Ex. it is the probability that a customer, X, is 35 years old and earns $40,000, given that we know the customer will buy a computer. • P(X) is the prior probability of X • Ex. it is the probability that a person from our set of customers is 35 years old and earns $40,000. • P(H), P(X|H), and P(X) may be estimated from the given training dataset
  • 4. Probability & Bayesian Learning • Bayes’ theorem is useful for calculating the posterior probability, P(H|X), from P(H), P(X|H), and P(X): P(H
  • 5. Bayesian Classification • A Naive Bayes classifier or simple Bayesian classifier working: • Let D be training dataset of tuples with class labels • Each tuple X has n attributes A1… An with values x1… xn • Each Tuple is represented as X = (x1… xn) • Suppose that there are m classes, C1, C2,..., Cm. • Given a new tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. • That is, the naive Bayesian classifier predicts that tuple X belongs to the class Ci if and only if: P(Ci|X) > P(Cj|X) for 1 ≤ j ≤ m, j ≠ i • Thus, we maximize P(Ci|X) • The class Ci for which P(Ci|X) is maximized is called the maximum posteriori hypothesis.
  • 6. Bayesian Classification • By Bayes Theorem: As P(X) is constant for all classes, only P(X|Ci)P(Ci) needs to be maximized • Class prior probabilities P(Ci) = |Ci,D|/|D|, where |Ci,D| is the number of training tuples of class Ci in D. • Posterior Probability of X conditioned on Ci, P (X|Ci) : P(X|Ci) = = P(x1| C1) × P(x2|C2) × ··· × P(xn|Cn) • To predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci • The predicted class label is the class Ci for which P(X|Ci).P(Ci) is the maximum
  • 7. Examples • An insurance company insured 2000 scooter drivers, 4000 car drivers, and 6000 truck drivers. The probability of an accident involving a scooter driver, car driver, and a truck is 0.01, 0.03, and 0.015 respectively. One of the insured persons meets with an accident. What is the probability that he is a scooter driver? • Solution:- • Let E1, E2, E3, and A be the events defined as follows: • E1 = person chosen is a scooter driver • E2 = person chosen is a car driver • E3 = person chosen is a truck driver and • A = person meets with an accident
  • 8. • Since there are 12000 people, therefore: • P(E1) = 2000/12000 = ⅙ • P(E2) = 4000/12000 = ⅓ • P(E3) = 6000/12000 = ½ • It is given that P(A / E1) = Probability that a person meets with an accident given that he is a scooter driver = 0.01 • Similarly, you have P(A / E2) = 0.03 and P(A / E3) = 0.15 • You are required to find P(E1 / A), i.e. given that the person meets with an accident, what is the probability that he was a scooter driver? • P(E1/A) = = (1/6 * 0.01) / [(1/6 * 0.01) + (1/3 * 0.03) + (1/2 * 0.15)] = 1/52
  • 9. Example 2 • Three urns contain 6 red, 4 black; 4 red, 6 black, and 5 red, 5 black balls respectively. One of the urns is selected at random and a ball is drawn from it. If the ball drawn is red, find the probability that it is drawn from the first urn. • Solution:- • Solution: Let E1, E2, E3, and A be the events defined as follows: • E1 = urn first is chosen • E2 = urn second is chosen • E3 = urn third is chosen • A = ball drawn is red • Since there are three urns and one of the three urns is chosen at random, therefore: • P(E1) = P(E2) = P(E3) = ⅓
  • 10. • If E1 has already occurred, then urn first has been chosen, containing 6 red and 4 black balls. • The probability of drawing a red ball from it is 6/10. • So, P(A/E1) = 6/10 • Similarly, you have P(A/E2) = 4/10 and P(A/E3) = 5/10 • You are required to find the P(E1/A) i.e., given that the ball drawn is red, what is the probability that it is drawn from the first urn. • By Bayes theorem, you have • P(E1/A) = = 1/3 * 6/10(1/3 * 6/10) + (1/3 * 4/10) + (1/3 * 5/10) = ⅖
  • 11. Example 3 • Amy has two bags. Bag I has 7 red and 4 blue balls and bag II has 5 red and 9 blue balls. Amy draws a ball at random and it turns out to be red. Determine the probability that the ball was from the bag I. • Solution: • Assume A to be the event of drawing a red ball. • Let X and Y be the events that the ball is from the bag I and bag II, respectively. • We know that the probability of choosing a bag for drawing a ball is 1/2, that is, • P(X) = P(Y) = 1/2
  • 12. • Since there are 7 red balls out of a total of 11 balls in the bag I, therefore, • P(drawing a red ball from the bag I) = P(A|X) = 7/11 • Similarly, P(drawing a red ball from bag II) = P(A|Y) = 5/14 • We need to determine the value of P(the ball drawn is from the bag I given that it is a red ball), that is, P(X|A). • Using Bayes theorem, we have the following: • P(X|A) = = [((7/11)(1/2))/(7/11)(1/2)+(5/14)(1/2)] = 0.64
  • 14. Naïve Bayes Classifier • Data tuples attributes : age, income, student, and credit rating. • The class label attribute: buys computer ( {yes, no}). • C1 : class buys computer = yes and C2: buys computer = no. • The tuple we wish to classify is: X = (age = youth, income = medium, student = yes, credit rating = fair) • We need to maximize P(X|Ci)P(Ci), for i = 1, 2. • P(Ci), the prior probability of each class, can be computed based on the training tuples: • P(buys computer = yes) = P (C1) = 9/14 = 0.643 (no. of ‘YES’ tuples/total no. of tuples) • P(buys computer = no) = P(C2) = 5/14 = 0.357 (no. of ‘NO’ tuples/total no. of tuples)
  • 15. Naïve Bayes Classifier • To compute P(X|Ci), fori = 1, 2, we compute the following conditional probabilities: • P(age = youth | buys computer = yes) = 2/9 = 0.222 • P(age = youth | buys computer = no) = 3/5 = 0.600 • P(income = medium | buys computer = yes) = 4/9 = 0.444 • P(income = medium | buys computer = no) = 2/5 = 0.400 • P(student = yes | buys computer = yes) = 6/9 = 0.667 • P(student = yes | buys computer = no) = 1/5 = 0.200 • P(credit rating = fair | buys computer = yes) = 6/9 = 0.667 • P(credit rating = fair | buys computer = no) = 2/5 = 0.400
  • 16. Naïve Bayes Classifier • Using these probabilities, we obtain: • P(X|buys computer = yes) = P(age = youth | buys computer = yes) x P(income = medium | buys computer = yes) x P(student = yes | buys computer = yes) × x P(credit rating = fair | buys computer = yes) = 0.222 × 0.444 × 0.667 × 0.667 = 0.044. • Similarly, P(X|buys computer = no) = 0.600 × 0.400 × 0.200 × 0.400 = 0.019.
  • 17. Naïve Bayes Classifier • To find the class, Ci , that maximizes P(X|Ci)P(Ci), we compute: • P(X|buys computer = yes) X P(buys computer = yes) = 0.044 × 0.643 = 0.028 • P(X|buys computer = no) X P(buys computer = no) = 0.019 × 0.357 = 0.007 • Therefore, the naive Bayesian classifier predicts buys computer = yes for tuple X.
  • 19. Sr. No Outlook Temperature Humidity Windy Class: Play Golf 1 Rainy Hot High FALSE No 2 Rainy Hot High TRUE No 3 Overcast Hot High FALSE Yes 4 Sunny Mild High FALSE Yes 5 Sunny Cool Normal FALSE Yes 6 Sunny Cool Normal TRUE No 7 Overcast Cool Normal TRUE Yes 8 Rainy Mild High FALSE No 9 Rainy Cool Normal FALSE Yes 10 Sunny Mild Normal FALSE Yes 11 Rainy Mild Normal TRUE Yes 12 Overcast Mild High TRUE Yes 13 Overcast Hot Normal FALSE Yes 14 Sunny Mild High TRUE No
  • 21. Linear Regression • Regression: Supervised learning problem where we are given examples of instances whose x and y values are given and you have to learn a function, f, so that given an unknown value x, it should predict value of y. f (x) y (for regression, y is continuous) • Different types of functions can be used for regression. The simplest being “Linear Regression” • X can have multiple features (attributes). • Simple Regression – where x has only one feature • We can plot now values of x and y in feature space.
  • 22. Linear Regression • Linear regression considers a simple line as function for mapping x to y. • Given x and y values, we need to find out the best line that represents the data so that given a new unknown value of x, y can be predicted. • In simple words – Given an input x, we have to compute y. • Ex. Predict cost of flat (y), given number of rooms (x) • Predict weight of a person (y), given person’s age (x) • Predict Salary of a person (y), given work experience (x) • X – is called independent variable • Y – dependent variable • Simple Regression: One dependent variable, one independent variable (ϴ0 + ϴ1.x) • Multiple Regression – One dependent variable, two or more independent variables (ϴ0 + ϴ1.x1 + ϴ2.x2 + ϴ3.x3 …) Regressi on Simple Regression (1 feature) Linear Non- Linear Multiple Regression (more than one features) Linear Non Linear
  • 23. Linear Regression • Formula for linear regression is given by: y = a + bx • In Machine Learning, Hypothesis function for Linear Regression: y = ϴ0 + ϴ1.x • Here, • x: is input data (training Data), • y: is data label • ϴ0 : intercept (y-intercept) • ϴ1: coefficient of x (slope of line) • When we train the model, it fits the best line to predict the value of y for given value of x. • This is done by finding the best ϴ0 and ϴ1 values. • Cost function (J): The aim is to predict the value of y such that error between predicted value are true value is minimum. • So, it is very important to update the θ0 and θ1 values, to reach the best value that minimize the error between predicted y value (pred) and true y value (y). minimize j = - cost function
  • 24. Linear Regression • Gradient Descent: To update θ0 and θ1 values in order to reduce Cost function (minimizing error value) and achieving the best fit line the model uses Gradient Descent. • The idea is to start with random θ0 and θ1 values and then iteratively updating the values, reaching minimum cost (minimum error). • How θ0 and θ1 get updated:
  • 25. • θj : Weights of the hypothesis • hθ (xi ) : predicted y value for ith input • j : Feature index number (can be 0, 1, 2, ......, n) • α : Learning Rate of Gradient Descent
  • 26. Linear Regression Consider a dataset Iteration 1: to start, θ0 and θ1 values are randomly chosen. Let us suppose, θ0 = 0 and θ1 = 0 Linear Regression: y = ϴ0 + ϴ1.x [y1 y2 y3 y4] = [θ0 θ1] y1= θ0.1 + θ1. y2= θ0.1 + θ1. y3= θ0.1 + θ1. y4= θ0.1 + θ1. Sample No (m) Experience (X) Salary (y) – in lakhs 1 2 3 2 6 10 3 5 4 4 7 3 Cost Function Error
  • 27. Linear Regression • Gradient Descent (Update θ0 value) Here, j = 0 (θ0) Gradient Descent – Update θ1 value Here, j = 1 (θ1) Iteration 2 – θ0 = 0.005 and θ1 = 0.02657
  • 28. Linear Regression = = (8.66 + 96.80 + 14.93 + 7.91) = 16.03
  • 29. Linear Regression • Gradient Descent (Update θ0 value) Here, j = 0 Gradient Descent – Update θ1 value Here, j = 1 = 0.005 - [(0.057-3) + (0.161-10) + (0.135-4) + (0.187 - 3)] = 0.005 - (-2.943 – 9.839 – 3.865 – 2.813) = 0.005 - (-19.46) = 0.005 – (-0.0048) = 0.0098 = 0.026 - [(0.057-3)2 + (0.161-10)6 + (0.135- 4)5 + (0.187 - 3)7] = 0.026 - (-2.943x2 + (– 9.839x6) + (– 3.865x5) + (– 2.813x7)) = 0.026 - (-5.886 + (-59.034) + (-19.325) + (- 19.691)) = 0.026 – (-103.936) = 0.026 + 0.0259 = 0.0519
  • 30. Linear Regression Iteration 3 : θ0 = 0.098 and θ1 = 0.051 = • y1= 0.098 x 1 + 0.051 x • y2= 0.098 x 1 + 0.051 x • y3= 0.098 x 1 + 0.051 x • y4= 0.098 x 1 + 0.051 x y1= 0.098 x 1 + 0.051 x y2= 0.098 x 1 + 0.051 x y3= 0.098 x 1 + 0.051 x y4= 0.098 x 1 + 0.051 x = = (7.84 + 93.06 + 12.93 + 6.47) = 15.03
  • 31. Linear Regression = 0.098 - [(0.2-3) + (0.353-10) + (0.404-4) + (0.455 - 3)] = 0.098 - (-2.8 + (– 9.647) + (– 3.596) +(-2.545)) = 0.098 - (-18.588) = 0.098 – (-0.0046) = 0.102 = 0.051 - [(0.2-3)2 + (0.353-10)6 + (0.404- 4)5 + (0.455 - 3)7] = 0.051 - (-2.8x2 + (– 9.647x6) + (– 3.596x5) + (–2.545x7)) = 0.051 - (-5.6 + (-57.882) + (-17.98) + (- 17.815)) = 0.051 – (-99.277) = 0.051 + 0.0248 = 0.075 θ0 = 0.102 and θ1 = 0.075 The iterations are continued till the error reduces to minimum and we get good fit (i.e., values of θ0 and θ1 )