Naive Bayes Classifier
Bayesian Methods
 Our focus this lecture
 Learning and classification methods based on
probability theory.
 Bayes theorem plays a critical role in probabilistic
learning and classification.
 Uses prior probability of each category given no
information about an item.
 Categorization produces a posterior probability
distribution over the possible categories given a
description of an item.
Basic Probability Formulas
 Product rule
 Sum rule
 Bayes theorem
 Theorem of total probability, if event Ai is
mutually exclusive and probability sum to 1
)
(
)
|
(
)
(
)
|
(
)
( A
P
A
B
P
B
P
B
A
P
B
A
P 


)
(
)
(
)
(
)
( B
A
P
B
P
A
P
B
A
P 







n
i
i
i A
P
A
B
P
B
P
1
)
(
)
|
(
)
(
)
(
)
(
)
|
(
)
|
(
D
P
h
P
h
D
P
D
h
P 
Bayes Theorem
 Given a hypothesis h and data D which bears on the
hypothesis:
 P(h): independent probability of h: prior probability
 P(D): independent probability of D
 P(D|h): conditional probability of D given h:
likelihood
 P(h|D): conditional probability of h given D: posterior
probability
)
(
)
(
)
|
(
)
|
(
D
P
h
P
h
D
P
D
h
P 
Does patient have cancer or not?
 A patient takes a lab test and the result comes back positive.
It is known that the test returns a correct positive result in
only 99% of the cases and a correct negative result in only
95% of the cases. Furthermore, only 0.03 of the entire
population has this disease.
1. What is the probability that this patient has cancer?
2. What is the probability that he does not have cancer?
3. What is the diagnosis?
Maximum A Posterior
 Based on Bayes Theorem, we can compute the
Maximum A Posterior (MAP) hypothesis for the data
 We are interested in the best hypothesis for some
space H given observed training data D.
)
|
(
argmax D
h
P
h
H
h
MAP


)
(
)
(
)
|
(
argmax
D
P
h
P
h
D
P
H
h

)
(
)
|
(
argmax h
P
h
D
P
H
h

H: set of all hypothesis.
Note that we can drop P(D) as the probability of the data is constant
(and independent of the hypothesis).
Maximum Likelihood
 Now assume that all hypotheses are equally
probable a priori, i.e., P(hi ) = P(hj ) for all hi,
hj belong to H.
 This is called assuming a uniform prior. It
simplifies computing the posterior:
 This hypothesis is called the maximum
likelihood hypothesis.
)
|
(
max
arg h
D
P
h
H
h
ML


Desirable Properties of Bayes Classifier
 Incrementality: with each training example,
the prior and the likelihood can be updated
dynamically: flexible and robust to errors.
 Combines prior knowledge and observed
data: prior probability of a hypothesis
multiplied with probability of the hypothesis
given the training data
 Probabilistic hypothesis: outputs not only a
classification, but a probability distribution
over all classes
Bayes Classifiers
Assumption: training set consists of instances of different classes
described cj as conjunctions of attributes values
Task: Classify a new instance d based on a tuple of attribute values
into one of the classes cj  C
Key idea: assign the most probable class using Bayes
Theorem.
MAP
c
)
,
,
,
|
(
argmax 2
1 n
j
C
c
MAP x
x
x
c
P
c
j



)
,
,
,
(
)
(
)
|
,
,
,
(
argmax
2
1
2
1
n
j
j
n
C
c x
x
x
P
c
P
c
x
x
x
P
j 



)
(
)
|
,
,
,
(
argmax 2
1 j
j
n
C
c
c
P
c
x
x
x
P
j



Parameters estimation
 P(cj)
 Can be estimated from the frequency of classes in the
training examples.
 P(x1,x2,…,xn|cj)
 O(|X|n•|C|) parameters
 Could only be estimated if a very, very large number of
training examples was available.
 Independence Assumption: attribute values are
conditionally independent given the target value: naïve
Bayes.


i
j
i
j
n c
x
P
c
x
x
x
P )
|
(
)
|
,
,
,
( 2
1 



i
j
i
j
C
c
NB c
x
P
c
P
c
j
)
|
(
)
(
max
arg
Properties
 Estimating instead of greatly
reduces the number of parameters (and the data
sparseness).
 The learning step in Naïve Bayes consists of
estimating and based on the
frequencies in the training data
 An unseen instance is classified by computing the
class that maximizes the posterior
 When conditioned independence is satisfied, Naïve
Bayes corresponds to MAP classification.
)
|
( j
i c
x
P )
|
,
,
,
( 2
1 j
n c
x
x
x
P 
)
( j
c
P
)
|
( j
i c
x
P
Example. ‘Play Tennis’ data
Day Outlook Temperature Humidity Wind Play
Tennis
Day1 Sunny Hot High Weak No
Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No
Question: For the day <sunny, cool, high, strong>, what’s
the play prediction?
Naive Bayes solution
Classify any new datum instance x=(a1,…aT) as:
 To do this based on training examples, we need to estimate the
parameters from the training examples:
 For each target value (hypothesis) h
 For each attribute value at of each datum instance
)
(
estimate
:
)
(
ˆ h
P
h
P 
)
|
(
estimate
:
)
|
(
ˆ h
a
P
h
a
P t
t 



t
t
h
h
Bayes
Naive h
a
P
h
P
h
P
h
P
h )
|
(
)
(
max
arg
)
|
(
)
(
max
arg x
Based on the examples in the table, classify the following datum x:
x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong)
 That means: Play tennis or not?
 Working:
)
|
(
)
|
(
)
|
(
)
|
(
)
(
max
arg
)
|
(
)
(
max
arg
)
|
(
)
(
max
arg
]
,
[
]
,
[
]
,
[
h
strong
Wind
P
h
high
Humidity
P
h
cool
Temp
P
h
sunny
Outlook
P
h
P
h
a
P
h
P
h
P
h
P
h
no
yes
h
t
t
no
yes
h
no
yes
h
NB











x
no
x
PlayTennis
answer
no
strong
P
no
high
P
no
cool
P
no
sunny
P
no
P
yes
strong
P
yes
high
P
yes
cool
P
yes
sunny
P
yes
P
etc
no
PlayTennis
strong
Wind
P
yes
PlayTennis
strong
Wind
P
no
PlayTennis
P
yes
PlayTennis
P


















)
(
:
)
|
(
)
|
(
)
|
(
)
|
(
)
(
0053
.
0
)
|
(
)
|
(
)
|
(
)
|
(
)
(
.
60
.
0
5
/
3
)
|
(
33
.
0
9
/
3
)
|
(
36
.
0
14
/
5
)
(
64
.
0
14
/
9
)
(
0.0206
Underflow Prevention
 Multiplying lots of probabilities, which are
between 0 and 1 by definition, can result in
floating-point underflow.
 Since log(xy) = log(x) + log(y), it is better to
perform all computations by summing logs of
probabilities rather than multiplying
probabilities.
 Class with highest final un-normalized log
probability score is still the most probable.





positions
i
j
i
j
C
c
NB c
x
P
c
P
c )
|
(
log
)
(
log
argmax
j

More Related Content

PPTX
Decision tree, softmax regression and ensemble methods in machine learning
PDF
Recursive Neural Networks
PPT
Predicate Logic
PPTX
Naive Bayes Presentation
PPTX
ML - Multiple Linear Regression
PDF
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
PPTX
Supervised Machine Learning
Decision tree, softmax regression and ensemble methods in machine learning
Recursive Neural Networks
Predicate Logic
Naive Bayes Presentation
ML - Multiple Linear Regression
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Supervised Machine Learning

What's hot (20)

PDF
Lecture6 - C4.5
PDF
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PPTX
인공지능, 기계학습 그리고 딥러닝
PDF
Introduction to Neural Networks
PPTX
Bayesian Neural Networks
PPTX
Random Forest and KNN is fun
PPTX
Bayes network
PDF
Naive Bayes
PPTX
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
PPT
Hebbian Learning
PDF
Reinforcement learning, Q-Learning
PPTX
Neural Networks
PDF
1. Linear Algebra for Machine Learning: Linear Systems
PPTX
Machine Learning - Accuracy and Confusion Matrix
PPTX
Artificial Neural Network
PPT
L03 ai - knowledge representation using logic
PPT
2.2.ppt.SC
PPT
Decision tree
PPT
Unit I & II in Principles of Soft computing
PPTX
Deep Reinforcement Learning
Lecture6 - C4.5
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
인공지능, 기계학습 그리고 딥러닝
Introduction to Neural Networks
Bayesian Neural Networks
Random Forest and KNN is fun
Bayes network
Naive Bayes
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
Hebbian Learning
Reinforcement learning, Q-Learning
Neural Networks
1. Linear Algebra for Machine Learning: Linear Systems
Machine Learning - Accuracy and Confusion Matrix
Artificial Neural Network
L03 ai - knowledge representation using logic
2.2.ppt.SC
Decision tree
Unit I & II in Principles of Soft computing
Deep Reinforcement Learning
Ad

Similar to bayesNaive.ppt (20)

PPT
Naive Bayes Classifier.ppt helping others by sharing the ppt
PPT
Bayes Classification
PPTX
PPT
NaiveBayes_machine-learning(basic_ppt).ppt
PDF
Bayes 6
PPTX
Naive Bayes_1.pptx Slides of NB in classical machine learning
PPT
ch8Bayes.ppt
PPTX
"Naive Bayes Classifier" @ Papers We Love Bucharest
PPT
Lecture07_ Naive Bayes Classifier Machine Learning
PDF
19BayesTheoremClassification19BayesTheoremClassification.ppt
PPT
9-Decision Tree Induction-23-01-2025.ppt
PPTX
baysian in machine learning in Supervised Learning .pptx
PDF
Bayesian Learning- part of machine learning
PDF
Machine learning naive bayes and svm.pdf
PDF
naive bayes example.pdf
PDF
naive bayes example.pdf
PDF
Bayesian Learning - Naive Bayes Algorithm
PPT
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
PPTX
Navies bayes
PPTX
UNIT II (7).pptx
Naive Bayes Classifier.ppt helping others by sharing the ppt
Bayes Classification
NaiveBayes_machine-learning(basic_ppt).ppt
Bayes 6
Naive Bayes_1.pptx Slides of NB in classical machine learning
ch8Bayes.ppt
"Naive Bayes Classifier" @ Papers We Love Bucharest
Lecture07_ Naive Bayes Classifier Machine Learning
19BayesTheoremClassification19BayesTheoremClassification.ppt
9-Decision Tree Induction-23-01-2025.ppt
baysian in machine learning in Supervised Learning .pptx
Bayesian Learning- part of machine learning
Machine learning naive bayes and svm.pdf
naive bayes example.pdf
naive bayes example.pdf
Bayesian Learning - Naive Bayes Algorithm
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Navies bayes
UNIT II (7).pptx
Ad

Recently uploaded (20)

PPTX
Share_Module_2_Power_conflict_and_negotiation.pptx
DOCX
Cambridge-Practice-Tests-for-IELTS-12.docx
PPTX
Module on health assessment of CHN. pptx
PPTX
Core Concepts of Personalized Learning and Virtual Learning Environments
PPTX
Unit 4 Computer Architecture Multicore Processor.pptx
PPTX
Virtual and Augmented Reality in Current Scenario
PDF
English Textual Question & Ans (12th Class).pdf
PDF
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
PDF
MICROENCAPSULATION_NDDS_BPHARMACY__SEM VII_PCI .pdf
PDF
Race Reva University – Shaping Future Leaders in Artificial Intelligence
PDF
What if we spent less time fighting change, and more time building what’s rig...
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PPTX
Introduction to pro and eukaryotes and differences.pptx
PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PDF
Environmental Education MCQ BD2EE - Share Source.pdf
PDF
Empowerment Technology for Senior High School Guide
PDF
International_Financial_Reporting_Standa.pdf
PDF
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 2).pdf
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
Share_Module_2_Power_conflict_and_negotiation.pptx
Cambridge-Practice-Tests-for-IELTS-12.docx
Module on health assessment of CHN. pptx
Core Concepts of Personalized Learning and Virtual Learning Environments
Unit 4 Computer Architecture Multicore Processor.pptx
Virtual and Augmented Reality in Current Scenario
English Textual Question & Ans (12th Class).pdf
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
MICROENCAPSULATION_NDDS_BPHARMACY__SEM VII_PCI .pdf
Race Reva University – Shaping Future Leaders in Artificial Intelligence
What if we spent less time fighting change, and more time building what’s rig...
AI-driven educational solutions for real-life interventions in the Philippine...
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
Introduction to pro and eukaryotes and differences.pptx
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
Environmental Education MCQ BD2EE - Share Source.pdf
Empowerment Technology for Senior High School Guide
International_Financial_Reporting_Standa.pdf
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 2).pdf
A powerpoint presentation on the Revised K-10 Science Shaping Paper

bayesNaive.ppt

  • 2. Bayesian Methods  Our focus this lecture  Learning and classification methods based on probability theory.  Bayes theorem plays a critical role in probabilistic learning and classification.  Uses prior probability of each category given no information about an item.  Categorization produces a posterior probability distribution over the possible categories given a description of an item.
  • 3. Basic Probability Formulas  Product rule  Sum rule  Bayes theorem  Theorem of total probability, if event Ai is mutually exclusive and probability sum to 1 ) ( ) | ( ) ( ) | ( ) ( A P A B P B P B A P B A P    ) ( ) ( ) ( ) ( B A P B P A P B A P         n i i i A P A B P B P 1 ) ( ) | ( ) ( ) ( ) ( ) | ( ) | ( D P h P h D P D h P 
  • 4. Bayes Theorem  Given a hypothesis h and data D which bears on the hypothesis:  P(h): independent probability of h: prior probability  P(D): independent probability of D  P(D|h): conditional probability of D given h: likelihood  P(h|D): conditional probability of h given D: posterior probability ) ( ) ( ) | ( ) | ( D P h P h D P D h P 
  • 5. Does patient have cancer or not?  A patient takes a lab test and the result comes back positive. It is known that the test returns a correct positive result in only 99% of the cases and a correct negative result in only 95% of the cases. Furthermore, only 0.03 of the entire population has this disease. 1. What is the probability that this patient has cancer? 2. What is the probability that he does not have cancer? 3. What is the diagnosis?
  • 6. Maximum A Posterior  Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP) hypothesis for the data  We are interested in the best hypothesis for some space H given observed training data D. ) | ( argmax D h P h H h MAP   ) ( ) ( ) | ( argmax D P h P h D P H h  ) ( ) | ( argmax h P h D P H h  H: set of all hypothesis. Note that we can drop P(D) as the probability of the data is constant (and independent of the hypothesis).
  • 7. Maximum Likelihood  Now assume that all hypotheses are equally probable a priori, i.e., P(hi ) = P(hj ) for all hi, hj belong to H.  This is called assuming a uniform prior. It simplifies computing the posterior:  This hypothesis is called the maximum likelihood hypothesis. ) | ( max arg h D P h H h ML  
  • 8. Desirable Properties of Bayes Classifier  Incrementality: with each training example, the prior and the likelihood can be updated dynamically: flexible and robust to errors.  Combines prior knowledge and observed data: prior probability of a hypothesis multiplied with probability of the hypothesis given the training data  Probabilistic hypothesis: outputs not only a classification, but a probability distribution over all classes
  • 9. Bayes Classifiers Assumption: training set consists of instances of different classes described cj as conjunctions of attributes values Task: Classify a new instance d based on a tuple of attribute values into one of the classes cj  C Key idea: assign the most probable class using Bayes Theorem. MAP c ) , , , | ( argmax 2 1 n j C c MAP x x x c P c j    ) , , , ( ) ( ) | , , , ( argmax 2 1 2 1 n j j n C c x x x P c P c x x x P j     ) ( ) | , , , ( argmax 2 1 j j n C c c P c x x x P j   
  • 10. Parameters estimation  P(cj)  Can be estimated from the frequency of classes in the training examples.  P(x1,x2,…,xn|cj)  O(|X|n•|C|) parameters  Could only be estimated if a very, very large number of training examples was available.  Independence Assumption: attribute values are conditionally independent given the target value: naïve Bayes.   i j i j n c x P c x x x P ) | ( ) | , , , ( 2 1     i j i j C c NB c x P c P c j ) | ( ) ( max arg
  • 11. Properties  Estimating instead of greatly reduces the number of parameters (and the data sparseness).  The learning step in Naïve Bayes consists of estimating and based on the frequencies in the training data  An unseen instance is classified by computing the class that maximizes the posterior  When conditioned independence is satisfied, Naïve Bayes corresponds to MAP classification. ) | ( j i c x P ) | , , , ( 2 1 j n c x x x P  ) ( j c P ) | ( j i c x P
  • 12. Example. ‘Play Tennis’ data Day Outlook Temperature Humidity Wind Play Tennis Day1 Sunny Hot High Weak No Day2 Sunny Hot High Strong No Day3 Overcast Hot High Weak Yes Day4 Rain Mild High Weak Yes Day5 Rain Cool Normal Weak Yes Day6 Rain Cool Normal Strong No Day7 Overcast Cool Normal Strong Yes Day8 Sunny Mild High Weak No Day9 Sunny Cool Normal Weak Yes Day10 Rain Mild Normal Weak Yes Day11 Sunny Mild Normal Strong Yes Day12 Overcast Mild High Strong Yes Day13 Overcast Hot Normal Weak Yes Day14 Rain Mild High Strong No Question: For the day <sunny, cool, high, strong>, what’s the play prediction?
  • 13. Naive Bayes solution Classify any new datum instance x=(a1,…aT) as:  To do this based on training examples, we need to estimate the parameters from the training examples:  For each target value (hypothesis) h  For each attribute value at of each datum instance ) ( estimate : ) ( ˆ h P h P  ) | ( estimate : ) | ( ˆ h a P h a P t t     t t h h Bayes Naive h a P h P h P h P h ) | ( ) ( max arg ) | ( ) ( max arg x
  • 14. Based on the examples in the table, classify the following datum x: x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong)  That means: Play tennis or not?  Working: ) | ( ) | ( ) | ( ) | ( ) ( max arg ) | ( ) ( max arg ) | ( ) ( max arg ] , [ ] , [ ] , [ h strong Wind P h high Humidity P h cool Temp P h sunny Outlook P h P h a P h P h P h P h no yes h t t no yes h no yes h NB            x no x PlayTennis answer no strong P no high P no cool P no sunny P no P yes strong P yes high P yes cool P yes sunny P yes P etc no PlayTennis strong Wind P yes PlayTennis strong Wind P no PlayTennis P yes PlayTennis P                   ) ( : ) | ( ) | ( ) | ( ) | ( ) ( 0053 . 0 ) | ( ) | ( ) | ( ) | ( ) ( . 60 . 0 5 / 3 ) | ( 33 . 0 9 / 3 ) | ( 36 . 0 14 / 5 ) ( 64 . 0 14 / 9 ) ( 0.0206
  • 15. Underflow Prevention  Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow.  Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.  Class with highest final un-normalized log probability score is still the most probable.      positions i j i j C c NB c x P c P c ) | ( log ) ( log argmax j