SlideShare a Scribd company logo
6
Most read
13
Most read
19
Most read
Bayesian Learning
 Each observed training example can incrementally decrease or
increase the estimated probability that a hypothesis is correct.
 This provides a more flexible approach to learning than
algorithms that completely eliminate a hypothesis if it is found to
be inconsistent with any single example.
 Prior knowledge can be combined with observed data to
determine the final probability of a hypothesis. In Bayesian
learning, prior knowledge is provided by asserting
 A prior probability for each candidate hypothesis, and a
probability distribution over observed data for each possible
hypothesis.
 Bayesian methods can accommodate hypotheses that make
probabilistic predictions
 New instances can be classified by combining the predictions of
multiple hypotheses, weighted by their probabilities.
 Even in cases where Bayesian methods prove computationally
intractable, they can provide a standard of optimal decision
making against which other practical methods can be measured
Bayesian Learning…
 Require initial knowledge of many probabilities – When
these probabilities are not known in advance, they are
often estimated based on background knowledge,
previously available data, and assumptions about the
form of the underlying distributions.
 Significant computational cost is required to determine
the Bayes optimal hypothesis in the general case (linear
in the number of candidate hypotheses). In certain
specialized situations, this computational cost can be
significantly reduced
Bayes Theorem
 In machine learning, we try to determine the best
hypothesis from some hypothesis space H, given the
observed training data D.
 In Bayesian learning, the best hypothesis means the
most probable hypothesis, given the data D plus any
initial knowledge about the prior probabilities of the
various hypotheses in H.
 Bayes theorem provides a way to calculate the
probability of a hypothesis based on its prior probability,
the probabilities of observing various data given the
hypothesis, and the observed data itself
Bayes Theorem..
 In ML problems, we are interested in the probability P(h|D) that h
holds given the observed training data D.
 Bayes theorem provides a way to calculate the posterior probability
P(h|D), from the prior probability P(h), together with P(D) and
P(D|h).
 Bayes Theorem:
 P(h) is prior probability of hypothesis h
 P(D) is prior probability of training data D
 P(h|D) is posterior probability of h given D
 P(D|h) is posterior probability of D given h
 P(h|D) increases with P(h) and P(D|h) according to Bayes theorem.
 P(h|D) decreases as P(D) increases, because the more probable it is
that D will be observed independent of h, the less evidence D
provides in support of h
Maximum A Posteriori (MAP)
Hypothesis, hMAP
Maximum Likelihood (ML)
Hypothesis, hML
Brute-Force Bayes Concept
Learning
Brute-Force MAP Learning
Algorithm
Brute-Force MAP Learning
Algorithm
Brute-Force MAP Learning
Algorithm
Maximum Likelihood and Least-
Squared Error Hypotheses
 Many learning approaches such as neural network learning,
linear regression, and polynomial curve fitting try to learn a
continuous-valued target function.
 Under certain assumptions any learning algorithm that
minimizes the squared error between the output hypothesis
predictions and the training data will output a MAXIMUM
LIKELIHOOD HYPOTHESIS.
 The significance of this result is that it provides a Bayesian
justification (under certain assumptions) for many neural
network and other curve fitting methods that attempt to
minimize the sum of squared errors over the training data.
Maximum Likelihood and Least-
Squared Error Hypotheses –
Deriving hML
Maximum Likelihood and Least-
Squared Error Hypotheses –
Deriving hML
Maximum Likelihood and Least-
Squared Error Hypotheses –
Deriving hML
Maximum Likelihood and Least-
Squared Error Hypotheses
 The maximum likelihood hypothesis hML is the one that minimizes
the sum of the squared errors between observed training values di
and hypothesis predictions h(xi ).
 This holds under the assumption that the observed training values
di are generated by adding random noise to the true target value,
where this random noise is drawn independently for each example
from a Normal distribution with zero mean.
 Similar derivations can be performed starting with other assumed
noise distributions, producing different results.
 Why is it reasonable to choose the Normal distribution to
characterize noise? – One reason, is that it allows for a
mathematically straightforward analysis. – A second reason is that
the smooth, bell-shaped distribution is a good approximation to
many types of noise in physical systems.
 Minimizing the sum of squared errors is a common approach in
many neural network, curve fitting, and other approaches to
approximating real-valued functions.
Minimum Description Length
Principle
Minimum Description Length
Principle
 Consider the problem of designing a code to transmit messages
drawn at random, where the probability of encountering
message i is pi.
 We are interested here in the most compact code; that is, we
are interested in the code that minimizes the expected
number of bits we must transmit in order to encode a message
drawn at random.
 Clearly, to minimize the expected code length we should
assign shorter codes to messages that are more probable.
 We will refer to the number of bits required to encode
message i using code C as the description length of message i
with respect to C, which we denote by Lc(i).
Minimum Description Length
Principle
• Therefore the above equation is rewritten to show that
is the hypothesis h that minimizes the sum given by the
description length of the hypothesis plus the description
length of the data given the hypothesis
Bayes Optimal Classifier
 It is the most probable classification for the new instance in the given
the training data. So, whenever it gets the new instance, it makes the
most efficient classification.
 To develop some intuitions, consider a hypothesis space containing
three hypotheses, h1, h2, and h3.
 Suppose that the posterior probabilities of these hypotheses given the
training data are .4, .3, and .3 respectively. Thus, h1 is the MAP
hypothesis.
 Suppose a new instance x is encountered, which is classified positive by
h1, but negative by h2 and h3.
 Taking all hypotheses into account, the probability that x is positive is
.4 (the probability associated with hi), and the probability that it is
negative is therefore .6.
 The most probable classification (negative) in this case is different from
the classification generated by the MAP hypothesis
 If the possible classification of the new example can take on any value
vj from some set V, then the probability P(vjlD) that the correct
classification for the new instance is vj is,
Bayes Optimal Classifier
Gibbs Algorithm

More Related Content

PPTX
Naïve Bayes Classifier Algorithm.pptx
PPTX
Decision tree induction \ Decision Tree Algorithm with Example| Data science
PPTX
Machine learning session4(linear regression)
PPTX
Evaluating hypothesis
PPTX
Inductive bias
PPTX
Maximum likelihood estimation
PDF
Decision trees in Machine Learning
PPT
2.4 rule based classification
Naïve Bayes Classifier Algorithm.pptx
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Machine learning session4(linear regression)
Evaluating hypothesis
Inductive bias
Maximum likelihood estimation
Decision trees in Machine Learning
2.4 rule based classification

What's hot (20)

PPT
Fuzzy Set Theory
ODP
NAIVE BAYES CLASSIFIER
PPTX
Perceptron & Neural Networks
PPTX
K-Nearest Neighbor(KNN)
PDF
Logistic regression in Machine Learning
PPTX
Artificial Intelligence Notes Unit 3
PPTX
Randomized Algorithm- Advanced Algorithm
PPT
Bayes Classification
PPTX
Concept learning and candidate elimination algorithm
PPTX
Np hard
PPTX
Introduction to Machine Learning
PPT
2.2 decision tree
PPTX
Naive Bayes Presentation
PPTX
ML - Multiple Linear Regression
PPTX
Logistic regression
PPTX
Decision Trees
PPTX
Data preprocessing in Machine learning
PDF
Bayesian Networks - A Brief Introduction
PPTX
ML_ Unit 2_Part_B
PPT
Bayseian decision theory
Fuzzy Set Theory
NAIVE BAYES CLASSIFIER
Perceptron & Neural Networks
K-Nearest Neighbor(KNN)
Logistic regression in Machine Learning
Artificial Intelligence Notes Unit 3
Randomized Algorithm- Advanced Algorithm
Bayes Classification
Concept learning and candidate elimination algorithm
Np hard
Introduction to Machine Learning
2.2 decision tree
Naive Bayes Presentation
ML - Multiple Linear Regression
Logistic regression
Decision Trees
Data preprocessing in Machine learning
Bayesian Networks - A Brief Introduction
ML_ Unit 2_Part_B
Bayseian decision theory
Ad

Similar to Bayes Theorem.pdf (20)

PPTX
4-ML-UNIT-IV-Bayesian Learning.pptx
PPTX
Module 4_F.pptx
PPTX
-BayesianLearning in machine Learning 12
PPTX
Module 4 part_1
PPTX
UNIT II (7).pptx
PPTX
Bayesian Learning by Dr.C.R.Dhivyaa Kongu Engineering College
PDF
Machine Learning using python module_4_part_1.pdf
PDF
Bayesian Learning- part of machine learning
PDF
Bayesian Learning - Naive Bayes Algorithm
PPT
original
PPTX
Combining inductive and analytical learning
PPTX
Machine Learning CONCEPT LEARNING AS SEARCH.pptx
PDF
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
PDF
GUC_2744_59_29307_2023-02-22T14_07_02.pdf
PPT
ML_Lecture_2 well posed algorithm find s.ppt
PPT
chap4_Parametric_Methods.ppt
PDF
Phil 6334 Mayo slides Day 1
PPTX
AI ML M5
PDF
Bayesian learning
PPT
bayes answer jejisiowwoowwksknejejrjejej
4-ML-UNIT-IV-Bayesian Learning.pptx
Module 4_F.pptx
-BayesianLearning in machine Learning 12
Module 4 part_1
UNIT II (7).pptx
Bayesian Learning by Dr.C.R.Dhivyaa Kongu Engineering College
Machine Learning using python module_4_part_1.pdf
Bayesian Learning- part of machine learning
Bayesian Learning - Naive Bayes Algorithm
original
Combining inductive and analytical learning
Machine Learning CONCEPT LEARNING AS SEARCH.pptx
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
GUC_2744_59_29307_2023-02-22T14_07_02.pdf
ML_Lecture_2 well posed algorithm find s.ppt
chap4_Parametric_Methods.ppt
Phil 6334 Mayo slides Day 1
AI ML M5
Bayesian learning
bayes answer jejisiowwoowwksknejejrjejej
Ad

More from Nirmalavenkatachalam (8)

PDF
Introduction to Intelligent Agent and its types
PPTX
Introduction-to-Artificial Intelligence and Data Science
PPTX
Unit 2 DATABASE ESSENTIALS.pptx
PPTX
Unit 3 ppt.pptx
PPTX
PPTX
Divide and Conquer / Greedy Techniques
PDF
DAA Unit 1.pdf
PDF
EDA-Unit 1.pdf
Introduction to Intelligent Agent and its types
Introduction-to-Artificial Intelligence and Data Science
Unit 2 DATABASE ESSENTIALS.pptx
Unit 3 ppt.pptx
Divide and Conquer / Greedy Techniques
DAA Unit 1.pdf
EDA-Unit 1.pdf

Recently uploaded (20)

PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
master seminar digital applications in india
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Institutional Correction lecture only . . .
PDF
Insiders guide to clinical Medicine.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
01-Introduction-to-Information-Management.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPH.pptx obstetrics and gynecology in nursing
Renaissance Architecture: A Journey from Faith to Humanism
master seminar digital applications in india
Final Presentation General Medicine 03-08-2024.pptx
human mycosis Human fungal infections are called human mycosis..pptx
Supply Chain Operations Speaking Notes -ICLT Program
Microbial disease of the cardiovascular and lymphatic systems
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
102 student loan defaulters named and shamed – Is someone you know on the list?
VCE English Exam - Section C Student Revision Booklet
Institutional Correction lecture only . . .
Insiders guide to clinical Medicine.pdf
RMMM.pdf make it easy to upload and study
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
01-Introduction-to-Information-Management.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
STATICS OF THE RIGID BODIES Hibbelers.pdf

Bayes Theorem.pdf

  • 1. Bayesian Learning  Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct.  This provides a more flexible approach to learning than algorithms that completely eliminate a hypothesis if it is found to be inconsistent with any single example.  Prior knowledge can be combined with observed data to determine the final probability of a hypothesis. In Bayesian learning, prior knowledge is provided by asserting  A prior probability for each candidate hypothesis, and a probability distribution over observed data for each possible hypothesis.  Bayesian methods can accommodate hypotheses that make probabilistic predictions  New instances can be classified by combining the predictions of multiple hypotheses, weighted by their probabilities.  Even in cases where Bayesian methods prove computationally intractable, they can provide a standard of optimal decision making against which other practical methods can be measured
  • 2. Bayesian Learning…  Require initial knowledge of many probabilities – When these probabilities are not known in advance, they are often estimated based on background knowledge, previously available data, and assumptions about the form of the underlying distributions.  Significant computational cost is required to determine the Bayes optimal hypothesis in the general case (linear in the number of candidate hypotheses). In certain specialized situations, this computational cost can be significantly reduced
  • 3. Bayes Theorem  In machine learning, we try to determine the best hypothesis from some hypothesis space H, given the observed training data D.  In Bayesian learning, the best hypothesis means the most probable hypothesis, given the data D plus any initial knowledge about the prior probabilities of the various hypotheses in H.  Bayes theorem provides a way to calculate the probability of a hypothesis based on its prior probability, the probabilities of observing various data given the hypothesis, and the observed data itself
  • 4. Bayes Theorem..  In ML problems, we are interested in the probability P(h|D) that h holds given the observed training data D.  Bayes theorem provides a way to calculate the posterior probability P(h|D), from the prior probability P(h), together with P(D) and P(D|h).  Bayes Theorem:  P(h) is prior probability of hypothesis h  P(D) is prior probability of training data D  P(h|D) is posterior probability of h given D  P(D|h) is posterior probability of D given h  P(h|D) increases with P(h) and P(D|h) according to Bayes theorem.  P(h|D) decreases as P(D) increases, because the more probable it is that D will be observed independent of h, the less evidence D provides in support of h
  • 5. Maximum A Posteriori (MAP) Hypothesis, hMAP
  • 11. Maximum Likelihood and Least- Squared Error Hypotheses  Many learning approaches such as neural network learning, linear regression, and polynomial curve fitting try to learn a continuous-valued target function.  Under certain assumptions any learning algorithm that minimizes the squared error between the output hypothesis predictions and the training data will output a MAXIMUM LIKELIHOOD HYPOTHESIS.  The significance of this result is that it provides a Bayesian justification (under certain assumptions) for many neural network and other curve fitting methods that attempt to minimize the sum of squared errors over the training data.
  • 12. Maximum Likelihood and Least- Squared Error Hypotheses – Deriving hML
  • 13. Maximum Likelihood and Least- Squared Error Hypotheses – Deriving hML
  • 14. Maximum Likelihood and Least- Squared Error Hypotheses – Deriving hML
  • 15. Maximum Likelihood and Least- Squared Error Hypotheses  The maximum likelihood hypothesis hML is the one that minimizes the sum of the squared errors between observed training values di and hypothesis predictions h(xi ).  This holds under the assumption that the observed training values di are generated by adding random noise to the true target value, where this random noise is drawn independently for each example from a Normal distribution with zero mean.  Similar derivations can be performed starting with other assumed noise distributions, producing different results.  Why is it reasonable to choose the Normal distribution to characterize noise? – One reason, is that it allows for a mathematically straightforward analysis. – A second reason is that the smooth, bell-shaped distribution is a good approximation to many types of noise in physical systems.  Minimizing the sum of squared errors is a common approach in many neural network, curve fitting, and other approaches to approximating real-valued functions.
  • 17. Minimum Description Length Principle  Consider the problem of designing a code to transmit messages drawn at random, where the probability of encountering message i is pi.  We are interested here in the most compact code; that is, we are interested in the code that minimizes the expected number of bits we must transmit in order to encode a message drawn at random.  Clearly, to minimize the expected code length we should assign shorter codes to messages that are more probable.  We will refer to the number of bits required to encode message i using code C as the description length of message i with respect to C, which we denote by Lc(i).
  • 18. Minimum Description Length Principle • Therefore the above equation is rewritten to show that is the hypothesis h that minimizes the sum given by the description length of the hypothesis plus the description length of the data given the hypothesis
  • 19. Bayes Optimal Classifier  It is the most probable classification for the new instance in the given the training data. So, whenever it gets the new instance, it makes the most efficient classification.  To develop some intuitions, consider a hypothesis space containing three hypotheses, h1, h2, and h3.  Suppose that the posterior probabilities of these hypotheses given the training data are .4, .3, and .3 respectively. Thus, h1 is the MAP hypothesis.  Suppose a new instance x is encountered, which is classified positive by h1, but negative by h2 and h3.  Taking all hypotheses into account, the probability that x is positive is .4 (the probability associated with hi), and the probability that it is negative is therefore .6.  The most probable classification (negative) in this case is different from the classification generated by the MAP hypothesis  If the possible classification of the new example can take on any value vj from some set V, then the probability P(vjlD) that the correct classification for the new instance is vj is,