SlideShare a Scribd company logo
1-1
Probabilistic Methods
1-2
Lectures 13, 14, 15, 16
Probabilistic Classifiers
EE 439 Introduction to
Machine Learning
Draft of Chapter 2 of Tom Mitchell, “Machine
Learning”, 2nd Edition, 2018
www.cs.cmu.edu/~tom/mlbook.html.
Probabilistic Methods
1-3
Estimating Probabilities: Intro
❑ Many machine learning methods depend on
probabilistic approaches.
❑ The reason: when we are interested in learning some
target function f : X → Y, we can more generally learn
the probabilistic function P(Y|X).
❑ We can design algorithms that learn functions with
uncertain outcomes (e.g., predicting tomorrow’s stock
price) and that incorporate prior knowledge to guide
learning (e.g., a bias that tomorrow’s stock price is
likely to be similar to today’s price).
Probabilistic Methods
1-4
Estimating Probabilities: Intro
❑ We will discuss joint probability distributions over
many variables and shows how they can be used to
calculate a target P(Y|X).
❑ We will study the problem of learning, or estimating
probability distributions from training data.
❖ The two most common approaches:
• maximum likelihood estimation (MLE) and
• maximum a posteriori estimation (MAP).
Probabilistic Methods
1-5
Joint Probability Distributions
❑ The key to building probabilistic models is
❖ to define a set of random variables, and
❖ to consider the joint probability distribution over
them.
Probabilistic Methods
1-6
Joint Probability Distributions
Probabilistic Methods
❑ Table 1 defines a joint probability distribution over
three random variables:
❖ a person’s Gender, the number of HoursWorked each week,
and their Wealth.
1-7
Joint Probability Distributions
❑ In general, defining a joint probability distribution over
a set of discrete-valued variables involves three
simple steps:
Probabilistic Methods
1-8
Joint Probability Distributions
❑ The joint probability distribution is central to
probabilistic inference because once we know the
joint distribution, we can answer every possible
probabilistic question that can be asked about these
variables.
❖ We can calculate conditional or joint probabilities
over any subset of variables, given their joint
distribution.
Probabilistic Methods
1-9
Joint Probability Distributions
Probabilistic Methods
❑ The probability that any single variable will take on
any specific value.
❑ P(Gender = male) = ?
❑ P(Wealth = rich) = ?
1-10
Joint Probability Distributions
Probabilistic Methods
❑ The probability that any single variable will take on
any specific value.
❑ P(Gender = male) = 0.6685
❖ sum the four rows for which Gender = male.
❑ P(Wealth = rich) = 0.2393
❖ add together the probabilities for the four rows
covering the cases for which Wealth=rich.
1-11
Joint Probability Distributions
Probabilistic Methods
❑ The probability that any subset of the variables will
take on a particular joint assignment.
❑ P(Wealth=rich ˄ Gender=female) = ?
1-12
Joint Probability Distributions
Probabilistic Methods
❑ The probability that any subset of the variables will
take on a particular joint assignment.
❑ P(Wealth=rich ˄ Gender=female) = 0.0362
❖ sum the two table rows that satisfy this joint
assignment.
1-13
Joint Probability Distributions
Probabilistic Methods
❑ Any conditional probability P(Y|X) = P(X ˄ Y)/P(X)
defined over subsets of the variables.
❑ P(Wealth=rich | Gender=female) = ?
1-14
Joint Probability Distributions
Probabilistic Methods
❑ Any conditional probability P(Y|X) = P(X ˄ Y)/P(X)
defined over subsets of the variables.
❑ P(Wealth=rich | Gender=female) = 0.0362/0.3315 =
0.1092.
❖ sum appropriate rows, to obtain the conditional
probability.
1-15
Joint Probability Distributions
Probabilistic Methods
❑ If we know the joint probability distribution over an
arbitrary set of random variables {X1. . . Xn}, we can
calculate the conditional and joint probability
distributions for arbitrary subsets of these variables
(e.g., P(Xn | X1. . . Xn-1)).
❑ In theory, we can solve
❖ Any classification, regression or other function
approximation problem defined over these
variables, and furthermore, produce probabilistic
rather than deterministic predictions for any given
input to the target function.
1-16
Learning the Joint Distribution
Probabilistic Methods
❑ How can we learn joint distributions from observed
training data?
❑ In the example of Table 1, it will be easy if we begin
with a large database containing, say, descriptions of
a million people in terms of their values for our three
variables.
1-17
Learning the Joint Distribution
Probabilistic Methods
❑ Given a large data set:
❖ one can easily estimate a probability for each row
in the table by calculating the fraction of database
entries (people) that satisfy the joint assignment
specified for that row.
❖ If thousands of database entries fall into each row,
we will obtain highly reliable probability estimates
using this strategy.
1-18
Learning the Joint Distribution
Probabilistic Methods
❑ However, it can be difficult to learn the joint
distribution due to the very large amount of training
data required.
❖ Consider how our learning problem would change
if we were to add additional variables to describe a
total of 100 Boolean features for each person in
Table 1 (e.g., we could add ”do they have a
college degree?”, ”are they healthy?”).
❖ Given 100 Boolean features, the number of rows
would now expand to 2^100, which is greater than
10^30.
1-19
Learning the Joint Distribution
Probabilistic Methods
❑ Unfortunately, even if our database describes every
single person on earth, we will not have enough data
to obtain reliable probability estimates for most rows.
❖ There are only approximately 10^10 people on
earth, which means that for most of the 2^100
rows in our table, we would have zero training
examples!
1-20
Learning the Joint Distribution
Probabilistic Methods
❑ This is a significant problem:
❖ Given that real-world machine learning
applications often use many more than 100
features to describe each example (e.g., text
analysis may use millions of features to describe
text in a given document.).
1-21
Learning the Joint Distribution
Probabilistic Methods
❑ To successfully address the issue of learning
probabilities from available training data, we must:
❖ (1) be smart about how we estimate probability
parameters from available data, and
❖ (2) be smart about how we represent joint
probability distributions.
1-22
Estimating Probabilities
Probabilistic Methods
❑ Let us begin our discussion of how to estimate
probabilities with a simple example and explore two
intuitive algorithms.
❑ These two intuitive algorithms illustrate the two
primary approaches used in nearly all probabilistic
machine learning algorithms.
1-23
Estimating Probabilities
Probabilistic Methods
❑ Assume you have a coin, represented by the random
variable X. If you flip this coin, it may turn up heads
(X = 1) or tails (X = 0).
❑ The learning task is to estimate the probability that it
will turn up heads; i.e., to estimate P(X =1).
❑ We will use θ to refer to the true (but unknown)
probability of heads (e.g., P(X =1) = θ), and use to
refer to our learned estimate of this true θ.
1-24
Estimating Probabilities
Probabilistic Methods
❑ You gather training data by flipping the coin n times,
and observe that it turns up heads α1 times, and tails
α0 times.
n = α1 + α0
❑ What is the most intuitive approach to estimating θ =
P(X =1) from this training data?
❖ Estimate the probability by the fraction of flips that
result in heads:
1-25
Estimating Probabilities
Probabilistic Methods
1-26
Estimating Probabilities
Probabilistic Methods
1-27
Estimating Probabilities
Probabilistic Methods
❑ This leads to our second intuitive algorithm:
❖ an algorithm that enables us to incorporate prior
assumptions along with observed training data to
produce our final estimate.
❑ Algorithm 2 allows us to express our prior
assumptions or knowledge about the coin by adding
in any number of imaginary coin flips resulting in
heads or tails.
1-28
Estimating Probabilities
Probabilistic Methods
1-29
Estimating Probabilities
Probabilistic Methods
1-30
Estimating Probabilities
Probabilistic Methods
1-31
Estimating Probabilities
Probabilistic Methods
❑ Asymptotically, as the volume of actual observed
data grows toward infinity:
❖ the influence of our imaginary data goes to zero
(the fixed number of imaginary coin flips becomes
insignificant compared to a sufficiently large
number of actual observations).
❖ In other words, Algorithm 2 behaves so that priors
have the strongest influence when observations
are scarce, and their influence gradually reduces
as observations become more plentiful.
1-32
Estimating Probabilities
Probabilistic Methods
❑ Both Algorithm 1 and Algorithm 2 are intuitively quite
compelling.
❖ These two algorithms exemplify the two most
widely used approaches to machine
learning of probabilistic models from training data.
❖ They follow two different underlying principles.
• Algorithm 1 follows a principle called
Maximum Likelihood Estimation (MLE), in
which we seek an estimate of θ that maximizes
the probability of the observed data.
1-33
Estimating Probabilities
Probabilistic Methods
• Algorithm 2 follows a different principle called
Maximum a Posteriori (MAP) estimation, in
which we seek the estimate of θ that is itself
most probable, given the observed data, plus
background assumptions about its value.
❑ Both principles have been widely used to derive and
to justify a vast range of machine learning algorithms,
from Bayesian networks, to linear regression, to
neural network learning.
❖ Our coin flip example represents just one of many such
learning problems.
1-34
Estimating Probabilities
Probabilistic Methods
❑ Here the learning task is to estimate the unknown
value of θ = P(X = 1) for a Boolean valued random
variable X, based on a sample of n values of X drawn
independently (e.g., n independent flips of a coin with
probability θ of heads).
❑ Data is generated by a random number generator.
❖ The true value of θ is 0.3, and the same sequence
of training examples is used in each plot.
❑ The experimental behavior of these two algorithms is
shown in Figure.
1-35
Estimating Probabilities
Probabilistic Methods
• The blue line shows the estimates produced by MLE while
the red line shows the estimates of MAP.
• Plots on the left correspond to the correct prior assumption
about the value of θ, plots on the right reflect the incorrect
prior assumption that θ is most probably 0.4.
1-36
Estimating Probabilities
Probabilistic Methods
• Plots in the top row reflect lower confidence in the prior
assumption, by including only 60 imaginary data points,
whereas bottom plots assume 120.
1-37
Maximum Likelihood Estimation
(MLE)
Probabilistic Methods
1-38
Maximum Likelihood Estimation
(MLE)
Probabilistic Methods
1-39
Maximum Likelihood Estimation
(MLE)
Probabilistic Methods
1-40
Maximum Likelihood Estimation
(MLE)
Probabilistic Methods
1-41
Maximum Likelihood Estimation
(MLE)
Probabilistic Methods
1-42
Maximum Likelihood Estimation
(MLE)
Probabilistic Methods
1-43
Maximum Likelihood Estimation
(MLE)
Probabilistic Methods
1-44
Maximum Likelihood Estimation
(MLE)
Probabilistic Methods
1-45
Maximum Likelihood Estimation
(MLE)
Probabilistic Methods
1-46
Maximum a Posteriori Probability
Estimation (MAP)
Probabilistic Methods
1-47
Maximum a Posteriori Probability
Estimation (MAP)
Probabilistic Methods
1-48
Maximum a Posteriori Probability
Estimation (MAP)
Probabilistic Methods
1-49
Maximum a Posteriori Probability
Estimation (MAP)
Probabilistic Methods
1-50
Maximum a Posteriori Probability
Estimation (MAP)
Probabilistic Methods
1-51
Maximum a Posteriori Probability
Estimation (MAP)
Probabilistic Methods
1-52
Maximum a Posteriori Probability
Estimation (MAP)
Probabilistic Methods
1-53
Probabilistic Methods
Chapter 4 of E. Alpaydin, “Introduction to Machine
Learning”, 3rd Edition The MIT press, 2014
1-54
Parametric Methods
❑ A statistic is any value that is calculated from a given
sample.
❑ In statistical inference, we make a decision using the
information provided by a sample.
Probabilistic Methods
1-55
Parametric Methods
❑ Our approach is parametric where we assume that
the sample is drawn from some distribution that
obeys a known model, for example, Gaussian.
❖ The advantage is that the model is defined up to a
small number of parameters — for example,
mean, variance — the sufficient statistics of the
Gaussian distribution.
❖ Once those parameters are estimated from the
sample, the whole distribution is known.
Probabilistic Methods
1-56
Parametric Methods
❑ Estimate the parameters of the distribution from the
given sample,
❑ Plug in these estimates to the assumed model, and
❑ Get an estimated distribution, which we then use to
make a decision.
❑ The method we use to estimate the parameters of a
distribution is maximum likelihood estimation.
Probabilistic Methods
1-57
Parametric Methods
❑ We start with density estimation, which is the general
case of estimating p(x).
❑ We use this for classification where the estimated
densities are the class densities, p(x|Ci), and priors,
P(Ci), to be able to calculate the posteriors, P(Ci|x),
and make our decision.
❑ In this chapter, x is one-dimensional and thus the
densities are univariate. We generalize to the
multivariate case in chapter 5.
Probabilistic Methods
1-58
Maximum Likelihood Estimation
Probabilistic Methods
1-59
Maximum Likelihood Estimation
Probabilistic Methods
1-60
Bernoulli Density
Probabilistic Methods
1-61
Bernoulli Density
Probabilistic Methods
1-62
Gaussian (Normal) Density
Probabilistic Methods
1-63
Parametric Classification
Probabilistic Methods
1-64
Parametric Classification
Probabilistic Methods
1-65
Parametric Classification
Probabilistic Methods
1-66
Parametric Classification
Probabilistic Methods
1-67
Parametric Classification
Probabilistic Methods
1-68
Parametric Classification
Probabilistic Methods
1-69
Parametric Classification
Probabilistic Methods
1-70
Parametric Classification
Probabilistic Methods
1-71
Parametric Classification
Probabilistic Methods
1-72
Parametric Classification
Probabilistic Methods
1-73
Parametric Classification
Probabilistic Methods
1-74
An Example: Iris flower data set
Probabilistic Methods
❑ The Iris flower data set, or Fisher's Iris data set is a
multivariate data set introduced by the British
statistician, and biologist Ronald Fisher in 1936.
Iris setosa
Iris versicolor
Iris virginica
1-75
An Example: Iris flower data set
Probabilistic Methods
1-76
An Example: Iris flower data set
Probabilistic Methods
1-77
An Example: Iris flower data set
Probabilistic Methods
1-78
An Example: Iris flower data set
Probabilistic Methods
❑ m1 = 4.825
❑ m2 = 6.45
❑ m3 = 6.425
❑ s1^2 = 0.036875
❑ s2^2 = 0.3525
❑ s3^2 = 0.216875
1-79
Evaluating an Estimator: Bias
and Variance
Probabilistic Methods
1-80
Evaluating an Estimator: Bias
and Variance
Probabilistic Methods
1-81
Evaluating an Estimator: Bias
and Variance
Probabilistic Methods
1-82
Evaluating an Estimator: Bias
and Variance
Probabilistic Methods
1-83
Evaluating an Estimator: Bias
and Variance
Probabilistic Methods
1-84
Evaluating an Estimator: Bias
and Variance
Probabilistic Methods
1-85
Evaluating an Estimator: Bias
and Variance
Probabilistic Methods
1-86
Probabilistic Methods
Chapter 5 of E. Alpaydin, “Introduction to Machine
Learning”, 3rd Edition The MIT press, 2014
1-87
Multivariate Methods
Probabilistic Methods
❑ We generalize our discussion of the parametric
approach to the multivariate case
❖ multiple inputs and the output is a class variable or
continuous output (a function of these multiple
inputs).
❖ These inputs may be discrete or numeric.
❑ We will see how such functions can be learned from
a labeled multivariate sample and also how the
complexity of the learner can be fine-tuned to the
data at hand.
1-88
Multivariate Data
Probabilistic Methods
❑ In a multivariate setting, the sample may be viewed
as a data matrix
❑ where the d columns correspond to d variables.
❑ These are also called inputs, features, or attributes.
❑ The N rows correspond to independent and identically
distributed observations, examples, or instances.
1-89
Multivariate Data
Probabilistic Methods
❑ For example, in deciding on a loan application, an
observation vector is the information associated with
a customer
❖ age, marital status, yearly income, and so forth,
and we have N such past customers.
1-90
Multivariate Data
Probabilistic Methods
❑ Typically these variables are correlated.
❑ If they are not, there is no need for a multivariate
analysis.
❑ Our aim may be simplification, that is, summarizing
this large body of data by means of relatively few
parameters.
❑ Or our aim may be exploratory, and we may be
interested in generating hypotheses about data.
1-91
Parameter Estimation
Probabilistic Methods
1-92
Parameter Estimation
Probabilistic Methods
1-93
Parameter Estimation
Probabilistic Methods
1-94
Parameter Estimation
Probabilistic Methods
1-95
Multivariate Normal Distribution
Probabilistic Methods
1-96
Multivariate Normal Distribution
Probabilistic Methods
1-97
Multivariate Normal Distribution
Probabilistic Methods
1-98
Multivariate Normal Distribution
Probabilistic Methods
❑ When the variables are
independent, the major axes of the
density are parallel to the input
axes.
1-99
Multivariate Normal Distribution
Probabilistic Methods
❑ The density becomes an ellipse if
the variances are different.
1-
100
Multivariate Normal Distribution
Probabilistic Methods
❑ The density rotates depending on
the sign of the covariance
(correlation).
1-
101
Multivariate Normal Distribution
Probabilistic Methods
❑ The density rotates depending on
the sign of the covariance
(correlation).
1-
102
Multivariate Normal Distribution
Probabilistic Methods
1-
103
Multivariate Normal Distribution
Probabilistic Methods
1-
104
Multivariate Normal Distribution
Probabilistic Methods
1-
105
Multivariate Normal Distribution
Probabilistic Methods
1-
106
Multivariate Classification
Probabilistic Methods
❑ The main reason for this is its analytical simplicity
❑ While real data may not often be exactly
multivariate normal, it is a useful approximation.
1-
107
Multivariate Classification
Probabilistic Methods
1-
108
Multivariate Classification
Probabilistic Methods
1-
109
Multivariate Classification
Probabilistic Methods
1-110
Multivariate Classification
Probabilistic Methods
1-111
Multivariate Classification
Probabilistic Methods
1-112
Multivariate Classification
Probabilistic Methods
1-113
Multivariate Classification
Probabilistic Methods
1-114
Multivariate Classification
Probabilistic Methods
1-115
Multivariate Classification
Probabilistic Methods
1-116
Multivariate Classification
Probabilistic Methods
1-117
Multivariate Classification
Probabilistic Methods
1-118
Multivariate Classification
Probabilistic Methods
1-119
Multivariate Classification
Probabilistic Methods
1-
120
Multivariate Classification
Probabilistic Methods
1-
121
Multivariate Classification
Probabilistic Methods
1-
122
Discrete Features
Probabilistic Methods
1-
123
Discrete Features
Probabilistic Methods
1-
124
Discrete Features
Probabilistic Methods
❑ Read: 6.10 An Example: Learning To Classify Text --
Tom Mitchell’s book
❑ http://www.cs.cmu.edu/afs/cs/project/theo-
11/www/naive-bayes.html

More Related Content

DOCX
Module-2_Notes-with-Example for data science
PDF
Methods of point estimation
PPTX
PREDICT 422 - Module 1.pptx
PPTX
Unit1_AI&ML_leftover (2).pptx
PDF
MHT Multi Hypothesis Tracking - Part3
PDF
MLE.pdf
PPT
natural language processing by Christopher
PPTX
Probability distribution Function & Decision Trees in machine learning
Module-2_Notes-with-Example for data science
Methods of point estimation
PREDICT 422 - Module 1.pptx
Unit1_AI&ML_leftover (2).pptx
MHT Multi Hypothesis Tracking - Part3
MLE.pdf
natural language processing by Christopher
Probability distribution Function & Decision Trees in machine learning

Similar to machine learning.pdf (20)

PPT
chap4_Parametric_Methods.ppt
PDF
Federico Vegetti_GLM and Maximum Likelihood.pdf
PPTX
03 Data Mining Techniques
PDF
Data science
DOCX
Essentials of machine learning algorithms
DOCX
Sampling theory teaches about machine .docx
PDF
Multiple linear regression
PPTX
Statistics and probability pp
PPT
Spsshelp 100608163328-phpapp01
PPT
PPT
Areas In Statistics
PPTX
UNIT II (7).pptx
PPTX
Bayesian Learning by Dr.C.R.Dhivyaa Kongu Engineering College
PPTX
Ml ppt at
PPTX
151028_abajpai1
PDF
Subject-2---Unidimensional-Data-2024.pdf
PDF
Dive into the Data
PDF
Different types of distributions
PPTX
Probability
chap4_Parametric_Methods.ppt
Federico Vegetti_GLM and Maximum Likelihood.pdf
03 Data Mining Techniques
Data science
Essentials of machine learning algorithms
Sampling theory teaches about machine .docx
Multiple linear regression
Statistics and probability pp
Spsshelp 100608163328-phpapp01
Areas In Statistics
UNIT II (7).pptx
Bayesian Learning by Dr.C.R.Dhivyaa Kongu Engineering College
Ml ppt at
151028_abajpai1
Subject-2---Unidimensional-Data-2024.pdf
Dive into the Data
Different types of distributions
Probability
Ad

Recently uploaded (20)

PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Approach and Philosophy of On baking technology
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Hybrid model detection and classification of lung cancer
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
A Presentation on Touch Screen Technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Encapsulation theory and applications.pdf
Unlocking AI with Model Context Protocol (MCP)
Approach and Philosophy of On baking technology
Programs and apps: productivity, graphics, security and other tools
Group 1 Presentation -Planning and Decision Making .pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
Hybrid model detection and classification of lung cancer
Chapter 5: Probability Theory and Statistics
cloud_computing_Infrastucture_as_cloud_p
A Presentation on Touch Screen Technology
MIND Revenue Release Quarter 2 2025 Press Release
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
A Presentation on Artificial Intelligence
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Web App vs Mobile App What Should You Build First.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Ad

machine learning.pdf