Probability Theory.pdf

CMPUT 366 F20: Probability Theory
James Wright & Vadim Bulitko
October 15, 2020
CMPUT 366 F20: Probability Theory 1

Lecture Outline
Probability Theory
PM 8.1-8.2

Uncertainty
In both search and RL we assumed that the
agent knows its current state s
That is an abstraction/simpliﬁcation
in real life agents may not know the entire
state with certainty
Agent’s knowledge is uncertain
agent must consider multiple hypotheses
agent must update beliefs about which
hypotheses are likely given observations
Stephen Hladky, 2009

Example*
An AI robot has to decide between three actions:
drive without wearing a seatbelt
drive while wearing a seatbelt
stay home
If the robot knows with certainty that an accident will happen, it will just stay
home
If the robot knows with certainty that an accident will not happen, it will not
bother to wear a seatbelt
Wearing a seatbelt makes sense because the robot is uncertain about whether
driving will lead to an accident
*
This is a hypothetical example with a robot. As a human in real life please always follow appropriate laws and regulations on wearing seatbealts.

Measuring Uncertainty
Probability is a way of measuring/quantifying uncertainty
The agent assigns a number between 0 and 1 to hypotheses
0 means absolutely certain that statement is false
1 means absolutely certain that statement is true
intermediate values mean more or less certain
Probability is a measurement of uncertainty, not truth
a statement with probability 0.75 is not “mostly true”
rather, the agent believes it is more likely to be true than not

Subjective versus Objective: The Frequentist Perspective
Probabilities can be interpreted as objective statements about the world, or as
subjective statements about an agent’s beliefs
Objective view is called frequentist:
The probability of an event is the proportion of times it would happen in the long
run of repeated experiments
Every event has a single, true probability
Events that can only happen once do not have a well-deﬁned probability

Subjective versus Objective: The Bayesian Perspective
Probabilities can be interpreted as objective statements about the world or as
subjective statements about an agent’s beliefs
Subjective view is called Bayesian
The probability of an event is a measure of an agent’s belief about its likelihood
Different agents can legitimately have different beliefs, so they can legitimately
assign different probabilities to the same event
There is only one way to update those beliefs in response to new data
In this course, we will primarily take the Bayesian view

Example: Dice
Discuss:
Diane rolls a fair, six-sided dice, and gets the number X
What is P(X = 5)?
Diane truthfully tells Oliver that she rolled an odd number
What should Oliver believe P(X = 5) is?
Diane truthfully tells Greta that she rolled a number greater than or equal to 5
What should Greta believe P(X = 5) is?

Semantics: Possible Worlds
Random variables (e.g., X) take values from a set (domain)
A possible world ω is a complete assignment of values to all random variables
A probability measure is a function P : Ω → R:
P
ω∈Ω
P(ω) = 1
∀ω ∈ Ω [P(ω) ≥ 0]

Semantics: Possible Worlds
Random variables (e.g., X) take values from a set (domain)
A possible world ω is a complete assignment of values to all random variables
A probability measure is a function P : Ω → R:
P
ω∈Ω
P(ω) = 1
∀ω ∈ Ω [P(ω) ≥ 0]
Discuss for the six-sided fair dice example:
What is the random variable?
What is its domain?
How many worlds are there?
What is the P?

Propositions
A primitive proposition is an equality or an inequality (e.g., X = 2 or X ≥ 5)
A proposition is built up from other propositions using logical connectives (e.g.,
X = 1 ∨ X = 3 ∨ X = 5)
The probability of a proposition is the sum of the probabilities of the possible
worlds in which that proposition is true
P(α) =
X
ω∈Ω, ω|=α
P(ω)
Example: in the dice example P(X ≥ 5) = P(X = 5) + P(X = 6) = 1/6 + 1/6 = 1/3

Basic Properties
P(α ∨ β) ≥ P(α)
P(α ∨ β) ≥ P(β)
P(α & β) ≤ P(α)
P(α & β) ≤ P(β)
P(¬α) = 1 − P(α)

Joint Distributions
In our dice example there was a single random variable X
We typically want to think about the interactions of multiple random variables
A joint distribution assigns a probability to each full assignment of values to
variables
P(X = 1, Y = 5) is equivalent to P(X = 1 & Y = 5)
the cumulative probability of all worlds in which X = 1 and Y = 5
Suppose Diane now throws her six-sided fair dice twice. The result of the ﬁrst
throw is X and the second throw is Y
Discuss:
What is P(X = 1, Y = 5)?
What is P(X = 1)?

Another Joint-Distribution Example
What might a day be like in Edmonton?
Two random variables:
Weather with domain {clear, snowing}
Temperature with domain {mild, cold, very_cold}
Joint distribution P(Weather, Temperature) →

Marginalization
Marginalization is using a joint distribution
P(X1, . . . , Xm, . . . , Xn) to compute a distribution
over a smaller number of variables P(X1, . . . , Xm)
The smaller distribution is called the marginal
distribution of its variables
We compute the marginal distribution by
summing out the other variables, for instance:
P(X, Y) =
X
z
P(X, Y, Z = z)
What is the marginal distribution of Weather?
What is P(Weather = clear)?
What is P(Weather = snowing)?

Conditional Probability
Agents need to be able to update their beliefs based on new observations
This process is called conditioning
We write P(h|e) to denote the probability of hypothesis h given that we have
observed evidence e
P(h|e) is the probability of h conditional on e

Semantics of Conditional Probability
Evidence e lets us rule out all of the worlds that
are incompatible with e
For instance, if the agent observes that the
weather is clear, it should no longer assign any
probability to the worlds in which it is snowing
We need to normalize the probabilities of the
remaining worlds to ensure that the probabilities
of possible worlds sum to 1
Modify the table on the right given the evidence
that the weather is clear

Chain Rule
Conditional probability is deﬁned as
P(h|e) =
P(h, e)
P(e)
which is exactly the sum of probabilities of all worlds in which h & e are true
divided by the sum of probabilities of all worlds in which e is true
in the weather example, P(mild|clear) = 0.2
0.2+0.3+0.25
From there we have P(h, e) = P(h|e)P(e)
More generally, we have the chain rule:
P(α1, . . . , αn) = P(α1)P(α2|α1) . . . P(αn|α1, . . . , αn−1)
=
n
Y
i=1
P(αi|α1, . . . , αi−1)

Bayes’ Rule
We have P(h, e) = P(h|e)P(e) = P(e|h)P(h)
From here we have the Baye’s rule
P(h|e) =
P(e|h)P(h)
P(e)
P(e) is probability of the the evidence
P(h) is the prior probability of a hypothesis h
P(e|h) is the likelihood — often easier to compute than:
P(h|e) is the posterior
Discuss why P(wet|rain) is easier to compute than P(rain|wet)
wet is the evidence e
rain is the hypothesis h

Expected Value
The expected value of a random variable X is the weighted average of that
variable over the domain, weighted by the probability of each value:
E[X] =
X
x
P(X = x)x
The conditional expected value of a variable X conditioned on proposition y is
its expected value weighted by the conditional probability:
E[X|y] =
X
x
P(X = x|y)x
Discuss
What is the expected value of a six-sided fair dice?
What is the conditional expected value of a six-sided fair dice conditioned on the
fact that it is even?

Expected Value Examples: E[X] = 3

Summary
Probability is a numerical measure of uncertainty
Formal semantics:
positive weights, sum up to 1 over possible worlds
probability of proposition is total weight of worlds in which the proposition is true
Conditional probability updates the agent’s beliefs based on evidence
Expected value of a variable is its probability-weighted average over possible
worlds

Probability Theory.pdf

More Related Content

Similar to Probability Theory.pdf (20)

Recently uploaded (20)

Probability Theory.pdf