SlideShare a Scribd company logo
Artificial Intelligence
Introduction to Bayesian Networks
Andres Mendez-Vazquez
March 2, 2016
1 / 85
Outline
1 History
The History of Bayesian Applications
2 Bayes Theorem
Everything Starts at Someplace
Why Bayesian Networks?
3 Bayesian Networks
Definition
Markov Condition
Example
Using the Markok Condition
Representing the Joint Distribution
Observations
Causality and Bayesian Networks
Precautionary Tale
Causal DAG
Inference in Bayesian Networks
Example
General Strategy of Inference
Inference - An Overview
2 / 85
Outline
1 History
The History of Bayesian Applications
2 Bayes Theorem
Everything Starts at Someplace
Why Bayesian Networks?
3 Bayesian Networks
Definition
Markov Condition
Example
Using the Markok Condition
Representing the Joint Distribution
Observations
Causality and Bayesian Networks
Precautionary Tale
Causal DAG
Inference in Bayesian Networks
Example
General Strategy of Inference
Inference - An Overview
3 / 85
History
History
‘60s The first expert systems. IF-THEN rules.
1968 Attempts to use probabilities in expert systems (Gorry &
Barnett).
1973 Gave up - to heavy calculations! (Gorry).
1976 MYCIN: Medical predicate logic expert system with certainty
factors (Shortliffe).
1976 PROSPECTOR: Predicts the likely location of mineral deposits.
Uses Bayes’ rule. (Duda et al.).
Summary until mid ’80s
“Pure logic will solve the AI problems!”
“Probability theory is intractable to use and too complicated for
complex models.”
4 / 85
History
History
‘60s The first expert systems. IF-THEN rules.
1968 Attempts to use probabilities in expert systems (Gorry &
Barnett).
1973 Gave up - to heavy calculations! (Gorry).
1976 MYCIN: Medical predicate logic expert system with certainty
factors (Shortliffe).
1976 PROSPECTOR: Predicts the likely location of mineral deposits.
Uses Bayes’ rule. (Duda et al.).
Summary until mid ’80s
“Pure logic will solve the AI problems!”
“Probability theory is intractable to use and too complicated for
complex models.”
4 / 85
History
History
‘60s The first expert systems. IF-THEN rules.
1968 Attempts to use probabilities in expert systems (Gorry &
Barnett).
1973 Gave up - to heavy calculations! (Gorry).
1976 MYCIN: Medical predicate logic expert system with certainty
factors (Shortliffe).
1976 PROSPECTOR: Predicts the likely location of mineral deposits.
Uses Bayes’ rule. (Duda et al.).
Summary until mid ’80s
“Pure logic will solve the AI problems!”
“Probability theory is intractable to use and too complicated for
complex models.”
4 / 85
History
History
‘60s The first expert systems. IF-THEN rules.
1968 Attempts to use probabilities in expert systems (Gorry &
Barnett).
1973 Gave up - to heavy calculations! (Gorry).
1976 MYCIN: Medical predicate logic expert system with certainty
factors (Shortliffe).
1976 PROSPECTOR: Predicts the likely location of mineral deposits.
Uses Bayes’ rule. (Duda et al.).
Summary until mid ’80s
“Pure logic will solve the AI problems!”
“Probability theory is intractable to use and too complicated for
complex models.”
4 / 85
History
History
‘60s The first expert systems. IF-THEN rules.
1968 Attempts to use probabilities in expert systems (Gorry &
Barnett).
1973 Gave up - to heavy calculations! (Gorry).
1976 MYCIN: Medical predicate logic expert system with certainty
factors (Shortliffe).
1976 PROSPECTOR: Predicts the likely location of mineral deposits.
Uses Bayes’ rule. (Duda et al.).
Summary until mid ’80s
“Pure logic will solve the AI problems!”
“Probability theory is intractable to use and too complicated for
complex models.”
4 / 85
History
History
‘60s The first expert systems. IF-THEN rules.
1968 Attempts to use probabilities in expert systems (Gorry &
Barnett).
1973 Gave up - to heavy calculations! (Gorry).
1976 MYCIN: Medical predicate logic expert system with certainty
factors (Shortliffe).
1976 PROSPECTOR: Predicts the likely location of mineral deposits.
Uses Bayes’ rule. (Duda et al.).
Summary until mid ’80s
“Pure logic will solve the AI problems!”
“Probability theory is intractable to use and too complicated for
complex models.”
4 / 85
History
History
‘60s The first expert systems. IF-THEN rules.
1968 Attempts to use probabilities in expert systems (Gorry &
Barnett).
1973 Gave up - to heavy calculations! (Gorry).
1976 MYCIN: Medical predicate logic expert system with certainty
factors (Shortliffe).
1976 PROSPECTOR: Predicts the likely location of mineral deposits.
Uses Bayes’ rule. (Duda et al.).
Summary until mid ’80s
“Pure logic will solve the AI problems!”
“Probability theory is intractable to use and too complicated for
complex models.”
4 / 85
But...
More History
1986 Bayesian networks were revived and reintroduced to expert
systems (Pearl).
1988 Breakthrough for efficient calculation algorithms (Lauritzen &
Spiegelhalter) tractable calculations on Bayesian Networkss.
1995 In Windows95™ for printer-trouble shooting and Office
assistance (“the paper clip”).
1999 Bayesian Networks are getting more and more used. Ex. Gene
expression analysis, Business strategy etc.
2000 Widely used - A Bayesian Network tool will be shipped with
every Windows™ Commercial Server.
5 / 85
But...
More History
1986 Bayesian networks were revived and reintroduced to expert
systems (Pearl).
1988 Breakthrough for efficient calculation algorithms (Lauritzen &
Spiegelhalter) tractable calculations on Bayesian Networkss.
1995 In Windows95™ for printer-trouble shooting and Office
assistance (“the paper clip”).
1999 Bayesian Networks are getting more and more used. Ex. Gene
expression analysis, Business strategy etc.
2000 Widely used - A Bayesian Network tool will be shipped with
every Windows™ Commercial Server.
5 / 85
But...
More History
1986 Bayesian networks were revived and reintroduced to expert
systems (Pearl).
1988 Breakthrough for efficient calculation algorithms (Lauritzen &
Spiegelhalter) tractable calculations on Bayesian Networkss.
1995 In Windows95™ for printer-trouble shooting and Office
assistance (“the paper clip”).
1999 Bayesian Networks are getting more and more used. Ex. Gene
expression analysis, Business strategy etc.
2000 Widely used - A Bayesian Network tool will be shipped with
every Windows™ Commercial Server.
5 / 85
But...
More History
1986 Bayesian networks were revived and reintroduced to expert
systems (Pearl).
1988 Breakthrough for efficient calculation algorithms (Lauritzen &
Spiegelhalter) tractable calculations on Bayesian Networkss.
1995 In Windows95™ for printer-trouble shooting and Office
assistance (“the paper clip”).
1999 Bayesian Networks are getting more and more used. Ex. Gene
expression analysis, Business strategy etc.
2000 Widely used - A Bayesian Network tool will be shipped with
every Windows™ Commercial Server.
5 / 85
But...
More History
1986 Bayesian networks were revived and reintroduced to expert
systems (Pearl).
1988 Breakthrough for efficient calculation algorithms (Lauritzen &
Spiegelhalter) tractable calculations on Bayesian Networkss.
1995 In Windows95™ for printer-trouble shooting and Office
assistance (“the paper clip”).
1999 Bayesian Networks are getting more and more used. Ex. Gene
expression analysis, Business strategy etc.
2000 Widely used - A Bayesian Network tool will be shipped with
every Windows™ Commercial Server.
5 / 85
Furtheron 2000-2015
Bayesian Networks are use in
Spam Detection.
Gene Dicovery.
Signal Processing.
Ranking.
Forecasting.
etc.
Something Notable
We are interested more and more on building automatically Bayesian
Networks using data!!!
6 / 85
Furtheron 2000-2015
Bayesian Networks are use in
Spam Detection.
Gene Dicovery.
Signal Processing.
Ranking.
Forecasting.
etc.
Something Notable
We are interested more and more on building automatically Bayesian
Networks using data!!!
6 / 85
Bayesian Network Advantages
Many of Them
1 Since in a Bayesian network encodes all variables, missing data entries
can be handled successfully.
2 When used for learning casual relationships, they help better
understand a problem domain as well as forecast consequences.
3 it is ideal to use a Bayesian network for representing prior data and
knowledge.
4 Over-fitting of data can be avoidable when using Bayesian networks
and Bayesian statistical methods.
7 / 85
Bayesian Network Advantages
Many of Them
1 Since in a Bayesian network encodes all variables, missing data entries
can be handled successfully.
2 When used for learning casual relationships, they help better
understand a problem domain as well as forecast consequences.
3 it is ideal to use a Bayesian network for representing prior data and
knowledge.
4 Over-fitting of data can be avoidable when using Bayesian networks
and Bayesian statistical methods.
7 / 85
Bayesian Network Advantages
Many of Them
1 Since in a Bayesian network encodes all variables, missing data entries
can be handled successfully.
2 When used for learning casual relationships, they help better
understand a problem domain as well as forecast consequences.
3 it is ideal to use a Bayesian network for representing prior data and
knowledge.
4 Over-fitting of data can be avoidable when using Bayesian networks
and Bayesian statistical methods.
7 / 85
Bayesian Network Advantages
Many of Them
1 Since in a Bayesian network encodes all variables, missing data entries
can be handled successfully.
2 When used for learning casual relationships, they help better
understand a problem domain as well as forecast consequences.
3 it is ideal to use a Bayesian network for representing prior data and
knowledge.
4 Over-fitting of data can be avoidable when using Bayesian networks
and Bayesian statistical methods.
7 / 85
Outline
1 History
The History of Bayesian Applications
2 Bayes Theorem
Everything Starts at Someplace
Why Bayesian Networks?
3 Bayesian Networks
Definition
Markov Condition
Example
Using the Markok Condition
Representing the Joint Distribution
Observations
Causality and Bayesian Networks
Precautionary Tale
Causal DAG
Inference in Bayesian Networks
Example
General Strategy of Inference
Inference - An Overview
8 / 85
Bayes Theorem
One Version
P(A|B) =
P(B|A)P(A)
P(B)
Where
P(A) is the prior probability or marginal probability of A. It is
"prior" in the sense that it does not take into account any information
about B.
P(A|B) is the conditional probability of A, given B. It is also called
the posterior probability because it is derived from or depends upon
the specified value of B.
P(B|A) is the conditional probability of B given A. It is also called
the likelihood.
P(B) is the prior or marginal probability of B, and acts as a
normalizing constant.
9 / 85
Bayes Theorem
One Version
P(A|B) =
P(B|A)P(A)
P(B)
Where
P(A) is the prior probability or marginal probability of A. It is
"prior" in the sense that it does not take into account any information
about B.
P(A|B) is the conditional probability of A, given B. It is also called
the posterior probability because it is derived from or depends upon
the specified value of B.
P(B|A) is the conditional probability of B given A. It is also called
the likelihood.
P(B) is the prior or marginal probability of B, and acts as a
normalizing constant.
9 / 85
Bayes Theorem
One Version
P(A|B) =
P(B|A)P(A)
P(B)
Where
P(A) is the prior probability or marginal probability of A. It is
"prior" in the sense that it does not take into account any information
about B.
P(A|B) is the conditional probability of A, given B. It is also called
the posterior probability because it is derived from or depends upon
the specified value of B.
P(B|A) is the conditional probability of B given A. It is also called
the likelihood.
P(B) is the prior or marginal probability of B, and acts as a
normalizing constant.
9 / 85
Bayes Theorem
One Version
P(A|B) =
P(B|A)P(A)
P(B)
Where
P(A) is the prior probability or marginal probability of A. It is
"prior" in the sense that it does not take into account any information
about B.
P(A|B) is the conditional probability of A, given B. It is also called
the posterior probability because it is derived from or depends upon
the specified value of B.
P(B|A) is the conditional probability of B given A. It is also called
the likelihood.
P(B) is the prior or marginal probability of B, and acts as a
normalizing constant.
9 / 85
Bayes Theorem
One Version
P(A|B) =
P(B|A)P(A)
P(B)
Where
P(A) is the prior probability or marginal probability of A. It is
"prior" in the sense that it does not take into account any information
about B.
P(A|B) is the conditional probability of A, given B. It is also called
the posterior probability because it is derived from or depends upon
the specified value of B.
P(B|A) is the conditional probability of B given A. It is also called
the likelihood.
P(B) is the prior or marginal probability of B, and acts as a
normalizing constant.
9 / 85
A Simple Example
Consider two related variables:
1 Drug (D) with values y or n
2 Test (T) with values +ve or –ve
Initial Probabilities
P(D = y) = 0.001
P(T = +ve|D = y) = 0.8
P(T = +ve|D = n) = 0.01
10 / 85
A Simple Example
Consider two related variables:
1 Drug (D) with values y or n
2 Test (T) with values +ve or –ve
Initial Probabilities
P(D = y) = 0.001
P(T = +ve|D = y) = 0.8
P(T = +ve|D = n) = 0.01
10 / 85
A Simple Example
Consider two related variables:
1 Drug (D) with values y or n
2 Test (T) with values +ve or –ve
Initial Probabilities
P(D = y) = 0.001
P(T = +ve|D = y) = 0.8
P(T = +ve|D = n) = 0.01
10 / 85
A Simple Example
Consider two related variables:
1 Drug (D) with values y or n
2 Test (T) with values +ve or –ve
Initial Probabilities
P(D = y) = 0.001
P(T = +ve|D = y) = 0.8
P(T = +ve|D = n) = 0.01
10 / 85
A Simple Example
Consider two related variables:
1 Drug (D) with values y or n
2 Test (T) with values +ve or –ve
Initial Probabilities
P(D = y) = 0.001
P(T = +ve|D = y) = 0.8
P(T = +ve|D = n) = 0.01
10 / 85
A Simple Example
What is the probability that a person has taken the drug?
P (D = y|T = +ve) =
P (T = +ve|D = y) P (D=y)
P (T = +ve|D = y) P (D=y) + P (T = +ve|D = n) P (D=n)
Let me develop the equation
Using simply
P (A, B) = P (A|B) P (B) (Chain Rule) (1)
11 / 85
A Simple Example
What is the probability that a person has taken the drug?
P (D = y|T = +ve) =
P (T = +ve|D = y) P (D=y)
P (T = +ve|D = y) P (D=y) + P (T = +ve|D = n) P (D=n)
Let me develop the equation
Using simply
P (A, B) = P (A|B) P (B) (Chain Rule) (1)
11 / 85
Outline
1 History
The History of Bayesian Applications
2 Bayes Theorem
Everything Starts at Someplace
Why Bayesian Networks?
3 Bayesian Networks
Definition
Markov Condition
Example
Using the Markok Condition
Representing the Joint Distribution
Observations
Causality and Bayesian Networks
Precautionary Tale
Causal DAG
Inference in Bayesian Networks
Example
General Strategy of Inference
Inference - An Overview
12 / 85
A More Complex Case
Increase Complexity
Suppose now that there is a similar link between Lung Cancer (L) and
a chest X-ray (X) and that we also have the following relationships:
History of smoking (S) has a direct influence on bronchitis (B) and
lung cancer (L);
L and B have a direct influence on fatigue (F).
Question
What is the probability that someone has bronchitis given that they
smoke, have fatigue and have received a positive X-ray result?
13 / 85
A More Complex Case
Increase Complexity
Suppose now that there is a similar link between Lung Cancer (L) and
a chest X-ray (X) and that we also have the following relationships:
History of smoking (S) has a direct influence on bronchitis (B) and
lung cancer (L);
L and B have a direct influence on fatigue (F).
Question
What is the probability that someone has bronchitis given that they
smoke, have fatigue and have received a positive X-ray result?
13 / 85
A More Complex Case
Increase Complexity
Suppose now that there is a similar link between Lung Cancer (L) and
a chest X-ray (X) and that we also have the following relationships:
History of smoking (S) has a direct influence on bronchitis (B) and
lung cancer (L);
L and B have a direct influence on fatigue (F).
Question
What is the probability that someone has bronchitis given that they
smoke, have fatigue and have received a positive X-ray result?
13 / 85
A More Complex Case
Increase Complexity
Suppose now that there is a similar link between Lung Cancer (L) and
a chest X-ray (X) and that we also have the following relationships:
History of smoking (S) has a direct influence on bronchitis (B) and
lung cancer (L);
L and B have a direct influence on fatigue (F).
Question
What is the probability that someone has bronchitis given that they
smoke, have fatigue and have received a positive X-ray result?
13 / 85
A More Complex Case
Short Hand
P (b1|s1, f1, x1) =
P (b1, s1, f1, x1)
P (s1, f1, x1)
= l
P (b1, s1, f1, x1, l)
b,l
P (b, s1, f1, x1, l)
14 / 85
Values for the Complex Case
Table
Feature Value When the Feature Takes this Value
H h1 There is a history of smoking
h2 There is no history of smoking
B b1 Bronchitis is present
b2 Bronchitis is absent
L l1 Lung cancer is present
l2 Lung cancer is absent
F f1 Fatigue is present
f2 Fatigue is absent
C c1 Chest X-ray is positive
c2 Chest X-ray is negative
15 / 85
Problem with Large Instances
The joint probability distribution P(b, s, f , x, l)
For five binary variables there are 25 = 32 values in the joint
distribution (for 100 variables there are over 2100 values)
How are these values to be obtained?
We can try to do inference
To obtain posterior distributions once some evidence is available
requires summation over an exponential number of terms!!!
Ok
We need something BETTER!!!
16 / 85
Problem with Large Instances
The joint probability distribution P(b, s, f , x, l)
For five binary variables there are 25 = 32 values in the joint
distribution (for 100 variables there are over 2100 values)
How are these values to be obtained?
We can try to do inference
To obtain posterior distributions once some evidence is available
requires summation over an exponential number of terms!!!
Ok
We need something BETTER!!!
16 / 85
Problem with Large Instances
The joint probability distribution P(b, s, f , x, l)
For five binary variables there are 25 = 32 values in the joint
distribution (for 100 variables there are over 2100 values)
How are these values to be obtained?
We can try to do inference
To obtain posterior distributions once some evidence is available
requires summation over an exponential number of terms!!!
Ok
We need something BETTER!!!
16 / 85
Problem with Large Instances
The joint probability distribution P(b, s, f , x, l)
For five binary variables there are 25 = 32 values in the joint
distribution (for 100 variables there are over 2100 values)
How are these values to be obtained?
We can try to do inference
To obtain posterior distributions once some evidence is available
requires summation over an exponential number of terms!!!
Ok
We need something BETTER!!!
16 / 85
Outline
1 History
The History of Bayesian Applications
2 Bayes Theorem
Everything Starts at Someplace
Why Bayesian Networks?
3 Bayesian Networks
Definition
Markov Condition
Example
Using the Markok Condition
Representing the Joint Distribution
Observations
Causality and Bayesian Networks
Precautionary Tale
Causal DAG
Inference in Bayesian Networks
Example
General Strategy of Inference
Inference - An Overview
17 / 85
Bayesian Networks
Definition
A Bayesian network consists of
A Graph
Nodes represent the random variables.
Directed edges (arrows) between pairs of nodes.
it must be a Directed Acyclic Graph (DAG) – no directed cycles.
The graph represents independence relationships between variables.
This allows to define
Conditional Probability Specifications:
The conditional probability of each variable given its parents in the
DAG.
18 / 85
Bayesian Networks
Definition
A Bayesian network consists of
A Graph
Nodes represent the random variables.
Directed edges (arrows) between pairs of nodes.
it must be a Directed Acyclic Graph (DAG) – no directed cycles.
The graph represents independence relationships between variables.
This allows to define
Conditional Probability Specifications:
The conditional probability of each variable given its parents in the
DAG.
18 / 85
Bayesian Networks
Definition
A Bayesian network consists of
A Graph
Nodes represent the random variables.
Directed edges (arrows) between pairs of nodes.
it must be a Directed Acyclic Graph (DAG) – no directed cycles.
The graph represents independence relationships between variables.
This allows to define
Conditional Probability Specifications:
The conditional probability of each variable given its parents in the
DAG.
18 / 85
Bayesian Networks
Definition
A Bayesian network consists of
A Graph
Nodes represent the random variables.
Directed edges (arrows) between pairs of nodes.
it must be a Directed Acyclic Graph (DAG) – no directed cycles.
The graph represents independence relationships between variables.
This allows to define
Conditional Probability Specifications:
The conditional probability of each variable given its parents in the
DAG.
18 / 85
Bayesian Networks
Definition
A Bayesian network consists of
A Graph
Nodes represent the random variables.
Directed edges (arrows) between pairs of nodes.
it must be a Directed Acyclic Graph (DAG) – no directed cycles.
The graph represents independence relationships between variables.
This allows to define
Conditional Probability Specifications:
The conditional probability of each variable given its parents in the
DAG.
18 / 85
Bayesian Networks
Definition
A Bayesian network consists of
A Graph
Nodes represent the random variables.
Directed edges (arrows) between pairs of nodes.
it must be a Directed Acyclic Graph (DAG) – no directed cycles.
The graph represents independence relationships between variables.
This allows to define
Conditional Probability Specifications:
The conditional probability of each variable given its parents in the
DAG.
18 / 85
Bayesian Networks
Definition
A Bayesian network consists of
A Graph
Nodes represent the random variables.
Directed edges (arrows) between pairs of nodes.
it must be a Directed Acyclic Graph (DAG) – no directed cycles.
The graph represents independence relationships between variables.
This allows to define
Conditional Probability Specifications:
The conditional probability of each variable given its parents in the
DAG.
18 / 85
Example
DAG for the previous Lung Cancer Problem
H
B L
F C
19 / 85
Outline
1 History
The History of Bayesian Applications
2 Bayes Theorem
Everything Starts at Someplace
Why Bayesian Networks?
3 Bayesian Networks
Definition
Markov Condition
Example
Using the Markok Condition
Representing the Joint Distribution
Observations
Causality and Bayesian Networks
Precautionary Tale
Causal DAG
Inference in Bayesian Networks
Example
General Strategy of Inference
Inference - An Overview
20 / 85
Markov Condition
Definition
Suppose we have a joint probability distribution P of the random
variables in some set V and a DAG G = (V , E).
We say that (G, P) satisfies the Markov condition if for each variable
X ∈ V , {X} is conditionally independent of the set of all its
non-descendents given the set of all its parents.
Notation
PAX = set of parents of X.
NDX = set of non-descendants of X.
We use the following the notation
IP ({X} , NDX |PAX )
21 / 85
Markov Condition
Definition
Suppose we have a joint probability distribution P of the random
variables in some set V and a DAG G = (V , E).
We say that (G, P) satisfies the Markov condition if for each variable
X ∈ V , {X} is conditionally independent of the set of all its
non-descendents given the set of all its parents.
Notation
PAX = set of parents of X.
NDX = set of non-descendants of X.
We use the following the notation
IP ({X} , NDX |PAX )
21 / 85
Markov Condition
Definition
Suppose we have a joint probability distribution P of the random
variables in some set V and a DAG G = (V , E).
We say that (G, P) satisfies the Markov condition if for each variable
X ∈ V , {X} is conditionally independent of the set of all its
non-descendents given the set of all its parents.
Notation
PAX = set of parents of X.
NDX = set of non-descendants of X.
We use the following the notation
IP ({X} , NDX |PAX )
21 / 85
Markov Condition
Definition
Suppose we have a joint probability distribution P of the random
variables in some set V and a DAG G = (V , E).
We say that (G, P) satisfies the Markov condition if for each variable
X ∈ V , {X} is conditionally independent of the set of all its
non-descendents given the set of all its parents.
Notation
PAX = set of parents of X.
NDX = set of non-descendants of X.
We use the following the notation
IP ({X} , NDX |PAX )
21 / 85
Markov Condition
Definition
Suppose we have a joint probability distribution P of the random
variables in some set V and a DAG G = (V , E).
We say that (G, P) satisfies the Markov condition if for each variable
X ∈ V , {X} is conditionally independent of the set of all its
non-descendents given the set of all its parents.
Notation
PAX = set of parents of X.
NDX = set of non-descendants of X.
We use the following the notation
IP ({X} , NDX |PAX )
21 / 85
Outline
1 History
The History of Bayesian Applications
2 Bayes Theorem
Everything Starts at Someplace
Why Bayesian Networks?
3 Bayesian Networks
Definition
Markov Condition
Example
Using the Markok Condition
Representing the Joint Distribution
Observations
Causality and Bayesian Networks
Precautionary Tale
Causal DAG
Inference in Bayesian Networks
Example
General Strategy of Inference
Inference - An Overview
22 / 85
Example
We have that
H
B L
F C
Given the previous DAG we have
Node PA Conditional Independence
C {L} IP ({C} , {H, B, F} | {L})
B {H} IP ({B} , {L, C} | {H})
F {B, L} IP ({F} , {H, C} | {B, L})
L {H} IP ({L} , {B} | {H})
23 / 85
Example
We have that
H
B L
F C
Given the previous DAG we have
Node PA Conditional Independence
C {L} IP ({C} , {H, B, F} | {L})
B {H} IP ({B} , {L, C} | {H})
F {B, L} IP ({F} , {H, C} | {B, L})
L {H} IP ({L} , {B} | {H})
23 / 85
Outline
1 History
The History of Bayesian Applications
2 Bayes Theorem
Everything Starts at Someplace
Why Bayesian Networks?
3 Bayesian Networks
Definition
Markov Condition
Example
Using the Markok Condition
Representing the Joint Distribution
Observations
Causality and Bayesian Networks
Precautionary Tale
Causal DAG
Inference in Bayesian Networks
Example
General Strategy of Inference
Inference - An Overview
24 / 85
Using the Markov Condition
First Decompose a Joint Distribution using the Chain Rule
P (c, f , l, b, h) = P (c|b, s, l, f ) P (f |b, h, l) P (l|b, h) P (b|h) P (h) (2)
Using the Markov condition in the following DAG
We have the following equivalences
P (c|b, h, l, f ) = P (c|l)
P (f |b, h, l) = P (f |b, l)
P (l|b, h) = P (l|h)
25 / 85
Using the Markov Condition
First Decompose a Joint Distribution using the Chain Rule
P (c, f , l, b, h) = P (c|b, s, l, f ) P (f |b, h, l) P (l|b, h) P (b|h) P (h) (2)
Using the Markov condition in the following DAG
H
B L
F C
We have the following equivalences
P (c|b, h, l, f ) = P (c|l)
P (f |b, h, l) = P (f |b, l)
P (l|b, h) = P (l|h)
25 / 85
Using the Markov Condition
First Decompose a Joint Distribution using the Chain Rule
P (c, f , l, b, h) = P (c|b, s, l, f ) P (f |b, h, l) P (l|b, h) P (b|h) P (h) (2)
Using the Markov condition in the following DAG
H
B L
F C
We have the following equivalences
P (c|b, h, l, f ) = P (c|l)
P (f |b, h, l) = P (f |b, l)
P (l|b, h) = P (l|h)
25 / 85
Using the Markov Condition
Finally
P (c, f , l, b, h) = P (c|l) P (f |b, l) P (l|h) P (b|h) P (h) (3)
26 / 85
Outline
1 History
The History of Bayesian Applications
2 Bayes Theorem
Everything Starts at Someplace
Why Bayesian Networks?
3 Bayesian Networks
Definition
Markov Condition
Example
Using the Markok Condition
Representing the Joint Distribution
Observations
Causality and Bayesian Networks
Precautionary Tale
Causal DAG
Inference in Bayesian Networks
Example
General Strategy of Inference
Inference - An Overview
27 / 85
Representing the Joint Distribution
Theorem 1.4
If (G, P) satisfies the Markov condition, then P is equal to the product of
its conditional distributions of all nodes given values of their parents,
whenever these conditional distributions exist.
General Representation
In general, for a network with nodes X1, X2, ..., Xn ⇒
P (x1, x2, ..., xn) =
n
i=1
P (xi|PA (xi))
28 / 85
Representing the Joint Distribution
Theorem 1.4
If (G, P) satisfies the Markov condition, then P is equal to the product of
its conditional distributions of all nodes given values of their parents,
whenever these conditional distributions exist.
General Representation
In general, for a network with nodes X1, X2, ..., Xn ⇒
P (x1, x2, ..., xn) =
n
i=1
P (xi|PA (xi))
28 / 85
Proof of Theorem 1.4
We prove the case where P is discrete
Order the nodes so that if Y is a descendant of Z, then Y follows Z in
the ordering.
Topological Sorting.
This is called
Ancestral ordering.
29 / 85
Proof of Theorem 1.4
We prove the case where P is discrete
Order the nodes so that if Y is a descendant of Z, then Y follows Z in
the ordering.
Topological Sorting.
This is called
Ancestral ordering.
29 / 85
Proof of Theorem 1.4
We prove the case where P is discrete
Order the nodes so that if Y is a descendant of Z, then Y follows Z in
the ordering.
Topological Sorting.
This is called
Ancestral ordering.
29 / 85
Proof
For example
The ancestral ordering are
[H, L, B, C, F] and [H, B, L, F, C] (4)
30 / 85
Proof
For example
The ancestral ordering are
[H, L, B, C, F] and [H, B, L, F, C] (4)
30 / 85
Proof
Now
Let X1, X2, ..., Xn be the resultant ordering.
For a given set of values of x1, x2, ..., xn
Let pai be the subsets of these values containing the values of Xi s parents
Thus, we need to prove that whenever P (pai) = 0 for 1 ≤ i ≤ n
P (xn, xn−1, ..., x1) = P (xn|pan) P xn−1|pan−1 ...P (x1|pa1) (5)
31 / 85
Proof
Now
Let X1, X2, ..., Xn be the resultant ordering.
For a given set of values of x1, x2, ..., xn
Let pai be the subsets of these values containing the values of Xi s parents
Thus, we need to prove that whenever P (pai) = 0 for 1 ≤ i ≤ n
P (xn, xn−1, ..., x1) = P (xn|pan) P xn−1|pan−1 ...P (x1|pa1) (5)
31 / 85
Proof
Now
Let X1, X2, ..., Xn be the resultant ordering.
For a given set of values of x1, x2, ..., xn
Let pai be the subsets of these values containing the values of Xi s parents
Thus, we need to prove that whenever P (pai) = 0 for 1 ≤ i ≤ n
P (xn, xn−1, ..., x1) = P (xn|pan) P xn−1|pan−1 ...P (x1|pa1) (5)
31 / 85
Proof
Something Notable
We show this using induction on the number of variables in the network.
Assume then P (pai) = 0 for 1 ≤ i ≤ n for a combination of xis
values.
Base Case of Induction
Since pa1 is empty, then
P (x1) = P (x1|pa1) (6)
Inductive Hypothesis
Suppose for this combination of values of the xi’s that
P (xi, xi−1, ..., x1) = P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) (7)
32 / 85
Proof
Something Notable
We show this using induction on the number of variables in the network.
Assume then P (pai) = 0 for 1 ≤ i ≤ n for a combination of xis
values.
Base Case of Induction
Since pa1 is empty, then
P (x1) = P (x1|pa1) (6)
Inductive Hypothesis
Suppose for this combination of values of the xi’s that
P (xi, xi−1, ..., x1) = P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) (7)
32 / 85
Proof
Something Notable
We show this using induction on the number of variables in the network.
Assume then P (pai) = 0 for 1 ≤ i ≤ n for a combination of xis
values.
Base Case of Induction
Since pa1 is empty, then
P (x1) = P (x1|pa1) (6)
Inductive Hypothesis
Suppose for this combination of values of the xi’s that
P (xi, xi−1, ..., x1) = P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) (7)
32 / 85
Proof
Something Notable
We show this using induction on the number of variables in the network.
Assume then P (pai) = 0 for 1 ≤ i ≤ n for a combination of xis
values.
Base Case of Induction
Since pa1 is empty, then
P (x1) = P (x1|pa1) (6)
Inductive Hypothesis
Suppose for this combination of values of the xi’s that
P (xi, xi−1, ..., x1) = P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) (7)
32 / 85
Proof
Something Notable
We show this using induction on the number of variables in the network.
Assume then P (pai) = 0 for 1 ≤ i ≤ n for a combination of xis
values.
Base Case of Induction
Since pa1 is empty, then
P (x1) = P (x1|pa1) (6)
Inductive Hypothesis
Suppose for this combination of values of the xi’s that
P (xi, xi−1, ..., x1) = P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) (7)
32 / 85
Proof
Something Notable
We show this using induction on the number of variables in the network.
Assume then P (pai) = 0 for 1 ≤ i ≤ n for a combination of xis
values.
Base Case of Induction
Since pa1 is empty, then
P (x1) = P (x1|pa1) (6)
Inductive Hypothesis
Suppose for this combination of values of the xi’s that
P (xi, xi−1, ..., x1) = P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) (7)
32 / 85
Proof
Inductive Step
We need show for this combination of values of the xi’s that
P (xi+1, xi, ..., x1) = P xi+1|pai+1 P (xi|pai) ...P (x1|pa1) (8)
Case 1
For this combination of values:
P (xi, xi−1, ..., x1) = 0 (9)
By Conditional Probability, we have
P (xi+1, xi, ..., x1) = P (xi+1|xi, ..., x1) P (xi, ..., x1) = 0 (10)
33 / 85
Proof
Inductive Step
We need show for this combination of values of the xi’s that
P (xi+1, xi, ..., x1) = P xi+1|pai+1 P (xi|pai) ...P (x1|pa1) (8)
Case 1
For this combination of values:
P (xi, xi−1, ..., x1) = 0 (9)
By Conditional Probability, we have
P (xi+1, xi, ..., x1) = P (xi+1|xi, ..., x1) P (xi, ..., x1) = 0 (10)
33 / 85
Proof
Inductive Step
We need show for this combination of values of the xi’s that
P (xi+1, xi, ..., x1) = P xi+1|pai+1 P (xi|pai) ...P (x1|pa1) (8)
Case 1
For this combination of values:
P (xi, xi−1, ..., x1) = 0 (9)
By Conditional Probability, we have
P (xi+1, xi, ..., x1) = P (xi+1|xi, ..., x1) P (xi, ..., x1) = 0 (10)
33 / 85
Proof
Due to the previous equalities and the inductive hypothesis
There is some k, 1 ≤ k ≤ i such that P (xk|pak) = 0 because after all
P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) = 0 (11)
Thus, the equality holds
Now for the Case 2
Case 2
For this combination of values P (xi, xi−1, ..., x1) = 0
34 / 85
Proof
Due to the previous equalities and the inductive hypothesis
There is some k, 1 ≤ k ≤ i such that P (xk|pak) = 0 because after all
P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) = 0 (11)
Thus, the equality holds
Now for the Case 2
Case 2
For this combination of values P (xi, xi−1, ..., x1) = 0
34 / 85
Proof
Due to the previous equalities and the inductive hypothesis
There is some k, 1 ≤ k ≤ i such that P (xk|pak) = 0 because after all
P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) = 0 (11)
Thus, the equality holds
Now for the Case 2
Case 2
For this combination of values P (xi, xi−1, ..., x1) = 0
34 / 85
Proof
Thus by the Rule of Conditional Probability
P (xi+1, xi, ..., x1) = P (xi+1|xi, ..., x1) P (xi, ..., x1)
Definition Markov Condition (Remember!!!)
Suppose we have a joint probability distribution P of the random
variables in some set V and a DAG G = (V , E).
We say that (G, P) satisfies the Markov condition if for each variable
X ∈ V , {X} is conditionally independent of the set of all its
non-descendents given the set of all its parents.
35 / 85
Proof
Thus by the Rule of Conditional Probability
P (xi+1, xi, ..., x1) = P (xi+1|xi, ..., x1) P (xi, ..., x1)
Definition Markov Condition (Remember!!!)
Suppose we have a joint probability distribution P of the random
variables in some set V and a DAG G = (V , E).
We say that (G, P) satisfies the Markov condition if for each variable
X ∈ V , {X} is conditionally independent of the set of all its
non-descendents given the set of all its parents.
35 / 85
Proof
Thus by the Rule of Conditional Probability
P (xi+1, xi, ..., x1) = P (xi+1|xi, ..., x1) P (xi, ..., x1)
Definition Markov Condition (Remember!!!)
Suppose we have a joint probability distribution P of the random
variables in some set V and a DAG G = (V , E).
We say that (G, P) satisfies the Markov condition if for each variable
X ∈ V , {X} is conditionally independent of the set of all its
non-descendents given the set of all its parents.
35 / 85
Proof
Given this Markov Condition and the fact that X1, ..., Xi are all
non-descendants of Xi+1
We have that
P (xi+1, xi, ..., x1) = P xi+1|pai+1 P (xi, ..., x1)
= P xi+1|pai+1 P (xi|pai) · · · P (x1|pa1) (IH)
Q.E.D.
36 / 85
Proof
Given this Markov Condition and the fact that X1, ..., Xi are all
non-descendants of Xi+1
We have that
P (xi+1, xi, ..., x1) = P xi+1|pai+1 P (xi, ..., x1)
= P xi+1|pai+1 P (xi|pai) · · · P (x1|pa1) (IH)
Q.E.D.
36 / 85
Outline
1 History
The History of Bayesian Applications
2 Bayes Theorem
Everything Starts at Someplace
Why Bayesian Networks?
3 Bayesian Networks
Definition
Markov Condition
Example
Using the Markok Condition
Representing the Joint Distribution
Observations
Causality and Bayesian Networks
Precautionary Tale
Causal DAG
Inference in Bayesian Networks
Example
General Strategy of Inference
Inference - An Overview
37 / 85
Now
OBSERVATIONS
1 An enormous saving can be made regarding the number of values
required for the joint distribution.
2 To determine the joint distribution directly for n binary variables 2n
values are required.
3 For a Bayesian Network with n binary variables and each node has at
most k parents then less than 2kn values are required!!!
38 / 85
Now
OBSERVATIONS
1 An enormous saving can be made regarding the number of values
required for the joint distribution.
2 To determine the joint distribution directly for n binary variables 2n
values are required.
3 For a Bayesian Network with n binary variables and each node has at
most k parents then less than 2kn values are required!!!
38 / 85
Now
OBSERVATIONS
1 An enormous saving can be made regarding the number of values
required for the joint distribution.
2 To determine the joint distribution directly for n binary variables 2n
values are required.
3 For a Bayesian Network with n binary variables and each node has at
most k parents then less than 2kn values are required!!!
38 / 85
It is more!!!
Theorem 1.5
Let a DAG G be given in which each node is a random variable, and
let a discrete conditional probability distribution of each node given
values of its parents in G be specified.
Then, the product of these conditional distributions yields a joint
probability distribution P of the variables, and (G, P) satisfies the
Markov condition.
Note
Notice that the theorem requires that specified conditional
distributions be discrete.
Often in the case of continuous distributions it still holds.
39 / 85
It is more!!!
Theorem 1.5
Let a DAG G be given in which each node is a random variable, and
let a discrete conditional probability distribution of each node given
values of its parents in G be specified.
Then, the product of these conditional distributions yields a joint
probability distribution P of the variables, and (G, P) satisfies the
Markov condition.
Note
Notice that the theorem requires that specified conditional
distributions be discrete.
Often in the case of continuous distributions it still holds.
39 / 85
It is more!!!
Theorem 1.5
Let a DAG G be given in which each node is a random variable, and
let a discrete conditional probability distribution of each node given
values of its parents in G be specified.
Then, the product of these conditional distributions yields a joint
probability distribution P of the variables, and (G, P) satisfies the
Markov condition.
Note
Notice that the theorem requires that specified conditional
distributions be discrete.
Often in the case of continuous distributions it still holds.
39 / 85
It is more!!!
Theorem 1.5
Let a DAG G be given in which each node is a random variable, and
let a discrete conditional probability distribution of each node given
values of its parents in G be specified.
Then, the product of these conditional distributions yields a joint
probability distribution P of the variables, and (G, P) satisfies the
Markov condition.
Note
Notice that the theorem requires that specified conditional
distributions be discrete.
Often in the case of continuous distributions it still holds.
39 / 85
Outline
1 History
The History of Bayesian Applications
2 Bayes Theorem
Everything Starts at Someplace
Why Bayesian Networks?
3 Bayesian Networks
Definition
Markov Condition
Example
Using the Markok Condition
Representing the Joint Distribution
Observations
Causality and Bayesian Networks
Precautionary Tale
Causal DAG
Inference in Bayesian Networks
Example
General Strategy of Inference
Inference - An Overview
40 / 85
Causality in Bayesian Networks
Definition of a Cause
The one, such as a person, an event, or a condition, that is responsible for
an action or a result.
However
Although useful, this simple definition is certainly not the last word on the
concept of causation.
Actually Philosophers are still wrangling the issue!!!
41 / 85
Causality in Bayesian Networks
Definition of a Cause
The one, such as a person, an event, or a condition, that is responsible for
an action or a result.
However
Although useful, this simple definition is certainly not the last word on the
concept of causation.
Actually Philosophers are still wrangling the issue!!!
41 / 85
Causality in Bayesian Networks
Definition of a Cause
The one, such as a person, an event, or a condition, that is responsible for
an action or a result.
However
Although useful, this simple definition is certainly not the last word on the
concept of causation.
Actually Philosophers are still wrangling the issue!!!
41 / 85
Causality in Bayesian Networks
Nevertheless, It sheds light in the issue
If the action of making variable X take some value sometimes
changes the value taken by a variable Y .
Causality
Here, we assume X is responsible for sometimes changing Y ’s value
Thus, we conclude X is a cause of Y .
42 / 85
Causality in Bayesian Networks
Nevertheless, It sheds light in the issue
If the action of making variable X take some value sometimes
changes the value taken by a variable Y .
Causality
Here, we assume X is responsible for sometimes changing Y ’s value
Thus, we conclude X is a cause of Y .
42 / 85
Furthermore
Formally
We say we manipulate X when we force X to take some value.
We say X causes Y if there is some manipulation of X that leads to
a change in the probability distribution of Y .
Thus
We assume causes and their effects are statistically correlated.
However
Variables can be correlated without one causing the other.
43 / 85
Furthermore
Formally
We say we manipulate X when we force X to take some value.
We say X causes Y if there is some manipulation of X that leads to
a change in the probability distribution of Y .
Thus
We assume causes and their effects are statistically correlated.
However
Variables can be correlated without one causing the other.
43 / 85
Furthermore
Formally
We say we manipulate X when we force X to take some value.
We say X causes Y if there is some manipulation of X that leads to
a change in the probability distribution of Y .
Thus
We assume causes and their effects are statistically correlated.
However
Variables can be correlated without one causing the other.
43 / 85
Furthermore
Formally
We say we manipulate X when we force X to take some value.
We say X causes Y if there is some manipulation of X that leads to
a change in the probability distribution of Y .
Thus
We assume causes and their effects are statistically correlated.
However
Variables can be correlated without one causing the other.
43 / 85
Outline
1 History
The History of Bayesian Applications
2 Bayes Theorem
Everything Starts at Someplace
Why Bayesian Networks?
3 Bayesian Networks
Definition
Markov Condition
Example
Using the Markok Condition
Representing the Joint Distribution
Observations
Causality and Bayesian Networks
Precautionary Tale
Causal DAG
Inference in Bayesian Networks
Example
General Strategy of Inference
Inference - An Overview
44 / 85
Precautionary Tale: Causality and Bayesian Networks
Important
Not every Bayesian Networks describes causal relationships between the
variables.
Consider
Consider the dependence between Lung Cancer, L, and the X-ray
test, X.
By focusing on just these variables we might be tempted to represent
them by the following Bayesian Networks.
45 / 85
Precautionary Tale: Causality and Bayesian Networks
Important
Not every Bayesian Networks describes causal relationships between the
variables.
Consider
Consider the dependence between Lung Cancer, L, and the X-ray
test, X.
By focusing on just these variables we might be tempted to represent
them by the following Bayesian Networks.
45 / 85
Precautionary Tale: Causality and Bayesian Networks
Important
Not every Bayesian Networks describes causal relationships between the
variables.
Consider
Consider the dependence between Lung Cancer, L, and the X-ray
test, X.
By focusing on just these variables we might be tempted to represent
them by the following Bayesian Networks.
45 / 85
Precautionary Tale: Causality and Bayesian Networks
Important
Not every Bayesian Networks describes causal relationships between the
variables.
Consider
Consider the dependence between Lung Cancer, L, and the X-ray
test, X.
By focusing on just these variables we might be tempted to represent
them by the following Bayesian Networks.
L X
45 / 85
Precautionary Tale: Causality and Bayesian Networks
However, we can try the same
L X
46 / 85
Remark
Be Careful
It is tempting to think that Bayesian Networkss can be created by creating
a DAG where the edges represent direct causal relationships between the
variables.
47 / 85
Outline
1 History
The History of Bayesian Applications
2 Bayes Theorem
Everything Starts at Someplace
Why Bayesian Networks?
3 Bayesian Networks
Definition
Markov Condition
Example
Using the Markok Condition
Representing the Joint Distribution
Observations
Causality and Bayesian Networks
Precautionary Tale
Causal DAG
Inference in Bayesian Networks
Example
General Strategy of Inference
Inference - An Overview
48 / 85
However
Causal DAG
Given a set of variables V , if for every X, Y ∈ V we draw an edge from X
to Y ⇐⇒ X is a direct cause of Y relative to V , we call the resultant
DAG a causal DAG.
We want
If we create a causal DAG G = (V , E) and assume the probability
distribution of the variables in V satisfies the Markov condition with G:
we say we are making the causal Markov assumption.
In General
The Markov condition holds for a causal DAG.
49 / 85
However
Causal DAG
Given a set of variables V , if for every X, Y ∈ V we draw an edge from X
to Y ⇐⇒ X is a direct cause of Y relative to V , we call the resultant
DAG a causal DAG.
We want
If we create a causal DAG G = (V , E) and assume the probability
distribution of the variables in V satisfies the Markov condition with G:
we say we are making the causal Markov assumption.
In General
The Markov condition holds for a causal DAG.
49 / 85
However
Causal DAG
Given a set of variables V , if for every X, Y ∈ V we draw an edge from X
to Y ⇐⇒ X is a direct cause of Y relative to V , we call the resultant
DAG a causal DAG.
We want
If we create a causal DAG G = (V , E) and assume the probability
distribution of the variables in V satisfies the Markov condition with G:
we say we are making the causal Markov assumption.
In General
The Markov condition holds for a causal DAG.
49 / 85
However
Causal DAG
Given a set of variables V , if for every X, Y ∈ V we draw an edge from X
to Y ⇐⇒ X is a direct cause of Y relative to V , we call the resultant
DAG a causal DAG.
We want
If we create a causal DAG G = (V , E) and assume the probability
distribution of the variables in V satisfies the Markov condition with G:
we say we are making the causal Markov assumption.
In General
The Markov condition holds for a causal DAG.
49 / 85
However, we still want to know if the Markov Condition
Holds
Remark
There are several thing that the DAG needs to have in order to have the
Markov Condition.
Examples of those
Common Causes
Common Effects
50 / 85
However, we still want to know if the Markov Condition
Holds
Remark
There are several thing that the DAG needs to have in order to have the
Markov Condition.
Examples of those
Common Causes
Common Effects
50 / 85
However, we still want to know if the Markov Condition
Holds
Remark
There are several thing that the DAG needs to have in order to have the
Markov Condition.
Examples of those
Common Causes
Common Effects
50 / 85
How to have a Markov Assumption : Common Causes
Consider
Smoking
Bronchitis Lung Cancer
Markov condition
Ip ({B} , {L} | {S}) ⇒ P(b|l, s) = P(b|s) (12)
51 / 85
How to have a Markov Assumption : Common Causes
Consider
Smoking
Bronchitis Lung Cancer
Markov condition
Ip ({B} , {L} | {S}) ⇒ P(b|l, s) = P(b|s) (12)
51 / 85
How to have a Markov Assumption : Common Causes
If we know the causal relationships
S → B and S → L (13)
Now!!!
If we know the person is a smoker.
52 / 85
How to have a Markov Assumption : Common Causes
If we know the causal relationships
S → B and S → L (13)
Now!!!
If we know the person is a smoker.
52 / 85
How to have a Markov Assumption : Common Causes
Then, because of the blocking of information from smoking
Finding out that he has Bronchitis will not give us any more information
about the probability of him having Lung Cancer.
Markov condition
It is satisfied!!!
53 / 85
How to have a Markov Assumption : Common Causes
Then, because of the blocking of information from smoking
Finding out that he has Bronchitis will not give us any more information
about the probability of him having Lung Cancer.
Markov condition
It is satisfied!!!
53 / 85
How to have a Markov Assumption : Common Effects
Consider
Alarm
Burglary Earthquake
Markov Condition
lp (B, E) ⇒ P(b|e) = P(b) (14)
Thus
We would expect Burglary and Earthquake to be independent of each
other which is in agreement with the Markov condition.
54 / 85
How to have a Markov Assumption : Common Effects
Consider
Alarm
Burglary Earthquake
Markov Condition
lp (B, E) ⇒ P(b|e) = P(b) (14)
Thus
We would expect Burglary and Earthquake to be independent of each
other which is in agreement with the Markov condition.
54 / 85
How to have a Markov Assumption : Common Effects
Consider
Alarm
Burglary Earthquake
Markov Condition
lp (B, E) ⇒ P(b|e) = P(b) (14)
Thus
We would expect Burglary and Earthquake to be independent of each
other which is in agreement with the Markov condition.
54 / 85
How to have a Markov Assumption : Common Effects
However
We would, however expect them to be conditionally dependent given
Alarm.
Thus
If the alarm has gone off, news that there had been an earthquake would
‘explain away’ the idea that a burglary had taken place.
Then
Again in agreement with the Markov condition.
55 / 85
How to have a Markov Assumption : Common Effects
However
We would, however expect them to be conditionally dependent given
Alarm.
Thus
If the alarm has gone off, news that there had been an earthquake would
‘explain away’ the idea that a burglary had taken place.
Then
Again in agreement with the Markov condition.
55 / 85
How to have a Markov Assumption : Common Effects
However
We would, however expect them to be conditionally dependent given
Alarm.
Thus
If the alarm has gone off, news that there had been an earthquake would
‘explain away’ the idea that a burglary had taken place.
Then
Again in agreement with the Markov condition.
55 / 85
The Causal Markov Condition
What do we want?
The basic idea is that the Markov condition holds for a causal DAG.
56 / 85
Rules to construct A Causal Graph
Conditions
1 There must be no hidden common causes.
2 There must not be selection bias.
3 There must be no feedback loops.
Observations
Even with these there is a lot of controversy as to its validity.
It seems to be false in quantum mechanical.
57 / 85
Rules to construct A Causal Graph
Conditions
1 There must be no hidden common causes.
2 There must not be selection bias.
3 There must be no feedback loops.
Observations
Even with these there is a lot of controversy as to its validity.
It seems to be false in quantum mechanical.
57 / 85
Rules to construct A Causal Graph
Conditions
1 There must be no hidden common causes.
2 There must not be selection bias.
3 There must be no feedback loops.
Observations
Even with these there is a lot of controversy as to its validity.
It seems to be false in quantum mechanical.
57 / 85
Rules to construct A Causal Graph
Conditions
1 There must be no hidden common causes.
2 There must not be selection bias.
3 There must be no feedback loops.
Observations
Even with these there is a lot of controversy as to its validity.
It seems to be false in quantum mechanical.
57 / 85
Rules to construct A Causal Graph
Conditions
1 There must be no hidden common causes.
2 There must not be selection bias.
3 There must be no feedback loops.
Observations
Even with these there is a lot of controversy as to its validity.
It seems to be false in quantum mechanical.
57 / 85
Hidden Common Causes?
Given the following DAG
H
X Y
Z
Something Notable
If a DAG is created on the basis of causal relationships between the
variables under consideration then X and Y would be marginally
independent according to the Markov condition.
Thus
If H is hidden, they will normally be dependent.
58 / 85
Hidden Common Causes?
Given the following DAG
H
X Y
Z
Something Notable
If a DAG is created on the basis of causal relationships between the
variables under consideration then X and Y would be marginally
independent according to the Markov condition.
Thus
If H is hidden, they will normally be dependent.
58 / 85
Hidden Common Causes?
Given the following DAG
H
X Y
Z
Something Notable
If a DAG is created on the basis of causal relationships between the
variables under consideration then X and Y would be marginally
independent according to the Markov condition.
Thus
If H is hidden, they will normally be dependent.
58 / 85
Outline
1 History
The History of Bayesian Applications
2 Bayes Theorem
Everything Starts at Someplace
Why Bayesian Networks?
3 Bayesian Networks
Definition
Markov Condition
Example
Using the Markok Condition
Representing the Joint Distribution
Observations
Causality and Bayesian Networks
Precautionary Tale
Causal DAG
Inference in Bayesian Networks
Example
General Strategy of Inference
Inference - An Overview
59 / 85
Inference in Bayesian Networks
What do we want from Bayesian Networks?
The main point of Bayesian Networkss is to enable probabilistic inference
to be performed.
Two different types of inferences
1 Belief Updating.
2 Abduction Inference.
60 / 85
Inference in Bayesian Networks
What do we want from Bayesian Networks?
The main point of Bayesian Networkss is to enable probabilistic inference
to be performed.
Two different types of inferences
1 Belief Updating.
2 Abduction Inference.
60 / 85
Inference in Bayesian Networks
What do we want from Bayesian Networks?
The main point of Bayesian Networkss is to enable probabilistic inference
to be performed.
Two different types of inferences
1 Belief Updating.
2 Abduction Inference.
60 / 85
Inference in Bayesian Networks
Belief updating
It is used to obtain the posterior probability of one or more variables given
evidence concerning the values of other variables.
Abductive inference
It finds the most probable configuration of a set of variables (hypothesis)
given certain evidence.
61 / 85
Inference in Bayesian Networks
Belief updating
It is used to obtain the posterior probability of one or more variables given
evidence concerning the values of other variables.
Abductive inference
It finds the most probable configuration of a set of variables (hypothesis)
given certain evidence.
61 / 85
Using the Structure I
Consider the following Bayesian Networks
Burgalary Earthquake
Alarm
JohnCalls MaryCalls
P(B)
0.001
P(E)
0.002
P(B)
0.001
P(E)
0.002
B E P(A|B,E)
T T 0.95
T F 0.94
F T 0.29
F F 0.001
P(B)
0.001
A P(JC|A)
T 0.9
F 0.05
A P(MC|A)
T 0.7
F 0.01
Consider answering a query in a Bayesian Network
Q= set of query variables
e= evidence (set of instantiated variable-value pairs)
Inference = computation of conditional distribution P(Q|e)
62 / 85
Using the Structure I
Consider the following Bayesian Networks
Burgalary Earthquake
Alarm
JohnCalls MaryCalls
P(B)
0.001
P(E)
0.002
P(B)
0.001
P(E)
0.002
B E P(A|B,E)
T T 0.95
T F 0.94
F T 0.29
F F 0.001
P(B)
0.001
A P(JC|A)
T 0.9
F 0.05
A P(MC|A)
T 0.7
F 0.01
Consider answering a query in a Bayesian Network
Q= set of query variables
e= evidence (set of instantiated variable-value pairs)
Inference = computation of conditional distribution P(Q|e)
62 / 85
Using the Structure I
Consider the following Bayesian Networks
Burgalary Earthquake
Alarm
JohnCalls MaryCalls
P(B)
0.001
P(E)
0.002
P(B)
0.001
P(E)
0.002
B E P(A|B,E)
T T 0.95
T F 0.94
F T 0.29
F F 0.001
P(B)
0.001
A P(JC|A)
T 0.9
F 0.05
A P(MC|A)
T 0.7
F 0.01
Consider answering a query in a Bayesian Network
Q= set of query variables
e= evidence (set of instantiated variable-value pairs)
Inference = computation of conditional distribution P(Q|e)
62 / 85
Using the Structure I
Consider the following Bayesian Networks
Burgalary Earthquake
Alarm
JohnCalls MaryCalls
P(B)
0.001
P(E)
0.002
P(B)
0.001
P(E)
0.002
B E P(A|B,E)
T T 0.95
T F 0.94
F T 0.29
F F 0.001
P(B)
0.001
A P(JC|A)
T 0.9
F 0.05
A P(MC|A)
T 0.7
F 0.01
Consider answering a query in a Bayesian Network
Q= set of query variables
e= evidence (set of instantiated variable-value pairs)
Inference = computation of conditional distribution P(Q|e)
62 / 85
Using the Structure II
Examples
P(burglary|alarm)
P(earthquake|JCalls, MCalls)
P(JCalls, MCalls|burglary, earthquake)
So
Can we use the structure of the Bayesian Network to answer such queries
efficiently?
Answer
YES
Note: Generally speaking, complexity is inversely proportional to
sparsity of graph
63 / 85
Using the Structure II
Examples
P(burglary|alarm)
P(earthquake|JCalls, MCalls)
P(JCalls, MCalls|burglary, earthquake)
So
Can we use the structure of the Bayesian Network to answer such queries
efficiently?
Answer
YES
Note: Generally speaking, complexity is inversely proportional to
sparsity of graph
63 / 85
Using the Structure II
Examples
P(burglary|alarm)
P(earthquake|JCalls, MCalls)
P(JCalls, MCalls|burglary, earthquake)
So
Can we use the structure of the Bayesian Network to answer such queries
efficiently?
Answer
YES
Note: Generally speaking, complexity is inversely proportional to
sparsity of graph
63 / 85
Using the Structure II
Examples
P(burglary|alarm)
P(earthquake|JCalls, MCalls)
P(JCalls, MCalls|burglary, earthquake)
So
Can we use the structure of the Bayesian Network to answer such queries
efficiently?
Answer
YES
Note: Generally speaking, complexity is inversely proportional to
sparsity of graph
63 / 85
Using the Structure II
Examples
P(burglary|alarm)
P(earthquake|JCalls, MCalls)
P(JCalls, MCalls|burglary, earthquake)
So
Can we use the structure of the Bayesian Network to answer such queries
efficiently?
Answer
YES
Note: Generally speaking, complexity is inversely proportional to
sparsity of graph
63 / 85
Using the Structure II
Examples
P(burglary|alarm)
P(earthquake|JCalls, MCalls)
P(JCalls, MCalls|burglary, earthquake)
So
Can we use the structure of the Bayesian Network to answer such queries
efficiently?
Answer
YES
Note: Generally speaking, complexity is inversely proportional to
sparsity of graph
63 / 85
Outline
1 History
The History of Bayesian Applications
2 Bayes Theorem
Everything Starts at Someplace
Why Bayesian Networks?
3 Bayesian Networks
Definition
Markov Condition
Example
Using the Markok Condition
Representing the Joint Distribution
Observations
Causality and Bayesian Networks
Precautionary Tale
Causal DAG
Inference in Bayesian Networks
Example
General Strategy of Inference
Inference - An Overview
64 / 85
Example
DAG
D
B E
CA F G
We have the following model
p (a, b, c, d, e, f , g) is modeled by
p (a|b) p (c|b) p (f |e) p (g|e) p (b|d) p (e|d) p (d)
65 / 85
Example
DAG
D
B E
CA F G
We have the following model
p (a, b, c, d, e, f , g) is modeled by
p (a|b) p (c|b) p (f |e) p (g|e) p (b|d) p (e|d) p (d)
65 / 85
Example
DAG
D
B E
CA F G
We want to calculate the following
p (a|c, g)
66 / 85
Example
DAG
D
B E
CA F G
We want to calculate the following
p (a|c, g)
66 / 85
Example
DAG
D
B E
CA F G
However, a direct calculation leads to use a demarginalization
p (a|c, g) =
b,d,e,f
p (a, b, d, e, f |c, g)
This will require that if we fix the value of a, c and g to have a complexity
of O m4 with m = max {|B| , |D| , |E| , |F|}
67 / 85
Example
DAG
D
B E
CA F G
However, a direct calculation leads to use a demarginalization
p (a|c, g) =
b,d,e,f
p (a, b, d, e, f |c, g)
This will require that if we fix the value of a, c and g to have a complexity
of O m4 with m = max {|B| , |D| , |E| , |F|}
67 / 85
Example
We get some information about (a = ai, c = ci, g = gi)
D
B E
CA F G
However, we re-express the equation using the chain representation
p (a = ai, b, d, e, f |c = ci, g = gi) =...
p (a = ai|b) p (b|d, c = ci) p (d|e) p (e, f |g = gi)
68 / 85
Example
We get some information about (a = ai, c = ci, g = gi)
D
B E
CA F G
However, we re-express the equation using the chain representation
p (a = ai, b, d, e, f |c = ci, g = gi) =...
p (a = ai|b) p (b|d, c = ci) p (d|e) p (e, f |g = gi)
68 / 85
Example
DAG
D
B E
CA F G
Now, we re-order the sum
b
p (a = ai|b)
d
p (b|d, c = ci)
e
p (d|e)
f
p (e, f |g = gi)
69 / 85
Example
DAG
D
B E
CA F G
Now, we re-order the sum
b
p (a = ai|b)
d
p (b|d, c = ci)
e
p (d|e)
f
p (e, f |g = gi)
69 / 85
Example
Now, using the relation about E
D
B E
CA F G
Using this information, we can reduce one of the sums by
marginalization
f
p (e, f |g = gi) = p (e|g = gi)
70 / 85
Example
Now, using the relation about E
D
B E
CA F G
Using this information, we can reduce one of the sums by
marginalization
f
p (e, f |g = gi) = p (e|g = gi)
70 / 85
Example
DAG
D
B E
CA F G
Thus, we can reduce the size of our sum
b
p (a = ai|b)
d
p (b|d, c = ci)
e
p (d|e) p (e|g = gi)
71 / 85
Example
DAG
D
B E
CA F G
Thus, we can reduce the size of our sum
b
p (a = ai|b)
d
p (b|d, c = ci)
e
p (d|e) p (e|g = gi)
71 / 85
Example
DAG
D
B E
CA F G
Now, we can calculate the probability of D by using the chain rule
p (d|e) p (e|g = gi) = p (d|e, g = gi) p (e|g = gi) = p (d, e|g = gi)
72 / 85
Example
DAG
D
B E
CA F G
Now, we can calculate the probability of D by using the chain rule
p (d|e) p (e|g = gi) = p (d|e, g = gi) p (e|g = gi) = p (d, e|g = gi)
72 / 85
Example
DAG
D
B E
CA F G
Now, we can calculate the probability of D by using the chain rule
b
p (a = ai|b)
d
p (b|d, c = ci)
e
p (d, e|g = gi)
73 / 85
Example
DAG
D
B E
CA F G
Now, we can calculate the probability of D by using the chain rule
b
p (a = ai|b)
d
p (b|d, c = ci)
e
p (d, e|g = gi)
73 / 85
Example
DAG
D
B E
CA F G
Now, we sum over all possible values of E
e
p (d, e|g = gi) = p (d|g = gi)
74 / 85
Example
DAG
D
B E
CA F G
Now, we sum over all possible values of E
e
p (d, e|g = gi) = p (d|g = gi)
74 / 85
Example
DAG
D
B E
CA F G
We get the following
b
p (a = ai|b)
d
p (b|d, c = ci) p (d|g = gi)
75 / 85
Example
DAG
D
B E
CA F G
We get the following
b
p (a = ai|b)
d
p (b|d, c = ci) p (d|g = gi)
75 / 85
Example
DAG
D
B E
CA F G
Again the chain rule for D
p (b|d, c = ci) p (d|g = gi) = p (b|d, c = ci, g = gi) p (d|c = ci, g = gi)
= p (b, d|c = ci, g = gi)
76 / 85
Example
DAG
D
B E
CA F G
Again the chain rule for D
p (b|d, c = ci) p (d|g = gi) = p (b|d, c = ci, g = gi) p (d|c = ci, g = gi)
= p (b, d|c = ci, g = gi)
76 / 85
Example
DAG
D
B E
CA F G
Now, we sum over all possible values of D
b
p (a = ai|b) p (b|c = ci, g = gi)
77 / 85
Example
DAG
D
B E
CA F G
Now, we sum over all possible values of D
b
p (a = ai|b) p (b|c = ci, g = gi)
77 / 85
Example
DAG
D
B E
CA F G
Now, we use the chain rule for reducing again
p (a = ai|b) p (b|) = p (a = ai, b|c = ci, g = gi)
78 / 85
Example
DAG
D
B E
CA F G
Now, we use the chain rule for reducing again
p (a = ai|b) p (b|) = p (a = ai, b|c = ci, g = gi)
78 / 85
Example
DAG
D
B E
CA F G
Now, we use the chain rule for reducing again
b
p (a = ai, b|c = ci, g = gi) = p (a = ai|c = ci, g = gi)
79 / 85
Example
DAG
D
B E
CA F G
Now, we use the chain rule for reducing again
b
p (a = ai, b|c = ci, g = gi) = p (a = ai|c = ci, g = gi)
79 / 85
Complexity
Because this can be computed using a sequence of four for loops
The complexity simply becomes O (m) when compared with O m4
80 / 85
Outline
1 History
The History of Bayesian Applications
2 Bayes Theorem
Everything Starts at Someplace
Why Bayesian Networks?
3 Bayesian Networks
Definition
Markov Condition
Example
Using the Markok Condition
Representing the Joint Distribution
Observations
Causality and Bayesian Networks
Precautionary Tale
Causal DAG
Inference in Bayesian Networks
Example
General Strategy of Inference
Inference - An Overview
81 / 85
General Strategy for Inference
Query
Want to compute P(q|e)!!!
Step 1
P(q|e) = P(q,e)
P(e) = aP(q, e), since a = P(e) is constant wrt Q.
Step 2
P(q, e) = a..z P(q, e, a, b, . . . .z), by the law of total probability.
82 / 85
General Strategy for Inference
Query
Want to compute P(q|e)!!!
Step 1
P(q|e) = P(q,e)
P(e) = aP(q, e), since a = P(e) is constant wrt Q.
Step 2
P(q, e) = a..z P(q, e, a, b, . . . .z), by the law of total probability.
82 / 85
General Strategy for Inference
Query
Want to compute P(q|e)!!!
Step 1
P(q|e) = P(q,e)
P(e) = aP(q, e), since a = P(e) is constant wrt Q.
Step 2
P(q, e) = a..z P(q, e, a, b, . . . .z), by the law of total probability.
82 / 85
General Strategy for inference
Step 3
a..z P(q, e, a, b, . . . .z) = a..z P(variable i | parents i) (using
Bayesian network factoring)
Step 4
Distribute summations across product terms for efficient computation.
83 / 85
General Strategy for inference
Step 3
a..z P(q, e, a, b, . . . .z) = a..z P(variable i | parents i) (using
Bayesian network factoring)
Step 4
Distribute summations across product terms for efficient computation.
83 / 85
Outline
1 History
The History of Bayesian Applications
2 Bayes Theorem
Everything Starts at Someplace
Why Bayesian Networks?
3 Bayesian Networks
Definition
Markov Condition
Example
Using the Markok Condition
Representing the Joint Distribution
Observations
Causality and Bayesian Networks
Precautionary Tale
Causal DAG
Inference in Bayesian Networks
Example
General Strategy of Inference
Inference - An Overview
84 / 85
Inference – An Overview
Case 1
Trees and singly connected networks – only one path between any two
nodes:
Message passing (Pearl, 1988)
Case 2
Multiply connected networks:
A range of algorithms including cut-set conditioning (Pearl, 1988),
junction tree propagation (Lauritzen and Spiegelhalter, 1988), bucket
elimination (Dechter, 1996) to mention a few.
A range of algorithms for approximate inference.
Notes
Both exact and approximate inference are NP-hard in the worst case.
Here the focus will be on message passing and junction tree
propagation for discrete variables.
85 / 85
Inference – An Overview
Case 1
Trees and singly connected networks – only one path between any two
nodes:
Message passing (Pearl, 1988)
Case 2
Multiply connected networks:
A range of algorithms including cut-set conditioning (Pearl, 1988),
junction tree propagation (Lauritzen and Spiegelhalter, 1988), bucket
elimination (Dechter, 1996) to mention a few.
A range of algorithms for approximate inference.
Notes
Both exact and approximate inference are NP-hard in the worst case.
Here the focus will be on message passing and junction tree
propagation for discrete variables.
85 / 85
Inference – An Overview
Case 1
Trees and singly connected networks – only one path between any two
nodes:
Message passing (Pearl, 1988)
Case 2
Multiply connected networks:
A range of algorithms including cut-set conditioning (Pearl, 1988),
junction tree propagation (Lauritzen and Spiegelhalter, 1988), bucket
elimination (Dechter, 1996) to mention a few.
A range of algorithms for approximate inference.
Notes
Both exact and approximate inference are NP-hard in the worst case.
Here the focus will be on message passing and junction tree
propagation for discrete variables.
85 / 85
Inference – An Overview
Case 1
Trees and singly connected networks – only one path between any two
nodes:
Message passing (Pearl, 1988)
Case 2
Multiply connected networks:
A range of algorithms including cut-set conditioning (Pearl, 1988),
junction tree propagation (Lauritzen and Spiegelhalter, 1988), bucket
elimination (Dechter, 1996) to mention a few.
A range of algorithms for approximate inference.
Notes
Both exact and approximate inference are NP-hard in the worst case.
Here the focus will be on message passing and junction tree
propagation for discrete variables.
85 / 85
Inference – An Overview
Case 1
Trees and singly connected networks – only one path between any two
nodes:
Message passing (Pearl, 1988)
Case 2
Multiply connected networks:
A range of algorithms including cut-set conditioning (Pearl, 1988),
junction tree propagation (Lauritzen and Spiegelhalter, 1988), bucket
elimination (Dechter, 1996) to mention a few.
A range of algorithms for approximate inference.
Notes
Both exact and approximate inference are NP-hard in the worst case.
Here the focus will be on message passing and junction tree
propagation for discrete variables.
85 / 85
Inference – An Overview
Case 1
Trees and singly connected networks – only one path between any two
nodes:
Message passing (Pearl, 1988)
Case 2
Multiply connected networks:
A range of algorithms including cut-set conditioning (Pearl, 1988),
junction tree propagation (Lauritzen and Spiegelhalter, 1988), bucket
elimination (Dechter, 1996) to mention a few.
A range of algorithms for approximate inference.
Notes
Both exact and approximate inference are NP-hard in the worst case.
Here the focus will be on message passing and junction tree
propagation for discrete variables.
85 / 85
Inference – An Overview
Case 1
Trees and singly connected networks – only one path between any two
nodes:
Message passing (Pearl, 1988)
Case 2
Multiply connected networks:
A range of algorithms including cut-set conditioning (Pearl, 1988),
junction tree propagation (Lauritzen and Spiegelhalter, 1988), bucket
elimination (Dechter, 1996) to mention a few.
A range of algorithms for approximate inference.
Notes
Both exact and approximate inference are NP-hard in the worst case.
Here the focus will be on message passing and junction tree
propagation for discrete variables.
85 / 85

More Related Content

DOCX
A highly scalable key pre distribution scheme for wireless sensor networks
DOCX
A highly scalable key pre distribution scheme for wireless sensor networks
PDF
AI-SDV 2021: Mazahir Bhagat - Mapping Canadian Patented Inventions
PPTX
Knowledge representation and reasoning
PDF
Sementic nets
PPTX
Bayesian networks and the search for causality
PDF
2014 Best Sports Cars
A highly scalable key pre distribution scheme for wireless sensor networks
A highly scalable key pre distribution scheme for wireless sensor networks
AI-SDV 2021: Mazahir Bhagat - Mapping Canadian Patented Inventions
Knowledge representation and reasoning
Sementic nets
Bayesian networks and the search for causality
2014 Best Sports Cars

Viewers also liked (14)

PPTX
Accelerate Sales and Increase Revenue
PDF
Artificial Intelligence 02 uninformed search
PDF
Tea vs-coffee
PDF
The design of things you don't want to think about — WIAD 2016 Jönköping
PDF
Xtext project and PhDs in Gemany
PPT
9.6 El modelado Glaciar
PDF
HOW TO CHECK DELETED WHATSAPP MESSAGES ON IPHONE
PPT
Lipinski Jmrc Lecture1 Nov2008
PDF
Delphi7 oyutnii garin awlaga 2006 muis
PPTX
Music video audience profile
PDF
رهبری تیمهای نوآور
PPT
How to Look at Art
PPTX
Representation and organization of knowledge in memory
PDF
9 einfache Ideen für individuelle Bildmotive
Accelerate Sales and Increase Revenue
Artificial Intelligence 02 uninformed search
Tea vs-coffee
The design of things you don't want to think about — WIAD 2016 Jönköping
Xtext project and PhDs in Gemany
9.6 El modelado Glaciar
HOW TO CHECK DELETED WHATSAPP MESSAGES ON IPHONE
Lipinski Jmrc Lecture1 Nov2008
Delphi7 oyutnii garin awlaga 2006 muis
Music video audience profile
رهبری تیمهای نوآور
How to Look at Art
Representation and organization of knowledge in memory
9 einfache Ideen für individuelle Bildmotive
Ad

Similar to Artificial Intelligence 06.01 introduction bayesian_networks (20)

PDF
The Bayesia Portfolio of Research Software
PDF
Bayesianmd2
PDF
BayesiaLab_Book_V18 (1)
PDF
Bayesian inference and big data: are we there yet? by Jose Luis Hidalgo at Bi...
PPT
Basen Network
PDF
Lecture9 - Bayesian-Decision-Theory
PDF
Machine Learning History & Prbabilistic modelling
PDF
Graphical Models 4dummies
PDF
Bayesian networks
PDF
BayesiaLab 5.0 Introduction
PDF
Brief bibliography of interestingness measure, bayesian belief network and ca...
PDF
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
PDF
Bayesian Networks - A Brief Introduction
PPTX
Knowledge & Reasoning for Students study
PPTX
Knowledge & Reasoning.ppt for students study
PDF
Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)
PDF
Principles of Health Informatics: Artificial intelligence and machine learning
PPTX
Bayesian probabilistic interference
PPTX
Bayesian probabilistic interference
PDF
Impact of big data congestion in IT: An adaptive knowledgebased Bayesian network
The Bayesia Portfolio of Research Software
Bayesianmd2
BayesiaLab_Book_V18 (1)
Bayesian inference and big data: are we there yet? by Jose Luis Hidalgo at Bi...
Basen Network
Lecture9 - Bayesian-Decision-Theory
Machine Learning History & Prbabilistic modelling
Graphical Models 4dummies
Bayesian networks
BayesiaLab 5.0 Introduction
Brief bibliography of interestingness measure, bayesian belief network and ca...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Bayesian Networks - A Brief Introduction
Knowledge & Reasoning for Students study
Knowledge & Reasoning.ppt for students study
Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)
Principles of Health Informatics: Artificial intelligence and machine learning
Bayesian probabilistic interference
Bayesian probabilistic interference
Impact of big data congestion in IT: An adaptive knowledgebased Bayesian network
Ad

More from Andres Mendez-Vazquez (20)

PDF
2.03 bayesian estimation
PDF
05 linear transformations
PDF
01.04 orthonormal basis_eigen_vectors
PDF
01.03 squared matrices_and_other_issues
PDF
01.02 linear equations
PDF
01.01 vector spaces
PDF
06 recurrent neural_networks
PDF
05 backpropagation automatic_differentiation
PDF
Zetta global
PDF
01 Introduction to Neural Networks and Deep Learning
PDF
25 introduction reinforcement_learning
PDF
Neural Networks and Deep Learning Syllabus
PDF
Introduction to artificial_intelligence_syllabus
PDF
Ideas 09 22_2018
PDF
Ideas about a Bachelor in Machine Learning/Data Sciences
PDF
Analysis of Algorithms Syllabus
PDF
20 k-means, k-center, k-meoids and variations
PDF
18.1 combining models
PDF
17 vapnik chervonenkis dimension
PDF
A basic introduction to learning
2.03 bayesian estimation
05 linear transformations
01.04 orthonormal basis_eigen_vectors
01.03 squared matrices_and_other_issues
01.02 linear equations
01.01 vector spaces
06 recurrent neural_networks
05 backpropagation automatic_differentiation
Zetta global
01 Introduction to Neural Networks and Deep Learning
25 introduction reinforcement_learning
Neural Networks and Deep Learning Syllabus
Introduction to artificial_intelligence_syllabus
Ideas 09 22_2018
Ideas about a Bachelor in Machine Learning/Data Sciences
Analysis of Algorithms Syllabus
20 k-means, k-center, k-meoids and variations
18.1 combining models
17 vapnik chervonenkis dimension
A basic introduction to learning

Recently uploaded (20)

PDF
PPT on Performance Review to get promotions
PPT
introduction to datamining and warehousing
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
UNIT - 3 Total quality Management .pptx
PPT
A5_DistSysCh1.ppt_INTRODUCTION TO DISTRIBUTED SYSTEMS
PDF
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
Artificial Intelligence
PDF
737-MAX_SRG.pdf student reference guides
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PPTX
Current and future trends in Computer Vision.pptx
PPT
Total quality management ppt for engineering students
PPT on Performance Review to get promotions
introduction to datamining and warehousing
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Fundamentals of safety and accident prevention -final (1).pptx
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
Safety Seminar civil to be ensured for safe working.
UNIT - 3 Total quality Management .pptx
A5_DistSysCh1.ppt_INTRODUCTION TO DISTRIBUTED SYSTEMS
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
R24 SURVEYING LAB MANUAL for civil enggi
Artificial Intelligence
737-MAX_SRG.pdf student reference guides
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
Current and future trends in Computer Vision.pptx
Total quality management ppt for engineering students

Artificial Intelligence 06.01 introduction bayesian_networks

  • 1. Artificial Intelligence Introduction to Bayesian Networks Andres Mendez-Vazquez March 2, 2016 1 / 85
  • 2. Outline 1 History The History of Bayesian Applications 2 Bayes Theorem Everything Starts at Someplace Why Bayesian Networks? 3 Bayesian Networks Definition Markov Condition Example Using the Markok Condition Representing the Joint Distribution Observations Causality and Bayesian Networks Precautionary Tale Causal DAG Inference in Bayesian Networks Example General Strategy of Inference Inference - An Overview 2 / 85
  • 3. Outline 1 History The History of Bayesian Applications 2 Bayes Theorem Everything Starts at Someplace Why Bayesian Networks? 3 Bayesian Networks Definition Markov Condition Example Using the Markok Condition Representing the Joint Distribution Observations Causality and Bayesian Networks Precautionary Tale Causal DAG Inference in Bayesian Networks Example General Strategy of Inference Inference - An Overview 3 / 85
  • 4. History History ‘60s The first expert systems. IF-THEN rules. 1968 Attempts to use probabilities in expert systems (Gorry & Barnett). 1973 Gave up - to heavy calculations! (Gorry). 1976 MYCIN: Medical predicate logic expert system with certainty factors (Shortliffe). 1976 PROSPECTOR: Predicts the likely location of mineral deposits. Uses Bayes’ rule. (Duda et al.). Summary until mid ’80s “Pure logic will solve the AI problems!” “Probability theory is intractable to use and too complicated for complex models.” 4 / 85
  • 5. History History ‘60s The first expert systems. IF-THEN rules. 1968 Attempts to use probabilities in expert systems (Gorry & Barnett). 1973 Gave up - to heavy calculations! (Gorry). 1976 MYCIN: Medical predicate logic expert system with certainty factors (Shortliffe). 1976 PROSPECTOR: Predicts the likely location of mineral deposits. Uses Bayes’ rule. (Duda et al.). Summary until mid ’80s “Pure logic will solve the AI problems!” “Probability theory is intractable to use and too complicated for complex models.” 4 / 85
  • 6. History History ‘60s The first expert systems. IF-THEN rules. 1968 Attempts to use probabilities in expert systems (Gorry & Barnett). 1973 Gave up - to heavy calculations! (Gorry). 1976 MYCIN: Medical predicate logic expert system with certainty factors (Shortliffe). 1976 PROSPECTOR: Predicts the likely location of mineral deposits. Uses Bayes’ rule. (Duda et al.). Summary until mid ’80s “Pure logic will solve the AI problems!” “Probability theory is intractable to use and too complicated for complex models.” 4 / 85
  • 7. History History ‘60s The first expert systems. IF-THEN rules. 1968 Attempts to use probabilities in expert systems (Gorry & Barnett). 1973 Gave up - to heavy calculations! (Gorry). 1976 MYCIN: Medical predicate logic expert system with certainty factors (Shortliffe). 1976 PROSPECTOR: Predicts the likely location of mineral deposits. Uses Bayes’ rule. (Duda et al.). Summary until mid ’80s “Pure logic will solve the AI problems!” “Probability theory is intractable to use and too complicated for complex models.” 4 / 85
  • 8. History History ‘60s The first expert systems. IF-THEN rules. 1968 Attempts to use probabilities in expert systems (Gorry & Barnett). 1973 Gave up - to heavy calculations! (Gorry). 1976 MYCIN: Medical predicate logic expert system with certainty factors (Shortliffe). 1976 PROSPECTOR: Predicts the likely location of mineral deposits. Uses Bayes’ rule. (Duda et al.). Summary until mid ’80s “Pure logic will solve the AI problems!” “Probability theory is intractable to use and too complicated for complex models.” 4 / 85
  • 9. History History ‘60s The first expert systems. IF-THEN rules. 1968 Attempts to use probabilities in expert systems (Gorry & Barnett). 1973 Gave up - to heavy calculations! (Gorry). 1976 MYCIN: Medical predicate logic expert system with certainty factors (Shortliffe). 1976 PROSPECTOR: Predicts the likely location of mineral deposits. Uses Bayes’ rule. (Duda et al.). Summary until mid ’80s “Pure logic will solve the AI problems!” “Probability theory is intractable to use and too complicated for complex models.” 4 / 85
  • 10. History History ‘60s The first expert systems. IF-THEN rules. 1968 Attempts to use probabilities in expert systems (Gorry & Barnett). 1973 Gave up - to heavy calculations! (Gorry). 1976 MYCIN: Medical predicate logic expert system with certainty factors (Shortliffe). 1976 PROSPECTOR: Predicts the likely location of mineral deposits. Uses Bayes’ rule. (Duda et al.). Summary until mid ’80s “Pure logic will solve the AI problems!” “Probability theory is intractable to use and too complicated for complex models.” 4 / 85
  • 11. But... More History 1986 Bayesian networks were revived and reintroduced to expert systems (Pearl). 1988 Breakthrough for efficient calculation algorithms (Lauritzen & Spiegelhalter) tractable calculations on Bayesian Networkss. 1995 In Windows95™ for printer-trouble shooting and Office assistance (“the paper clip”). 1999 Bayesian Networks are getting more and more used. Ex. Gene expression analysis, Business strategy etc. 2000 Widely used - A Bayesian Network tool will be shipped with every Windows™ Commercial Server. 5 / 85
  • 12. But... More History 1986 Bayesian networks were revived and reintroduced to expert systems (Pearl). 1988 Breakthrough for efficient calculation algorithms (Lauritzen & Spiegelhalter) tractable calculations on Bayesian Networkss. 1995 In Windows95™ for printer-trouble shooting and Office assistance (“the paper clip”). 1999 Bayesian Networks are getting more and more used. Ex. Gene expression analysis, Business strategy etc. 2000 Widely used - A Bayesian Network tool will be shipped with every Windows™ Commercial Server. 5 / 85
  • 13. But... More History 1986 Bayesian networks were revived and reintroduced to expert systems (Pearl). 1988 Breakthrough for efficient calculation algorithms (Lauritzen & Spiegelhalter) tractable calculations on Bayesian Networkss. 1995 In Windows95™ for printer-trouble shooting and Office assistance (“the paper clip”). 1999 Bayesian Networks are getting more and more used. Ex. Gene expression analysis, Business strategy etc. 2000 Widely used - A Bayesian Network tool will be shipped with every Windows™ Commercial Server. 5 / 85
  • 14. But... More History 1986 Bayesian networks were revived and reintroduced to expert systems (Pearl). 1988 Breakthrough for efficient calculation algorithms (Lauritzen & Spiegelhalter) tractable calculations on Bayesian Networkss. 1995 In Windows95™ for printer-trouble shooting and Office assistance (“the paper clip”). 1999 Bayesian Networks are getting more and more used. Ex. Gene expression analysis, Business strategy etc. 2000 Widely used - A Bayesian Network tool will be shipped with every Windows™ Commercial Server. 5 / 85
  • 15. But... More History 1986 Bayesian networks were revived and reintroduced to expert systems (Pearl). 1988 Breakthrough for efficient calculation algorithms (Lauritzen & Spiegelhalter) tractable calculations on Bayesian Networkss. 1995 In Windows95™ for printer-trouble shooting and Office assistance (“the paper clip”). 1999 Bayesian Networks are getting more and more used. Ex. Gene expression analysis, Business strategy etc. 2000 Widely used - A Bayesian Network tool will be shipped with every Windows™ Commercial Server. 5 / 85
  • 16. Furtheron 2000-2015 Bayesian Networks are use in Spam Detection. Gene Dicovery. Signal Processing. Ranking. Forecasting. etc. Something Notable We are interested more and more on building automatically Bayesian Networks using data!!! 6 / 85
  • 17. Furtheron 2000-2015 Bayesian Networks are use in Spam Detection. Gene Dicovery. Signal Processing. Ranking. Forecasting. etc. Something Notable We are interested more and more on building automatically Bayesian Networks using data!!! 6 / 85
  • 18. Bayesian Network Advantages Many of Them 1 Since in a Bayesian network encodes all variables, missing data entries can be handled successfully. 2 When used for learning casual relationships, they help better understand a problem domain as well as forecast consequences. 3 it is ideal to use a Bayesian network for representing prior data and knowledge. 4 Over-fitting of data can be avoidable when using Bayesian networks and Bayesian statistical methods. 7 / 85
  • 19. Bayesian Network Advantages Many of Them 1 Since in a Bayesian network encodes all variables, missing data entries can be handled successfully. 2 When used for learning casual relationships, they help better understand a problem domain as well as forecast consequences. 3 it is ideal to use a Bayesian network for representing prior data and knowledge. 4 Over-fitting of data can be avoidable when using Bayesian networks and Bayesian statistical methods. 7 / 85
  • 20. Bayesian Network Advantages Many of Them 1 Since in a Bayesian network encodes all variables, missing data entries can be handled successfully. 2 When used for learning casual relationships, they help better understand a problem domain as well as forecast consequences. 3 it is ideal to use a Bayesian network for representing prior data and knowledge. 4 Over-fitting of data can be avoidable when using Bayesian networks and Bayesian statistical methods. 7 / 85
  • 21. Bayesian Network Advantages Many of Them 1 Since in a Bayesian network encodes all variables, missing data entries can be handled successfully. 2 When used for learning casual relationships, they help better understand a problem domain as well as forecast consequences. 3 it is ideal to use a Bayesian network for representing prior data and knowledge. 4 Over-fitting of data can be avoidable when using Bayesian networks and Bayesian statistical methods. 7 / 85
  • 22. Outline 1 History The History of Bayesian Applications 2 Bayes Theorem Everything Starts at Someplace Why Bayesian Networks? 3 Bayesian Networks Definition Markov Condition Example Using the Markok Condition Representing the Joint Distribution Observations Causality and Bayesian Networks Precautionary Tale Causal DAG Inference in Bayesian Networks Example General Strategy of Inference Inference - An Overview 8 / 85
  • 23. Bayes Theorem One Version P(A|B) = P(B|A)P(A) P(B) Where P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into account any information about B. P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the specified value of B. P(B|A) is the conditional probability of B given A. It is also called the likelihood. P(B) is the prior or marginal probability of B, and acts as a normalizing constant. 9 / 85
  • 24. Bayes Theorem One Version P(A|B) = P(B|A)P(A) P(B) Where P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into account any information about B. P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the specified value of B. P(B|A) is the conditional probability of B given A. It is also called the likelihood. P(B) is the prior or marginal probability of B, and acts as a normalizing constant. 9 / 85
  • 25. Bayes Theorem One Version P(A|B) = P(B|A)P(A) P(B) Where P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into account any information about B. P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the specified value of B. P(B|A) is the conditional probability of B given A. It is also called the likelihood. P(B) is the prior or marginal probability of B, and acts as a normalizing constant. 9 / 85
  • 26. Bayes Theorem One Version P(A|B) = P(B|A)P(A) P(B) Where P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into account any information about B. P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the specified value of B. P(B|A) is the conditional probability of B given A. It is also called the likelihood. P(B) is the prior or marginal probability of B, and acts as a normalizing constant. 9 / 85
  • 27. Bayes Theorem One Version P(A|B) = P(B|A)P(A) P(B) Where P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into account any information about B. P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the specified value of B. P(B|A) is the conditional probability of B given A. It is also called the likelihood. P(B) is the prior or marginal probability of B, and acts as a normalizing constant. 9 / 85
  • 28. A Simple Example Consider two related variables: 1 Drug (D) with values y or n 2 Test (T) with values +ve or –ve Initial Probabilities P(D = y) = 0.001 P(T = +ve|D = y) = 0.8 P(T = +ve|D = n) = 0.01 10 / 85
  • 29. A Simple Example Consider two related variables: 1 Drug (D) with values y or n 2 Test (T) with values +ve or –ve Initial Probabilities P(D = y) = 0.001 P(T = +ve|D = y) = 0.8 P(T = +ve|D = n) = 0.01 10 / 85
  • 30. A Simple Example Consider two related variables: 1 Drug (D) with values y or n 2 Test (T) with values +ve or –ve Initial Probabilities P(D = y) = 0.001 P(T = +ve|D = y) = 0.8 P(T = +ve|D = n) = 0.01 10 / 85
  • 31. A Simple Example Consider two related variables: 1 Drug (D) with values y or n 2 Test (T) with values +ve or –ve Initial Probabilities P(D = y) = 0.001 P(T = +ve|D = y) = 0.8 P(T = +ve|D = n) = 0.01 10 / 85
  • 32. A Simple Example Consider two related variables: 1 Drug (D) with values y or n 2 Test (T) with values +ve or –ve Initial Probabilities P(D = y) = 0.001 P(T = +ve|D = y) = 0.8 P(T = +ve|D = n) = 0.01 10 / 85
  • 33. A Simple Example What is the probability that a person has taken the drug? P (D = y|T = +ve) = P (T = +ve|D = y) P (D=y) P (T = +ve|D = y) P (D=y) + P (T = +ve|D = n) P (D=n) Let me develop the equation Using simply P (A, B) = P (A|B) P (B) (Chain Rule) (1) 11 / 85
  • 34. A Simple Example What is the probability that a person has taken the drug? P (D = y|T = +ve) = P (T = +ve|D = y) P (D=y) P (T = +ve|D = y) P (D=y) + P (T = +ve|D = n) P (D=n) Let me develop the equation Using simply P (A, B) = P (A|B) P (B) (Chain Rule) (1) 11 / 85
  • 35. Outline 1 History The History of Bayesian Applications 2 Bayes Theorem Everything Starts at Someplace Why Bayesian Networks? 3 Bayesian Networks Definition Markov Condition Example Using the Markok Condition Representing the Joint Distribution Observations Causality and Bayesian Networks Precautionary Tale Causal DAG Inference in Bayesian Networks Example General Strategy of Inference Inference - An Overview 12 / 85
  • 36. A More Complex Case Increase Complexity Suppose now that there is a similar link between Lung Cancer (L) and a chest X-ray (X) and that we also have the following relationships: History of smoking (S) has a direct influence on bronchitis (B) and lung cancer (L); L and B have a direct influence on fatigue (F). Question What is the probability that someone has bronchitis given that they smoke, have fatigue and have received a positive X-ray result? 13 / 85
  • 37. A More Complex Case Increase Complexity Suppose now that there is a similar link between Lung Cancer (L) and a chest X-ray (X) and that we also have the following relationships: History of smoking (S) has a direct influence on bronchitis (B) and lung cancer (L); L and B have a direct influence on fatigue (F). Question What is the probability that someone has bronchitis given that they smoke, have fatigue and have received a positive X-ray result? 13 / 85
  • 38. A More Complex Case Increase Complexity Suppose now that there is a similar link between Lung Cancer (L) and a chest X-ray (X) and that we also have the following relationships: History of smoking (S) has a direct influence on bronchitis (B) and lung cancer (L); L and B have a direct influence on fatigue (F). Question What is the probability that someone has bronchitis given that they smoke, have fatigue and have received a positive X-ray result? 13 / 85
  • 39. A More Complex Case Increase Complexity Suppose now that there is a similar link between Lung Cancer (L) and a chest X-ray (X) and that we also have the following relationships: History of smoking (S) has a direct influence on bronchitis (B) and lung cancer (L); L and B have a direct influence on fatigue (F). Question What is the probability that someone has bronchitis given that they smoke, have fatigue and have received a positive X-ray result? 13 / 85
  • 40. A More Complex Case Short Hand P (b1|s1, f1, x1) = P (b1, s1, f1, x1) P (s1, f1, x1) = l P (b1, s1, f1, x1, l) b,l P (b, s1, f1, x1, l) 14 / 85
  • 41. Values for the Complex Case Table Feature Value When the Feature Takes this Value H h1 There is a history of smoking h2 There is no history of smoking B b1 Bronchitis is present b2 Bronchitis is absent L l1 Lung cancer is present l2 Lung cancer is absent F f1 Fatigue is present f2 Fatigue is absent C c1 Chest X-ray is positive c2 Chest X-ray is negative 15 / 85
  • 42. Problem with Large Instances The joint probability distribution P(b, s, f , x, l) For five binary variables there are 25 = 32 values in the joint distribution (for 100 variables there are over 2100 values) How are these values to be obtained? We can try to do inference To obtain posterior distributions once some evidence is available requires summation over an exponential number of terms!!! Ok We need something BETTER!!! 16 / 85
  • 43. Problem with Large Instances The joint probability distribution P(b, s, f , x, l) For five binary variables there are 25 = 32 values in the joint distribution (for 100 variables there are over 2100 values) How are these values to be obtained? We can try to do inference To obtain posterior distributions once some evidence is available requires summation over an exponential number of terms!!! Ok We need something BETTER!!! 16 / 85
  • 44. Problem with Large Instances The joint probability distribution P(b, s, f , x, l) For five binary variables there are 25 = 32 values in the joint distribution (for 100 variables there are over 2100 values) How are these values to be obtained? We can try to do inference To obtain posterior distributions once some evidence is available requires summation over an exponential number of terms!!! Ok We need something BETTER!!! 16 / 85
  • 45. Problem with Large Instances The joint probability distribution P(b, s, f , x, l) For five binary variables there are 25 = 32 values in the joint distribution (for 100 variables there are over 2100 values) How are these values to be obtained? We can try to do inference To obtain posterior distributions once some evidence is available requires summation over an exponential number of terms!!! Ok We need something BETTER!!! 16 / 85
  • 46. Outline 1 History The History of Bayesian Applications 2 Bayes Theorem Everything Starts at Someplace Why Bayesian Networks? 3 Bayesian Networks Definition Markov Condition Example Using the Markok Condition Representing the Joint Distribution Observations Causality and Bayesian Networks Precautionary Tale Causal DAG Inference in Bayesian Networks Example General Strategy of Inference Inference - An Overview 17 / 85
  • 47. Bayesian Networks Definition A Bayesian network consists of A Graph Nodes represent the random variables. Directed edges (arrows) between pairs of nodes. it must be a Directed Acyclic Graph (DAG) – no directed cycles. The graph represents independence relationships between variables. This allows to define Conditional Probability Specifications: The conditional probability of each variable given its parents in the DAG. 18 / 85
  • 48. Bayesian Networks Definition A Bayesian network consists of A Graph Nodes represent the random variables. Directed edges (arrows) between pairs of nodes. it must be a Directed Acyclic Graph (DAG) – no directed cycles. The graph represents independence relationships between variables. This allows to define Conditional Probability Specifications: The conditional probability of each variable given its parents in the DAG. 18 / 85
  • 49. Bayesian Networks Definition A Bayesian network consists of A Graph Nodes represent the random variables. Directed edges (arrows) between pairs of nodes. it must be a Directed Acyclic Graph (DAG) – no directed cycles. The graph represents independence relationships between variables. This allows to define Conditional Probability Specifications: The conditional probability of each variable given its parents in the DAG. 18 / 85
  • 50. Bayesian Networks Definition A Bayesian network consists of A Graph Nodes represent the random variables. Directed edges (arrows) between pairs of nodes. it must be a Directed Acyclic Graph (DAG) – no directed cycles. The graph represents independence relationships between variables. This allows to define Conditional Probability Specifications: The conditional probability of each variable given its parents in the DAG. 18 / 85
  • 51. Bayesian Networks Definition A Bayesian network consists of A Graph Nodes represent the random variables. Directed edges (arrows) between pairs of nodes. it must be a Directed Acyclic Graph (DAG) – no directed cycles. The graph represents independence relationships between variables. This allows to define Conditional Probability Specifications: The conditional probability of each variable given its parents in the DAG. 18 / 85
  • 52. Bayesian Networks Definition A Bayesian network consists of A Graph Nodes represent the random variables. Directed edges (arrows) between pairs of nodes. it must be a Directed Acyclic Graph (DAG) – no directed cycles. The graph represents independence relationships between variables. This allows to define Conditional Probability Specifications: The conditional probability of each variable given its parents in the DAG. 18 / 85
  • 53. Bayesian Networks Definition A Bayesian network consists of A Graph Nodes represent the random variables. Directed edges (arrows) between pairs of nodes. it must be a Directed Acyclic Graph (DAG) – no directed cycles. The graph represents independence relationships between variables. This allows to define Conditional Probability Specifications: The conditional probability of each variable given its parents in the DAG. 18 / 85
  • 54. Example DAG for the previous Lung Cancer Problem H B L F C 19 / 85
  • 55. Outline 1 History The History of Bayesian Applications 2 Bayes Theorem Everything Starts at Someplace Why Bayesian Networks? 3 Bayesian Networks Definition Markov Condition Example Using the Markok Condition Representing the Joint Distribution Observations Causality and Bayesian Networks Precautionary Tale Causal DAG Inference in Bayesian Networks Example General Strategy of Inference Inference - An Overview 20 / 85
  • 56. Markov Condition Definition Suppose we have a joint probability distribution P of the random variables in some set V and a DAG G = (V , E). We say that (G, P) satisfies the Markov condition if for each variable X ∈ V , {X} is conditionally independent of the set of all its non-descendents given the set of all its parents. Notation PAX = set of parents of X. NDX = set of non-descendants of X. We use the following the notation IP ({X} , NDX |PAX ) 21 / 85
  • 57. Markov Condition Definition Suppose we have a joint probability distribution P of the random variables in some set V and a DAG G = (V , E). We say that (G, P) satisfies the Markov condition if for each variable X ∈ V , {X} is conditionally independent of the set of all its non-descendents given the set of all its parents. Notation PAX = set of parents of X. NDX = set of non-descendants of X. We use the following the notation IP ({X} , NDX |PAX ) 21 / 85
  • 58. Markov Condition Definition Suppose we have a joint probability distribution P of the random variables in some set V and a DAG G = (V , E). We say that (G, P) satisfies the Markov condition if for each variable X ∈ V , {X} is conditionally independent of the set of all its non-descendents given the set of all its parents. Notation PAX = set of parents of X. NDX = set of non-descendants of X. We use the following the notation IP ({X} , NDX |PAX ) 21 / 85
  • 59. Markov Condition Definition Suppose we have a joint probability distribution P of the random variables in some set V and a DAG G = (V , E). We say that (G, P) satisfies the Markov condition if for each variable X ∈ V , {X} is conditionally independent of the set of all its non-descendents given the set of all its parents. Notation PAX = set of parents of X. NDX = set of non-descendants of X. We use the following the notation IP ({X} , NDX |PAX ) 21 / 85
  • 60. Markov Condition Definition Suppose we have a joint probability distribution P of the random variables in some set V and a DAG G = (V , E). We say that (G, P) satisfies the Markov condition if for each variable X ∈ V , {X} is conditionally independent of the set of all its non-descendents given the set of all its parents. Notation PAX = set of parents of X. NDX = set of non-descendants of X. We use the following the notation IP ({X} , NDX |PAX ) 21 / 85
  • 61. Outline 1 History The History of Bayesian Applications 2 Bayes Theorem Everything Starts at Someplace Why Bayesian Networks? 3 Bayesian Networks Definition Markov Condition Example Using the Markok Condition Representing the Joint Distribution Observations Causality and Bayesian Networks Precautionary Tale Causal DAG Inference in Bayesian Networks Example General Strategy of Inference Inference - An Overview 22 / 85
  • 62. Example We have that H B L F C Given the previous DAG we have Node PA Conditional Independence C {L} IP ({C} , {H, B, F} | {L}) B {H} IP ({B} , {L, C} | {H}) F {B, L} IP ({F} , {H, C} | {B, L}) L {H} IP ({L} , {B} | {H}) 23 / 85
  • 63. Example We have that H B L F C Given the previous DAG we have Node PA Conditional Independence C {L} IP ({C} , {H, B, F} | {L}) B {H} IP ({B} , {L, C} | {H}) F {B, L} IP ({F} , {H, C} | {B, L}) L {H} IP ({L} , {B} | {H}) 23 / 85
  • 64. Outline 1 History The History of Bayesian Applications 2 Bayes Theorem Everything Starts at Someplace Why Bayesian Networks? 3 Bayesian Networks Definition Markov Condition Example Using the Markok Condition Representing the Joint Distribution Observations Causality and Bayesian Networks Precautionary Tale Causal DAG Inference in Bayesian Networks Example General Strategy of Inference Inference - An Overview 24 / 85
  • 65. Using the Markov Condition First Decompose a Joint Distribution using the Chain Rule P (c, f , l, b, h) = P (c|b, s, l, f ) P (f |b, h, l) P (l|b, h) P (b|h) P (h) (2) Using the Markov condition in the following DAG We have the following equivalences P (c|b, h, l, f ) = P (c|l) P (f |b, h, l) = P (f |b, l) P (l|b, h) = P (l|h) 25 / 85
  • 66. Using the Markov Condition First Decompose a Joint Distribution using the Chain Rule P (c, f , l, b, h) = P (c|b, s, l, f ) P (f |b, h, l) P (l|b, h) P (b|h) P (h) (2) Using the Markov condition in the following DAG H B L F C We have the following equivalences P (c|b, h, l, f ) = P (c|l) P (f |b, h, l) = P (f |b, l) P (l|b, h) = P (l|h) 25 / 85
  • 67. Using the Markov Condition First Decompose a Joint Distribution using the Chain Rule P (c, f , l, b, h) = P (c|b, s, l, f ) P (f |b, h, l) P (l|b, h) P (b|h) P (h) (2) Using the Markov condition in the following DAG H B L F C We have the following equivalences P (c|b, h, l, f ) = P (c|l) P (f |b, h, l) = P (f |b, l) P (l|b, h) = P (l|h) 25 / 85
  • 68. Using the Markov Condition Finally P (c, f , l, b, h) = P (c|l) P (f |b, l) P (l|h) P (b|h) P (h) (3) 26 / 85
  • 69. Outline 1 History The History of Bayesian Applications 2 Bayes Theorem Everything Starts at Someplace Why Bayesian Networks? 3 Bayesian Networks Definition Markov Condition Example Using the Markok Condition Representing the Joint Distribution Observations Causality and Bayesian Networks Precautionary Tale Causal DAG Inference in Bayesian Networks Example General Strategy of Inference Inference - An Overview 27 / 85
  • 70. Representing the Joint Distribution Theorem 1.4 If (G, P) satisfies the Markov condition, then P is equal to the product of its conditional distributions of all nodes given values of their parents, whenever these conditional distributions exist. General Representation In general, for a network with nodes X1, X2, ..., Xn ⇒ P (x1, x2, ..., xn) = n i=1 P (xi|PA (xi)) 28 / 85
  • 71. Representing the Joint Distribution Theorem 1.4 If (G, P) satisfies the Markov condition, then P is equal to the product of its conditional distributions of all nodes given values of their parents, whenever these conditional distributions exist. General Representation In general, for a network with nodes X1, X2, ..., Xn ⇒ P (x1, x2, ..., xn) = n i=1 P (xi|PA (xi)) 28 / 85
  • 72. Proof of Theorem 1.4 We prove the case where P is discrete Order the nodes so that if Y is a descendant of Z, then Y follows Z in the ordering. Topological Sorting. This is called Ancestral ordering. 29 / 85
  • 73. Proof of Theorem 1.4 We prove the case where P is discrete Order the nodes so that if Y is a descendant of Z, then Y follows Z in the ordering. Topological Sorting. This is called Ancestral ordering. 29 / 85
  • 74. Proof of Theorem 1.4 We prove the case where P is discrete Order the nodes so that if Y is a descendant of Z, then Y follows Z in the ordering. Topological Sorting. This is called Ancestral ordering. 29 / 85
  • 75. Proof For example The ancestral ordering are [H, L, B, C, F] and [H, B, L, F, C] (4) 30 / 85
  • 76. Proof For example The ancestral ordering are [H, L, B, C, F] and [H, B, L, F, C] (4) 30 / 85
  • 77. Proof Now Let X1, X2, ..., Xn be the resultant ordering. For a given set of values of x1, x2, ..., xn Let pai be the subsets of these values containing the values of Xi s parents Thus, we need to prove that whenever P (pai) = 0 for 1 ≤ i ≤ n P (xn, xn−1, ..., x1) = P (xn|pan) P xn−1|pan−1 ...P (x1|pa1) (5) 31 / 85
  • 78. Proof Now Let X1, X2, ..., Xn be the resultant ordering. For a given set of values of x1, x2, ..., xn Let pai be the subsets of these values containing the values of Xi s parents Thus, we need to prove that whenever P (pai) = 0 for 1 ≤ i ≤ n P (xn, xn−1, ..., x1) = P (xn|pan) P xn−1|pan−1 ...P (x1|pa1) (5) 31 / 85
  • 79. Proof Now Let X1, X2, ..., Xn be the resultant ordering. For a given set of values of x1, x2, ..., xn Let pai be the subsets of these values containing the values of Xi s parents Thus, we need to prove that whenever P (pai) = 0 for 1 ≤ i ≤ n P (xn, xn−1, ..., x1) = P (xn|pan) P xn−1|pan−1 ...P (x1|pa1) (5) 31 / 85
  • 80. Proof Something Notable We show this using induction on the number of variables in the network. Assume then P (pai) = 0 for 1 ≤ i ≤ n for a combination of xis values. Base Case of Induction Since pa1 is empty, then P (x1) = P (x1|pa1) (6) Inductive Hypothesis Suppose for this combination of values of the xi’s that P (xi, xi−1, ..., x1) = P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) (7) 32 / 85
  • 81. Proof Something Notable We show this using induction on the number of variables in the network. Assume then P (pai) = 0 for 1 ≤ i ≤ n for a combination of xis values. Base Case of Induction Since pa1 is empty, then P (x1) = P (x1|pa1) (6) Inductive Hypothesis Suppose for this combination of values of the xi’s that P (xi, xi−1, ..., x1) = P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) (7) 32 / 85
  • 82. Proof Something Notable We show this using induction on the number of variables in the network. Assume then P (pai) = 0 for 1 ≤ i ≤ n for a combination of xis values. Base Case of Induction Since pa1 is empty, then P (x1) = P (x1|pa1) (6) Inductive Hypothesis Suppose for this combination of values of the xi’s that P (xi, xi−1, ..., x1) = P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) (7) 32 / 85
  • 83. Proof Something Notable We show this using induction on the number of variables in the network. Assume then P (pai) = 0 for 1 ≤ i ≤ n for a combination of xis values. Base Case of Induction Since pa1 is empty, then P (x1) = P (x1|pa1) (6) Inductive Hypothesis Suppose for this combination of values of the xi’s that P (xi, xi−1, ..., x1) = P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) (7) 32 / 85
  • 84. Proof Something Notable We show this using induction on the number of variables in the network. Assume then P (pai) = 0 for 1 ≤ i ≤ n for a combination of xis values. Base Case of Induction Since pa1 is empty, then P (x1) = P (x1|pa1) (6) Inductive Hypothesis Suppose for this combination of values of the xi’s that P (xi, xi−1, ..., x1) = P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) (7) 32 / 85
  • 85. Proof Something Notable We show this using induction on the number of variables in the network. Assume then P (pai) = 0 for 1 ≤ i ≤ n for a combination of xis values. Base Case of Induction Since pa1 is empty, then P (x1) = P (x1|pa1) (6) Inductive Hypothesis Suppose for this combination of values of the xi’s that P (xi, xi−1, ..., x1) = P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) (7) 32 / 85
  • 86. Proof Inductive Step We need show for this combination of values of the xi’s that P (xi+1, xi, ..., x1) = P xi+1|pai+1 P (xi|pai) ...P (x1|pa1) (8) Case 1 For this combination of values: P (xi, xi−1, ..., x1) = 0 (9) By Conditional Probability, we have P (xi+1, xi, ..., x1) = P (xi+1|xi, ..., x1) P (xi, ..., x1) = 0 (10) 33 / 85
  • 87. Proof Inductive Step We need show for this combination of values of the xi’s that P (xi+1, xi, ..., x1) = P xi+1|pai+1 P (xi|pai) ...P (x1|pa1) (8) Case 1 For this combination of values: P (xi, xi−1, ..., x1) = 0 (9) By Conditional Probability, we have P (xi+1, xi, ..., x1) = P (xi+1|xi, ..., x1) P (xi, ..., x1) = 0 (10) 33 / 85
  • 88. Proof Inductive Step We need show for this combination of values of the xi’s that P (xi+1, xi, ..., x1) = P xi+1|pai+1 P (xi|pai) ...P (x1|pa1) (8) Case 1 For this combination of values: P (xi, xi−1, ..., x1) = 0 (9) By Conditional Probability, we have P (xi+1, xi, ..., x1) = P (xi+1|xi, ..., x1) P (xi, ..., x1) = 0 (10) 33 / 85
  • 89. Proof Due to the previous equalities and the inductive hypothesis There is some k, 1 ≤ k ≤ i such that P (xk|pak) = 0 because after all P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) = 0 (11) Thus, the equality holds Now for the Case 2 Case 2 For this combination of values P (xi, xi−1, ..., x1) = 0 34 / 85
  • 90. Proof Due to the previous equalities and the inductive hypothesis There is some k, 1 ≤ k ≤ i such that P (xk|pak) = 0 because after all P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) = 0 (11) Thus, the equality holds Now for the Case 2 Case 2 For this combination of values P (xi, xi−1, ..., x1) = 0 34 / 85
  • 91. Proof Due to the previous equalities and the inductive hypothesis There is some k, 1 ≤ k ≤ i such that P (xk|pak) = 0 because after all P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) = 0 (11) Thus, the equality holds Now for the Case 2 Case 2 For this combination of values P (xi, xi−1, ..., x1) = 0 34 / 85
  • 92. Proof Thus by the Rule of Conditional Probability P (xi+1, xi, ..., x1) = P (xi+1|xi, ..., x1) P (xi, ..., x1) Definition Markov Condition (Remember!!!) Suppose we have a joint probability distribution P of the random variables in some set V and a DAG G = (V , E). We say that (G, P) satisfies the Markov condition if for each variable X ∈ V , {X} is conditionally independent of the set of all its non-descendents given the set of all its parents. 35 / 85
  • 93. Proof Thus by the Rule of Conditional Probability P (xi+1, xi, ..., x1) = P (xi+1|xi, ..., x1) P (xi, ..., x1) Definition Markov Condition (Remember!!!) Suppose we have a joint probability distribution P of the random variables in some set V and a DAG G = (V , E). We say that (G, P) satisfies the Markov condition if for each variable X ∈ V , {X} is conditionally independent of the set of all its non-descendents given the set of all its parents. 35 / 85
  • 94. Proof Thus by the Rule of Conditional Probability P (xi+1, xi, ..., x1) = P (xi+1|xi, ..., x1) P (xi, ..., x1) Definition Markov Condition (Remember!!!) Suppose we have a joint probability distribution P of the random variables in some set V and a DAG G = (V , E). We say that (G, P) satisfies the Markov condition if for each variable X ∈ V , {X} is conditionally independent of the set of all its non-descendents given the set of all its parents. 35 / 85
  • 95. Proof Given this Markov Condition and the fact that X1, ..., Xi are all non-descendants of Xi+1 We have that P (xi+1, xi, ..., x1) = P xi+1|pai+1 P (xi, ..., x1) = P xi+1|pai+1 P (xi|pai) · · · P (x1|pa1) (IH) Q.E.D. 36 / 85
  • 96. Proof Given this Markov Condition and the fact that X1, ..., Xi are all non-descendants of Xi+1 We have that P (xi+1, xi, ..., x1) = P xi+1|pai+1 P (xi, ..., x1) = P xi+1|pai+1 P (xi|pai) · · · P (x1|pa1) (IH) Q.E.D. 36 / 85
  • 97. Outline 1 History The History of Bayesian Applications 2 Bayes Theorem Everything Starts at Someplace Why Bayesian Networks? 3 Bayesian Networks Definition Markov Condition Example Using the Markok Condition Representing the Joint Distribution Observations Causality and Bayesian Networks Precautionary Tale Causal DAG Inference in Bayesian Networks Example General Strategy of Inference Inference - An Overview 37 / 85
  • 98. Now OBSERVATIONS 1 An enormous saving can be made regarding the number of values required for the joint distribution. 2 To determine the joint distribution directly for n binary variables 2n values are required. 3 For a Bayesian Network with n binary variables and each node has at most k parents then less than 2kn values are required!!! 38 / 85
  • 99. Now OBSERVATIONS 1 An enormous saving can be made regarding the number of values required for the joint distribution. 2 To determine the joint distribution directly for n binary variables 2n values are required. 3 For a Bayesian Network with n binary variables and each node has at most k parents then less than 2kn values are required!!! 38 / 85
  • 100. Now OBSERVATIONS 1 An enormous saving can be made regarding the number of values required for the joint distribution. 2 To determine the joint distribution directly for n binary variables 2n values are required. 3 For a Bayesian Network with n binary variables and each node has at most k parents then less than 2kn values are required!!! 38 / 85
  • 101. It is more!!! Theorem 1.5 Let a DAG G be given in which each node is a random variable, and let a discrete conditional probability distribution of each node given values of its parents in G be specified. Then, the product of these conditional distributions yields a joint probability distribution P of the variables, and (G, P) satisfies the Markov condition. Note Notice that the theorem requires that specified conditional distributions be discrete. Often in the case of continuous distributions it still holds. 39 / 85
  • 102. It is more!!! Theorem 1.5 Let a DAG G be given in which each node is a random variable, and let a discrete conditional probability distribution of each node given values of its parents in G be specified. Then, the product of these conditional distributions yields a joint probability distribution P of the variables, and (G, P) satisfies the Markov condition. Note Notice that the theorem requires that specified conditional distributions be discrete. Often in the case of continuous distributions it still holds. 39 / 85
  • 103. It is more!!! Theorem 1.5 Let a DAG G be given in which each node is a random variable, and let a discrete conditional probability distribution of each node given values of its parents in G be specified. Then, the product of these conditional distributions yields a joint probability distribution P of the variables, and (G, P) satisfies the Markov condition. Note Notice that the theorem requires that specified conditional distributions be discrete. Often in the case of continuous distributions it still holds. 39 / 85
  • 104. It is more!!! Theorem 1.5 Let a DAG G be given in which each node is a random variable, and let a discrete conditional probability distribution of each node given values of its parents in G be specified. Then, the product of these conditional distributions yields a joint probability distribution P of the variables, and (G, P) satisfies the Markov condition. Note Notice that the theorem requires that specified conditional distributions be discrete. Often in the case of continuous distributions it still holds. 39 / 85
  • 105. Outline 1 History The History of Bayesian Applications 2 Bayes Theorem Everything Starts at Someplace Why Bayesian Networks? 3 Bayesian Networks Definition Markov Condition Example Using the Markok Condition Representing the Joint Distribution Observations Causality and Bayesian Networks Precautionary Tale Causal DAG Inference in Bayesian Networks Example General Strategy of Inference Inference - An Overview 40 / 85
  • 106. Causality in Bayesian Networks Definition of a Cause The one, such as a person, an event, or a condition, that is responsible for an action or a result. However Although useful, this simple definition is certainly not the last word on the concept of causation. Actually Philosophers are still wrangling the issue!!! 41 / 85
  • 107. Causality in Bayesian Networks Definition of a Cause The one, such as a person, an event, or a condition, that is responsible for an action or a result. However Although useful, this simple definition is certainly not the last word on the concept of causation. Actually Philosophers are still wrangling the issue!!! 41 / 85
  • 108. Causality in Bayesian Networks Definition of a Cause The one, such as a person, an event, or a condition, that is responsible for an action or a result. However Although useful, this simple definition is certainly not the last word on the concept of causation. Actually Philosophers are still wrangling the issue!!! 41 / 85
  • 109. Causality in Bayesian Networks Nevertheless, It sheds light in the issue If the action of making variable X take some value sometimes changes the value taken by a variable Y . Causality Here, we assume X is responsible for sometimes changing Y ’s value Thus, we conclude X is a cause of Y . 42 / 85
  • 110. Causality in Bayesian Networks Nevertheless, It sheds light in the issue If the action of making variable X take some value sometimes changes the value taken by a variable Y . Causality Here, we assume X is responsible for sometimes changing Y ’s value Thus, we conclude X is a cause of Y . 42 / 85
  • 111. Furthermore Formally We say we manipulate X when we force X to take some value. We say X causes Y if there is some manipulation of X that leads to a change in the probability distribution of Y . Thus We assume causes and their effects are statistically correlated. However Variables can be correlated without one causing the other. 43 / 85
  • 112. Furthermore Formally We say we manipulate X when we force X to take some value. We say X causes Y if there is some manipulation of X that leads to a change in the probability distribution of Y . Thus We assume causes and their effects are statistically correlated. However Variables can be correlated without one causing the other. 43 / 85
  • 113. Furthermore Formally We say we manipulate X when we force X to take some value. We say X causes Y if there is some manipulation of X that leads to a change in the probability distribution of Y . Thus We assume causes and their effects are statistically correlated. However Variables can be correlated without one causing the other. 43 / 85
  • 114. Furthermore Formally We say we manipulate X when we force X to take some value. We say X causes Y if there is some manipulation of X that leads to a change in the probability distribution of Y . Thus We assume causes and their effects are statistically correlated. However Variables can be correlated without one causing the other. 43 / 85
  • 115. Outline 1 History The History of Bayesian Applications 2 Bayes Theorem Everything Starts at Someplace Why Bayesian Networks? 3 Bayesian Networks Definition Markov Condition Example Using the Markok Condition Representing the Joint Distribution Observations Causality and Bayesian Networks Precautionary Tale Causal DAG Inference in Bayesian Networks Example General Strategy of Inference Inference - An Overview 44 / 85
  • 116. Precautionary Tale: Causality and Bayesian Networks Important Not every Bayesian Networks describes causal relationships between the variables. Consider Consider the dependence between Lung Cancer, L, and the X-ray test, X. By focusing on just these variables we might be tempted to represent them by the following Bayesian Networks. 45 / 85
  • 117. Precautionary Tale: Causality and Bayesian Networks Important Not every Bayesian Networks describes causal relationships between the variables. Consider Consider the dependence between Lung Cancer, L, and the X-ray test, X. By focusing on just these variables we might be tempted to represent them by the following Bayesian Networks. 45 / 85
  • 118. Precautionary Tale: Causality and Bayesian Networks Important Not every Bayesian Networks describes causal relationships between the variables. Consider Consider the dependence between Lung Cancer, L, and the X-ray test, X. By focusing on just these variables we might be tempted to represent them by the following Bayesian Networks. 45 / 85
  • 119. Precautionary Tale: Causality and Bayesian Networks Important Not every Bayesian Networks describes causal relationships between the variables. Consider Consider the dependence between Lung Cancer, L, and the X-ray test, X. By focusing on just these variables we might be tempted to represent them by the following Bayesian Networks. L X 45 / 85
  • 120. Precautionary Tale: Causality and Bayesian Networks However, we can try the same L X 46 / 85
  • 121. Remark Be Careful It is tempting to think that Bayesian Networkss can be created by creating a DAG where the edges represent direct causal relationships between the variables. 47 / 85
  • 122. Outline 1 History The History of Bayesian Applications 2 Bayes Theorem Everything Starts at Someplace Why Bayesian Networks? 3 Bayesian Networks Definition Markov Condition Example Using the Markok Condition Representing the Joint Distribution Observations Causality and Bayesian Networks Precautionary Tale Causal DAG Inference in Bayesian Networks Example General Strategy of Inference Inference - An Overview 48 / 85
  • 123. However Causal DAG Given a set of variables V , if for every X, Y ∈ V we draw an edge from X to Y ⇐⇒ X is a direct cause of Y relative to V , we call the resultant DAG a causal DAG. We want If we create a causal DAG G = (V , E) and assume the probability distribution of the variables in V satisfies the Markov condition with G: we say we are making the causal Markov assumption. In General The Markov condition holds for a causal DAG. 49 / 85
  • 124. However Causal DAG Given a set of variables V , if for every X, Y ∈ V we draw an edge from X to Y ⇐⇒ X is a direct cause of Y relative to V , we call the resultant DAG a causal DAG. We want If we create a causal DAG G = (V , E) and assume the probability distribution of the variables in V satisfies the Markov condition with G: we say we are making the causal Markov assumption. In General The Markov condition holds for a causal DAG. 49 / 85
  • 125. However Causal DAG Given a set of variables V , if for every X, Y ∈ V we draw an edge from X to Y ⇐⇒ X is a direct cause of Y relative to V , we call the resultant DAG a causal DAG. We want If we create a causal DAG G = (V , E) and assume the probability distribution of the variables in V satisfies the Markov condition with G: we say we are making the causal Markov assumption. In General The Markov condition holds for a causal DAG. 49 / 85
  • 126. However Causal DAG Given a set of variables V , if for every X, Y ∈ V we draw an edge from X to Y ⇐⇒ X is a direct cause of Y relative to V , we call the resultant DAG a causal DAG. We want If we create a causal DAG G = (V , E) and assume the probability distribution of the variables in V satisfies the Markov condition with G: we say we are making the causal Markov assumption. In General The Markov condition holds for a causal DAG. 49 / 85
  • 127. However, we still want to know if the Markov Condition Holds Remark There are several thing that the DAG needs to have in order to have the Markov Condition. Examples of those Common Causes Common Effects 50 / 85
  • 128. However, we still want to know if the Markov Condition Holds Remark There are several thing that the DAG needs to have in order to have the Markov Condition. Examples of those Common Causes Common Effects 50 / 85
  • 129. However, we still want to know if the Markov Condition Holds Remark There are several thing that the DAG needs to have in order to have the Markov Condition. Examples of those Common Causes Common Effects 50 / 85
  • 130. How to have a Markov Assumption : Common Causes Consider Smoking Bronchitis Lung Cancer Markov condition Ip ({B} , {L} | {S}) ⇒ P(b|l, s) = P(b|s) (12) 51 / 85
  • 131. How to have a Markov Assumption : Common Causes Consider Smoking Bronchitis Lung Cancer Markov condition Ip ({B} , {L} | {S}) ⇒ P(b|l, s) = P(b|s) (12) 51 / 85
  • 132. How to have a Markov Assumption : Common Causes If we know the causal relationships S → B and S → L (13) Now!!! If we know the person is a smoker. 52 / 85
  • 133. How to have a Markov Assumption : Common Causes If we know the causal relationships S → B and S → L (13) Now!!! If we know the person is a smoker. 52 / 85
  • 134. How to have a Markov Assumption : Common Causes Then, because of the blocking of information from smoking Finding out that he has Bronchitis will not give us any more information about the probability of him having Lung Cancer. Markov condition It is satisfied!!! 53 / 85
  • 135. How to have a Markov Assumption : Common Causes Then, because of the blocking of information from smoking Finding out that he has Bronchitis will not give us any more information about the probability of him having Lung Cancer. Markov condition It is satisfied!!! 53 / 85
  • 136. How to have a Markov Assumption : Common Effects Consider Alarm Burglary Earthquake Markov Condition lp (B, E) ⇒ P(b|e) = P(b) (14) Thus We would expect Burglary and Earthquake to be independent of each other which is in agreement with the Markov condition. 54 / 85
  • 137. How to have a Markov Assumption : Common Effects Consider Alarm Burglary Earthquake Markov Condition lp (B, E) ⇒ P(b|e) = P(b) (14) Thus We would expect Burglary and Earthquake to be independent of each other which is in agreement with the Markov condition. 54 / 85
  • 138. How to have a Markov Assumption : Common Effects Consider Alarm Burglary Earthquake Markov Condition lp (B, E) ⇒ P(b|e) = P(b) (14) Thus We would expect Burglary and Earthquake to be independent of each other which is in agreement with the Markov condition. 54 / 85
  • 139. How to have a Markov Assumption : Common Effects However We would, however expect them to be conditionally dependent given Alarm. Thus If the alarm has gone off, news that there had been an earthquake would ‘explain away’ the idea that a burglary had taken place. Then Again in agreement with the Markov condition. 55 / 85
  • 140. How to have a Markov Assumption : Common Effects However We would, however expect them to be conditionally dependent given Alarm. Thus If the alarm has gone off, news that there had been an earthquake would ‘explain away’ the idea that a burglary had taken place. Then Again in agreement with the Markov condition. 55 / 85
  • 141. How to have a Markov Assumption : Common Effects However We would, however expect them to be conditionally dependent given Alarm. Thus If the alarm has gone off, news that there had been an earthquake would ‘explain away’ the idea that a burglary had taken place. Then Again in agreement with the Markov condition. 55 / 85
  • 142. The Causal Markov Condition What do we want? The basic idea is that the Markov condition holds for a causal DAG. 56 / 85
  • 143. Rules to construct A Causal Graph Conditions 1 There must be no hidden common causes. 2 There must not be selection bias. 3 There must be no feedback loops. Observations Even with these there is a lot of controversy as to its validity. It seems to be false in quantum mechanical. 57 / 85
  • 144. Rules to construct A Causal Graph Conditions 1 There must be no hidden common causes. 2 There must not be selection bias. 3 There must be no feedback loops. Observations Even with these there is a lot of controversy as to its validity. It seems to be false in quantum mechanical. 57 / 85
  • 145. Rules to construct A Causal Graph Conditions 1 There must be no hidden common causes. 2 There must not be selection bias. 3 There must be no feedback loops. Observations Even with these there is a lot of controversy as to its validity. It seems to be false in quantum mechanical. 57 / 85
  • 146. Rules to construct A Causal Graph Conditions 1 There must be no hidden common causes. 2 There must not be selection bias. 3 There must be no feedback loops. Observations Even with these there is a lot of controversy as to its validity. It seems to be false in quantum mechanical. 57 / 85
  • 147. Rules to construct A Causal Graph Conditions 1 There must be no hidden common causes. 2 There must not be selection bias. 3 There must be no feedback loops. Observations Even with these there is a lot of controversy as to its validity. It seems to be false in quantum mechanical. 57 / 85
  • 148. Hidden Common Causes? Given the following DAG H X Y Z Something Notable If a DAG is created on the basis of causal relationships between the variables under consideration then X and Y would be marginally independent according to the Markov condition. Thus If H is hidden, they will normally be dependent. 58 / 85
  • 149. Hidden Common Causes? Given the following DAG H X Y Z Something Notable If a DAG is created on the basis of causal relationships between the variables under consideration then X and Y would be marginally independent according to the Markov condition. Thus If H is hidden, they will normally be dependent. 58 / 85
  • 150. Hidden Common Causes? Given the following DAG H X Y Z Something Notable If a DAG is created on the basis of causal relationships between the variables under consideration then X and Y would be marginally independent according to the Markov condition. Thus If H is hidden, they will normally be dependent. 58 / 85
  • 151. Outline 1 History The History of Bayesian Applications 2 Bayes Theorem Everything Starts at Someplace Why Bayesian Networks? 3 Bayesian Networks Definition Markov Condition Example Using the Markok Condition Representing the Joint Distribution Observations Causality and Bayesian Networks Precautionary Tale Causal DAG Inference in Bayesian Networks Example General Strategy of Inference Inference - An Overview 59 / 85
  • 152. Inference in Bayesian Networks What do we want from Bayesian Networks? The main point of Bayesian Networkss is to enable probabilistic inference to be performed. Two different types of inferences 1 Belief Updating. 2 Abduction Inference. 60 / 85
  • 153. Inference in Bayesian Networks What do we want from Bayesian Networks? The main point of Bayesian Networkss is to enable probabilistic inference to be performed. Two different types of inferences 1 Belief Updating. 2 Abduction Inference. 60 / 85
  • 154. Inference in Bayesian Networks What do we want from Bayesian Networks? The main point of Bayesian Networkss is to enable probabilistic inference to be performed. Two different types of inferences 1 Belief Updating. 2 Abduction Inference. 60 / 85
  • 155. Inference in Bayesian Networks Belief updating It is used to obtain the posterior probability of one or more variables given evidence concerning the values of other variables. Abductive inference It finds the most probable configuration of a set of variables (hypothesis) given certain evidence. 61 / 85
  • 156. Inference in Bayesian Networks Belief updating It is used to obtain the posterior probability of one or more variables given evidence concerning the values of other variables. Abductive inference It finds the most probable configuration of a set of variables (hypothesis) given certain evidence. 61 / 85
  • 157. Using the Structure I Consider the following Bayesian Networks Burgalary Earthquake Alarm JohnCalls MaryCalls P(B) 0.001 P(E) 0.002 P(B) 0.001 P(E) 0.002 B E P(A|B,E) T T 0.95 T F 0.94 F T 0.29 F F 0.001 P(B) 0.001 A P(JC|A) T 0.9 F 0.05 A P(MC|A) T 0.7 F 0.01 Consider answering a query in a Bayesian Network Q= set of query variables e= evidence (set of instantiated variable-value pairs) Inference = computation of conditional distribution P(Q|e) 62 / 85
  • 158. Using the Structure I Consider the following Bayesian Networks Burgalary Earthquake Alarm JohnCalls MaryCalls P(B) 0.001 P(E) 0.002 P(B) 0.001 P(E) 0.002 B E P(A|B,E) T T 0.95 T F 0.94 F T 0.29 F F 0.001 P(B) 0.001 A P(JC|A) T 0.9 F 0.05 A P(MC|A) T 0.7 F 0.01 Consider answering a query in a Bayesian Network Q= set of query variables e= evidence (set of instantiated variable-value pairs) Inference = computation of conditional distribution P(Q|e) 62 / 85
  • 159. Using the Structure I Consider the following Bayesian Networks Burgalary Earthquake Alarm JohnCalls MaryCalls P(B) 0.001 P(E) 0.002 P(B) 0.001 P(E) 0.002 B E P(A|B,E) T T 0.95 T F 0.94 F T 0.29 F F 0.001 P(B) 0.001 A P(JC|A) T 0.9 F 0.05 A P(MC|A) T 0.7 F 0.01 Consider answering a query in a Bayesian Network Q= set of query variables e= evidence (set of instantiated variable-value pairs) Inference = computation of conditional distribution P(Q|e) 62 / 85
  • 160. Using the Structure I Consider the following Bayesian Networks Burgalary Earthquake Alarm JohnCalls MaryCalls P(B) 0.001 P(E) 0.002 P(B) 0.001 P(E) 0.002 B E P(A|B,E) T T 0.95 T F 0.94 F T 0.29 F F 0.001 P(B) 0.001 A P(JC|A) T 0.9 F 0.05 A P(MC|A) T 0.7 F 0.01 Consider answering a query in a Bayesian Network Q= set of query variables e= evidence (set of instantiated variable-value pairs) Inference = computation of conditional distribution P(Q|e) 62 / 85
  • 161. Using the Structure II Examples P(burglary|alarm) P(earthquake|JCalls, MCalls) P(JCalls, MCalls|burglary, earthquake) So Can we use the structure of the Bayesian Network to answer such queries efficiently? Answer YES Note: Generally speaking, complexity is inversely proportional to sparsity of graph 63 / 85
  • 162. Using the Structure II Examples P(burglary|alarm) P(earthquake|JCalls, MCalls) P(JCalls, MCalls|burglary, earthquake) So Can we use the structure of the Bayesian Network to answer such queries efficiently? Answer YES Note: Generally speaking, complexity is inversely proportional to sparsity of graph 63 / 85
  • 163. Using the Structure II Examples P(burglary|alarm) P(earthquake|JCalls, MCalls) P(JCalls, MCalls|burglary, earthquake) So Can we use the structure of the Bayesian Network to answer such queries efficiently? Answer YES Note: Generally speaking, complexity is inversely proportional to sparsity of graph 63 / 85
  • 164. Using the Structure II Examples P(burglary|alarm) P(earthquake|JCalls, MCalls) P(JCalls, MCalls|burglary, earthquake) So Can we use the structure of the Bayesian Network to answer such queries efficiently? Answer YES Note: Generally speaking, complexity is inversely proportional to sparsity of graph 63 / 85
  • 165. Using the Structure II Examples P(burglary|alarm) P(earthquake|JCalls, MCalls) P(JCalls, MCalls|burglary, earthquake) So Can we use the structure of the Bayesian Network to answer such queries efficiently? Answer YES Note: Generally speaking, complexity is inversely proportional to sparsity of graph 63 / 85
  • 166. Using the Structure II Examples P(burglary|alarm) P(earthquake|JCalls, MCalls) P(JCalls, MCalls|burglary, earthquake) So Can we use the structure of the Bayesian Network to answer such queries efficiently? Answer YES Note: Generally speaking, complexity is inversely proportional to sparsity of graph 63 / 85
  • 167. Outline 1 History The History of Bayesian Applications 2 Bayes Theorem Everything Starts at Someplace Why Bayesian Networks? 3 Bayesian Networks Definition Markov Condition Example Using the Markok Condition Representing the Joint Distribution Observations Causality and Bayesian Networks Precautionary Tale Causal DAG Inference in Bayesian Networks Example General Strategy of Inference Inference - An Overview 64 / 85
  • 168. Example DAG D B E CA F G We have the following model p (a, b, c, d, e, f , g) is modeled by p (a|b) p (c|b) p (f |e) p (g|e) p (b|d) p (e|d) p (d) 65 / 85
  • 169. Example DAG D B E CA F G We have the following model p (a, b, c, d, e, f , g) is modeled by p (a|b) p (c|b) p (f |e) p (g|e) p (b|d) p (e|d) p (d) 65 / 85
  • 170. Example DAG D B E CA F G We want to calculate the following p (a|c, g) 66 / 85
  • 171. Example DAG D B E CA F G We want to calculate the following p (a|c, g) 66 / 85
  • 172. Example DAG D B E CA F G However, a direct calculation leads to use a demarginalization p (a|c, g) = b,d,e,f p (a, b, d, e, f |c, g) This will require that if we fix the value of a, c and g to have a complexity of O m4 with m = max {|B| , |D| , |E| , |F|} 67 / 85
  • 173. Example DAG D B E CA F G However, a direct calculation leads to use a demarginalization p (a|c, g) = b,d,e,f p (a, b, d, e, f |c, g) This will require that if we fix the value of a, c and g to have a complexity of O m4 with m = max {|B| , |D| , |E| , |F|} 67 / 85
  • 174. Example We get some information about (a = ai, c = ci, g = gi) D B E CA F G However, we re-express the equation using the chain representation p (a = ai, b, d, e, f |c = ci, g = gi) =... p (a = ai|b) p (b|d, c = ci) p (d|e) p (e, f |g = gi) 68 / 85
  • 175. Example We get some information about (a = ai, c = ci, g = gi) D B E CA F G However, we re-express the equation using the chain representation p (a = ai, b, d, e, f |c = ci, g = gi) =... p (a = ai|b) p (b|d, c = ci) p (d|e) p (e, f |g = gi) 68 / 85
  • 176. Example DAG D B E CA F G Now, we re-order the sum b p (a = ai|b) d p (b|d, c = ci) e p (d|e) f p (e, f |g = gi) 69 / 85
  • 177. Example DAG D B E CA F G Now, we re-order the sum b p (a = ai|b) d p (b|d, c = ci) e p (d|e) f p (e, f |g = gi) 69 / 85
  • 178. Example Now, using the relation about E D B E CA F G Using this information, we can reduce one of the sums by marginalization f p (e, f |g = gi) = p (e|g = gi) 70 / 85
  • 179. Example Now, using the relation about E D B E CA F G Using this information, we can reduce one of the sums by marginalization f p (e, f |g = gi) = p (e|g = gi) 70 / 85
  • 180. Example DAG D B E CA F G Thus, we can reduce the size of our sum b p (a = ai|b) d p (b|d, c = ci) e p (d|e) p (e|g = gi) 71 / 85
  • 181. Example DAG D B E CA F G Thus, we can reduce the size of our sum b p (a = ai|b) d p (b|d, c = ci) e p (d|e) p (e|g = gi) 71 / 85
  • 182. Example DAG D B E CA F G Now, we can calculate the probability of D by using the chain rule p (d|e) p (e|g = gi) = p (d|e, g = gi) p (e|g = gi) = p (d, e|g = gi) 72 / 85
  • 183. Example DAG D B E CA F G Now, we can calculate the probability of D by using the chain rule p (d|e) p (e|g = gi) = p (d|e, g = gi) p (e|g = gi) = p (d, e|g = gi) 72 / 85
  • 184. Example DAG D B E CA F G Now, we can calculate the probability of D by using the chain rule b p (a = ai|b) d p (b|d, c = ci) e p (d, e|g = gi) 73 / 85
  • 185. Example DAG D B E CA F G Now, we can calculate the probability of D by using the chain rule b p (a = ai|b) d p (b|d, c = ci) e p (d, e|g = gi) 73 / 85
  • 186. Example DAG D B E CA F G Now, we sum over all possible values of E e p (d, e|g = gi) = p (d|g = gi) 74 / 85
  • 187. Example DAG D B E CA F G Now, we sum over all possible values of E e p (d, e|g = gi) = p (d|g = gi) 74 / 85
  • 188. Example DAG D B E CA F G We get the following b p (a = ai|b) d p (b|d, c = ci) p (d|g = gi) 75 / 85
  • 189. Example DAG D B E CA F G We get the following b p (a = ai|b) d p (b|d, c = ci) p (d|g = gi) 75 / 85
  • 190. Example DAG D B E CA F G Again the chain rule for D p (b|d, c = ci) p (d|g = gi) = p (b|d, c = ci, g = gi) p (d|c = ci, g = gi) = p (b, d|c = ci, g = gi) 76 / 85
  • 191. Example DAG D B E CA F G Again the chain rule for D p (b|d, c = ci) p (d|g = gi) = p (b|d, c = ci, g = gi) p (d|c = ci, g = gi) = p (b, d|c = ci, g = gi) 76 / 85
  • 192. Example DAG D B E CA F G Now, we sum over all possible values of D b p (a = ai|b) p (b|c = ci, g = gi) 77 / 85
  • 193. Example DAG D B E CA F G Now, we sum over all possible values of D b p (a = ai|b) p (b|c = ci, g = gi) 77 / 85
  • 194. Example DAG D B E CA F G Now, we use the chain rule for reducing again p (a = ai|b) p (b|) = p (a = ai, b|c = ci, g = gi) 78 / 85
  • 195. Example DAG D B E CA F G Now, we use the chain rule for reducing again p (a = ai|b) p (b|) = p (a = ai, b|c = ci, g = gi) 78 / 85
  • 196. Example DAG D B E CA F G Now, we use the chain rule for reducing again b p (a = ai, b|c = ci, g = gi) = p (a = ai|c = ci, g = gi) 79 / 85
  • 197. Example DAG D B E CA F G Now, we use the chain rule for reducing again b p (a = ai, b|c = ci, g = gi) = p (a = ai|c = ci, g = gi) 79 / 85
  • 198. Complexity Because this can be computed using a sequence of four for loops The complexity simply becomes O (m) when compared with O m4 80 / 85
  • 199. Outline 1 History The History of Bayesian Applications 2 Bayes Theorem Everything Starts at Someplace Why Bayesian Networks? 3 Bayesian Networks Definition Markov Condition Example Using the Markok Condition Representing the Joint Distribution Observations Causality and Bayesian Networks Precautionary Tale Causal DAG Inference in Bayesian Networks Example General Strategy of Inference Inference - An Overview 81 / 85
  • 200. General Strategy for Inference Query Want to compute P(q|e)!!! Step 1 P(q|e) = P(q,e) P(e) = aP(q, e), since a = P(e) is constant wrt Q. Step 2 P(q, e) = a..z P(q, e, a, b, . . . .z), by the law of total probability. 82 / 85
  • 201. General Strategy for Inference Query Want to compute P(q|e)!!! Step 1 P(q|e) = P(q,e) P(e) = aP(q, e), since a = P(e) is constant wrt Q. Step 2 P(q, e) = a..z P(q, e, a, b, . . . .z), by the law of total probability. 82 / 85
  • 202. General Strategy for Inference Query Want to compute P(q|e)!!! Step 1 P(q|e) = P(q,e) P(e) = aP(q, e), since a = P(e) is constant wrt Q. Step 2 P(q, e) = a..z P(q, e, a, b, . . . .z), by the law of total probability. 82 / 85
  • 203. General Strategy for inference Step 3 a..z P(q, e, a, b, . . . .z) = a..z P(variable i | parents i) (using Bayesian network factoring) Step 4 Distribute summations across product terms for efficient computation. 83 / 85
  • 204. General Strategy for inference Step 3 a..z P(q, e, a, b, . . . .z) = a..z P(variable i | parents i) (using Bayesian network factoring) Step 4 Distribute summations across product terms for efficient computation. 83 / 85
  • 205. Outline 1 History The History of Bayesian Applications 2 Bayes Theorem Everything Starts at Someplace Why Bayesian Networks? 3 Bayesian Networks Definition Markov Condition Example Using the Markok Condition Representing the Joint Distribution Observations Causality and Bayesian Networks Precautionary Tale Causal DAG Inference in Bayesian Networks Example General Strategy of Inference Inference - An Overview 84 / 85
  • 206. Inference – An Overview Case 1 Trees and singly connected networks – only one path between any two nodes: Message passing (Pearl, 1988) Case 2 Multiply connected networks: A range of algorithms including cut-set conditioning (Pearl, 1988), junction tree propagation (Lauritzen and Spiegelhalter, 1988), bucket elimination (Dechter, 1996) to mention a few. A range of algorithms for approximate inference. Notes Both exact and approximate inference are NP-hard in the worst case. Here the focus will be on message passing and junction tree propagation for discrete variables. 85 / 85
  • 207. Inference – An Overview Case 1 Trees and singly connected networks – only one path between any two nodes: Message passing (Pearl, 1988) Case 2 Multiply connected networks: A range of algorithms including cut-set conditioning (Pearl, 1988), junction tree propagation (Lauritzen and Spiegelhalter, 1988), bucket elimination (Dechter, 1996) to mention a few. A range of algorithms for approximate inference. Notes Both exact and approximate inference are NP-hard in the worst case. Here the focus will be on message passing and junction tree propagation for discrete variables. 85 / 85
  • 208. Inference – An Overview Case 1 Trees and singly connected networks – only one path between any two nodes: Message passing (Pearl, 1988) Case 2 Multiply connected networks: A range of algorithms including cut-set conditioning (Pearl, 1988), junction tree propagation (Lauritzen and Spiegelhalter, 1988), bucket elimination (Dechter, 1996) to mention a few. A range of algorithms for approximate inference. Notes Both exact and approximate inference are NP-hard in the worst case. Here the focus will be on message passing and junction tree propagation for discrete variables. 85 / 85
  • 209. Inference – An Overview Case 1 Trees and singly connected networks – only one path between any two nodes: Message passing (Pearl, 1988) Case 2 Multiply connected networks: A range of algorithms including cut-set conditioning (Pearl, 1988), junction tree propagation (Lauritzen and Spiegelhalter, 1988), bucket elimination (Dechter, 1996) to mention a few. A range of algorithms for approximate inference. Notes Both exact and approximate inference are NP-hard in the worst case. Here the focus will be on message passing and junction tree propagation for discrete variables. 85 / 85
  • 210. Inference – An Overview Case 1 Trees and singly connected networks – only one path between any two nodes: Message passing (Pearl, 1988) Case 2 Multiply connected networks: A range of algorithms including cut-set conditioning (Pearl, 1988), junction tree propagation (Lauritzen and Spiegelhalter, 1988), bucket elimination (Dechter, 1996) to mention a few. A range of algorithms for approximate inference. Notes Both exact and approximate inference are NP-hard in the worst case. Here the focus will be on message passing and junction tree propagation for discrete variables. 85 / 85
  • 211. Inference – An Overview Case 1 Trees and singly connected networks – only one path between any two nodes: Message passing (Pearl, 1988) Case 2 Multiply connected networks: A range of algorithms including cut-set conditioning (Pearl, 1988), junction tree propagation (Lauritzen and Spiegelhalter, 1988), bucket elimination (Dechter, 1996) to mention a few. A range of algorithms for approximate inference. Notes Both exact and approximate inference are NP-hard in the worst case. Here the focus will be on message passing and junction tree propagation for discrete variables. 85 / 85
  • 212. Inference – An Overview Case 1 Trees and singly connected networks – only one path between any two nodes: Message passing (Pearl, 1988) Case 2 Multiply connected networks: A range of algorithms including cut-set conditioning (Pearl, 1988), junction tree propagation (Lauritzen and Spiegelhalter, 1988), bucket elimination (Dechter, 1996) to mention a few. A range of algorithms for approximate inference. Notes Both exact and approximate inference are NP-hard in the worst case. Here the focus will be on message passing and junction tree propagation for discrete variables. 85 / 85