Artificial Intelligence 06.01 introduction bayesian_networks

1. Artiﬁcial Intelligence Introduction to Bayesian Networks Andres Mendez-Vazquez March 2, 2016 1 / 85

2. Outline 1 History The History of Bayesian Applications 2 Bayes Theorem Everything Starts at Someplace Why Bayesian Networks? 3 Bayesian Networks Deﬁnition Markov Condition Example Using the Markok Condition Representing the Joint Distribution Observations Causality and Bayesian Networks Precautionary Tale Causal DAG Inference in Bayesian Networks Example General Strategy of Inference Inference - An Overview 2 / 85

4. History History ‘60s The ﬁrst expert systems. IF-THEN rules. 1968 Attempts to use probabilities in expert systems (Gorry & Barnett). 1973 Gave up - to heavy calculations! (Gorry). 1976 MYCIN: Medical predicate logic expert system with certainty factors (Shortliﬀe). 1976 PROSPECTOR: Predicts the likely location of mineral deposits. Uses Bayes’ rule. (Duda et al.). Summary until mid ’80s “Pure logic will solve the AI problems!” “Probability theory is intractable to use and too complicated for complex models.” 4 / 85

11. But... More History 1986 Bayesian networks were revived and reintroduced to expert systems (Pearl). 1988 Breakthrough for eﬃcient calculation algorithms (Lauritzen & Spiegelhalter) tractable calculations on Bayesian Networkss. 1995 In Windows95™ for printer-trouble shooting and Oﬃce assistance (“the paper clip”). 1999 Bayesian Networks are getting more and more used. Ex. Gene expression analysis, Business strategy etc. 2000 Widely used - A Bayesian Network tool will be shipped with every Windows™ Commercial Server. 5 / 85

16. Furtheron 2000-2015 Bayesian Networks are use in Spam Detection. Gene Dicovery. Signal Processing. Ranking. Forecasting. etc. Something Notable We are interested more and more on building automatically Bayesian Networks using data!!! 6 / 85

17. Furtheron 2000-2015 Bayesian Networks are use in Spam Detection. Gene Dicovery. Signal Processing. Ranking. Forecasting. etc. Something Notable We are interested more and more on building automatically Bayesian Networks using data!!! 6 / 85

18. Bayesian Network Advantages Many of Them 1 Since in a Bayesian network encodes all variables, missing data entries can be handled successfully. 2 When used for learning casual relationships, they help better understand a problem domain as well as forecast consequences. 3 it is ideal to use a Bayesian network for representing prior data and knowledge. 4 Over-ﬁtting of data can be avoidable when using Bayesian networks and Bayesian statistical methods. 7 / 85

23. Bayes Theorem One Version P(A|B) = P(B|A)P(A) P(B) Where P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into account any information about B. P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the speciﬁed value of B. P(B|A) is the conditional probability of B given A. It is also called the likelihood. P(B) is the prior or marginal probability of B, and acts as a normalizing constant. 9 / 85

28. A Simple Example Consider two related variables: 1 Drug (D) with values y or n 2 Test (T) with values +ve or –ve Initial Probabilities P(D = y) = 0.001 P(T = +ve|D = y) = 0.8 P(T = +ve|D = n) = 0.01 10 / 85

33. A Simple Example What is the probability that a person has taken the drug? P (D = y|T = +ve) = P (T = +ve|D = y) P (D=y) P (T = +ve|D = y) P (D=y) + P (T = +ve|D = n) P (D=n) Let me develop the equation Using simply P (A, B) = P (A|B) P (B) (Chain Rule) (1) 11 / 85

34. A Simple Example What is the probability that a person has taken the drug? P (D = y|T = +ve) = P (T = +ve|D = y) P (D=y) P (T = +ve|D = y) P (D=y) + P (T = +ve|D = n) P (D=n) Let me develop the equation Using simply P (A, B) = P (A|B) P (B) (Chain Rule) (1) 11 / 85

36. A More Complex Case Increase Complexity Suppose now that there is a similar link between Lung Cancer (L) and a chest X-ray (X) and that we also have the following relationships: History of smoking (S) has a direct inﬂuence on bronchitis (B) and lung cancer (L); L and B have a direct inﬂuence on fatigue (F). Question What is the probability that someone has bronchitis given that they smoke, have fatigue and have received a positive X-ray result? 13 / 85

40. A More Complex Case Short Hand P (b1|s1, f1, x1) = P (b1, s1, f1, x1) P (s1, f1, x1) = l P (b1, s1, f1, x1, l) b,l P (b, s1, f1, x1, l) 14 / 85

41. Values for the Complex Case Table Feature Value When the Feature Takes this Value H h1 There is a history of smoking h2 There is no history of smoking B b1 Bronchitis is present b2 Bronchitis is absent L l1 Lung cancer is present l2 Lung cancer is absent F f1 Fatigue is present f2 Fatigue is absent C c1 Chest X-ray is positive c2 Chest X-ray is negative 15 / 85

42. Problem with Large Instances The joint probability distribution P(b, s, f , x, l) For ﬁve binary variables there are 25 = 32 values in the joint distribution (for 100 variables there are over 2100 values) How are these values to be obtained? We can try to do inference To obtain posterior distributions once some evidence is available requires summation over an exponential number of terms!!! Ok We need something BETTER!!! 16 / 85

47. Bayesian Networks Definition A Bayesian network consists of A Graph Nodes represent the random variables. Directed edges (arrows) between pairs of nodes. it must be a Directed Acyclic Graph (DAG) – no directed cycles. The graph represents independence relationships between variables. This allows to define Conditional Probability Specifications: The conditional probability of each variable given its parents in the DAG. 18 / 85

54. Example DAG for the previous Lung Cancer Problem H B L F C 19 / 85

56. Markov Condition Deﬁnition Suppose we have a joint probability distribution P of the random variables in some set V and a DAG G = (V , E). We say that (G, P) satisﬁes the Markov condition if for each variable X ∈ V , {X} is conditionally independent of the set of all its non-descendents given the set of all its parents. Notation PAX = set of parents of X. NDX = set of non-descendants of X. We use the following the notation IP ({X} , NDX |PAX ) 21 / 85

62. Example We have that H B L F C Given the previous DAG we have Node PA Conditional Independence C {L} IP ({C} , {H, B, F} | {L}) B {H} IP ({B} , {L, C} | {H}) F {B, L} IP ({F} , {H, C} | {B, L}) L {H} IP ({L} , {B} | {H}) 23 / 85

63. Example We have that H B L F C Given the previous DAG we have Node PA Conditional Independence C {L} IP ({C} , {H, B, F} | {L}) B {H} IP ({B} , {L, C} | {H}) F {B, L} IP ({F} , {H, C} | {B, L}) L {H} IP ({L} , {B} | {H}) 23 / 85

68. Using the Markov Condition Finally P (c, f , l, b, h) = P (c|l) P (f |b, l) P (l|h) P (b|h) P (h) (3) 26 / 85

70. Representing the Joint Distribution Theorem 1.4 If (G, P) satisﬁes the Markov condition, then P is equal to the product of its conditional distributions of all nodes given values of their parents, whenever these conditional distributions exist. General Representation In general, for a network with nodes X1, X2, ..., Xn ⇒ P (x1, x2, ..., xn) = n i=1 P (xi|PA (xi)) 28 / 85

71. Representing the Joint Distribution Theorem 1.4 If (G, P) satisﬁes the Markov condition, then P is equal to the product of its conditional distributions of all nodes given values of their parents, whenever these conditional distributions exist. General Representation In general, for a network with nodes X1, X2, ..., Xn ⇒ P (x1, x2, ..., xn) = n i=1 P (xi|PA (xi)) 28 / 85

72. Proof of Theorem 1.4 We prove the case where P is discrete Order the nodes so that if Y is a descendant of Z, then Y follows Z in the ordering. Topological Sorting. This is called Ancestral ordering. 29 / 85

75. Proof For example The ancestral ordering are [H, L, B, C, F] and [H, B, L, F, C] (4) 30 / 85

76. Proof For example The ancestral ordering are [H, L, B, C, F] and [H, B, L, F, C] (4) 30 / 85

77. Proof Now Let X1, X2, ..., Xn be the resultant ordering. For a given set of values of x1, x2, ..., xn Let pai be the subsets of these values containing the values of Xi s parents Thus, we need to prove that whenever P (pai) = 0 for 1 ≤ i ≤ n P (xn, xn−1, ..., x1) = P (xn|pan) P xn−1|pan−1 ...P (x1|pa1) (5) 31 / 85

80. Proof Something Notable We show this using induction on the number of variables in the network. Assume then P (pai) = 0 for 1 ≤ i ≤ n for a combination of xis values. Base Case of Induction Since pa1 is empty, then P (x1) = P (x1|pa1) (6) Inductive Hypothesis Suppose for this combination of values of the xi’s that P (xi, xi−1, ..., x1) = P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) (7) 32 / 85

86. Proof Inductive Step We need show for this combination of values of the xi’s that P (xi+1, xi, ..., x1) = P xi+1|pai+1 P (xi|pai) ...P (x1|pa1) (8) Case 1 For this combination of values: P (xi, xi−1, ..., x1) = 0 (9) By Conditional Probability, we have P (xi+1, xi, ..., x1) = P (xi+1|xi, ..., x1) P (xi, ..., x1) = 0 (10) 33 / 85

89. Proof Due to the previous equalities and the inductive hypothesis There is some k, 1 ≤ k ≤ i such that P (xk|pak) = 0 because after all P (xi|pai) P xi−1|pai−1 ...P (x1|pa1) = 0 (11) Thus, the equality holds Now for the Case 2 Case 2 For this combination of values P (xi, xi−1, ..., x1) = 0 34 / 85

92. Proof Thus by the Rule of Conditional Probability P (xi+1, xi, ..., x1) = P (xi+1|xi, ..., x1) P (xi, ..., x1) Deﬁnition Markov Condition (Remember!!!) Suppose we have a joint probability distribution P of the random variables in some set V and a DAG G = (V , E). We say that (G, P) satisﬁes the Markov condition if for each variable X ∈ V , {X} is conditionally independent of the set of all its non-descendents given the set of all its parents. 35 / 85

95. Proof Given this Markov Condition and the fact that X1, ..., Xi are all non-descendants of Xi+1 We have that P (xi+1, xi, ..., x1) = P xi+1|pai+1 P (xi, ..., x1) = P xi+1|pai+1 P (xi|pai) · · · P (x1|pa1) (IH) Q.E.D. 36 / 85

96. Proof Given this Markov Condition and the fact that X1, ..., Xi are all non-descendants of Xi+1 We have that P (xi+1, xi, ..., x1) = P xi+1|pai+1 P (xi, ..., x1) = P xi+1|pai+1 P (xi|pai) · · · P (x1|pa1) (IH) Q.E.D. 36 / 85

98. Now OBSERVATIONS 1 An enormous saving can be made regarding the number of values required for the joint distribution. 2 To determine the joint distribution directly for n binary variables 2n values are required. 3 For a Bayesian Network with n binary variables and each node has at most k parents then less than 2kn values are required!!! 38 / 85

101. It is more!!! Theorem 1.5 Let a DAG G be given in which each node is a random variable, and let a discrete conditional probability distribution of each node given values of its parents in G be specified. Then, the product of these conditional distributions yields a joint probability distribution P of the variables, and (G, P) satisfies the Markov condition. Note Notice that the theorem requires that specified conditional distributions be discrete. Often in the case of continuous distributions it still holds. 39 / 85

106. Causality in Bayesian Networks Deﬁnition of a Cause The one, such as a person, an event, or a condition, that is responsible for an action or a result. However Although useful, this simple deﬁnition is certainly not the last word on the concept of causation. Actually Philosophers are still wrangling the issue!!! 41 / 85

109. Causality in Bayesian Networks Nevertheless, It sheds light in the issue If the action of making variable X take some value sometimes changes the value taken by a variable Y . Causality Here, we assume X is responsible for sometimes changing Y ’s value Thus, we conclude X is a cause of Y . 42 / 85

110. Causality in Bayesian Networks Nevertheless, It sheds light in the issue If the action of making variable X take some value sometimes changes the value taken by a variable Y . Causality Here, we assume X is responsible for sometimes changing Y ’s value Thus, we conclude X is a cause of Y . 42 / 85

111. Furthermore Formally We say we manipulate X when we force X to take some value. We say X causes Y if there is some manipulation of X that leads to a change in the probability distribution of Y . Thus We assume causes and their eﬀects are statistically correlated. However Variables can be correlated without one causing the other. 43 / 85

116. Precautionary Tale: Causality and Bayesian Networks Important Not every Bayesian Networks describes causal relationships between the variables. Consider Consider the dependence between Lung Cancer, L, and the X-ray test, X. By focusing on just these variables we might be tempted to represent them by the following Bayesian Networks. 45 / 85

119. Precautionary Tale: Causality and Bayesian Networks Important Not every Bayesian Networks describes causal relationships between the variables. Consider Consider the dependence between Lung Cancer, L, and the X-ray test, X. By focusing on just these variables we might be tempted to represent them by the following Bayesian Networks. L X 45 / 85

120. Precautionary Tale: Causality and Bayesian Networks However, we can try the same L X 46 / 85

121. Remark Be Careful It is tempting to think that Bayesian Networkss can be created by creating a DAG where the edges represent direct causal relationships between the variables. 47 / 85

123. However Causal DAG Given a set of variables V , if for every X, Y ∈ V we draw an edge from X to Y ⇐⇒ X is a direct cause of Y relative to V , we call the resultant DAG a causal DAG. We want If we create a causal DAG G = (V , E) and assume the probability distribution of the variables in V satisﬁes the Markov condition with G: we say we are making the causal Markov assumption. In General The Markov condition holds for a causal DAG. 49 / 85

127. However, we still want to know if the Markov Condition Holds Remark There are several thing that the DAG needs to have in order to have the Markov Condition. Examples of those Common Causes Common Eﬀects 50 / 85

130. How to have a Markov Assumption : Common Causes Consider Smoking Bronchitis Lung Cancer Markov condition Ip ({B} , {L} | {S}) ⇒ P(b|l, s) = P(b|s) (12) 51 / 85

131. How to have a Markov Assumption : Common Causes Consider Smoking Bronchitis Lung Cancer Markov condition Ip ({B} , {L} | {S}) ⇒ P(b|l, s) = P(b|s) (12) 51 / 85

132. How to have a Markov Assumption : Common Causes If we know the causal relationships S → B and S → L (13) Now!!! If we know the person is a smoker. 52 / 85

133. How to have a Markov Assumption : Common Causes If we know the causal relationships S → B and S → L (13) Now!!! If we know the person is a smoker. 52 / 85

134. How to have a Markov Assumption : Common Causes Then, because of the blocking of information from smoking Finding out that he has Bronchitis will not give us any more information about the probability of him having Lung Cancer. Markov condition It is satisﬁed!!! 53 / 85

135. How to have a Markov Assumption : Common Causes Then, because of the blocking of information from smoking Finding out that he has Bronchitis will not give us any more information about the probability of him having Lung Cancer. Markov condition It is satisﬁed!!! 53 / 85

136. How to have a Markov Assumption : Common Eﬀects Consider Alarm Burglary Earthquake Markov Condition lp (B, E) ⇒ P(b|e) = P(b) (14) Thus We would expect Burglary and Earthquake to be independent of each other which is in agreement with the Markov condition. 54 / 85

139. How to have a Markov Assumption : Common Eﬀects However We would, however expect them to be conditionally dependent given Alarm. Thus If the alarm has gone oﬀ, news that there had been an earthquake would ‘explain away’ the idea that a burglary had taken place. Then Again in agreement with the Markov condition. 55 / 85

142. The Causal Markov Condition What do we want? The basic idea is that the Markov condition holds for a causal DAG. 56 / 85

143. Rules to construct A Causal Graph Conditions 1 There must be no hidden common causes. 2 There must not be selection bias. 3 There must be no feedback loops. Observations Even with these there is a lot of controversy as to its validity. It seems to be false in quantum mechanical. 57 / 85

148. Hidden Common Causes? Given the following DAG H X Y Z Something Notable If a DAG is created on the basis of causal relationships between the variables under consideration then X and Y would be marginally independent according to the Markov condition. Thus If H is hidden, they will normally be dependent. 58 / 85

152. Inference in Bayesian Networks What do we want from Bayesian Networks? The main point of Bayesian Networkss is to enable probabilistic inference to be performed. Two diﬀerent types of inferences 1 Belief Updating. 2 Abduction Inference. 60 / 85

155. Inference in Bayesian Networks Belief updating It is used to obtain the posterior probability of one or more variables given evidence concerning the values of other variables. Abductive inference It ﬁnds the most probable conﬁguration of a set of variables (hypothesis) given certain evidence. 61 / 85

156. Inference in Bayesian Networks Belief updating It is used to obtain the posterior probability of one or more variables given evidence concerning the values of other variables. Abductive inference It ﬁnds the most probable conﬁguration of a set of variables (hypothesis) given certain evidence. 61 / 85

157. Using the Structure I Consider the following Bayesian Networks Burgalary Earthquake Alarm JohnCalls MaryCalls P(B) 0.001 P(E) 0.002 P(B) 0.001 P(E) 0.002 B E P(A|B,E) T T 0.95 T F 0.94 F T 0.29 F F 0.001 P(B) 0.001 A P(JC|A) T 0.9 F 0.05 A P(MC|A) T 0.7 F 0.01 Consider answering a query in a Bayesian Network Q= set of query variables e= evidence (set of instantiated variable-value pairs) Inference = computation of conditional distribution P(Q|e) 62 / 85

161. Using the Structure II Examples P(burglary|alarm) P(earthquake|JCalls, MCalls) P(JCalls, MCalls|burglary, earthquake) So Can we use the structure of the Bayesian Network to answer such queries eﬃciently? Answer YES Note: Generally speaking, complexity is inversely proportional to sparsity of graph 63 / 85

168. Example DAG D B E CA F G We have the following model p (a, b, c, d, e, f , g) is modeled by p (a|b) p (c|b) p (f |e) p (g|e) p (b|d) p (e|d) p (d) 65 / 85

169. Example DAG D B E CA F G We have the following model p (a, b, c, d, e, f , g) is modeled by p (a|b) p (c|b) p (f |e) p (g|e) p (b|d) p (e|d) p (d) 65 / 85

170. Example DAG D B E CA F G We want to calculate the following p (a|c, g) 66 / 85

171. Example DAG D B E CA F G We want to calculate the following p (a|c, g) 66 / 85

172. Example DAG D B E CA F G However, a direct calculation leads to use a demarginalization p (a|c, g) = b,d,e,f p (a, b, d, e, f |c, g) This will require that if we ﬁx the value of a, c and g to have a complexity of O m4 with m = max {|B| , |D| , |E| , |F|} 67 / 85

173. Example DAG D B E CA F G However, a direct calculation leads to use a demarginalization p (a|c, g) = b,d,e,f p (a, b, d, e, f |c, g) This will require that if we ﬁx the value of a, c and g to have a complexity of O m4 with m = max {|B| , |D| , |E| , |F|} 67 / 85

174. Example We get some information about (a = ai, c = ci, g = gi) D B E CA F G However, we re-express the equation using the chain representation p (a = ai, b, d, e, f |c = ci, g = gi) =... p (a = ai|b) p (b|d, c = ci) p (d|e) p (e, f |g = gi) 68 / 85

175. Example We get some information about (a = ai, c = ci, g = gi) D B E CA F G However, we re-express the equation using the chain representation p (a = ai, b, d, e, f |c = ci, g = gi) =... p (a = ai|b) p (b|d, c = ci) p (d|e) p (e, f |g = gi) 68 / 85

176. Example DAG D B E CA F G Now, we re-order the sum b p (a = ai|b) d p (b|d, c = ci) e p (d|e) f p (e, f |g = gi) 69 / 85

177. Example DAG D B E CA F G Now, we re-order the sum b p (a = ai|b) d p (b|d, c = ci) e p (d|e) f p (e, f |g = gi) 69 / 85

178. Example Now, using the relation about E D B E CA F G Using this information, we can reduce one of the sums by marginalization f p (e, f |g = gi) = p (e|g = gi) 70 / 85

179. Example Now, using the relation about E D B E CA F G Using this information, we can reduce one of the sums by marginalization f p (e, f |g = gi) = p (e|g = gi) 70 / 85

180. Example DAG D B E CA F G Thus, we can reduce the size of our sum b p (a = ai|b) d p (b|d, c = ci) e p (d|e) p (e|g = gi) 71 / 85

181. Example DAG D B E CA F G Thus, we can reduce the size of our sum b p (a = ai|b) d p (b|d, c = ci) e p (d|e) p (e|g = gi) 71 / 85

184. Example DAG D B E CA F G Now, we can calculate the probability of D by using the chain rule b p (a = ai|b) d p (b|d, c = ci) e p (d, e|g = gi) 73 / 85

185. Example DAG D B E CA F G Now, we can calculate the probability of D by using the chain rule b p (a = ai|b) d p (b|d, c = ci) e p (d, e|g = gi) 73 / 85

186. Example DAG D B E CA F G Now, we sum over all possible values of E e p (d, e|g = gi) = p (d|g = gi) 74 / 85

187. Example DAG D B E CA F G Now, we sum over all possible values of E e p (d, e|g = gi) = p (d|g = gi) 74 / 85

188. Example DAG D B E CA F G We get the following b p (a = ai|b) d p (b|d, c = ci) p (d|g = gi) 75 / 85

189. Example DAG D B E CA F G We get the following b p (a = ai|b) d p (b|d, c = ci) p (d|g = gi) 75 / 85

192. Example DAG D B E CA F G Now, we sum over all possible values of D b p (a = ai|b) p (b|c = ci, g = gi) 77 / 85

193. Example DAG D B E CA F G Now, we sum over all possible values of D b p (a = ai|b) p (b|c = ci, g = gi) 77 / 85

194. Example DAG D B E CA F G Now, we use the chain rule for reducing again p (a = ai|b) p (b|) = p (a = ai, b|c = ci, g = gi) 78 / 85

195. Example DAG D B E CA F G Now, we use the chain rule for reducing again p (a = ai|b) p (b|) = p (a = ai, b|c = ci, g = gi) 78 / 85

196. Example DAG D B E CA F G Now, we use the chain rule for reducing again b p (a = ai, b|c = ci, g = gi) = p (a = ai|c = ci, g = gi) 79 / 85

197. Example DAG D B E CA F G Now, we use the chain rule for reducing again b p (a = ai, b|c = ci, g = gi) = p (a = ai|c = ci, g = gi) 79 / 85

198. Complexity Because this can be computed using a sequence of four for loops The complexity simply becomes O (m) when compared with O m4 80 / 85

200. General Strategy for Inference Query Want to compute P(q|e)!!! Step 1 P(q|e) = P(q,e) P(e) = aP(q, e), since a = P(e) is constant wrt Q. Step 2 P(q, e) = a..z P(q, e, a, b, . . . .z), by the law of total probability. 82 / 85

203. General Strategy for inference Step 3 a..z P(q, e, a, b, . . . .z) = a..z P(variable i | parents i) (using Bayesian network factoring) Step 4 Distribute summations across product terms for eﬃcient computation. 83 / 85

204. General Strategy for inference Step 3 a..z P(q, e, a, b, . . . .z) = a..z P(variable i | parents i) (using Bayesian network factoring) Step 4 Distribute summations across product terms for eﬃcient computation. 83 / 85

206. Inference – An Overview Case 1 Trees and singly connected networks – only one path between any two nodes: Message passing (Pearl, 1988) Case 2 Multiply connected networks: A range of algorithms including cut-set conditioning (Pearl, 1988), junction tree propagation (Lauritzen and Spiegelhalter, 1988), bucket elimination (Dechter, 1996) to mention a few. A range of algorithms for approximate inference. Notes Both exact and approximate inference are NP-hard in the worst case. Here the focus will be on message passing and junction tree propagation for discrete variables. 85 / 85

Artificial Intelligence 06.01 introduction bayesian_networks

More Related Content

Viewers also liked (14)

Similar to Artificial Intelligence 06.01 introduction bayesian_networks (20)

More from Andres Mendez-Vazquez (20)

Recently uploaded (20)

Artificial Intelligence 06.01 introduction bayesian_networks