Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010

1. Exact Inference in Bayesian Networks using MapReduceAlex KozlovCloudera, Inc.

2. About MeAbout ClouderaBayesian (Probabilistic) NetworksBN Inference 101CPCS NetworkWhy BN InferenceInference with MRResultsConclusions2Session Agenda

3. Worked on BN Inference in 1995-1998 (for Ph.D.)Published the fastest implementation at the timeWorked on DM/BI field since thenRecently joined Cloudera, Inc.Started looking at how to solve world’s hardest problems3About Me

4. Founded in the summer 2008Cloudera helps organizations profit from all of their data. We deliver the industry-standard platform which consolidates, stores and processes any kind of data, from any source, at scale. We make it possible to do more powerful analysis of more kinds of data, at scale, than ever before. With Cloudera, you get better insight into their customers, partners, vendors and businesses.Cloudera’s platform is built on the popular open source Apache Hadoop project. We deliver the innovative work of a global community of contributors in a package that makes it easy for anyone to put the power of Google, Facebook and Yahoo! to work on their own problems.4About Cloudera

5. NodesEdgesProbabilities5Bayesian NetworksBayes, Thomas (1763)An essay towards solving a problem in the doctrine of chances, published posthumously by his friendPhilosophical Transactions of the Royal Society of London, 53:370-418

6. Computational biology and bioinformatics (gene regulatory networks, protein structure, gene expression analysis)MedicineDocument classification, information retrievalImage processingData fusionGamingLawOn-line advertising!6Applications

7. 7A Simple BN NetworkTFRainRainTF0.40.6F0.20.80.10.9TSprinklerSprinkler, RainTF0.010.99F, F0.80.2F, TWet Driveway0.9 0.1T, F0.990.01T, TPr(Rain | Wet Driveway)Pr(Sprinkler Broken | !Wet Driveway & !Rain)

9. JPD = <product of all probabilities and conditional probabilities in the network> = Pr(A, B, …, H)PAB = SELECT A, B, SUM(PROB) FROM JPD GROUP BY A, B;PB = SELECT B, SUM(PROB) FROM PAB GROUP BY A;Pr(A|B) = Pr(A,B)/Pr(B) – Bayes’ ruleCPCS is 422 nodes, a table of at least 2422 rows!9BN Inference 101 (in Hive)

11. 11CPCS Networks422 nodes14 nodes describe diseases33 risk factors375 various findings related to diseases

12. 12CPCS Networks

13. Choose the right tool for the right job!BN is an abstraction for reasoning and decision making

14. Easy to incorporate human insight and intuitions

15. Very general, no specific ‘label’ node

16. Easy to do ‘what-if’, strength of influence, value of information, analysis

17. Immune to Gaussian assumptionsIt’s all just a joint probability distribution13Why Bayesian Network Inference?

18. Map & Reduces14B1C1E1KeysMapB1C1E2A1B1B1ReduceB1C2E1A2B1B1C2E2A1B2B2C1E1∑ Pr(B1| A) x ∑ Pr(D| C1)B2B2C1E2A2B2B2C2E1B2C2E2Pr(C| BE) x ∑ Pr(B1| A) x ∑ Pr(D| C1)Aggregation 2 (x)B1C1E1B1C1E2C1D1B1C2E1C1Aggregation 1 (+)C2D1B1C2E2C1D2B2C1E1B2C1E2C2BCEC2D2B2C2E1B2C2E2

19. for each clique in depth-first order:MAP:Sum over the variables to get ‘clique message’ (requires state, custom partitioner and input format)Emit factors for the next cliqueREDUCE:Multiply the factors from all childrenInclude probabilities assigned to the cliqueForm the new clique valuesthe MAP is done over all child cliques15MapReduce Implementation

20. Topological parallelism: compute branches C2 and C4 in parallel

21. Clique parallelism: divide computation of each clique into maps/reducers

22. Fall back into optimal factoring if a corresponding subtree is small

23. Combine multiple phases together

24. Reduce replication level16Cliques, Trees, and ParallelismC6C5C4C3C2C1Cliques may be larger than they appear!

25. CPCS:The 360-node subnet has the largest ‘clique’ of 11,739,896 floats (fits into 2GB)The full 422-node version (absent, mild, moderate, severe)3,377,699,720,527,872 floats (or 12 PB of storage, but do not need it for all queries)In most cases do not need to do inference on the full network17CPCS Inference

26. 1‘used an SGI Origin 2000 machine with sixteen MIPS R10000 processors (195 MHz clock speed)’ in 19972Macbook Pro 4 GB DDR3 2.53 GHz310 node Linux Xeon cluster 24 GB quad 2-core18Results

27. Exact probabilistic inference is finally in sight for the full 422 node CPCS networkHadoop helps to solve the world’s hardest problemsWhat you should know after this talkBN is a DAG and represents a joint probability distribution (JPD)Can compute conditional probabilities by multiplying and summing JPDFor large networks, this may be PBytes of intermediate data, but it’s MR19Conclusions

28. Questions?alexvk@{cloudera,gmail}.com

29. BACKUP21

30. Conditioning nodes (evidence) – do not need to be summedBare child nodes’ values sum to one (barren node) – can be dropped from the network22Optimizing BN Inference 101Noisy-OR (conditional independence of parents)Context specific independence (based on the specific value of one of the parents)TF0.010.99FF0.80.2FTWet grass0.9 0.1TF0.990.01TT

31. 23GeNIe package

32. No updates – have to compute clique potentials from all children and assigned probabilitiesTree structure The key encodes full set of variable values (LongWritable or composite)The value encodes partial sums (proportional to probabilities)No need for TotalOrderPartitioning (we know the key distribution)Need custom Partitioner and WritableComparator (next slide)Need to do the aggregation in the Mapper (sum, next slide)24MapReduce Implementation

33. Build on top of old 1997 C program with a few modificationsAn interactive command line program for interactive analysisEstimates running time from optimal factory plan andEither executes it locallyShips a jar to a Hadoop cluster to execute25Current implementation

Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010 (20)

More from Yahoo Developer Network (20)

Recently uploaded (20)

Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010

Editor's Notes