SlideShare a Scribd company logo
Understanding your data
with Bayesian networks
(in python)
Bartek Wilczyński
bartek@mimuw.edu.pl
University of Warsaw
PyData Silicon Valey, May 5th 2014
Are you confused enough?
Or should I confuse you a bit more ?
Image from xkcd.org/552/
Data show: Confused students score better!
Data from Eric Mazur
There may be factors we haven't thought about
● Maybe confusion helps
with learning?
● Or maybe there is
an alternative explanation?
● As long as these are just
cartoon models – we
cannot really rule out any
structure
Paying
attention
Being
confused
Correct
answer
Being
confused
Correct
answer
or
What do I mean by data?
Sex Age Smoking Stress Lung Heart Feel
M 0-20 never N No no great
F 70 sometimes N minor no OK
M 50-70 daily Y no severe Not-so-well
M 20-50 daily N no minor OK
F 70 never N no minor great
F 20-50 sometimes Y severe minor Not-so-well
F 20-50 never Y no no great
M 20-50 sometimes N minor no great
M 50-70 never Y severe no OK
F 0-20 never N no severe OK
M 20-50 daily Y no no OK
M 0-20 daily N no no Not-so-well
M 20-50 never N minor no OK
.... ... ... ... ... ... ...
Network of connections
Smoking
(daily, sometimes, never)
Age
(0-20,20-50, 50-70,70+)
Stressful job
(yes,no)
Lung problems
(no,minor,severe)
Heart problems
(no,minor,severe)
Sex
(male,female)
How did you feel this morning?
(great, OK, not-so-well, terrible)
What is a Bayesian Network ?
●
A directed acyclic graph without cycles
●
with nodes representing random variables
●
and edges between nodes representing dependencies
(not necessarily causal)
●
Each edge is directed from a parent to a child, so all
nodes with connections to a given node constitute its
set of parents
●
Each variable is associated with a value domain and a
probability distribution conditional on parents' values
Back to our confused students
● Let us consider our model of
confused students
● We can consider the model
with an additional variable
● We need to heve data on the
additional variable to be
predictive
● Sometimes we need to use
“wrong” models if they are
predictive
Paying
attention
Being
confused
Correct
answer
Paying attention
yes no
confused 80% 0%
not confused 20% 100%
Paying
attention
Being
confused
Correct
answer
Paying attention
yes no
correct 50% 20%
incorrect 50% 80%
Can we find the “best” Bayesian Network?
● Given a dataset with observations,
we can try to find the “best”
network topology (i.e. the best
collection of parents' sets)
● In order to do it automatically we
need a scoring function to define
what we mean by “best”
● A score function is useful if it can
be written as a sum over
variables, i.e. the best network
consists of best parent sets for
variables (modulo acyclicity)
How to find the best network?
● There are generally three main approaches to defining BN scores:
– Bayesian statistics, e.g. BDe (Herskovits et al. '95)
– Information Theoretic, e.g. MDL (Lam et al. '94)
– Hypothesis testing, e.g. MMPC (Salehi et al. '10)
● There are also hybrid approaches, like the recent MIT (de Campos '06)
approach that uses information theory and hypothesis testing
● We have two issues:
– There are exponentially many potential parent sets
– The desired network needs to have no cycles
● The second issue is more important and makes the problem NP-complete
(Chickering '96)
Cycles are not always a problem
● Dynamic Bayesian
Networks are avariant of
BN models that describe
temporal dependencies
● We can safely assume that
the causal links only go
forward in time
● That breaks the problem of
cycles as we now have two
versions of each variable:
“before” and “after”
X1
X2
X3
X1 X1
t t+1
X2 X2
X3 X3
Different types of variables
● Another common situation is
when we have different types
of variables
● We may know that only
certain types of connections
are causal
● Or we may be interested only in
certain types of connections
● This breaks the cycles as well
Mutations
Protein expression
Diseases
BNFinder – python library for Bayesian Networks
● A library for identification of
optimal Bayesian Networks
● Works under assumption of
acyclicity by external
constraints (disjoint sets of
variables or dynamic
networks)
● fast and efficient (relatively)
Example1 – the simplest possible
Now, parallellize!
● Since we have external
constraints on acyclicity, we
can search for parent sets
independently
● This leads to a simple
parallelization scheme and
good efficiency
Bonn et al. Nat. Genet, 2012
Active Inactive
Making the training set for “activity” variable
Handling continuous data
Network model
Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014
Does it provide useful predictions?
• 12 positive and 4 negative predictions tested
• >90% success (1 error)
Some more continuous data with perturbations
• 8008 enhancers compiled
from 15 ChIP experiments
(almost 20k binding peaks)
• Activity data for ~140
enhancers divided into
– 3 tissues (MESO, VM, SM)
– 5 stages
(4-6,7-8,9-10,1112,13-16)
• Gene expression data for
5082 genes from the BDGP
database
Wilczynski et al.PLoS Comp.Biol 2012
Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014
Predictions validated:
19/20 correct stage, 10/20 correct tissue
Summary
● Bayesian Networks can provide predictive models based on
conditional probability distributions
● BNFinder is an effective tool for finding optimal networks given
tabular data. And it's open source!
● It can be used as a commandline tool or as a library
● It can use continuous data as well as discrete
● Can be run in parallel on multiple cores (with good efficiency)
● Convenience functions (cross-validation, ROC plots) included
http://launchpad.net/bnfinder
Thanks!
● Norbert Dojer
● Alina Frolova
● Paweł Bednarz
● Agnieszka Podsiadło
● Questions?

More Related Content

PPTX
PPTX
eScience SHAP talk
PPTX
Spectral Clustering
PDF
Artificial Neural Network
PPTX
Machine learning introduction
PDF
Meta learning with memory augmented neural network
PDF
Kalman Filter Presentation
PDF
밑바닥부터 시작하는딥러닝 8장
eScience SHAP talk
Spectral Clustering
Artificial Neural Network
Machine learning introduction
Meta learning with memory augmented neural network
Kalman Filter Presentation
밑바닥부터 시작하는딥러닝 8장

What's hot (20)

PPTX
Machine Learning: Bias and Variance Trade-off
PDF
Introduction to agents and multi-agent systems
PPT
2.5 backpropagation
PDF
Presentation on Neural Style Transfer
PDF
Bayesian Network Modeling using Python and R
PPTX
A Unified Approach to Interpreting Model Predictions (SHAP)
PDF
Interpretable machine learning : Methods for understanding complex models
PPTX
Understanding Black Box Models with Shapley Values
PDF
Reinforcement learning, Q-Learning
PDF
Visual Explanation of Ridge Regression and LASSO
PDF
Conditional trees
PDF
Distributed machine learning
PPTX
Convolutional neural networks 이론과 응용
PDF
Scalable machine learning
PPT
AI Lecture 7 (uncertainty)
PPTX
Speech Processing with deep learning
PPT
Artificial Neural Networks
PDF
Multimodal Deep Learning
PDF
An introduction to reinforcement learning
PDF
Hyperparameter Optimization for Machine Learning
Machine Learning: Bias and Variance Trade-off
Introduction to agents and multi-agent systems
2.5 backpropagation
Presentation on Neural Style Transfer
Bayesian Network Modeling using Python and R
A Unified Approach to Interpreting Model Predictions (SHAP)
Interpretable machine learning : Methods for understanding complex models
Understanding Black Box Models with Shapley Values
Reinforcement learning, Q-Learning
Visual Explanation of Ridge Regression and LASSO
Conditional trees
Distributed machine learning
Convolutional neural networks 이론과 응용
Scalable machine learning
AI Lecture 7 (uncertainty)
Speech Processing with deep learning
Artificial Neural Networks
Multimodal Deep Learning
An introduction to reinforcement learning
Hyperparameter Optimization for Machine Learning
Ad

Similar to Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014 (20)

PDF
Bayesian Networks - A Brief Introduction
PPTX
Presentation1.pptx
PPTX
Optimal Bayesian Networks
PDF
BayesiaLab_Book_V18 (1)
PDF
Graphical Models 4dummies
PDF
PyData DC 2016 Talk: Bayesian Network Modeling Using Python and R
PPTX
Unit V -Graphical Models.pptx for artificial intelligence
PPTX
Unit V -Graphical Models in artificial intelligence and machine learning
PPT
Project3.ppt
PPTX
Bayesian probabilistic interference
PPTX
Bayesian probabilistic interference
PDF
Bayesianmd2
PPT
. An introduction to machine learning and probabilistic ...
PPT
Cs221 lecture3-fall11
ODP
Gentle Introduction: Bayesian Modelling and Probabilistic Programming in R
PDF
20 bayes learning
PPT
Basen Network
PPTX
Bayesian Belief Network in artificial intelligence.pptx
PPT
bayesian in artificial intelligence and search methods
PDF
Analysis of massive data using R (CAEPIA2015)
Bayesian Networks - A Brief Introduction
Presentation1.pptx
Optimal Bayesian Networks
BayesiaLab_Book_V18 (1)
Graphical Models 4dummies
PyData DC 2016 Talk: Bayesian Network Modeling Using Python and R
Unit V -Graphical Models.pptx for artificial intelligence
Unit V -Graphical Models in artificial intelligence and machine learning
Project3.ppt
Bayesian probabilistic interference
Bayesian probabilistic interference
Bayesianmd2
. An introduction to machine learning and probabilistic ...
Cs221 lecture3-fall11
Gentle Introduction: Bayesian Modelling and Probabilistic Programming in R
20 bayes learning
Basen Network
Bayesian Belief Network in artificial intelligence.pptx
bayesian in artificial intelligence and search methods
Analysis of massive data using R (CAEPIA2015)
Ad

More from PyData (20)

PDF
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PDF
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PDF
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PDF
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PDF
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PPTX
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PPTX
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PDF
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PDF
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PDF
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PDF
Words in Space - Rebecca Bilbro
PDF
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PPTX
Pydata beautiful soup - Monica Puerto
PDF
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PPTX
Extending Pandas with Custom Types - Will Ayd
PDF
Measuring Model Fairness - Stephen Hoover
PDF
What's the Science in Data Science? - Skipper Seabold
PDF
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PDF
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PDF
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Words in Space - Rebecca Bilbro
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
Pydata beautiful soup - Monica Puerto
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
Extending Pandas with Custom Types - Will Ayd
Measuring Model Fairness - Stephen Hoover
What's the Science in Data Science? - Skipper Seabold
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...

Recently uploaded (20)

PDF
NewMind AI Monthly Chronicles - July 2025
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
MYSQL Presentation for SQL database connectivity
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Electronic commerce courselecture one. Pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Cloud computing and distributed systems.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
KodekX | Application Modernization Development
NewMind AI Monthly Chronicles - July 2025
CIFDAQ's Market Insight: SEC Turns Pro Crypto
The AUB Centre for AI in Media Proposal.docx
Reach Out and Touch Someone: Haptics and Empathic Computing
MYSQL Presentation for SQL database connectivity
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
cuic standard and advanced reporting.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Empathic Computing: Creating Shared Understanding
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Per capita expenditure prediction using model stacking based on satellite ima...
Electronic commerce courselecture one. Pdf
Spectral efficient network and resource selection model in 5G networks
20250228 LYD VKU AI Blended-Learning.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Cloud computing and distributed systems.
Digital-Transformation-Roadmap-for-Companies.pptx
KodekX | Application Modernization Development

Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

  • 1. Understanding your data with Bayesian networks (in python) Bartek Wilczyński bartek@mimuw.edu.pl University of Warsaw PyData Silicon Valey, May 5th 2014
  • 2. Are you confused enough? Or should I confuse you a bit more ? Image from xkcd.org/552/
  • 3. Data show: Confused students score better! Data from Eric Mazur
  • 4. There may be factors we haven't thought about ● Maybe confusion helps with learning? ● Or maybe there is an alternative explanation? ● As long as these are just cartoon models – we cannot really rule out any structure Paying attention Being confused Correct answer Being confused Correct answer or
  • 5. What do I mean by data? Sex Age Smoking Stress Lung Heart Feel M 0-20 never N No no great F 70 sometimes N minor no OK M 50-70 daily Y no severe Not-so-well M 20-50 daily N no minor OK F 70 never N no minor great F 20-50 sometimes Y severe minor Not-so-well F 20-50 never Y no no great M 20-50 sometimes N minor no great M 50-70 never Y severe no OK F 0-20 never N no severe OK M 20-50 daily Y no no OK M 0-20 daily N no no Not-so-well M 20-50 never N minor no OK .... ... ... ... ... ... ...
  • 6. Network of connections Smoking (daily, sometimes, never) Age (0-20,20-50, 50-70,70+) Stressful job (yes,no) Lung problems (no,minor,severe) Heart problems (no,minor,severe) Sex (male,female) How did you feel this morning? (great, OK, not-so-well, terrible)
  • 7. What is a Bayesian Network ? ● A directed acyclic graph without cycles ● with nodes representing random variables ● and edges between nodes representing dependencies (not necessarily causal) ● Each edge is directed from a parent to a child, so all nodes with connections to a given node constitute its set of parents ● Each variable is associated with a value domain and a probability distribution conditional on parents' values
  • 8. Back to our confused students ● Let us consider our model of confused students ● We can consider the model with an additional variable ● We need to heve data on the additional variable to be predictive ● Sometimes we need to use “wrong” models if they are predictive Paying attention Being confused Correct answer Paying attention yes no confused 80% 0% not confused 20% 100% Paying attention Being confused Correct answer Paying attention yes no correct 50% 20% incorrect 50% 80%
  • 9. Can we find the “best” Bayesian Network? ● Given a dataset with observations, we can try to find the “best” network topology (i.e. the best collection of parents' sets) ● In order to do it automatically we need a scoring function to define what we mean by “best” ● A score function is useful if it can be written as a sum over variables, i.e. the best network consists of best parent sets for variables (modulo acyclicity)
  • 10. How to find the best network? ● There are generally three main approaches to defining BN scores: – Bayesian statistics, e.g. BDe (Herskovits et al. '95) – Information Theoretic, e.g. MDL (Lam et al. '94) – Hypothesis testing, e.g. MMPC (Salehi et al. '10) ● There are also hybrid approaches, like the recent MIT (de Campos '06) approach that uses information theory and hypothesis testing ● We have two issues: – There are exponentially many potential parent sets – The desired network needs to have no cycles ● The second issue is more important and makes the problem NP-complete (Chickering '96)
  • 11. Cycles are not always a problem ● Dynamic Bayesian Networks are avariant of BN models that describe temporal dependencies ● We can safely assume that the causal links only go forward in time ● That breaks the problem of cycles as we now have two versions of each variable: “before” and “after” X1 X2 X3 X1 X1 t t+1 X2 X2 X3 X3
  • 12. Different types of variables ● Another common situation is when we have different types of variables ● We may know that only certain types of connections are causal ● Or we may be interested only in certain types of connections ● This breaks the cycles as well Mutations Protein expression Diseases
  • 13. BNFinder – python library for Bayesian Networks ● A library for identification of optimal Bayesian Networks ● Works under assumption of acyclicity by external constraints (disjoint sets of variables or dynamic networks) ● fast and efficient (relatively)
  • 14. Example1 – the simplest possible
  • 15. Now, parallellize! ● Since we have external constraints on acyclicity, we can search for parent sets independently ● This leads to a simple parallelization scheme and good efficiency
  • 16. Bonn et al. Nat. Genet, 2012
  • 18. Making the training set for “activity” variable
  • 22. Does it provide useful predictions? • 12 positive and 4 negative predictions tested • >90% success (1 error)
  • 23. Some more continuous data with perturbations
  • 24. • 8008 enhancers compiled from 15 ChIP experiments (almost 20k binding peaks) • Activity data for ~140 enhancers divided into – 3 tissues (MESO, VM, SM) – 5 stages (4-6,7-8,9-10,1112,13-16) • Gene expression data for 5082 genes from the BDGP database Wilczynski et al.PLoS Comp.Biol 2012
  • 26. Predictions validated: 19/20 correct stage, 10/20 correct tissue
  • 27. Summary ● Bayesian Networks can provide predictive models based on conditional probability distributions ● BNFinder is an effective tool for finding optimal networks given tabular data. And it's open source! ● It can be used as a commandline tool or as a library ● It can use continuous data as well as discrete ● Can be run in parallel on multiple cores (with good efficiency) ● Convenience functions (cross-validation, ROC plots) included http://launchpad.net/bnfinder
  • 28. Thanks! ● Norbert Dojer ● Alina Frolova ● Paweł Bednarz ● Agnieszka Podsiadło ● Questions?