SlideShare a Scribd company logo
Dirichlet Processes and Applications Saurav Jha
Machine Learning Engineer
Copyright © 2018 FactSet Research Systems Inc. All rights reserved. Confidential: Do not forward.
1. Probability 101: Mass & Density Functions
2. Probability 102: Simplex and its geometrical meaning
3. Dirichlet Distribution
4. Dirichlet Process
5. A demo
6. An application
Table of Contents
2
Probability 101
• PDF = probability that a continuous random variable has a particular range of values
• PMF = probability that a discrete random variable is exactly equal to some value
3
• In continuous setting:
∫b
a f(x)dx = prob. that outcome is between a and b
i.e., units of f(x) = prob. Per unit length (dx)
= how dense is probability per unit length near x
• In discrete setting:
f(x) = Pr(X = x)
i.e., units of f(x) = simple probability
= what is the mass of object X at point x
• Set of PMFs on entire sample space.
S = { x E Rn : xi >= 0, ∑i=1..n xi = 1}
Probability Mass Function (PMF) vs Density Function (PDF)
Probability Simplex
4
• A k-dimensional polytope ( a geometric object with flat sides) formed from convex hull of
its k+1 vertices.
Probability 102: K-Simplex – geometrical meaning
• Let u0, u1, …, uk E Rk be (k+1) points, then the simplex determined by them = set of points:
C = {Ɵ0u0 + … + Ɵkuk | ∑i = 0...k Ɵi = 1 and Ɵi >= 0 ∀ i }
 Looking at u0, u1, u2 as a disjoint set of possible events, such that their probs. sum to 1.
i.e. p0 + p1 + p2 = 1, where 0 <= pi <= 1
 Consider the three probabilities as points in Euclidean space (p1,p2,p3).
 Resulting shape outlines the perimeter of a triangle.
 While the set C lies in a k-dim. Space (k=3), the object it forms is (k-1)
dimensional.
 Each point pi in the simplex = a pmf in its own (i.e. each component of pi
= [0,1] and all its components sum up to 1).
Dirichlet distribution
5
• Let Q = [Q1, Q2, …, Qk] = a random pmf (i.e. Qi >= 0) for i = 1,2,…, k and ∑i=1..k Qi = 1.
• Let α = [α1, α2, . . . , αk], with αi > 0 for each i, and let α0 = ∑i=1..k αi
• Then, Q = a Dirichlet distribution with param. α and is denoted by Q ∼ Dir(α):
P(Q1, Q2, …, Qk) =
• A probability distribution whose samples lie in the (k-1) dimensional probability
simplex ∆k, i.e., a distribution over pmfs of length k.
• Ranges over possible parameters vectors for a multinomial distribution and is
the conjugate prior of multinomial distribution.
“A distribution of distributions”
Dirichlet distribution – an example use-case
• X = vector representing n draws of a random var. with 3 possible outcomes = [4,4,2]
• PMF of X = multinomial distribution = (p1n1* p2n2 * p3n3) * n!/ n1!*n2!*n3!
6
Q) What if p1, p2, p3 are unknown? i.e., no certainty over what the distribution of
categorical vars. is!
 Solution: use a Dirichlet distribution with params α1, α2, α3 to first draw a P ~ Dir(α), and then, draw
X ~ Multi(p).
• Introduces one level of indirection in the model for X – instead of saying what P generated X, use
params α1, α2, α3 to find likely prob. Distributions and then draw samples X acc. To random P.
• Since, sampling is directly from a prob. K-Simplex => the values of a k-dim. Dirichlet distribution =
mean value of the Dirichlet.
• Addition of the Dirichlet distribution = introducing prior beliefs about what X is likely to occur. i.e., a
random pmf has a Dirichlet distribution with param α. [1]
• Analogy 1: if a random pmf = a bag full of dice, then a sample from the Dirichlet = a specific dice.
Dirichlet Process
 Dirichlet Processes to the Rescue !
7
• In the dice analogy, the dice must have a finite no. of faces.
• Limitation of Dirichlet distribution = assumes a finite set of events.
• Enables working with an infinite set of events, and hence to model prob.
Distributions over infinite sample spaces.
Analogy 2:
• Asking a pedestrians on the street to choose their fav. Color out of {V,I,B,G,Y,O,R}.
• Based on answer, model each person as a pmf over 7 colors.
• Each person’s pmf = a realization of a draw from a Dirichlet distribution over 7 colors.
 What if the choices are no longer restricted to 7 colors?
• Modelling an individual’s pmfs (over infinite dim.) = a distribution over distributions over
an infinite samle space.
• One solution = a Dirichlet process.
Dirichlet Process – definition
 Input = H (a prob. Distribution a.k.a base distribution), α (a +ve real no. a.k.a
concentration param.)
 Draw A (i.e., nth element) from H.
 For n > 1:
 Assign A to a new category with the prob. α / (α + n – 1).
 Assign A to a pre-existing category x with prob. nx / (α + n – 1), where nx = no. of
random variables already assigned to x.
8
• Assign elements A,B,C to unknown no. of categories following the algorithm:
• Used when modelling data that tends to repeat previous values in a “rich get richer”
fashion.
• Can also be defined as a Chinese Restaurant Process.
• Applications: Morphological segmentation in NLP, Modelling mutation rates of genes in
evolutionary biology.
A demo [2]
9
An application: Learning of hierarchical Morphology paradigms [3]
• A paradigm = a pair (StemList, SuffixList) where, each Stem+Suffix string = a valid word.
• Can be modelled as a hierarchical structure.
10
• Morphologically similar words = close to
each other in the structure.
• Similarity metric = # common morphemes
• Notations: w = word, s = stem, m = suffix
• Assumption: Stems and suffixes
generated independently from each
other.
• Prob. of a word = p(w = s+m)
= p(s) * p(m)
An application: Learning of hierarchical Morphology paradigms [3]
11
1. Two Dirichlet processes generate stems and suffixes independently:
• βs = concentration parameter, i.e., no. of stem types generated by the DP
• If β = small, new stem/suffix types are less likely to be generated.
• β = large, more likely to generate new stem/suffix types, thus yielding more uniform distribution.
• Authors choose β < 1, i.e. to yield a more skewed distribution with sparse stems & suffixes.
• P = base distribution specifying prior prob. Distribution
for morpheme lengths.
• Joint prob. Of stems can then be calculated as:
References
1. Frigyik, Bela A. et al. “Introduction to the Dirichlet Distribution and Related Processes.”
(2010).
2. http://phyletica.org/dirichlet-process/
3. Can, Burcu and Suresh Manandhar. “Probabilistic Hierarchical Clustering of
Morphological Paradigms.” EACL (2012).
12
THANK YOU !
13

More Related Content

PDF
A Theory of the Learnable; PAC Learning
PDF
"PAC Learning - a discussion on the original paper by Valiant" presentation @...
PPTX
Maximums and minimum
PDF
Advanced matlab codigos matematicos
PDF
PAC Learning
PDF
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
PDF
Classification and regression based on derivatives: a consistency result for ...
PPTX
Fermat and euler theorem
A Theory of the Learnable; PAC Learning
"PAC Learning - a discussion on the original paper by Valiant" presentation @...
Maximums and minimum
Advanced matlab codigos matematicos
PAC Learning
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
Classification and regression based on derivatives: a consistency result for ...
Fermat and euler theorem

What's hot (20)

PDF
Interaction Networks for Learning about Objects, Relations and Physics
PDF
Usage of Different Matrix Operation for MIMO Communication
PDF
Theorems on polynomial functions
PDF
More theorems on polynomial functions
PDF
Common fixed point theorems of integral type in menger pm spaces
PPTX
Pigeonhole Principle,Cardinality,Countability
PDF
Dual Learning for Machine Translation (NIPS 2016)
PDF
Intractable likelihoods
PPTX
Report in math 830
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
Meta-learning and the ELBO
PPT
Core 3 Numerical Methods 1
PDF
Clustering:k-means, expect-maximization and gaussian mixture model
PPTX
Pigeon hole principle
PDF
ABC short course: final chapters
PDF
Linear Discriminant Analysis and Its Generalization
Interaction Networks for Learning about Objects, Relations and Physics
Usage of Different Matrix Operation for MIMO Communication
Theorems on polynomial functions
More theorems on polynomial functions
Common fixed point theorems of integral type in menger pm spaces
Pigeonhole Principle,Cardinality,Countability
Dual Learning for Machine Translation (NIPS 2016)
Intractable likelihoods
Report in math 830
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Meta-learning and the ELBO
Core 3 Numerical Methods 1
Clustering:k-means, expect-maximization and gaussian mixture model
Pigeon hole principle
ABC short course: final chapters
Linear Discriminant Analysis and Its Generalization
Ad

Similar to Dirichlet processes and Applications (20)

PDF
Digging into the Dirichlet Distribution by Max Sklar
PDF
Gentle Introduction to Dirichlet Processes
PPTX
Probability distributionv1
PDF
PhysicsSIG2008-01-Seneviratne
PPT
pattern recognition
PDF
Introduction to Evidential Neural Networks
PPT
Bayesian phylogenetic inference_big4_ws_2016-10-10
PDF
A Gentle Introduction to Bayesian Nonparametrics
PDF
Everything about Special Discrete Distributions
PDF
DirichletProcessNotes
PDF
Lecture 4-Discrete Random Variables .pdf
PPTX
Econometrics 2.pptx
PDF
FullMLCheatSheetfor engineering students .pdf
PDF
Deep learning .pdf
PPTX
pre - PSP - Discrete Probabilitty Distribution_modified 05-03-2025.pptx
PDF
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
PDF
Probably, Definitely, Maybe
PDF
Madrid easy
PPT
Probability statistics assignment help
PPTX
Finance MOS 2310 Lesson 7 Chapter 13-14.pptx
Digging into the Dirichlet Distribution by Max Sklar
Gentle Introduction to Dirichlet Processes
Probability distributionv1
PhysicsSIG2008-01-Seneviratne
pattern recognition
Introduction to Evidential Neural Networks
Bayesian phylogenetic inference_big4_ws_2016-10-10
A Gentle Introduction to Bayesian Nonparametrics
Everything about Special Discrete Distributions
DirichletProcessNotes
Lecture 4-Discrete Random Variables .pdf
Econometrics 2.pptx
FullMLCheatSheetfor engineering students .pdf
Deep learning .pdf
pre - PSP - Discrete Probabilitty Distribution_modified 05-03-2025.pptx
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Probably, Definitely, Maybe
Madrid easy
Probability statistics assignment help
Finance MOS 2310 Lesson 7 Chapter 13-14.pptx
Ad

Recently uploaded (20)

PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Lecture1 pattern recognition............
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction to Knowledge Engineering Part 1
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Quality review (1)_presentation of this 21
Data_Analytics_and_PowerBI_Presentation.pptx
Reliability_Chapter_ presentation 1221.5784
Lecture1 pattern recognition............
Supervised vs unsupervised machine learning algorithms
1_Introduction to advance data techniques.pptx
Database Infoormation System (DBIS).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Acceptance and paychological effects of mandatory extra coach I classes.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Introduction-to-Cloud-ComputingFinal.pptx
annual-report-2024-2025 original latest.
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Knowledge Engineering Part 1

Dirichlet processes and Applications

  • 1. Dirichlet Processes and Applications Saurav Jha Machine Learning Engineer Copyright © 2018 FactSet Research Systems Inc. All rights reserved. Confidential: Do not forward.
  • 2. 1. Probability 101: Mass & Density Functions 2. Probability 102: Simplex and its geometrical meaning 3. Dirichlet Distribution 4. Dirichlet Process 5. A demo 6. An application Table of Contents 2
  • 3. Probability 101 • PDF = probability that a continuous random variable has a particular range of values • PMF = probability that a discrete random variable is exactly equal to some value 3 • In continuous setting: ∫b a f(x)dx = prob. that outcome is between a and b i.e., units of f(x) = prob. Per unit length (dx) = how dense is probability per unit length near x • In discrete setting: f(x) = Pr(X = x) i.e., units of f(x) = simple probability = what is the mass of object X at point x • Set of PMFs on entire sample space. S = { x E Rn : xi >= 0, ∑i=1..n xi = 1} Probability Mass Function (PMF) vs Density Function (PDF) Probability Simplex
  • 4. 4 • A k-dimensional polytope ( a geometric object with flat sides) formed from convex hull of its k+1 vertices. Probability 102: K-Simplex – geometrical meaning • Let u0, u1, …, uk E Rk be (k+1) points, then the simplex determined by them = set of points: C = {Ɵ0u0 + … + Ɵkuk | ∑i = 0...k Ɵi = 1 and Ɵi >= 0 ∀ i }  Looking at u0, u1, u2 as a disjoint set of possible events, such that their probs. sum to 1. i.e. p0 + p1 + p2 = 1, where 0 <= pi <= 1  Consider the three probabilities as points in Euclidean space (p1,p2,p3).  Resulting shape outlines the perimeter of a triangle.  While the set C lies in a k-dim. Space (k=3), the object it forms is (k-1) dimensional.  Each point pi in the simplex = a pmf in its own (i.e. each component of pi = [0,1] and all its components sum up to 1).
  • 5. Dirichlet distribution 5 • Let Q = [Q1, Q2, …, Qk] = a random pmf (i.e. Qi >= 0) for i = 1,2,…, k and ∑i=1..k Qi = 1. • Let α = [α1, α2, . . . , αk], with αi > 0 for each i, and let α0 = ∑i=1..k αi • Then, Q = a Dirichlet distribution with param. α and is denoted by Q ∼ Dir(α): P(Q1, Q2, …, Qk) = • A probability distribution whose samples lie in the (k-1) dimensional probability simplex ∆k, i.e., a distribution over pmfs of length k. • Ranges over possible parameters vectors for a multinomial distribution and is the conjugate prior of multinomial distribution. “A distribution of distributions”
  • 6. Dirichlet distribution – an example use-case • X = vector representing n draws of a random var. with 3 possible outcomes = [4,4,2] • PMF of X = multinomial distribution = (p1n1* p2n2 * p3n3) * n!/ n1!*n2!*n3! 6 Q) What if p1, p2, p3 are unknown? i.e., no certainty over what the distribution of categorical vars. is!  Solution: use a Dirichlet distribution with params α1, α2, α3 to first draw a P ~ Dir(α), and then, draw X ~ Multi(p). • Introduces one level of indirection in the model for X – instead of saying what P generated X, use params α1, α2, α3 to find likely prob. Distributions and then draw samples X acc. To random P. • Since, sampling is directly from a prob. K-Simplex => the values of a k-dim. Dirichlet distribution = mean value of the Dirichlet. • Addition of the Dirichlet distribution = introducing prior beliefs about what X is likely to occur. i.e., a random pmf has a Dirichlet distribution with param α. [1] • Analogy 1: if a random pmf = a bag full of dice, then a sample from the Dirichlet = a specific dice.
  • 7. Dirichlet Process  Dirichlet Processes to the Rescue ! 7 • In the dice analogy, the dice must have a finite no. of faces. • Limitation of Dirichlet distribution = assumes a finite set of events. • Enables working with an infinite set of events, and hence to model prob. Distributions over infinite sample spaces. Analogy 2: • Asking a pedestrians on the street to choose their fav. Color out of {V,I,B,G,Y,O,R}. • Based on answer, model each person as a pmf over 7 colors. • Each person’s pmf = a realization of a draw from a Dirichlet distribution over 7 colors.  What if the choices are no longer restricted to 7 colors? • Modelling an individual’s pmfs (over infinite dim.) = a distribution over distributions over an infinite samle space. • One solution = a Dirichlet process.
  • 8. Dirichlet Process – definition  Input = H (a prob. Distribution a.k.a base distribution), α (a +ve real no. a.k.a concentration param.)  Draw A (i.e., nth element) from H.  For n > 1:  Assign A to a new category with the prob. α / (α + n – 1).  Assign A to a pre-existing category x with prob. nx / (α + n – 1), where nx = no. of random variables already assigned to x. 8 • Assign elements A,B,C to unknown no. of categories following the algorithm: • Used when modelling data that tends to repeat previous values in a “rich get richer” fashion. • Can also be defined as a Chinese Restaurant Process. • Applications: Morphological segmentation in NLP, Modelling mutation rates of genes in evolutionary biology.
  • 10. An application: Learning of hierarchical Morphology paradigms [3] • A paradigm = a pair (StemList, SuffixList) where, each Stem+Suffix string = a valid word. • Can be modelled as a hierarchical structure. 10 • Morphologically similar words = close to each other in the structure. • Similarity metric = # common morphemes • Notations: w = word, s = stem, m = suffix • Assumption: Stems and suffixes generated independently from each other. • Prob. of a word = p(w = s+m) = p(s) * p(m)
  • 11. An application: Learning of hierarchical Morphology paradigms [3] 11 1. Two Dirichlet processes generate stems and suffixes independently: • βs = concentration parameter, i.e., no. of stem types generated by the DP • If β = small, new stem/suffix types are less likely to be generated. • β = large, more likely to generate new stem/suffix types, thus yielding more uniform distribution. • Authors choose β < 1, i.e. to yield a more skewed distribution with sparse stems & suffixes. • P = base distribution specifying prior prob. Distribution for morpheme lengths. • Joint prob. Of stems can then be calculated as:
  • 12. References 1. Frigyik, Bela A. et al. “Introduction to the Dirichlet Distribution and Related Processes.” (2010). 2. http://phyletica.org/dirichlet-process/ 3. Can, Burcu and Suresh Manandhar. “Probabilistic Hierarchical Clustering of Morphological Paradigms.” EACL (2012). 12