Dirichlet processes and Applications

Dirichlet Processes and Applications Saurav Jha
Machine Learning Engineer
Copyright © 2018 FactSet Research Systems Inc. All rights reserved. Confidential: Do not forward.

1. Probability 101: Mass & Density Functions
2. Probability 102: Simplex and its geometrical meaning
3. Dirichlet Distribution
4. Dirichlet Process
5. A demo
6. An application
Table of Contents
2

Probability 101
• PDF = probability that a continuous random variable has a particular range of values
• PMF = probability that a discrete random variable is exactly equal to some value
3
• In continuous setting:
∫b
a f(x)dx = prob. that outcome is between a and b
i.e., units of f(x) = prob. Per unit length (dx)
= how dense is probability per unit length near x
• In discrete setting:
f(x) = Pr(X = x)
i.e., units of f(x) = simple probability
= what is the mass of object X at point x
• Set of PMFs on entire sample space.
S = { x E Rn : xi >= 0, ∑i=1..n xi = 1}
Probability Mass Function (PMF) vs Density Function (PDF)
Probability Simplex

4
• A k-dimensional polytope ( a geometric object with flat sides) formed from convex hull of
its k+1 vertices.
Probability 102: K-Simplex – geometrical meaning
• Let u0, u1, …, uk E Rk be (k+1) points, then the simplex determined by them = set of points:
C = {Ɵ0u0 + … + Ɵkuk | ∑i = 0...k Ɵi = 1 and Ɵi >= 0 ∀ i }
 Looking at u0, u1, u2 as a disjoint set of possible events, such that their probs. sum to 1.
i.e. p0 + p1 + p2 = 1, where 0 <= pi <= 1
 Consider the three probabilities as points in Euclidean space (p1,p2,p3).
 Resulting shape outlines the perimeter of a triangle.
 While the set C lies in a k-dim. Space (k=3), the object it forms is (k-1)
dimensional.
 Each point pi in the simplex = a pmf in its own (i.e. each component of pi
= [0,1] and all its components sum up to 1).

Dirichlet distribution
5
• Let Q = [Q1, Q2, …, Qk] = a random pmf (i.e. Qi >= 0) for i = 1,2,…, k and ∑i=1..k Qi = 1.
• Let α = [α1, α2, . . . , αk], with αi > 0 for each i, and let α0 = ∑i=1..k αi
• Then, Q = a Dirichlet distribution with param. α and is denoted by Q ∼ Dir(α):
P(Q1, Q2, …, Qk) =
• A probability distribution whose samples lie in the (k-1) dimensional probability
simplex ∆k, i.e., a distribution over pmfs of length k.
• Ranges over possible parameters vectors for a multinomial distribution and is
the conjugate prior of multinomial distribution.
“A distribution of distributions”

Dirichlet distribution – an example use-case
• X = vector representing n draws of a random var. with 3 possible outcomes = [4,4,2]
• PMF of X = multinomial distribution = (p1n1* p2n2 * p3n3) * n!/ n1!*n2!*n3!
6
Q) What if p1, p2, p3 are unknown? i.e., no certainty over what the distribution of
categorical vars. is!
 Solution: use a Dirichlet distribution with params α1, α2, α3 to first draw a P ~ Dir(α), and then, draw
X ~ Multi(p).
• Introduces one level of indirection in the model for X – instead of saying what P generated X, use
params α1, α2, α3 to find likely prob. Distributions and then draw samples X acc. To random P.
• Since, sampling is directly from a prob. K-Simplex => the values of a k-dim. Dirichlet distribution =
mean value of the Dirichlet.
• Addition of the Dirichlet distribution = introducing prior beliefs about what X is likely to occur. i.e., a
random pmf has a Dirichlet distribution with param α. [1]
• Analogy 1: if a random pmf = a bag full of dice, then a sample from the Dirichlet = a specific dice.

Dirichlet Process
 Dirichlet Processes to the Rescue !
7
• In the dice analogy, the dice must have a finite no. of faces.
• Limitation of Dirichlet distribution = assumes a finite set of events.
• Enables working with an infinite set of events, and hence to model prob.
Distributions over infinite sample spaces.
Analogy 2:
• Asking a pedestrians on the street to choose their fav. Color out of {V,I,B,G,Y,O,R}.
• Based on answer, model each person as a pmf over 7 colors.
• Each person’s pmf = a realization of a draw from a Dirichlet distribution over 7 colors.
 What if the choices are no longer restricted to 7 colors?
• Modelling an individual’s pmfs (over infinite dim.) = a distribution over distributions over
an infinite samle space.
• One solution = a Dirichlet process.

Dirichlet Process – definition
 Input = H (a prob. Distribution a.k.a base distribution), α (a +ve real no. a.k.a
concentration param.)
 Draw A (i.e., nth element) from H.
 For n > 1:
 Assign A to a new category with the prob. α / (α + n – 1).
 Assign A to a pre-existing category x with prob. nx / (α + n – 1), where nx = no. of
random variables already assigned to x.
8
• Assign elements A,B,C to unknown no. of categories following the algorithm:
• Used when modelling data that tends to repeat previous values in a “rich get richer”
fashion.
• Can also be defined as a Chinese Restaurant Process.
• Applications: Morphological segmentation in NLP, Modelling mutation rates of genes in
evolutionary biology.

An application: Learning of hierarchical Morphology paradigms [3]
• A paradigm = a pair (StemList, SuffixList) where, each Stem+Suffix string = a valid word.
• Can be modelled as a hierarchical structure.
10
• Morphologically similar words = close to
each other in the structure.
• Similarity metric = # common morphemes
• Notations: w = word, s = stem, m = suffix
• Assumption: Stems and suffixes
generated independently from each
other.
• Prob. of a word = p(w = s+m)
= p(s) * p(m)

An application: Learning of hierarchical Morphology paradigms [3]
11
1. Two Dirichlet processes generate stems and suffixes independently:
• βs = concentration parameter, i.e., no. of stem types generated by the DP
• If β = small, new stem/suffix types are less likely to be generated.
• β = large, more likely to generate new stem/suffix types, thus yielding more uniform distribution.
• Authors choose β < 1, i.e. to yield a more skewed distribution with sparse stems & suffixes.
• P = base distribution specifying prior prob. Distribution
for morpheme lengths.
• Joint prob. Of stems can then be calculated as:

References
1. Frigyik, Bela A. et al. “Introduction to the Dirichlet Distribution and Related Processes.”
(2010).
2. http://phyletica.org/dirichlet-process/
3. Can, Burcu and Suresh Manandhar. “Probabilistic Hierarchical Clustering of
Morphological Paradigms.” EACL (2012).
12

Dirichlet processes and Applications

More Related Content

What's hot (20)

Similar to Dirichlet processes and Applications (20)

Recently uploaded (20)

Dirichlet processes and Applications