Admixture of Poisson MRFs: A New Topic Model with Word Dependencies

Admixture of Poisson MRFs: A New Topic
Model with Word Dependencies
David Inouye*, Pradeep Ravikumar, Inderjit Dhillon
April 30, 2015
* Presenter
David Inouye*, Pradeep Ravikumar, Inderjit Dhillon Admixture of Poisson MRFs

Analyzing Large Collections of Documents
Doc 1
Doc 2
Doc 3
Doc 4
...
“networks”
“learning”
“program
m
ing”
…
Digital
Representation
Bag of Words Matrix:
- Removes order and
syntax information
- Unrealistic but powerful
2
Model
Computation
3
Summary
4
Collection of
Documents
1
)
Examples:
1. Research papers
2. News articles
3. Twitter posts
1

Model
Computation
3
Summary
4
Collection of
Documents
1
)
Examples:
1. Research papers
2. News articles
3. Twitter posts
1
Doc 1
Doc 2
Doc 3
Doc 4
...
“networks”
“learning”
“program
m
ing”
…
Digital
Representation
- Removes order and
syntax information
2

Summary
4
Collection of
Documents
1
)
Examples:
1. Research papers
2. News articles
3. Twitter posts
1
Doc 1
Doc 2
Doc 3
Doc 4
...
“networks”
“learning”
“program
m
ing”
…
Digital
Representation
- Removes order and
syntax information
2
Model
Computation
3

Collection of
Documents
1
)
Examples:
1. Research papers
2. News articles
3. Twitter posts
1
Doc 1
Doc 2
Doc 3
Doc 4
...
“networks”
“learning”
“program
m
ing”
…
Digital
Representation
- Removes order and
syntax information
2
Model
Computation
3
Summary
4

Research Paper Example - Top Words
Top Words
(Frequency)
Doc 1
Doc 2
Doc 3
Doc 4
...
“networks”
“learning”
“program
m
ing”
…
Collection of
Documents
Digital
Representation
Model
Computation Summary
Examples:
1. Research papers
2. News articles
3. Twitter posts
- Removes order and
syntax information
1
Titles of Research Papers:
1. Machine Learning (ICML, NIPS)
2. Communication Networks (INFOCOM)
3. Programming Languages (PLDI, CAV, POPL, OOPSLA)
2

Research Paper Example - Topic Modeling
Topic
Modeling
Doc 1
Doc 2
Doc 3
Doc 4
...
“networks”
“learning”
“program
m
ing”
…
Collection of
Documents
Digital
Representation
Model
Computation Summary
Examples:
1. Research papers
2. News articles
3. Twitter posts
- Removes order and
syntax information
1
Titles of Research Papers:
1. Machine Learning (ICML, NIPS)
2. Communication Networks (INFOCOM)
3. Programming Languages (PLDI, CAV, POPL, OOPSLA)
2

Applications for Topic Modeling
Applications
1. Summarize/Visualize [Hall et al. 2008]
2. Word sense disambiguation [Boyd-Graber et al. 2007]
3. Multi-lingual understanding [Mimno et al. 2009]
4. Information retrieval [Wei & Croft 2006]
Diﬀerent domains
1. Genetics [Pritchard et al. 2000 (14,000 citations)]
2. Computer vision [Li et al. 2010]
3. Social networks [Airoldi et al. 2008]
4. Social science surveys [Roberts et al. 2014]
5. Social E-commerce [Hu et al. 2014]

Brief History - Latent Semantic Analysis (LSA)
2
k p
n
k
k k
U VT
Σ
Singular Value
Decomposition
3
Low Dimensional
Document
Representation
“Latent Topic”
4
1
Doc 1
Doc 2
Doc 3
Doc 4
...
“networks”
“learning”
“program
m
ing”
…
Digital
Representation
n
p

3
Low Dimensional
Document
Representation
“Latent Topic”
4
1
Doc 1
Doc 2
Doc 3
Doc 4
...
“networks”
“learning”
“program
m
ing”
…
Digital
Representation
Singular Value
Decomposition
n
p
2
k p
n
k
k k
U VT
Σ

“Latent Topic”
4
1
Doc 1
Doc 2
Doc 3
Doc 4
...
“networks”
“learning”
“program
m
ing”
…
Digital
Representation
Singular Value
Decomposition
n
p
2
k p
n
k
k k
U VT
Σ
3
Low Dimensional
Document
Representation

1
Doc 1
Doc 2
Doc 3
Doc 4
...
“networks”
“learning”
“program
m
ing”
…
Digital
Representation
Singular Value
Decomposition
n
p
2
k p
n
k
k k
U VT
Σ
“Latent Topic”
4
3
Low Dimensional
Document
Representation
Positive and negative values diﬃcult to interpret

Brief History - Probabilistic Topic Models
3
Topic Weights
per Document
“Topics”
(Word weights)
4
1
Doc 1
Doc 2
Doc 3
Doc 4
...
“networks”
“learning”
“program
m
ing”
…
Digital
Representation
n
p
2
k p
n
k
Probabilistic Topic
Models
Related through
probability model
Probability vectors are much easier to interpret
LDA - Added Bayesian priors for regularization

Comparison of 2D Projections
SVD dimensions are diﬃcult
to interpret
APM has smooth
distribution compared to
LDA
SVD
LDA
APM
comm−net.6978
mach−learn.8925
prog−lang.6618

Brief History - Extensions/Variants
1. Add time information [Blei & Lafferty 2006]
2. Add author information [Rosen-Zvi et al. 2004]
3. Add document category information [Mcauliffe & Blei 2008]
4. Automatically discover number of topics [Teh et al. 2006]
5. Model correlation between topics [Blei & Lafferty 2006]
6. . . .
7. . . .

Brief History - Extensions/Variants
1. Add time information [Blei & Lafferty 2006]
2. Add author information [Rosen-Zvi et al. 2004]
3. Add document category information [Mcauliffe & Blei 2008]
4. Automatically discover number of topics [Teh et al. 2006]
5. Model correlation between topics [Blei & Lafferty 2006]
6. . . .
7. . . .
Previous models - topics only have weights for single words
Our model - topics have weights for pairs of words

Interpreting Topics
LDA 3 topics

Interpreting Topics
LDA 3 topics
LDA 6 topics

Interpreting Topics
LDA 3 topics
LDA 30 topics

Interpreting Topics
LDA 3 topics
LDA 30 topics
APM 3 topics

Overview of APM
Admixture of Poisson MRFS
(APM)
Multinomial Admixture Poisson MRF
Gaussian
MRF
Mixture
LDA

Mixtures
Multiple sub-populations
The sub-populations are usually unknown a priori
Each individual from the population comes from exactly one
subpopulation
Figure source: Kalai, Moitra, and Valiant. Disentangling Gaussians. Communications
of the ACM. 2012.

Admixtures
Mixtures - Draws from single
component distribution. (Top)
Admixtures - Draws from a
distribution whose parameters are a
convex combination of component
parameters. (Bottom)
x2
x1
"Documents"
Mixture
Components
x2
x1
Dense
"Topic"
Sparse
"Document"
Dense
"Document”
Sparse
"Topic"

Gaussian MRFs
Allows for dependencies between variables
What if the data dimension is large?
If dimension is 1000, 10002
/2 =500,000 parameters
Assume some conditional independence between variables.

Independent PMRFIndependent PMRF
Count of Word 1
CountofWord2
0 2 4 6 8
8
6
4
2
0
1. Each conditional (”slice”)
of a PMRF is 1-D Poisson.
2. Distinct from Gaussian
MRF
3. Positive dependencies can
model word co-occurence.
Positive Dependency PMRFPMRF Positive Dependency
Count of Word 1
CountofWord2
0 2 4 6 8
8
6
4
2
0
Negative Dependency PMRFPMRF Negative Dependency
Count of Word 1
CountofWord2
0 2 4 6 8
8
6
4
2
0

Poisson MRFs [Yang et al., 2012]
P(A | B, C) P(B | A, C) P(C | A, B)
P(A, B, C) ??
If we assume the node conditional distributions are Poisson,
does there exist a joint MRF distribution
that has these conditionals?
Poisson MRF joint distribution:
Pr
PMRF
(x | θ, Θ) ∝ exp θT
x + xT
Θx −
p
s=1
ln(xs!) .
Node conditionals are 1-D Poissons:
Pr(xs | x−s, θs, Θs) ∝ exp{ (θs + xT
Θs
ηs
) xs − ln(xs!) }.

Admixture of Poisson MRFs (APM) [Inouye et al. 2014]
APM replaces standard Multinomial with Poisson MRF
Pr
APM
(x, w, θ1...k
, Θ1...k
)
= Pr
PMRF
x ¯θ =
k
j=1
wj θj
, ¯Θ =
k
j=1
wj Θj
Pr
Dir
(w)
k
j=1
Pr(θj
, Θj
)

APM Algorithm
1. Optimization problem is not convex
2. Want to exploit parallel computing
3. Large optimization problem: APM has O(kp2) parameters
versus O(kp) for LDA
LDA(k = 5, p = 1000) ⇒ 5,000 parameters
APM(k = 5, p = 1000) ⇒ 5,000,000 parameters
APM(k = 5, NNZ(Θ) = 10 per word) ⇒ 50,000 free
parameters

Parallel Alternating Newton-like Algorithm
1. Split the algorithm into alternating convex problems
arg min
Φ1,Φ2,··· ,Φp
−
1
n
p
s=1
tr(Ψs
Φs
) −
n
i=1
exp(zT
i Φs
wi ) +
p
s=1
λ vec(Φs
)1 1
arg min
w1,w2,··· ,wn∈∆k
−
1
n
n
i=1
ψT
i wi −
p
s=1
exp(zT
i Φs
wi )
where zi = [1 xT
i ]T
Ψs
= f (X, W)
φj
s = [θj
s (Θj
s )T
]T
ψi = f (X, Φ1...k
)
Φs
= [φ1
s φ2
s · · · φk
s ]
2. Subproblems in summation can be computed in parallel
3. Use fast Newton-like optimization method [Hsieh et al. 2014]

Timing Results on Wikipedia Dataset (k = 5, λ = 0.5)
1
3.1
3.4
0.6
2.2 2.2
0
1
2
3
4
n = 20,000
p = 5,000
# of Words = 50M
n = 100,000
p = 5,000
# of Words = 133M
n = 20,000
p = 10,000
# of Words = 57M
Time(hrs)
APM Training Time on Wikipedia Dataset
1st Iter. Avg. Next 3 Iter.
Algorithm scales approximately as O(np2)

Parallel Speedup
0
5
10
15
20
0 5 10 15 20
Speedup
# of MATLAB Workers
Parallel Speedup on BNC Dataset
Perfect Speedup
Actual Speedup
BNC dataset has n = 4049 and p = 1646
Speedup could be O(min(n, p)) on distributed system

Evaluating APM: No Direct Evaluation of Edge Parameters
Previous metrics evaluate the similarity of word pairs
[Newman et al. 2010, Mimno et al. 2011, Aletras and Court 2013]
Averaged statistic for all 10
2 pairs of top words computed
Attempted to correlate with human judgment
Unlike previous topic models, APM explicitly models
dependencies between words
How can we semantically evaluate the parameters for these
dependencies?

Evocation [Boyd-Graber et al. 2006]
Evocation denotes the idea of which words “evoke” or “bring
to mind” other words
Diﬀerent types of evocation:
1. Rose - Flower (example)
2. Brave - Noble (kind)
3. Yell - Talk (manner)
4. Eggs - Bacon (co-occurence)
5. Snore - Sleep (setting)
6. Wet - Desert (antonymy)
7. Work - Lazy (exclusivity)
8. Banana - Kiwi (likeness)
Distinctive from word similarity or synonymy
Collected human scores for approximately 15% of word pairs

Evocation Metric Illustration
Word Pair H M
w1 ↔ w2
w1 ↔ w3
w1 ↔ w4
w2 ↔ w3
w2 ↔ w4
w3 ↔ w4
w2 ↔ w3
w2 ↔ w4
w3 ↔ w4
w1 ↔ w3
w1 ↔ w2
w1 ↔ w4
Word Pair H M
w2 ↔ w4
w3 ↔ w4
w1 ↔ w3
w1 ↔ w4
Word Pair H M
Rank by model weights M Sum top-m human scores H

Models for Comparison
APM: Admixture of Poisson MRFs
APM-LowReg: Very small regularization parameter
APM-HeldOut: Chooses λ from held-out documents
CTM: Correlated Topic Models
HDP: Hierarchical Dirichlet Process (Non-parametric)
LDA: Latent Dirichlet Allocation
RSM: Replicated Softmax (Undirected Topic Model)
RND: Random baseline

Evocation Metric Results
k = 1 3 5 10 25 50 k = 1 3 5 10 25
Evoc-1 (Avg. Evoc. of Topics) Evoc-2 (Evoc. of Avg. Topic)
APM APM-LowReg APM-HeldOut CTM HDP LDA RSM RND
0
200
400
600
800
1000
1200
1400
1600
k = 1 3 5 10 25 50 k = 1 3 5
Evoc-1 (Avg. Evoc. of Topics) Evoc-2 (Ev
Evocation(m=50)
APM APM-LowReg APM-HeldOut CTM HDP LDA
0
200
400
600
800
1000
1200
1400
1600
k = 1 3 5 10 25 50 k = 1 3 5
Evoc-1 (Avg. Evoc. of Topics) Evoc-2 (E
Evocation(m=50)
0
200
400
600
800
1000
1200
1400
1600
k = 1 3 5 10 25 50 k = 1 3 5
Evoc-1 (Avg. Evoc. of Topics) Evoc-2 (Ev
Evocation(m=50)
5 10 25 50 k = 1 3 5 10 25 50
(Avg. Evoc. of Topics) Evoc-2 (Evoc. of Avg. Topic)
PM-LowReg APM-HeldOut CTM HDP LDA RSM RND

Evocation Metric Top Word Pairs
Table: Top 20 Word Pairs for Best LDA
Human
Score
Human
Score
Human
Score
100 run.v ↔ car.n 38 woman.n ↔ man.n 100 telephone.n
82 teach.v ↔ school.n 38 give.v ↔ church.n 97 husband.n
69 school.n ↔ class.n 38 wife.n ↔ man.n 82 residential.a
63 van.n ↔ car.n 38 engine.n ↔ car.n 76 politics.n
51 hour.n ↔ day.n 35 publish.v ↔ book.n 75 steel.n
50 teach.v ↔ student.n 32 west.n ↔ state.n 75 job.n
44 house.n ↔ government.n 32 year.n ↔ day.n 75 room.n
44 week.n ↔ day.n 25 member.n ↔ give.v 72 aunt.n
38 university.n ↔ institution.n 25 dog.n ↔ animal.n 72 printer.n
38 state.n ↔ government.n 25 seat.n ↔ car.n 60 love.v
Word Pair Word Pair Wo
Table: Top 20 Word Pairs for Best APM
Human
Score
Human
Score
n.n ↔ man.n 100 telephone.n ↔ call.n 57 question.n ↔ answer.n
e.v ↔ church.n 97 husband.n ↔ wife.n 57 prison.n ↔ cell.n
e.n ↔ man.n 82 residential.a ↔ home.n 51 mother.n ↔ baby.n
e.n ↔ car.n 76 politics.n ↔ political.a 50 sun.n ↔ earth.n
h.v ↔ book.n 75 steel.n ↔ iron.n 50 west.n ↔ east.n
.n ↔ state.n 75 job.n ↔ employment.n 44 weekend.n ↔ sunday.n
r.n ↔ day.n 75 room.n ↔ bedroom.n 41 wine.n ↔ drink.v
.n ↔ give.v 72 aunt.n ↔ uncle.n 38 south.n ↔ north.n
g.n ↔ animal.n 72 printer.n ↔ print.v 38 morning.n ↔ afternoon.n
.n ↔ car.n 60 love.v ↔ love.n 38 engine.n ↔ car.n
Word Pair Word Pair Word Pair

Current and Future Work
networksnetworks
learninglearning
basedbased
usingusing
analysisanalysis
networknetwork
wirelesswireless
datadata
modelmodel
multimulti
controlcontrol
efficientefficient
timetime
performanceperformance
routingrouting
distributeddistributed
optimaloptimal
algorithmsalgorithms
algorithmalgorithm
sensorsensor
traffictraffic
schedulingscheduling
highhigh
largelarge
multiplemultiple
mobilemobile
atmatm
packetpacket
delaydelay
allocationallocation
flowflow
protocolprotocol
accessaccess
multicastmulticast
energyenergy
channelchannel
realreal
scalescale
powerpower
locallocal
hochoc
raterate
randomrandom
serviceservice
evaluationevaluation
codingcoding
radioradio
bandwidthbandwidth
opticaloptical
speedspeed
videovideo
endend
peerpeer
resourceresource
cloudcloud
computingcomputing
congestioncongestion
hophop
distributiondistribution
contentcontent
cognitivecognitive
switchswitch
spectrumspectrum
switchingswitching
privacyprivacy
wdmwdm
layerlayer
streamingstreaming
locationlocation
queueingqueueing
engineeringengineering
inputinput
crosscross
areaarea
qualityquality
loadload
wavelengthwavelength
preservingpreserving
admissionadmission
assignmentassignment
reliablereliable
switchesswitches
macmac
faultfault
toleranttolerant
balancingbalancing
switchedswitched
varyingvarying
registerregister
widewide
centercenter
networksnetworks
learninglearning
basedbased
usingusing
analysisanalysis
networknetwork
wirelesswireless
datadata
modelmodel
multimulti
systemssystems
modelsmodels
timetime
neuralneural
objectobject
optimaloptimal
informationinformation
highhigh
bayesianbayesian
largelarge
optimizationoptimization
multiplemultiple
inferenceinference
linearlinear
nonnon
sparsesparse
clusteringclustering
estimationestimation
selectionselection
kernelkernel
supportsupport
stochasticstochastic
scalescale
gaussiangaussian
featurefeature
markovmarkov
processprocess
processesprocesses
randomrandom
codingcoding
classclass
decisiondecision
recognitionrecognition machinesmachines
predictionprediction
visualvisual
vectorvector
lowlow
supervisedsupervised
structuredstructured
policypolicy
treestrees
functionfunction
approximateapproximate
continuouscontinuous
semisemi
gradientgradient
reductionreduction
maximummaximum
latentlatent
dimensionaldimensional
matrixmatrix
convexconvex
propagationpropagation
marginmargin
graphicalgraphical
variablevariable
hiddenhidden
variationalvariational
tasktask
componentcomponent
mixturemixture
speechspeech
spectralspectral
rankrank
theoretictheoretic
neuronsneurons
fieldsfields
densitydensity
vlsivlsi
instanceinstance
analoganalog
montemonte
messagemessage
carlocarlo topictopic
labellabel
entropyentropy
neighborneighbor
nearnear
dirichletdirichlet
spikingspiking
seriesseries
beliefbelief
factorizationfactorization
dynamicaldynamical
partiallypartially
descentdescent
differencedifference
nearestnearest
dimensionalitydimensionality
passingpassing
completioncompletion principalprincipal
leastleast
boltzmannboltzmann
likelihoodlikelihood
squaressquares
observableobservable
networksnetworkslearninglearning
basedbased
usingusing
analysisanalysis
networknetworkwirelesswireless
datadata
modelmodel
multimulti
systemssystems
timetime
approachapproach
programmingprogramming
objectobjectdistributeddistributed
languagelanguage
designdesign
orientedoriented
systemsystem
informationinformation
highhigh
softwaresoftware
programsprograms
inferenceinference
verificationverification
checkingchecking
flowflow
codecode
typetype
realrealprogramprogram
languageslanguages
orderorder
machinemachine
levellevel
temporaltemporal
studystudy
domaindomain
virtualvirtual
logiclogic
generationgeneration
implementationimplementation
casecase
staticstatic
hybridhybrid
abstractabstract
structuresstructures
formalformal
higherhigher
sessionsession
specificspecific
collectioncollection
firstfirst
garbagegarbage
extendedextended
posterposter
aspectaspect
1. Visualization
2. Better inference of
parameters
3. Extension to other domains

Thanks for listening!
Admixture of Poisson MRFS
(APM)
Multinomial Admixture Poisson MRF
Gaussian
MRF
Mixture
LDA

Admixture of Poisson MRFs: A New Topic Model with Word Dependencies

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Admixture of Poisson MRFs: A New Topic Model with Word Dependencies (20)

Recently uploaded (20)

Admixture of Poisson MRFs: A New Topic Model with Word Dependencies