SlideShare a Scribd company logo
A Note on Latent LSTM Allocation
Tomonari MASADA @ Nagasaki University
August 31, 2017
(I’m not fully confident with this note.)
1 ELBO
In latent LSTM allocation, the topic assignments zd = {zd,1, . . . , zd,Nd
} for each document d are drawn
from the categorical distribution whose parameters are obtained as a softmax output of LSTM.
Based on the description of the generative process given in the paper [1], we obtain the full joint
distribution as follows:
p({w1, . . . , wd}, {z1, . . . , zd}, φ; LSTM, β) = p(φ; β)
d
p(wd, zd, φ; LSTM, β) (1)
We maximize the evidence p({w1, . . . , wd}; LSTM, β), which is obtained as below.
p({w1, . . . , wd}; LSTM, β) =
{z1,...,zd}
p({w1, . . . , wd}, {z1, . . . , zd}, φ; LSTM, β)dφ
=
{z1,...,zd}
p(φ; β)
d
p(wd, zd|φ; LSTM)dφ, (2)
where
p(wd, zd|φ; LSTM) = p(wd|zd, φ)p(zd; LSTM)
=
t
p(wd,t|zd,t, φ)p(zd,t|zd,1:t−1; LSTM) (3)
Jensen’s inequality gives the following lower bound of the log of the evidence:
log p({w1, . . . , wd}; LSTM, β) = log
Z
p(φ; β)
d
p(wd, zd|φ; LSTM)dφ
= log
Z
q(Z, φ)
p(φ; β) d p(wd, zd|φ; LSTM)
q(Z, φ)
dφ
≥
Z
q(Z, φ) log
p(φ; β) d p(wd, zd|φ; LSTM)
q(Z, φ)
dφ
≡ L (4)
Let this lower bound, i.e., ELBO, be denoted by L.
We assume that the variational posterior q(Z, φ) factorizes as k q(φk) × d q(zd). The q(φk) are
Dirichlet distributions whose parameters are ξk = {ξk,1 . . . , ξk,V }.
Then the ELBO L can be rewritten as below.
L = q(φ) log p(φ; β)dφ +
d zd
q(zd) log p(zd; LSTM) +
d zd
q(zd)q(φ) log p(wd|zd, φ)dφ
−
d zd
q(zd) log q(zd) − q(φ) log q(φ)dφ (5)
1
Further we assume that q(zd) factorizes as t q(zd,t), where the q(zd,t) are the categorical distributions
satisfying
K
k=1 q(zd,t = k) = 1. We let γd,t,k denote q(zd,t = k).
The second term of L in Eq. (5) can be rewritten as below.
zd
q(zd) log p(zd; LSTM) =
zd t
q(zd,t)
t
log p(zd,t|zd,1:t−1; LSTM)
=
zd t
q(zd,t) log p(zd,1; LSTM) + log p(zd,2|zd,1; LSTM) + log p(zd,3|zd,1, zd,2; LSTM)
+ · · · + log p(zd,Nd
|zd,1, . . . , zd,Nd−1; LSTM)
=
K
zd,1=1
q(zd,1) log p(zd,1; LSTM) +
K
zd,1=1
K
zd,2=1
q(zd,1)q(zd,2) log p(zd,2|zd,1; LSTM)
+ · · · +
K
zd,1=1
· · ·
K
zd,Nd−1=1
q(zd,1) · · · q(zd,Nd−1) log p(zd,Nd−1|zd,1, . . . , zd,Nd−2; LSTM)
+ · · · +
K
zd,1=1
· · ·
K
zd,Nd
=1
q(zd,1) · · · q(zd,Nd
) log p(zd,Nd
|zd,1, . . . , zd,Nd−1; LSTM) (6)
The evaluation of Eq. (6) is intractable. However, for each t, the zd,1:t−1 in p(zd,t|zd,1:t−1; LSTM) can be
regarded as free variables whose values are set by some procedure having nothing to do with the generative
model. We obtain the values of the zd,1:t−1 by LSTM forward pass and denote them as ˆzd,1:t−1. Then we
can simplify Eq. (6) as follows:
zd
q(zd) log p(zd; LSTM) =
Nd
t=1
K
zd,t=1
q(zd,t) log p(zd,t|ˆzd,1:t−1; LSTM)
=
Nd
t=1
K
k=1
γd,t,k log p(zd,t = k|ˆzd,1:t−1; LSTM) (7)
The third term of L in Eq. (5) can be rewritten as below.
d zd
q(zd)q(φ) log p(wd|zd, φ)dφ =
d
q(φ)
zd
q(zd)
t
log φzd,t,wd,t
dφ
= q(φ)
d
Nd
t=1
K
k=1
q(zd,t = k) log φk,wd,t
dφ
=
D
d=1
Nd
t=1
K
k=1
γd,t,k q(φk) log φk,wd,t
dφk
=
D
d=1
Nd
t=1
K
k=1
γd,t,k Ψ(ξk,wd,t
) − Ψ
v
ξk,v (8)
The first term of L in Eq. (5) can be rewritten as below.
q(φ) log p(φ; β)dφ =
k
q(φk) log p(φk; β)dφk
= K log Γ(V β) − KV log Γ(β) +
k v
(β − 1) q(φk) log φk,vdφk
= K log Γ(V β) − KV log Γ(β) + (β − 1)
k v
Ψ(ξk,v) − Ψ
v
ξk,v (9)
2
The fourth term of L in Eq. (5) can be rewritten as below.
d zd
q(zd) log q(zd) =
D
d=1
Nd
t=1
K
k=1
q(zd,t = k) log q(zd,t = k) (10)
The last term of L can be rewritten as below.
q(φ) log q(φ)dφ =
k
q(φk) log q(φk)dφk
=
k
log Γ
v
ξk,v −
k v
log Γ(ξk,v) +
k v
(ξk,v − 1) Ψ(ξk,v) − Ψ
v
ξk,v (11)
2 Inference
The partial differentiation of L with respect to γd,t,k is
∂L
∂γd,t,k
= log p(zd,t = k|ˆzd,1:t−1; LSTM) + Ψ(ξk,wd,t
) − Ψ
v
ξk,v − log γd,t,k + const. (12)
By solving ∂L
∂γd,t,k
= 0, we obtain
γd,t,k ∝ φk,wd,t
p(zd,t = k|ˆzd,1:t−1; LSTM), (13)
where φk,wd,t
≡
exp(Ψ(ξk,wd,t
))
exp(Ψ( v ξk,v)) . When t = 1, γd,1,k ∝ φk,wd,1
p(zd,1 = k|LSTM). Therefore, q(zd,1) does
not depend on the zd,t for t > 1, and we can draw a sample from q(zd,1) without seeing the zd,t for
t > 1. When t = 2, γd,2,k ∝ φk,wd,2
p(zd,2 = k|ˆzd,1; LSTM). That is, q(zd,1) depends only on ˆzd,1. One
possible way to determine ˆzd,1 is to draw a sample from q(zd,1), because this drawing can be performed
without seeing the zd,t for t > 1. For each t s.t. t > 2, we may repeat a similar argument. However, this
procedure to determine the ˆzd,t is made possible by the assumption that lead to the approximation given
in Eq. (7), because we cannot obtain the simple update γd,t,k ∝ φk,wd,t
p(zd,t = k|ˆzd,1:t−1; LSTM) without
this assumption. And this assumption tells nothing about how we should sample the zd,t. For example,
we may draw the zd,t simply based on the softmax output at each t of LSTM without using φ. Anyway, it
is sure that the assumption leads to the approximation given in Eq. (7) provides no answer to the question
why we should use φ when sampling the zd,t.
For ξk,v, we obtain the estimation β + d {t:wd,t=v} γd,t,k as usual.
Let θd,t,k denote p(zd,t = k|ˆzd,1:t−1; LSTM), which is a softmax output of LSTM. The partial differen-
tiation of L with respect to any LSTM parameter is
∂L
∂LSTM
=
d∈B
Nd
t=1
K
k=1
γd,t,k
∂
∂LSTM
log θd,t,k =
d∈B
Nd
t=1
K
k=1
γd,t,k
θd,t,k
∂θd,t,k
∂LSTM
(14)
References
[1] Manzil Zaheer, Amr Ahmed, and Alexander J. Smola. Latent LSTM allocation: Joint clustering
and non-linear dynamic modeling of sequence data. In Doina Precup and Yee Whye Teh, editors,
Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of
Machine Learning Research, pages 3967–3976, International Convention Centre, Sydney, Australia,
06–11 Aug 2017. PMLR.
3

More Related Content

PDF
A Note on TopicRNN
PDF
Topic modeling with Poisson factorization (2)
PDF
Poisson factorization
PDF
Solovay Kitaev theorem
PDF
Solving the energy problem of helium final report
PDF
Fast and efficient exact synthesis of single qubit unitaries generated by cli...
PDF
PDF
Minimum spanning tree
A Note on TopicRNN
Topic modeling with Poisson factorization (2)
Poisson factorization
Solovay Kitaev theorem
Solving the energy problem of helium final report
Fast and efficient exact synthesis of single qubit unitaries generated by cli...
Minimum spanning tree

What's hot (20)

PDF
Specific Finite Groups(General)
PDF
Goldberg-Coxeter construction for 3- or 4-valent plane maps
PDF
Specific Finite Groups(General)
PPT
lecture 4
PDF
Specific Finite Groups(General)
PDF
Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)
PDF
A One-Pass Triclustering Approach: Is There any Room for Big Data?
PDF
QMC: Operator Splitting Workshop, Forward-Backward Splitting Algorithm withou...
PDF
On maximal and variational Fourier restriction
PDF
Bayesian Inference and Uncertainty Quantification for Inverse Problems
PPT
ADA - Minimum Spanning Tree Prim Kruskal and Dijkstra
PDF
Kumegawa russia
PDF
Mid term solution
PDF
Hierarchical matrices for approximating large covariance matries and computin...
PDF
Low-rank tensor approximation (Introduction)
PPT
2.6 all pairsshortestpath
PPTX
Fdtd ppt for mine
PDF
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
PPT
Prim's Algorithm on minimum spanning tree
Specific Finite Groups(General)
Goldberg-Coxeter construction for 3- or 4-valent plane maps
Specific Finite Groups(General)
lecture 4
Specific Finite Groups(General)
Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)
A One-Pass Triclustering Approach: Is There any Room for Big Data?
QMC: Operator Splitting Workshop, Forward-Backward Splitting Algorithm withou...
On maximal and variational Fourier restriction
Bayesian Inference and Uncertainty Quantification for Inverse Problems
ADA - Minimum Spanning Tree Prim Kruskal and Dijkstra
Kumegawa russia
Mid term solution
Hierarchical matrices for approximating large covariance matries and computin...
Low-rank tensor approximation (Introduction)
2.6 all pairsshortestpath
Fdtd ppt for mine
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
Prim's Algorithm on minimum spanning tree
Ad

Similar to A Note on Latent LSTM Allocation (20)

PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
A Note on PCVB0 for HDP-LDA
PDF
Meta-learning and the ELBO
PDF
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
PDF
A Note on Correlated Topic Models
PDF
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
PDF
Expectation propagation for latent Dirichlet allocation
PDF
Control as Inference (強化学習とベイズ統計)
PDF
A Solution Manual and Notes for The Elements of Statistical Learning.pdf
PDF
Least squares support Vector Machine Classifier
PDF
Introduction to modern Variational Inference.
PDF
A Note on the Derivation of the Variational Inference Updates for DILN
PDF
Cs229 notes8
PDF
PPT
Triangular Learner Model
PDF
A Note on BPTT for LSTM LM
PPTX
Bayesian Neural Networks
PDF
Variational inference using implicit distributions
PDF
Ijciet 10 01_153-2
PPTX
Extreme learning machine:Theory and applications
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
A Note on PCVB0 for HDP-LDA
Meta-learning and the ELBO
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
A Note on Correlated Topic Models
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
Expectation propagation for latent Dirichlet allocation
Control as Inference (強化学習とベイズ統計)
A Solution Manual and Notes for The Elements of Statistical Learning.pdf
Least squares support Vector Machine Classifier
Introduction to modern Variational Inference.
A Note on the Derivation of the Variational Inference Updates for DILN
Cs229 notes8
Triangular Learner Model
A Note on BPTT for LSTM LM
Bayesian Neural Networks
Variational inference using implicit distributions
Ijciet 10 01_153-2
Extreme learning machine:Theory and applications
Ad

More from Tomonari Masada (20)

PDF
Learning Latent Space Energy Based Prior Modelの解説
PDF
Denoising Diffusion Probabilistic Modelsの重要な式の解説
PDF
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
PDF
A note on the density of Gumbel-softmax
PPTX
トピックモデルの基礎と応用
PDF
Mini-batch Variational Inference for Time-Aware Topic Modeling
PDF
A note on variational inference for the univariate Gaussian
PDF
Document Modeling with Implicit Approximate Posterior Distributions
PDF
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
PDF
A Note on ZINB-VAE
PPTX
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
PPTX
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
TXT
Word count in Husserliana Volumes 1 to 28
PDF
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
PDF
FDSE2015
PDF
A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...
PDF
The detailed derivation of the derivatives in Table 2 of Marginalized Denoisi...
PPTX
ChronoSAGE: Diversifying Topic Modeling Chronologically
PPT
A Topic Model for Traffic Speed Data Analysis
PDF
Supplementary material for my following paper: Infinite Latent Process Decomp...
Learning Latent Space Energy Based Prior Modelの解説
Denoising Diffusion Probabilistic Modelsの重要な式の解説
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
A note on the density of Gumbel-softmax
トピックモデルの基礎と応用
Mini-batch Variational Inference for Time-Aware Topic Modeling
A note on variational inference for the univariate Gaussian
Document Modeling with Implicit Approximate Posterior Distributions
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
A Note on ZINB-VAE
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
Word count in Husserliana Volumes 1 to 28
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
FDSE2015
A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...
The detailed derivation of the derivatives in Table 2 of Marginalized Denoisi...
ChronoSAGE: Diversifying Topic Modeling Chronologically
A Topic Model for Traffic Speed Data Analysis
Supplementary material for my following paper: Infinite Latent Process Decomp...

Recently uploaded (20)

PPTX
Fundamentals of Mechanical Engineering.pptx
PPT
introduction to datamining and warehousing
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
Artificial Intelligence
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PPTX
introduction to high performance computing
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
737-MAX_SRG.pdf student reference guides
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PPTX
Current and future trends in Computer Vision.pptx
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PPTX
UNIT - 3 Total quality Management .pptx
PDF
Soil Improvement Techniques Note - Rabbi
PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
Fundamentals of Mechanical Engineering.pptx
introduction to datamining and warehousing
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
Safety Seminar civil to be ensured for safe working.
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Artificial Intelligence
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
introduction to high performance computing
Fundamentals of safety and accident prevention -final (1).pptx
Categorization of Factors Affecting Classification Algorithms Selection
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
737-MAX_SRG.pdf student reference guides
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Current and future trends in Computer Vision.pptx
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
UNIT - 3 Total quality Management .pptx
Soil Improvement Techniques Note - Rabbi
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS

A Note on Latent LSTM Allocation

  • 1. A Note on Latent LSTM Allocation Tomonari MASADA @ Nagasaki University August 31, 2017 (I’m not fully confident with this note.) 1 ELBO In latent LSTM allocation, the topic assignments zd = {zd,1, . . . , zd,Nd } for each document d are drawn from the categorical distribution whose parameters are obtained as a softmax output of LSTM. Based on the description of the generative process given in the paper [1], we obtain the full joint distribution as follows: p({w1, . . . , wd}, {z1, . . . , zd}, φ; LSTM, β) = p(φ; β) d p(wd, zd, φ; LSTM, β) (1) We maximize the evidence p({w1, . . . , wd}; LSTM, β), which is obtained as below. p({w1, . . . , wd}; LSTM, β) = {z1,...,zd} p({w1, . . . , wd}, {z1, . . . , zd}, φ; LSTM, β)dφ = {z1,...,zd} p(φ; β) d p(wd, zd|φ; LSTM)dφ, (2) where p(wd, zd|φ; LSTM) = p(wd|zd, φ)p(zd; LSTM) = t p(wd,t|zd,t, φ)p(zd,t|zd,1:t−1; LSTM) (3) Jensen’s inequality gives the following lower bound of the log of the evidence: log p({w1, . . . , wd}; LSTM, β) = log Z p(φ; β) d p(wd, zd|φ; LSTM)dφ = log Z q(Z, φ) p(φ; β) d p(wd, zd|φ; LSTM) q(Z, φ) dφ ≥ Z q(Z, φ) log p(φ; β) d p(wd, zd|φ; LSTM) q(Z, φ) dφ ≡ L (4) Let this lower bound, i.e., ELBO, be denoted by L. We assume that the variational posterior q(Z, φ) factorizes as k q(φk) × d q(zd). The q(φk) are Dirichlet distributions whose parameters are ξk = {ξk,1 . . . , ξk,V }. Then the ELBO L can be rewritten as below. L = q(φ) log p(φ; β)dφ + d zd q(zd) log p(zd; LSTM) + d zd q(zd)q(φ) log p(wd|zd, φ)dφ − d zd q(zd) log q(zd) − q(φ) log q(φ)dφ (5) 1
  • 2. Further we assume that q(zd) factorizes as t q(zd,t), where the q(zd,t) are the categorical distributions satisfying K k=1 q(zd,t = k) = 1. We let γd,t,k denote q(zd,t = k). The second term of L in Eq. (5) can be rewritten as below. zd q(zd) log p(zd; LSTM) = zd t q(zd,t) t log p(zd,t|zd,1:t−1; LSTM) = zd t q(zd,t) log p(zd,1; LSTM) + log p(zd,2|zd,1; LSTM) + log p(zd,3|zd,1, zd,2; LSTM) + · · · + log p(zd,Nd |zd,1, . . . , zd,Nd−1; LSTM) = K zd,1=1 q(zd,1) log p(zd,1; LSTM) + K zd,1=1 K zd,2=1 q(zd,1)q(zd,2) log p(zd,2|zd,1; LSTM) + · · · + K zd,1=1 · · · K zd,Nd−1=1 q(zd,1) · · · q(zd,Nd−1) log p(zd,Nd−1|zd,1, . . . , zd,Nd−2; LSTM) + · · · + K zd,1=1 · · · K zd,Nd =1 q(zd,1) · · · q(zd,Nd ) log p(zd,Nd |zd,1, . . . , zd,Nd−1; LSTM) (6) The evaluation of Eq. (6) is intractable. However, for each t, the zd,1:t−1 in p(zd,t|zd,1:t−1; LSTM) can be regarded as free variables whose values are set by some procedure having nothing to do with the generative model. We obtain the values of the zd,1:t−1 by LSTM forward pass and denote them as ˆzd,1:t−1. Then we can simplify Eq. (6) as follows: zd q(zd) log p(zd; LSTM) = Nd t=1 K zd,t=1 q(zd,t) log p(zd,t|ˆzd,1:t−1; LSTM) = Nd t=1 K k=1 γd,t,k log p(zd,t = k|ˆzd,1:t−1; LSTM) (7) The third term of L in Eq. (5) can be rewritten as below. d zd q(zd)q(φ) log p(wd|zd, φ)dφ = d q(φ) zd q(zd) t log φzd,t,wd,t dφ = q(φ) d Nd t=1 K k=1 q(zd,t = k) log φk,wd,t dφ = D d=1 Nd t=1 K k=1 γd,t,k q(φk) log φk,wd,t dφk = D d=1 Nd t=1 K k=1 γd,t,k Ψ(ξk,wd,t ) − Ψ v ξk,v (8) The first term of L in Eq. (5) can be rewritten as below. q(φ) log p(φ; β)dφ = k q(φk) log p(φk; β)dφk = K log Γ(V β) − KV log Γ(β) + k v (β − 1) q(φk) log φk,vdφk = K log Γ(V β) − KV log Γ(β) + (β − 1) k v Ψ(ξk,v) − Ψ v ξk,v (9) 2
  • 3. The fourth term of L in Eq. (5) can be rewritten as below. d zd q(zd) log q(zd) = D d=1 Nd t=1 K k=1 q(zd,t = k) log q(zd,t = k) (10) The last term of L can be rewritten as below. q(φ) log q(φ)dφ = k q(φk) log q(φk)dφk = k log Γ v ξk,v − k v log Γ(ξk,v) + k v (ξk,v − 1) Ψ(ξk,v) − Ψ v ξk,v (11) 2 Inference The partial differentiation of L with respect to γd,t,k is ∂L ∂γd,t,k = log p(zd,t = k|ˆzd,1:t−1; LSTM) + Ψ(ξk,wd,t ) − Ψ v ξk,v − log γd,t,k + const. (12) By solving ∂L ∂γd,t,k = 0, we obtain γd,t,k ∝ φk,wd,t p(zd,t = k|ˆzd,1:t−1; LSTM), (13) where φk,wd,t ≡ exp(Ψ(ξk,wd,t )) exp(Ψ( v ξk,v)) . When t = 1, γd,1,k ∝ φk,wd,1 p(zd,1 = k|LSTM). Therefore, q(zd,1) does not depend on the zd,t for t > 1, and we can draw a sample from q(zd,1) without seeing the zd,t for t > 1. When t = 2, γd,2,k ∝ φk,wd,2 p(zd,2 = k|ˆzd,1; LSTM). That is, q(zd,1) depends only on ˆzd,1. One possible way to determine ˆzd,1 is to draw a sample from q(zd,1), because this drawing can be performed without seeing the zd,t for t > 1. For each t s.t. t > 2, we may repeat a similar argument. However, this procedure to determine the ˆzd,t is made possible by the assumption that lead to the approximation given in Eq. (7), because we cannot obtain the simple update γd,t,k ∝ φk,wd,t p(zd,t = k|ˆzd,1:t−1; LSTM) without this assumption. And this assumption tells nothing about how we should sample the zd,t. For example, we may draw the zd,t simply based on the softmax output at each t of LSTM without using φ. Anyway, it is sure that the assumption leads to the approximation given in Eq. (7) provides no answer to the question why we should use φ when sampling the zd,t. For ξk,v, we obtain the estimation β + d {t:wd,t=v} γd,t,k as usual. Let θd,t,k denote p(zd,t = k|ˆzd,1:t−1; LSTM), which is a softmax output of LSTM. The partial differen- tiation of L with respect to any LSTM parameter is ∂L ∂LSTM = d∈B Nd t=1 K k=1 γd,t,k ∂ ∂LSTM log θd,t,k = d∈B Nd t=1 K k=1 γd,t,k θd,t,k ∂θd,t,k ∂LSTM (14) References [1] Manzil Zaheer, Amr Ahmed, and Alexander J. Smola. Latent LSTM allocation: Joint clustering and non-linear dynamic modeling of sequence data. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3967–3976, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. 3