SlideShare a Scribd company logo
(Hierarchical) Topic Modeling
Yueshen Xu (lecturer)
ysxu@xidian.edu.cn / xuyueshen@163.com
Data and Knowledge Engineering Research Center
Xidian University
Text Mining & NLP & ML
Software Engineering2016/12/27
Outline
 Background
 Some Concepts
 Topic Modeling
 Probabilistic Latent Semantic Indexing (PLSI)
 Latent Dirichlet Allocation (LDA)
 Hierarchical Topic Modeling
 Chinese Restaurant Process (CRP)
 Parameter Estimation
 Supplement & Reference
2
Keywords: topic modeling, hierarchical topic modeling, probabilistic graphical
model, Bayesian model
Basics, not
state-of-the-art
Software Engineering2016/12/27
Background
 Information Overloading
3
we need
summarization
Visualization
Dimensional
Reduction
Big Data
Cloud Computing
Artificial Intelligence
Deep Learning
,…, etc
Software Engineering2016/12/27
Background
 Text Summarization
 Document Summarization
 What do these docs (or this doc) talk about?
 Review Summarization
 What do these consumers care about or complain about?
 Short Text/Tweets Summarization
 What are people discussing about?
4
Automatic Applicable Explainable
 Basic Requirement
Topic Modeling
Software Engineering2016/12/27
 General Concepts
 Latent Semantic Analysis
 Text Mining
 Natural Language Processing
 Computational Linguistics
 Information Retrieval
 Dimension Reduction
 Topic Modeling
Some Concepts
5
Information Retrieval
Computational Linguistics
Natural Language Processing
LSA/Topic Model
Text Mining
LSA
Data Mining
Reduction
Dimension
Machine
Learning
Machine
Translation
Topic
Modeling
 to learn the latent topics from a corpus/document
Software Engineering2016/12/27
Topic Modeling
 Topic modeling
 an example in Chinese (from my doctorate thesis)
6
继续实施稳健的货币政策,保
持松紧适度适时预调微调,做
好与供给侧结构,并综合运用
数量、价格等多种货币政策
从员额上来看,这次改革远远超
过了裁军的数量,它是一种结构
性的改革,是军队组织结构现代
化的一个关键步骤
美元作为主要国际货币的地位在
可预见的将来仍无可取代,唯一
的出路是推动全球治理向更均衡
的方向发展。国际货币基金组织
总裁拉加德日前在美国马里兰大
学演讲时就呼吁,国际治理改革
应认清新兴经济体越来越重要这
一现实。
独立学院从母体高校“断奶”后,
可能会面临品牌、招生等方面阵
痛,但是在国家和省市鼓励民间
资本进入教育领域的实施意见发
布后,一些独立学院果断切割连
接母体大学的“脐带”,自立门
户发展。
Corpus
Doc
1
Doc2
Doc
3 Doc4
Software Engineering2016/12/27
Topic Modeling
 After topic modeling
7
继续实施稳健的货币政策,保
持松紧适度适时预调微调,做
好与供给侧结构,并综合运用
数量、价格等多种货币政策
政策 0.082
改革 0.063
…
金融 0.074
货币 0.051
…
学院 0.077
教育 0.071
…
军队 0.083
组织 0.079
…
从员额上来看,这次改革远远
超过了裁军的数量,它是一种
结构性的改革,是军队组织结
构现代化的一个关键步骤
美元作为主要国际货币的地位
在可预见的将来仍无可取代,
唯一的出路是推动全球治理向
更均衡的方向发展。国际货币
基金组织总裁拉加德日前在美
国马里兰大学演讲时就呼吁,
国际治理改革应认清新兴经济
体越来越重要这一现实。
独立学院从母体高校“断奶”
后,可能会面临品牌、招生等
方面阵痛,但是在国家和省市
鼓励民间资本进入教育领域的
实施意见发布后,一些独立学
院果断切割连接母体大学的
“脐带”,自立门户发展。 …
…
…
…
Corpus
Doc
1
Doc
2
Doc3
Doc
4
Topic
2
Topic
3
Topic
4
Topic
1
Software Engineering2016/12/27
Topic Modeling
 A topic
 A word cluster  a group of words
 Not clustered randomly, but meaningfully (not semantically)
8
 Models
 Parametric models
 Latent Semantic Indexing (LSI)
 PLSI; Latent Dirichlet Allocation (LDA)
 Non-parametric models (Dirichlet Process)
 (Nested) Chinese Restaurant Process
 Indian Buffet Process
 Pitman-Yor Process
Software Engineering2016/12/27
Topic Modeling
9
pLSI Model
w1
w2
wN
z1
zK
z2
d1
d2
dM
…..
…..
…..
)(dp)|( dzp)|( zwp
 Assumption
 Pairs(d,w) are assumed to be
generated independently
 Conditioned on z, w is generated
independently of d
 Words in a document are
exchangeable
 Documents are exchangeable
 Latent topics z are independent
The generative process
∑∑ ∈∈ ZzZz
dzpzwpdpdzwpdpdpdwpwdp )|()|()(=)|,()(=)()|(=),(
Multinomial Distribution
Multinomial Distribution
One layer of ‘Deep
Neutral Network’
Software Engineering2016/12/27
Topic Modeling
10
 Latent Dirichlet Allocation (LDA)
 David M. Blei, Andrew Y. Ng, Michael I. Jordan
 Hierarchical Bayesian model; Bayesian pLSI
θ z w
N
M
α
β
iterative times
Generative process of LDA
 Choose N ~ Poisson(𝜉);
 For each document d={𝑤1, 𝑤2 … 𝑤 𝑛}
Choose 𝜃 ~𝐷𝑖𝑟(𝛼); For each of the N
words 𝑤 𝑛 in d:
a) Choose a topic 𝑧 𝑛~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑛𝑎𝑙 𝜃
b) Choose a word 𝑤 𝑛 from 𝑝 𝑤 𝑛 𝑧 𝑛, 𝛽 ,
a multinomial distribution conditioned on 𝑧 𝑛
Software Engineering2016/12/27
Topic Modeling
 Parameter Estimation
 Variational Inference (+EM) || Gibbs Sampling (MCMC)
11
Variational EM Algorithm
Aim: (𝛼
∗
, 𝛽
∗
)=arg max 𝑑=1
𝑀
𝑝 𝒘|𝛼, 𝛽
Initialize 𝛼, 𝛽
E-Step: compute 𝛼, 𝛽 through variational inference for
likelihood approximation
M-Step: Maximize the likelihood according to 𝛼, 𝛽
End until convergence
I just hope you to know: EM is quite important
Software Engineering2016/12/27
Hierarchical Topic Modeling
Topic modeling is not enough
12
Hierarchical
Structure
Software Engineering2016/12/27
Hierarchical Topic Modeling
13
Chinese Restaurant Process (Dirichlet Process)
 A restaurant with an infinite number of tables, and
customers (word) enter this restaurant sequentially. The ith
customer (𝜃𝑖) sits at a table (𝜙 𝑘) according to the probability
𝜙 𝑘: Clustering == 1/2 unsupervised learning  clustering, topic modeling (two layer
clustering), hierarchical concept building, collaborative filtering, similarity computation…
Software Engineering2016/12/27
Hierarchical Topic Modeling
14
 The generative process (nested CRP)
 Focus on the insight
1. Let 𝑐1 be the root restaurant (only one table)
2. For each level 𝑙 ∈ {2, … , 𝐿}:
Draw a table from restaurant 𝑐𝑙−1 using CRP. Set 𝑐𝑙 to be the restaurant referred to
by that table
3. Draw an 𝐿 -dimensional topic proportion vector 𝜃~𝐷𝑖𝑟(𝛼)
4. For each word 𝑤 𝑛:
Draw 𝑧 ∈ 1, … , 𝐿 ~ Mult(𝜃)
Draw 𝑤 𝑛 from the topic associated with restaurant 𝑐 𝑧
α
zm,n
N
c1
c2
cL
T
γ
wm,n
M
β
k


m


Matryoshka
(Russia) Doll
Software Engineering2016/12/27
Hierarchical Topic Modeling
Examples
15
root topic analysis obtain base system concentration
thermal
polymer acid
property
diamine
activity compound acid
derivative active
compound ligand group
investigate synergistic
reaction
derivative
yield synthesis
microwave
assay food quality content
analysis
decoction
component
radix quality
constituent
compound
activity
synthesize salt
derivative
antioxidant
activity extract
inhibitory
flavonoid
interaction
cation metal
energy
solution
Software Engineering2016/12/27
Supplement
16
Some supplements
 Probabilistic Graphical Model
 Modeling Bayesian Network using plates and circles
 Generative Model & Discriminative Model: 𝑝(𝜃|𝑋/𝐷𝑎𝑡𝑎)
 Generative Model: p(θ|X) ∝ p(X|θ)p(θ)
- Naïve Bayes, GMM, pLSA, LDA, HMM, HDP… : Unsupervised Learning
 Discriminative Model: 𝑝(𝜃|𝑋)
- LR, KNN,SVM, Boosting, Decision Tree : Supervised Learning
Also can be represented by
graphical models
Software Engineering2016/12/27
Reference
 My previous tutorials/notes (ZJU/UIC/Netease/ITRZJU as a Ph.D)
 ‘Topic modeling (an introduction)’
 ‘Non-parametric Bayesian learning in discrete data’
 ‘The research of topic modeling in text mining’
 ‘Matrix factorization with user generated content’
 …, etc
 Website
 You can download all slides of mine
 http://web.xidian.edu.cn/ysxu/teach.html
 http://liu.cs.uic.edu/yueshenxu/
 http://www.slideshare.net/obamaxys2011
 https://www.researchgate.net/profile/Yueshen_Xu
17
Software Engineering2016/12/27
Reference
• David Blei, etc. Latent Dirichlet Allocation, JMLR, 2003
• Yee Whye Teh. Dirichlet Processes: Tutorial and Practical Course, 2007
• Yee Whye Teh, Jordan M I, etc. Hierarchical Dirichlet Processes, American Statistical
Association, 2006
• David Blei. Probabilstic topic models. Communications of the ACM, 2012
• David Blei, etc. The Nested Chinese Restaurant Process and Bayesian Inference of
Topic Hierarchies. Journal of the ACM, 2010
• Gregor Heinrich. Parameter Estimation for Text Analysis, 2008
• T.S., Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals
of Statistics, 1973
• Martin J. Wainwright. Graphical Models, Exponential Families, and Variational
Inference
• Rick Durrett. Probability: Theory and Examples, 2010
• Christopher Bishop. Pattern Recognition and Machine Learning, 2007
• Vasilis Vryniotis. DatumBox: The Dirichlet Process Mixture Model, 2014
18
Software Engineering2016/12/27 19
Q&A

More Related Content

PDF
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
PPTX
Topic modeling using big data analytics
PDF
Neural Semi-supervised Learning under Domain Shift
PDF
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
PDF
Basic review on topic modeling
ODP
Topic Modeling
PDF
Topics Modeling
PPTX
13. Indexing MTrees - Data Structures using C++ by Varsha Patil
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
Topic modeling using big data analytics
Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Basic review on topic modeling
Topic Modeling
Topics Modeling
13. Indexing MTrees - Data Structures using C++ by Varsha Patil

What's hot (20)

PPTX
Transformation Functions for Text Classification: A case study with StackOver...
PPTX
16. Algo analysis & Design - Data Structures using C++ by Varsha Patil
PDF
Transfer Learning -- The Next Frontier for Machine Learning
PDF
Thinking in clustering yueshen xu
PDF
Lifelong Topic Modelling presentation
PPTX
4. Recursion - Data Structures using C++ by Varsha Patil
PPTX
Neural Models for Information Retrieval
PPTX
7. Tree - Data Structures using C++ by Varsha Patil
PPTX
Document ranking using qprp with concept of multi dimensional subspace
PDF
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
PDF
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
PDF
Topic model an introduction
PPTX
Adversarial and reinforcement learning-based approaches to information retrieval
PDF
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
PPTX
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
PDF
Bringing Mathematics To the Web of Data: the Case of the Mathematics Subject ...
PPTX
Deep Neural Methods for Retrieval
PDF
Seminar_Koga_Yuki_v2.pdf
PDF
Latent Dirichlet Allocation
PPT
Machine learning for the Web:
Transformation Functions for Text Classification: A case study with StackOver...
16. Algo analysis & Design - Data Structures using C++ by Varsha Patil
Transfer Learning -- The Next Frontier for Machine Learning
Thinking in clustering yueshen xu
Lifelong Topic Modelling presentation
4. Recursion - Data Structures using C++ by Varsha Patil
Neural Models for Information Retrieval
7. Tree - Data Structures using C++ by Varsha Patil
Document ranking using qprp with concept of multi dimensional subspace
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic model an introduction
Adversarial and reinforcement learning-based approaches to information retrieval
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
Bringing Mathematics To the Web of Data: the Case of the Mathematics Subject ...
Deep Neural Methods for Retrieval
Seminar_Koga_Yuki_v2.pdf
Latent Dirichlet Allocation
Machine learning for the Web:
Ad

Viewers also liked (13)

PPTX
A Practical Application of LDA using Dail Debate Records
PDF
An Introduce to Topic Model
PPTX
Rapid Video Astd10
PPTX
Conventions in an indie music video
PDF
CV-Europass-20160105-PijuanGarcia-v2_
PDF
GiordanaArcesilai_brochure legg
PPTX
Профілактика зараження компютера вірусами
PPTX
Topic model, LDA and all that
PPT
Public Participation and Conflict Resolution in Public Planning: Practice in ...
PDF
Topic Modeling
PPT
Topic Models
PPSX
Body : PowerPoint presentation and game
POTX
LDA Beginner's Tutorial
A Practical Application of LDA using Dail Debate Records
An Introduce to Topic Model
Rapid Video Astd10
Conventions in an indie music video
CV-Europass-20160105-PijuanGarcia-v2_
GiordanaArcesilai_brochure legg
Профілактика зараження компютера вірусами
Topic model, LDA and all that
Public Participation and Conflict Resolution in Public Planning: Practice in ...
Topic Modeling
Topic Models
Body : PowerPoint presentation and game
LDA Beginner's Tutorial
Ad

Similar to (Hierarchical) topic modeling (20)

PDF
(Hierarchical) Topic Modeling_Yueshen Xu
PDF
TopicModels_BleiPaper_Summary.pptx
PDF
Survey of Generative Clustering Models 2008
PDF
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
PDF
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
PDF
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
PPTX
Recommenders, Topics, and Text
PDF
Discovering User's Topics of Interest in Recommender Systems
PDF
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
PDF
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
PDF
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
PDF
Introduction to Recommender Systems
PDF
Nbe rtopicsandrecomvlecture1
PDF
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
PDF
Streaming topic model training and inference
PDF
Recommending Semantic Nearest Neighbors Using Storm and Dato
PDF
LatentCross.pdf
PDF
Probabilistic Topic models
PDF
Learning to recommend with user generated content
PPTX
Wikipedia Document Classification
(Hierarchical) Topic Modeling_Yueshen Xu
TopicModels_BleiPaper_Summary.pptx
Survey of Generative Clustering Models 2008
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
Recommenders, Topics, and Text
Discovering User's Topics of Interest in Recommender Systems
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
Introduction to Recommender Systems
Nbe rtopicsandrecomvlecture1
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Streaming topic model training and inference
Recommending Semantic Nearest Neighbors Using Storm and Dato
LatentCross.pdf
Probabilistic Topic models
Learning to recommend with user generated content
Wikipedia Document Classification

More from Yueshen Xu (20)

PDF
Context aware service recommendation
PDF
Course review for ir class 本科课件
PDF
Semantic web 本科课件
PDF
Recommender system slides for undergraduate
PDF
推荐系统 本科课件
PDF
Text classification 本科课件
PDF
Text clustering (information retrieval, in chinese)
PDF
Non parametric bayesian learning in discrete data
PDF
聚类 (Clustering)
PDF
Yueshen xu cv
PDF
徐悦甡简历
PDF
Social recommender system
PPT
Summary on the Conference of WISE 2013
PPTX
Acoustic modeling using deep belief networks
PPT
Summarization for dragon star program
PPT
Aggregation computation over distributed data streams(the final version)
PPT
Aggregation computation over distributed data streams
PPT
Analysis on tcp ip protocol stack
PPT
Simple conclusion for sap tech ed 2011
PPT
Stream data mining & CluStream framework
Context aware service recommendation
Course review for ir class 本科课件
Semantic web 本科课件
Recommender system slides for undergraduate
推荐系统 本科课件
Text classification 本科课件
Text clustering (information retrieval, in chinese)
Non parametric bayesian learning in discrete data
聚类 (Clustering)
Yueshen xu cv
徐悦甡简历
Social recommender system
Summary on the Conference of WISE 2013
Acoustic modeling using deep belief networks
Summarization for dragon star program
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams
Analysis on tcp ip protocol stack
Simple conclusion for sap tech ed 2011
Stream data mining & CluStream framework

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Lecture1 pattern recognition............
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
climate analysis of Dhaka ,Banglades.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Knowledge Engineering Part 1
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Business Acumen Training GuidePresentation.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Miokarditis (Inflamasi pada Otot Jantung)
ISS -ESG Data flows What is ESG and HowHow
Clinical guidelines as a resource for EBP(1).pdf
Data_Analytics_and_PowerBI_Presentation.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Lecture1 pattern recognition............
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb

(Hierarchical) topic modeling

  • 1. (Hierarchical) Topic Modeling Yueshen Xu (lecturer) ysxu@xidian.edu.cn / xuyueshen@163.com Data and Knowledge Engineering Research Center Xidian University Text Mining & NLP & ML
  • 2. Software Engineering2016/12/27 Outline  Background  Some Concepts  Topic Modeling  Probabilistic Latent Semantic Indexing (PLSI)  Latent Dirichlet Allocation (LDA)  Hierarchical Topic Modeling  Chinese Restaurant Process (CRP)  Parameter Estimation  Supplement & Reference 2 Keywords: topic modeling, hierarchical topic modeling, probabilistic graphical model, Bayesian model Basics, not state-of-the-art
  • 3. Software Engineering2016/12/27 Background  Information Overloading 3 we need summarization Visualization Dimensional Reduction Big Data Cloud Computing Artificial Intelligence Deep Learning ,…, etc
  • 4. Software Engineering2016/12/27 Background  Text Summarization  Document Summarization  What do these docs (or this doc) talk about?  Review Summarization  What do these consumers care about or complain about?  Short Text/Tweets Summarization  What are people discussing about? 4 Automatic Applicable Explainable  Basic Requirement Topic Modeling
  • 5. Software Engineering2016/12/27  General Concepts  Latent Semantic Analysis  Text Mining  Natural Language Processing  Computational Linguistics  Information Retrieval  Dimension Reduction  Topic Modeling Some Concepts 5 Information Retrieval Computational Linguistics Natural Language Processing LSA/Topic Model Text Mining LSA Data Mining Reduction Dimension Machine Learning Machine Translation Topic Modeling  to learn the latent topics from a corpus/document
  • 6. Software Engineering2016/12/27 Topic Modeling  Topic modeling  an example in Chinese (from my doctorate thesis) 6 继续实施稳健的货币政策,保 持松紧适度适时预调微调,做 好与供给侧结构,并综合运用 数量、价格等多种货币政策 从员额上来看,这次改革远远超 过了裁军的数量,它是一种结构 性的改革,是军队组织结构现代 化的一个关键步骤 美元作为主要国际货币的地位在 可预见的将来仍无可取代,唯一 的出路是推动全球治理向更均衡 的方向发展。国际货币基金组织 总裁拉加德日前在美国马里兰大 学演讲时就呼吁,国际治理改革 应认清新兴经济体越来越重要这 一现实。 独立学院从母体高校“断奶”后, 可能会面临品牌、招生等方面阵 痛,但是在国家和省市鼓励民间 资本进入教育领域的实施意见发 布后,一些独立学院果断切割连 接母体大学的“脐带”,自立门 户发展。 Corpus Doc 1 Doc2 Doc 3 Doc4
  • 7. Software Engineering2016/12/27 Topic Modeling  After topic modeling 7 继续实施稳健的货币政策,保 持松紧适度适时预调微调,做 好与供给侧结构,并综合运用 数量、价格等多种货币政策 政策 0.082 改革 0.063 … 金融 0.074 货币 0.051 … 学院 0.077 教育 0.071 … 军队 0.083 组织 0.079 … 从员额上来看,这次改革远远 超过了裁军的数量,它是一种 结构性的改革,是军队组织结 构现代化的一个关键步骤 美元作为主要国际货币的地位 在可预见的将来仍无可取代, 唯一的出路是推动全球治理向 更均衡的方向发展。国际货币 基金组织总裁拉加德日前在美 国马里兰大学演讲时就呼吁, 国际治理改革应认清新兴经济 体越来越重要这一现实。 独立学院从母体高校“断奶” 后,可能会面临品牌、招生等 方面阵痛,但是在国家和省市 鼓励民间资本进入教育领域的 实施意见发布后,一些独立学 院果断切割连接母体大学的 “脐带”,自立门户发展。 … … … … Corpus Doc 1 Doc 2 Doc3 Doc 4 Topic 2 Topic 3 Topic 4 Topic 1
  • 8. Software Engineering2016/12/27 Topic Modeling  A topic  A word cluster  a group of words  Not clustered randomly, but meaningfully (not semantically) 8  Models  Parametric models  Latent Semantic Indexing (LSI)  PLSI; Latent Dirichlet Allocation (LDA)  Non-parametric models (Dirichlet Process)  (Nested) Chinese Restaurant Process  Indian Buffet Process  Pitman-Yor Process
  • 9. Software Engineering2016/12/27 Topic Modeling 9 pLSI Model w1 w2 wN z1 zK z2 d1 d2 dM ….. ….. ….. )(dp)|( dzp)|( zwp  Assumption  Pairs(d,w) are assumed to be generated independently  Conditioned on z, w is generated independently of d  Words in a document are exchangeable  Documents are exchangeable  Latent topics z are independent The generative process ∑∑ ∈∈ ZzZz dzpzwpdpdzwpdpdpdwpwdp )|()|()(=)|,()(=)()|(=),( Multinomial Distribution Multinomial Distribution One layer of ‘Deep Neutral Network’
  • 10. Software Engineering2016/12/27 Topic Modeling 10  Latent Dirichlet Allocation (LDA)  David M. Blei, Andrew Y. Ng, Michael I. Jordan  Hierarchical Bayesian model; Bayesian pLSI θ z w N M α β iterative times Generative process of LDA  Choose N ~ Poisson(𝜉);  For each document d={𝑤1, 𝑤2 … 𝑤 𝑛} Choose 𝜃 ~𝐷𝑖𝑟(𝛼); For each of the N words 𝑤 𝑛 in d: a) Choose a topic 𝑧 𝑛~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑛𝑎𝑙 𝜃 b) Choose a word 𝑤 𝑛 from 𝑝 𝑤 𝑛 𝑧 𝑛, 𝛽 , a multinomial distribution conditioned on 𝑧 𝑛
  • 11. Software Engineering2016/12/27 Topic Modeling  Parameter Estimation  Variational Inference (+EM) || Gibbs Sampling (MCMC) 11 Variational EM Algorithm Aim: (𝛼 ∗ , 𝛽 ∗ )=arg max 𝑑=1 𝑀 𝑝 𝒘|𝛼, 𝛽 Initialize 𝛼, 𝛽 E-Step: compute 𝛼, 𝛽 through variational inference for likelihood approximation M-Step: Maximize the likelihood according to 𝛼, 𝛽 End until convergence I just hope you to know: EM is quite important
  • 12. Software Engineering2016/12/27 Hierarchical Topic Modeling Topic modeling is not enough 12 Hierarchical Structure
  • 13. Software Engineering2016/12/27 Hierarchical Topic Modeling 13 Chinese Restaurant Process (Dirichlet Process)  A restaurant with an infinite number of tables, and customers (word) enter this restaurant sequentially. The ith customer (𝜃𝑖) sits at a table (𝜙 𝑘) according to the probability 𝜙 𝑘: Clustering == 1/2 unsupervised learning  clustering, topic modeling (two layer clustering), hierarchical concept building, collaborative filtering, similarity computation…
  • 14. Software Engineering2016/12/27 Hierarchical Topic Modeling 14  The generative process (nested CRP)  Focus on the insight 1. Let 𝑐1 be the root restaurant (only one table) 2. For each level 𝑙 ∈ {2, … , 𝐿}: Draw a table from restaurant 𝑐𝑙−1 using CRP. Set 𝑐𝑙 to be the restaurant referred to by that table 3. Draw an 𝐿 -dimensional topic proportion vector 𝜃~𝐷𝑖𝑟(𝛼) 4. For each word 𝑤 𝑛: Draw 𝑧 ∈ 1, … , 𝐿 ~ Mult(𝜃) Draw 𝑤 𝑛 from the topic associated with restaurant 𝑐 𝑧 α zm,n N c1 c2 cL T γ wm,n M β k   m   Matryoshka (Russia) Doll
  • 15. Software Engineering2016/12/27 Hierarchical Topic Modeling Examples 15 root topic analysis obtain base system concentration thermal polymer acid property diamine activity compound acid derivative active compound ligand group investigate synergistic reaction derivative yield synthesis microwave assay food quality content analysis decoction component radix quality constituent compound activity synthesize salt derivative antioxidant activity extract inhibitory flavonoid interaction cation metal energy solution
  • 16. Software Engineering2016/12/27 Supplement 16 Some supplements  Probabilistic Graphical Model  Modeling Bayesian Network using plates and circles  Generative Model & Discriminative Model: 𝑝(𝜃|𝑋/𝐷𝑎𝑡𝑎)  Generative Model: p(θ|X) ∝ p(X|θ)p(θ) - Naïve Bayes, GMM, pLSA, LDA, HMM, HDP… : Unsupervised Learning  Discriminative Model: 𝑝(𝜃|𝑋) - LR, KNN,SVM, Boosting, Decision Tree : Supervised Learning Also can be represented by graphical models
  • 17. Software Engineering2016/12/27 Reference  My previous tutorials/notes (ZJU/UIC/Netease/ITRZJU as a Ph.D)  ‘Topic modeling (an introduction)’  ‘Non-parametric Bayesian learning in discrete data’  ‘The research of topic modeling in text mining’  ‘Matrix factorization with user generated content’  …, etc  Website  You can download all slides of mine  http://web.xidian.edu.cn/ysxu/teach.html  http://liu.cs.uic.edu/yueshenxu/  http://www.slideshare.net/obamaxys2011  https://www.researchgate.net/profile/Yueshen_Xu 17
  • 18. Software Engineering2016/12/27 Reference • David Blei, etc. Latent Dirichlet Allocation, JMLR, 2003 • Yee Whye Teh. Dirichlet Processes: Tutorial and Practical Course, 2007 • Yee Whye Teh, Jordan M I, etc. Hierarchical Dirichlet Processes, American Statistical Association, 2006 • David Blei. Probabilstic topic models. Communications of the ACM, 2012 • David Blei, etc. The Nested Chinese Restaurant Process and Bayesian Inference of Topic Hierarchies. Journal of the ACM, 2010 • Gregor Heinrich. Parameter Estimation for Text Analysis, 2008 • T.S., Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics, 1973 • Martin J. Wainwright. Graphical Models, Exponential Families, and Variational Inference • Rick Durrett. Probability: Theory and Examples, 2010 • Christopher Bishop. Pattern Recognition and Machine Learning, 2007 • Vasilis Vryniotis. DatumBox: The Dirichlet Process Mixture Model, 2014 18

Editor's Notes

  • #6: 最后表述,由此引出本节:分组交换技术:数据报/虚电路