2015 07-tuto1-phrase mining

Phrase Mining and Topic
Modeling from Large Corpora
JIAWEI HAN
COMPUTER SCIENCE
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
JUNE 16, 2016
2

3
Why Phrase Mining?
 Unigrams vs. phrases
 Unigrams (single words) are ambiguous
Example: “United”: United States? United Airline? United Parcel Service?
 Phrase: A natural, meaningful, unambiguous semantic unit
Example: “United States” vs. “United Airline”
 Mining semantically meaningful phrases
 Transform text data from word granularity to phrase granularity
 Enhance the power and efficiency at manipulating unstructured data using
database technology

4
 Phrase mining: Originated from the NLP community—“Chunking”
 Model it as a sequence labeling problem (B-NP, I-NP, O, …)
 Need annotation and training
 Annotate hundreds of documents as training data
 Train a supervised model based on part-of-speech features
 Recent trend:
 Use distributional features based on web n-grams (Bergsma et al., 2010)
 State-of-the-art performance: ~95% accuracy, ~88% phrase-level F-score
 Limitations
 High annotation cost, not scalable to a new language, a new domain or genre
 May not fit domain-specific, dynamic, emerging applications
 Scientific domains, query logs, or social media, e.g., Yelp, Twitter
Mining Phrases: Why Not Use NLP Methods?

5
Data Mining Approaches
 General principle: Fully exploit information redundancy and data-
driven criteria to determine phrase boundaries and salience
 Phrase Mining and Topic Modeling from Large Corpora
 Strategy 1: Simultaneously Inferring Phrases and Topics
 Strategy 2: Post Topic Modeling Phrase Construction
 Strategy 3: First Phrase Mining then Topic Modeling (ToPMine)
 Integration of Phrase Mining with Document Segmentation

Session 1. Frequent Pattern
Mining for Text Data

7
Frequent Pattern Mining for Text Data: Phrase
Mining and Topic Modeling
 Motivation: Unigrams (single words) can be difficult to interpret
 Ex.: The topic that represents the area of Machine Learning
learning
reinforcement
support
machine
vector
selection
feature
random
:
versus
learning
support vector machines
reinforcement learning
feature selection
conditional random fields
classification
decision trees
:

8
Various Strategies: Phrase-Based Topic Modeling
 Strategy 1: Generate bag-of-words → generate sequence of tokens
 Bigram topical model [Wallach’06], topical n-gram model [Wang, et al.’07],
phrase discovering topic model [Lindsey, et al.’12]
 Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
 Label topic [Mei et al.’07], TurboTopic [Blei & Lafferty’09], KERT [Danilevsky, et
al.’14]
 Strategy 3: Prior bag-of-words model inference, mine phrases and impose on
the bag-of-words model
 ToPMine [El-kishky, et al.’15]

Session 2. Strategy 1:
Simultaneously Inferring
Phrases and Topics

10
Strategy 1: Simultaneously Inferring Phrases and Topics
 Bigram Topic Model [Wallach’06]
 Probabilistic generative model that conditions on previous word and topic when
drawing next word
 Topical N-Grams (TNG) [Wang, et al.’07]
 Probabilistic model that generates words in textual order
 Create n-grams by concatenating successive bigrams (a generalization of Bigram
Topic Model)
 Phrase-Discovering LDA (PDLDA) [Lindsey, et al.’12]
 Viewing each sentence as a time-series of words, PDLDA posits that the
generative parameter (topic) changes periodically
 Each word is drawn based on previous m words (context) and current phrase topic
 High model complexity: Tends to overfitting; High inference cost: Slow

11
TNG: Experiments on Research Papers

12
TNG: Experiments on Research Papers (2)

Session 3. Strategy 2: Post
Topic Modeling Phrase
Construction

14
Strategy 2: Post Topic Modeling Phrase Construction
 TurboTopics [Blei & Lafferty’09] – Phrase construction as a post-processing step
to Latent Dirichlet Allocation
 Perform Latent Dirichlet Allocation on corpus to assign each token a topic label
 Merge adjacent unigrams with the same topic label by a distribution-free
permutation test on arbitrary-length back-off model
 End recursive merging when all significant adjacent unigrams have been
merged
 KERT [Danilevsky et al.’14] – Phrase construction as a post-processing step
to Latent Dirichlet Allocation
 Perform frequent pattern mining on each topic
 Perform phrase ranking based on four different criteria

15
Example of TurboTopics
 Perform LDA on corpus to assign each token a topic label
 E.g., … phase11 transition11 …. game153 theory127 …
 Then merge adjacent unigrams with same topic label

16
KERT: Topical Keyphrase Extraction & Ranking
[Danilevsky, et al. 2014]
learning
support vector machines
reinforcement learning
feature selection
conditional random fields
classification
decision trees
:
Topical keyphrase
extraction & ranking
knowledge discovery using least squares support vector machine classifiers
support vectors for reinforcement learning
a hybrid approach to feature selection
pseudo conditional random fields
automatic web page classification in a dynamic and hierarchical way
inverse time dependency in convex regularized learning
postprocessing decision trees to extract actionable knowledge
variance minimization least squares support vector machines
…
Unigram topic assignment: Topic 1 & Topic 2

17
Framework of KERT
1. Run bag-of-words model inference and assign topic label to each token
2. Extract candidate keyphrases within each topic
3. Rank the keyphrases in each topic
 Popularity: ‘information retrieval’ vs. ‘cross-language information retrieval’
 Discriminativeness: only frequent in documents about topic t
 Concordance: ‘active learning’ vs.‘learning classification’
 Completeness: ‘vector machine’ vs. ‘support vector machine’
Frequent pattern mining
Comparability property: directly compare phrases of mixed lengths

18
Comparison of phrase ranking methods
The topic that represents the area of Machine Learning
kpRel
[Zhao et al. 11]
KERT
(-popularity)
KERT
(-discriminativeness)
KERT
(-concordance)
KERT
[Danilevsky et al. 14]
learning effective support vector machines learning learning
classification text feature selection classification support vector machines
selection probabilistic reinforcement learning selection reinforcement learning
models identification conditional random fields feature feature selection
algorithm mapping constraint satisfaction decision conditional random fields
features task decision trees bayesian classification
decision planning dimensionality reduction trees decision trees
: : : : :
KERT: Topical Phrases on Machine Learning
Top-Ranked Phrases by Mining Paper Titles in DBLP

Session 4. Strategy 3: First
Phrase Mining then Topic
Modeling

20
Strategy 3: First Phrase Mining then Topic Modeling
 ToPMine [El-Kishky et al. VLDB’15]
 First phrase construction, then topic mining
 Contrast with KERT: topic modeling, then phrase mining
 The ToPMine Framework:
 Perform frequent contiguous pattern mining to extract candidate phrases and
their counts
 Perform agglomerative merging of adjacent unigrams as guided by a significance
score—This segments each document into a “bag-of-phrases”
 The newly formed bag-of-phrases are passed as input to PhraseLDA, an extension
of LDA, that constrains all words in a phrase to each sharing the same latent topic

21
Why First Phrase Mining then Topic Modeling ?
 With Strategy 2, tokens in the same phrase may be assigned to different topics
 Ex. knowledge discovery using least squares support vector machine classifiers…
 Knowledge discovery and support vector machine should have coherent topic labels
 Solution: switch the order of phrase mining and topic model inference
 Techniques
 Phrase mining and document segmentation
 Topic model inference with phrase constraint
[knowledge discovery] using [least
squares] [support vector machine]
[classifiers] …
[knowledge discovery] using [least
squares] [support vector machine]
[classifiers] …

22
Phrase Mining: Frequent Pattern Mining + Statistical
Analysis
[Markov blanket] [feature selection] for [support vector
machines]
[knowledge discovery] using [least squares] [support
vector machine] [classifiers]
…[support vector] for [machine learning]…
Phrase Raw
freq.
True
freq.
[support vector machine] 90 80
[vector machine] 95 0
[support vector] 100 20
Quality
phrases
Based on significance score [Church et al.’91]:
α(P1, P2) ≈ (f(P1●P2) ̶ µ0(P1,P2))/√ f(P1●P2)

23
Collocation Mining
 Collocation: A sequence of words that occur more frequently than expected
 Often “interesting” and due to their non-compositionality, often relay information
not portrayed by their constituent terms (e.g., “made an exception”, “strong tea”)
 Many different measures used to extract collocations from a corpus [Dunning 93,
Pederson 96]
 E.g., mutual information, t-test, z-test, chi-squared test, likelihood ratio
 Many of these measures can be used to guide the agglomerative phrase-
segmentation algorithm

24
ToPMine: Phrase LDA (Constrained Topic Modeling)
 The generative model for PhraseLDA is the same as
LDA
 Difference: the model incorporates constraints
obtained from the “bag-of-phrases” input
 Chain-graph shows that all words in a phrase are
constrained to take on the same topic values
[knowledge discovery] using [least squares]
[support vector machine] [classifiers] …
Topic model inference with phrase constraints

25
Example Topical Phrases: A Comparison
ToPMine [El-kishky et al. 14]
Strategy 3 (67 seconds)
information retrieval feature selection
social networks machine learning
web search semi supervised
search engine large scale
information
extraction
support vector
machines
question answering active learning
web pages face recognition
: :
Topic 1 Topic 2
social networks information retrieval
web search text classification
time series machine learning
search engine support vector machines
management system information extraction
real time neural networks
decision trees text categorization
: :
Topic 1 Topic 2
PDLDA [Lindsey et al. 12] Strategy 1
(3.72 hours)

26
ToPMine: Experiments on DBLP Abstracts

27
ToPMine: Topics on Associate Press News (1989)

28
ToPMine: Experiments on Yelp Reviews

Session 5. A Comparative
Study of Three Strategies

30
Efficiency: Running Time of Different Strategies
 Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the bag-of-words model
Running time: strategy 3 > strategy 2 > strategy 1 (“>” means outperforms)

31
Coherence of Topics: Comparison of Strategies
 Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the bag-of-words model
Coherence measured by z-score: strategy 3 > strategy 2 > strategy 1

32
Phrase Intrusion: Comparison of Strategies
Phrase intrusion measured by average number of correct answers:
strategy 3 > strategy 2 > strategy 1

33
Phrase Quality: Comparison of Strategies
Phrase quality measured by z-score:
strategy 3 > strategy 2 > strategy 1

34
Summary: Strategies on Topical Phrase Mining
 Integrated complex model; phrase quality and topic inference rely on each
other
 Slow and overfitting
 Phrase quality relies on topic labels for unigrams
 Can be fast; generally high-quality topics and phrases
 Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the
bag-of-words model
 Topic inference relies on correct segmentation of documents, but not sensitive
 Can be fast; generally high-quality topics and phrases

Session 6. Mining Quality Phrases
from Massive Text Corpora: A
SegPhrase Approach

37
Mining Phrases: Why Not Use Raw Frequency Based
Methods?
 Traditional data-driven approaches
 Frequent pattern mining
 If AB is frequent, likely AB could be a phrase
 Raw frequency could NOT reflect the quality of phrases
 E.g., freq(vector machine) ≥ freq(support vector machine)
 Need to rectify the frequency based on segmentation results
 Phrasal segmentation will tell
 Some words should be treated as a whole phrase whereas others are still
unigrams

38
SegPhrase: From Raw Corpus
to Quality Phrases and Segmented Corpus
Document 1
Citation recommendation is an interesting but
challenging research problem in data mining area.
Document 2
In this study, we investigate the problem in the
context of heterogeneous information networks
using data mining technique.
Phrase Mining
Document 3
Principal Component Analysis is a linear
dimensionality reduction technique commonly used
in machine learning applications.
Quality Phrases
Phrasal Segmentation
Raw Corpus Segmented Corpus
Input Raw Corpus Quality Phrases Segmented Corpus

39
SegPhrase: The Overall Framework
 ClassPhrase: Frequent pattern mining, feature extraction, classification
 SegPhrase: Phrasal segmentation and phrase quality estimation
 SegPhrase+: One more round to enhance mined phrase quality
ClassPhrase SegPhrase(+)

40
What Kind of Phrases Are of “High Quality”?
 Judging the quality of phrases
 Popularity
 “information retrieval” vs. “cross-language information retrieval”
 Concordance
 “powerful tea” vs. “strong tea”
 “active learning” vs. “learning classification”
 Informativeness
 “this paper” (frequent but not discriminative, not informative)
 Completeness
 “vector machine” vs. “support vector machine”

41
ClassPhrase I: Pattern Mining for Candidate Set
 Build a candidate phrases set by frequent pattern mining
 Mining frequent k-grams
 k is typically small, e.g. 6 in our experiments
 Popularity measured by raw frequent words and phrases mined from the corpus

42
ClassPhrase II: Feature Extraction: Concordance
 Partition a phrase into two parts to check whether the co-occurrence is
significantly higher than pure random
 support vector machine this paper demonstrates
 Pointwise mutual information:
 Pointwise KL divergence:
 The additional p(v) is multiplied with pointwise mutual information,
leading to less bias towards rare-occurred phrases
𝑢𝑙 𝑢𝑙𝑢 𝑟
𝑢 𝑟

43
ClassPhrase II: Feature Extraction: Informativeness
 Deriving Informativeness
 Quality phrases typically start and end with a non-stopword
 “machine learning is” vs. “machine learning”
 Use average IDF over words in the phrase to measure the semantics
 Usually, the probabilities of a quality phrase in quotes, brackets, or connected
by dash should be higher (punctuations information)
 “state-of-the-art”
 We can also incorporate features using some NLP techniques, such as POS
tagging, chunking, and semantic parsing

44
ClassPhrase III: Classifier
 Limited Training
 Labels: Whether a phrase is a quality one or not
 “support vector machine”: 1
 “the experiment shows”: 0
 For ~1GB corpus, only 300 labels
 Random Forest as our classifier
 Predicted phrase quality scores lie in [0, 1]
 Bootstrap many different datasets from limited labels

45
SegPhrase: Why Do We Need Phrasal
Segmentation in Corpus?
 Phrasal segmentation can tell which phrase is more appropriate
 Ex: A standard ⌈feature vector⌋ ⌈machine learning⌋ setup is used to describe...
 Rectified phrase frequency (expected influence)
 Example:
Not counted towards the rectified frequency

46
SegPhrase: Segmentation of Phrases
 Partition a sequence of word by maximizing the likelihood
 Considering
 Phrase quality score
 ClassPhrase assigns a quality score for each phrase
 Probability in corpus
 Length penalty
 length penalty 𝛼: when 𝛼 > 1, it favors shorter phrases
 Filter out phrases with low rectified frequency
 Bad phrases are expected to rarely occur in the segmentation results

47
SegPhrase+: Enhancing Phrasal Segmentation
 SegPhrase+: One more round for enhanced phrasal segmentation
 Feedback
 Using rectified frequency, re-compute those features previously computing
based on raw frequency
 Process
 Classification  Phrasal segmentation // SegPhrase
 Classification  Phrasal segmentation // SegPhrase+
 Effects on computing quality scores
 np hard in the strong sense
 np hard in the strong
 data base management system

48
Performance Study: Methods to Be Compared
 Other phase mining methods: Methods to be compared
 NLP chunking based methods
 Chunks as candidates
 Sorted by TF-IDF and C-value (K. Frantzi et al., 2000)
 Unsupervised raw frequency based methods
 ConExtr (A. Parameswaran et al., VLDB 2010)
 ToPMine (A. El-Kishky et al., VLDB 2015)
 Supervised method
 KEA, designed for single document keyphrases (O. Medelyan & I. H. Witten,
2006)

49
Performance Study: Experimental Setting
 Datasets
 Popular Wiki Phrases
 Based on internal links
 ~7K high quality phrases
 Pooling
 Sampled 500 * 7 Wiki-uncovered phrases
 Evaluated by 3 reviewers independently
Dataset #docs #words #labels
DBLP 2.77M 91.6M 300
Yelp 4.75M 145.1M 300

50
Performance: Precision Recall Curves on DBLP
Compare
with other
baselines
TF-IDF
C-Value
ConExtr
KEA
ToPMine
SegPhrase+
Compare with
our 3 variations
TF-IDF
ClassPhrase
SegPhrase
SegPhrase+
50

51
Performance Study: Processing Efficiency
 SegPhrase+ is linear to the size of corpus!

52
Experimental Results: Interesting Phrases
Generated (From the Titles and Abstracts of SIGMOD)
Query SIGMOD
Method SegPhrase+ Chunking (TF-IDF & C-Value)
1 data base data base
2 database system database system
3 relational database query processing
4 query optimization query optimization
5 query processing relational database
… … …
51 sql server database technology
52 relational data database server
53 data structure large volume
54 join query performance study
55 web service web service
… … …
201 high dimensional data efficient implementation
202 location based service sensor network
203 xml schema large collection
204 two phase locking important issue
205 deep web frequent itemset
… … …
Only in SegPhrase+ Only in Chunking

53
Experimental Results: Interesting Phrases
Generated (From the Titles and Abstracts of SIGKDD)
Query SIGKDD
Method SegPhrase+ Chunking (TF-IDF & C-Value)
1 data mining data mining
2 data set association rule
3 association rule knowledge discovery
4 knowledge discovery frequent itemset
5 time series decision tree
… … …
51 association rule mining search space
52 rule set domain knowledge
53 concept drift importnant problem
54 knowledge acquisition concurrency control
55 gene expression data conceptual graph
… … …
201 web content optimal solution
202 frequent subgraph semantic relationship
203 intrusion detection effective way
204 categorical attribute space complexity
205 user preference small set
… … …
Only in SegPhrase+ Only in Chunking

54
Experimental Results: Similarity Search
 Find high-quality similar phrases based on user’s phrase query
 In response to a user’s phrase query, SegPhrase+ generates high quality,
semantically similar phrases
 In DBLP, query on “data mining” and “OLAP”
 In Yelp, query on “blu-ray”, “noodle”, and “valet parking”

55
Experimental Results: High Quality Phrases
Generated (Top-Ranked Phrases From English Gigaword)
 Northrop Grumman, Ashfaq Kayani, Sania Mirza, Pius Xii, Shakhtar Donetsk, Kyaw Zaw Lwin
 Ratko Mladic, Abdolmalek Rigi, Rubin Kazan, Rajon Rondo, Rubel Hossain, bluefin tuna
 Psv Eindhoven, Nicklas Bendtner, Ryo Ishikawa, Desmond Tutu, Landon Donovan, Jannie du Plessis
 Zinedine Zidane, Uttar Pradesh, Thor Hushovd, Andhra Pradesh, Jafar_Panahi, Marouane Chamakh
 Rahm Emanuel, Yakubu Aiyegbeni, Salva Kiir, Abdelhamid Abou Zeid, Blaise Compaore, Rickie Fowler
 Andry Rajoelina, Merck Kgaa, Js Kabylie, Arjun Atwal, Andal Ampatuan Jnr, Reggio Calabria, Ghani Baradar
 Mahela Jayawardene, Jemaah Islamiyah, quantitative easing, Nodar Kumaritashvili, Alviro Petersen
 Rumiana Jeleva, Helio Castroneves, Koumei Oda, Porfirio Lobo, Anastasia Pavlyuchenkova
 Thaksin Shinawatra, Evgeni_Malkin, Salvatore Sirigu, Edoardo Molinari, Yoshito Sengoku
 Otago Highlanders, Umar Akmal, Shuaibu Amodu, Nadia Petrova, Jerzy Buzek, Leonid Kuchma,
 Alona Bondarenko, Chosun Ilbo, Kei Nishikori, Nobunari Oda, Kumbh Mela, Santo_Domingo
 Nicolae Ceausescu, Yoann Gourcuff, Petr Cech, Mirlande Manigat, Sulieman Benn, Sekouba Konate

56
Recent Progress
 Distant Training: No need of human labeling
 Training using general knowledge bases
 E.g., Freebase, Wikipedia
 Quality Estimation for Unigrams
 Integration of phrases and unigrams in one uniform framework
 Demo based on DBLP abstract
 Multi-languages: Beyond English corpus
 Extensible to mining quality phrases in multiple languages
 Recent progress: SegPhrase+ works on Chinese and Arabic

57
Mining Quality Phrases in Multiple Languages
Rank Phrase In English
… … …
62 首席_执行官 CEO
63 中间_偏右 Middle-right
… … …
84 百度_百科 Baidu Pedia
85 热带_气旋 Tropical cyclone
86 中国科学院_院士 Fellow of Chinese
Academy of Sciences
… … …
1001 十大_中文_金曲 Top-10 Chinese Songs
1002 全球_资讯网 Global News Website
1003 天一阁_藏_明代_科举_录_选刊 A Chinese book name
… … …
9934 国家_戏剧_院 National Theater
9935 谢谢_你 Thank you
… … …
 Both ToPMine and SegPhrase+ are
extensible to mining quality
phrases in multiple languages
 SegPhrase+ on Chinese (From
Chinese Wikipedia)
 ToPMine on Arabic (From Quran
(Fus7a Arabic)(no preprocessing)
 Experimental results of Arabic
phrases:
‫كفروا‬ Those who disbelieve
‫بسم‬‫هللا‬‫الرحمن‬‫الرحيم‬  In the name
of God the Gracious and Merciful

58
Summary on SegPrhase+ and Future Work
 SegPhrase+: A new phrase mining framework
 Integrating phrase mining with phrasal segmentation
 Requires only limited training or distant training
 Generates high-quality phrases, close to human judgement
 Linearly scalable on time and space
 Looking forward: High-quality, scalable phrase mining
 Facilitate entity recognition and typing in large corpora
 (See the next part of this tutorial)
 Transform massive unstructured data into semi-structured knowledge
networks

60
References
 D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi-Word Expressions, arXiv:0907.1013, 2009
 K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis. In U. Zernik (ed.), Lexical Acquisition:
Exploiting On-Line Resources to Build a Lexicon. Lawrence Erlbaum, 1991
 M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical
Keyphrases on Collections of Short Documents“, SDM’14
 A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15
 K. Frantzi, S. Ananiadou, and H. Mima, Automatic Recognition of Multi-Word Terms: the c-value/nc-value
Method. Int. Journal on Digital Libraries, 3(2), 2000
 R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A phrase-discovering topic model using hierarchical pitman-yor
processes, EMNLP-CoNLL’12.
 J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15
 O. Medelyan and I. H. Witten, Thesaurus Based Automatic Keyphrase Indexing. IJCDL’06
 Q. Mei, X. Shen, C. Zhai. Automatic Labeling of Multinomial Topic Models, KDD’07
 A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the Web of Concepts: Extracting Concepts from
Large Datasets. VLDB’10
 X. Wang, A. McCallum, X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information
retrieval, ICDM’07

2015 07-tuto1-phrase mining

More Related Content

What's hot (20)

Similar to 2015 07-tuto1-phrase mining (20)

More from jins0618 (20)

Recently uploaded (20)

2015 07-tuto1-phrase mining

Editor's Notes