SlideShare a Scribd company logo
1
Phrase Mining and Topic
Modeling from Large Corpora
JIAWEI HAN
COMPUTER SCIENCE
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
JUNE 16, 2016
2
3
Why Phrase Mining?
 Unigrams vs. phrases
 Unigrams (single words) are ambiguous
Example: “United”: United States? United Airline? United Parcel Service?
 Phrase: A natural, meaningful, unambiguous semantic unit
Example: “United States” vs. “United Airline”
 Mining semantically meaningful phrases
 Transform text data from word granularity to phrase granularity
 Enhance the power and efficiency at manipulating unstructured data using
database technology
4
 Phrase mining: Originated from the NLP community—“Chunking”
 Model it as a sequence labeling problem (B-NP, I-NP, O, …)
 Need annotation and training
 Annotate hundreds of documents as training data
 Train a supervised model based on part-of-speech features
 Recent trend:
 Use distributional features based on web n-grams (Bergsma et al., 2010)
 State-of-the-art performance: ~95% accuracy, ~88% phrase-level F-score
 Limitations
 High annotation cost, not scalable to a new language, a new domain or genre
 May not fit domain-specific, dynamic, emerging applications
 Scientific domains, query logs, or social media, e.g., Yelp, Twitter
Mining Phrases: Why Not Use NLP Methods?
5
Data Mining Approaches
 General principle: Fully exploit information redundancy and data-
driven criteria to determine phrase boundaries and salience
 Phrase Mining and Topic Modeling from Large Corpora
 Strategy 1: Simultaneously Inferring Phrases and Topics
 Strategy 2: Post Topic Modeling Phrase Construction
 Strategy 3: First Phrase Mining then Topic Modeling (ToPMine)
 Integration of Phrase Mining with Document Segmentation
Session 1. Frequent Pattern
Mining for Text Data
7
Frequent Pattern Mining for Text Data: Phrase
Mining and Topic Modeling
 Motivation: Unigrams (single words) can be difficult to interpret
 Ex.: The topic that represents the area of Machine Learning
learning
reinforcement
support
machine
vector
selection
feature
random
:
versus
learning
support vector machines
reinforcement learning
feature selection
conditional random fields
classification
decision trees
:
8
Various Strategies: Phrase-Based Topic Modeling
 Strategy 1: Generate bag-of-words → generate sequence of tokens
 Bigram topical model [Wallach’06], topical n-gram model [Wang, et al.’07],
phrase discovering topic model [Lindsey, et al.’12]
 Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
 Label topic [Mei et al.’07], TurboTopic [Blei & Lafferty’09], KERT [Danilevsky, et
al.’14]
 Strategy 3: Prior bag-of-words model inference, mine phrases and impose on
the bag-of-words model
 ToPMine [El-kishky, et al.’15]
Session 2. Strategy 1:
Simultaneously Inferring
Phrases and Topics
10
Strategy 1: Simultaneously Inferring Phrases and Topics
 Bigram Topic Model [Wallach’06]
 Probabilistic generative model that conditions on previous word and topic when
drawing next word
 Topical N-Grams (TNG) [Wang, et al.’07]
 Probabilistic model that generates words in textual order
 Create n-grams by concatenating successive bigrams (a generalization of Bigram
Topic Model)
 Phrase-Discovering LDA (PDLDA) [Lindsey, et al.’12]
 Viewing each sentence as a time-series of words, PDLDA posits that the
generative parameter (topic) changes periodically
 Each word is drawn based on previous m words (context) and current phrase topic
 High model complexity: Tends to overfitting; High inference cost: Slow
11
TNG: Experiments on Research Papers
12
TNG: Experiments on Research Papers (2)
Session 3. Strategy 2: Post
Topic Modeling Phrase
Construction
14
Strategy 2: Post Topic Modeling Phrase Construction
 TurboTopics [Blei & Lafferty’09] – Phrase construction as a post-processing step
to Latent Dirichlet Allocation
 Perform Latent Dirichlet Allocation on corpus to assign each token a topic label
 Merge adjacent unigrams with the same topic label by a distribution-free
permutation test on arbitrary-length back-off model
 End recursive merging when all significant adjacent unigrams have been
merged
 KERT [Danilevsky et al.’14] – Phrase construction as a post-processing step
to Latent Dirichlet Allocation
 Perform frequent pattern mining on each topic
 Perform phrase ranking based on four different criteria
15
Example of TurboTopics
 Perform LDA on corpus to assign each token a topic label
 E.g., … phase11 transition11 …. game153 theory127 …
 Then merge adjacent unigrams with same topic label
16
KERT: Topical Keyphrase Extraction & Ranking
[Danilevsky, et al. 2014]
learning
support vector machines
reinforcement learning
feature selection
conditional random fields
classification
decision trees
:
Topical keyphrase
extraction & ranking
knowledge discovery using least squares support vector machine classifiers
support vectors for reinforcement learning
a hybrid approach to feature selection
pseudo conditional random fields
automatic web page classification in a dynamic and hierarchical way
inverse time dependency in convex regularized learning
postprocessing decision trees to extract actionable knowledge
variance minimization least squares support vector machines
…
Unigram topic assignment: Topic 1 & Topic 2
17
Framework of KERT
1. Run bag-of-words model inference and assign topic label to each token
2. Extract candidate keyphrases within each topic
3. Rank the keyphrases in each topic
 Popularity: ‘information retrieval’ vs. ‘cross-language information retrieval’
 Discriminativeness: only frequent in documents about topic t
 Concordance: ‘active learning’ vs.‘learning classification’
 Completeness: ‘vector machine’ vs. ‘support vector machine’
Frequent pattern mining
Comparability property: directly compare phrases of mixed lengths
18
Comparison of phrase ranking methods
The topic that represents the area of Machine Learning
kpRel
[Zhao et al. 11]
KERT
(-popularity)
KERT
(-discriminativeness)
KERT
(-concordance)
KERT
[Danilevsky et al. 14]
learning effective support vector machines learning learning
classification text feature selection classification support vector machines
selection probabilistic reinforcement learning selection reinforcement learning
models identification conditional random fields feature feature selection
algorithm mapping constraint satisfaction decision conditional random fields
features task decision trees bayesian classification
decision planning dimensionality reduction trees decision trees
: : : : :
KERT: Topical Phrases on Machine Learning
Top-Ranked Phrases by Mining Paper Titles in DBLP
Session 4. Strategy 3: First
Phrase Mining then Topic
Modeling
20
Strategy 3: First Phrase Mining then Topic Modeling
 ToPMine [El-Kishky et al. VLDB’15]
 First phrase construction, then topic mining
 Contrast with KERT: topic modeling, then phrase mining
 The ToPMine Framework:
 Perform frequent contiguous pattern mining to extract candidate phrases and
their counts
 Perform agglomerative merging of adjacent unigrams as guided by a significance
score—This segments each document into a “bag-of-phrases”
 The newly formed bag-of-phrases are passed as input to PhraseLDA, an extension
of LDA, that constrains all words in a phrase to each sharing the same latent topic
21
Why First Phrase Mining then Topic Modeling ?
 With Strategy 2, tokens in the same phrase may be assigned to different topics
 Ex. knowledge discovery using least squares support vector machine classifiers…
 Knowledge discovery and support vector machine should have coherent topic labels
 Solution: switch the order of phrase mining and topic model inference
 Techniques
 Phrase mining and document segmentation
 Topic model inference with phrase constraint
[knowledge discovery] using [least
squares] [support vector machine]
[classifiers] …
[knowledge discovery] using [least
squares] [support vector machine]
[classifiers] …
22
Phrase Mining: Frequent Pattern Mining + Statistical
Analysis
[Markov blanket] [feature selection] for [support vector
machines]
[knowledge discovery] using [least squares] [support
vector machine] [classifiers]
…[support vector] for [machine learning]…
Phrase Raw
freq.
True
freq.
[support vector machine] 90 80
[vector machine] 95 0
[support vector] 100 20
Quality
phrases
Based on significance score [Church et al.’91]:
α(P1, P2) ≈ (f(P1●P2) ̶ µ0(P1,P2))/√ f(P1●P2)
23
Collocation Mining
 Collocation: A sequence of words that occur more frequently than expected
 Often “interesting” and due to their non-compositionality, often relay information
not portrayed by their constituent terms (e.g., “made an exception”, “strong tea”)
 Many different measures used to extract collocations from a corpus [Dunning 93,
Pederson 96]
 E.g., mutual information, t-test, z-test, chi-squared test, likelihood ratio
 Many of these measures can be used to guide the agglomerative phrase-
segmentation algorithm
24
ToPMine: Phrase LDA (Constrained Topic Modeling)
 The generative model for PhraseLDA is the same as
LDA
 Difference: the model incorporates constraints
obtained from the “bag-of-phrases” input
 Chain-graph shows that all words in a phrase are
constrained to take on the same topic values
[knowledge discovery] using [least squares]
[support vector machine] [classifiers] …
Topic model inference with phrase constraints
25
Example Topical Phrases: A Comparison
ToPMine [El-kishky et al. 14]
Strategy 3 (67 seconds)
information retrieval feature selection
social networks machine learning
web search semi supervised
search engine large scale
information
extraction
support vector
machines
question answering active learning
web pages face recognition
: :
Topic 1 Topic 2
social networks information retrieval
web search text classification
time series machine learning
search engine support vector machines
management system information extraction
real time neural networks
decision trees text categorization
: :
Topic 1 Topic 2
PDLDA [Lindsey et al. 12] Strategy 1
(3.72 hours)
26
ToPMine: Experiments on DBLP Abstracts
27
ToPMine: Topics on Associate Press News (1989)
28
ToPMine: Experiments on Yelp Reviews
Session 5. A Comparative
Study of Three Strategies
30
Efficiency: Running Time of Different Strategies
 Strategy 1: Generate bag-of-words → generate sequence of tokens
 Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
 Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the bag-of-words model
Running time: strategy 3 > strategy 2 > strategy 1 (“>” means outperforms)
31
Coherence of Topics: Comparison of Strategies
 Strategy 1: Generate bag-of-words → generate sequence of tokens
 Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
 Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the bag-of-words model
Coherence measured by z-score: strategy 3 > strategy 2 > strategy 1
32
Phrase Intrusion: Comparison of Strategies
Phrase intrusion measured by average number of correct answers:
strategy 3 > strategy 2 > strategy 1
33
Phrase Quality: Comparison of Strategies
Phrase quality measured by z-score:
strategy 3 > strategy 2 > strategy 1
34
Summary: Strategies on Topical Phrase Mining
 Strategy 1: Generate bag-of-words → generate sequence of tokens
 Integrated complex model; phrase quality and topic inference rely on each
other
 Slow and overfitting
 Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
 Phrase quality relies on topic labels for unigrams
 Can be fast; generally high-quality topics and phrases
 Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the
bag-of-words model
 Topic inference relies on correct segmentation of documents, but not sensitive
 Can be fast; generally high-quality topics and phrases
35
Session 6. Mining Quality Phrases
from Massive Text Corpora: A
SegPhrase Approach
37
Mining Phrases: Why Not Use Raw Frequency Based
Methods?
 Traditional data-driven approaches
 Frequent pattern mining
 If AB is frequent, likely AB could be a phrase
 Raw frequency could NOT reflect the quality of phrases
 E.g., freq(vector machine) ≥ freq(support vector machine)
 Need to rectify the frequency based on segmentation results
 Phrasal segmentation will tell
 Some words should be treated as a whole phrase whereas others are still
unigrams
38
SegPhrase: From Raw Corpus
to Quality Phrases and Segmented Corpus
Document 1
Citation recommendation is an interesting but
challenging research problem in data mining area.
Document 2
In this study, we investigate the problem in the
context of heterogeneous information networks
using data mining technique.
Phrase Mining
Document 3
Principal Component Analysis is a linear
dimensionality reduction technique commonly used
in machine learning applications.
Quality Phrases
Phrasal Segmentation
Raw Corpus Segmented Corpus
Input Raw Corpus Quality Phrases Segmented Corpus
39
SegPhrase: The Overall Framework
 ClassPhrase: Frequent pattern mining, feature extraction, classification
 SegPhrase: Phrasal segmentation and phrase quality estimation
 SegPhrase+: One more round to enhance mined phrase quality
ClassPhrase SegPhrase(+)
40
What Kind of Phrases Are of “High Quality”?
 Judging the quality of phrases
 Popularity
 “information retrieval” vs. “cross-language information retrieval”
 Concordance
 “powerful tea” vs. “strong tea”
 “active learning” vs. “learning classification”
 Informativeness
 “this paper” (frequent but not discriminative, not informative)
 Completeness
 “vector machine” vs. “support vector machine”
41
ClassPhrase I: Pattern Mining for Candidate Set
 Build a candidate phrases set by frequent pattern mining
 Mining frequent k-grams
 k is typically small, e.g. 6 in our experiments
 Popularity measured by raw frequent words and phrases mined from the corpus
42
ClassPhrase II: Feature Extraction: Concordance
 Partition a phrase into two parts to check whether the co-occurrence is
significantly higher than pure random
 support vector machine this paper demonstrates
 Pointwise mutual information:
 Pointwise KL divergence:
 The additional p(v) is multiplied with pointwise mutual information,
leading to less bias towards rare-occurred phrases
𝑢𝑙 𝑢𝑙𝑢 𝑟
𝑢 𝑟
43
ClassPhrase II: Feature Extraction: Informativeness
 Deriving Informativeness
 Quality phrases typically start and end with a non-stopword
 “machine learning is” vs. “machine learning”
 Use average IDF over words in the phrase to measure the semantics
 Usually, the probabilities of a quality phrase in quotes, brackets, or connected
by dash should be higher (punctuations information)
 “state-of-the-art”
 We can also incorporate features using some NLP techniques, such as POS
tagging, chunking, and semantic parsing
44
ClassPhrase III: Classifier
 Limited Training
 Labels: Whether a phrase is a quality one or not
 “support vector machine”: 1
 “the experiment shows”: 0
 For ~1GB corpus, only 300 labels
 Random Forest as our classifier
 Predicted phrase quality scores lie in [0, 1]
 Bootstrap many different datasets from limited labels
45
SegPhrase: Why Do We Need Phrasal
Segmentation in Corpus?
 Phrasal segmentation can tell which phrase is more appropriate
 Ex: A standard ⌈feature vector⌋ ⌈machine learning⌋ setup is used to describe...
 Rectified phrase frequency (expected influence)
 Example:
Not counted towards the rectified frequency
46
SegPhrase: Segmentation of Phrases
 Partition a sequence of word by maximizing the likelihood
 Considering
 Phrase quality score
 ClassPhrase assigns a quality score for each phrase
 Probability in corpus
 Length penalty
 length penalty 𝛼: when 𝛼 > 1, it favors shorter phrases
 Filter out phrases with low rectified frequency
 Bad phrases are expected to rarely occur in the segmentation results
47
SegPhrase+: Enhancing Phrasal Segmentation
 SegPhrase+: One more round for enhanced phrasal segmentation
 Feedback
 Using rectified frequency, re-compute those features previously computing
based on raw frequency
 Process
 Classification  Phrasal segmentation // SegPhrase
 Classification  Phrasal segmentation // SegPhrase+
 Effects on computing quality scores
 np hard in the strong sense
 np hard in the strong
 data base management system
48
Performance Study: Methods to Be Compared
 Other phase mining methods: Methods to be compared
 NLP chunking based methods
 Chunks as candidates
 Sorted by TF-IDF and C-value (K. Frantzi et al., 2000)
 Unsupervised raw frequency based methods
 ConExtr (A. Parameswaran et al., VLDB 2010)
 ToPMine (A. El-Kishky et al., VLDB 2015)
 Supervised method
 KEA, designed for single document keyphrases (O. Medelyan & I. H. Witten,
2006)
49
Performance Study: Experimental Setting
 Datasets
 Popular Wiki Phrases
 Based on internal links
 ~7K high quality phrases
 Pooling
 Sampled 500 * 7 Wiki-uncovered phrases
 Evaluated by 3 reviewers independently
Dataset #docs #words #labels
DBLP 2.77M 91.6M 300
Yelp 4.75M 145.1M 300
50
Performance: Precision Recall Curves on DBLP
Compare
with other
baselines
TF-IDF
C-Value
ConExtr
KEA
ToPMine
SegPhrase+
Compare with
our 3 variations
TF-IDF
ClassPhrase
SegPhrase
SegPhrase+
50
51
Performance Study: Processing Efficiency
 SegPhrase+ is linear to the size of corpus!
52
Experimental Results: Interesting Phrases
Generated (From the Titles and Abstracts of SIGMOD)
Query SIGMOD
Method SegPhrase+ Chunking (TF-IDF & C-Value)
1 data base data base
2 database system database system
3 relational database query processing
4 query optimization query optimization
5 query processing relational database
… … …
51 sql server database technology
52 relational data database server
53 data structure large volume
54 join query performance study
55 web service web service
… … …
201 high dimensional data efficient implementation
202 location based service sensor network
203 xml schema large collection
204 two phase locking important issue
205 deep web frequent itemset
… … …
Only in SegPhrase+ Only in Chunking
53
Experimental Results: Interesting Phrases
Generated (From the Titles and Abstracts of SIGKDD)
Query SIGKDD
Method SegPhrase+ Chunking (TF-IDF & C-Value)
1 data mining data mining
2 data set association rule
3 association rule knowledge discovery
4 knowledge discovery frequent itemset
5 time series decision tree
… … …
51 association rule mining search space
52 rule set domain knowledge
53 concept drift importnant problem
54 knowledge acquisition concurrency control
55 gene expression data conceptual graph
… … …
201 web content optimal solution
202 frequent subgraph semantic relationship
203 intrusion detection effective way
204 categorical attribute space complexity
205 user preference small set
… … …
Only in SegPhrase+ Only in Chunking
54
Experimental Results: Similarity Search
 Find high-quality similar phrases based on user’s phrase query
 In response to a user’s phrase query, SegPhrase+ generates high quality,
semantically similar phrases
 In DBLP, query on “data mining” and “OLAP”
 In Yelp, query on “blu-ray”, “noodle”, and “valet parking”
55
Experimental Results: High Quality Phrases
Generated (Top-Ranked Phrases From English Gigaword)
 Northrop Grumman, Ashfaq Kayani, Sania Mirza, Pius Xii, Shakhtar Donetsk, Kyaw Zaw Lwin
 Ratko Mladic, Abdolmalek Rigi, Rubin Kazan, Rajon Rondo, Rubel Hossain, bluefin tuna
 Psv Eindhoven, Nicklas Bendtner, Ryo Ishikawa, Desmond Tutu, Landon Donovan, Jannie du Plessis
 Zinedine Zidane, Uttar Pradesh, Thor Hushovd, Andhra Pradesh, Jafar_Panahi, Marouane Chamakh
 Rahm Emanuel, Yakubu Aiyegbeni, Salva Kiir, Abdelhamid Abou Zeid, Blaise Compaore, Rickie Fowler
 Andry Rajoelina, Merck Kgaa, Js Kabylie, Arjun Atwal, Andal Ampatuan Jnr, Reggio Calabria, Ghani Baradar
 Mahela Jayawardene, Jemaah Islamiyah, quantitative easing, Nodar Kumaritashvili, Alviro Petersen
 Rumiana Jeleva, Helio Castroneves, Koumei Oda, Porfirio Lobo, Anastasia Pavlyuchenkova
 Thaksin Shinawatra, Evgeni_Malkin, Salvatore Sirigu, Edoardo Molinari, Yoshito Sengoku
 Otago Highlanders, Umar Akmal, Shuaibu Amodu, Nadia Petrova, Jerzy Buzek, Leonid Kuchma,
 Alona Bondarenko, Chosun Ilbo, Kei Nishikori, Nobunari Oda, Kumbh Mela, Santo_Domingo
 Nicolae Ceausescu, Yoann Gourcuff, Petr Cech, Mirlande Manigat, Sulieman Benn, Sekouba Konate
56
Recent Progress
 Distant Training: No need of human labeling
 Training using general knowledge bases
 E.g., Freebase, Wikipedia
 Quality Estimation for Unigrams
 Integration of phrases and unigrams in one uniform framework
 Demo based on DBLP abstract
 Multi-languages: Beyond English corpus
 Extensible to mining quality phrases in multiple languages
 Recent progress: SegPhrase+ works on Chinese and Arabic
57
Mining Quality Phrases in Multiple Languages
Rank Phrase In English
… … …
62 首席_执行官 CEO
63 中间_偏右 Middle-right
… … …
84 百度_百科 Baidu Pedia
85 热带_气旋 Tropical cyclone
86 中国科学院_院士 Fellow of Chinese
Academy of Sciences
… … …
1001 十大_中文_金曲 Top-10 Chinese Songs
1002 全球_资讯网 Global News Website
1003 天一阁_藏_明代_科举_录_选刊 A Chinese book name
… … …
9934 国家_戏剧_院 National Theater
9935 谢谢_你 Thank you
… … …
 Both ToPMine and SegPhrase+ are
extensible to mining quality
phrases in multiple languages
 SegPhrase+ on Chinese (From
Chinese Wikipedia)
 ToPMine on Arabic (From Quran
(Fus7a Arabic)(no preprocessing)
 Experimental results of Arabic
phrases:
‫كفروا‬ Those who disbelieve
‫بسم‬‫هللا‬‫الرحمن‬‫الرحيم‬  In the name
of God the Gracious and Merciful
58
Summary on SegPrhase+ and Future Work
 SegPhrase+: A new phrase mining framework
 Integrating phrase mining with phrasal segmentation
 Requires only limited training or distant training
 Generates high-quality phrases, close to human judgement
 Linearly scalable on time and space
 Looking forward: High-quality, scalable phrase mining
 Facilitate entity recognition and typing in large corpora
 (See the next part of this tutorial)
 Transform massive unstructured data into semi-structured knowledge
networks
59
60
References
 D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi-Word Expressions, arXiv:0907.1013, 2009
 K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis. In U. Zernik (ed.), Lexical Acquisition:
Exploiting On-Line Resources to Build a Lexicon. Lawrence Erlbaum, 1991
 M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical
Keyphrases on Collections of Short Documents“, SDM’14
 A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15
 K. Frantzi, S. Ananiadou, and H. Mima, Automatic Recognition of Multi-Word Terms: the c-value/nc-value
Method. Int. Journal on Digital Libraries, 3(2), 2000
 R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A phrase-discovering topic model using hierarchical pitman-yor
processes, EMNLP-CoNLL’12.
 J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15
 O. Medelyan and I. H. Witten, Thesaurus Based Automatic Keyphrase Indexing. IJCDL’06
 Q. Mei, X. Shen, C. Zhai. Automatic Labeling of Multinomial Topic Models, KDD’07
 A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the Web of Concepts: Extracting Concepts from
Large Datasets. VLDB’10
 X. Wang, A. McCallum, X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information
retrieval, ICDM’07

More Related Content

PPTX
Unit 2 Plagiarism and Academic Integrity
DOCX
An ode to death
PPTX
Nationalism and Rabindranath Tagore
PPTX
The scope of semantics made simple
DOC
Universal Grammar
PPTX
Error Analysis (Errors explanation).pptx
PPTX
Deconstructive study of The Ramayana.
PPTX
William wordsworth - the nature's poet
Unit 2 Plagiarism and Academic Integrity
An ode to death
Nationalism and Rabindranath Tagore
The scope of semantics made simple
Universal Grammar
Error Analysis (Errors explanation).pptx
Deconstructive study of The Ramayana.
William wordsworth - the nature's poet

What's hot (20)

PPTX
Reader response theory ppt
DOCX
Gender system in urdu language
PPTX
Why Comparative Indian Literature ?
PPTX
Stylistics: Style as Deviation, Variation and Foregrounding
DOCX
Chaucer's humor
PPTX
Personality and language learning
PPTX
Dr Allama Muhammad Iqbal (RA)
PPTX
Issues in south asia
PPT
Nature and function of criticism
PPTX
Indo-European family
PPT
15. intro to lang. functional linguistics
PPTX
TYPES AND USES OF LANGUAGE TESTING & NORM-REFERENCED TEST AND CRITERION-REFER...
PPTX
General Areas of Research in English Language
PDF
Eid a poem by Aurangzaib Almgir Hashmi
PPTX
Pidgin & creoles
PPTX
indian english literature
PPTX
Differences Structuralism and post Structuralism
PPTX
Deviation and parralelism copy
PPT
18th and 19th century literature
PPTX
corpus linguistics.pptx
Reader response theory ppt
Gender system in urdu language
Why Comparative Indian Literature ?
Stylistics: Style as Deviation, Variation and Foregrounding
Chaucer's humor
Personality and language learning
Dr Allama Muhammad Iqbal (RA)
Issues in south asia
Nature and function of criticism
Indo-European family
15. intro to lang. functional linguistics
TYPES AND USES OF LANGUAGE TESTING & NORM-REFERENCED TEST AND CRITERION-REFER...
General Areas of Research in English Language
Eid a poem by Aurangzaib Almgir Hashmi
Pidgin & creoles
indian english literature
Differences Structuralism and post Structuralism
Deviation and parralelism copy
18th and 19th century literature
corpus linguistics.pptx
Ad

Similar to 2015 07-tuto1-phrase mining (20)

PPT
Information extraction for Free Text
PPTX
Techniques For Deep Query Understanding
PPTX
ICMT 2016: Search-Based Model Transformations with MOMoT
PPT
Introduction to Machine Learning.
PPTX
Final presentation
PPTX
Transformation Functions for Text Classification: A case study with StackOver...
PPTX
Unit - 1 - Introduction of the machine learning
PPT
text classification_NB.ppt
PPTX
Neural Models for Information Retrieval
PDF
Topic modeling of marketing scientific papers: An experimental survey
PDF
E017252831
PDF
Extraction of Data Using Comparable Entity Mining
PDF
Big Data Palooza Talk: Aspects of Semantic Processing
PPT
lecture14-learning-ranking.ppt
PPT
lecture14-learning-ranking.ppt
PDF
TopicModels_BleiPaper_Summary.pptx
PDF
[PythonPH] Transforming the call center with Text mining and Deep learning (C...
PDF
Question Classification using Semantic, Syntactic and Lexical features
PDF
Question Classification using Semantic, Syntactic and Lexical features
PPTX
Introduction to LLM Post-Training - MIT 6.S191 2025
Information extraction for Free Text
Techniques For Deep Query Understanding
ICMT 2016: Search-Based Model Transformations with MOMoT
Introduction to Machine Learning.
Final presentation
Transformation Functions for Text Classification: A case study with StackOver...
Unit - 1 - Introduction of the machine learning
text classification_NB.ppt
Neural Models for Information Retrieval
Topic modeling of marketing scientific papers: An experimental survey
E017252831
Extraction of Data Using Comparable Entity Mining
Big Data Palooza Talk: Aspects of Semantic Processing
lecture14-learning-ranking.ppt
lecture14-learning-ranking.ppt
TopicModels_BleiPaper_Summary.pptx
[PythonPH] Transforming the call center with Text mining and Deep learning (C...
Question Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical features
Introduction to LLM Post-Training - MIT 6.S191 2025
Ad

More from jins0618 (20)

PDF
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
PDF
Latent Interest and Topic Mining on User-item Bipartite Networks
PDF
Web Service QoS Prediction Approach in Mobile Internet Environments
PDF
吕潇 星环科技大数据技术探索与应用实践
PPT
李战怀 大数据环境下数据存储与管理的研究
PPTX
2015 07-tuto0-courseoutline
PDF
Christian jensen advanced routing in spatial networks using big data
PPTX
Jeffrey xu yu large graph processing
PDF
Calton pu experimental methods on performance in cloud and accuracy in big da...
PDF
Ling liu part 02:big graph processing
PDF
Ling liu part 01:big graph processing
PPTX
Wang ke mining revenue-maximizing bundling configuration
PDF
Wang ke classification by cut clearance under threshold
PPTX
2015 07-tuto2-clus type
PPTX
2015 07-tuto3-mining hin
PPTX
2015 07-tuto0-courseoutline
PPTX
Weiyi meng web data truthfulness analysis
PPTX
Ke yi small summaries for big data
PDF
Gao cong geospatial social media data management and context-aware recommenda...
PPTX
Chengqi zhang graph processing and mining in the era of big data
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
Latent Interest and Topic Mining on User-item Bipartite Networks
Web Service QoS Prediction Approach in Mobile Internet Environments
吕潇 星环科技大数据技术探索与应用实践
李战怀 大数据环境下数据存储与管理的研究
2015 07-tuto0-courseoutline
Christian jensen advanced routing in spatial networks using big data
Jeffrey xu yu large graph processing
Calton pu experimental methods on performance in cloud and accuracy in big da...
Ling liu part 02:big graph processing
Ling liu part 01:big graph processing
Wang ke mining revenue-maximizing bundling configuration
Wang ke classification by cut clearance under threshold
2015 07-tuto2-clus type
2015 07-tuto3-mining hin
2015 07-tuto0-courseoutline
Weiyi meng web data truthfulness analysis
Ke yi small summaries for big data
Gao cong geospatial social media data management and context-aware recommenda...
Chengqi zhang graph processing and mining in the era of big data

Recently uploaded (20)

PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
annual-report-2024-2025 original latest.
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Transcultural that can help you someday.
PDF
Business Analytics and business intelligence.pdf
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Database Infoormation System (DBIS).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
ISS -ESG Data flows What is ESG and HowHow
Topic 5 Presentation 5 Lesson 5 Corporate Fin
annual-report-2024-2025 original latest.
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Introduction-to-Cloud-ComputingFinal.pptx
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Optimise Shopper Experiences with a Strong Data Estate.pdf
STERILIZATION AND DISINFECTION-1.ppthhhbx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Transcultural that can help you someday.
Business Analytics and business intelligence.pdf
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Database Infoormation System (DBIS).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx

2015 07-tuto1-phrase mining

  • 1. 1
  • 2. Phrase Mining and Topic Modeling from Large Corpora JIAWEI HAN COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN JUNE 16, 2016 2
  • 3. 3 Why Phrase Mining?  Unigrams vs. phrases  Unigrams (single words) are ambiguous Example: “United”: United States? United Airline? United Parcel Service?  Phrase: A natural, meaningful, unambiguous semantic unit Example: “United States” vs. “United Airline”  Mining semantically meaningful phrases  Transform text data from word granularity to phrase granularity  Enhance the power and efficiency at manipulating unstructured data using database technology
  • 4. 4  Phrase mining: Originated from the NLP community—“Chunking”  Model it as a sequence labeling problem (B-NP, I-NP, O, …)  Need annotation and training  Annotate hundreds of documents as training data  Train a supervised model based on part-of-speech features  Recent trend:  Use distributional features based on web n-grams (Bergsma et al., 2010)  State-of-the-art performance: ~95% accuracy, ~88% phrase-level F-score  Limitations  High annotation cost, not scalable to a new language, a new domain or genre  May not fit domain-specific, dynamic, emerging applications  Scientific domains, query logs, or social media, e.g., Yelp, Twitter Mining Phrases: Why Not Use NLP Methods?
  • 5. 5 Data Mining Approaches  General principle: Fully exploit information redundancy and data- driven criteria to determine phrase boundaries and salience  Phrase Mining and Topic Modeling from Large Corpora  Strategy 1: Simultaneously Inferring Phrases and Topics  Strategy 2: Post Topic Modeling Phrase Construction  Strategy 3: First Phrase Mining then Topic Modeling (ToPMine)  Integration of Phrase Mining with Document Segmentation
  • 6. Session 1. Frequent Pattern Mining for Text Data
  • 7. 7 Frequent Pattern Mining for Text Data: Phrase Mining and Topic Modeling  Motivation: Unigrams (single words) can be difficult to interpret  Ex.: The topic that represents the area of Machine Learning learning reinforcement support machine vector selection feature random : versus learning support vector machines reinforcement learning feature selection conditional random fields classification decision trees :
  • 8. 8 Various Strategies: Phrase-Based Topic Modeling  Strategy 1: Generate bag-of-words → generate sequence of tokens  Bigram topical model [Wallach’06], topical n-gram model [Wang, et al.’07], phrase discovering topic model [Lindsey, et al.’12]  Strategy 2: Post bag-of-words model inference, visualize topics with n-grams  Label topic [Mei et al.’07], TurboTopic [Blei & Lafferty’09], KERT [Danilevsky, et al.’14]  Strategy 3: Prior bag-of-words model inference, mine phrases and impose on the bag-of-words model  ToPMine [El-kishky, et al.’15]
  • 9. Session 2. Strategy 1: Simultaneously Inferring Phrases and Topics
  • 10. 10 Strategy 1: Simultaneously Inferring Phrases and Topics  Bigram Topic Model [Wallach’06]  Probabilistic generative model that conditions on previous word and topic when drawing next word  Topical N-Grams (TNG) [Wang, et al.’07]  Probabilistic model that generates words in textual order  Create n-grams by concatenating successive bigrams (a generalization of Bigram Topic Model)  Phrase-Discovering LDA (PDLDA) [Lindsey, et al.’12]  Viewing each sentence as a time-series of words, PDLDA posits that the generative parameter (topic) changes periodically  Each word is drawn based on previous m words (context) and current phrase topic  High model complexity: Tends to overfitting; High inference cost: Slow
  • 11. 11 TNG: Experiments on Research Papers
  • 12. 12 TNG: Experiments on Research Papers (2)
  • 13. Session 3. Strategy 2: Post Topic Modeling Phrase Construction
  • 14. 14 Strategy 2: Post Topic Modeling Phrase Construction  TurboTopics [Blei & Lafferty’09] – Phrase construction as a post-processing step to Latent Dirichlet Allocation  Perform Latent Dirichlet Allocation on corpus to assign each token a topic label  Merge adjacent unigrams with the same topic label by a distribution-free permutation test on arbitrary-length back-off model  End recursive merging when all significant adjacent unigrams have been merged  KERT [Danilevsky et al.’14] – Phrase construction as a post-processing step to Latent Dirichlet Allocation  Perform frequent pattern mining on each topic  Perform phrase ranking based on four different criteria
  • 15. 15 Example of TurboTopics  Perform LDA on corpus to assign each token a topic label  E.g., … phase11 transition11 …. game153 theory127 …  Then merge adjacent unigrams with same topic label
  • 16. 16 KERT: Topical Keyphrase Extraction & Ranking [Danilevsky, et al. 2014] learning support vector machines reinforcement learning feature selection conditional random fields classification decision trees : Topical keyphrase extraction & ranking knowledge discovery using least squares support vector machine classifiers support vectors for reinforcement learning a hybrid approach to feature selection pseudo conditional random fields automatic web page classification in a dynamic and hierarchical way inverse time dependency in convex regularized learning postprocessing decision trees to extract actionable knowledge variance minimization least squares support vector machines … Unigram topic assignment: Topic 1 & Topic 2
  • 17. 17 Framework of KERT 1. Run bag-of-words model inference and assign topic label to each token 2. Extract candidate keyphrases within each topic 3. Rank the keyphrases in each topic  Popularity: ‘information retrieval’ vs. ‘cross-language information retrieval’  Discriminativeness: only frequent in documents about topic t  Concordance: ‘active learning’ vs.‘learning classification’  Completeness: ‘vector machine’ vs. ‘support vector machine’ Frequent pattern mining Comparability property: directly compare phrases of mixed lengths
  • 18. 18 Comparison of phrase ranking methods The topic that represents the area of Machine Learning kpRel [Zhao et al. 11] KERT (-popularity) KERT (-discriminativeness) KERT (-concordance) KERT [Danilevsky et al. 14] learning effective support vector machines learning learning classification text feature selection classification support vector machines selection probabilistic reinforcement learning selection reinforcement learning models identification conditional random fields feature feature selection algorithm mapping constraint satisfaction decision conditional random fields features task decision trees bayesian classification decision planning dimensionality reduction trees decision trees : : : : : KERT: Topical Phrases on Machine Learning Top-Ranked Phrases by Mining Paper Titles in DBLP
  • 19. Session 4. Strategy 3: First Phrase Mining then Topic Modeling
  • 20. 20 Strategy 3: First Phrase Mining then Topic Modeling  ToPMine [El-Kishky et al. VLDB’15]  First phrase construction, then topic mining  Contrast with KERT: topic modeling, then phrase mining  The ToPMine Framework:  Perform frequent contiguous pattern mining to extract candidate phrases and their counts  Perform agglomerative merging of adjacent unigrams as guided by a significance score—This segments each document into a “bag-of-phrases”  The newly formed bag-of-phrases are passed as input to PhraseLDA, an extension of LDA, that constrains all words in a phrase to each sharing the same latent topic
  • 21. 21 Why First Phrase Mining then Topic Modeling ?  With Strategy 2, tokens in the same phrase may be assigned to different topics  Ex. knowledge discovery using least squares support vector machine classifiers…  Knowledge discovery and support vector machine should have coherent topic labels  Solution: switch the order of phrase mining and topic model inference  Techniques  Phrase mining and document segmentation  Topic model inference with phrase constraint [knowledge discovery] using [least squares] [support vector machine] [classifiers] … [knowledge discovery] using [least squares] [support vector machine] [classifiers] …
  • 22. 22 Phrase Mining: Frequent Pattern Mining + Statistical Analysis [Markov blanket] [feature selection] for [support vector machines] [knowledge discovery] using [least squares] [support vector machine] [classifiers] …[support vector] for [machine learning]… Phrase Raw freq. True freq. [support vector machine] 90 80 [vector machine] 95 0 [support vector] 100 20 Quality phrases Based on significance score [Church et al.’91]: α(P1, P2) ≈ (f(P1●P2) ̶ µ0(P1,P2))/√ f(P1●P2)
  • 23. 23 Collocation Mining  Collocation: A sequence of words that occur more frequently than expected  Often “interesting” and due to their non-compositionality, often relay information not portrayed by their constituent terms (e.g., “made an exception”, “strong tea”)  Many different measures used to extract collocations from a corpus [Dunning 93, Pederson 96]  E.g., mutual information, t-test, z-test, chi-squared test, likelihood ratio  Many of these measures can be used to guide the agglomerative phrase- segmentation algorithm
  • 24. 24 ToPMine: Phrase LDA (Constrained Topic Modeling)  The generative model for PhraseLDA is the same as LDA  Difference: the model incorporates constraints obtained from the “bag-of-phrases” input  Chain-graph shows that all words in a phrase are constrained to take on the same topic values [knowledge discovery] using [least squares] [support vector machine] [classifiers] … Topic model inference with phrase constraints
  • 25. 25 Example Topical Phrases: A Comparison ToPMine [El-kishky et al. 14] Strategy 3 (67 seconds) information retrieval feature selection social networks machine learning web search semi supervised search engine large scale information extraction support vector machines question answering active learning web pages face recognition : : Topic 1 Topic 2 social networks information retrieval web search text classification time series machine learning search engine support vector machines management system information extraction real time neural networks decision trees text categorization : : Topic 1 Topic 2 PDLDA [Lindsey et al. 12] Strategy 1 (3.72 hours)
  • 26. 26 ToPMine: Experiments on DBLP Abstracts
  • 27. 27 ToPMine: Topics on Associate Press News (1989)
  • 29. Session 5. A Comparative Study of Three Strategies
  • 30. 30 Efficiency: Running Time of Different Strategies  Strategy 1: Generate bag-of-words → generate sequence of tokens  Strategy 2: Post bag-of-words model inference, visualize topics with n-grams  Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the bag-of-words model Running time: strategy 3 > strategy 2 > strategy 1 (“>” means outperforms)
  • 31. 31 Coherence of Topics: Comparison of Strategies  Strategy 1: Generate bag-of-words → generate sequence of tokens  Strategy 2: Post bag-of-words model inference, visualize topics with n-grams  Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the bag-of-words model Coherence measured by z-score: strategy 3 > strategy 2 > strategy 1
  • 32. 32 Phrase Intrusion: Comparison of Strategies Phrase intrusion measured by average number of correct answers: strategy 3 > strategy 2 > strategy 1
  • 33. 33 Phrase Quality: Comparison of Strategies Phrase quality measured by z-score: strategy 3 > strategy 2 > strategy 1
  • 34. 34 Summary: Strategies on Topical Phrase Mining  Strategy 1: Generate bag-of-words → generate sequence of tokens  Integrated complex model; phrase quality and topic inference rely on each other  Slow and overfitting  Strategy 2: Post bag-of-words model inference, visualize topics with n-grams  Phrase quality relies on topic labels for unigrams  Can be fast; generally high-quality topics and phrases  Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the bag-of-words model  Topic inference relies on correct segmentation of documents, but not sensitive  Can be fast; generally high-quality topics and phrases
  • 35. 35
  • 36. Session 6. Mining Quality Phrases from Massive Text Corpora: A SegPhrase Approach
  • 37. 37 Mining Phrases: Why Not Use Raw Frequency Based Methods?  Traditional data-driven approaches  Frequent pattern mining  If AB is frequent, likely AB could be a phrase  Raw frequency could NOT reflect the quality of phrases  E.g., freq(vector machine) ≥ freq(support vector machine)  Need to rectify the frequency based on segmentation results  Phrasal segmentation will tell  Some words should be treated as a whole phrase whereas others are still unigrams
  • 38. 38 SegPhrase: From Raw Corpus to Quality Phrases and Segmented Corpus Document 1 Citation recommendation is an interesting but challenging research problem in data mining area. Document 2 In this study, we investigate the problem in the context of heterogeneous information networks using data mining technique. Phrase Mining Document 3 Principal Component Analysis is a linear dimensionality reduction technique commonly used in machine learning applications. Quality Phrases Phrasal Segmentation Raw Corpus Segmented Corpus Input Raw Corpus Quality Phrases Segmented Corpus
  • 39. 39 SegPhrase: The Overall Framework  ClassPhrase: Frequent pattern mining, feature extraction, classification  SegPhrase: Phrasal segmentation and phrase quality estimation  SegPhrase+: One more round to enhance mined phrase quality ClassPhrase SegPhrase(+)
  • 40. 40 What Kind of Phrases Are of “High Quality”?  Judging the quality of phrases  Popularity  “information retrieval” vs. “cross-language information retrieval”  Concordance  “powerful tea” vs. “strong tea”  “active learning” vs. “learning classification”  Informativeness  “this paper” (frequent but not discriminative, not informative)  Completeness  “vector machine” vs. “support vector machine”
  • 41. 41 ClassPhrase I: Pattern Mining for Candidate Set  Build a candidate phrases set by frequent pattern mining  Mining frequent k-grams  k is typically small, e.g. 6 in our experiments  Popularity measured by raw frequent words and phrases mined from the corpus
  • 42. 42 ClassPhrase II: Feature Extraction: Concordance  Partition a phrase into two parts to check whether the co-occurrence is significantly higher than pure random  support vector machine this paper demonstrates  Pointwise mutual information:  Pointwise KL divergence:  The additional p(v) is multiplied with pointwise mutual information, leading to less bias towards rare-occurred phrases 𝑢𝑙 𝑢𝑙𝑢 𝑟 𝑢 𝑟
  • 43. 43 ClassPhrase II: Feature Extraction: Informativeness  Deriving Informativeness  Quality phrases typically start and end with a non-stopword  “machine learning is” vs. “machine learning”  Use average IDF over words in the phrase to measure the semantics  Usually, the probabilities of a quality phrase in quotes, brackets, or connected by dash should be higher (punctuations information)  “state-of-the-art”  We can also incorporate features using some NLP techniques, such as POS tagging, chunking, and semantic parsing
  • 44. 44 ClassPhrase III: Classifier  Limited Training  Labels: Whether a phrase is a quality one or not  “support vector machine”: 1  “the experiment shows”: 0  For ~1GB corpus, only 300 labels  Random Forest as our classifier  Predicted phrase quality scores lie in [0, 1]  Bootstrap many different datasets from limited labels
  • 45. 45 SegPhrase: Why Do We Need Phrasal Segmentation in Corpus?  Phrasal segmentation can tell which phrase is more appropriate  Ex: A standard ⌈feature vector⌋ ⌈machine learning⌋ setup is used to describe...  Rectified phrase frequency (expected influence)  Example: Not counted towards the rectified frequency
  • 46. 46 SegPhrase: Segmentation of Phrases  Partition a sequence of word by maximizing the likelihood  Considering  Phrase quality score  ClassPhrase assigns a quality score for each phrase  Probability in corpus  Length penalty  length penalty 𝛼: when 𝛼 > 1, it favors shorter phrases  Filter out phrases with low rectified frequency  Bad phrases are expected to rarely occur in the segmentation results
  • 47. 47 SegPhrase+: Enhancing Phrasal Segmentation  SegPhrase+: One more round for enhanced phrasal segmentation  Feedback  Using rectified frequency, re-compute those features previously computing based on raw frequency  Process  Classification  Phrasal segmentation // SegPhrase  Classification  Phrasal segmentation // SegPhrase+  Effects on computing quality scores  np hard in the strong sense  np hard in the strong  data base management system
  • 48. 48 Performance Study: Methods to Be Compared  Other phase mining methods: Methods to be compared  NLP chunking based methods  Chunks as candidates  Sorted by TF-IDF and C-value (K. Frantzi et al., 2000)  Unsupervised raw frequency based methods  ConExtr (A. Parameswaran et al., VLDB 2010)  ToPMine (A. El-Kishky et al., VLDB 2015)  Supervised method  KEA, designed for single document keyphrases (O. Medelyan & I. H. Witten, 2006)
  • 49. 49 Performance Study: Experimental Setting  Datasets  Popular Wiki Phrases  Based on internal links  ~7K high quality phrases  Pooling  Sampled 500 * 7 Wiki-uncovered phrases  Evaluated by 3 reviewers independently Dataset #docs #words #labels DBLP 2.77M 91.6M 300 Yelp 4.75M 145.1M 300
  • 50. 50 Performance: Precision Recall Curves on DBLP Compare with other baselines TF-IDF C-Value ConExtr KEA ToPMine SegPhrase+ Compare with our 3 variations TF-IDF ClassPhrase SegPhrase SegPhrase+ 50
  • 51. 51 Performance Study: Processing Efficiency  SegPhrase+ is linear to the size of corpus!
  • 52. 52 Experimental Results: Interesting Phrases Generated (From the Titles and Abstracts of SIGMOD) Query SIGMOD Method SegPhrase+ Chunking (TF-IDF & C-Value) 1 data base data base 2 database system database system 3 relational database query processing 4 query optimization query optimization 5 query processing relational database … … … 51 sql server database technology 52 relational data database server 53 data structure large volume 54 join query performance study 55 web service web service … … … 201 high dimensional data efficient implementation 202 location based service sensor network 203 xml schema large collection 204 two phase locking important issue 205 deep web frequent itemset … … … Only in SegPhrase+ Only in Chunking
  • 53. 53 Experimental Results: Interesting Phrases Generated (From the Titles and Abstracts of SIGKDD) Query SIGKDD Method SegPhrase+ Chunking (TF-IDF & C-Value) 1 data mining data mining 2 data set association rule 3 association rule knowledge discovery 4 knowledge discovery frequent itemset 5 time series decision tree … … … 51 association rule mining search space 52 rule set domain knowledge 53 concept drift importnant problem 54 knowledge acquisition concurrency control 55 gene expression data conceptual graph … … … 201 web content optimal solution 202 frequent subgraph semantic relationship 203 intrusion detection effective way 204 categorical attribute space complexity 205 user preference small set … … … Only in SegPhrase+ Only in Chunking
  • 54. 54 Experimental Results: Similarity Search  Find high-quality similar phrases based on user’s phrase query  In response to a user’s phrase query, SegPhrase+ generates high quality, semantically similar phrases  In DBLP, query on “data mining” and “OLAP”  In Yelp, query on “blu-ray”, “noodle”, and “valet parking”
  • 55. 55 Experimental Results: High Quality Phrases Generated (Top-Ranked Phrases From English Gigaword)  Northrop Grumman, Ashfaq Kayani, Sania Mirza, Pius Xii, Shakhtar Donetsk, Kyaw Zaw Lwin  Ratko Mladic, Abdolmalek Rigi, Rubin Kazan, Rajon Rondo, Rubel Hossain, bluefin tuna  Psv Eindhoven, Nicklas Bendtner, Ryo Ishikawa, Desmond Tutu, Landon Donovan, Jannie du Plessis  Zinedine Zidane, Uttar Pradesh, Thor Hushovd, Andhra Pradesh, Jafar_Panahi, Marouane Chamakh  Rahm Emanuel, Yakubu Aiyegbeni, Salva Kiir, Abdelhamid Abou Zeid, Blaise Compaore, Rickie Fowler  Andry Rajoelina, Merck Kgaa, Js Kabylie, Arjun Atwal, Andal Ampatuan Jnr, Reggio Calabria, Ghani Baradar  Mahela Jayawardene, Jemaah Islamiyah, quantitative easing, Nodar Kumaritashvili, Alviro Petersen  Rumiana Jeleva, Helio Castroneves, Koumei Oda, Porfirio Lobo, Anastasia Pavlyuchenkova  Thaksin Shinawatra, Evgeni_Malkin, Salvatore Sirigu, Edoardo Molinari, Yoshito Sengoku  Otago Highlanders, Umar Akmal, Shuaibu Amodu, Nadia Petrova, Jerzy Buzek, Leonid Kuchma,  Alona Bondarenko, Chosun Ilbo, Kei Nishikori, Nobunari Oda, Kumbh Mela, Santo_Domingo  Nicolae Ceausescu, Yoann Gourcuff, Petr Cech, Mirlande Manigat, Sulieman Benn, Sekouba Konate
  • 56. 56 Recent Progress  Distant Training: No need of human labeling  Training using general knowledge bases  E.g., Freebase, Wikipedia  Quality Estimation for Unigrams  Integration of phrases and unigrams in one uniform framework  Demo based on DBLP abstract  Multi-languages: Beyond English corpus  Extensible to mining quality phrases in multiple languages  Recent progress: SegPhrase+ works on Chinese and Arabic
  • 57. 57 Mining Quality Phrases in Multiple Languages Rank Phrase In English … … … 62 首席_执行官 CEO 63 中间_偏右 Middle-right … … … 84 百度_百科 Baidu Pedia 85 热带_气旋 Tropical cyclone 86 中国科学院_院士 Fellow of Chinese Academy of Sciences … … … 1001 十大_中文_金曲 Top-10 Chinese Songs 1002 全球_资讯网 Global News Website 1003 天一阁_藏_明代_科举_录_选刊 A Chinese book name … … … 9934 国家_戏剧_院 National Theater 9935 谢谢_你 Thank you … … …  Both ToPMine and SegPhrase+ are extensible to mining quality phrases in multiple languages  SegPhrase+ on Chinese (From Chinese Wikipedia)  ToPMine on Arabic (From Quran (Fus7a Arabic)(no preprocessing)  Experimental results of Arabic phrases: ‫كفروا‬ Those who disbelieve ‫بسم‬‫هللا‬‫الرحمن‬‫الرحيم‬  In the name of God the Gracious and Merciful
  • 58. 58 Summary on SegPrhase+ and Future Work  SegPhrase+: A new phrase mining framework  Integrating phrase mining with phrasal segmentation  Requires only limited training or distant training  Generates high-quality phrases, close to human judgement  Linearly scalable on time and space  Looking forward: High-quality, scalable phrase mining  Facilitate entity recognition and typing in large corpora  (See the next part of this tutorial)  Transform massive unstructured data into semi-structured knowledge networks
  • 59. 59
  • 60. 60 References  D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi-Word Expressions, arXiv:0907.1013, 2009  K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis. In U. Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. Lawrence Erlbaum, 1991  M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents“, SDM’14  A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15  K. Frantzi, S. Ananiadou, and H. Mima, Automatic Recognition of Multi-Word Terms: the c-value/nc-value Method. Int. Journal on Digital Libraries, 3(2), 2000  R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A phrase-discovering topic model using hierarchical pitman-yor processes, EMNLP-CoNLL’12.  J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15  O. Medelyan and I. H. Witten, Thesaurus Based Automatic Keyphrase Indexing. IJCDL’06  Q. Mei, X. Shen, C. Zhai. Automatic Labeling of Multinomial Topic Models, KDD’07  A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the Web of Concepts: Extracting Concepts from Large Datasets. VLDB’10  X. Wang, A. McCallum, X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval, ICDM’07

Editor's Notes

  • #12: An interesting comparison A state of the art phrase-discovering topic model
  • #13: An interesting comparison A state of the art phrase-discovering topic model
  • #26: An interesting comparison A state of the art phrase-discovering topic model
  • #27: An interesting comparison A state of the art phrase-discovering topic model
  • #28: An interesting comparison A state of the art phrase-discovering topic model
  • #29: An interesting comparison A state of the art phrase-discovering topic model
  • #51: Then the picky audience might ask how to project phrase mining methods back to the corpus so we can get mention-level chunking results?  Maybe during presentation we can emphasize that's not the goal, we think it's more interesting to discover important phrases from a large corpus without any manual annotations.