Florian Boudin - 2015 - Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming

Reducing Over-generation Errors for Automatic Keyphrase
Extraction using Integer Linear Programming
Florian Boudin
LINA - UMR CNRS 6241, Université de Nantes, France
Keyphrase 2015
1 / 22

Errors made by keyphrase extraction systems
37%Over-generation errors
27%
Infrequency errors
12%
Redundancy errors
10%
Evaluation errors
[Hasan and Ng, 2014]
2 / 22

Motivation
Most errors are due to over-generation
System correctly outputs a keyphrase because it contains an important word, but
erroneously predicts other candidates as keyphrases because they contain the same word
e.g. olympics, olympic movement, international olympic comittee
Why over-generation errors are frequent?
Candidates are ranked independently, often according to their component words
We propose a global inference model to tackle the problem of over-generation errors
3 / 22

Outline
Introduction
Method
Experiments
Conclusion
4 / 22

Proposed method
Weighting candidates vs. weighting component words
Words are easier to extract, match and weight
Useful for reducing over-generation errors
Ensure that the importance of each word is counted only once in the set of keyphrases
Keyphrases should be extracted as a set rather than independently
Finding the optimal set of keyphrases → combinatorial optimisation problem
Formulated as an integer linear problem (ILP)
Solved exactly using off-the-shelf solvers
5 / 22

ILP model deﬁnition
Based on the concept-based model for summarization [Gillick and Favre, 2009]
The value of a set of keyphrases is the sum of the weights of its unique words
Word weights
olympic(s) = 5
game = 1
100-meter = 2
dash = 2
Candidates
Olympics
Olympic games
100-meter dash
Olympic games
100-meter dash
5 + 1 + 2 + 2 =10
Olympics
100-meter dash
5 + 2 + 2 =9
Olympics
Olympic games
5 + 1 =6
6 / 22

ILP model deﬁnition (cont.)
Let xi and cj be binary variables indicating the presence of word i and candidate j in
the set of extracted keyphrases
max
i
wixi ← Summing over unique word weights
s.t.
j
cj ≤ N ← Number of extracted keyphrases
cjOccij ≤ xi, ∀i, j ← Constraints for consistency
j
cjOccij ≥ xi, ∀i Occij = 1 if word i is in candidate j
7 / 22

ILP model deﬁnition (cont.)
By summing over word weights, the model overly favors long candidates
e.g. olympics < olympic games < modern olympic games
To correct this bias in the model
1. Pruning long candidates
2. Adding constraints to prefer shorter candidates
3. Adding a regularization term to the objective function
8 / 22

Regularization
Let lj be the size, in words, of candidate j, and substrj the number of times cj occurs
as a subtring in other candidates
max
i
wixi − λ
j
(lj − 1)cj
1 + substrj
Regularization penalizes candidates made of more than one word, and is dampened for
candidates that occur frequently as substrings
low λ ; ; ; ;
mid λ ; ; ; ;
high λ ; ; ; ;
9 / 22

Outline
Introduction
Method
Experiments
Conclusion
10 / 22

Experimental parameters
Experiments are carried out on the SemEval dataset [Kim et al., 2010]
Scientiﬁc articles from the ACM Digital Library
144 articles (training) + 100 articles (test)
Keyphrase candidates are sequences of nouns and adjectives
Evaluation in terms of precision, recall and f-measure at the top N keyphrases
Sets of combined author- and reader-assigned keyphrases as reference keyphrases
Extracted/reference keyphrases are stemmed
Regularization parameter λ tuned on the training set
11 / 22

Word weighting functions
TF×IDF [Spärck Jones, 1972]
IDF weights are computed on the training set
TextRank [Mihalcea and Tarau, 2004]
Window is sentence, edge weights are co-occurrences
Logistic regression [Hong and Nenkova, 2014]
Reference keyphrases in training data are used to generate positive/negative examples
Features: position ﬁrst occurrence, TF×IDF, presence in ﬁrst sentence
12 / 22

Baselines
sum : ranking candidates using the sum of the weights of their component
words [Wan and Xiao, 2008]
norm : ranking candidates using the sum of the weights of their component words
normalized by their lengths
Redundant keyphrases are pruned from the ranked lists
1. Olympic games
2. Olympics
3. 100-meter dash
4. · · ·
13 / 22

Results
Top-5 candidates Top-10 candidates
Weighting + Ranking P R F P R F
TF×IDF + sum 5.6 1.9 2.8 5.3 3.5 4.2
+ norm 19.2 6.7 9.9 15.1 10.6 12.3
+ ilp 25.4 9.1 13.3†
17.5 12.4 14.4†
TextRank + sum 4.5 1.6 2.3 4.0 2.8 3.3
+ norm 18.8 6.6 9.6 14.5 10.1 11.8
+ ilp 22.6 8.0 11.7†
17.4 12.2 14.2†
Logistic regression + sum 4.2 1.5 2.2 4.7 3.4 3.9
+ norm 23.8 8.3 12.2 18.9 13.3 15.5
+ ilp 29.4 10.4 15.3†
19.8 14.1 16.3
14 / 22

Results (cont.)
Top-5 candidates Top-10 candidates
Method P R F rank P R F rank
SemEval - TF×IDF 22.0 7.5 11.2 17.7 12.1 14.4
TF×IDF + ilp 25.4 9.1 13.3 14/20 17.5 12.4 14.4 18/20
SemEval - MaxEnt 21.4 7.3 10.9 17.3 11.8 14.0
Logistic regression + ilp 29.4 10.4 15.3 10/20 19.8 14.1 16.3 15/20
15 / 22

Example (J-3.txt)
TF×IDF + sum (P = 0.1)
advertis bid; certain advertis budget; keyword bid; convex hull landscap; budget optim
bid; uniform bid strategi; advertis slot; advertis campaign; ward advertis; searchbas
advertis
TF×IDF + norm (P = 0.2)
advertis; advertis bid; keyword; keyword bid; landscap; advertis slot; advertis cam-
paign; ward advertis; searchbas advertis; advertis random
TF×IDF + ilp (P = 0.4)
click; advertis; uniform bid; landscap; auction; convex hull; keyword; budget optim;
single-bid strategi; queri
16 / 22

Outline
Introduction
Method
Experiments
Conclusion
17 / 22

Conclusion
Proposed ILP model
Can be applied on top of any word weighting function
Reduces over-generation errors by weighting candidates as a set
Substancial improvement over commonly used word-based ranking approaches
Future work
Phrase-based model regularized by word redundancy
18 / 22

Thank you
florian.boudin@univ-nantes.fr
19 / 22

References I
Gillick, D. and Favre, B. (2009).
A scalable global model for summarization.
In Proceedings of the Workshop on Integer Linear Programming for Natural Language
Processing, pages 10–18, Boulder, Colorado. Association for Computational Linguistics.
Hasan, K. S. and Ng, V. (2014).
Automatic keyphrase extraction: A survey of the state of the art.
In Proceedings of the 52nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1262–1273, Baltimore, Maryland. Association for
Computational Linguistics.
20 / 22

References II
Hong, K. and Nenkova, A. (2014).
Improving the estimation of word importance for news multi-document summarization.
In Proceedings of the 14th Conference of the European Chapter of the Association for
Computational Linguistics, pages 712–721, Gothenburg, Sweden. Association for
Computational Linguistics.
Kim, S. N., Medelyan, O., Kan, M.-Y., and Baldwin, T. (2010).
Semeval-2010 task 5 : Automatic keyphrase extraction from scientiﬁc articles.
In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 21–26,
Uppsala, Sweden. Association for Computational Linguistics.
Mihalcea, R. and Tarau, P. (2004).
Textrank: Bringing order into texts.
In Lin, D. and Wu, D., editors, Proceedings of EMNLP 2004, pages 404–411, Barcelona, Spain.
Association for Computational Linguistics.
21 / 22

References III
Spärck Jones, K. (1972).
A statistical interpretation of term speciﬁcity and its application in retrieval.
Journal of Documentation, 28:11–21.
Wan, X. and Xiao, J. (2008).
Collabrank: Towards a collaborative approach to single-document keyphrase extraction.
In Proceedings of the 22nd International Conference on Computational Linguistics (Coling
2008), pages 969–976, Manchester, UK. Coling 2008 Organizing Committee.
22 / 22

Florian Boudin - 2015 - Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming

More Related Content

Similar to Florian Boudin - 2015 - Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming (20)

More from Association for Computational Linguistics (20)

Recently uploaded (20)

Florian Boudin - 2015 - Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming