Reducing Over-generation Errors for Automatic Keyphrase
Extraction using Integer Linear Programming
Florian Boudin
LINA - UMR CNRS 6241, Université de Nantes, France
Keyphrase 2015
1 / 22
Errors made by keyphrase extraction systems
37%Over-generation errors
27%
Infrequency errors
12%
Redundancy errors
10%
Evaluation errors
[Hasan and Ng, 2014]
2 / 22
Motivation
Most errors are due to over-generation
System correctly outputs a keyphrase because it contains an important word, but
erroneously predicts other candidates as keyphrases because they contain the same word
e.g. olympics, olympic movement, international olympic comittee
Why over-generation errors are frequent?
Candidates are ranked independently, often according to their component words
We propose a global inference model to tackle the problem of over-generation errors
3 / 22
Outline
Introduction
Method
Experiments
Conclusion
4 / 22
Proposed method
Weighting candidates vs. weighting component words
Words are easier to extract, match and weight
Useful for reducing over-generation errors
Ensure that the importance of each word is counted only once in the set of keyphrases
Keyphrases should be extracted as a set rather than independently
Finding the optimal set of keyphrases → combinatorial optimisation problem
Formulated as an integer linear problem (ILP)
Solved exactly using off-the-shelf solvers
5 / 22
ILP model definition
Based on the concept-based model for summarization [Gillick and Favre, 2009]
The value of a set of keyphrases is the sum of the weights of its unique words
Word weights
olympic(s) = 5
game = 1
100-meter = 2
dash = 2
Candidates
Olympics
Olympic games
100-meter dash
Olympic games
100-meter dash
5 + 1 + 2 + 2 =10
Olympics
100-meter dash
5 + 2 + 2 =9
Olympics
Olympic games
5 + 1 =6
6 / 22
ILP model definition (cont.)
Let xi and cj be binary variables indicating the presence of word i and candidate j in
the set of extracted keyphrases
max
i
wixi ← Summing over unique word weights
s.t.
j
cj ≤ N ← Number of extracted keyphrases
cjOccij ≤ xi, ∀i, j ← Constraints for consistency
j
cjOccij ≥ xi, ∀i Occij = 1 if word i is in candidate j
7 / 22
ILP model definition (cont.)
By summing over word weights, the model overly favors long candidates
e.g. olympics < olympic games < modern olympic games
To correct this bias in the model
1. Pruning long candidates
2. Adding constraints to prefer shorter candidates
3. Adding a regularization term to the objective function
8 / 22
Regularization
Let lj be the size, in words, of candidate j, and substrj the number of times cj occurs
as a subtring in other candidates
max
i
wixi − λ
j
(lj − 1)cj
1 + substrj
Regularization penalizes candidates made of more than one word, and is dampened for
candidates that occur frequently as substrings
low λ ; ; ; ;
mid λ ; ; ; ;
high λ ; ; ; ;
9 / 22
Outline
Introduction
Method
Experiments
Conclusion
10 / 22
Experimental parameters
Experiments are carried out on the SemEval dataset [Kim et al., 2010]
Scientific articles from the ACM Digital Library
144 articles (training) + 100 articles (test)
Keyphrase candidates are sequences of nouns and adjectives
Evaluation in terms of precision, recall and f-measure at the top N keyphrases
Sets of combined author- and reader-assigned keyphrases as reference keyphrases
Extracted/reference keyphrases are stemmed
Regularization parameter λ tuned on the training set
11 / 22
Word weighting functions
TF×IDF [Spärck Jones, 1972]
IDF weights are computed on the training set
TextRank [Mihalcea and Tarau, 2004]
Window is sentence, edge weights are co-occurrences
Logistic regression [Hong and Nenkova, 2014]
Reference keyphrases in training data are used to generate positive/negative examples
Features: position first occurrence, TF×IDF, presence in first sentence
12 / 22
Baselines
sum : ranking candidates using the sum of the weights of their component
words [Wan and Xiao, 2008]
norm : ranking candidates using the sum of the weights of their component words
normalized by their lengths
Redundant keyphrases are pruned from the ranked lists
1. Olympic games
2. Olympics
3. 100-meter dash
4. · · ·
13 / 22
Results
Top-5 candidates Top-10 candidates
Weighting + Ranking P R F P R F
TF×IDF + sum 5.6 1.9 2.8 5.3 3.5 4.2
+ norm 19.2 6.7 9.9 15.1 10.6 12.3
+ ilp 25.4 9.1 13.3†
17.5 12.4 14.4†
TextRank + sum 4.5 1.6 2.3 4.0 2.8 3.3
+ norm 18.8 6.6 9.6 14.5 10.1 11.8
+ ilp 22.6 8.0 11.7†
17.4 12.2 14.2†
Logistic regression + sum 4.2 1.5 2.2 4.7 3.4 3.9
+ norm 23.8 8.3 12.2 18.9 13.3 15.5
+ ilp 29.4 10.4 15.3†
19.8 14.1 16.3
14 / 22
Results (cont.)
Top-5 candidates Top-10 candidates
Method P R F rank P R F rank
SemEval - TF×IDF 22.0 7.5 11.2 17.7 12.1 14.4
TF×IDF + ilp 25.4 9.1 13.3 14/20 17.5 12.4 14.4 18/20
SemEval - MaxEnt 21.4 7.3 10.9 17.3 11.8 14.0
Logistic regression + ilp 29.4 10.4 15.3 10/20 19.8 14.1 16.3 15/20
15 / 22
Example (J-3.txt)
TF×IDF + sum (P = 0.1)
advertis bid; certain advertis budget; keyword bid; convex hull landscap; budget optim
bid; uniform bid strategi; advertis slot; advertis campaign; ward advertis; searchbas
advertis
TF×IDF + norm (P = 0.2)
advertis; advertis bid; keyword; keyword bid; landscap; advertis slot; advertis cam-
paign; ward advertis; searchbas advertis; advertis random
TF×IDF + ilp (P = 0.4)
click; advertis; uniform bid; landscap; auction; convex hull; keyword; budget optim;
single-bid strategi; queri
16 / 22
Outline
Introduction
Method
Experiments
Conclusion
17 / 22
Conclusion
Proposed ILP model
Can be applied on top of any word weighting function
Reduces over-generation errors by weighting candidates as a set
Substancial improvement over commonly used word-based ranking approaches
Future work
Phrase-based model regularized by word redundancy
18 / 22
Thank you
florian.boudin@univ-nantes.fr
19 / 22
References I
Gillick, D. and Favre, B. (2009).
A scalable global model for summarization.
In Proceedings of the Workshop on Integer Linear Programming for Natural Language
Processing, pages 10–18, Boulder, Colorado. Association for Computational Linguistics.
Hasan, K. S. and Ng, V. (2014).
Automatic keyphrase extraction: A survey of the state of the art.
In Proceedings of the 52nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1262–1273, Baltimore, Maryland. Association for
Computational Linguistics.
20 / 22
References II
Hong, K. and Nenkova, A. (2014).
Improving the estimation of word importance for news multi-document summarization.
In Proceedings of the 14th Conference of the European Chapter of the Association for
Computational Linguistics, pages 712–721, Gothenburg, Sweden. Association for
Computational Linguistics.
Kim, S. N., Medelyan, O., Kan, M.-Y., and Baldwin, T. (2010).
Semeval-2010 task 5 : Automatic keyphrase extraction from scientific articles.
In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 21–26,
Uppsala, Sweden. Association for Computational Linguistics.
Mihalcea, R. and Tarau, P. (2004).
Textrank: Bringing order into texts.
In Lin, D. and Wu, D., editors, Proceedings of EMNLP 2004, pages 404–411, Barcelona, Spain.
Association for Computational Linguistics.
21 / 22
References III
Spärck Jones, K. (1972).
A statistical interpretation of term specificity and its application in retrieval.
Journal of Documentation, 28:11–21.
Wan, X. and Xiao, J. (2008).
Collabrank: Towards a collaborative approach to single-document keyphrase extraction.
In Proceedings of the 22nd International Conference on Computational Linguistics (Coling
2008), pages 969–976, Manchester, UK. Coling 2008 Organizing Committee.
22 / 22

More Related Content

PPTX
Using Knowledge Building Forums in EFL Classroms - FIETxs2019
PPTX
Dorra elmekki nlp
PDF
GSCL2013 Poster.A Study of Chinese Word Segmentation Based on the Characteris...
PPT
CBTvsPBT
PDF
Selecting proper lexical paraphrase for children
PPTX
Gpt1 and 2 model review
PDF
Cser13.ppt
PDF
Two Level Disambiguation Model for Query Translation
Using Knowledge Building Forums in EFL Classroms - FIETxs2019
Dorra elmekki nlp
GSCL2013 Poster.A Study of Chinese Word Segmentation Based on the Characteris...
CBTvsPBT
Selecting proper lexical paraphrase for children
Gpt1 and 2 model review
Cser13.ppt
Two Level Disambiguation Model for Query Translation

Similar to Florian Boudin - 2015 - Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming (20)

PDF
Operations Management 11th Edition Heizer Test Bank
PPTX
A Large Scale Study of Multiple Programming Languages and Code Quality
PPTX
Interannotator Agreement
PDF
Operations Management 11th Edition Heizer Test Bank
PDF
Principles of Operations Management 9th Edition Heizer Test Bank
DOCX
Ch 6 only 1. Distinguish between a purpose statement, research p
DOCX
Ch 6 only 1. distinguish between a purpose statement, research p
PDF
Operations Management 11th Edition Heizer Test Bank
PDF
Operations Management 11th Edition Heizer Test Bank
PDF
Operations Management 11th Edition Heizer Test Bank
PDF
Operations Management 11th Edition Heizer Test Bank
PDF
PDF
Operations Management 11th Edition Heizer Test Bank
PDF
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
PDF
Principles of Operations Management 9th Edition Heizer Test Bank
PDF
Principles of Operations Management 9th Edition Heizer Test Bank
PDF
Derivation of Green Metrics for Software
DOCX
Csc1100 elements of programming (revised july 2014) 120lh-2-student
PDF
A MODEL TO COMPARE THE DEGREE OF REFACTORING OPPORTUNITIES OF THREE PROJECTS ...
PDF
A Model To Compare The Degree Of Refactoring Opportunities Of Three Projects ...
Operations Management 11th Edition Heizer Test Bank
A Large Scale Study of Multiple Programming Languages and Code Quality
Interannotator Agreement
Operations Management 11th Edition Heizer Test Bank
Principles of Operations Management 9th Edition Heizer Test Bank
Ch 6 only 1. Distinguish between a purpose statement, research p
Ch 6 only 1. distinguish between a purpose statement, research p
Operations Management 11th Edition Heizer Test Bank
Operations Management 11th Edition Heizer Test Bank
Operations Management 11th Edition Heizer Test Bank
Operations Management 11th Edition Heizer Test Bank
Operations Management 11th Edition Heizer Test Bank
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Principles of Operations Management 9th Edition Heizer Test Bank
Principles of Operations Management 9th Edition Heizer Test Bank
Derivation of Green Metrics for Software
Csc1100 elements of programming (revised july 2014) 120lh-2-student
A MODEL TO COMPARE THE DEGREE OF REFACTORING OPPORTUNITIES OF THREE PROJECTS ...
A Model To Compare The Degree Of Refactoring Opportunities Of Three Projects ...
Ad

More from Association for Computational Linguistics (20)

PDF
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
PDF
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
PDF
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
PDF
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
PDF
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
PDF
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
PDF
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
PDF
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
PDF
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
PDF
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
PDF
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
PDF
Chenchen Ding - 2015 - NICT at WAT 2015
PDF
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
PDF
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
PDF
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
PDF
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
PDF
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
PDF
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
PDF
Chenchen Ding - 2015 - NICT at WAT 2015
PDF
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Chenchen Ding - 2015 - NICT at WAT 2015
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Chenchen Ding - 2015 - NICT at WAT 2015
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Ad

Recently uploaded (20)

PPTX
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
Virtual and Augmented Reality in Current Scenario
PDF
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
PDF
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
PDF
Complications of Minimal Access-Surgery.pdf
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PDF
International_Financial_Reporting_Standa.pdf
PDF
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
PDF
Hazard Identification & Risk Assessment .pdf
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
What if we spent less time fighting change, and more time building what’s rig...
PDF
My India Quiz Book_20210205121199924.pdf
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
PPTX
Computer Architecture Input Output Memory.pptx
PPTX
TNA_Presentation-1-Final(SAVE)) (1).pptx
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Virtual and Augmented Reality in Current Scenario
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
Complications of Minimal Access-Surgery.pdf
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
International_Financial_Reporting_Standa.pdf
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
Hazard Identification & Risk Assessment .pdf
Weekly quiz Compilation Jan -July 25.pdf
What if we spent less time fighting change, and more time building what’s rig...
My India Quiz Book_20210205121199924.pdf
AI-driven educational solutions for real-life interventions in the Philippine...
Computer Architecture Input Output Memory.pptx
TNA_Presentation-1-Final(SAVE)) (1).pptx
Chinmaya Tiranga quiz Grand Finale.pdf

Florian Boudin - 2015 - Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming

  • 1. Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming Florian Boudin LINA - UMR CNRS 6241, Université de Nantes, France Keyphrase 2015 1 / 22
  • 2. Errors made by keyphrase extraction systems 37%Over-generation errors 27% Infrequency errors 12% Redundancy errors 10% Evaluation errors [Hasan and Ng, 2014] 2 / 22
  • 3. Motivation Most errors are due to over-generation System correctly outputs a keyphrase because it contains an important word, but erroneously predicts other candidates as keyphrases because they contain the same word e.g. olympics, olympic movement, international olympic comittee Why over-generation errors are frequent? Candidates are ranked independently, often according to their component words We propose a global inference model to tackle the problem of over-generation errors 3 / 22
  • 5. Proposed method Weighting candidates vs. weighting component words Words are easier to extract, match and weight Useful for reducing over-generation errors Ensure that the importance of each word is counted only once in the set of keyphrases Keyphrases should be extracted as a set rather than independently Finding the optimal set of keyphrases → combinatorial optimisation problem Formulated as an integer linear problem (ILP) Solved exactly using off-the-shelf solvers 5 / 22
  • 6. ILP model definition Based on the concept-based model for summarization [Gillick and Favre, 2009] The value of a set of keyphrases is the sum of the weights of its unique words Word weights olympic(s) = 5 game = 1 100-meter = 2 dash = 2 Candidates Olympics Olympic games 100-meter dash Olympic games 100-meter dash 5 + 1 + 2 + 2 =10 Olympics 100-meter dash 5 + 2 + 2 =9 Olympics Olympic games 5 + 1 =6 6 / 22
  • 7. ILP model definition (cont.) Let xi and cj be binary variables indicating the presence of word i and candidate j in the set of extracted keyphrases max i wixi ← Summing over unique word weights s.t. j cj ≤ N ← Number of extracted keyphrases cjOccij ≤ xi, ∀i, j ← Constraints for consistency j cjOccij ≥ xi, ∀i Occij = 1 if word i is in candidate j 7 / 22
  • 8. ILP model definition (cont.) By summing over word weights, the model overly favors long candidates e.g. olympics < olympic games < modern olympic games To correct this bias in the model 1. Pruning long candidates 2. Adding constraints to prefer shorter candidates 3. Adding a regularization term to the objective function 8 / 22
  • 9. Regularization Let lj be the size, in words, of candidate j, and substrj the number of times cj occurs as a subtring in other candidates max i wixi − λ j (lj − 1)cj 1 + substrj Regularization penalizes candidates made of more than one word, and is dampened for candidates that occur frequently as substrings low λ ; ; ; ; mid λ ; ; ; ; high λ ; ; ; ; 9 / 22
  • 11. Experimental parameters Experiments are carried out on the SemEval dataset [Kim et al., 2010] Scientific articles from the ACM Digital Library 144 articles (training) + 100 articles (test) Keyphrase candidates are sequences of nouns and adjectives Evaluation in terms of precision, recall and f-measure at the top N keyphrases Sets of combined author- and reader-assigned keyphrases as reference keyphrases Extracted/reference keyphrases are stemmed Regularization parameter λ tuned on the training set 11 / 22
  • 12. Word weighting functions TF×IDF [Spärck Jones, 1972] IDF weights are computed on the training set TextRank [Mihalcea and Tarau, 2004] Window is sentence, edge weights are co-occurrences Logistic regression [Hong and Nenkova, 2014] Reference keyphrases in training data are used to generate positive/negative examples Features: position first occurrence, TF×IDF, presence in first sentence 12 / 22
  • 13. Baselines sum : ranking candidates using the sum of the weights of their component words [Wan and Xiao, 2008] norm : ranking candidates using the sum of the weights of their component words normalized by their lengths Redundant keyphrases are pruned from the ranked lists 1. Olympic games 2. Olympics 3. 100-meter dash 4. · · · 13 / 22
  • 14. Results Top-5 candidates Top-10 candidates Weighting + Ranking P R F P R F TF×IDF + sum 5.6 1.9 2.8 5.3 3.5 4.2 + norm 19.2 6.7 9.9 15.1 10.6 12.3 + ilp 25.4 9.1 13.3† 17.5 12.4 14.4† TextRank + sum 4.5 1.6 2.3 4.0 2.8 3.3 + norm 18.8 6.6 9.6 14.5 10.1 11.8 + ilp 22.6 8.0 11.7† 17.4 12.2 14.2† Logistic regression + sum 4.2 1.5 2.2 4.7 3.4 3.9 + norm 23.8 8.3 12.2 18.9 13.3 15.5 + ilp 29.4 10.4 15.3† 19.8 14.1 16.3 14 / 22
  • 15. Results (cont.) Top-5 candidates Top-10 candidates Method P R F rank P R F rank SemEval - TF×IDF 22.0 7.5 11.2 17.7 12.1 14.4 TF×IDF + ilp 25.4 9.1 13.3 14/20 17.5 12.4 14.4 18/20 SemEval - MaxEnt 21.4 7.3 10.9 17.3 11.8 14.0 Logistic regression + ilp 29.4 10.4 15.3 10/20 19.8 14.1 16.3 15/20 15 / 22
  • 16. Example (J-3.txt) TF×IDF + sum (P = 0.1) advertis bid; certain advertis budget; keyword bid; convex hull landscap; budget optim bid; uniform bid strategi; advertis slot; advertis campaign; ward advertis; searchbas advertis TF×IDF + norm (P = 0.2) advertis; advertis bid; keyword; keyword bid; landscap; advertis slot; advertis cam- paign; ward advertis; searchbas advertis; advertis random TF×IDF + ilp (P = 0.4) click; advertis; uniform bid; landscap; auction; convex hull; keyword; budget optim; single-bid strategi; queri 16 / 22
  • 18. Conclusion Proposed ILP model Can be applied on top of any word weighting function Reduces over-generation errors by weighting candidates as a set Substancial improvement over commonly used word-based ranking approaches Future work Phrase-based model regularized by word redundancy 18 / 22
  • 20. References I Gillick, D. and Favre, B. (2009). A scalable global model for summarization. In Proceedings of the Workshop on Integer Linear Programming for Natural Language Processing, pages 10–18, Boulder, Colorado. Association for Computational Linguistics. Hasan, K. S. and Ng, V. (2014). Automatic keyphrase extraction: A survey of the state of the art. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1262–1273, Baltimore, Maryland. Association for Computational Linguistics. 20 / 22
  • 21. References II Hong, K. and Nenkova, A. (2014). Improving the estimation of word importance for news multi-document summarization. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 712–721, Gothenburg, Sweden. Association for Computational Linguistics. Kim, S. N., Medelyan, O., Kan, M.-Y., and Baldwin, T. (2010). Semeval-2010 task 5 : Automatic keyphrase extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 21–26, Uppsala, Sweden. Association for Computational Linguistics. Mihalcea, R. and Tarau, P. (2004). Textrank: Bringing order into texts. In Lin, D. and Wu, D., editors, Proceedings of EMNLP 2004, pages 404–411, Barcelona, Spain. Association for Computational Linguistics. 21 / 22
  • 22. References III Spärck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11–21. Wan, X. and Xiao, J. (2008). Collabrank: Towards a collaborative approach to single-document keyphrase extraction. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 969–976, Manchester, UK. Coling 2008 Organizing Committee. 22 / 22