SlideShare a Scribd company logo
1Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Gregor Große-Bölting
Chifumi Nishioka
Ansgar Scherp
A Comparison of Different
Strategies for Automated
Semantic Document Annotation
2Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Motivation [1/2]
• Document annotation
– Facilitates users and search engines to find documents
– Requires a huge amount of human effort
– e.g., subject indexers in ZBW labeled 1.6 million
scientific documents in economics
• Semantic document annotation
– Documents annotated with semantic entities
– e.g., PubMed and MeSH, ACM DL and ACM CCS
Focus on semantic document annotation
Necessity of automated document annotation
3Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Motivation [2/2]
• Small scale experiments so far
– Comparing a small number of strategies
– Datasets containing a few hundred documents
• Comparing of 43 strategies for document annotation
within the developed experiment framework
– The largest number of strategies
• Experiments with three datasets from different domains
– Contain full-texts of 100,000 documents annotated by subject
indexers
– The largest dataset of scientific publications
We conducted the largest scale experiment
4Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Experiment Framework
Strategies are composed of methods from concept extraction,
concept activation, and annotation selection
1. Concept Extraction
detect concepts (candidate annotations) from each document
2. Concept Activation
compute a score for each concept of a document
3. Annotation Selection
select annotations from concepts for each document
4. Evaluation
measure performance of strategies with ground truth
5Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Research Question
• Research questions solved with the experiment
framework
(I) Which strategy performs best?
(II) Which concept extraction method performs best?
(III) Which concept activation method performs best?
(IV) Which annotation selection method performs
best?
6Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Extraction [1/2]
• Entity
– Extract entities from documents using a domain-specific
knowledge base
– Domain-specific knowledge base
• Entities (subjects) in a specific domain (e.g., medicine)
• One or more labels for each entity
• Relationships between entities
– Detect entities by string matching with entity labels
• Tri-gram
– Extract contigurous sequences of one, two, and three
words in a document
7Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Extraction [2/2]
• RAKE (Rapid Automatic Keyword
Extraction) [Rose et al. 10]
– Unsupervised method for extracting keywords
– Incorporate cooccurrence and frequency of words
• LDA (Latent Dirichlet Allocation) [Blei et al. 03]
– Unsupervised topic modeling method for inferring latent
topics in a document corpus
– Topic model
• Topic: A probability distribution over words
• Document: A probability distribution over topics
– Treat a topic as a concept
8Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [1/6]
• Three types of concept activation
methods
– Statistical Methods
• Baseline
• Use only directly mentioned concepts
– Hierarchy-based Methods
• Reveal concepts that are not mentioned explicitly using a
hierarchical knowledge base
– Graph-based Methods
• Use only directly mentioned concepts
• Represent concept
cooccurrences as a graph
Bank, Interest Rate, Financial Crisis, Bank, Central
Bank, Tax, Interest Rate
Tax
Bank
Interest Rate
Financial Crisis
Central Bank
9Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [2/6]
• Statistical Methods
– Frequency
• 𝑓𝑟𝑒𝑞 𝑐, 𝑑 depends on Concept Extraction methods
– The number of appearances (Entity and Tri-gram)
– The score output by RAKE (RAKE)
– The probability of a topic for a document 𝑑 (LDA)
– CF-IDF [Goossen et al. 11]
• An extension of TF-IDF replacing words with concepts
• Lower scores for concepts that appear in many documents
𝑠𝑐𝑜𝑟𝑒 𝑐𝑓𝑖𝑑𝑓(𝑐, 𝑑) = 𝑐𝑓(𝑐, 𝑑) ∙ 𝑙𝑜𝑔
|𝐷|
| 𝑑 ∈ 𝐷 : {𝑐 ∈ 𝑑}|
𝑠𝑐𝑜𝑟𝑒𝑓𝑟𝑒𝑞(𝑐, 𝑑) = 𝑓𝑟𝑒𝑞(𝑐, 𝑑)
10Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [3/6]
• Hierarchy-based Methods [1/2]
– Base Activation
• 𝐶𝑙(𝑐): a set of child concepts of a concept 𝑐
• 𝜆: decay parameter, set 𝜆 = 0.4
• e.g.,
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 + 𝜆 ∙
𝑐 𝑖∈𝐶 𝑙(𝑐)
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒(𝑐𝑖, 𝑑)
Social
Recommendation
Social
Tagging
Web Searching Web Mining
Site
Wrapping
Web Log
Analysis
World Wide Web
𝑐1
𝑐2
𝑐3
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐1, 𝑑 = 1.00
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐2, 𝑑 = 0.40
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐3, 𝑑 = 0.16
11Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [4/6]
• Hierarchy-based Methods [2/2]
– Branch Activation
• 𝐵𝑁: reciprocal of the number of concepts that are located one
level above a concept 𝑐
– OneHop Activation
• 𝐶 𝑑: set of concepts in a document 𝑑
• Activates concepts in a maximum distance of one hop
𝑠𝑐𝑜𝑟𝑒 𝑏𝑟𝑎𝑛𝑐ℎ 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 + 𝜆 ∙ 𝐵𝑁 ∙
𝑐 𝑖∈𝐶 𝑙(𝑐)
𝑠𝑐𝑜𝑟𝑒 𝑏𝑟𝑎𝑛𝑐ℎ(𝑐𝑖, 𝑑)
𝑠𝑐𝑜𝑟𝑒 𝑜𝑛𝑒ℎ𝑜𝑝 𝑐, 𝑑 =
𝑓𝑟𝑒𝑞 𝑐, 𝑑 if |𝐶𝑙(𝑐) ∩ 𝐶 𝑑| ≥ 2
𝑓𝑟𝑒𝑞 (𝑐, 𝑑) + 𝜆 ∙
𝑐 𝑖∈𝐶 𝑙(𝑐)
𝑓𝑟𝑒𝑞 𝑐𝑖, 𝑑 otherwise
12Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [5/6]
• Graph-based Methods [1/2]
– Degree [Zouaq et al. 12]
• 𝑑𝑒𝑔𝑟𝑒𝑒 𝑐, 𝑑 : the number of edges linked with a concept 𝑐
• e.g., 𝑠𝑐𝑜𝑟𝑒 𝑑𝑒𝑔𝑟𝑒𝑒(Bank, 𝑑) = 3
– HITS [Kleinberg 99; Zouaq et al. 12]
• Link analysis algorithm for search engines [Kleinberg 99]
• ℎ𝑢𝑏 𝑐, 𝑑 = 𝑐 𝑖∈𝐶 𝑛(𝑐) 𝑎𝑢𝑡ℎ 𝑐𝑖, 𝑑
• 𝑎𝑢𝑡ℎ 𝑐, 𝑑 = 𝑐 𝑖∈𝐶 𝑛(𝑐) ℎ𝑢𝑏 𝑐𝑖, 𝑑
𝑠𝑐𝑜𝑟𝑒 𝑑𝑒𝑔𝑟𝑒𝑒(𝑐, 𝑑) = 𝑑𝑒𝑔𝑟𝑒𝑒(𝑐, 𝑑)
𝑠𝑐𝑜𝑟𝑒ℎ𝑖𝑡𝑠 𝑐, 𝑑 = ℎ𝑢𝑏 𝑐, 𝑑 + 𝑎𝑢𝑡ℎ 𝑐, 𝑑
Tax
Bank
Interest Rate
Financial Crisis
Central Bank
13Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [6/6]
• Graph-based Methods [2/2]
– PageRank [Page et al. 99; Mihalcea & Paul 04]
• Link analysis algorithm for search engines
• Based on the intuition that a node that is linked from many
important nodes is more important
• 𝐶𝑖𝑛(𝑐): set of concepts connected with incoming edges from 𝑐
• 𝐶 𝑜𝑢𝑡(𝑐): set of concepts connected with outgoing edges from 𝑐
• 𝜇: dumping factor, 𝜇 = 0.85
𝑠𝑐𝑜𝑟𝑒 𝑝𝑎𝑔𝑒 𝑐, 𝑑 = 1 − 𝜇 + 𝜇 ∙
𝑐 𝑖∈𝐶 𝑖𝑛(𝑐)
𝑠𝑐𝑜𝑟𝑒 𝑝𝑎𝑔𝑒(𝑐𝑖, 𝑑)
|𝐶 𝑜𝑢𝑡(𝑐𝑖)|
14Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Annotation Selection
• Top-5 and Top-10
– Select concepts whose scores are ranked in top-k
• k Nearest Neighbor (kNN) [Huang et al. 11]
– Based on the assumption that documents with similar
concepts share similar annotations
1. Compute similarity scores between a target document
and all documents with annotations
2. Select union of annotations of k nearest documents
Central bank
Law
Financial crisis
Finance
China
Human resource
Leadership
Marketing
Competition law
??
0.49
0.45
0.42
0.60
Example
- 𝑘 = 2
- Selected annotations
Finance; China; Marketing;
Competition Law
15Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Configurations [1/5]
Entity Tri-gram LDARAKE
Statistical
Methods
(2 methods)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
16Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Configurations [2/5]
24 strategies
Entity Tri-gram LDARAKE
Statistical
Methods
(2 methods)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
17Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
15 strategies
Entity Tri-gram LDARAKE
Statistical
Method
(2 methods)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
Configurations [3/5]
18Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
3 strategies
Entity Tri-gram LDARAKE
Statistical
Method
(Frequency)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
Configurations [4/5]
19Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Entity Tri-gram LDARAKE
Statistical
Methods
(Frequency)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
Configurations [5/5]
43 strategies in total
20Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Datasets and Metrics of Experiments
Economics Political Science Computer Science
publication ZBW FIV SemEval 2010
# of publications 62,924 28,324 244
# of annotations 5.26 (± 1.84) 12.00 (± 4.02) 5.05 (± 2.41)
knowledge base STW European Thesaurus ACM CCS
# of entities 6,335 7,912 2,299
# of labels 11,679 8,421 9,086
• Computer Science: SemEval 2010 dataset [Kim et al. 10]
– Publications are annotated with keywords originally
– We converted keywords to entities by string matching
• All publications and labels of entities are in English
• We use full-texts of publications
• All annotations are used as ground truth
• Evaluation metrics: Precision, Recall, F-measure
21Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
(I) Best Performing Strategies
• Economics and Political Science datasets
– The best strategy: Entity × HITS × kNN
– F-measure: 0.39 (economics), 0.28 (political science)
• Computer Science dataset
– The best strategy: Entity × Degree × kNN
– F-measure: 0.33 (computer science)
• Graph-based methods do not differ a lot
In general, a document annotation strategy
Entity × Graph-based method × kNN performs best
22Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
(II) Influence of Concept Extraction
• Concept Extraction method: Entity
– Use domain-specific knowledge bases
– Knowledge bases: freely available and of high quality
– 32 thesauri listed in W3C SKOS Datasets
For Concept Extraction methods, Entity consistently
outperforms Tri-gram, RAKE, and LDA
23Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
(III) Influence of Concept Activation
• Poor performance of hierarchy-based methods
– We use full-texts in the experiments
• Full-texts contain so many different concepts (203.80 unique
entities (SD: 24.50)) that others do not have to be activated
– However, OneHop can work as well as graph-based
methods
• It activates concept in one hop distance
For Concept Activation methods,
graph-based methods are better than statistical
methods or hierarchy-based methods
24Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
(IV) Influence of Annotation Selection
• kNN
– No learning process
– Confirms the assumption that documents with similar
concepts share similar annotations
For Annotation Selection methods, kNN can enhance
the performance
25Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Conclusion
• Large scale experiment for automated semantic
document annotation for scientific publications
• Best strategy: Entity × Graph-based method × kNN
– Novel combination of methods
• Best concept extraction method: Entity
• Best concept activation method: Graph-based
methods
– OneHop can achieve similar performance and requires
less computation cost
26Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Thank you!
Questions?
27Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Appendix
28Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Research Question
• Research questions solved with the experiment
framework
(I) Which strategy performs best?
(II) Which concept extraction method performs best?
(III) Which concept activation method performs best?
(IV) Which annotation selection method performs
best?
29Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
LDA (Latent Dirichlet Allocation)
source: D. M. Blei. Probabilistic topic models, CACM, 2012.
30Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Entity Extraction and Conversion
• Entity extraction
– String matching with entity labels
– Starting with longer entity labels
• e.g., From a text “financial crisis is …”, only an entity “financial
crisis” is detected (not “crisis”).
• Converting to entities
– Words and keywords are extracted in Tri-gram and RAKE
– They are converted to entities by string matching with
entity labels before annotation selection
– If no matched entity label is found, word or keyword is
discarded
31Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
kNN [1/2]
• Similarity measure
– Each document is represented as a vector where each
element is a score of a concept
– Cosine similarity is used as a similarity measure
GDP
Immigration
Population
Bank
Interest rate
Canada
0.3
0.5
0.8
0.1
0.0
0.5
GDP
Immigration
Population
Bank
Interest rate
Canada
0.6
0.0
0.4
0.8
0.4
0.2
cosine similarity between (0.3, 0.5, 0.8, 0.1, 0.0, 0.5) and (0.6, 0.0, 0.4, 0.8, 0.4, 0.2)
𝒔𝒊𝒎 𝒅 𝟏, 𝒅 𝟐 = 𝟎. 𝟓𝟐
𝑑1 𝑑2
32Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
kNN [2/2]
• k = 1
• k = 2
Central bank
Law
Financial crisis
Finance
China
Human resource
Leadership
Marketing
Competition law
??
0.49
0.45
0.42
0.60
Marketing
Competitive law
Selected annotations
Central bank
Law
Financial crisis
Finance
China
Human resource
Leadership
Marketing
Competition law
??
0.49
0.45
0.42
0.60
Marketing
Competitive law
Finance
China
Selected annotations
33Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Evaluation Metrics
• Precision
• Recall
• F-measure
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠} ∩ |{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠}|
| 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 |
𝑟𝑒𝑐𝑎𝑙𝑙 =
{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠} ∩ |{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠}|
| 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 |
𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ∙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
34Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Datasets
• Economics dataset
– 11 GB
• Political science dataset
– 3.8 GB
35Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Experiments
• Preprocessing documents
– lemmatization
– stop words removal
• 10-fold cross validation
– split a dataset into 10 equal sized subsets
– 8 subset for training data
– 1 subset for testing data
– 1 subset for optimizing parameter
36Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Result Table: Entity [1/2]
Economics
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .14 (.17) .14 (.15) .13 (.15) .22 (.20) .11 (.10) .14 (.12) .08 (.21) .08 (.21) .08 (.21)
CF-IDF .19 (.19) .18 (.17) .18 (.16) .24 (.21) .12 (.10) .15 (.12) .29 (.32) .30 (.32) .29 (.31)
Base Act. .10 (.14) .09 (.13) .09 (.13) .18 (.19) .09 (.09) .12 (.11) .20 (.30) .20 (.30) .20 (.29)
Branch Act. .08 (.14) .08 (.12) .08 (.12) .17 (.19) .08 (.09) .11 (.11) .17 (.28) .17 (.28) .17 (.27)
OneHop .12 (.16) .12 (.14) .12 (.14) .19 (.19) .09 (.09) .12 (.11) .35 (.34) .36 (.34) .35 (.33)
Degree .15 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32)
HITS .14 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.10) .14 (.12) .40 (.32) .40 (.32) .39 (.31)
PageRank .14 (.17) .14 (.15) .14 (.15) .22 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32)
Political Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .12 (.11) .18 (.16) .14 (.12) .15 (.13) .12 (.10) .13 (.10) .14 (.17) .05 (.07) .07 (.09)
CF-IDF .05 (.07) .12 (.16) .07 (.10) .07 (.09) .08 (.10) .07 (.09) .24 (.22) .14 (.14) .17 (.16)
Base Act. .05 (.08) .10 (.13) .07 (.09) .10 (.10) .10 (.09) .09 (.09) .14 (.19) .07 (.10) .09 (.12)
Branch Act. .04 (.07) .08 (.12) .05 (.08) .09 (.09) .09 (.09) .08 (.09) .12 (.17) .06 (.10) .08 (.11)
OneHop .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .27 (.21) .26 (.21) .25 (.19)
Degree .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .29 (.21) .28 (.21) .27 (.19)
HITS .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .30 (.22) .29 (.21) .28 (.20)
PageRank .10 (.09) .20 (.17) .13 (.11) .13 (.10) .14 (.11) .13 (.10) .29 (.22) .29 (.21) .27 (.20)
37Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Result Table: Entity [2/2]
Computer Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .18 (.21) .14 (.15) .15 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .24 (.16) .30 (.17)
CF-IDF .02 (.08) .02 (.06) .02 (.06) .03 (.11) .01 (.04) .02 (.05) .47 (.29) .23 (.17) .29 (.18)
Base Act. .17 (.20) .13 (.14) .14 (.15) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .22 (.15) .29 (.17)
Branch Act. .17 (.20) .12 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .50 (.28) .22 (.15) .29 (.17)
OneHop .17 (.20) .13 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .42 (.30) .25 (.21) .29 (.20)
Degree .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .27 (.17) .33 (.18)
HITS .18 (.21) .14 (.15) .15 (.16) .21 (.22) .08 (.08) .11 (.11) .48 (.31) .27 (.18) .32 (.20)
PageRank .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .50 (.29) .25 (.15) .31 (.18)
38Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Result Table: Tri-gram
Economics
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .12 (.15) .12 (.14) .11 (.14) .19 (.19) .10 (.10) .13 (.12) .08 (.22) .08 (.22) .08 (.21)
CF-IDF .10 (.12) .10 (.12) .09 (.11) .17 (.17) .08 (.10) .12 (.12) .07 (.20) .06 (.22) .06 (.20)
Degree .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .07 (.21) .07 (.21) .07 (.20)
HITS .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .08 (.22) .08 (.22) .07 (.21)
PageRank .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .10 (.20) .04 (.08) .05 (.11)
Political Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .06 (.08) .14 (.16) .08 (.10) .10 (.10) .11 (.11) .10 (.09) .08 (.14) .05 (.08) .06 (.09)
CF-IDF .05 (.05) .06 (.07) .05 (.06) .09 (.10) .09 (.10) .08 (.09) .09 (.15) .04 (.08) .06 (.10)
Degree .01 (.03) .03 (.07) .01 (.04) .01 (.03) .03 (.07) .01 (.04) .11 (.14) .03 (.05) .05 (.07)
HITS .01 (.03) .02 (.06) .01 (.03) .01 (.03) .00 (.06) .01 (.03) .12 (.14) .04 (.06) .06 (.08)
PageRank .01 (.04) .03 (.08) .02 (.05) .01 (.04) .03 (.08) .02 (.05) .08 (.12) .03 (.05) .04 (.06)
Computer Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .26 (.24) .20 (.18) .22 (.19) .54 (.30) .20 (.13) .29 (.17) .44 (.28) .25 (.18) .30 (.19)
CF-IDF .23 (.24) .18 (.18) .19 (.19) .54 (.29) .22 (.14) .30 (.17) .48 (.28) .20 (.14) .26 (.15)
Degree .09 (.15) .07 (.11) .07 (.12) .13 (.19) .05 (.07) .07 (.09) .48 (.29) .23 (.16) .29 (.18)
HITS .05 (.14) .04 (.09) .04 (.10) .11 (.18) .04 (.06) .06 (.09) .39 (.29) .26 (.21) .28 (.19)
PageRank .02 (.06) .02 (.05) .02 (.06) .03 (.08) .01 (.03) .02 (.05) .46 (.29) .25 (.18) .30 (.18)
39Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Result Table: RAKE
Economics
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .08 (.14) .08 (.12) .08 (.12) .15 (.18) .07 (.08) .10 (.11) .34 (.33) .34 (.33) .33 (.32)
Political Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .04 (.07) .08 (.13) .05 (.08) .07 (.09) .08 (.09) .07 (.08) .31 (.23) .18 (.15) .22 (.17)
Computer Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .24 (.24) .17 (.16) .19 (.17) .42 (.28) .15 (.10) .22 (.14) .42 (.27) .20 (.13) .25 (.15)
40Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Result Table: LDA
Economics
kNN
Recall Precision F
Frequency .19 (.30) .19 (.30) .19 (.30)
Political Science
kNN
Recall Precision F
Frequency .15 (.19) .15 (.18) .14 (.17)
Computer Science
kNN
Recall Precision F
Frequency .28 (.27) .24 (.23) .24 (.22)
41Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Materials
• Codes
– https://github.com/ggb/ShortStories
• Datasets
– economics and political science
• not publicly available yet
• contact us directly, if you are interested in
– computer science
• publicly available
42Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Presentation
• K-CAP 2015
– International Conference on Knowledge Capture
– Scope
• Knowledge Acquisition / Capture
• Knowledge Extraction from Text
• Semantic Web
• Knowledge Engineering and Modelling
• …
• Time slot
– Presentation: 25 minutes
– Q & A: 5 minutes
43Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Reference
• [Blei et al. 03] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation,
JMLR, 2003.
• [Blei 12] D. M. Blei. Probabilistic topic models, CACM, 2012.
• [Goossen et al. 11] F. Goossen, W. IJntema, F. Frasincar, F. Hogenboom, and U.
Kaymak. News personalization using the CF-IDF semantic recommender, WIMS,
2011.
• [Grosse-Bolting et al. 15] G. Grosse-Bolting, C. Nishioka, and A. Scherp. Generic
process for extracting user profiles from social media using hierarchical knowledge
bases, ICSC, 2015.
• [Huang et al. 11] M. Huang, A. Névéol, and Z. Lu. Recommending MeSH terms for
annotating biomedical articles, JAMIA, 2011.
• [Kapanipathi et al. 14] P. Kapanipathi, P. Jain, C. Venkataramani, and A. Sheth. User
interests identification on Twitter using a hierarchical knowledge base, ESWC, 2014.
• [Kim et al. 10] S. N. Kim, O. Medelyan, M. Y. Kan, and T. Baldwin. Semeval-2010
task 5: Automatic keyphrase extraction from scientific articles, International
Workshop on Semantic Evaluation, 2010.
44Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Reference
• [Kleinberg 99] J. M. Kleinberg. Authoritative sources in a hyperlinked
environment, Journal of the ACM, 1999.
• [Mihalcea & Paul 04] R. Mihalcea and T. Paul. TextRank: Bringing order into texts,
EMNLP, 2004.
• [Page et al. 99] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank
citation ranking: bringing order to the web, TR of Stanford InfoLab, 1999.
• [Rose et al. 10] S. Rose, D. Engel, N. Cramer, and W. Cowley. Automatic keyword
extraction from individual documents, Text Mining, 2010.
• [Zouaq et al. 12] A. Zouaq, G. Dragan, and H. Marek. Voting theory for concept
detection, ESWC, 2012.

More Related Content

PPTX
Mining and Managing Large-scale Linked Open Data
PDF
Knowledge Discovery in Social Media and Scientific Digital Libraries
PPTX
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
PPTX
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
PDF
MOA for the IoT at ACML 2016
PDF
Artificial intelligence and data stream mining
PDF
Mining Big Data Streams with APACHE SAMOA
PDF
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Mining and Managing Large-scale Linked Open Data
Knowledge Discovery in Social Media and Scientific Digital Libraries
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
MOA for the IoT at ACML 2016
Artificial intelligence and data stream mining
Mining Big Data Streams with APACHE SAMOA
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...

What's hot (20)

PDF
Real-Time Big Data Stream Analytics
PPTX
Streaming Algorithms
PDF
Mining Big Data in Real Time
PDF
Signals from outer space
PDF
Moa: Real Time Analytics for Data Streams
PPTX
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
PDF
Efficient Online Evaluation of Big Data Stream Classifiers
PPT
5.1 mining data streams
PDF
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
PPTX
Mining high speed data streams: Hoeffding and VFDT
PDF
Vector databases and neural search
PDF
Deep learning and Apache Spark
PDF
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
PPTX
STRIP: stream learning of influence probabilities.
PPTX
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
PPTX
Improving Model Predictions via Stacking and Hyper-parameters Tuning
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
PPTX
Learning Systems for Science
PDF
Probabilistic data structures
PPTX
Apache con big data 2015 magellan
Real-Time Big Data Stream Analytics
Streaming Algorithms
Mining Big Data in Real Time
Signals from outer space
Moa: Real Time Analytics for Data Streams
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Efficient Online Evaluation of Big Data Stream Classifiers
5.1 mining data streams
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
Mining high speed data streams: Hoeffding and VFDT
Vector databases and neural search
Deep learning and Apache Spark
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
STRIP: stream learning of influence probabilities.
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Improving Model Predictions via Stacking and Hyper-parameters Tuning
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Learning Systems for Science
Probabilistic data structures
Apache con big data 2015 magellan
Ad

Viewers also liked (20)

PDF
Smart photo selection: interpret gaze as personal interest
PPTX
SchemEX -- Building an Index for Linked Open Data
PDF
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
PDF
A Framework for Iterative Signing of Graph Data on the Web
PPTX
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
PPTX
Events in Multimedia - Theory, Model, Application
PPT
Cvpr2007 object category recognition p1 - bag of words models
PPT
Conceptual indexing
PPT
Phrase Based Indexing and Information Retrivel
PPSX
Semantic Analysis using Wikipedia Taxonomy
PPTX
Rule based approach to sentiment analysis at romip’11 slides
PPTX
Concept-Based Information Retrieval using Explicit Semantic Analysis
PPTX
Sequential pattern mining
PDF
Unsupervised Learning with Apache Spark
PPSX
Patterns number and geometric
PPTX
Repeating and growing patterns
PPTX
Machine Learning in Pathology Diagnostics with Simagis Live
PPT
Pattern recognition
PDF
Web mining slides
Smart photo selection: interpret gaze as personal interest
SchemEX -- Building an Index for Linked Open Data
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
A Framework for Iterative Signing of Graph Data on the Web
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
Events in Multimedia - Theory, Model, Application
Cvpr2007 object category recognition p1 - bag of words models
Conceptual indexing
Phrase Based Indexing and Information Retrivel
Semantic Analysis using Wikipedia Taxonomy
Rule based approach to sentiment analysis at romip’11 slides
Concept-Based Information Retrieval using Explicit Semantic Analysis
Sequential pattern mining
Unsupervised Learning with Apache Spark
Patterns number and geometric
Repeating and growing patterns
Machine Learning in Pathology Diagnostics with Simagis Live
Pattern recognition
Web mining slides
Ad

Similar to A Comparison of Different Strategies for Automated Semantic Document Annotation (20)

PDF
G04124041046
PDF
November 2024: Top 10 Read Articles in Data Mining & Knowledge Management Pro...
PDF
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
PDF
December 2024: Top 10 Read Articles in Data Mining & Knowledge Management Pro...
PDF
July 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
PDF
May 2022: Top 10 Read Articles in Data Mining & Knowledge Management Process
PDF
Semantically Enriched Knowledge Extraction With Data Mining
PPTX
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
PPTX
Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm...
PDF
Web-scale semantic search
PDF
A Novel Approach for Keyword extraction in learning objects using text mining
PPTX
Rules for inducing hierarchies from social tagging data
PDF
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
PDF
Towards research data knowledge graphs
PPTX
Learning Relations from Social Tagging Data
PDF
Enriching search results using ontology
PDF
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
PDF
Ak4301197200
PPTX
Assigning semantic labels to data sources
PDF
Text classification supervised algorithms with term frequency inverse documen...
G04124041046
November 2024: Top 10 Read Articles in Data Mining & Knowledge Management Pro...
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
December 2024: Top 10 Read Articles in Data Mining & Knowledge Management Pro...
July 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
May 2022: Top 10 Read Articles in Data Mining & Knowledge Management Process
Semantically Enriched Knowledge Extraction With Data Mining
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm...
Web-scale semantic search
A Novel Approach for Keyword extraction in learning objects using text mining
Rules for inducing hierarchies from social tagging data
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
Towards research data knowledge graphs
Learning Relations from Social Tagging Data
Enriching search results using ontology
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
Ak4301197200
Assigning semantic labels to data sources
Text classification supervised algorithms with term frequency inverse documen...

More from Ansgar Scherp (10)

PPTX
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
PDF
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...
PDF
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
PPTX
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
PPTX
Can you see it? Annotating Image Regions based on Users' Gaze Information
PPTX
Linked open data - how to juggle with more than a billion triples
PPTX
SchemEX -- Building an Index for Linked Open Data
PPTX
A Model of Events for Integrating Event-based Information in Complex Socio-te...
PPTX
strukt - A Pattern System for Integrating Individual and Organizational Knowl...
PPTX
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
Can you see it? Annotating Image Regions based on Users' Gaze Information
Linked open data - how to juggle with more than a billion triples
SchemEX -- Building an Index for Linked Open Data
A Model of Events for Integrating Event-based Information in Complex Socio-te...
strukt - A Pattern System for Integrating Individual and Organizational Knowl...
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...

Recently uploaded (20)

PPT
tcp ip networks nd ip layering assotred slides
PDF
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
PPTX
international classification of diseases ICD-10 review PPT.pptx
PDF
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PDF
Testing WebRTC applications at scale.pdf
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PPTX
Digital Literacy And Online Safety on internet
PDF
An introduction to the IFRS (ISSB) Stndards.pdf
PPTX
Introduction to Information and Communication Technology
PPTX
presentation_pfe-universite-molay-seltan.pptx
PPTX
QR Codes Qr codecodecodecodecocodedecodecode
PPTX
INTERNET------BASICS-------UPDATED PPT PRESENTATION
PDF
Sims 4 Historia para lo sims 4 para jugar
PPTX
Internet___Basics___Styled_ presentation
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PDF
Paper PDF World Game (s) Great Redesign.pdf
tcp ip networks nd ip layering assotred slides
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
international classification of diseases ICD-10 review PPT.pptx
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
RPKI Status Update, presented by Makito Lay at IDNOG 10
Module 1 - Cyber Law and Ethics 101.pptx
PptxGenJS_Demo_Chart_20250317130215833.pptx
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
Testing WebRTC applications at scale.pdf
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
Digital Literacy And Online Safety on internet
An introduction to the IFRS (ISSB) Stndards.pdf
Introduction to Information and Communication Technology
presentation_pfe-universite-molay-seltan.pptx
QR Codes Qr codecodecodecodecocodedecodecode
INTERNET------BASICS-------UPDATED PPT PRESENTATION
Sims 4 Historia para lo sims 4 para jugar
Internet___Basics___Styled_ presentation
WebRTC in SignalWire - troubleshooting media negotiation
Paper PDF World Game (s) Great Redesign.pdf

A Comparison of Different Strategies for Automated Semantic Document Annotation

  • 1. 1Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Gregor Große-Bölting Chifumi Nishioka Ansgar Scherp A Comparison of Different Strategies for Automated Semantic Document Annotation
  • 2. 2Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Motivation [1/2] • Document annotation – Facilitates users and search engines to find documents – Requires a huge amount of human effort – e.g., subject indexers in ZBW labeled 1.6 million scientific documents in economics • Semantic document annotation – Documents annotated with semantic entities – e.g., PubMed and MeSH, ACM DL and ACM CCS Focus on semantic document annotation Necessity of automated document annotation
  • 3. 3Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Motivation [2/2] • Small scale experiments so far – Comparing a small number of strategies – Datasets containing a few hundred documents • Comparing of 43 strategies for document annotation within the developed experiment framework – The largest number of strategies • Experiments with three datasets from different domains – Contain full-texts of 100,000 documents annotated by subject indexers – The largest dataset of scientific publications We conducted the largest scale experiment
  • 4. 4Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Experiment Framework Strategies are composed of methods from concept extraction, concept activation, and annotation selection 1. Concept Extraction detect concepts (candidate annotations) from each document 2. Concept Activation compute a score for each concept of a document 3. Annotation Selection select annotations from concepts for each document 4. Evaluation measure performance of strategies with ground truth
  • 5. 5Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Research Question • Research questions solved with the experiment framework (I) Which strategy performs best? (II) Which concept extraction method performs best? (III) Which concept activation method performs best? (IV) Which annotation selection method performs best?
  • 6. 6Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Concept Extraction [1/2] • Entity – Extract entities from documents using a domain-specific knowledge base – Domain-specific knowledge base • Entities (subjects) in a specific domain (e.g., medicine) • One or more labels for each entity • Relationships between entities – Detect entities by string matching with entity labels • Tri-gram – Extract contigurous sequences of one, two, and three words in a document
  • 7. 7Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Concept Extraction [2/2] • RAKE (Rapid Automatic Keyword Extraction) [Rose et al. 10] – Unsupervised method for extracting keywords – Incorporate cooccurrence and frequency of words • LDA (Latent Dirichlet Allocation) [Blei et al. 03] – Unsupervised topic modeling method for inferring latent topics in a document corpus – Topic model • Topic: A probability distribution over words • Document: A probability distribution over topics – Treat a topic as a concept
  • 8. 8Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Concept Activation [1/6] • Three types of concept activation methods – Statistical Methods • Baseline • Use only directly mentioned concepts – Hierarchy-based Methods • Reveal concepts that are not mentioned explicitly using a hierarchical knowledge base – Graph-based Methods • Use only directly mentioned concepts • Represent concept cooccurrences as a graph Bank, Interest Rate, Financial Crisis, Bank, Central Bank, Tax, Interest Rate Tax Bank Interest Rate Financial Crisis Central Bank
  • 9. 9Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Concept Activation [2/6] • Statistical Methods – Frequency • 𝑓𝑟𝑒𝑞 𝑐, 𝑑 depends on Concept Extraction methods – The number of appearances (Entity and Tri-gram) – The score output by RAKE (RAKE) – The probability of a topic for a document 𝑑 (LDA) – CF-IDF [Goossen et al. 11] • An extension of TF-IDF replacing words with concepts • Lower scores for concepts that appear in many documents 𝑠𝑐𝑜𝑟𝑒 𝑐𝑓𝑖𝑑𝑓(𝑐, 𝑑) = 𝑐𝑓(𝑐, 𝑑) ∙ 𝑙𝑜𝑔 |𝐷| | 𝑑 ∈ 𝐷 : {𝑐 ∈ 𝑑}| 𝑠𝑐𝑜𝑟𝑒𝑓𝑟𝑒𝑞(𝑐, 𝑑) = 𝑓𝑟𝑒𝑞(𝑐, 𝑑)
  • 10. 10Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Concept Activation [3/6] • Hierarchy-based Methods [1/2] – Base Activation • 𝐶𝑙(𝑐): a set of child concepts of a concept 𝑐 • 𝜆: decay parameter, set 𝜆 = 0.4 • e.g., 𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 + 𝜆 ∙ 𝑐 𝑖∈𝐶 𝑙(𝑐) 𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒(𝑐𝑖, 𝑑) Social Recommendation Social Tagging Web Searching Web Mining Site Wrapping Web Log Analysis World Wide Web 𝑐1 𝑐2 𝑐3 𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐1, 𝑑 = 1.00 𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐2, 𝑑 = 0.40 𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐3, 𝑑 = 0.16
  • 11. 11Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Concept Activation [4/6] • Hierarchy-based Methods [2/2] – Branch Activation • 𝐵𝑁: reciprocal of the number of concepts that are located one level above a concept 𝑐 – OneHop Activation • 𝐶 𝑑: set of concepts in a document 𝑑 • Activates concepts in a maximum distance of one hop 𝑠𝑐𝑜𝑟𝑒 𝑏𝑟𝑎𝑛𝑐ℎ 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 + 𝜆 ∙ 𝐵𝑁 ∙ 𝑐 𝑖∈𝐶 𝑙(𝑐) 𝑠𝑐𝑜𝑟𝑒 𝑏𝑟𝑎𝑛𝑐ℎ(𝑐𝑖, 𝑑) 𝑠𝑐𝑜𝑟𝑒 𝑜𝑛𝑒ℎ𝑜𝑝 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 if |𝐶𝑙(𝑐) ∩ 𝐶 𝑑| ≥ 2 𝑓𝑟𝑒𝑞 (𝑐, 𝑑) + 𝜆 ∙ 𝑐 𝑖∈𝐶 𝑙(𝑐) 𝑓𝑟𝑒𝑞 𝑐𝑖, 𝑑 otherwise
  • 12. 12Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Concept Activation [5/6] • Graph-based Methods [1/2] – Degree [Zouaq et al. 12] • 𝑑𝑒𝑔𝑟𝑒𝑒 𝑐, 𝑑 : the number of edges linked with a concept 𝑐 • e.g., 𝑠𝑐𝑜𝑟𝑒 𝑑𝑒𝑔𝑟𝑒𝑒(Bank, 𝑑) = 3 – HITS [Kleinberg 99; Zouaq et al. 12] • Link analysis algorithm for search engines [Kleinberg 99] • ℎ𝑢𝑏 𝑐, 𝑑 = 𝑐 𝑖∈𝐶 𝑛(𝑐) 𝑎𝑢𝑡ℎ 𝑐𝑖, 𝑑 • 𝑎𝑢𝑡ℎ 𝑐, 𝑑 = 𝑐 𝑖∈𝐶 𝑛(𝑐) ℎ𝑢𝑏 𝑐𝑖, 𝑑 𝑠𝑐𝑜𝑟𝑒 𝑑𝑒𝑔𝑟𝑒𝑒(𝑐, 𝑑) = 𝑑𝑒𝑔𝑟𝑒𝑒(𝑐, 𝑑) 𝑠𝑐𝑜𝑟𝑒ℎ𝑖𝑡𝑠 𝑐, 𝑑 = ℎ𝑢𝑏 𝑐, 𝑑 + 𝑎𝑢𝑡ℎ 𝑐, 𝑑 Tax Bank Interest Rate Financial Crisis Central Bank
  • 13. 13Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Concept Activation [6/6] • Graph-based Methods [2/2] – PageRank [Page et al. 99; Mihalcea & Paul 04] • Link analysis algorithm for search engines • Based on the intuition that a node that is linked from many important nodes is more important • 𝐶𝑖𝑛(𝑐): set of concepts connected with incoming edges from 𝑐 • 𝐶 𝑜𝑢𝑡(𝑐): set of concepts connected with outgoing edges from 𝑐 • 𝜇: dumping factor, 𝜇 = 0.85 𝑠𝑐𝑜𝑟𝑒 𝑝𝑎𝑔𝑒 𝑐, 𝑑 = 1 − 𝜇 + 𝜇 ∙ 𝑐 𝑖∈𝐶 𝑖𝑛(𝑐) 𝑠𝑐𝑜𝑟𝑒 𝑝𝑎𝑔𝑒(𝑐𝑖, 𝑑) |𝐶 𝑜𝑢𝑡(𝑐𝑖)|
  • 14. 14Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Annotation Selection • Top-5 and Top-10 – Select concepts whose scores are ranked in top-k • k Nearest Neighbor (kNN) [Huang et al. 11] – Based on the assumption that documents with similar concepts share similar annotations 1. Compute similarity scores between a target document and all documents with annotations 2. Select union of annotations of k nearest documents Central bank Law Financial crisis Finance China Human resource Leadership Marketing Competition law ?? 0.49 0.45 0.42 0.60 Example - 𝑘 = 2 - Selected annotations Finance; China; Marketing; Competition Law
  • 15. 15Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Configurations [1/5] Entity Tri-gram LDARAKE Statistical Methods (2 methods) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation
  • 16. 16Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Configurations [2/5] 24 strategies Entity Tri-gram LDARAKE Statistical Methods (2 methods) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation
  • 17. 17Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 15 strategies Entity Tri-gram LDARAKE Statistical Method (2 methods) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation Configurations [3/5]
  • 18. 18Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 3 strategies Entity Tri-gram LDARAKE Statistical Method (Frequency) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation Configurations [4/5]
  • 19. 19Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Entity Tri-gram LDARAKE Statistical Methods (Frequency) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation Configurations [5/5] 43 strategies in total
  • 20. 20Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Datasets and Metrics of Experiments Economics Political Science Computer Science publication ZBW FIV SemEval 2010 # of publications 62,924 28,324 244 # of annotations 5.26 (± 1.84) 12.00 (± 4.02) 5.05 (± 2.41) knowledge base STW European Thesaurus ACM CCS # of entities 6,335 7,912 2,299 # of labels 11,679 8,421 9,086 • Computer Science: SemEval 2010 dataset [Kim et al. 10] – Publications are annotated with keywords originally – We converted keywords to entities by string matching • All publications and labels of entities are in English • We use full-texts of publications • All annotations are used as ground truth • Evaluation metrics: Precision, Recall, F-measure
  • 21. 21Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 (I) Best Performing Strategies • Economics and Political Science datasets – The best strategy: Entity × HITS × kNN – F-measure: 0.39 (economics), 0.28 (political science) • Computer Science dataset – The best strategy: Entity × Degree × kNN – F-measure: 0.33 (computer science) • Graph-based methods do not differ a lot In general, a document annotation strategy Entity × Graph-based method × kNN performs best
  • 22. 22Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 (II) Influence of Concept Extraction • Concept Extraction method: Entity – Use domain-specific knowledge bases – Knowledge bases: freely available and of high quality – 32 thesauri listed in W3C SKOS Datasets For Concept Extraction methods, Entity consistently outperforms Tri-gram, RAKE, and LDA
  • 23. 23Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 (III) Influence of Concept Activation • Poor performance of hierarchy-based methods – We use full-texts in the experiments • Full-texts contain so many different concepts (203.80 unique entities (SD: 24.50)) that others do not have to be activated – However, OneHop can work as well as graph-based methods • It activates concept in one hop distance For Concept Activation methods, graph-based methods are better than statistical methods or hierarchy-based methods
  • 24. 24Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 (IV) Influence of Annotation Selection • kNN – No learning process – Confirms the assumption that documents with similar concepts share similar annotations For Annotation Selection methods, kNN can enhance the performance
  • 25. 25Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Conclusion • Large scale experiment for automated semantic document annotation for scientific publications • Best strategy: Entity × Graph-based method × kNN – Novel combination of methods • Best concept extraction method: Entity • Best concept activation method: Graph-based methods – OneHop can achieve similar performance and requires less computation cost
  • 26. 26Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Thank you! Questions?
  • 28. 28Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Research Question • Research questions solved with the experiment framework (I) Which strategy performs best? (II) Which concept extraction method performs best? (III) Which concept activation method performs best? (IV) Which annotation selection method performs best?
  • 29. 29Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 LDA (Latent Dirichlet Allocation) source: D. M. Blei. Probabilistic topic models, CACM, 2012.
  • 30. 30Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Entity Extraction and Conversion • Entity extraction – String matching with entity labels – Starting with longer entity labels • e.g., From a text “financial crisis is …”, only an entity “financial crisis” is detected (not “crisis”). • Converting to entities – Words and keywords are extracted in Tri-gram and RAKE – They are converted to entities by string matching with entity labels before annotation selection – If no matched entity label is found, word or keyword is discarded
  • 31. 31Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 kNN [1/2] • Similarity measure – Each document is represented as a vector where each element is a score of a concept – Cosine similarity is used as a similarity measure GDP Immigration Population Bank Interest rate Canada 0.3 0.5 0.8 0.1 0.0 0.5 GDP Immigration Population Bank Interest rate Canada 0.6 0.0 0.4 0.8 0.4 0.2 cosine similarity between (0.3, 0.5, 0.8, 0.1, 0.0, 0.5) and (0.6, 0.0, 0.4, 0.8, 0.4, 0.2) 𝒔𝒊𝒎 𝒅 𝟏, 𝒅 𝟐 = 𝟎. 𝟓𝟐 𝑑1 𝑑2
  • 32. 32Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 kNN [2/2] • k = 1 • k = 2 Central bank Law Financial crisis Finance China Human resource Leadership Marketing Competition law ?? 0.49 0.45 0.42 0.60 Marketing Competitive law Selected annotations Central bank Law Financial crisis Finance China Human resource Leadership Marketing Competition law ?? 0.49 0.45 0.42 0.60 Marketing Competitive law Finance China Selected annotations
  • 33. 33Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Evaluation Metrics • Precision • Recall • F-measure 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = {𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠} ∩ |{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠}| | 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 | 𝑟𝑒𝑐𝑎𝑙𝑙 = {𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠} ∩ |{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠}| | 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 | 𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ∙ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
  • 34. 34Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Datasets • Economics dataset – 11 GB • Political science dataset – 3.8 GB
  • 35. 35Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Experiments • Preprocessing documents – lemmatization – stop words removal • 10-fold cross validation – split a dataset into 10 equal sized subsets – 8 subset for training data – 1 subset for testing data – 1 subset for optimizing parameter
  • 36. 36Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Result Table: Entity [1/2] Economics top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .14 (.17) .14 (.15) .13 (.15) .22 (.20) .11 (.10) .14 (.12) .08 (.21) .08 (.21) .08 (.21) CF-IDF .19 (.19) .18 (.17) .18 (.16) .24 (.21) .12 (.10) .15 (.12) .29 (.32) .30 (.32) .29 (.31) Base Act. .10 (.14) .09 (.13) .09 (.13) .18 (.19) .09 (.09) .12 (.11) .20 (.30) .20 (.30) .20 (.29) Branch Act. .08 (.14) .08 (.12) .08 (.12) .17 (.19) .08 (.09) .11 (.11) .17 (.28) .17 (.28) .17 (.27) OneHop .12 (.16) .12 (.14) .12 (.14) .19 (.19) .09 (.09) .12 (.11) .35 (.34) .36 (.34) .35 (.33) Degree .15 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32) HITS .14 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.10) .14 (.12) .40 (.32) .40 (.32) .39 (.31) PageRank .14 (.17) .14 (.15) .14 (.15) .22 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32) Political Science top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .12 (.11) .18 (.16) .14 (.12) .15 (.13) .12 (.10) .13 (.10) .14 (.17) .05 (.07) .07 (.09) CF-IDF .05 (.07) .12 (.16) .07 (.10) .07 (.09) .08 (.10) .07 (.09) .24 (.22) .14 (.14) .17 (.16) Base Act. .05 (.08) .10 (.13) .07 (.09) .10 (.10) .10 (.09) .09 (.09) .14 (.19) .07 (.10) .09 (.12) Branch Act. .04 (.07) .08 (.12) .05 (.08) .09 (.09) .09 (.09) .08 (.09) .12 (.17) .06 (.10) .08 (.11) OneHop .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .27 (.21) .26 (.21) .25 (.19) Degree .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .29 (.21) .28 (.21) .27 (.19) HITS .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .30 (.22) .29 (.21) .28 (.20) PageRank .10 (.09) .20 (.17) .13 (.11) .13 (.10) .14 (.11) .13 (.10) .29 (.22) .29 (.21) .27 (.20)
  • 37. 37Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Result Table: Entity [2/2] Computer Science top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .18 (.21) .14 (.15) .15 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .24 (.16) .30 (.17) CF-IDF .02 (.08) .02 (.06) .02 (.06) .03 (.11) .01 (.04) .02 (.05) .47 (.29) .23 (.17) .29 (.18) Base Act. .17 (.20) .13 (.14) .14 (.15) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .22 (.15) .29 (.17) Branch Act. .17 (.20) .12 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .50 (.28) .22 (.15) .29 (.17) OneHop .17 (.20) .13 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .42 (.30) .25 (.21) .29 (.20) Degree .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .27 (.17) .33 (.18) HITS .18 (.21) .14 (.15) .15 (.16) .21 (.22) .08 (.08) .11 (.11) .48 (.31) .27 (.18) .32 (.20) PageRank .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .50 (.29) .25 (.15) .31 (.18)
  • 38. 38Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Result Table: Tri-gram Economics top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .12 (.15) .12 (.14) .11 (.14) .19 (.19) .10 (.10) .13 (.12) .08 (.22) .08 (.22) .08 (.21) CF-IDF .10 (.12) .10 (.12) .09 (.11) .17 (.17) .08 (.10) .12 (.12) .07 (.20) .06 (.22) .06 (.20) Degree .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .07 (.21) .07 (.21) .07 (.20) HITS .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .08 (.22) .08 (.22) .07 (.21) PageRank .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .10 (.20) .04 (.08) .05 (.11) Political Science top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .06 (.08) .14 (.16) .08 (.10) .10 (.10) .11 (.11) .10 (.09) .08 (.14) .05 (.08) .06 (.09) CF-IDF .05 (.05) .06 (.07) .05 (.06) .09 (.10) .09 (.10) .08 (.09) .09 (.15) .04 (.08) .06 (.10) Degree .01 (.03) .03 (.07) .01 (.04) .01 (.03) .03 (.07) .01 (.04) .11 (.14) .03 (.05) .05 (.07) HITS .01 (.03) .02 (.06) .01 (.03) .01 (.03) .00 (.06) .01 (.03) .12 (.14) .04 (.06) .06 (.08) PageRank .01 (.04) .03 (.08) .02 (.05) .01 (.04) .03 (.08) .02 (.05) .08 (.12) .03 (.05) .04 (.06) Computer Science top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .26 (.24) .20 (.18) .22 (.19) .54 (.30) .20 (.13) .29 (.17) .44 (.28) .25 (.18) .30 (.19) CF-IDF .23 (.24) .18 (.18) .19 (.19) .54 (.29) .22 (.14) .30 (.17) .48 (.28) .20 (.14) .26 (.15) Degree .09 (.15) .07 (.11) .07 (.12) .13 (.19) .05 (.07) .07 (.09) .48 (.29) .23 (.16) .29 (.18) HITS .05 (.14) .04 (.09) .04 (.10) .11 (.18) .04 (.06) .06 (.09) .39 (.29) .26 (.21) .28 (.19) PageRank .02 (.06) .02 (.05) .02 (.06) .03 (.08) .01 (.03) .02 (.05) .46 (.29) .25 (.18) .30 (.18)
  • 39. 39Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Result Table: RAKE Economics top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .08 (.14) .08 (.12) .08 (.12) .15 (.18) .07 (.08) .10 (.11) .34 (.33) .34 (.33) .33 (.32) Political Science top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .04 (.07) .08 (.13) .05 (.08) .07 (.09) .08 (.09) .07 (.08) .31 (.23) .18 (.15) .22 (.17) Computer Science top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .24 (.24) .17 (.16) .19 (.17) .42 (.28) .15 (.10) .22 (.14) .42 (.27) .20 (.13) .25 (.15)
  • 40. 40Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Result Table: LDA Economics kNN Recall Precision F Frequency .19 (.30) .19 (.30) .19 (.30) Political Science kNN Recall Precision F Frequency .15 (.19) .15 (.18) .14 (.17) Computer Science kNN Recall Precision F Frequency .28 (.27) .24 (.23) .24 (.22)
  • 41. 41Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Materials • Codes – https://github.com/ggb/ShortStories • Datasets – economics and political science • not publicly available yet • contact us directly, if you are interested in – computer science • publicly available
  • 42. 42Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Presentation • K-CAP 2015 – International Conference on Knowledge Capture – Scope • Knowledge Acquisition / Capture • Knowledge Extraction from Text • Semantic Web • Knowledge Engineering and Modelling • … • Time slot – Presentation: 25 minutes – Q & A: 5 minutes
  • 43. 43Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Reference • [Blei et al. 03] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation, JMLR, 2003. • [Blei 12] D. M. Blei. Probabilistic topic models, CACM, 2012. • [Goossen et al. 11] F. Goossen, W. IJntema, F. Frasincar, F. Hogenboom, and U. Kaymak. News personalization using the CF-IDF semantic recommender, WIMS, 2011. • [Grosse-Bolting et al. 15] G. Grosse-Bolting, C. Nishioka, and A. Scherp. Generic process for extracting user profiles from social media using hierarchical knowledge bases, ICSC, 2015. • [Huang et al. 11] M. Huang, A. Névéol, and Z. Lu. Recommending MeSH terms for annotating biomedical articles, JAMIA, 2011. • [Kapanipathi et al. 14] P. Kapanipathi, P. Jain, C. Venkataramani, and A. Sheth. User interests identification on Twitter using a hierarchical knowledge base, ESWC, 2014. • [Kim et al. 10] S. N. Kim, O. Medelyan, M. Y. Kan, and T. Baldwin. Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles, International Workshop on Semantic Evaluation, 2010.
  • 44. 44Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Reference • [Kleinberg 99] J. M. Kleinberg. Authoritative sources in a hyperlinked environment, Journal of the ACM, 1999. • [Mihalcea & Paul 04] R. Mihalcea and T. Paul. TextRank: Bringing order into texts, EMNLP, 2004. • [Page et al. 99] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the web, TR of Stanford InfoLab, 1999. • [Rose et al. 10] S. Rose, D. Engel, N. Cramer, and W. Cowley. Automatic keyword extraction from individual documents, Text Mining, 2010. • [Zouaq et al. 12] A. Zouaq, G. Dragan, and H. Marek. Voting theory for concept detection, ESWC, 2012.