Effective Named Entity Recognition for Idiosyncratic Web Collections

Effective Named Entity Recognition for
Idiosyncratic Web Collections
Roman Prokofyev, Gianluca Demartini, Philippe Cudre-Mauroux
eXascale Infolab, University of Fribourg, Switzerland
WWW 2014
April 10, 2014
1

Outline
• Introduction
• Problem definition
• Existing approaches and applicability
• Overview
• Candidate Named Entities Selection
• Dataset description
• Features description
• Experimental setup & Evaluation
2

Problem Definition
• search engine
• web search engine
• navigational query
• user intent
• information need
• web content
• …
Entity type: scientific concept
3

Traditional NER
Types:
• Maximum Entropy (Mallet, NLTK)
• Conditional Random Fields (Stanford NER, Mallet)
Properties:
• Require extensive training
• Usually domain-specific, different collections require
training on their domain
• Very good at detecting such types as Location, Person,
Organization
4

Proposed Approach
Our problem is defined as a classification task.
Two-step classification:
• Extract candidate named entities using frequency filtration
algorithm.
• Classify candidate named entities using supervised
classifier.
Candidate selection should allow us to greatly reduce the
number of n-grams to classify, possibly without significant
loss in Recall.
5

Pipeline
6
Text
extraction
(Apache Tika)
List of
extracted
n-grams
n-gram
Indexing
foreach
Candidat e
Selection
List of
selected
n-grams
Supervised
Classi! er
Ranked
list of
n-grams
Lemmat
ization
n+1 grams
merging
Feature
extractionFeature
extractionFeatures
POS
Tagging
frequency
reweighting

Candidate Selection: Part I
Consider all bigrams with frequency > k (k=2):
candidate named: 5
entity are: 4
entity candidate: 3
entity in: 18
entity recognition: 12
named entity: 101
of named: 10
that named: 3
the named: 4
candidate named: 5
entity candidate: 3
named entity: 101
NLTK stop word filter
7

Candidate Selection: Part II
Trigram frequency is looked up from the n-gram index.
candidate named entity: 5
named entity candidate: 3
named entity recognition: 12
named entity: 101
candidate named: 5
entity candidate: 3
candidate named: 5
entity candidate: 3
named entity: 101
candidate named entity: 5
named entity candidate: 3
named entity recognition: 12
named entity: 81
candidate named: 0
entity candidate: 0
8

Candidate Selection: Discussion
Possible to extract n-grams (n>2) with frequency ≤k
9

After Candidate Selection
TwiNER: named entity
recognition in targeted
twitter stream
„SIGIR 2012
10

Classifier: Overview
Machine Learning algorithm:
Decision Trees from scikit-learn package.
Feature types:
• POS Tags and their derivatives
• External Knowledge Bases (DBLP, DBPedia)
• DBPedia relation graphs
• Syntactic features
11

Datasets
Two collections:
• CS Collection (SIGIR 2012 Research Track): 100 papers
• Physics collection: 100 papers randomly selected from
arXiv.org High Energy Physics category
CS Collection Physics Collection
N# Candidate N-grams 21 531 18 129
N# Judged N-grams 15 057 11 421
N# Valid Entities 8 145 5 747
N# Invalid N-grams 6 912 5 674
Available at: github.com/XI-lab/scientific_NER_dataset
12

Features: POS Tags, part I
100+ different tag patterns
13

Features: POS Tags, part II
Two feature schemes:
• Raw POS tag patterns, each tag is a binary feature
• Regex POS tag patterns:
• First tag match, for example:
• Last tag match:
JJ NNS
JJ NN NN
JJ NN
...
JJ*
NN VB
NN NN VB
JJ NN VB
...
*VB
14

Features: External Knowledge Bases
Domain-specific knowledge bases:
• DBLP (Computer Science): contains author-assigned
keywords to the papers
• ScienceWISE: high-quality scientific concepts (mostly for
Physics domain) http://sciencewise.info
We perform exact string matching with these KBs.
15

Features: DBPedia, part I
DBPedia pages essentially represent valid entities
But there are a few problems when:
• N-gram is not an entity
• N-gram is not a scientific concept (“Tom Cruise” in IR
paper)
CS Collection Physics Collection
Precision Recall Precision Recall
Exact string matching 0.9045 0.2394 0.7063 0.0155
Matching with redirects 0.8457 0.4229 0.7768 0.5843
16

Features: DBPedia, part II
Com ponent siz eCom ponent siz e
NumberofcomponentsNumberofcomponents
0 10 20 30 40 50 60 70
0.4
1
2
4
10
20
40
100
200
400
Com ponent siz eCom ponent siz e
NumberofcomponentsNumberofcomponents
5 10 15 20 25 30 35 40
0.4
1
2
4
10
20
40
100
200
400
Without redirects With redirects
17

Features: Syntactic
Set of common syntactic features:
• N-gram length in words
• Whether n-gram is uppercased
• The number of other n-gram given n-gram is part of
18

Experiments: Overview
1. Regex POS Patterns vs Normal POS tags
2. Redirects vs Non-redirects
3. Feature importance scores
4. MaxEntropy comparison
All results are obtained using average with 10-fold cross-
validation.
19

Experiments: Comparison I
CS Collection Precision Recall F1
score
Accuracy N#
features
Normal POS +
Components
0.8794 0.8058* 0.8409* 0.8429* 54
Regex POS +
Components
0.8475* 0.8524* 0.8499* 0.8448* 9
Normal POS +
Components-Redirects
0.8678* 0.8305* 0.8487* 0.8473 50
Regex POS +
0.8406* 0.8769 0.8584 0.8509 7
20
The symbol * indicates a statistically significant difference as compared to the
approach in bold.

Experiments: Comparison II
Physics Collection Precision Recall F1
score
Accuracy N#
features
Normal POS +
Components
0.8253* 0.6567* 0.7311* 0.7567 53
Regex POS +
Components
0.7941* 0.6781 0.7315* 0.7492* 4
Normal POS +
0.8339 0.6674* 0.7412 0.7653 50
Regex POS +
0.8375 0.6479* 0.7305* 0.7592* 6
21
The symbol * indicates a statistically significant difference as compared to the
approach in bold.

Experiments: Feature Importance
Importance
NN STARTS 0.3091
DBLP 0.1442
Components + DBLP 0.1125
Components 0.0789
VB ENDS 0.0386
NN ENDS 0.0380
JJ STARTS 0.0364
Importance
ScienceWISE 0.2870
Component +
ScienceWISE
0.1948
Wikipedia redirect 0.1104
Components 0.1093
Wikilinks 0.0439
Participation count 0.0370
CS Collection, 7 features Physics Collection, 6 features
22

Experiments: MaxEntropy
Precision Recall F1 score
Maximum Entropy 0.6566 0.7196 0.6867
Decision Trees 0.8121 0.8742 0.8420
MaxEnt classifier receives full text as input.
(we used a classifier from NLTK package)
Comparison experiment: 80% of CS Collection as a training
data, 20% as a test dataset.
23

Lessons Learned
Classic NER approaches are not good enough for
Idiosyncratic Web Collections
Leveraging the graph of scientific concepts is a key feature
Domain specific KBs and POS patterns work well
Experimental results show up to 85% accuracy over
different scientific collections
24
http://iner.exascale.info/
eXascale Infolab, http://exascale.info

Effective Named Entity Recognition for Idiosyncratic Web Collections

More Related Content

Viewers also liked (20)

Similar to Effective Named Entity Recognition for Idiosyncratic Web Collections (20)

More from eXascale Infolab (20)

Recently uploaded (20)

Effective Named Entity Recognition for Idiosyncratic Web Collections