SlideShare a Scribd company logo
Effective Named Entity Recognition for
Idiosyncratic Web Collections
Roman Prokofyev, Gianluca Demartini, Philippe Cudre-Mauroux
eXascale Infolab, University of Fribourg, Switzerland
WWW 2014
April 10, 2014
1
Outline
• Introduction
• Problem definition
• Existing approaches and applicability
• Overview
• Candidate Named Entities Selection
• Dataset description
• Features description
• Experimental setup & Evaluation
2
Problem Definition
• search engine
• web search engine
• navigational query
• user intent
• information need
• web content
• …
Entity type: scientific concept
3
Traditional NER
Types:
• Maximum Entropy (Mallet, NLTK)
• Conditional Random Fields (Stanford NER, Mallet)
Properties:
• Require extensive training
• Usually domain-specific, different collections require
training on their domain
• Very good at detecting such types as Location, Person,
Organization
4
Proposed Approach
Our problem is defined as a classification task.
Two-step classification:
• Extract candidate named entities using frequency filtration
algorithm.
• Classify candidate named entities using supervised
classifier.
Candidate selection should allow us to greatly reduce the
number of n-grams to classify, possibly without significant
loss in Recall.
5
Pipeline
6
Text
extraction
(Apache Tika)
List of
extracted
n-grams
n-gram
Indexing
foreach
Candidat e
Selection
List of
selected
n-grams
Supervised
Classi! er
Ranked
list of
n-grams
Lemmat
ization
n+1 grams
merging
Feature
extractionFeature
extractionFeatures
POS
Tagging
frequency
reweighting
Candidate Selection: Part I
Consider all bigrams with frequency > k (k=2):
candidate named: 5
entity are: 4
entity candidate: 3
entity in: 18
entity recognition: 12
named entity: 101
of named: 10
that named: 3
the named: 4
candidate named: 5
entity candidate: 3
entity recognition: 12
named entity: 101
NLTK stop word filter
7
Candidate Selection: Part II
Trigram frequency is looked up from the n-gram index.
candidate named entity: 5
named entity candidate: 3
named entity recognition: 12
named entity: 101
candidate named: 5
entity candidate: 3
entity recognition: 12
candidate named: 5
entity candidate: 3
entity recognition: 12
named entity: 101
candidate named entity: 5
named entity candidate: 3
named entity recognition: 12
named entity: 81
candidate named: 0
entity candidate: 0
entity recognition: 0
8
Candidate Selection: Discussion
Possible to extract n-grams (n>2) with frequency ≤k
9
After Candidate Selection
TwiNER: named entity
recognition in targeted
twitter stream
„SIGIR 2012
10
Classifier: Overview
Machine Learning algorithm:
Decision Trees from scikit-learn package.
Feature types:
• POS Tags and their derivatives
• External Knowledge Bases (DBLP, DBPedia)
• DBPedia relation graphs
• Syntactic features
11
Datasets
Two collections:
• CS Collection (SIGIR 2012 Research Track): 100 papers
• Physics collection: 100 papers randomly selected from
arXiv.org High Energy Physics category
CS Collection Physics Collection
N# Candidate N-grams 21 531 18 129
N# Judged N-grams 15 057 11 421
N# Valid Entities 8 145 5 747
N# Invalid N-grams 6 912 5 674
Available at: github.com/XI-lab/scientific_NER_dataset
12
Features: POS Tags, part I
100+ different tag patterns
13
Features: POS Tags, part II
Two feature schemes:
• Raw POS tag patterns, each tag is a binary feature
• Regex POS tag patterns:
• First tag match, for example:
• Last tag match:
JJ NNS
JJ NN NN
JJ NN
...
JJ*
NN VB
NN NN VB
JJ NN VB
...
*VB
14
Features: External Knowledge Bases
Domain-specific knowledge bases:
• DBLP (Computer Science): contains author-assigned
keywords to the papers
• ScienceWISE: high-quality scientific concepts (mostly for
Physics domain) http://sciencewise.info
We perform exact string matching with these KBs.
15
Features: DBPedia, part I
DBPedia pages essentially represent valid entities
But there are a few problems when:
• N-gram is not an entity
• N-gram is not a scientific concept (“Tom Cruise” in IR
paper)
CS Collection Physics Collection
Precision Recall Precision Recall
Exact string matching 0.9045 0.2394 0.7063 0.0155
Matching with redirects 0.8457 0.4229 0.7768 0.5843
16
Features: DBPedia, part II
Com ponent siz eCom ponent siz e
NumberofcomponentsNumberofcomponents
0 10 20 30 40 50 60 70
0.4
1
2
4
10
20
40
100
200
400
Com ponent siz eCom ponent siz e
NumberofcomponentsNumberofcomponents
5 10 15 20 25 30 35 40
0.4
1
2
4
10
20
40
100
200
400
Without redirects With redirects
17
Features: Syntactic
Set of common syntactic features:
• N-gram length in words
• Whether n-gram is uppercased
• The number of other n-gram given n-gram is part of
18
Experiments: Overview
1. Regex POS Patterns vs Normal POS tags
2. Redirects vs Non-redirects
3. Feature importance scores
4. MaxEntropy comparison
All results are obtained using average with 10-fold cross-
validation.
19
Experiments: Comparison I
CS Collection Precision Recall F1
score
Accuracy N#
features
Normal POS +
Components
0.8794 0.8058* 0.8409* 0.8429* 54
Regex POS +
Components
0.8475* 0.8524* 0.8499* 0.8448* 9
Normal POS +
Components-Redirects
0.8678* 0.8305* 0.8487* 0.8473 50
Regex POS +
Components-Redirects
0.8406* 0.8769 0.8584 0.8509 7
20
The symbol * indicates a statistically significant difference as compared to the
approach in bold.
Experiments: Comparison II
Physics Collection Precision Recall F1
score
Accuracy N#
features
Normal POS +
Components
0.8253* 0.6567* 0.7311* 0.7567 53
Regex POS +
Components
0.7941* 0.6781 0.7315* 0.7492* 4
Normal POS +
Components-Redirects
0.8339 0.6674* 0.7412 0.7653 50
Regex POS +
Components-Redirects
0.8375 0.6479* 0.7305* 0.7592* 6
21
The symbol * indicates a statistically significant difference as compared to the
approach in bold.
Experiments: Feature Importance
Importance
NN STARTS 0.3091
DBLP 0.1442
Components + DBLP 0.1125
Components 0.0789
VB ENDS 0.0386
NN ENDS 0.0380
JJ STARTS 0.0364
Importance
ScienceWISE 0.2870
Component +
ScienceWISE
0.1948
Wikipedia redirect 0.1104
Components 0.1093
Wikilinks 0.0439
Participation count 0.0370
CS Collection, 7 features Physics Collection, 6 features
22
Experiments: MaxEntropy
Precision Recall F1 score
Maximum Entropy 0.6566 0.7196 0.6867
Decision Trees 0.8121 0.8742 0.8420
MaxEnt classifier receives full text as input.
(we used a classifier from NLTK package)
Comparison experiment: 80% of CS Collection as a training
data, 20% as a test dataset.
23
Lessons Learned
Classic NER approaches are not good enough for
Idiosyncratic Web Collections
Leveraging the graph of scientific concepts is a key feature
Domain specific KBs and POS patterns work well
Experimental results show up to 85% accuracy over
different scientific collections
24
http://iner.exascale.info/
eXascale Infolab, http://exascale.info

More Related Content

PDF
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
PDF
Report on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents
PDF
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
PDF
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
PPTX
Introduction to Machine Learning with Python and scikit-learn
PDF
Text classification in scikit-learn
PDF
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
PPTX
Braden Hancock "Programmatically creating and managing training data with Sno...
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
Report on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Introduction to Machine Learning with Python and scikit-learn
Text classification in scikit-learn
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Braden Hancock "Programmatically creating and managing training data with Sno...

Viewers also liked (20)

PDF
Word embeddings as a service - PyData NYC 2015
PPT
Ut Pictura Poesis Lecture
PPTX
Federated SPARQL query processing over the Web of Data
ODP
DBpedia: A Public Data Infrastructure for the Web of Data
PPTX
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
PDF
Linked Data Fragments
PPT
Gathering Alternative Surface Forms for DBpedia Entities
PDF
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
PDF
DBpedia InsideOut
PPTX
NLP todo
PDF
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
ODP
Fast Approximate A-box Consistency Checking using Machine Learning
PDF
LDQL: A Query Language for the Web of Linked Data
PDF
Applying Linked Open Data to Public Procurement
PDF
Exploiting the query structure for efficient join ordering in SPARQL queries
PDF
Exploring Linked Data content through network analysis
PPT
Automatic Term Ambiguity Detection
PPT
Linked Data: What’s the Story?
PDF
A Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
Word embeddings as a service - PyData NYC 2015
Ut Pictura Poesis Lecture
Federated SPARQL query processing over the Web of Data
DBpedia: A Public Data Infrastructure for the Web of Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Linked Data Fragments
Gathering Alternative Surface Forms for DBpedia Entities
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
DBpedia InsideOut
NLP todo
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Fast Approximate A-box Consistency Checking using Machine Learning
LDQL: A Query Language for the Web of Linked Data
Applying Linked Open Data to Public Procurement
Exploiting the query structure for efficient join ordering in SPARQL queries
Exploring Linked Data content through network analysis
Automatic Term Ambiguity Detection
Linked Data: What’s the Story?
A Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
Ad

Similar to Effective Named Entity Recognition for Idiosyncratic Web Collections (20)

PDF
FIRE2014_IIT-P
PPTX
asdrfasdfasdf
PDF
LP&IIS2013 PPT. Chinese Named Entity Recognition with Conditional Random Fiel...
PPTX
PhD Defense
PPTX
05 -- Feature Engineering (Text).pptxiuy
PDF
A survey of named entity recognition in assamese and other indian languages
PDF
Automatic generation of domain models for call centers
PPSX
Semantic Analysis using Wikipedia Taxonomy
PDF
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
DOC
P-6
DOC
P-6
PDF
Named Entity Recognition from Online News
PDF
Magpie
PPT
Topic Models Based Personalized Spam Filter
DOC
Statistical Named Entity Recognition for Hungarian – analysis ...
PDF
Statistical Learning and Text Classification with NLTK and scikit-learn
PPTX
Data mining on yelp dataset
PDF
IRJET- A Review on Part-of-Speech Tagging on Gujarati Language
PPTX
8_POSNER_university_of Azad_Jammau_kashmir.pptx
PPTX
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-II.pptx
FIRE2014_IIT-P
asdrfasdfasdf
LP&IIS2013 PPT. Chinese Named Entity Recognition with Conditional Random Fiel...
PhD Defense
05 -- Feature Engineering (Text).pptxiuy
A survey of named entity recognition in assamese and other indian languages
Automatic generation of domain models for call centers
Semantic Analysis using Wikipedia Taxonomy
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
P-6
P-6
Named Entity Recognition from Online News
Magpie
Topic Models Based Personalized Spam Filter
Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Learning and Text Classification with NLTK and scikit-learn
Data mining on yelp dataset
IRJET- A Review on Part-of-Speech Tagging on Gujarati Language
8_POSNER_university_of Azad_Jammau_kashmir.pptx
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-II.pptx
Ad

More from eXascale Infolab (20)

PDF
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
PPTX
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
PDF
Representation Learning on Complex Graphs
PPTX
A force directed approach for offline gps trajectory map
PPTX
Cikm 2018
PPTX
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
PDF
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
PDF
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
PDF
Crowd scheduling www2016
PPTX
SANAPHOR: Ontology-based Coreference Resolution
PDF
Efficient, Scalable, and Provenance-Aware Management of Linked Data
PDF
Entity-Centric Data Management
PDF
SSSW 2015 Sense Making
PDF
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
PDF
Executing Provenance-Enabled Queries over Web Data
PDF
The Dynamics of Micro-Task Crowdsourcing
PDF
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
PPTX
CIKM14: Fixing grammatical errors by preposition ranking
PDF
OLTP-Bench
PPTX
An Introduction to Big Data
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
Representation Learning on Complex Graphs
A force directed approach for offline gps trajectory map
Cikm 2018
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Crowd scheduling www2016
SANAPHOR: Ontology-based Coreference Resolution
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Entity-Centric Data Management
SSSW 2015 Sense Making
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
Executing Provenance-Enabled Queries over Web Data
The Dynamics of Micro-Task Crowdsourcing
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
CIKM14: Fixing grammatical errors by preposition ranking
OLTP-Bench
An Introduction to Big Data

Recently uploaded (20)

PPTX
BIOMOLECULES PPT........................
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PPTX
Hypertension_Training_materials_English_2024[1] (1).pptx
PPTX
Microbes in human welfare class 12 .pptx
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PDF
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
PPT
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
PPTX
Seminar Hypertension and Kidney diseases.pptx
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
PPTX
endocrine - management of adrenal incidentaloma.pptx
PPT
Presentation of a Romanian Institutee 2.
PPTX
TORCH INFECTIONS in pregnancy with toxoplasma
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PDF
CHAPTER 2 The Chemical Basis of Life Lecture Outline.pdf
PPTX
A powerpoint on colorectal cancer with brief background
PPT
6.1 High Risk New Born. Padetric health ppt
PPT
veterinary parasitology ````````````.ppt
PPTX
Fluid dynamics vivavoce presentation of prakash
PDF
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
BIOMOLECULES PPT........................
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Hypertension_Training_materials_English_2024[1] (1).pptx
Microbes in human welfare class 12 .pptx
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
Seminar Hypertension and Kidney diseases.pptx
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
endocrine - management of adrenal incidentaloma.pptx
Presentation of a Romanian Institutee 2.
TORCH INFECTIONS in pregnancy with toxoplasma
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
CHAPTER 2 The Chemical Basis of Life Lecture Outline.pdf
A powerpoint on colorectal cancer with brief background
6.1 High Risk New Born. Padetric health ppt
veterinary parasitology ````````````.ppt
Fluid dynamics vivavoce presentation of prakash
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum

Effective Named Entity Recognition for Idiosyncratic Web Collections

  • 1. Effective Named Entity Recognition for Idiosyncratic Web Collections Roman Prokofyev, Gianluca Demartini, Philippe Cudre-Mauroux eXascale Infolab, University of Fribourg, Switzerland WWW 2014 April 10, 2014 1
  • 2. Outline • Introduction • Problem definition • Existing approaches and applicability • Overview • Candidate Named Entities Selection • Dataset description • Features description • Experimental setup & Evaluation 2
  • 3. Problem Definition • search engine • web search engine • navigational query • user intent • information need • web content • … Entity type: scientific concept 3
  • 4. Traditional NER Types: • Maximum Entropy (Mallet, NLTK) • Conditional Random Fields (Stanford NER, Mallet) Properties: • Require extensive training • Usually domain-specific, different collections require training on their domain • Very good at detecting such types as Location, Person, Organization 4
  • 5. Proposed Approach Our problem is defined as a classification task. Two-step classification: • Extract candidate named entities using frequency filtration algorithm. • Classify candidate named entities using supervised classifier. Candidate selection should allow us to greatly reduce the number of n-grams to classify, possibly without significant loss in Recall. 5
  • 6. Pipeline 6 Text extraction (Apache Tika) List of extracted n-grams n-gram Indexing foreach Candidat e Selection List of selected n-grams Supervised Classi! er Ranked list of n-grams Lemmat ization n+1 grams merging Feature extractionFeature extractionFeatures POS Tagging frequency reweighting
  • 7. Candidate Selection: Part I Consider all bigrams with frequency > k (k=2): candidate named: 5 entity are: 4 entity candidate: 3 entity in: 18 entity recognition: 12 named entity: 101 of named: 10 that named: 3 the named: 4 candidate named: 5 entity candidate: 3 entity recognition: 12 named entity: 101 NLTK stop word filter 7
  • 8. Candidate Selection: Part II Trigram frequency is looked up from the n-gram index. candidate named entity: 5 named entity candidate: 3 named entity recognition: 12 named entity: 101 candidate named: 5 entity candidate: 3 entity recognition: 12 candidate named: 5 entity candidate: 3 entity recognition: 12 named entity: 101 candidate named entity: 5 named entity candidate: 3 named entity recognition: 12 named entity: 81 candidate named: 0 entity candidate: 0 entity recognition: 0 8
  • 9. Candidate Selection: Discussion Possible to extract n-grams (n>2) with frequency ≤k 9
  • 10. After Candidate Selection TwiNER: named entity recognition in targeted twitter stream „SIGIR 2012 10
  • 11. Classifier: Overview Machine Learning algorithm: Decision Trees from scikit-learn package. Feature types: • POS Tags and their derivatives • External Knowledge Bases (DBLP, DBPedia) • DBPedia relation graphs • Syntactic features 11
  • 12. Datasets Two collections: • CS Collection (SIGIR 2012 Research Track): 100 papers • Physics collection: 100 papers randomly selected from arXiv.org High Energy Physics category CS Collection Physics Collection N# Candidate N-grams 21 531 18 129 N# Judged N-grams 15 057 11 421 N# Valid Entities 8 145 5 747 N# Invalid N-grams 6 912 5 674 Available at: github.com/XI-lab/scientific_NER_dataset 12
  • 13. Features: POS Tags, part I 100+ different tag patterns 13
  • 14. Features: POS Tags, part II Two feature schemes: • Raw POS tag patterns, each tag is a binary feature • Regex POS tag patterns: • First tag match, for example: • Last tag match: JJ NNS JJ NN NN JJ NN ... JJ* NN VB NN NN VB JJ NN VB ... *VB 14
  • 15. Features: External Knowledge Bases Domain-specific knowledge bases: • DBLP (Computer Science): contains author-assigned keywords to the papers • ScienceWISE: high-quality scientific concepts (mostly for Physics domain) http://sciencewise.info We perform exact string matching with these KBs. 15
  • 16. Features: DBPedia, part I DBPedia pages essentially represent valid entities But there are a few problems when: • N-gram is not an entity • N-gram is not a scientific concept (“Tom Cruise” in IR paper) CS Collection Physics Collection Precision Recall Precision Recall Exact string matching 0.9045 0.2394 0.7063 0.0155 Matching with redirects 0.8457 0.4229 0.7768 0.5843 16
  • 17. Features: DBPedia, part II Com ponent siz eCom ponent siz e NumberofcomponentsNumberofcomponents 0 10 20 30 40 50 60 70 0.4 1 2 4 10 20 40 100 200 400 Com ponent siz eCom ponent siz e NumberofcomponentsNumberofcomponents 5 10 15 20 25 30 35 40 0.4 1 2 4 10 20 40 100 200 400 Without redirects With redirects 17
  • 18. Features: Syntactic Set of common syntactic features: • N-gram length in words • Whether n-gram is uppercased • The number of other n-gram given n-gram is part of 18
  • 19. Experiments: Overview 1. Regex POS Patterns vs Normal POS tags 2. Redirects vs Non-redirects 3. Feature importance scores 4. MaxEntropy comparison All results are obtained using average with 10-fold cross- validation. 19
  • 20. Experiments: Comparison I CS Collection Precision Recall F1 score Accuracy N# features Normal POS + Components 0.8794 0.8058* 0.8409* 0.8429* 54 Regex POS + Components 0.8475* 0.8524* 0.8499* 0.8448* 9 Normal POS + Components-Redirects 0.8678* 0.8305* 0.8487* 0.8473 50 Regex POS + Components-Redirects 0.8406* 0.8769 0.8584 0.8509 7 20 The symbol * indicates a statistically significant difference as compared to the approach in bold.
  • 21. Experiments: Comparison II Physics Collection Precision Recall F1 score Accuracy N# features Normal POS + Components 0.8253* 0.6567* 0.7311* 0.7567 53 Regex POS + Components 0.7941* 0.6781 0.7315* 0.7492* 4 Normal POS + Components-Redirects 0.8339 0.6674* 0.7412 0.7653 50 Regex POS + Components-Redirects 0.8375 0.6479* 0.7305* 0.7592* 6 21 The symbol * indicates a statistically significant difference as compared to the approach in bold.
  • 22. Experiments: Feature Importance Importance NN STARTS 0.3091 DBLP 0.1442 Components + DBLP 0.1125 Components 0.0789 VB ENDS 0.0386 NN ENDS 0.0380 JJ STARTS 0.0364 Importance ScienceWISE 0.2870 Component + ScienceWISE 0.1948 Wikipedia redirect 0.1104 Components 0.1093 Wikilinks 0.0439 Participation count 0.0370 CS Collection, 7 features Physics Collection, 6 features 22
  • 23. Experiments: MaxEntropy Precision Recall F1 score Maximum Entropy 0.6566 0.7196 0.6867 Decision Trees 0.8121 0.8742 0.8420 MaxEnt classifier receives full text as input. (we used a classifier from NLTK package) Comparison experiment: 80% of CS Collection as a training data, 20% as a test dataset. 23
  • 24. Lessons Learned Classic NER approaches are not good enough for Idiosyncratic Web Collections Leveraging the graph of scientific concepts is a key feature Domain specific KBs and POS patterns work well Experimental results show up to 85% accuracy over different scientific collections 24 http://iner.exascale.info/ eXascale Infolab, http://exascale.info