SlideShare a Scribd company logo
A Framework to Automatically Extract Funding
Information from Text
Deep Kayal, Zubair Afzal, George Tsatsaronis et al.
Content and Innovation Group, Elsevier B.V., Amsterdam, NL.
d.kayal@elsevier.com
16 September, 2018
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 1 / 25
Overview
1 Motivation and Problem Definition
2 Background
3 Methodology
4 Experiments and Results
5 Conclusions
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 2 / 25
Section 1
Motivation and Problem Definition
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 3 / 25
Motivation
Usually, institutions and researchers are required to acknowledge the
funding source and grants.
This information, if captured effectively, will enable funding
organizations to justify the impact of their allocated research funds.
Plus, this information will also help researchers discover appropriate
funding opportunities for their interests.
In this work, we address the problem of automating the
extraction of funding information from text, using natural
language processing and machine learning techniques.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 4 / 25
Problem Statement
Can we automatically detect the funder information from scientific
papers?
Support for the Nurses’ Health Study and the Health Professionals
Follow-up Study was provided by grants (P01 CA87969 and UM1
CA167552, respectively) from the NCI. Support for the Women’s Health
Initiative program is provided by contracts (N01WH22110, N01WH24152,
N01WH3210032102 and N01WH32105) from the National Heart, Lung,
and Blood Institute.
Can we mark them with entities of the form Funding Body and Grant
Number?
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 5 / 25
Section 2
Background
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 6 / 25
Problem Definition
Given a scientific article as raw text input, we design a system to
perform two tasks:
1 identify all text segments which contain funding information.
2 process all the funding text segments in order to detect the set of the
funding bodies (FB) and the set of grants (GR) that appear in the text.
The former is a binary text classification task.
While, the latter can be seen as a named entity recognition (NER)
problem.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 7 / 25
NER and Sequential Learning
NER extracts information, known as named entities, from
unstructured text; for example, the names of persons, locations and
organizations.
In literature, NER systems have been found to employ rule-based,
gazetteer and machine learning approaches [Nadeau, 2007].
Sequential learning approaches are machine learning models that
leverage the relationships between nearby data points and their class
labels.
Hidden Markov Models [Zhou, 2002], Linear CRFs [McCallum, 2003]
and Maximum Entropy Models [Chieu, 2002] are popular ways of
modeling data for NER.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 8 / 25
Implementations and Toolkits
The Stanford CoreNLP toolkit1 is a Java-based toolkit that has a
CRF implementation, enhanced with long-distance features. An
important aspect of the toolkit is the ability to use distributional
similarity measures.
LingPipe2 is another NLP toolkit, whose efficient HMM
implementation includes n-gram features.
In this work, we also use the Apache OpenNLP4 toolkit3, which has a
MaxEnt implementation for NER.
Finally, this work also makes use of Elseviers Fingerprint Engine
(FPE)4, which is an industrial solution for annotating text with
ontological concepts, given a vocabulary.
1
http://stanfordnlp.github.io/CoreNLP/
2
http://alias-i.com/lingpipe/demos/tutorial/read-me.html
3
https://opennlp.apache.org/
4
https://www.elsevier.com/solutions/elsevier-fingerprint-engine
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 9 / 25
Section 3
Methodology
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 10 / 25
Design Choice
As mentioned earlier, we use a two-stage approach to extract funding
information from text.
This design has the following benefits:
1 it minimizes the execution time of the approach as the costliest
component, namely NER.
2 it reduces the number of false positives, as there are many text
segments in a scientific full text article that contain strings which a
NER component could potentially annotate falsely.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 11 / 25
Data Collection
Silver Set:
randomly sample articles from ScienceDirect5
database from the last 10
years and select only acknowledgment sections from them.
using the Fingerprint Engine (FPE) and Crossref’s open funder
registry6
, annotate FBs from these acknowledgment sections.
at the end of this step, the number of retained sections with at least
one annotated FB resulted in 44,660.
Gold Set:
journal articles were picked randomly from a large number of
publications, annotated by three different experts and harmonized.
1,682 articles, out of around 2000, contained at least one
funding-related annotation, resulting in 4,537 FB and 3,156 GR
annotations in the set.
pair-wise averaged Cohens kappa was used to calculate the
inter-annotator agreement to assess dataset quality, and was found to
be 0.89, suggesting high quality.
5
http://www.sciencedirect.com/
6
http://www.crossref.org/fundingdata/registry.html
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 12 / 25
Data Usage
The Silver Set was used to learn word clusters for the distributional
similarity measure that can be employed within the Stanford CoreNLP
toolkit
this was done by generating word-embeddings using this dataset, using
the Word2Vec algorithm [Mikolov, 2013].
followed by K-means clustering using cosine similarity.
Additionally, it was also used to train models to detect FB
annotations.
The Gold Set was used to train the binary text classifier that detects
the paragraphs of text which contain funding information.
It was also used to train models to detect FB and GR annotations.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 13 / 25
Detecting Text Blocks with Funding Information
As the first step, the text segments which contain funding information
are to be separated from the rest (Binary Classification).
To address this problem, we have used a cost-sensitive L2-regularized
linear Support Vector Machine (SVMs), as SVMs are known to
perform well on text classification problems.
The SVMs operate on TF-IDF vectors extracted from the segments
of each input text, based on a bigram bag-of-words text.
The SVM was trained on the examples of positive (1,682) and
negative segments (47,565), i.e., paragraphs with and without
funding information, which could be found in the Gold Set.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 14 / 25
Extracting Funding Information using NER
In order to annotate a piece of text with the FB label, a variety of
models were used:
1 pre-trained models packaged as part of the Stanford CoreNLP and
LingPipe suites; in this work they were used to identify the
Organization labels in the text, which were then stored as FB.
2 Stanford CRF, LingPipe HMM and OpenNLP MaxEnt models trained
on the Silver and Gold sets.
3 Stanford CRF classifiers using distributional similarity features based
on the word clusters created from the Silver Set data.
As for GR labels:
1 we use a rule-based approach, considering every word inside the
funding section with at least a digit, as a grant ID
2 we train all of the aforementioned models based on the labeled data in
the Gold Set.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 15 / 25
Ensembling
Figure: An example of the ensemble approach for extracting funding information
from text.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 16 / 25
Overall Pipeline
Figure: Schematic showing the overall pipeline.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 17 / 25
Section 4
Experiments and Results
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 18 / 25
Detection of Text Blocks with Funding Information
Section P R F1
SVM 99 5 9
Cost sensitive L2-SVM (C=2) 95 85 90
Table: Results for the identification of text with funding informationm using SVM.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 19 / 25
Extraction of Funding Organization names
Method P R F1
HMM-Pre 18(±0) 31(±0) 23(±0)
CRF-Pre 35(±0) 54(±0) 42(±0)
FPE 48(±0) 46(±0) 47(±0)
CRF-S 49(±0) 43(±0) 46(±0)
HMM-S 36(±0) 48(±0) 41(±0)
MaxEnt-S 50(±0) 39(±0) 44(±0)
CRF-G 64(±.2) 58(±.2) 61(±.2)
CRF-dsim-G 66(±.2) 61(±.3) 63(±.2)
HMM-G 49(±.3) 54(±.2) 52(±.2)
MaxEnt-G 64(±.4) 54(±.2) 59(±.3)
FundingFinder 72(±.3) 63(±.2) 68(±.3)
Table: NER Results for Funding Body (FB) annotation label. Best performing
model is highlighted in bold while the second best is in italics.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 20 / 25
Extraction of Grant IDs
Method P R F1
Rule-based 78(±0) 89(±0) 83(±0)
CRF-G 91(±.1) 91(±.08) 91(±.1)
HMM-G 76(±.2) 77(±.2) 76(±.2)
MaxEnt-G 87(±.2) 89(±.1) 88(±.2)
FundingFinder 92(±.1) 91(±.1) 92(±.1)
Table: NER Results for Grant (GR) annotation label. Best performing model is
highlighted in bold while the second best is in italics.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 21 / 25
Section 5
Conclusions
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 22 / 25
Conclusions and Contributions
1 We have discussed on the practically important problem of extracting
funding information from text, and have experimentally provided an
overview of the state-of-the-art methods that could be used for the
same. This may prove to be a significant head-start for researchers
delving into the same problem for further research.
2 Empirically, we have shown that a small and high quality dataset is
more suitable for this NER task than a larger, but noisier, dataset.
3 We have suggested an efficient two-stage pipeline for the task of
funding information extraction.
4 A learning mechanism, based on an ensemble of state-of-the-art base
annotators, was suggested, which should be easily extensible to any
NER task.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 23 / 25
References
Nadeau, D., Sekine, S (2007)
A survey of named entity recognition and classification
Linguisticae Investigationes 30(1), 3 – 26.
Chieu, H.L. (2002)
Named entity recognition : a maximum entropy approach using global information
Proceedings of the 2002 International Conference on Computational Linguistics 190 – 196.
McCallum, A., Li, W. (2003)
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons
Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 4, 188 – 191.
Zhou, G., Su, J. (2002)
Named entity recognition using an HMM-based chunk tagger
Proceedings of the 40th Annual Meeting on Association for Computational Linguistics 473 – 480.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J. (2013)
Distributed representations of words and phrases and their compositionality
Proceedings of the 26th International Conference on Neural Information Processing Systems 3111 – 3119.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 24 / 25
Thank You!
Please email me at d.kayal@elsevier.com for
critiques, comments, advice, dataset inquiries, etc.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 25 / 25

More Related Content

PDF
SA2: Text Mining from User Generated Content
PDF
Ej36829834
PDF
A comprehensive study of major techniques of multi level frequent pattern min...
PDF
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
PPTX
Data Mining: Text and web mining
PDF
REVIEW: Frequent Pattern Mining Techniques
PDF
A Study of Various Projected Data Based Pattern Mining Algorithms
SA2: Text Mining from User Generated Content
Ej36829834
A comprehensive study of major techniques of multi level frequent pattern min...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Data Mining: Text and web mining
REVIEW: Frequent Pattern Mining Techniques
A Study of Various Projected Data Based Pattern Mining Algorithms

What's hot (17)

PDF
International Journal of Engineering Research and Development
PDF
A classification of methods for frequent pattern mining
PPT
Role of Text Mining in Search Engine
PPTX
Textmining Information Extraction
PPT
Big Data & Text Mining
PPT
Text mining
PPT
Textmining Introduction
PPT
Tesxt mining
PPT
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
PPT
Cs583 info-retrieval
PPT
Week12
PDF
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
PDF
Discovering Frequent Patterns with New Mining Procedure
PPTX
Introduction to Text Mining
PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
PDF
An improvised frequent pattern tree
International Journal of Engineering Research and Development
A classification of methods for frequent pattern mining
Role of Text Mining in Search Engine
Textmining Information Extraction
Big Data & Text Mining
Text mining
Textmining Introduction
Tesxt mining
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Cs583 info-retrieval
Week12
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
Discovering Frequent Patterns with New Mining Procedure
Introduction to Text Mining
Welcome to International Journal of Engineering Research and Development (IJERD)
An improvised frequent pattern tree
Ad

Similar to A Framework to Automatically Extract Funding Information from Text (20)

PDF
Comparative study of frequent item set in data mining
PDF
Acknowledgement Entity Recognition In CORD-19 Papers
PDF
Towards research data knowledge graphs
PDF
Knowledge Graph Maintenance
PDF
E017252831
PDF
Extraction of Data Using Comparable Entity Mining
PDF
Information_Retrieval_Models_Nfaoui_El_Habib
PDF
Predicting Budget from Transportation Research Grant Description: An Explorat...
PPTX
Nidoy_Grounded Theory by Corbin & Strauss.pptx
PPTX
ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
PDF
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
PDF
Semantic Search and Result Presentation with Entity Cards
PDF
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
PDF
An Ontology-Based Information Extraction Approach For R Sum S
PDF
Modern association rule mining methods
PDF
Visual mining of science citation data for benchmarking scientific and techno...
PDF
Data Mining based on Hashing Technique
PDF
P33077080
PPTX
Dynamic Search Using Semantics & Statistics
PDF
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
Comparative study of frequent item set in data mining
Acknowledgement Entity Recognition In CORD-19 Papers
Towards research data knowledge graphs
Knowledge Graph Maintenance
E017252831
Extraction of Data Using Comparable Entity Mining
Information_Retrieval_Models_Nfaoui_El_Habib
Predicting Budget from Transportation Research Grant Description: An Explorat...
Nidoy_Grounded Theory by Corbin & Strauss.pptx
ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
Semantic Search and Result Presentation with Entity Cards
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
An Ontology-Based Information Extraction Approach For R Sum S
Modern association rule mining methods
Visual mining of science citation data for benchmarking scientific and techno...
Data Mining based on Hashing Technique
P33077080
Dynamic Search Using Semantics & Statistics
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
Ad

More from Deep Kayal (6)

PPTX
State of transformers in Computer Vision
PDF
Unsupervised sentence-embeddings by manifold approximation and projection
PPTX
Notes on Deploying Machine-learning Models at Scale
PPTX
Information Extraction from Text, presented @ Deloitte
PPTX
Topic Pages. From articles to answers.
PPTX
Large-Scale Data Extraction, Structuring and Matching using Python and Spark
State of transformers in Computer Vision
Unsupervised sentence-embeddings by manifold approximation and projection
Notes on Deploying Machine-learning Models at Scale
Information Extraction from Text, presented @ Deloitte
Topic Pages. From articles to answers.
Large-Scale Data Extraction, Structuring and Matching using Python and Spark

Recently uploaded (20)

PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPT
protein biochemistry.ppt for university classes
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
An interstellar mission to test astrophysical black holes
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PDF
. Radiology Case Scenariosssssssssssssss
DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
protein biochemistry.ppt for university classes
bbec55_b34400a7914c42429908233dbd381773.pdf
Biophysics 2.pdffffffffffffffffffffffffff
Derivatives of integument scales, beaks, horns,.pptx
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
neck nodes and dissection types and lymph nodes levels
An interstellar mission to test astrophysical black holes
Cell Membrane: Structure, Composition & Functions
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Introduction to Cardiovascular system_structure and functions-1
microscope-Lecturecjchchchchcuvuvhc.pptx
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
POSITIONING IN OPERATION THEATRE ROOM.ppt
. Radiology Case Scenariosssssssssssssss
Viruses (History, structure and composition, classification, Bacteriophage Re...
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5

A Framework to Automatically Extract Funding Information from Text

  • 1. A Framework to Automatically Extract Funding Information from Text Deep Kayal, Zubair Afzal, George Tsatsaronis et al. Content and Innovation Group, Elsevier B.V., Amsterdam, NL. d.kayal@elsevier.com 16 September, 2018 Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 1 / 25
  • 2. Overview 1 Motivation and Problem Definition 2 Background 3 Methodology 4 Experiments and Results 5 Conclusions Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 2 / 25
  • 3. Section 1 Motivation and Problem Definition Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 3 / 25
  • 4. Motivation Usually, institutions and researchers are required to acknowledge the funding source and grants. This information, if captured effectively, will enable funding organizations to justify the impact of their allocated research funds. Plus, this information will also help researchers discover appropriate funding opportunities for their interests. In this work, we address the problem of automating the extraction of funding information from text, using natural language processing and machine learning techniques. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 4 / 25
  • 5. Problem Statement Can we automatically detect the funder information from scientific papers? Support for the Nurses’ Health Study and the Health Professionals Follow-up Study was provided by grants (P01 CA87969 and UM1 CA167552, respectively) from the NCI. Support for the Women’s Health Initiative program is provided by contracts (N01WH22110, N01WH24152, N01WH3210032102 and N01WH32105) from the National Heart, Lung, and Blood Institute. Can we mark them with entities of the form Funding Body and Grant Number? Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 5 / 25
  • 6. Section 2 Background Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 6 / 25
  • 7. Problem Definition Given a scientific article as raw text input, we design a system to perform two tasks: 1 identify all text segments which contain funding information. 2 process all the funding text segments in order to detect the set of the funding bodies (FB) and the set of grants (GR) that appear in the text. The former is a binary text classification task. While, the latter can be seen as a named entity recognition (NER) problem. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 7 / 25
  • 8. NER and Sequential Learning NER extracts information, known as named entities, from unstructured text; for example, the names of persons, locations and organizations. In literature, NER systems have been found to employ rule-based, gazetteer and machine learning approaches [Nadeau, 2007]. Sequential learning approaches are machine learning models that leverage the relationships between nearby data points and their class labels. Hidden Markov Models [Zhou, 2002], Linear CRFs [McCallum, 2003] and Maximum Entropy Models [Chieu, 2002] are popular ways of modeling data for NER. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 8 / 25
  • 9. Implementations and Toolkits The Stanford CoreNLP toolkit1 is a Java-based toolkit that has a CRF implementation, enhanced with long-distance features. An important aspect of the toolkit is the ability to use distributional similarity measures. LingPipe2 is another NLP toolkit, whose efficient HMM implementation includes n-gram features. In this work, we also use the Apache OpenNLP4 toolkit3, which has a MaxEnt implementation for NER. Finally, this work also makes use of Elseviers Fingerprint Engine (FPE)4, which is an industrial solution for annotating text with ontological concepts, given a vocabulary. 1 http://stanfordnlp.github.io/CoreNLP/ 2 http://alias-i.com/lingpipe/demos/tutorial/read-me.html 3 https://opennlp.apache.org/ 4 https://www.elsevier.com/solutions/elsevier-fingerprint-engine Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 9 / 25
  • 10. Section 3 Methodology Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 10 / 25
  • 11. Design Choice As mentioned earlier, we use a two-stage approach to extract funding information from text. This design has the following benefits: 1 it minimizes the execution time of the approach as the costliest component, namely NER. 2 it reduces the number of false positives, as there are many text segments in a scientific full text article that contain strings which a NER component could potentially annotate falsely. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 11 / 25
  • 12. Data Collection Silver Set: randomly sample articles from ScienceDirect5 database from the last 10 years and select only acknowledgment sections from them. using the Fingerprint Engine (FPE) and Crossref’s open funder registry6 , annotate FBs from these acknowledgment sections. at the end of this step, the number of retained sections with at least one annotated FB resulted in 44,660. Gold Set: journal articles were picked randomly from a large number of publications, annotated by three different experts and harmonized. 1,682 articles, out of around 2000, contained at least one funding-related annotation, resulting in 4,537 FB and 3,156 GR annotations in the set. pair-wise averaged Cohens kappa was used to calculate the inter-annotator agreement to assess dataset quality, and was found to be 0.89, suggesting high quality. 5 http://www.sciencedirect.com/ 6 http://www.crossref.org/fundingdata/registry.html Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 12 / 25
  • 13. Data Usage The Silver Set was used to learn word clusters for the distributional similarity measure that can be employed within the Stanford CoreNLP toolkit this was done by generating word-embeddings using this dataset, using the Word2Vec algorithm [Mikolov, 2013]. followed by K-means clustering using cosine similarity. Additionally, it was also used to train models to detect FB annotations. The Gold Set was used to train the binary text classifier that detects the paragraphs of text which contain funding information. It was also used to train models to detect FB and GR annotations. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 13 / 25
  • 14. Detecting Text Blocks with Funding Information As the first step, the text segments which contain funding information are to be separated from the rest (Binary Classification). To address this problem, we have used a cost-sensitive L2-regularized linear Support Vector Machine (SVMs), as SVMs are known to perform well on text classification problems. The SVMs operate on TF-IDF vectors extracted from the segments of each input text, based on a bigram bag-of-words text. The SVM was trained on the examples of positive (1,682) and negative segments (47,565), i.e., paragraphs with and without funding information, which could be found in the Gold Set. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 14 / 25
  • 15. Extracting Funding Information using NER In order to annotate a piece of text with the FB label, a variety of models were used: 1 pre-trained models packaged as part of the Stanford CoreNLP and LingPipe suites; in this work they were used to identify the Organization labels in the text, which were then stored as FB. 2 Stanford CRF, LingPipe HMM and OpenNLP MaxEnt models trained on the Silver and Gold sets. 3 Stanford CRF classifiers using distributional similarity features based on the word clusters created from the Silver Set data. As for GR labels: 1 we use a rule-based approach, considering every word inside the funding section with at least a digit, as a grant ID 2 we train all of the aforementioned models based on the labeled data in the Gold Set. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 15 / 25
  • 16. Ensembling Figure: An example of the ensemble approach for extracting funding information from text. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 16 / 25
  • 17. Overall Pipeline Figure: Schematic showing the overall pipeline. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 17 / 25
  • 18. Section 4 Experiments and Results Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 18 / 25
  • 19. Detection of Text Blocks with Funding Information Section P R F1 SVM 99 5 9 Cost sensitive L2-SVM (C=2) 95 85 90 Table: Results for the identification of text with funding informationm using SVM. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 19 / 25
  • 20. Extraction of Funding Organization names Method P R F1 HMM-Pre 18(±0) 31(±0) 23(±0) CRF-Pre 35(±0) 54(±0) 42(±0) FPE 48(±0) 46(±0) 47(±0) CRF-S 49(±0) 43(±0) 46(±0) HMM-S 36(±0) 48(±0) 41(±0) MaxEnt-S 50(±0) 39(±0) 44(±0) CRF-G 64(±.2) 58(±.2) 61(±.2) CRF-dsim-G 66(±.2) 61(±.3) 63(±.2) HMM-G 49(±.3) 54(±.2) 52(±.2) MaxEnt-G 64(±.4) 54(±.2) 59(±.3) FundingFinder 72(±.3) 63(±.2) 68(±.3) Table: NER Results for Funding Body (FB) annotation label. Best performing model is highlighted in bold while the second best is in italics. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 20 / 25
  • 21. Extraction of Grant IDs Method P R F1 Rule-based 78(±0) 89(±0) 83(±0) CRF-G 91(±.1) 91(±.08) 91(±.1) HMM-G 76(±.2) 77(±.2) 76(±.2) MaxEnt-G 87(±.2) 89(±.1) 88(±.2) FundingFinder 92(±.1) 91(±.1) 92(±.1) Table: NER Results for Grant (GR) annotation label. Best performing model is highlighted in bold while the second best is in italics. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 21 / 25
  • 22. Section 5 Conclusions Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 22 / 25
  • 23. Conclusions and Contributions 1 We have discussed on the practically important problem of extracting funding information from text, and have experimentally provided an overview of the state-of-the-art methods that could be used for the same. This may prove to be a significant head-start for researchers delving into the same problem for further research. 2 Empirically, we have shown that a small and high quality dataset is more suitable for this NER task than a larger, but noisier, dataset. 3 We have suggested an efficient two-stage pipeline for the task of funding information extraction. 4 A learning mechanism, based on an ensemble of state-of-the-art base annotators, was suggested, which should be easily extensible to any NER task. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 23 / 25
  • 24. References Nadeau, D., Sekine, S (2007) A survey of named entity recognition and classification Linguisticae Investigationes 30(1), 3 – 26. Chieu, H.L. (2002) Named entity recognition : a maximum entropy approach using global information Proceedings of the 2002 International Conference on Computational Linguistics 190 – 196. McCallum, A., Li, W. (2003) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 4, 188 – 191. Zhou, G., Su, J. (2002) Named entity recognition using an HMM-based chunk tagger Proceedings of the 40th Annual Meeting on Association for Computational Linguistics 473 – 480. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J. (2013) Distributed representations of words and phrases and their compositionality Proceedings of the 26th International Conference on Neural Information Processing Systems 3111 – 3119. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 24 / 25
  • 25. Thank You! Please email me at d.kayal@elsevier.com for critiques, comments, advice, dataset inquiries, etc. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 25 / 25