SlideShare a Scribd company logo
1
2
 An exponential growth of scientific literature is prominently observed over the past decades.
 Abstracts
 High-level key phrases
 Theories and models
 Automatic extraction of theories and models:
 Facilitate building knowledge graphs
 Be incorporated into to the faceted search feature of an academic search engine
 Literature analysis, such as domain development and innovation composition
 We focus on extracting object names and aspects from figure captions.
3
 Theory entity extraction has not been extensively explored.
 No existing labeled data for extracting theory entities in social and behavioral science
(SBS) domains.
 Crowdsourcing is not appropriate
 Manual annotation is time-consuming
 We propose to use distant supervision to address the data sparsity problem of theory
extraction.
 Use an already existing database, such as Wikipedia, to collect instances of entity
mentions.
 Then use these instances to automatically generate our training data.
 We focus on extracting object names and aspects from figure captions.
4
 The pipeline is composed of
the following five modules:
 Web Scraping
 Obtaining Body Text
 Sentence Segmentation
 Elasticsearch
 Automatic Annotation
 We focus on extracting object names and aspects from figure captions.
5
 The pipeline is composed of the following five modules:
1. Web Scraping
 Use a web scraper to obtain theory and model entities from Wikipedia webpages.
 A heuristic filter is used to keep phrases ending with the following head words such as
“theory”, “model”, “concept”...
2. Obtaining Body Text
 Use GROBID to convert PDF documents into XML format and keep the body text
3. Sentence Segmentation
 Use Stanza to segment the body text of papers into sentences
 870,000 sentences from the 2400 papers
4. Elasticsearch
 The sentences are indexed by Elasticsearch.
5. Automatic Annotation
 The seed theory mentions obtained in Web Scraping are used to query the Elasticsearch
index
 Sentences represented in BIO (Begin, Inside, Outside) schema
 We focus on extracting object names and aspects from figure captions.
6
Automatic Annotation
 The parent sample is obtained by the Defense Advanced Research Projects Agency
(DARPA) programme ‘Systematizing Confidence in Open Research and Evidence’
(SCORE) project, containing approximately 30,000 articles published from 2009-2018 in
62 major SBS journals in psychology, economics, politics, management, education, etc.
 We obtain the text for labeling from a random sample of 2400 SBS papers.
 The ground truth dataset contains 4534 sentences with 550 unique theory mentions
automatically annotated by the pipeline.
 We focus on extracting object names and aspects from figure captions.
7
 We compare four deep neural network architectures, including BiLSTM, BiLSTM-CRF,
Transformer, and GCN.
 BiLSTM: The BiLSTM architecture analyzes the contextual dependency for each token from
both forwards and backwards, and then assigns each token a label based on probability
scores for each tag.
 BiLSTM-CRF: BiLSTM can work together with a Conditional Random Field (CRF) layer,
which labels a token based on its own features, features and labels of nearby tokens.
 Transformer: A transformer model predicts labels of tokens based on features of neighboring
tokens simultaneously using a multi-head attention mechanism.
 GCN: GCN (Graph Convolutional Network) is a type of CNN that processes graph-like data
structures.
 We focus on extracting object names and aspects from figure captions.
8
 We focus on extracting object names and aspects from figure captions.
9
 Time efficiency
 Less than half an hour to go through 870,000 sentences and check whether they contain
any of the 550 theory phrases
 Two hours to annotate the 4534 sentences
 Extraction results of a small dataset:
 All theory names extracted from these sentences were new.
 In particular, about 42% contain head words that were not in the heuristic filter.
 We focus on extracting object names and aspects from figure captions.
10
 We focus on extracting object names and aspects from figure captions.
11
Conclusion
 We proposed a trainable framework that extracts theory and model mentions from
scientific papers using distant supervision.
 We have created a new benchmark corpus consisting of 4534 annotated
sentences from papers in SBS domains. This dataset can be used for future
models on theory extraction.
 We compared several NER neural architectures and investigated their
dependency on pre-trained language models. The empirical results indicated that
the RoBERTa-BiLSTM-CRF architecture achieved the best performance with an
F1 score of 77.21% and a precision of 89.72%.

More Related Content

PPTX
2010 PACLIC - pay attention to categories
PDF
Signals from outer space
PPTX
Natural Language Processing Advancements By Deep Learning - A Survey
PPTX
Information Extraction from Text, presented @ Deloitte
PPTX
asdrfasdfasdf
PPTX
NLP todo
PDF
Natural Language Processing NLP (Transformers)
PPTX
Building Named Entity Recognition Models Efficiently using NERDS
2010 PACLIC - pay attention to categories
Signals from outer space
Natural Language Processing Advancements By Deep Learning - A Survey
Information Extraction from Text, presented @ Deloitte
asdrfasdfasdf
NLP todo
Natural Language Processing NLP (Transformers)
Building Named Entity Recognition Models Efficiently using NERDS

Similar to Presentation_Doceng.pptx (20)

PPTX
Named entity recognition - Kaggle/Own data
PPTX
Progressive Transformer-Based Generation of Radiology Reports
PDF
ENSEMBLE MODEL FOR CHUNKING
PPTX
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
PPTX
Text summarization-with Extractive Text summarization techniques.pptx
PDF
Detection & Recognition of Text.pdf
PDF
Natural Language Processing (NLP)
PDF
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
PDF
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
PPTX
ICLR 2020 Recap
PDF
Babar: Knowledge Recognition, Extraction and Representation
PDF
TensorFlow London: Cutting edge generative models
PDF
Using NLP to Explore Entity Relationships in COVID-19 Literature
PDF
5_RNN_LSTM.pdf
 
PDF
IRJET- Text Highlighting – A Machine Learning Approach
PDF
IRJET- Sewage Treatment Potential of Coir Geotextiles in Conjunction with Act...
PPTX
Improving Natural Language Inference Using External Knowledge in the Science ...
PDF
Scene Description From Images To Sentences
PDF
Biemann ibm cog_comp_jan2015_noanim
PPTX
CiteSeerX: Mining Scholarly Big Data
Named entity recognition - Kaggle/Own data
Progressive Transformer-Based Generation of Radiology Reports
ENSEMBLE MODEL FOR CHUNKING
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Text summarization-with Extractive Text summarization techniques.pptx
Detection & Recognition of Text.pdf
Natural Language Processing (NLP)
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
ICLR 2020 Recap
Babar: Knowledge Recognition, Extraction and Representation
TensorFlow London: Cutting edge generative models
Using NLP to Explore Entity Relationships in COVID-19 Literature
5_RNN_LSTM.pdf
 
IRJET- Text Highlighting – A Machine Learning Approach
IRJET- Sewage Treatment Potential of Coir Geotextiles in Conjunction with Act...
Improving Natural Language Inference Using External Knowledge in the Science ...
Scene Description From Images To Sentences
Biemann ibm cog_comp_jan2015_noanim
CiteSeerX: Mining Scholarly Big Data

Recently uploaded (20)

PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PDF
diccionario toefl examen de ingles para principiante
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
famous lake in india and its disturibution and importance
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
HPLC-PPT.docx high performance liquid chromatography
Comparative Structure of Integument in Vertebrates.pptx
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
bbec55_b34400a7914c42429908233dbd381773.pdf
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Biophysics 2.pdffffffffffffffffffffffffff
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
diccionario toefl examen de ingles para principiante
AlphaEarth Foundations and the Satellite Embedding dataset
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
famous lake in india and its disturibution and importance
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Cell Membrane: Structure, Composition & Functions
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Taita Taveta Laboratory Technician Workshop Presentation.pptx
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
The KM-GBF monitoring framework – status & key messages.pptx
Phytochemical Investigation of Miliusa longipes.pdf
HPLC-PPT.docx high performance liquid chromatography

Presentation_Doceng.pptx

  • 1. 1
  • 2. 2  An exponential growth of scientific literature is prominently observed over the past decades.  Abstracts  High-level key phrases  Theories and models  Automatic extraction of theories and models:  Facilitate building knowledge graphs  Be incorporated into to the faceted search feature of an academic search engine  Literature analysis, such as domain development and innovation composition
  • 3.  We focus on extracting object names and aspects from figure captions. 3  Theory entity extraction has not been extensively explored.  No existing labeled data for extracting theory entities in social and behavioral science (SBS) domains.  Crowdsourcing is not appropriate  Manual annotation is time-consuming  We propose to use distant supervision to address the data sparsity problem of theory extraction.  Use an already existing database, such as Wikipedia, to collect instances of entity mentions.  Then use these instances to automatically generate our training data.
  • 4.  We focus on extracting object names and aspects from figure captions. 4  The pipeline is composed of the following five modules:  Web Scraping  Obtaining Body Text  Sentence Segmentation  Elasticsearch  Automatic Annotation
  • 5.  We focus on extracting object names and aspects from figure captions. 5  The pipeline is composed of the following five modules: 1. Web Scraping  Use a web scraper to obtain theory and model entities from Wikipedia webpages.  A heuristic filter is used to keep phrases ending with the following head words such as “theory”, “model”, “concept”... 2. Obtaining Body Text  Use GROBID to convert PDF documents into XML format and keep the body text 3. Sentence Segmentation  Use Stanza to segment the body text of papers into sentences  870,000 sentences from the 2400 papers 4. Elasticsearch  The sentences are indexed by Elasticsearch. 5. Automatic Annotation  The seed theory mentions obtained in Web Scraping are used to query the Elasticsearch index  Sentences represented in BIO (Begin, Inside, Outside) schema
  • 6.  We focus on extracting object names and aspects from figure captions. 6 Automatic Annotation  The parent sample is obtained by the Defense Advanced Research Projects Agency (DARPA) programme ‘Systematizing Confidence in Open Research and Evidence’ (SCORE) project, containing approximately 30,000 articles published from 2009-2018 in 62 major SBS journals in psychology, economics, politics, management, education, etc.  We obtain the text for labeling from a random sample of 2400 SBS papers.  The ground truth dataset contains 4534 sentences with 550 unique theory mentions automatically annotated by the pipeline.
  • 7.  We focus on extracting object names and aspects from figure captions. 7  We compare four deep neural network architectures, including BiLSTM, BiLSTM-CRF, Transformer, and GCN.  BiLSTM: The BiLSTM architecture analyzes the contextual dependency for each token from both forwards and backwards, and then assigns each token a label based on probability scores for each tag.  BiLSTM-CRF: BiLSTM can work together with a Conditional Random Field (CRF) layer, which labels a token based on its own features, features and labels of nearby tokens.  Transformer: A transformer model predicts labels of tokens based on features of neighboring tokens simultaneously using a multi-head attention mechanism.  GCN: GCN (Graph Convolutional Network) is a type of CNN that processes graph-like data structures.
  • 8.  We focus on extracting object names and aspects from figure captions. 8
  • 9.  We focus on extracting object names and aspects from figure captions. 9  Time efficiency  Less than half an hour to go through 870,000 sentences and check whether they contain any of the 550 theory phrases  Two hours to annotate the 4534 sentences  Extraction results of a small dataset:  All theory names extracted from these sentences were new.  In particular, about 42% contain head words that were not in the heuristic filter.
  • 10.  We focus on extracting object names and aspects from figure captions. 10
  • 11.  We focus on extracting object names and aspects from figure captions. 11 Conclusion  We proposed a trainable framework that extracts theory and model mentions from scientific papers using distant supervision.  We have created a new benchmark corpus consisting of 4534 annotated sentences from papers in SBS domains. This dataset can be used for future models on theory extraction.  We compared several NER neural architectures and investigated their dependency on pre-trained language models. The empirical results indicated that the RoBERTa-BiLSTM-CRF architecture achieved the best performance with an F1 score of 77.21% and a precision of 89.72%.