Presentation_Doceng.pptx

2
 An exponential growth of scientific literature is prominently observed over the past decades.
 Abstracts
 High-level key phrases
 Theories and models
 Automatic extraction of theories and models:
 Facilitate building knowledge graphs
 Be incorporated into to the faceted search feature of an academic search engine
 Literature analysis, such as domain development and innovation composition

 We focus on extracting object names and aspects from figure captions.
3
 Theory entity extraction has not been extensively explored.
 No existing labeled data for extracting theory entities in social and behavioral science
(SBS) domains.
 Crowdsourcing is not appropriate
 Manual annotation is time-consuming
 We propose to use distant supervision to address the data sparsity problem of theory
extraction.
 Use an already existing database, such as Wikipedia, to collect instances of entity
mentions.
 Then use these instances to automatically generate our training data.

4
 The pipeline is composed of
the following five modules:
 Web Scraping
 Obtaining Body Text
 Sentence Segmentation
 Elasticsearch
 Automatic Annotation

5
 The pipeline is composed of the following five modules:
1. Web Scraping
 Use a web scraper to obtain theory and model entities from Wikipedia webpages.
 A heuristic filter is used to keep phrases ending with the following head words such as
“theory”, “model”, “concept”...
2. Obtaining Body Text
 Use GROBID to convert PDF documents into XML format and keep the body text
3. Sentence Segmentation
 Use Stanza to segment the body text of papers into sentences
 870,000 sentences from the 2400 papers
4. Elasticsearch
 The sentences are indexed by Elasticsearch.
5. Automatic Annotation
 The seed theory mentions obtained in Web Scraping are used to query the Elasticsearch
index
 Sentences represented in BIO (Begin, Inside, Outside) schema

6
Automatic Annotation
 The parent sample is obtained by the Defense Advanced Research Projects Agency
(DARPA) programme ‘Systematizing Confidence in Open Research and Evidence’
(SCORE) project, containing approximately 30,000 articles published from 2009-2018 in
62 major SBS journals in psychology, economics, politics, management, education, etc.
 We obtain the text for labeling from a random sample of 2400 SBS papers.
 The ground truth dataset contains 4534 sentences with 550 unique theory mentions
automatically annotated by the pipeline.

7
 We compare four deep neural network architectures, including BiLSTM, BiLSTM-CRF,
Transformer, and GCN.
 BiLSTM: The BiLSTM architecture analyzes the contextual dependency for each token from
both forwards and backwards, and then assigns each token a label based on probability
scores for each tag.
 BiLSTM-CRF: BiLSTM can work together with a Conditional Random Field (CRF) layer,
which labels a token based on its own features, features and labels of nearby tokens.
 Transformer: A transformer model predicts labels of tokens based on features of neighboring
tokens simultaneously using a multi-head attention mechanism.
 GCN: GCN (Graph Convolutional Network) is a type of CNN that processes graph-like data
structures.

8

9
 Time efficiency
 Less than half an hour to go through 870,000 sentences and check whether they contain
any of the 550 theory phrases
 Two hours to annotate the 4534 sentences
 Extraction results of a small dataset:
 All theory names extracted from these sentences were new.
 In particular, about 42% contain head words that were not in the heuristic filter.

10

11
Conclusion
 We proposed a trainable framework that extracts theory and model mentions from
scientific papers using distant supervision.
 We have created a new benchmark corpus consisting of 4534 annotated
sentences from papers in SBS domains. This dataset can be used for future
models on theory extraction.
 We compared several NER neural architectures and investigated their
dependency on pre-trained language models. The empirical results indicated that
the RoBERTa-BiLSTM-CRF architecture achieved the best performance with an
F1 score of 77.21% and a precision of 89.72%.

Presentation_Doceng.pptx

More Related Content

Similar to Presentation_Doceng.pptx (20)

Recently uploaded (20)

Presentation_Doceng.pptx