1. The document proposes a framework to extract theory and model mentions from scientific papers using distant supervision. It automatically annotates sentences using seed theory mentions from Wikipedia to generate a labeled training dataset.
2. A benchmark corpus of 4534 annotated sentences from social and behavioral science papers is created. Neural networks including BiLSTM, Transformer, and GCN models are compared for named entity recognition, with RoBERTa-BiLSTM-CRF achieving the best performance.
3. The framework can efficiently annotate large text corpora and extracts new theory names not in the original heuristic filter, providing a method for automatic theory extraction from literature.