STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topics from Scientific Papers

STEREO: A Pipeline for Extracting Experiment Statistics,
Conditions, and Topics from Scientific Papers
Steffen Epp, Marcel Hoffman, Nicolas Lell, Michael Mohr, Ansgar Scherp |
University of Ulm, Germany | November 24, 2021

Seite 2 Motivation | STEREO | November 24, 2021
Motivation
▶ Reporting of statistics should follow APA style guides
▶ Eases reading, avoids misunderstanding, easy to verify the stats
▶ In practise, however, we see in scientific papers ...
▶ ... styleguide deviations “Physical demand
(t(23) = −2.22, p = 0.37) and temporal demand
(t(23) = 2.72, p = .012) are significantly different”
→ Statistics not at end of sentence and non-standard leading 0
at p-value.
▶ ... variables missing: “Similarly fall in hemoglobin was
associated with total operative time (r = 0.49, p = 0.003)”
→ no degree of freedom reported
▶ Different naming: “The value of Spearman correlation coefficient
showed that the threshold effects did not exist in the adult BAL
group (Spearman correlation coefficient : 0.158; P = 0.727)”
→ “Spearman correlation coefficient” instead of r(df)

Seite 3 Motivation | STEREO | November 24, 2021
Example of Extracted Statistics Record
▶ “There was no significant effect for sex, (t(38) = 1.7, p = .097)
despite women attaining higher scores than men”
▶ Extracted: {degreeOfFreddom = 38,
statisticVal = 1.7, pvalue = .097, topic = personal data and
conditions = {men, women}}

Seite 4 Methods | STEREO | November 24, 2021
Pipeline
Rule-based
using active Wrapper
ABAE
GBCE
Evaluation Evaluation
Step 1
Preprocessing
Step0 Step 2
Figure: The STEREO pipeline
Preprocessing:
▶ Split text by using the regular expression: (.s?[A − Z]) to get
sentences.
▶ Filter out all sentence without numbers.

Statistic Extraction
▶ Two sets of rules: R+
and R−
▶ For each rule r+
i ∈ R+
there is a set of sub-rules Si
The applications of the protein
R−
z }| {
CD45RA are significantly different
(t(23) = −2.22, p = 0.37)
| {z }
R+

Statistic Extraction
▶ Two sets of rules: R+
and R−
▶ For each rule r+
i ∈ R+
there is a set of sub-rules Si
The applications of the protein
r−
j
z }| {
CD45RA are significantly different
(
si1
z }| {
t(23) = −2.22
| {z }
si2
, p = 0.37
| {z }
si3
)
| {z }
r+
i

New Rules
▶ Sentence contains number without match
▶ Ask user to input new R+
or R−
rule
→ active wrapper induction
▶ “As we showed in Sec,
no match
z}|{
4.2 there is ... ”
→ New R−
rule: r−
j = “Sec,s* d+.d+”

ABAE Method
▶ Attention Based Aspect Extraction 1
▶ Set number of Topics K & train unsupervised
▶ Manually label each Topic from representative words
Example:
▶ Associating words to aspect: {day, week, month, hour, wk,
period, lasted, time, weekly, elapse, year, thereafter, minute,
daily, 24h ... } → Topic: Time
▶ “A negative non-significant relationship between PHQ-9 total
score and age 21-29; r (340) = -0.042, p = 0.441 ...” → Topic:
Mental Health
1
He et. al.: An Unsupervised Neural Attention Model for Aspect Extraction
https://aclanthology.org/P17-1036/

GBCE Method
▶ Grammar Based Condition Extraction
▶ POS and Grammar annotation through SpaCy
▶ Rules to identify noun phrases based on common phrases and
annotations
Example:
▶ “. . . increase in risk for men was bigger than for women . . . ”
▶ Rule scheme: Noun (subject) + verb + comparative adjective +
than + noun (object)
▶ Conditions: {men, women}

Seite 9 Results | STEREO | November 24, 2021
Dataset
▶ Cord-19 Dataset, version 21st September 2020
▶ 108k scientific papers
▶ 16m sentences after preprocessing
▶ 55% of sentences contain at least one digit

Rules Learned by Wrapper Induction
▶ Rules were learned on 500 documents.
▶ 85 R+
and 1,425 R−
Rules were found.
▶ On a sample of 10,000 unseen documents they covered 95% of
the sentences with digits.

Rule-based Statistics Extraction: Results
Statistic APA conform non-APA conform
Student’s t-test 608 179
Pearson Correlation 113 4,962
Spearman Correlation 1 528
ANOVA 0 9
Mann-Whitney U 2 34
Wilcoxon Signed-Rank 0 0
Chi-Square 14 31
not supported not applied 19,151
not determinable not applicable 87,904
Table: This table shows how many statistics of each type were extracted. Not
supported are e. g. odds ratio, IQR etc.. Not determinable are e. g. solely
reported p value where the type of statistic could not be decided.

Assessment of Statistic Extraction Results
Statistic APA conform non-APA conform
Student’s t-test 1.0 0.91
Pearson Correlation 1.0 0.98
Spearman Correlation 1.0 1.0
ANOVA n/a 1.0
Mann-Whitney U 1.0 1.0
Wilcoxon Signed-Rank n/a n/a
Chi-Square 1.0 0.97
other - 0.95
Table: The precision has been calculated for each statistic type on 200
samples. If less than 200 samples were extracted, the precision has been
calculated on the respective amount of extracted samples.

Topic Extraction from Experiments using ABEA: Results
emb train K Result APA Result non-APA
supp-sen supp-sen 15 33 31
supp-sen supp-sen 30 75 73
all-sen supp-sen 15 51 57
all-sen supp-sen 30 48 49
▶ The best result was achieved with embedding and model only
trained on sentences with our supported statistics (vs sentences
with any statistics and all sentences) and K = 30.
▶ This model correctly classified 75/100 APA and 73/100
non-APA conform sentences.
▶ Example Topics: {risk factors, statistics, mental health}

Grammar-based Condition Extraction: Results
GBCE Result APA Result non-APA
Correctly classified 46 30
Reason 1: Failed grammar 4 5
Reason 2: Sentence structure 10 3
Reason 3: Preprocessing error 9 12
Reason 4: Dependency parser 18 2
Reason 5: GBCE miss 25 47
Table: Number of correctly extracted experimental conditions and reasons
why the extraction failed. In serveral samples, a combination of reasons were
the cause.

Seite 15 Discussion | STEREO | November 24, 2021
Generalization
▶ The whole approach should transfer to other domains.
▶ Depending on the domain fine tuning of the R−
rules and
adding new statistic type to the R+
rules may be necessary.
▶ GBCE should generalize well to other domains, since it is based
on English grammar.
▶ ABAE can be transferred to similar domains, on different
domains it needs to be re-trained.

Threat to Validity/Reproducibility
▶ For some statistic we extracted just a few samples (e. g.
ANOVA), these results could not be representative
▶ It is possible that topic or condition extraction requires more
than one sentence to capture the context
▶ In ABAE the inference of topics could lead to errors, but all
terms were checked to reduce these possibility
▶ An extended version of the paper can be found on arxiv2
▶ For reproducibility the code and ruleset are publicly available on:
github.com/Foisunt/STEREO
2
https://arxiv.org/abs/2103.14124

Summary of our Results
▶ High quality stats extraction (100% accuracy on APA conform
and 95% on non-APA conform sentences)
▶ The vast majority of statistics (> 99%) is not strictly APA
conform.
▶ Some ABAE models found “statistics” or “result reporting”
topics, which are technically correct but not useful.
▶ We found no parameter setting for ABAE (embedding, K) that
clearly worked better in general.
▶ GBCE works better on APA conform sentence, because on
average they have a better grammatical structure.
Thank you! Questions?

STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topics from Scientific Papers

More Related Content

More from Ansgar Scherp (15)

Recently uploaded (20)

STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topics from Scientific Papers