SlideShare a Scribd company logo
STEREO: A Pipeline for Extracting Experiment Statistics,
Conditions, and Topics from Scientific Papers
Steffen Epp, Marcel Hoffman, Nicolas Lell, Michael Mohr, Ansgar Scherp |
University of Ulm, Germany | November 24, 2021
Seite 2 Motivation | STEREO | November 24, 2021
Motivation
▶ Reporting of statistics should follow APA style guides
▶ Eases reading, avoids misunderstanding, easy to verify the stats
▶ In practise, however, we see in scientific papers ...
▶ ... styleguide deviations “Physical demand
(t(23) = −2.22, p = 0.37) and temporal demand
(t(23) = 2.72, p = .012) are significantly different”
→ Statistics not at end of sentence and non-standard leading 0
at p-value.
▶ ... variables missing: “Similarly fall in hemoglobin was
associated with total operative time (r = 0.49, p = 0.003)”
→ no degree of freedom reported
▶ Different naming: “The value of Spearman correlation coefficient
showed that the threshold effects did not exist in the adult BAL
group (Spearman correlation coefficient : 0.158; P = 0.727)”
→ “Spearman correlation coefficient” instead of r(df)
Seite 3 Motivation | STEREO | November 24, 2021
Example of Extracted Statistics Record
▶ “There was no significant effect for sex, (t(38) = 1.7, p = .097)
despite women attaining higher scores than men”
▶ Extracted: {degreeOfFreddom = 38,
statisticVal = 1.7, pvalue = .097, topic = personal data and
conditions = {men, women}}
Seite 4 Methods | STEREO | November 24, 2021
Pipeline
Rule-based 
using active Wrapper
ABAE
GBCE
Evaluation Evaluation
Step 1
Preprocessing
Step0 Step 2
Figure: The STEREO pipeline
Preprocessing:
▶ Split text by using the regular expression: (.s?[A − Z]) to get
sentences.
▶ Filter out all sentence without numbers.
Seite 5 Methods | STEREO | November 24, 2021
Statistic Extraction
▶ Two sets of rules: R+
and R−
▶ For each rule r+
i ∈ R+
there is a set of sub-rules Si
The applications of the protein
R−
z }| {
CD45RA are significantly different
(t(23) = −2.22, p = 0.37)
| {z }
R+
Seite 5 Methods | STEREO | November 24, 2021
Statistic Extraction
▶ Two sets of rules: R+
and R−
▶ For each rule r+
i ∈ R+
there is a set of sub-rules Si
The applications of the protein
r−
j
z }| {
CD45RA are significantly different
(
si1
z }| {
t(23) = −2.22
| {z }
si2
, p = 0.37
| {z }
si3
)
| {z }
r+
i
Seite 6 Methods | STEREO | November 24, 2021
New Rules
▶ Sentence contains number without match
▶ Ask user to input new R+
or R−
rule
→ active wrapper induction
▶ “As we showed in Sec,
no match
z}|{
4.2 there is ... ”
→ New R−
rule: r−
j = “Sec,s* d+.d+”
Seite 7 Methods | STEREO | November 24, 2021
ABAE Method
▶ Attention Based Aspect Extraction 1
▶ Set number of Topics K & train unsupervised
▶ Manually label each Topic from representative words
Example:
▶ Associating words to aspect: {day, week, month, hour, wk,
period, lasted, time, weekly, elapse, year, thereafter, minute,
daily, 24h ... } → Topic: Time
▶ “A negative non-significant relationship between PHQ-9 total
score and age 21-29; r (340) = -0.042, p = 0.441 ...” → Topic:
Mental Health
1
He et. al.: An Unsupervised Neural Attention Model for Aspect Extraction
https://aclanthology.org/P17-1036/
Seite 8 Methods | STEREO | November 24, 2021
GBCE Method
▶ Grammar Based Condition Extraction
▶ POS and Grammar annotation through SpaCy
▶ Rules to identify noun phrases based on common phrases and
annotations
Example:
▶ “. . . increase in risk for men was bigger than for women . . . ”
▶ Rule scheme: Noun (subject) + verb + comparative adjective +
than + noun (object)
▶ Conditions: {men, women}
Seite 9 Results | STEREO | November 24, 2021
Dataset
▶ Cord-19 Dataset, version 21st September 2020
▶ 108k scientific papers
▶ 16m sentences after preprocessing
▶ 55% of sentences contain at least one digit
Seite 10 Results | STEREO | November 24, 2021
Rules Learned by Wrapper Induction
▶ Rules were learned on 500 documents.
▶ 85 R+
and 1,425 R−
Rules were found.
▶ On a sample of 10,000 unseen documents they covered 95% of
the sentences with digits.
Seite 11 Results | STEREO | November 24, 2021
Rule-based Statistics Extraction: Results
Statistic APA conform non-APA conform
Student’s t-test 608 179
Pearson Correlation 113 4,962
Spearman Correlation 1 528
ANOVA 0 9
Mann-Whitney U 2 34
Wilcoxon Signed-Rank 0 0
Chi-Square 14 31
not supported not applied 19,151
not determinable not applicable 87,904
Table: This table shows how many statistics of each type were extracted. Not
supported are e. g. odds ratio, IQR etc.. Not determinable are e. g. solely
reported p value where the type of statistic could not be decided.
Seite 12 Results | STEREO | November 24, 2021
Assessment of Statistic Extraction Results
Statistic APA conform non-APA conform
Student’s t-test 1.0 0.91
Pearson Correlation 1.0 0.98
Spearman Correlation 1.0 1.0
ANOVA n/a 1.0
Mann-Whitney U 1.0 1.0
Wilcoxon Signed-Rank n/a n/a
Chi-Square 1.0 0.97
other - 0.95
Table: The precision has been calculated for each statistic type on 200
samples. If less than 200 samples were extracted, the precision has been
calculated on the respective amount of extracted samples.
Seite 13 Results | STEREO | November 24, 2021
Topic Extraction from Experiments using ABEA: Results
emb train K Result APA Result non-APA
supp-sen supp-sen 15 33 31
supp-sen supp-sen 30 75 73
all-sen supp-sen 15 51 57
all-sen supp-sen 30 48 49
▶ The best result was achieved with embedding and model only
trained on sentences with our supported statistics (vs sentences
with any statistics and all sentences) and K = 30.
▶ This model correctly classified 75/100 APA and 73/100
non-APA conform sentences.
▶ Example Topics: {risk factors, statistics, mental health}
Seite 14 Results | STEREO | November 24, 2021
Grammar-based Condition Extraction: Results
GBCE Result APA Result non-APA
Correctly classified 46 30
Reason 1: Failed grammar 4 5
Reason 2: Sentence structure 10 3
Reason 3: Preprocessing error 9 12
Reason 4: Dependency parser 18 2
Reason 5: GBCE miss 25 47
Table: Number of correctly extracted experimental conditions and reasons
why the extraction failed. In serveral samples, a combination of reasons were
the cause.
Seite 15 Discussion | STEREO | November 24, 2021
Generalization
▶ The whole approach should transfer to other domains.
▶ Depending on the domain fine tuning of the R−
rules and
adding new statistic type to the R+
rules may be necessary.
▶ GBCE should generalize well to other domains, since it is based
on English grammar.
▶ ABAE can be transferred to similar domains, on different
domains it needs to be re-trained.
Seite 16 Discussion | STEREO | November 24, 2021
Threat to Validity/Reproducibility
▶ For some statistic we extracted just a few samples (e. g.
ANOVA), these results could not be representative
▶ It is possible that topic or condition extraction requires more
than one sentence to capture the context
▶ In ABAE the inference of topics could lead to errors, but all
terms were checked to reduce these possibility
▶ An extended version of the paper can be found on arxiv2
▶ For reproducibility the code and ruleset are publicly available on:
github.com/Foisunt/STEREO
2
https://arxiv.org/abs/2103.14124
Seite 17 Discussion | STEREO | November 24, 2021
Summary of our Results
▶ High quality stats extraction (100% accuracy on APA conform
and 95% on non-APA conform sentences)
▶ The vast majority of statistics (> 99%) is not strictly APA
conform.
▶ Some ABAE models found “statistics” or “result reporting”
topics, which are technically correct but not useful.
▶ We found no parameter setting for ABAE (embedding, K) that
clearly worked better in general.
▶ GBCE works better on APA conform sentence, because on
average they have a better grammatical structure.
Thank you! Questions?

More Related Content

PDF
Presentation 20 august 2014 (departmental meeting)
PPTX
Presentation2 bucks-1
PDF
European conference on educational research
PPTX
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
PDF
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
PPTX
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
PDF
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
PPTX
Mining and Managing Large-scale Linked Open Data
Presentation 20 august 2014 (departmental meeting)
Presentation2 bucks-1
European conference on educational research
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
Mining and Managing Large-scale Linked Open Data

More from Ansgar Scherp (15)

PDF
Knowledge Discovery in Social Media and Scientific Digital Libraries
PPTX
A Comparison of Different Strategies for Automated Semantic Document Annotation
PPTX
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
PDF
A Framework for Iterative Signing of Graph Data on the Web
PDF
Smart photo selection: interpret gaze as personal interest
PPTX
Events in Multimedia - Theory, Model, Application
PPTX
Can you see it? Annotating Image Regions based on Users' Gaze Information
PPTX
Linked open data - how to juggle with more than a billion triples
PPTX
SchemEX -- Building an Index for Linked Open Data
PPTX
SchemEX -- Building an Index for Linked Open Data
PPTX
A Model of Events for Integrating Event-based Information in Complex Socio-te...
PPTX
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
PPTX
strukt - A Pattern System for Integrating Individual and Organizational Knowl...
PPTX
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...
PPTX
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
Knowledge Discovery in Social Media and Scientific Digital Libraries
A Comparison of Different Strategies for Automated Semantic Document Annotation
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
A Framework for Iterative Signing of Graph Data on the Web
Smart photo selection: interpret gaze as personal interest
Events in Multimedia - Theory, Model, Application
Can you see it? Annotating Image Regions based on Users' Gaze Information
Linked open data - how to juggle with more than a billion triples
SchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open Data
A Model of Events for Integrating Event-based Information in Complex Socio-te...
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
strukt - A Pattern System for Integrating Individual and Organizational Knowl...
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
Ad

Recently uploaded (20)

PPTX
Derivatives of integument scales, beaks, horns,.pptx
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Microbiology with diagram medical studies .pptx
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
Sciences of Europe No 170 (2025)
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
INTRODUCTION TO EVS | Concept of sustainability
Derivatives of integument scales, beaks, horns,.pptx
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Classification Systems_TAXONOMY_SCIENCE8.pptx
The scientific heritage No 166 (166) (2025)
Microbiology with diagram medical studies .pptx
Placing the Near-Earth Object Impact Probability in Context
Introduction to Cardiovascular system_structure and functions-1
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
POSITIONING IN OPERATION THEATRE ROOM.ppt
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
neck nodes and dissection types and lymph nodes levels
. Radiology Case Scenariosssssssssssssss
Cell Membrane: Structure, Composition & Functions
Sciences of Europe No 170 (2025)
Phytochemical Investigation of Miliusa longipes.pdf
INTRODUCTION TO EVS | Concept of sustainability
Ad

STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topics from Scientific Papers

  • 1. STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topics from Scientific Papers Steffen Epp, Marcel Hoffman, Nicolas Lell, Michael Mohr, Ansgar Scherp | University of Ulm, Germany | November 24, 2021
  • 2. Seite 2 Motivation | STEREO | November 24, 2021 Motivation ▶ Reporting of statistics should follow APA style guides ▶ Eases reading, avoids misunderstanding, easy to verify the stats ▶ In practise, however, we see in scientific papers ... ▶ ... styleguide deviations “Physical demand (t(23) = −2.22, p = 0.37) and temporal demand (t(23) = 2.72, p = .012) are significantly different” → Statistics not at end of sentence and non-standard leading 0 at p-value. ▶ ... variables missing: “Similarly fall in hemoglobin was associated with total operative time (r = 0.49, p = 0.003)” → no degree of freedom reported ▶ Different naming: “The value of Spearman correlation coefficient showed that the threshold effects did not exist in the adult BAL group (Spearman correlation coefficient : 0.158; P = 0.727)” → “Spearman correlation coefficient” instead of r(df)
  • 3. Seite 3 Motivation | STEREO | November 24, 2021 Example of Extracted Statistics Record ▶ “There was no significant effect for sex, (t(38) = 1.7, p = .097) despite women attaining higher scores than men” ▶ Extracted: {degreeOfFreddom = 38, statisticVal = 1.7, pvalue = .097, topic = personal data and conditions = {men, women}}
  • 4. Seite 4 Methods | STEREO | November 24, 2021 Pipeline Rule-based  using active Wrapper ABAE GBCE Evaluation Evaluation Step 1 Preprocessing Step0 Step 2 Figure: The STEREO pipeline Preprocessing: ▶ Split text by using the regular expression: (.s?[A − Z]) to get sentences. ▶ Filter out all sentence without numbers.
  • 5. Seite 5 Methods | STEREO | November 24, 2021 Statistic Extraction ▶ Two sets of rules: R+ and R− ▶ For each rule r+ i ∈ R+ there is a set of sub-rules Si The applications of the protein R− z }| { CD45RA are significantly different (t(23) = −2.22, p = 0.37) | {z } R+
  • 6. Seite 5 Methods | STEREO | November 24, 2021 Statistic Extraction ▶ Two sets of rules: R+ and R− ▶ For each rule r+ i ∈ R+ there is a set of sub-rules Si The applications of the protein r− j z }| { CD45RA are significantly different ( si1 z }| { t(23) = −2.22 | {z } si2 , p = 0.37 | {z } si3 ) | {z } r+ i
  • 7. Seite 6 Methods | STEREO | November 24, 2021 New Rules ▶ Sentence contains number without match ▶ Ask user to input new R+ or R− rule → active wrapper induction ▶ “As we showed in Sec, no match z}|{ 4.2 there is ... ” → New R− rule: r− j = “Sec,s* d+.d+”
  • 8. Seite 7 Methods | STEREO | November 24, 2021 ABAE Method ▶ Attention Based Aspect Extraction 1 ▶ Set number of Topics K & train unsupervised ▶ Manually label each Topic from representative words Example: ▶ Associating words to aspect: {day, week, month, hour, wk, period, lasted, time, weekly, elapse, year, thereafter, minute, daily, 24h ... } → Topic: Time ▶ “A negative non-significant relationship between PHQ-9 total score and age 21-29; r (340) = -0.042, p = 0.441 ...” → Topic: Mental Health 1 He et. al.: An Unsupervised Neural Attention Model for Aspect Extraction https://aclanthology.org/P17-1036/
  • 9. Seite 8 Methods | STEREO | November 24, 2021 GBCE Method ▶ Grammar Based Condition Extraction ▶ POS and Grammar annotation through SpaCy ▶ Rules to identify noun phrases based on common phrases and annotations Example: ▶ “. . . increase in risk for men was bigger than for women . . . ” ▶ Rule scheme: Noun (subject) + verb + comparative adjective + than + noun (object) ▶ Conditions: {men, women}
  • 10. Seite 9 Results | STEREO | November 24, 2021 Dataset ▶ Cord-19 Dataset, version 21st September 2020 ▶ 108k scientific papers ▶ 16m sentences after preprocessing ▶ 55% of sentences contain at least one digit
  • 11. Seite 10 Results | STEREO | November 24, 2021 Rules Learned by Wrapper Induction ▶ Rules were learned on 500 documents. ▶ 85 R+ and 1,425 R− Rules were found. ▶ On a sample of 10,000 unseen documents they covered 95% of the sentences with digits.
  • 12. Seite 11 Results | STEREO | November 24, 2021 Rule-based Statistics Extraction: Results Statistic APA conform non-APA conform Student’s t-test 608 179 Pearson Correlation 113 4,962 Spearman Correlation 1 528 ANOVA 0 9 Mann-Whitney U 2 34 Wilcoxon Signed-Rank 0 0 Chi-Square 14 31 not supported not applied 19,151 not determinable not applicable 87,904 Table: This table shows how many statistics of each type were extracted. Not supported are e. g. odds ratio, IQR etc.. Not determinable are e. g. solely reported p value where the type of statistic could not be decided.
  • 13. Seite 12 Results | STEREO | November 24, 2021 Assessment of Statistic Extraction Results Statistic APA conform non-APA conform Student’s t-test 1.0 0.91 Pearson Correlation 1.0 0.98 Spearman Correlation 1.0 1.0 ANOVA n/a 1.0 Mann-Whitney U 1.0 1.0 Wilcoxon Signed-Rank n/a n/a Chi-Square 1.0 0.97 other - 0.95 Table: The precision has been calculated for each statistic type on 200 samples. If less than 200 samples were extracted, the precision has been calculated on the respective amount of extracted samples.
  • 14. Seite 13 Results | STEREO | November 24, 2021 Topic Extraction from Experiments using ABEA: Results emb train K Result APA Result non-APA supp-sen supp-sen 15 33 31 supp-sen supp-sen 30 75 73 all-sen supp-sen 15 51 57 all-sen supp-sen 30 48 49 ▶ The best result was achieved with embedding and model only trained on sentences with our supported statistics (vs sentences with any statistics and all sentences) and K = 30. ▶ This model correctly classified 75/100 APA and 73/100 non-APA conform sentences. ▶ Example Topics: {risk factors, statistics, mental health}
  • 15. Seite 14 Results | STEREO | November 24, 2021 Grammar-based Condition Extraction: Results GBCE Result APA Result non-APA Correctly classified 46 30 Reason 1: Failed grammar 4 5 Reason 2: Sentence structure 10 3 Reason 3: Preprocessing error 9 12 Reason 4: Dependency parser 18 2 Reason 5: GBCE miss 25 47 Table: Number of correctly extracted experimental conditions and reasons why the extraction failed. In serveral samples, a combination of reasons were the cause.
  • 16. Seite 15 Discussion | STEREO | November 24, 2021 Generalization ▶ The whole approach should transfer to other domains. ▶ Depending on the domain fine tuning of the R− rules and adding new statistic type to the R+ rules may be necessary. ▶ GBCE should generalize well to other domains, since it is based on English grammar. ▶ ABAE can be transferred to similar domains, on different domains it needs to be re-trained.
  • 17. Seite 16 Discussion | STEREO | November 24, 2021 Threat to Validity/Reproducibility ▶ For some statistic we extracted just a few samples (e. g. ANOVA), these results could not be representative ▶ It is possible that topic or condition extraction requires more than one sentence to capture the context ▶ In ABAE the inference of topics could lead to errors, but all terms were checked to reduce these possibility ▶ An extended version of the paper can be found on arxiv2 ▶ For reproducibility the code and ruleset are publicly available on: github.com/Foisunt/STEREO 2 https://arxiv.org/abs/2103.14124
  • 18. Seite 17 Discussion | STEREO | November 24, 2021 Summary of our Results ▶ High quality stats extraction (100% accuracy on APA conform and 95% on non-APA conform sentences) ▶ The vast majority of statistics (> 99%) is not strictly APA conform. ▶ Some ABAE models found “statistics” or “result reporting” topics, which are technically correct but not useful. ▶ We found no parameter setting for ABAE (embedding, K) that clearly worked better in general. ▶ GBCE works better on APA conform sentence, because on average they have a better grammatical structure. Thank you! Questions?