SlideShare a Scribd company logo
Do Neural Models Learn Transitivity of
Veridical Inference?
Hitomi Yanaka1,2
Koji Mineshima3
Kentaro Inui4,2
1
University of Tokyo, 2
RIKEN, 3
Keio University, 4
Tohoku University
NALOMA2021@Online
1
Generalization concern about neural models
• Deep neural network models (BERT [Devlin+,2018]) pretrained with
large-scale data have achieved high performance in language
understanding benchmark tasks (GLUE, SuperGLUE [Wang+, 2019]).
• However, many recent analyses [Liu et al., 2019] [McCoy+, 2020] show
that high performance on standard benchmarks does not always
mean that the model has the intended ability to understand
languages (“understanding languages like humans”).
2
1.Introduction
Systematicity (Fodor & Pylyshin, 1988)
Systematicity of language/thought:
● If you understand “John loves the girl”, then you must also
understand “The girl loves John”.
Systematicity of inference:
● If you infer A from A&B, then you must also infer A&B from
A&B&C, etc.
3
1.Introduction
Question
To what extent neural models can learn the systematicity of
inference from training instances?
Systematicity in NLI
● Goal: Study the systematic generalization ability of neural
models on Natural Language Inference (NLI) [Dagan+, 2013].
● Task to judge whether a premise P entails a hypothesis H.
4
P: John knew that there was a wild deer jumping a fence
H: There was a deer jumping over a fence Entailment
1.Introduction
Related work on analyzing whether neural models can learn
systematicity
● Monotonicity inference involving quantifiers and negation
[Goodwin+2020][Geiger+ 2020][Yanaka+ 2020]
● Semantic parsing task
artificial language: SCAN[Lake and Baroni 2017]
natural language: COGS[Kim and Linzen 2020], SyGNS[Yanaka+ 2021]
● Inductive reasoning task
CLUTRR[Sinha+ 2019]
Related work
5
1.Introduction
Transitivity: a key challenge for systematicity of NLI
● If you infer B from A and C from B, then you must also be able to
infer C from A.
A → B B → C
  A → C
● Syllogism/Cut Rule (in modern proof theory)
● Meta-logical inference ability:
○ The challenge is not to perform/learn a single pattern of inference
but to combine multiple patterns of inference.
○ Given that sentence pairs (A, B) and (B, C) are entailment, you
should also be able to judge (A, C) is entailment.
6
1.Introduction
Transitivity inference: Challenge
● If a model learns basic patterns A → B and B → C, it must be
able to compose these two and draw a new inference A → C.
● If a model lacks this generalization ability, it must memorize an
exponential number of inference combinations independently.
● How to create an NLI dataset for transitivity inference?
7
1.Introduction
Veridicality
Veridicality of clause-embedding verbs [Karttunen+,2012]
8
● A verb V is veridical when “x V that P” entails that P is true
P: John knows that [there was a deer jumping a fence]
H: There was a deer jumping a fence Entailment
● A verb V is non-veridical when “x V that P” does not entail that P is true
P: John hopes that [there was a deer jumping a fence]
H: There was a deer jumping a fence Non-entailment
1.Introduction
Transitivity inference involving veridicality
● Veridical inference can easily compose transitivity inference at scale by
embedding various inferences into clause-embedding verbs.
● Simple heuristics (word overlap etc) fail for composite inferences.
9
1.Introduction
Our work and contributions
Evaluate the systematic generalization ability of neural models on
transitivity inferences that combine veridical inferences with
various inference
1. Provide analysis methods with two transitivity inference
datasets: synthetic datasets and naturalistic datasets
https://github.com/verypluming/transitivity
2. Use our datasets to analyze two standard NLI models: LSTM
and BERT on various combination patterns
3. Analyze whether the data augmentation with new combination
patterns helps models to learn transitivity
10
How to test transitivity
Training
Basic 1. veridical inference: f(s1) → s1
Premise: John {knew/hoped} that Bob and Ann left. [f(s1)]
Hypothesis: Bob and Ann left. [s1] (Entailment/Non-entailment)
Basic 2. various inference patterns (eg. Boolean): s1 → s2
Premise: Bob and Ann left. [s1]
Hypothesis: Ann left. [s2] (Entailment)
Test composite inference: f(s1) → s2
Premise: John {knew/hoped} that Bob and Ann left. [f(s1)]
Hypothesis: Ann left. [s2] (Entailment/Non-entailment)
11
2. Method
Entailment/non-entailment labels
12
Basic (Train) Composite (Test)
f(s1) → s1 s1 → s2 f(s1) → s2
entailment (f: veridical) entailment entailment
entailment (f: veridical) neutral neutral
neutral (f: non-veridical) entailment neutral
neutral (f: non-veridical) neutral neutral
• Entailment labels of basic patterns f(s1) → s1 are determined by
rules (eg. if the embedding verb is veridical, the label for f(s1) → s1
is entailment).
• Entailment labels of composite inference are fixed by:
2. Method
13
● Synthetic dataset: Embedding Boolean inference (and, or, not)
created by CFG rules with meaning representations (simple
Montague Grammar!). Entailment labels of Boolean inference are
checked by a theorem prover.
f(s1): Someone knew that [Bob and Ann found Tom, Jim and Fred]
s1: Bob and Ann found Tom, Jim and Fred
s2: Bob found Jim
f(s1): Someone sees that [a person is brushing a cat]
s1: A person is brushing a cat
s2: A person is combing the fur of a cat
Dataset creation: synthetic and naturalistic datasets
● Naturalistic dataset: Embedding inferences in the SICK dataset
[Marelli+,2014], a collection of lexical/structural inferences.
2. Method
● Choose 30 clause-embedding verbs in previous verb veridicality
datasets[White+,2018][Ross and Pavlick,2019]
● Insert a clause-embedding verb f into the template “Someone f
that s1” to make the main clause in f(s1)
14
Dataset creation: clause-embedding verbs
Examples of how to create naturalistic datasets with SICK
f(s1): Someone sees that [a person is brushing a cat]
s1: A person is brushing a cat
s2: A person is combing the fur of a cat
f(s1) → s2, s1 → s2, f(s1) → s2
2. Method
Experimental setting
● Two neural NLI models
○ LSTM [Hochreiter and Schmidhuber 1997]
○ BERT [Devlin+ 2018]
● Datasets
● Evaluation metrics: the average accuracy of 5 runs
15
3. Experiments
Split Pattern entail:non-entail Synthetic Naturalistic
Train f(s1) → s1 1:1 6,000 30,000
s1 → s2 1:1 3,000 1,000
Test f(s1) → s2 1:3 6,000 30,000
● The models do not perform well on the composite inferences where the
verb f is veridical but embedded sentence s1 does not entail s2.
● They only look at the veridicality of f
Premise: Someone knew that Bob or Ann left. [f(s1)]
Hypothesis: Ann left. [s2]
(Gold: Non-entailment, Prediction: Entailment)
16
Results: LSTM and BERT do not perform transitivity
Synthetic data Naturalistic data
3. Experiments
● (1) Use various templates to generate the main clause in f(s1)
● We manually select 40 clauses of the verb veridicality dataset [Ross and
Pavlick 2019] and provide additional templates.
17
Is poor performance of transitivity inference due to
overfitting on verbs? Two additional setting (1)
Type Template
Pronoun At that moment, we f that s1
Specific group Some economists f that s1
Proper noun Hanson f that s1
3. Experiments
● (1) Use various templates to generate the main clause in f(s1)
● We manually select 40 clauses of the verb veridicality dataset [Ross and
Pavlick 2019] and provide additional templates.
● The results show the same trends: the models fail on the cases where the
verb f is veridical but embedded sentence s1 does not entail s2.
18
Is poor performance of transitivity inference due to
overfitting on verbs? Two additional setting (1)
yes: entailment, unk: neutral
3. Experiments
● (2) Flip the gold labels of 10% veridical inference examples
● The results show the same trends; even when we consider the
complexity of veridical inference in our analysis, the models fail to
consistently perform composite inference.
19
Is poor performance of transitivity inference due to
overfitting on verbs? Two additional setting (2)
yes: entailment, unk: neutral
3. Experiments
The data augmentation
improved performance
on transitivity test sets.
20
Does the data augmentation with a part of combination
patterns help models to learn transitivity?
3. Experiments
f(s1)→ s2: non-entail
f: veridical (f(s1)→ s1: entail)
s1 → s2: non-entail
The data augmentation
improved performance on
transitivity test sets.
However, the accuracy was
improved without training
s1 → s2 patterns...
21
Does the data augmentation with a part of combination
patterns help models to learn transitivity?
3. Experiments
The data augmentation
improved performance on
transitivity test sets.
However, the accuracy was
improved without training
s1 → s2 patterns...
Models do not “combine”
basic inferences to
perform transitivity
inference.
22
Does the data augmentation with a part of combination
patterns help models to learn transitivity?
3. Experiments
Humans generally follow the distinction between veridical and
non-veridical verbs, as well as the transitivity of entailment relation.
23
How well do human perform transitivity inference?
Naturalistic data
3. Experiments
24
How well do human perform transitivity inference?
● Humans tend to fail on the composite inferences where verb f is
non-veridical and s1 entails s2.
● They often neglect the veridical verb (Veridicality bias [Ross and Pavlick, 2019])
Premise: Someone believed that a man is jumping off a low wall. [f(s1)]
Hypothesis: A man is jumping a low wall. [s2]
(Gold: Non-entailment, Prediction: Entailment)
Naturalistic data
3. Experiments
Conclusion
Motivation
Evaluate the systematic generalization ability of neural NLI
models on transitivity inferences
Approach
Analyze models with synthetic and naturalistic transitivity
inference datasets involving veridicality
Main results
25
● Current models fail to consistently perform transitivity inference
● Models can memorize composite inference examples, but do not
have the intended ability to combine basic inference
Thanks! Hitomi Yanaka hyanaka@is.s.u-tokyo.ac.jp
Data and Code: https://github.com/verypluming/transitivity

More Related Content

PDF
Dependent Types in Natural Language Semantics
PDF
Lecture14 xing fei-fei
PDF
French machine reading for question answering
PDF
Composing (Im)politeness in Dependent Type Semantics
PDF
Harnessing Deep Neural Networks with Logic Rules
PDF
Probabilistic Abductive Logic Programming using Possible Worlds
PDF
Иван Лобов, Data-Centric Alliance, «Текущие тенденции в сфере исследования гл...
PDF
Benchmarking Linear Logic Proofs
Dependent Types in Natural Language Semantics
Lecture14 xing fei-fei
French machine reading for question answering
Composing (Im)politeness in Dependent Type Semantics
Harnessing Deep Neural Networks with Logic Rules
Probabilistic Abductive Logic Programming using Possible Worlds
Иван Лобов, Data-Centric Alliance, «Текущие тенденции в сфере исследования гл...
Benchmarking Linear Logic Proofs

What's hot (19)

PDF
Benchmarking Linear Logic Proofs
PDF
Linear Logic and Constructive Mathematics, after Shulman
PDF
Dialectica Categories for the Lambek Calculus
PDF
深層意味表現学習 (Deep Semantic Representations)
PPT
Jarrar.lecture notes.aai.2011s.ch7.p logic
PPT
Pattern Mining To Unknown Word Extraction (10
PDF
Topic model an introduction
PDF
Latent Relational Model for Relation Extraction
PDF
Categorical Semantics for Explicit Substitutions
PDF
Exchanging More than Complete Data
PDF
StarSpace: Embed All The Things!
PDF
Cerutti--PhD viva voce defence
PDF
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
PDF
Modeling missing data in distant supervision for information extraction (Ritt...
PDF
Pure Algebra to Applied AI: a personal journey
PDF
UWB semeval2016-task5
PPTX
Rules for inducing hierarchies from social tagging data
PDF
Constructive Modalities
PDF
Comparing human solving time with SAT-solving for Sudoku problems
Benchmarking Linear Logic Proofs
Linear Logic and Constructive Mathematics, after Shulman
Dialectica Categories for the Lambek Calculus
深層意味表現学習 (Deep Semantic Representations)
Jarrar.lecture notes.aai.2011s.ch7.p logic
Pattern Mining To Unknown Word Extraction (10
Topic model an introduction
Latent Relational Model for Relation Extraction
Categorical Semantics for Explicit Substitutions
Exchanging More than Complete Data
StarSpace: Embed All The Things!
Cerutti--PhD viva voce defence
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Modeling missing data in distant supervision for information extraction (Ritt...
Pure Algebra to Applied AI: a personal journey
UWB semeval2016-task5
Rules for inducing hierarchies from social tagging data
Constructive Modalities
Comparing human solving time with SAT-solving for Sudoku problems
Ad

Similar to Do Neural Models Learn Transitivity of Veridical Inference? (20)

PPTX
Natural language inference(NLI) importtant
PPTX
What knowledge bases know (and what they don't)
PDF
1066_multitask_prompted_training_en.pdf
PDF
AI Beyond Deep Learning
PDF
Basic review on topic modeling
DOCX
CSC375CSCM75Logic for Computer ScienceUlrich Berger.docx
PPTX
Ordinal Common-sense Inference
PPTX
AI3391 Artificial Intelligence Session 25 Horn clause.pptx
PDF
Seminar CCC
PDF
GDSC SSN - solution Challenge : Fundamentals of Decision Making
PDF
Lecture20 xing
PDF
Advances in Learning with Bayesian Networks - july 2015
PDF
Differentiable Logic Machines, published on TMLR
PDF
Knowledge Capturing via Conceptual Reframing: A Goal-oriented Framework for K...
PDF
XAI (IIT-Patna).pdf
PPTX
Introduction to First order logic .pptx
PDF
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
PPTX
PREDICT 422 - Module 1.pptx
PPT
Towards Linked Ontologies and Data on the Semantic Web
PDF
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
Natural language inference(NLI) importtant
What knowledge bases know (and what they don't)
1066_multitask_prompted_training_en.pdf
AI Beyond Deep Learning
Basic review on topic modeling
CSC375CSCM75Logic for Computer ScienceUlrich Berger.docx
Ordinal Common-sense Inference
AI3391 Artificial Intelligence Session 25 Horn clause.pptx
Seminar CCC
GDSC SSN - solution Challenge : Fundamentals of Decision Making
Lecture20 xing
Advances in Learning with Bayesian Networks - july 2015
Differentiable Logic Machines, published on TMLR
Knowledge Capturing via Conceptual Reframing: A Goal-oriented Framework for K...
XAI (IIT-Patna).pdf
Introduction to First order logic .pptx
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
PREDICT 422 - Module 1.pptx
Towards Linked Ontologies and Data on the Semantic Web
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
Ad

More from Hitomi Yanaka (7)

PDF
東京大学2021年度深層学習(Deep learning基礎講座2021) 第8回「深層学習と自然言語処理」
PDF
それでも私が研究を続ける理由
PDF
Investigating the Generalization Ability of Neural Models through Monotonicit...
PPTX
東京大学2020年度深層学習(Deep learning基礎講座) 第9回「深層学習と自然言語処理」
PDF
東京大学2020年度深層学習(Deep learning基礎講座) 第9回「深層学習と自然言語処理」(一部文字が欠けてます)
PDF
多言語統語・意味情報コーパスParallel Meaning Bank日本語版の構築
PDF
自然演繹に基づく文間の含意関係の証明を用いたフレーズアライメントの試み
東京大学2021年度深層学習(Deep learning基礎講座2021) 第8回「深層学習と自然言語処理」
それでも私が研究を続ける理由
Investigating the Generalization Ability of Neural Models through Monotonicit...
東京大学2020年度深層学習(Deep learning基礎講座) 第9回「深層学習と自然言語処理」
東京大学2020年度深層学習(Deep learning基礎講座) 第9回「深層学習と自然言語処理」(一部文字が欠けてます)
多言語統語・意味情報コーパスParallel Meaning Bank日本語版の構築
自然演繹に基づく文間の含意関係の証明を用いたフレーズアライメントの試み

Recently uploaded (20)

PPT
First Aid Training Presentation Slides.ppt
DOCX
ENGLISH PROJECT FOR BINOD BIHARI MAHTO KOYLANCHAL UNIVERSITY
DOCX
"Project Management: Ultimate Guide to Tools, Techniques, and Strategies (2025)"
PDF
Tunisia's Founding Father(s) Pitch-Deck 2022.pdf
PPTX
Self management and self evaluation presentation
PPTX
Human Mind & its character Characteristics
PPTX
An Unlikely Response 08 10 2025.pptx
PPTX
lesson6-211001025531lesson plan ppt.pptx
PPTX
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
PPTX
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
PDF
Instagram's Product Secrets Unveiled with this PPT
PPTX
ART-APP-REPORT-FINctrwxsg f fuy L-na.pptx
PPTX
Sustainable Forest Management ..SFM.pptx
PPTX
Emphasizing It's Not The End 08 06 2025.pptx
PPTX
Introduction-to-Food-Packaging-and-packaging -materials.pptx
PPTX
Intro to ISO 9001 2015.pptx wareness raising
PPTX
2025-08-10 Joseph 02 (shared slides).pptx
PPTX
Effective_Handling_Information_Presentation.pptx
PDF
natwest.pdf company description and business model
PPTX
Relationship Management Presentation In Banking.pptx
First Aid Training Presentation Slides.ppt
ENGLISH PROJECT FOR BINOD BIHARI MAHTO KOYLANCHAL UNIVERSITY
"Project Management: Ultimate Guide to Tools, Techniques, and Strategies (2025)"
Tunisia's Founding Father(s) Pitch-Deck 2022.pdf
Self management and self evaluation presentation
Human Mind & its character Characteristics
An Unlikely Response 08 10 2025.pptx
lesson6-211001025531lesson plan ppt.pptx
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
Instagram's Product Secrets Unveiled with this PPT
ART-APP-REPORT-FINctrwxsg f fuy L-na.pptx
Sustainable Forest Management ..SFM.pptx
Emphasizing It's Not The End 08 06 2025.pptx
Introduction-to-Food-Packaging-and-packaging -materials.pptx
Intro to ISO 9001 2015.pptx wareness raising
2025-08-10 Joseph 02 (shared slides).pptx
Effective_Handling_Information_Presentation.pptx
natwest.pdf company description and business model
Relationship Management Presentation In Banking.pptx

Do Neural Models Learn Transitivity of Veridical Inference?

  • 1. Do Neural Models Learn Transitivity of Veridical Inference? Hitomi Yanaka1,2 Koji Mineshima3 Kentaro Inui4,2 1 University of Tokyo, 2 RIKEN, 3 Keio University, 4 Tohoku University NALOMA2021@Online 1
  • 2. Generalization concern about neural models • Deep neural network models (BERT [Devlin+,2018]) pretrained with large-scale data have achieved high performance in language understanding benchmark tasks (GLUE, SuperGLUE [Wang+, 2019]). • However, many recent analyses [Liu et al., 2019] [McCoy+, 2020] show that high performance on standard benchmarks does not always mean that the model has the intended ability to understand languages (“understanding languages like humans”). 2 1.Introduction
  • 3. Systematicity (Fodor & Pylyshin, 1988) Systematicity of language/thought: ● If you understand “John loves the girl”, then you must also understand “The girl loves John”. Systematicity of inference: ● If you infer A from A&B, then you must also infer A&B from A&B&C, etc. 3 1.Introduction Question To what extent neural models can learn the systematicity of inference from training instances?
  • 4. Systematicity in NLI ● Goal: Study the systematic generalization ability of neural models on Natural Language Inference (NLI) [Dagan+, 2013]. ● Task to judge whether a premise P entails a hypothesis H. 4 P: John knew that there was a wild deer jumping a fence H: There was a deer jumping over a fence Entailment 1.Introduction
  • 5. Related work on analyzing whether neural models can learn systematicity ● Monotonicity inference involving quantifiers and negation [Goodwin+2020][Geiger+ 2020][Yanaka+ 2020] ● Semantic parsing task artificial language: SCAN[Lake and Baroni 2017] natural language: COGS[Kim and Linzen 2020], SyGNS[Yanaka+ 2021] ● Inductive reasoning task CLUTRR[Sinha+ 2019] Related work 5 1.Introduction
  • 6. Transitivity: a key challenge for systematicity of NLI ● If you infer B from A and C from B, then you must also be able to infer C from A. A → B B → C   A → C ● Syllogism/Cut Rule (in modern proof theory) ● Meta-logical inference ability: ○ The challenge is not to perform/learn a single pattern of inference but to combine multiple patterns of inference. ○ Given that sentence pairs (A, B) and (B, C) are entailment, you should also be able to judge (A, C) is entailment. 6 1.Introduction
  • 7. Transitivity inference: Challenge ● If a model learns basic patterns A → B and B → C, it must be able to compose these two and draw a new inference A → C. ● If a model lacks this generalization ability, it must memorize an exponential number of inference combinations independently. ● How to create an NLI dataset for transitivity inference? 7 1.Introduction
  • 8. Veridicality Veridicality of clause-embedding verbs [Karttunen+,2012] 8 ● A verb V is veridical when “x V that P” entails that P is true P: John knows that [there was a deer jumping a fence] H: There was a deer jumping a fence Entailment ● A verb V is non-veridical when “x V that P” does not entail that P is true P: John hopes that [there was a deer jumping a fence] H: There was a deer jumping a fence Non-entailment 1.Introduction
  • 9. Transitivity inference involving veridicality ● Veridical inference can easily compose transitivity inference at scale by embedding various inferences into clause-embedding verbs. ● Simple heuristics (word overlap etc) fail for composite inferences. 9 1.Introduction
  • 10. Our work and contributions Evaluate the systematic generalization ability of neural models on transitivity inferences that combine veridical inferences with various inference 1. Provide analysis methods with two transitivity inference datasets: synthetic datasets and naturalistic datasets https://github.com/verypluming/transitivity 2. Use our datasets to analyze two standard NLI models: LSTM and BERT on various combination patterns 3. Analyze whether the data augmentation with new combination patterns helps models to learn transitivity 10
  • 11. How to test transitivity Training Basic 1. veridical inference: f(s1) → s1 Premise: John {knew/hoped} that Bob and Ann left. [f(s1)] Hypothesis: Bob and Ann left. [s1] (Entailment/Non-entailment) Basic 2. various inference patterns (eg. Boolean): s1 → s2 Premise: Bob and Ann left. [s1] Hypothesis: Ann left. [s2] (Entailment) Test composite inference: f(s1) → s2 Premise: John {knew/hoped} that Bob and Ann left. [f(s1)] Hypothesis: Ann left. [s2] (Entailment/Non-entailment) 11 2. Method
  • 12. Entailment/non-entailment labels 12 Basic (Train) Composite (Test) f(s1) → s1 s1 → s2 f(s1) → s2 entailment (f: veridical) entailment entailment entailment (f: veridical) neutral neutral neutral (f: non-veridical) entailment neutral neutral (f: non-veridical) neutral neutral • Entailment labels of basic patterns f(s1) → s1 are determined by rules (eg. if the embedding verb is veridical, the label for f(s1) → s1 is entailment). • Entailment labels of composite inference are fixed by: 2. Method
  • 13. 13 ● Synthetic dataset: Embedding Boolean inference (and, or, not) created by CFG rules with meaning representations (simple Montague Grammar!). Entailment labels of Boolean inference are checked by a theorem prover. f(s1): Someone knew that [Bob and Ann found Tom, Jim and Fred] s1: Bob and Ann found Tom, Jim and Fred s2: Bob found Jim f(s1): Someone sees that [a person is brushing a cat] s1: A person is brushing a cat s2: A person is combing the fur of a cat Dataset creation: synthetic and naturalistic datasets ● Naturalistic dataset: Embedding inferences in the SICK dataset [Marelli+,2014], a collection of lexical/structural inferences. 2. Method
  • 14. ● Choose 30 clause-embedding verbs in previous verb veridicality datasets[White+,2018][Ross and Pavlick,2019] ● Insert a clause-embedding verb f into the template “Someone f that s1” to make the main clause in f(s1) 14 Dataset creation: clause-embedding verbs Examples of how to create naturalistic datasets with SICK f(s1): Someone sees that [a person is brushing a cat] s1: A person is brushing a cat s2: A person is combing the fur of a cat f(s1) → s2, s1 → s2, f(s1) → s2 2. Method
  • 15. Experimental setting ● Two neural NLI models ○ LSTM [Hochreiter and Schmidhuber 1997] ○ BERT [Devlin+ 2018] ● Datasets ● Evaluation metrics: the average accuracy of 5 runs 15 3. Experiments Split Pattern entail:non-entail Synthetic Naturalistic Train f(s1) → s1 1:1 6,000 30,000 s1 → s2 1:1 3,000 1,000 Test f(s1) → s2 1:3 6,000 30,000
  • 16. ● The models do not perform well on the composite inferences where the verb f is veridical but embedded sentence s1 does not entail s2. ● They only look at the veridicality of f Premise: Someone knew that Bob or Ann left. [f(s1)] Hypothesis: Ann left. [s2] (Gold: Non-entailment, Prediction: Entailment) 16 Results: LSTM and BERT do not perform transitivity Synthetic data Naturalistic data 3. Experiments
  • 17. ● (1) Use various templates to generate the main clause in f(s1) ● We manually select 40 clauses of the verb veridicality dataset [Ross and Pavlick 2019] and provide additional templates. 17 Is poor performance of transitivity inference due to overfitting on verbs? Two additional setting (1) Type Template Pronoun At that moment, we f that s1 Specific group Some economists f that s1 Proper noun Hanson f that s1 3. Experiments
  • 18. ● (1) Use various templates to generate the main clause in f(s1) ● We manually select 40 clauses of the verb veridicality dataset [Ross and Pavlick 2019] and provide additional templates. ● The results show the same trends: the models fail on the cases where the verb f is veridical but embedded sentence s1 does not entail s2. 18 Is poor performance of transitivity inference due to overfitting on verbs? Two additional setting (1) yes: entailment, unk: neutral 3. Experiments
  • 19. ● (2) Flip the gold labels of 10% veridical inference examples ● The results show the same trends; even when we consider the complexity of veridical inference in our analysis, the models fail to consistently perform composite inference. 19 Is poor performance of transitivity inference due to overfitting on verbs? Two additional setting (2) yes: entailment, unk: neutral 3. Experiments
  • 20. The data augmentation improved performance on transitivity test sets. 20 Does the data augmentation with a part of combination patterns help models to learn transitivity? 3. Experiments f(s1)→ s2: non-entail f: veridical (f(s1)→ s1: entail) s1 → s2: non-entail
  • 21. The data augmentation improved performance on transitivity test sets. However, the accuracy was improved without training s1 → s2 patterns... 21 Does the data augmentation with a part of combination patterns help models to learn transitivity? 3. Experiments
  • 22. The data augmentation improved performance on transitivity test sets. However, the accuracy was improved without training s1 → s2 patterns... Models do not “combine” basic inferences to perform transitivity inference. 22 Does the data augmentation with a part of combination patterns help models to learn transitivity? 3. Experiments
  • 23. Humans generally follow the distinction between veridical and non-veridical verbs, as well as the transitivity of entailment relation. 23 How well do human perform transitivity inference? Naturalistic data 3. Experiments
  • 24. 24 How well do human perform transitivity inference? ● Humans tend to fail on the composite inferences where verb f is non-veridical and s1 entails s2. ● They often neglect the veridical verb (Veridicality bias [Ross and Pavlick, 2019]) Premise: Someone believed that a man is jumping off a low wall. [f(s1)] Hypothesis: A man is jumping a low wall. [s2] (Gold: Non-entailment, Prediction: Entailment) Naturalistic data 3. Experiments
  • 25. Conclusion Motivation Evaluate the systematic generalization ability of neural NLI models on transitivity inferences Approach Analyze models with synthetic and naturalistic transitivity inference datasets involving veridicality Main results 25 ● Current models fail to consistently perform transitivity inference ● Models can memorize composite inference examples, but do not have the intended ability to combine basic inference Thanks! Hitomi Yanaka hyanaka@is.s.u-tokyo.ac.jp Data and Code: https://github.com/verypluming/transitivity