Temporal reasoning task

Temporal Reasoning Task
San Kim
2021.06.30
1. “Going on a vacation” tasks longer than “Going for a walk”: A Study of Temporal
Commonsense Understanding (EMNLP 19) – MCTACO
2. TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions (EMNLP 20)
3. Temporal Reasoning on Implicit Events from Distant Supervision (NAACL 21)-TRACIE

MCTACO (Multiple choice temporal common-sense)
Temporal commonsense
Given two events “going on a vacation” and “going for a walk,” most humans would know that a
vacation is typically longer and occurs less often than a walk, but it is still challenging for computers
to understand and reason about temporal commonsense.
5 temporal properties
• Duration (how long an event takes)
• Temporal ordering (typical order of events)
• Typical time (when an event happens)
• Frequency (how often an event occurs)
• Stationarity (whether a state holds for a very long time or
indefinitely)

MCTACO
• MCTACO is comprised of 13k tuples, in the form of (sentence, question, candidate answer).
• The sentences in those tuples are randomly selected from MultiRC
• Collect questions and candidate answers(both correct question and wrong ones) using AMT.
• To ensure the quality of the results, they limit the annotations to native speakers and use
qualification tryouts.
• Step1. Question Generation
• Should ask about one of the five temporal phenomena
the defined earlier
• Should not be solved simply by a word or phrase from
the original sentence
• They also require crowd-sourcers to provide a correct
answer for each of their questions(correct and
incorrect answers)
• Step2. Question verification
• The ask another two crowdsourcers to
check the questions generated in Step 1,
(a) whether the two requirements are
satisfied and (b) whether the question is
grammatically and logically correct.
• For valid questions, they continue to ask
crowdsourcers to give one correct answer
and one incorrect answer

MCTACO
• For those candidates that represent
events, the previously-mentioned token-
level perturbations rarely lead to
interesting and diverse set of candidate
answers. It may lead to invalid phrases
(e.g., “he left the house”  “he walked
the house”.) Therefore, to perturb such
candidates, they create a pool of 60k
event phrases using PropBank and
perturb the candidate answers to be the
most similar ones extracted by an
information retrieval(IR) system.
• Expand the candidate answer set to 20
candidates per question.
• Step3. Candidate answer expansion
• Until this stage, they have collected a small
set of candidate answers (3 positive and 2
negative) for each question.
• Automatically expand this set in three ways
• Use a set of rules to extract numbers
and quantities (“2”, “once”) and temporal
terms (e.g. “a.m.”, “1990”, “afternoon”,
“day”), and then randomly perturb them
based on a list of temporal
units(“second”), adjectives (“early”),
points (“a.m.”) and adverbs (“always”).
( “2 a.m.”  “3 p.m.”, “1 day”  “10
days”, “once a week”  “twice a month”)
• Mask each individual token in a
candidate answer (one at a time) and
use BERT to predict replacements for
each missing term; they rank those
predictions by the confidence level of
BERT and keep the top three.
• Step4. Answer labeling
• Each (sentence, question, answer) tuple
produced earlier is labeled by 4
crowdsourcers, with three options: “likely”,
“unlikely”, or “invalid”.
• A tuple is kept only if all 4 annotators
agree on “likely” or “unlikely”.

TORQUE
• Time is important for understanding events and stories
described in natural language text.
• “he won the championship yesterday” is different from “he
will win the championship tomorrow” (explicit)
• If we read that a woman is “expecting the birth of her first
child”, we know that the birth is in the future, while if she
is “mourning the death of her mother”, the death is in the
past. (implicit)
• These relationships between an event and a time point
(e.g. “won the championship yesterday”) or between two
events (e.g., “expecting” is before “birth” and “mourning”
is after “death”) are called temporal relations.

TORQUE
• Challenges of RC for temporal relationships
1. Reading comprehension works rarely require event understanding. Most datasets largely only
require an understanding of predicates and arguments, and would ask questions like “what was a
woman trapped in?”. But a temporal relation question would be “what started before a woman was
trapped?” To answer it, the system needs to identify events (e.g., LANDSLIDE is an event and “body”
is not), the time of these events (e.g., LANDSLIDE is correct answer, while SAID is not because of
the time when the two events happen), and look at the entire passage rather than the local
predicate-argument structures within a sentence (e.g., SNOW and RAINFALL are correct answers to
the question above).
2. There are many events in a typical passage of text, so tempral relation questions typically query
more than one relationship at the same time. This means that a question can have multiple
answers (e.g., “what happened after the landslide?”), or no answers, because the question may be
beyond the time scope (e.g., “what happened before the snow started?”)

TORQUE
3. Temporal relations queried by natural language questions are often sensitive to a few key words
such as before, after, and start. Those questions can easily be changed to make contrasting
questions with dramatically different answers. Those questions can easily be changed to make
contrasting questions with dramatically different answers. Models that are not sensitive to these small
changes in question words will perform poorly on this task.
landslide
searching, said, found
searching
No answers
Causing, disruption, brining,
flooding, searching, trapped,
landslide, said, found
Landslide, trapped, found, said,
disruption, flooding
Landslide, trapped
said
No answers

TORQUE
• Annotate 3.2k text snippets randomly selected from the TempEval3 dataset.
• TORQUE has 25k events and 21k user-generated and fully answered temporal relation questions.
• RoBERTa-large achieves 51% in exact match on TORQUE after fine-tuning, about 30% behind
human performance.
• Generally speaking, an event involves a predicate and its arguments.
• When studying time, events were defined as actions/states triggered by verbs, adjectives, and
nominals.
• This work follows this line of event definition and uses event and event trigger interchangeably.
• Define an event to be either a verb or a noun.
• In copular constructions, they choose to label the verb as the event, instead of an adjective or
preposition. (for consistent treatment of “she was on the east coast yesterday” and “she was happy”
– easily teach to crowd workers) (Note that from the perspective of data collection, labeling the
copula does not lose information as one can always do post-processing using dependency parsing
or semantic role labeling to recover the connection between “was” and “happy”.)
Events

TORQUE
• Events expressed in text are not always factual. They
can be negated, uncertain, hypothetical or have
associated modalities.
• Prior work dealing with events often tried to
categorize and label these various aspects because
they were crucial for determining temporal relation.
• Simply have people label all events, irrespective of
their modality, and use natural language to describe
relations between them.
Events

TORQUE
Temporal Relations
• The relationship between two events with respect to time, or
between one event and a fixed time point.
• (A, r, B) – A and B are events or time points, and r is a
temporal relation. (e.g. (HAD, happened before, SLEPT) – first
sentence in Fig. 3)
• In previous works, every event is assumed to be associated
with a time interval. When comparing two events, there are
13 possible relation labels.
• There are still many relations that cannot be expressed
because the assumption that every event has a time interval
is inaccurate: The time scope of an event may be fuzzy, an
event can have a non-factual modality, or events can be
repetitive and invoke multiple intervals.
• To better handle these phenomena, they use natural
language to annotate the relationships between events.

TORQUE
Natural Language Annotation of Temporal Relations
• (A, r, B): a temporal relation between two events
• (?, r, B) : a temporal relation question
• (?, happened before, SLEPT): natural language
expression  “what happened before a lion slept?”
• (A, r, B) holds, assuming for any deictic expression A
or B the time point when the passage was written,
and assuming that the passage is true.

TORQUE
Advantages of Natural Language Annotation
• DISRUPTION and FLOODING happened at about
the same time, but we do not know for sure which
one is earlier, so we have to choose vague.
• SNOW and DISRUPTION, we do not know which
one ends earlier and have to choose vague.
• The question-answer (QA) pairs can naturally
capture these fuzzy relations.

TORQUE
• Natural language questions can conveniently
incorporate different modes of events.
• ▲ the relation between “having a meal”, and
“sleeping”
• If we could only choose one label, we must
choose before for all these relations, although
these relations are actually different.
•  a repetitive event may be a series of intervals
rather than a single one, and often before is
very different from before.

TORQUE
• The format of natural language questions
bypasses the need for explicit annotation of
properties of events or other theories.
• The annotator naturally avoids event pairs that
do not have relations.
• “what happened after the service
industries are hardest hit?”
• “what happened after a passerby reported
the body?”
• “what was expected to happen when the
crisis hit America?”
• “what was supposed to happen after a
passerby called the police?”
• It still remains difficult to have a theory
explaining
•  why hit can compare to expected and crisis,
but not to gains.

TORQUE
Penalize Shortcuts by Contrast Sets
• An important problem in building datasets is to
avoid trivial solutions.
• Contrast questions: which slightly modify the
original questions, but dramatically change the
answers
• For an existing question (?, r, B) (e.g., “what
happened after he ate his breakfast?”)
• Keep using B and change r (e.g., “what
happened before/shortly after/… he ate his
breakfast?”)
• Modify it to ask about the start/end time
(e.g., “what happened after he started
eating his breakfast?” or “what would finish
after he ate his breakfast?”)
• Check that the answers to the new question
are different from the original one to avoid
trivial modifications (e.g., changing “what
happened” to “what occurred”)

TORQUE
Data Collection
• Passages that consist of two contiguous
sentences, as this is sufficient to capture the
vast majority of non-trivial temporal relations.
• Create a pool of 26k two-sentence passages
from the TempEval3 workshop (2.8k articles)
• 1. Label all the events
• 2. Repeatedly do the following
• (a) Ask a temporal relation question and
point out all the answers from the list of
events
• Modify the temporal relation to create
one or more new questions and answer
them.
Quality Control
• Qualification: crowd workers were trained and
tested on 3 capabilities: labeling events,
asking temporal relation questions, and
question answering. Crowd workers were
considered level-1 qualified if they could pass
the test within 3 attempts. (1/3 workers
passed the qualification.)
• Pilot: asked level-1 crowd workers to do a
small amount of the real task. They manually
checked the annotations and gave feedback
to them. Roughly 1 out of 3 pilot submissions
received a level-2 qualification. In the end,
there were 63 level-2 annotators, and 60 of
them actually worked on large-scale task.
• Validation: 20% of articles. 5 different level-2
annotators(include original annotator) validate
the event and answers. They intentionally
added noise to the original data for quality
control. They did not do additional validation
for the question because there is no bad
questions in a random sample of 100.
Quality Control

TORQUE
Cost
• 3 passages were presented.
• The crowd worker could decide to use some
or all of the.
• For each passage a worker decided to use,
they needed to label the vents, answer 3 hard-
coded warm-up questions, and them ask and
answer at least 12 questions (including
contrast questions). The final reward is a base
pay of $6 plus $0.5 for each extra question (up
to $4).
• Incentive
• (1) use fewer passages so that they can
do event labeling and warm-up questions
fewer times.
• (2) modify questions instead of asking
from scratch
• (3) ask extra questions in each job.
• In practice, crowd workers on average used 2
passages in each job.
• Validating the events in each passage and the
answers to a specific question both cost $0.1.
• In total, TORQUE cost $15k for an average of
$0.7/question.
statistics
• 3.2k passage annotations (~50 tokens/passage)
• 24.9k events (7.9 events/passage)
• 21.2k user-provided questions (half of them
were labeled by crowd workers as
modifications of existing ones)
• 94 / 200 questions querying about relations
that cannot be directly represented by the
previous single-interval-based labels.

TRACIE: Temporal Reasoning on Implicit Events from Distant Supervision
MATRES, ACL 18
TempEval1-3, TimeBank-Dense(TB-Dense), EventTimeCorpus
Before, after, equal, vague
Based on tart-points

• When reading a story, a human can construct
a latent timeline about events’ start and end
times.
• The timeline not only contains the placements
of explicitly mentioned events (e.g., ride a
bicycle), but also accounts for implicit events
(e.g. Farrah was distracted so she looked away).
• The ability to construct such a timeline is
essential for understanding the causal
dynamics of a situation.
• Contributions
• A temporal relation dataset TRACIE
focusing on implicit events
• A distant supervision process for temporal
understanding of implicit events
• A reasoning model that makes end-time
comparisons using predictions of start-
time distances and durations

• Such tests in TRACIE take the form of multi-premise textual entailment (TE)
• Each TRACIE instance contains
• A context story (or premise) consisting of a sequence of explicit narrative events
• An implicit event in the form of a natural language phrase that is unmentioned but has
some role in the story
• An explicit event also in the form of a phrase
• A comparator of either {starts, ends}
• A temporal relation of either {before, after} that marks the relationship in the dimension
defined by the comparator between the implicit-event and the explicit-event

• Such tests in TRACIE take the form of multi-premise textual entailment (TE)
• Premise: context story
• Hypotheses: temporal queries about pair-wise relations between implicit and explicit events
• E.g. “avoids” is implicit-event, “starts” is the comparator, “removed” is explicit-event and “before”
is the temporal-relation.
• Flip the temporal-relation (i.e., “before” to “after” and vice versa) to create
negative(contradiction) instances.
• Use start times of explicit-events as reference points and compare the implicit-event’s start or
end time with them, according to the label definitions (Fig. 3)

• Randomly sample short stories from the
ROCStories dataset
• For each story, one annotator writes 5 implicit
event phrases that are not explicitly mentioned
by the given story, but are inferable and
relevant.
• Additionally rewrites two explicit events closest
to the implicit event’s start and end time,
respectively.
• Build two TRACIE instances (minus the
temporal-relation) per implicit event
Implicit Event Generation
Automatic Instance Generation
• Extract all verbs and relevant arguments with
its semantic role labeling model in AllenNLP
• Construct a pool of explicit events in the form
of short phrases (using verbs and their
arguments)
• Extract all verbs and relevant arguments with
its semantic role labeling model in AllenNLP
• Construct a pool of explicit events in the form
of short phrases (using verbs and their
arguments)
• For each implicit event, randomly select two
{explicit-event, comparator} pairs from the pool.
Label Collection
• For each of the 20 instances per story,
annotate the temporal-relation with four
different annotators.
• Majority agreement as the final label and filter
out unagreeable instances.
• Two authors additionally verify the instances
with ambiguous verbs(e.g., “have”) and
corrected 5% of the end-time instances.

• Distant Supervision
• Within-sentence Extraction
• Collect start time comparisons between pairs of events heuristically from free-text using
“before/after” keywords
• Use AllenNLP’s SRL model to process each input sentence and find verbs with a temporal
argument that starts with either “before” or “after”, and contains at least another verb.
• If there are multiple verbs in the temporal argument, take the one with the largest number
of tokens as arguments.
• 2.8M instances from Wikipedia dump(May 2020)
Pattern-Based Pre-Training

• Distant Supervision
• Cross-sentence Extraction
• The data collected from the within-sentence patterns
does not reveal the relative distance between two
start times.
• Finds direct temporal expressions of hours and dates.
• Because these temporal expressions(e.g., 2021-01-01)
are globally comparable, the compared events can be
anywhere in a document.
• This process collects more supervision signals about
time-point comparisons and their relative distance
on event pairs with trivial causal relation.
• Find exact temporal values by filling unmentioned
elements of a temporal expression with the nearest
previous mention (e.g., add “January to the expression
of “the 10th” in Fig. 4)
Pattern-Based Pre-Training (PTNTIME)

• Cross-sentence Extraction
• Construct supervision instances under the assumption
that the extracted temporal expressions describe the
start times of the associated verbs (e.g., went started
on January 1st )
• Represent the differences between the two start times
as one of seven coarse temporal units: {<=minutes,
hours, days, weeks, months, years, >= decades}
• Go to park is weeks before write review as shown in
Fig. 4
• Couple the specialized temporal pre-training data
described above with additional paragraphs that are used
to perform conventional language model pretraining using
the original denoising task (T5).
• Input sequences of event : [EventA] starts [Relation]
[EventB] . Story: [Paragraph] and output sequences of
answer: [Label] [Distance] . [paragraph]: non empty only
for cross-sentence extractions. [label] is either positive or
negative. [distance] is one of the 7 coarse temporal units
represented with a set of blank tokens [extra_id_N]

• This model makes end-time comparisons by symbolically combining start time distance and
duration from separate predictions based on some of the components.
• Does not rely on explicit annotations on timepoints, but only relative comparisons between
them.
Symbolic Temporal Reasoning Model (SYSTIME)

Symbolic Temporal Reasoning Model (SYSTIME)

Duration estimation – pretrain sequence-to-sequence model
𝑟𝑒𝑛𝑑𝑠 𝑒1, 𝑒2 = 𝑏𝑒𝑓𝑜𝑟𝑒 ⇔ 𝑑𝑖𝑠𝑡 𝑒1, 𝑒2 + 𝑑𝑢𝑟 𝑒1 < 0
𝑟𝑒𝑛𝑑𝑠 𝑒1, 𝑒2 = 𝑎𝑓𝑡𝑒𝑟 ⇔ 𝑑𝑖𝑠𝑡 𝑒1, 𝑒2 + 𝑑𝑢𝑟 𝑒1 > 0
𝑟𝑠𝑡𝑎𝑟𝑡𝑠 𝑒1, 𝑒2 = 𝑏𝑒𝑓𝑜𝑟𝑒 ⇔ 𝑑𝑖𝑠𝑡 𝑒1, 𝑒2 < 0
𝑟𝑠𝑡𝑎𝑟𝑡𝑠 𝑒1, 𝑒2 = 𝑎𝑓𝑡𝑒𝑟 ⇔ 𝑑𝑖𝑠𝑡 𝑒1, 𝑒2 > 0
• Use duration data from TimeM (1M events and
duration values)
• Input sequence event: [Event] story: [Story]
• Output sequence answer: [Value]
• [Event] represents the tokens of an event with
the trigger verb marked by a special token to its
left
• [Story] represents down tokens from the context
• [Value] is one of the 7unit labels (i.e., {<= minuts,
hours, …})

Approximate dist() function using output from PTNTIME
• Input sequences of event : [EventA] starts
[Relation] [EventB] . Story: [Paragraph] and
output sequences of answer: [Label] [Distance] .
• [EventA]: the texture description of e1
• [EventB]: the texture description of e2
• [Paragraph] the context (premise)
• Fix [Relation] to be before.
• By taking the values of the vocabulary indices
corresponding to “positive” and “negative” from
the logits of [Label] and applying a softmax
operation, get P_before, P_after. P = [P_before,
P_after]
• Apply softmax to the logits of [Distance] over the
7words representing the temporal units to obtain
7 values that approximate the prob. of distance.
Place the 7 values in temporal units’ increasing
order in vector d. c = [0, 1, 2, 3, 4, 5, 6]
• To get the direction, apply the tanh function to
the difference between the prob. in p.

• T5-Large for PTNTIME and the duration model.
• PTNTIME – 45k steps(1.4M instances), duration
model – 80k steps(2.6M instances)
• These pretrained weights in SYSTIME: SYSTIME
ZEROSHOT wich uses no TRACIE supervision.
• Story-wide exact match metric, which is the
percentage of stories with all its related
hypotheses answered correctly.

• Uniform-dist: in the i.i.d. training set, 70% of the examples with the comparator ends and relation after are
positive. – randomly remove instances from the majority classes

• Train and evaluate only the instances with a label
of either “before” or “after”, which accounts for
about 80% of all instances.
• OT-NS(original test, no story): train and test with
only the sentences containing the trigger verbs
• OT: train and test with the entire document as an
auxiliary input
• OT-MS(original test, minimal supervision): train
with 1.2k (6%) training instances
• PT(perturbed test): train with the complete
training set and test on a perturbed test set from
Evaluating Models’ Local Decision Boundaries via
Contrast Sets.

Temporal reasoning task

More Related Content

Similar to Temporal reasoning task (20)

More from San Kim (19)

Recently uploaded (20)

Temporal reasoning task