TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Completion

TIE: A Framework for Embedding-
based Incremental Temporal
Knowledge Graph Completion
Jiapeng Wu (myself), Yishi Xu, Yingxue Zhang, Chen Ma, Mark Coates, Jackie Cheung
McGill University, School of Computer Science
Quebec AI Institute (MILA)
Montreal Research Center, Huawei Noah’s Ark Lab
University of Montreal (UdeM)

Answer queries (?, r, o, t) and (s, r, ?, t)
Temporal
Knowledge Graph
Image is from Trivedi, Rakshit, et al. "Know-evolve: Deep temporal reasoning for dynamic
knowledge graphs." International Conference on Machine Learning. PMLR, 2017.

Notation
• A TKG is denoted as a sequence of graphs: , where
• and denote the sets of entities and relations at time step t
• denotes the set of observed triples at time t

Notation
• Let denote the set of true triples at time t such that
• The set of missing facts can be written as .

Notation
• Let denote the set of true triples at time t such that
• The set of missing facts can be written as .
• TKG Completion (TKGC): given an object query for a
quadruples , our model aims at ranking the correct
entity o as high as possible.

Challenges 1
Previous work in TKGC doesn’t address the
incremental learning situation, where the new
changes (addition and deletion) in the knowledge
base are available at each new time step.
The model is expected to adapt to the changes while
maintaining its knowledge about the historical facts.
Naively fine tuning the model using all the data at
each new time step causes catastrophic forgetting
(CF), meaning that the historical task performance
degrades.

Contribution 1
• We introduce a new task dubbed incremental
TKGC.
• We propose Temporal Incremental Embedding
(TIE), a training and evaluation framework that
integrates incremental learning techniques with
standard TKGC approaches.
• To this challenge, TIE combines TKG
representation learning, experience replay and
temporal regularization to improve model
performance and alleviate catastrophic
forgetting.

Challenges 2
• Standard TKGC metrics such as Hits@10 and
MRR only measures overall performance across
all time steps but omits dynamic aspects.
• There is an absence of metrics that can evaluate
how well a model forgets deleted facts.
• For example, the quadruple (Trump, presidentOf,
US, 2020) is true, however (Trump, presidentOf,
US, 2021) is false. Hence, we would like the
model to rank Biden higher than Trump given the
query (?, presidentOf, US, 2021).

Contribution 2
• To measure TKGC models’ ability to discern
facts that were true in the past but false at
present, we propose new evaluation metrics
dubbed Deleted Facts Hits@10 (DF) and
Reciprocal Rank Difference Measure (RRD).
• We explicitly associate deleted facts with
negative labels and integrate them into the
training process, which shows improvement
upon the two metrics.

Challenges 3
• A naïve way to adapt to the changes in the
knowledge base is to train a new model
using all the available data at each new time
step. This is certainly a bad idea in terms of
training efficiency.
• A more practical way is the sliding window
approach, which uses data accumulated
during the past x time steps to train the
model.

Contribution 3
• We show that training just using added facts
significantly improves the training speed and
reduces dataset size by around 10x.
• Meanwhile it maintains a similar
performance level compared to naïve fine-
tuning methods.

Main Results
We adapt two existing TKGC models to our
incremental learning framework and conducted
experiments on wikidata12k and YAGO11k
datasets.
The proposed TIE framework reduces training
time by about 10x and improves some of the
proposed metrics compared to the full-batch
training. There isn’t a significant loss in any
traditional metric.
Extensive ablation studies reveal the
performance trade-offs among different
evaluation metrics, providing insights for
choosing among model variations.

Standard vs Incremental TKGC
Standard Incremental
Training and evaluation Once, from 1 to T T times, at each time step t
Training set
Test set
Candidates

TKG Encoder-Decoder Framework
• Encoder: learns entity and relation embedding matrices
and , together with a time-sensitive function. Examples:
• Diachronic Embedding (Goel et al., 2019)
• HyTE (Dasgupta et al., 2018)
• Decoder: score each triple (s, r, o), such that the true triples receive
much higher values than the false ones. Examples:
• TransE (Bordes et al., 2013)
• DistMult (Yang et al., 2014)

TKG Encoder-Decoder Framework
• We use and to denote the representation of entity i and
relation r at time step t, and use to denote the parametrized
model at time t. The score for a quadruple is written as:

Current and Average Metrics
Assuming we are at time step t, the standard Hits@10 is defined as:

Let be the Hits@10 value evaluated on using the
model trained at time step t.

The current Hits@10 is often of the most interest:

The current Hits@10 is often of the most interest:
The average Hits@10 measures the historical average performance,
which implies the degree of catastrophic forgetting:

Intransigence Metrics
• Intransigence: the inability of an algorithm to identify knowledge that
was true in the past but false at present. We call it a deleted fact.
• E.g., a student is no longer associated with his college after graduating.

• Intransigence: the inability of an algorithm to identify knowledge that
was true in the past but false at present. We call it a deleted fact.
• E.g., a student is no longer associated with his college after graduating.
• Deleted Facts Hits@10 (DF): rank of all the deleted facts, the lower
the better

• Reciprocal Rank Difference (RRD): reciprocal of the difference
between the ranks of the valid facts and deleted facts, the higher the
better:

• Reciprocal Rank Difference (RRD): reciprocal of the difference
between the ranks of the valid facts and deleted facts, the higher the
better:
• Here is the collection of deleted objects, and is the
normalizing constant.

TIE: Temporal
Incremental Embedding

Overview
Learning with deleted
facts and added facts
Experience Replay
Temporal
Regularization

Learning with deleted
facts and added facts

TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Completion

Learning with Deleted Facts
• Construct a set of deleted facts:
• Then, associate each quadruple with a negative label, and derive the
binary cross entropy loss:
• The goal is to reduce the intransigence of the learned model.

Dataset Statistics
Figure 2: Dataset statistics of Wikidata12k (left) and YAGO11k(right). The three curves represent the
numbers of facts (total, common and added) at every time step.

Learning with Added Facts
• Construct a set of added facts:
• The softmax cross entropy loss is derived as:

Experience Replay
• The replay buffer:

Experience Replay
• The replay buffer:
• The goal is to extract , a set of positive examples from
• Uniform Sampling: is a random subset of
• Frequency based sampling: draw samples from based on their frequencies

Frequency based sampling
• A pattern of the triple (𝑠,𝑟,𝑜) refers to a regular expression with some
of the elements replaced with the wildcard symbol ‘∗’.

• The set of patterns P is defined as follows:

• The set of patterns P is defined as follows :
• We use and to denote the historical and current frequency of
a pattern 𝑝 at time 𝑡.

• The set of patterns P is defined as follows :
• We use and to denote the historical and current frequency of
a pattern 𝑝 at time 𝑡.
• Take for example, the frequencies are:

• The unnormalized sampling probability of a quadruple is
calculated as .

calculated as .
• The frequency-dependent score is calculated as:

calculated as .
temporal frequencies

calculated as .
hyperparameters

calculated as .
• The time-dependent term is simply defined as an exponential decay
function:
hyperparameters

Inverse frequency-based sampling
• What can go wrong with ?

• It overly favors the patterns with higher temporal frequencies.

• For example, the pattern (*, instance of, commune of France) has many
matches from time step 62 – 77. The algorithm ends up selecting most of
them as samples.

• For example, the pattern (*, instance of, commune of France) has many
matches from time step 62 – 77. The algorithm ends up selecting most of
them as samples..
• An alternative formulation that increases diversity and favors low-
frequency patterns is .

Representation learning: cross entropy

• Time dependent negative sampling:

• The logits output by the current model :

• The logits output by the current model :
• The resulting replay cross-entropy (RCE) loss:

Representation learning: knowledge distillation

• Inspired by iCaRL (Rebuffi et al., 2017), we apply the knowledge distillation loss in
addition to the standard cross entropy loss.

• Before training at time step 𝑡 , we store the output of the model with the network
parameters after training at time step 𝑡 −1 as:

• Before training at time step 𝑡 , we store the output of the model with the network
parameters after training at time step 𝑡 −1 as:
• The resulting replay knowledge distillation (RKD) loss:

Initialization
• The model parameters at time step 𝑡 is initialized using the model
parameters trained at time step 𝑡 – 1. Taking the initialization for the
entity matrix as example,

Initialization
• The model parameters at time step 𝑡 is initialized using the model
parameters trained at time step 𝑡 – 1. Taking the initialization for the
entity matrix as example,
• We only need to regularize the embedding of entities that are also in
.

• Inspired by the prior work (Song et al., 2018), we impose the
following L2 regularization term to smooth the drastic changes in the
parameter space, thus alleviates catastrophic forgetting.
Temporal Regularization

• Inspired by the prior work (Song et al., 2018), we impose the
following L2 regularization term to smooth the drastic changes in the
parameter space, thus alleviates catastrophic forgetting.
• Note:
• and are weight and bias matrices in the Diachronic Embedding model.
• . The same notation holds for all hat symbols.
Temporal Regularization

Optimization
• The final loss function for TIE is the weighted combination of the loss
terms defined previously.

Dataset
• YAGO11k: transformed from YAGO3 in the following way:
(Pétala, Monteiro, isAffiliatedTo, Democratic Labour Party (Brazil), [1999 - 2014])
-> (Pétala, Monteiro, isAffiliatedTo, Democratic Labour Party (Brazil), 1999),
…..
(Pétala, Monteiro, isAffiliatedTo, Democratic Labour Party (Brazil), 2014)
• Wikidata12k: Wikipedia knowledge with time stamps.

Metrics
• C@10, or C-Hits@10 (↑): Current Hits@10.
• DF@10, or DF-Hits@10 (↓) : Deleted Facts Hits@10.
• RRD (↑) : Reciprocal Rank Difference.
• A@10, or A-Hits@10 (↑) : Average Hits@10.

Metrics
• C@10, or C-Hits@10 (↑): Current Hits@10.
• DF@10, or DF-Hits@10 (↓) : Deleted Facts Hits@10.
• RRD (↑) : Reciprocal Rank Difference.
• A@10, or A-Hits@10 (↑) : Average Hits@10.
• Note
• ↑: the higher the better
• ↓: the lower the better

Base Model and Baselines
• We use DE (Diachronic Embedding) and HyTE as our base models. DE
uses ComplEx while HyTE uses TransE as their decoding functions.

Base Model and Baselines
• We use DE (Diachronic Embedding) and HyTE as our base models. DE
uses ComplEx while HyTE uses TransE as their decoding functions.
• Baselines:
• Fine-tuning (FT): only utilizes added facts .
• Temporal Regularization (TR): uses temporal regularization loss in addition to
FT.
• Proposed model TIE: combines all the proposed methods. The choice of
experience replay sampling method depends on the validation set
performance.

Skyline models
• Full-batch (FB): use all the quadruples in and to fine-tune the
model at time step t.
• Full-batch with future data (FB_future): use all the data in
to train the model at each time step. This is treated as an oracle since
it has access to future data at each time step.

Ablation Study (1)
Comparing experience replay methods

Ablation Study (2)
Effect of learning with Deleted Facts

Ablation Study (3)
Effect of learning with Added Facts

Ablation Study (4)
Effect of experience replay sampling size

Conclusion
• We present TIE, a novel incremental learning framework TKGC tasks. TIE
combines TKG representation learning, frequency-based experience replay, and
temporal regularization to improve the model’s performance on both current and
past time steps.
• We propose DF and RRD metrics to measure the intransigence of the model.
• Extensive ablation studies shows each proposed component’s effectiveness. They
also provide insights for deciding among model variations by revealing
performance trade-offs among various evaluation metrics.
• This work serves as a first attempt and exploration to apply incremental learning
to TKGC tasks. Future work might involve exploring other incremental learning
techniques, such as constrained optimization, to achieve more robust
performance across datasets and metrics.

TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Completion

More Related Content

What's hot (20)

Similar to TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Completion (20)

Recently uploaded (20)

TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Completion