SlideShare a Scribd company logo
TIE: A Framework for Embedding-
based Incremental Temporal
Knowledge Graph Completion
Jiapeng Wu (myself), Yishi Xu, Yingxue Zhang, Chen Ma, Mark Coates, Jackie Cheung
McGill University, School of Computer Science
Quebec AI Institute (MILA)
Montreal Research Center, Huawei Noah’s Ark Lab
University of Montreal (UdeM)
Answer queries (?, r, o, t) and (s, r, ?, t)
Temporal
Knowledge Graph
Image is from Trivedi, Rakshit, et al. "Know-evolve: Deep temporal reasoning for dynamic
knowledge graphs." International Conference on Machine Learning. PMLR, 2017.
Challenges and
Contributions
Notation
• A TKG is denoted as a sequence of graphs: , where
• and denote the sets of entities and relations at time step t
• denotes the set of observed triples at time t
Notation
• A TKG is denoted as a sequence of graphs: , where
• and denote the sets of entities and relations at time step t
• denotes the set of observed triples at time t
• Let denote the set of true triples at time t such that
• The set of missing facts can be written as .
Notation
• A TKG is denoted as a sequence of graphs: , where
• and denote the sets of entities and relations at time step t
• denotes the set of observed triples at time t
• Let denote the set of true triples at time t such that
• The set of missing facts can be written as .
• TKG Completion (TKGC): given an object query for a
quadruples , our model aims at ranking the correct
entity o as high as possible.
Challenges 1
Previous work in TKGC doesn’t address the
incremental learning situation, where the new
changes (addition and deletion) in the knowledge
base are available at each new time step.
The model is expected to adapt to the changes while
maintaining its knowledge about the historical facts.
Naively fine tuning the model using all the data at
each new time step causes catastrophic forgetting
(CF), meaning that the historical task performance
degrades.
Contribution 1
• We introduce a new task dubbed incremental
TKGC.
• We propose Temporal Incremental Embedding
(TIE), a training and evaluation framework that
integrates incremental learning techniques with
standard TKGC approaches.
• To this challenge, TIE combines TKG
representation learning, experience replay and
temporal regularization to improve model
performance and alleviate catastrophic
forgetting.
Challenges 2
• Standard TKGC metrics such as Hits@10 and
MRR only measures overall performance across
all time steps but omits dynamic aspects.
• There is an absence of metrics that can evaluate
how well a model forgets deleted facts.
• For example, the quadruple (Trump, presidentOf,
US, 2020) is true, however (Trump, presidentOf,
US, 2021) is false. Hence, we would like the
model to rank Biden higher than Trump given the
query (?, presidentOf, US, 2021).
Contribution 2
• To measure TKGC models’ ability to discern
facts that were true in the past but false at
present, we propose new evaluation metrics
dubbed Deleted Facts Hits@10 (DF) and
Reciprocal Rank Difference Measure (RRD).
• We explicitly associate deleted facts with
negative labels and integrate them into the
training process, which shows improvement
upon the two metrics.
Challenges 3
• A naïve way to adapt to the changes in the
knowledge base is to train a new model
using all the available data at each new time
step. This is certainly a bad idea in terms of
training efficiency.
• A more practical way is the sliding window
approach, which uses data accumulated
during the past x time steps to train the
model.
Contribution 3
• We show that training just using added facts
significantly improves the training speed and
reduces dataset size by around 10x.
• Meanwhile it maintains a similar
performance level compared to naïve fine-
tuning methods.
Main Results
We adapt two existing TKGC models to our
incremental learning framework and conducted
experiments on wikidata12k and YAGO11k
datasets.
The proposed TIE framework reduces training
time by about 10x and improves some of the
proposed metrics compared to the full-batch
training. There isn’t a significant loss in any
traditional metric.
Extensive ablation studies reveal the
performance trade-offs among different
evaluation metrics, providing insights for
choosing among model variations.
Task setup
Standard vs Incremental TKGC
Standard Incremental
Training and evaluation Once, from 1 to T T times, at each time step t
Training set
Test set
Candidates
TKG Encoder-Decoder
TKG Encoder-Decoder Framework
• Encoder: learns entity and relation embedding matrices
and , together with a time-sensitive function. Examples:
• Diachronic Embedding (Goel et al., 2019)
• HyTE (Dasgupta et al., 2018)
• Decoder: score each triple (s, r, o), such that the true triples receive
much higher values than the false ones. Examples:
• TransE (Bordes et al., 2013)
• DistMult (Yang et al., 2014)
TKG Encoder-Decoder Framework
• We use and to denote the representation of entity i and
relation r at time step t, and use to denote the parametrized
model at time t. The score for a quadruple is written as:
Evaluation Metrics
Current and Average Metrics
Assuming we are at time step t, the standard Hits@10 is defined as:
Current and Average Metrics
Assuming we are at time step t, the standard Hits@10 is defined as:
Let be the Hits@10 value evaluated on using the
model trained at time step t.
Current and Average Metrics
Assuming we are at time step t, the standard Hits@10 is defined as:
Let be the Hits@10 value evaluated on using the
model trained at time step t.
The current Hits@10 is often of the most interest:
Current and Average Metrics
Assuming we are at time step t, the standard Hits@10 is defined as:
Let be the Hits@10 value evaluated on using the
model trained at time step t.
The current Hits@10 is often of the most interest:
The average Hits@10 measures the historical average performance,
which implies the degree of catastrophic forgetting:
Intransigence Metrics
• Intransigence: the inability of an algorithm to identify knowledge that
was true in the past but false at present. We call it a deleted fact.
• E.g., a student is no longer associated with his college after graduating.
Intransigence Metrics
• Intransigence: the inability of an algorithm to identify knowledge that
was true in the past but false at present. We call it a deleted fact.
• E.g., a student is no longer associated with his college after graduating.
• Deleted Facts Hits@10 (DF): rank of all the deleted facts, the lower
the better
Intransigence Metrics
• Reciprocal Rank Difference (RRD): reciprocal of the difference
between the ranks of the valid facts and deleted facts, the higher the
better:
Intransigence Metrics
• Reciprocal Rank Difference (RRD): reciprocal of the difference
between the ranks of the valid facts and deleted facts, the higher the
better:
• Here is the collection of deleted objects, and is the
normalizing constant.
TIE: Temporal
Incremental Embedding
Overview
Learning with deleted
facts and added facts
Experience Replay
Temporal
Regularization
Learning with deleted
facts and added facts
TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Completion
Learning with Deleted Facts
• Construct a set of deleted facts:
• Then, associate each quadruple with a negative label, and derive the
binary cross entropy loss:
• The goal is to reduce the intransigence of the learned model.
Dataset Statistics
Figure 2: Dataset statistics of Wikidata12k (left) and YAGO11k(right). The three curves represent the
numbers of facts (total, common and added) at every time step.
Learning with Added Facts
• Construct a set of added facts:
• The softmax cross entropy loss is derived as:
Experience Replay
TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Completion
Experience Replay
• The replay buffer:
Experience Replay
• The replay buffer:
• The goal is to extract , a set of positive examples from
• Uniform Sampling: is a random subset of
• Frequency based sampling: draw samples from based on their frequencies
Frequency based sampling
• A pattern of the triple (𝑠,𝑟,𝑜) refers to a regular expression with some
of the elements replaced with the wildcard symbol ‘∗’.
Frequency based sampling
• A pattern of the triple (𝑠,𝑟,𝑜) refers to a regular expression with some
of the elements replaced with the wildcard symbol ‘∗’.
• The set of patterns P is defined as follows:
Frequency based sampling
• A pattern of the triple (𝑠,𝑟,𝑜) refers to a regular expression with some
of the elements replaced with the wildcard symbol ‘∗’.
• The set of patterns P is defined as follows :
• We use and to denote the historical and current frequency of
a pattern 𝑝 at time 𝑡.
Frequency based sampling
• A pattern of the triple (𝑠,𝑟,𝑜) refers to a regular expression with some
of the elements replaced with the wildcard symbol ‘∗’.
• The set of patterns P is defined as follows :
• We use and to denote the historical and current frequency of
a pattern 𝑝 at time 𝑡.
• Take for example, the frequencies are:
Frequency based sampling
• The unnormalized sampling probability of a quadruple is
calculated as .
Frequency based sampling
• The unnormalized sampling probability of a quadruple is
calculated as .
• The frequency-dependent score is calculated as:
Frequency based sampling
• The unnormalized sampling probability of a quadruple is
calculated as .
• The frequency-dependent score is calculated as:
temporal frequencies
Frequency based sampling
• The unnormalized sampling probability of a quadruple is
calculated as .
• The frequency-dependent score is calculated as:
temporal frequencies
hyperparameters
Frequency based sampling
• The unnormalized sampling probability of a quadruple is
calculated as .
• The frequency-dependent score is calculated as:
• The time-dependent term is simply defined as an exponential decay
function:
temporal frequencies
hyperparameters
Inverse frequency-based sampling
• What can go wrong with ?
Inverse frequency-based sampling
• What can go wrong with ?
• It overly favors the patterns with higher temporal frequencies.
Inverse frequency-based sampling
• What can go wrong with ?
• It overly favors the patterns with higher temporal frequencies.
• For example, the pattern (*, instance of, commune of France) has many
matches from time step 62 – 77. The algorithm ends up selecting most of
them as samples.
Inverse frequency-based sampling
• What can go wrong with ?
• It overly favors the patterns with higher temporal frequencies.
• For example, the pattern (*, instance of, commune of France) has many
matches from time step 62 – 77. The algorithm ends up selecting most of
them as samples..
• An alternative formulation that increases diversity and favors low-
frequency patterns is .
Representation learning: cross entropy
Representation learning: cross entropy
• Time dependent negative sampling:
Representation learning: cross entropy
• Time dependent negative sampling:
• The logits output by the current model :
Representation learning: cross entropy
• Time dependent negative sampling:
• The logits output by the current model :
• The resulting replay cross-entropy (RCE) loss:
Representation learning: knowledge distillation
Representation learning: knowledge distillation
• Inspired by iCaRL (Rebuffi et al., 2017), we apply the knowledge distillation loss in
addition to the standard cross entropy loss.
Representation learning: knowledge distillation
• Inspired by iCaRL (Rebuffi et al., 2017), we apply the knowledge distillation loss in
addition to the standard cross entropy loss.
• Before training at time step 𝑡 , we store the output of the model with the network
parameters after training at time step 𝑡 −1 as:
Representation learning: knowledge distillation
• Inspired by iCaRL (Rebuffi et al., 2017), we apply the knowledge distillation loss in
addition to the standard cross entropy loss.
• Before training at time step 𝑡 , we store the output of the model with the network
parameters after training at time step 𝑡 −1 as:
• The resulting replay knowledge distillation (RKD) loss:
Temporal
Regularization
TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Completion
Initialization
• The model parameters at time step 𝑡 is initialized using the model
parameters trained at time step 𝑡 – 1. Taking the initialization for the
entity matrix as example,
Initialization
• The model parameters at time step 𝑡 is initialized using the model
parameters trained at time step 𝑡 – 1. Taking the initialization for the
entity matrix as example,
• We only need to regularize the embedding of entities that are also in
.
• Inspired by the prior work (Song et al., 2018), we impose the
following L2 regularization term to smooth the drastic changes in the
parameter space, thus alleviates catastrophic forgetting.
Temporal Regularization
• Inspired by the prior work (Song et al., 2018), we impose the
following L2 regularization term to smooth the drastic changes in the
parameter space, thus alleviates catastrophic forgetting.
• Note:
• and are weight and bias matrices in the Diachronic Embedding model.
• . The same notation holds for all hat symbols.
Temporal Regularization
Optimization
• The final loss function for TIE is the weighted combination of the loss
terms defined previously.
Experiments Setting
Dataset
• YAGO11k: transformed from YAGO3 in the following way:
(Pétala, Monteiro, isAffiliatedTo, Democratic Labour Party (Brazil), [1999 - 2014])
-> (Pétala, Monteiro, isAffiliatedTo, Democratic Labour Party (Brazil), 1999),
…..
(Pétala, Monteiro, isAffiliatedTo, Democratic Labour Party (Brazil), 2014)
• Wikidata12k: Wikipedia knowledge with time stamps.
Dataset
• YAGO11k: transformed from YAGO3 in the following way:
(Pétala, Monteiro, isAffiliatedTo, Democratic Labour Party (Brazil), [1999 - 2014])
-> (Pétala, Monteiro, isAffiliatedTo, Democratic Labour Party (Brazil), 1999),
…..
(Pétala, Monteiro, isAffiliatedTo, Democratic Labour Party (Brazil), 2014)
• Wikidata12k: Wikipedia knowledge with time stamps.
Training and Validation
Training and Validation
Training and Validation
Training and Validation
Training and Validation
Metrics
• C@10, or C-Hits@10 (↑): Current Hits@10.
• DF@10, or DF-Hits@10 (↓) : Deleted Facts Hits@10.
• RRD (↑) : Reciprocal Rank Difference.
• A@10, or A-Hits@10 (↑) : Average Hits@10.
Metrics
• C@10, or C-Hits@10 (↑): Current Hits@10.
• DF@10, or DF-Hits@10 (↓) : Deleted Facts Hits@10.
• RRD (↑) : Reciprocal Rank Difference.
• A@10, or A-Hits@10 (↑) : Average Hits@10.
• Note
• ↑: the higher the better
• ↓: the lower the better
Base Model and Baselines
• We use DE (Diachronic Embedding) and HyTE as our base models. DE
uses ComplEx while HyTE uses TransE as their decoding functions.
Base Model and Baselines
• We use DE (Diachronic Embedding) and HyTE as our base models. DE
uses ComplEx while HyTE uses TransE as their decoding functions.
• Baselines:
• Fine-tuning (FT): only utilizes added facts .
• Temporal Regularization (TR): uses temporal regularization loss in addition to
FT.
• Proposed model TIE: combines all the proposed methods. The choice of
experience replay sampling method depends on the validation set
performance.
Skyline models
• Full-batch (FB): use all the quadruples in and to fine-tune the
model at time step t.
• Full-batch with future data (FB_future): use all the data in
to train the model at each time step. This is treated as an oracle since
it has access to future data at each time step.
Results
TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Completion
TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Completion
TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Completion
Ablation Study (1)
Comparing experience replay methods
Ablation Study (2)
Effect of learning with Deleted Facts
Ablation Study (3)
Effect of learning with Added Facts
Ablation Study (4)
Effect of experience replay sampling size
Ablation Study (4)
Effect of experience replay sampling size
Conclusion
• We present TIE, a novel incremental learning framework TKGC tasks. TIE
combines TKG representation learning, frequency-based experience replay, and
temporal regularization to improve the model’s performance on both current and
past time steps.
• We propose DF and RRD metrics to measure the intransigence of the model.
• Extensive ablation studies shows each proposed component’s effectiveness. They
also provide insights for deciding among model variations by revealing
performance trade-offs among various evaluation metrics.
• This work serves as a first attempt and exploration to apply incremental learning
to TKGC tasks. Future work might involve exploring other incremental learning
techniques, such as constrained optimization, to achieve more robust
performance across datasets and metrics.
Thank you!

More Related Content

PDF
Designing Test Collections That Provide Tight Confidence Intervals
PDF
An Overview of Naïve Bayes Classifier
PDF
sigir2017bayesian
PDF
Mathematical Background for Artificial Intelligence
PDF
Lecture 2 Basic Concepts in Machine Learning for Language Technology
PPT
Spsshelp 100608163328-phpapp01
PPT
Machine Learning: Foundations Course Number 0368403401
PDF
Efficient Online Evaluation of Big Data Stream Classifiers
Designing Test Collections That Provide Tight Confidence Intervals
An Overview of Naïve Bayes Classifier
sigir2017bayesian
Mathematical Background for Artificial Intelligence
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Spsshelp 100608163328-phpapp01
Machine Learning: Foundations Course Number 0368403401
Efficient Online Evaluation of Big Data Stream Classifiers

What's hot (20)

PDF
Learning to Rank - From pairwise approach to listwise
PDF
Barga Data Science lecture 9
PDF
DMTM Lecture 06 Classification evaluation
PPTX
Presentation on supervised learning
PDF
Context-aware preference modeling with factorization
PPT
Machine Learning: Foundations Course Number 0368403401
PPT
Implications of Ceiling Effects in Defect Predictors
PDF
Week 1 - Data Structures and Algorithms
PPTX
Lecture 6: Ensemble Methods
PDF
IRJET- Big Data and Bayes Theorem used Analyze the Student’s Performance in E...
PPTX
Competition winning learning rates
PDF
Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...
PPT
Part 1
PDF
Moa: Real Time Analytics for Data Streams
PDF
Text Classification, Sentiment Analysis, and Opinion Mining
PDF
Safe and Efficient Off-Policy Reinforcement Learning
PDF
MOA for the IoT at ACML 2016
PPT
PDF
Maximum Entropy Reinforcement Learning (Stochastic Control)
Learning to Rank - From pairwise approach to listwise
Barga Data Science lecture 9
DMTM Lecture 06 Classification evaluation
Presentation on supervised learning
Context-aware preference modeling with factorization
Machine Learning: Foundations Course Number 0368403401
Implications of Ceiling Effects in Defect Predictors
Week 1 - Data Structures and Algorithms
Lecture 6: Ensemble Methods
IRJET- Big Data and Bayes Theorem used Analyze the Student’s Performance in E...
Competition winning learning rates
Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...
Part 1
Moa: Real Time Analytics for Data Streams
Text Classification, Sentiment Analysis, and Opinion Mining
Safe and Efficient Off-Policy Reinforcement Learning
MOA for the IoT at ACML 2016
Maximum Entropy Reinforcement Learning (Stochastic Control)
Ad

Similar to TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Completion (20)

PDF
Convolutional auto-encoded extreme learning machine for incremental learning ...
PDF
OPTIMAL CHOICE: NEW MACHINE LEARNING PROBLEM AND ITS SOLUTION
PPT
Instance Based Learning in Machine Learning
PPTX
Incremental Machine Learning.pptx
PDF
Exploiting Distributional Semantic Models in Question Answering
PDF
Machine Learning With MapReduce, K-Means, MLE
PPT
UNIT 1-INTRODUCTION-MACHINE LEARNING TECHNIQUES-AD
PDF
Scorer’s Diversity Phase 2.0: Presented by Mikhail Khludnev, Grid Dynamics Inc.
PPTX
Recommender Systems from A to Z – The Right Dataset
PDF
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PDF
Dr. Geoffrey J. Gordon: What can machine learning do for open education?
PDF
UNDERSTANDING NEGATIVE SAMPLING IN KNOWLEDGE GRAPH EMBEDDING
PDF
Buku panduan untuk Machine Learning.pdf
PDF
PhD Day: Entity Linking using Ontology Modularization
PDF
VLDB_2015_Nurjahan Begum
PPT
Supervised_Learning.ppt
DOCX
Query Aware Determinization of Uncertain Objects
DOCX
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
DOCX
Query aware determinization of uncertain
PDF
IEEE Datamining 2016 Title and Abstract
Convolutional auto-encoded extreme learning machine for incremental learning ...
OPTIMAL CHOICE: NEW MACHINE LEARNING PROBLEM AND ITS SOLUTION
Instance Based Learning in Machine Learning
Incremental Machine Learning.pptx
Exploiting Distributional Semantic Models in Question Answering
Machine Learning With MapReduce, K-Means, MLE
UNIT 1-INTRODUCTION-MACHINE LEARNING TECHNIQUES-AD
Scorer’s Diversity Phase 2.0: Presented by Mikhail Khludnev, Grid Dynamics Inc.
Recommender Systems from A to Z – The Right Dataset
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Dr. Geoffrey J. Gordon: What can machine learning do for open education?
UNDERSTANDING NEGATIVE SAMPLING IN KNOWLEDGE GRAPH EMBEDDING
Buku panduan untuk Machine Learning.pdf
PhD Day: Entity Linking using Ontology Modularization
VLDB_2015_Nurjahan Begum
Supervised_Learning.ppt
Query Aware Determinization of Uncertain Objects
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
Query aware determinization of uncertain
IEEE Datamining 2016 Title and Abstract
Ad

Recently uploaded (20)

PPTX
sap open course for s4hana steps from ECC to s4
PDF
Empathic Computing: Creating Shared Understanding
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Machine learning based COVID-19 study performance prediction
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPT
Teaching material agriculture food technology
PPTX
Programs and apps: productivity, graphics, security and other tools
sap open course for s4hana steps from ECC to s4
Empathic Computing: Creating Shared Understanding
Network Security Unit 5.pdf for BCA BBA.
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Review of recent advances in non-invasive hemoglobin estimation
Machine learning based COVID-19 study performance prediction
The AUB Centre for AI in Media Proposal.docx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Electronic commerce courselecture one. Pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Teaching material agriculture food technology
Programs and apps: productivity, graphics, security and other tools

TIE: A Framework for Embedding-based Incremental Temporal Knowledge Graph Completion

  • 1. TIE: A Framework for Embedding- based Incremental Temporal Knowledge Graph Completion Jiapeng Wu (myself), Yishi Xu, Yingxue Zhang, Chen Ma, Mark Coates, Jackie Cheung McGill University, School of Computer Science Quebec AI Institute (MILA) Montreal Research Center, Huawei Noah’s Ark Lab University of Montreal (UdeM)
  • 2. Answer queries (?, r, o, t) and (s, r, ?, t) Temporal Knowledge Graph Image is from Trivedi, Rakshit, et al. "Know-evolve: Deep temporal reasoning for dynamic knowledge graphs." International Conference on Machine Learning. PMLR, 2017.
  • 4. Notation • A TKG is denoted as a sequence of graphs: , where • and denote the sets of entities and relations at time step t • denotes the set of observed triples at time t
  • 5. Notation • A TKG is denoted as a sequence of graphs: , where • and denote the sets of entities and relations at time step t • denotes the set of observed triples at time t • Let denote the set of true triples at time t such that • The set of missing facts can be written as .
  • 6. Notation • A TKG is denoted as a sequence of graphs: , where • and denote the sets of entities and relations at time step t • denotes the set of observed triples at time t • Let denote the set of true triples at time t such that • The set of missing facts can be written as . • TKG Completion (TKGC): given an object query for a quadruples , our model aims at ranking the correct entity o as high as possible.
  • 7. Challenges 1 Previous work in TKGC doesn’t address the incremental learning situation, where the new changes (addition and deletion) in the knowledge base are available at each new time step. The model is expected to adapt to the changes while maintaining its knowledge about the historical facts. Naively fine tuning the model using all the data at each new time step causes catastrophic forgetting (CF), meaning that the historical task performance degrades.
  • 8. Contribution 1 • We introduce a new task dubbed incremental TKGC. • We propose Temporal Incremental Embedding (TIE), a training and evaluation framework that integrates incremental learning techniques with standard TKGC approaches. • To this challenge, TIE combines TKG representation learning, experience replay and temporal regularization to improve model performance and alleviate catastrophic forgetting.
  • 9. Challenges 2 • Standard TKGC metrics such as Hits@10 and MRR only measures overall performance across all time steps but omits dynamic aspects. • There is an absence of metrics that can evaluate how well a model forgets deleted facts. • For example, the quadruple (Trump, presidentOf, US, 2020) is true, however (Trump, presidentOf, US, 2021) is false. Hence, we would like the model to rank Biden higher than Trump given the query (?, presidentOf, US, 2021).
  • 10. Contribution 2 • To measure TKGC models’ ability to discern facts that were true in the past but false at present, we propose new evaluation metrics dubbed Deleted Facts Hits@10 (DF) and Reciprocal Rank Difference Measure (RRD). • We explicitly associate deleted facts with negative labels and integrate them into the training process, which shows improvement upon the two metrics.
  • 11. Challenges 3 • A naïve way to adapt to the changes in the knowledge base is to train a new model using all the available data at each new time step. This is certainly a bad idea in terms of training efficiency. • A more practical way is the sliding window approach, which uses data accumulated during the past x time steps to train the model.
  • 12. Contribution 3 • We show that training just using added facts significantly improves the training speed and reduces dataset size by around 10x. • Meanwhile it maintains a similar performance level compared to naïve fine- tuning methods.
  • 13. Main Results We adapt two existing TKGC models to our incremental learning framework and conducted experiments on wikidata12k and YAGO11k datasets. The proposed TIE framework reduces training time by about 10x and improves some of the proposed metrics compared to the full-batch training. There isn’t a significant loss in any traditional metric. Extensive ablation studies reveal the performance trade-offs among different evaluation metrics, providing insights for choosing among model variations.
  • 15. Standard vs Incremental TKGC Standard Incremental Training and evaluation Once, from 1 to T T times, at each time step t Training set Test set Candidates
  • 17. TKG Encoder-Decoder Framework • Encoder: learns entity and relation embedding matrices and , together with a time-sensitive function. Examples: • Diachronic Embedding (Goel et al., 2019) • HyTE (Dasgupta et al., 2018) • Decoder: score each triple (s, r, o), such that the true triples receive much higher values than the false ones. Examples: • TransE (Bordes et al., 2013) • DistMult (Yang et al., 2014)
  • 18. TKG Encoder-Decoder Framework • We use and to denote the representation of entity i and relation r at time step t, and use to denote the parametrized model at time t. The score for a quadruple is written as:
  • 20. Current and Average Metrics Assuming we are at time step t, the standard Hits@10 is defined as:
  • 21. Current and Average Metrics Assuming we are at time step t, the standard Hits@10 is defined as: Let be the Hits@10 value evaluated on using the model trained at time step t.
  • 22. Current and Average Metrics Assuming we are at time step t, the standard Hits@10 is defined as: Let be the Hits@10 value evaluated on using the model trained at time step t. The current Hits@10 is often of the most interest:
  • 23. Current and Average Metrics Assuming we are at time step t, the standard Hits@10 is defined as: Let be the Hits@10 value evaluated on using the model trained at time step t. The current Hits@10 is often of the most interest: The average Hits@10 measures the historical average performance, which implies the degree of catastrophic forgetting:
  • 24. Intransigence Metrics • Intransigence: the inability of an algorithm to identify knowledge that was true in the past but false at present. We call it a deleted fact. • E.g., a student is no longer associated with his college after graduating.
  • 25. Intransigence Metrics • Intransigence: the inability of an algorithm to identify knowledge that was true in the past but false at present. We call it a deleted fact. • E.g., a student is no longer associated with his college after graduating. • Deleted Facts Hits@10 (DF): rank of all the deleted facts, the lower the better
  • 26. Intransigence Metrics • Reciprocal Rank Difference (RRD): reciprocal of the difference between the ranks of the valid facts and deleted facts, the higher the better:
  • 27. Intransigence Metrics • Reciprocal Rank Difference (RRD): reciprocal of the difference between the ranks of the valid facts and deleted facts, the higher the better: • Here is the collection of deleted objects, and is the normalizing constant.
  • 29. Overview Learning with deleted facts and added facts Experience Replay Temporal Regularization
  • 30. Learning with deleted facts and added facts
  • 32. Learning with Deleted Facts • Construct a set of deleted facts: • Then, associate each quadruple with a negative label, and derive the binary cross entropy loss: • The goal is to reduce the intransigence of the learned model.
  • 33. Dataset Statistics Figure 2: Dataset statistics of Wikidata12k (left) and YAGO11k(right). The three curves represent the numbers of facts (total, common and added) at every time step.
  • 34. Learning with Added Facts • Construct a set of added facts: • The softmax cross entropy loss is derived as:
  • 37. Experience Replay • The replay buffer:
  • 38. Experience Replay • The replay buffer: • The goal is to extract , a set of positive examples from • Uniform Sampling: is a random subset of • Frequency based sampling: draw samples from based on their frequencies
  • 39. Frequency based sampling • A pattern of the triple (𝑠,𝑟,𝑜) refers to a regular expression with some of the elements replaced with the wildcard symbol ‘∗’.
  • 40. Frequency based sampling • A pattern of the triple (𝑠,𝑟,𝑜) refers to a regular expression with some of the elements replaced with the wildcard symbol ‘∗’. • The set of patterns P is defined as follows:
  • 41. Frequency based sampling • A pattern of the triple (𝑠,𝑟,𝑜) refers to a regular expression with some of the elements replaced with the wildcard symbol ‘∗’. • The set of patterns P is defined as follows : • We use and to denote the historical and current frequency of a pattern 𝑝 at time 𝑡.
  • 42. Frequency based sampling • A pattern of the triple (𝑠,𝑟,𝑜) refers to a regular expression with some of the elements replaced with the wildcard symbol ‘∗’. • The set of patterns P is defined as follows : • We use and to denote the historical and current frequency of a pattern 𝑝 at time 𝑡. • Take for example, the frequencies are:
  • 43. Frequency based sampling • The unnormalized sampling probability of a quadruple is calculated as .
  • 44. Frequency based sampling • The unnormalized sampling probability of a quadruple is calculated as . • The frequency-dependent score is calculated as:
  • 45. Frequency based sampling • The unnormalized sampling probability of a quadruple is calculated as . • The frequency-dependent score is calculated as: temporal frequencies
  • 46. Frequency based sampling • The unnormalized sampling probability of a quadruple is calculated as . • The frequency-dependent score is calculated as: temporal frequencies hyperparameters
  • 47. Frequency based sampling • The unnormalized sampling probability of a quadruple is calculated as . • The frequency-dependent score is calculated as: • The time-dependent term is simply defined as an exponential decay function: temporal frequencies hyperparameters
  • 48. Inverse frequency-based sampling • What can go wrong with ?
  • 49. Inverse frequency-based sampling • What can go wrong with ? • It overly favors the patterns with higher temporal frequencies.
  • 50. Inverse frequency-based sampling • What can go wrong with ? • It overly favors the patterns with higher temporal frequencies. • For example, the pattern (*, instance of, commune of France) has many matches from time step 62 – 77. The algorithm ends up selecting most of them as samples.
  • 51. Inverse frequency-based sampling • What can go wrong with ? • It overly favors the patterns with higher temporal frequencies. • For example, the pattern (*, instance of, commune of France) has many matches from time step 62 – 77. The algorithm ends up selecting most of them as samples.. • An alternative formulation that increases diversity and favors low- frequency patterns is .
  • 53. Representation learning: cross entropy • Time dependent negative sampling:
  • 54. Representation learning: cross entropy • Time dependent negative sampling: • The logits output by the current model :
  • 55. Representation learning: cross entropy • Time dependent negative sampling: • The logits output by the current model : • The resulting replay cross-entropy (RCE) loss:
  • 57. Representation learning: knowledge distillation • Inspired by iCaRL (Rebuffi et al., 2017), we apply the knowledge distillation loss in addition to the standard cross entropy loss.
  • 58. Representation learning: knowledge distillation • Inspired by iCaRL (Rebuffi et al., 2017), we apply the knowledge distillation loss in addition to the standard cross entropy loss. • Before training at time step 𝑡 , we store the output of the model with the network parameters after training at time step 𝑡 −1 as:
  • 59. Representation learning: knowledge distillation • Inspired by iCaRL (Rebuffi et al., 2017), we apply the knowledge distillation loss in addition to the standard cross entropy loss. • Before training at time step 𝑡 , we store the output of the model with the network parameters after training at time step 𝑡 −1 as: • The resulting replay knowledge distillation (RKD) loss:
  • 62. Initialization • The model parameters at time step 𝑡 is initialized using the model parameters trained at time step 𝑡 – 1. Taking the initialization for the entity matrix as example,
  • 63. Initialization • The model parameters at time step 𝑡 is initialized using the model parameters trained at time step 𝑡 – 1. Taking the initialization for the entity matrix as example, • We only need to regularize the embedding of entities that are also in .
  • 64. • Inspired by the prior work (Song et al., 2018), we impose the following L2 regularization term to smooth the drastic changes in the parameter space, thus alleviates catastrophic forgetting. Temporal Regularization
  • 65. • Inspired by the prior work (Song et al., 2018), we impose the following L2 regularization term to smooth the drastic changes in the parameter space, thus alleviates catastrophic forgetting. • Note: • and are weight and bias matrices in the Diachronic Embedding model. • . The same notation holds for all hat symbols. Temporal Regularization
  • 66. Optimization • The final loss function for TIE is the weighted combination of the loss terms defined previously.
  • 68. Dataset • YAGO11k: transformed from YAGO3 in the following way: (Pétala, Monteiro, isAffiliatedTo, Democratic Labour Party (Brazil), [1999 - 2014]) -> (Pétala, Monteiro, isAffiliatedTo, Democratic Labour Party (Brazil), 1999), ….. (Pétala, Monteiro, isAffiliatedTo, Democratic Labour Party (Brazil), 2014) • Wikidata12k: Wikipedia knowledge with time stamps.
  • 69. Dataset • YAGO11k: transformed from YAGO3 in the following way: (Pétala, Monteiro, isAffiliatedTo, Democratic Labour Party (Brazil), [1999 - 2014]) -> (Pétala, Monteiro, isAffiliatedTo, Democratic Labour Party (Brazil), 1999), ….. (Pétala, Monteiro, isAffiliatedTo, Democratic Labour Party (Brazil), 2014) • Wikidata12k: Wikipedia knowledge with time stamps.
  • 75. Metrics • C@10, or C-Hits@10 (↑): Current Hits@10. • DF@10, or DF-Hits@10 (↓) : Deleted Facts Hits@10. • RRD (↑) : Reciprocal Rank Difference. • A@10, or A-Hits@10 (↑) : Average Hits@10.
  • 76. Metrics • C@10, or C-Hits@10 (↑): Current Hits@10. • DF@10, or DF-Hits@10 (↓) : Deleted Facts Hits@10. • RRD (↑) : Reciprocal Rank Difference. • A@10, or A-Hits@10 (↑) : Average Hits@10. • Note • ↑: the higher the better • ↓: the lower the better
  • 77. Base Model and Baselines • We use DE (Diachronic Embedding) and HyTE as our base models. DE uses ComplEx while HyTE uses TransE as their decoding functions.
  • 78. Base Model and Baselines • We use DE (Diachronic Embedding) and HyTE as our base models. DE uses ComplEx while HyTE uses TransE as their decoding functions. • Baselines: • Fine-tuning (FT): only utilizes added facts . • Temporal Regularization (TR): uses temporal regularization loss in addition to FT. • Proposed model TIE: combines all the proposed methods. The choice of experience replay sampling method depends on the validation set performance.
  • 79. Skyline models • Full-batch (FB): use all the quadruples in and to fine-tune the model at time step t. • Full-batch with future data (FB_future): use all the data in to train the model at each time step. This is treated as an oracle since it has access to future data at each time step.
  • 84. Ablation Study (1) Comparing experience replay methods
  • 85. Ablation Study (2) Effect of learning with Deleted Facts
  • 86. Ablation Study (3) Effect of learning with Added Facts
  • 87. Ablation Study (4) Effect of experience replay sampling size
  • 88. Ablation Study (4) Effect of experience replay sampling size
  • 89. Conclusion • We present TIE, a novel incremental learning framework TKGC tasks. TIE combines TKG representation learning, frequency-based experience replay, and temporal regularization to improve the model’s performance on both current and past time steps. • We propose DF and RRD metrics to measure the intransigence of the model. • Extensive ablation studies shows each proposed component’s effectiveness. They also provide insights for deciding among model variations by revealing performance trade-offs among various evaluation metrics. • This work serves as a first attempt and exploration to apply incremental learning to TKGC tasks. Future work might involve exploring other incremental learning techniques, such as constrained optimization, to achieve more robust performance across datasets and metrics.