NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Transformer with Adaptive Node Sampling", NeurIPS 2022

LAB SEMINAR
Nguyen Thanh Sang
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: sang.ngt99@gmail.com
Hierarchical Graph Transformer with Adaptive
Node Sampling
--- Zaixi Zhang, Qi Liu, Qingyong Hu , Chee-Kong Lee ---
2023-07-03

Content
s
1
⮚ Paper
▪ Introduction
▪ Problem
▪ Contributions
▪ Framework
▪ Experiment
▪ Conclusion

2
Introduction
 Transformer-based models have achieved unprecedented
successes in natural language processing (NLP) and computer
vision (CV).
 When it comes to graph-structured data, transformers have not
achieved competitive performance, especially on large graphs.

3
Problems
+ Existing Transformer models for graph data by treating each node as a token and
designing dedicated positional encoding.
 only focus on small graphs such as molecular graphs with tens of atoms.
+ Graphormer achieves SOTA performance on molecular property prediction tasks.
 For large graphs: the quadratic computational and storage complexity of the vanilla
Transformer with the number of nodes inhibits the practical application.
+ Sparse Transformer methods can improve the efficiency of the vanilla Transformer.
 have not exploited the unique characteristics of graph data and require a quadratic or
at least sub-quadratic space complexity.
 still unaffordable in most practical cases.

4
Problems
+ Existing Graph Transformers have the following deficiencies:
(1) The fixed node sampling strategies in existing Graph Transformers are ignorant of the
graph properties
 sample uninformative nodes for attention.
 an adaptive node sampling strategy aware of the graph properties is needed.
(2) Though the sampling method enables scalability, most node sampling strategies focus
on local neighbors and neglect the long-range dependencies and global contexts of
graphs.
 incorporating complementary global information is necessary for Graph Transformer.

5
Contributions
+ An Adaptive Node Sampling for Graph Transformer (ANS-GT): modifies a multi-armed bandit algorithm to
adaptively sample nodes for attention.
+ Introduce coarse-grained global attention with graph coarsening
 helps graph transformer capture long-range dependencies while increasing efficiency.
+ Evaluate our method on six benchmark datasets to show the advantage over existing Graph
Transformers and popular GNNs.

6
Transformer Architecture
+ Each Transformer layer has two parts:
- a multi-head self-attention (MHA) module.
- a position-wise feed-forward network (FFN).
+ MHA module firstly projects the input H to query-, key-, value-spaces:
+ The scaled dot-product attention mechanism:
+ The outputs from different heads are further concatenated and transformed to obtain the final
output of MHA.

7
Graph Coarsening
+ Goal: reduces the number of nodes in a graph by clustering them into super-nodes while preserving
the global information of the graph as much as possible.
+ The coarse graph is a smaller weighted graph
+ G′ is obtained from the original graph by first computing a partition
+ The clusters 𝐶1 · · · 𝐶|𝑉| are disjoint and cover all the nodes in V.
+ Each cluster 𝐶𝑖 corresponds to a super-node in G′.

8
Motivating Observations
+ For large graphs, existing Graph Transformer models typically
choose to sample a batch of nodes for attention.
+ Real-world graph datasets exhibit different properties, which
makes a fixed node sampling strategy unsuitable for all kinds of
graphs.
+ Use four popular node sampling strategies for node
classification: 1-hop neighbors, 2-hop neighbors, PPR, and KNN.
 check the performance of Graph Transformer on graphs with
different properties.
+ For strong homophily (e.g., α = 1.0): sampling 1-hop neighbors
or nodes with top PPR scores.
+ For graphs with strong heterophily (e.g., α = 0.05): 2-hop
neighbors as neighborhoods.
+ KNN achieves the best performance (i.e., 77.2% accuracy) when
all the nodes are connected randomly.

10
Adaptive Node Sampling
+ Adaptive Node Sampling: adaptively choose the batch of most informative nodes by a multi-
armed bandit mechanism.
+ Four representative sampling heuristics:
- 1-/2-hop neighbors: adopt the normalized adjacency matrix:
For 1-hop neighbors:
- KNN: adopts the cosine similarity of node attributes to measure the similarities of nodes.
The similarity score 𝑆𝑖𝑗:
- PPR: The Personalized PageRank matrix S is calculated as: S = c(I − (1 − c)A) −1,
where factor c ∈ [0, 1] (set to 0.15 in our experiments).

11
Adaptive Node Sampling
+ The final node sampling probability:
the probability vector
Node sampling matrix

12
Hierarchical Graph Attention
+ To efficiently capture both the local and global information in the graph.
+ The positional encoding:
+ Graphormer framework to obtain the output of the l-th transformer layer:
the normalized adjacency matrices with self-loop
the number of positions

13
Optimization and Inference
+ Sample S input sequences for each center node and use its representation from the final
Transformer layer for prediction.
+ Predict the node class:
+ Average cross entropy loss of labeled training nodes:
+ In the inference stage, taking a bagging aggregation to improve accuracy and reduce variance:

14
Experiments
Datasets
+ Six benchmark datasets including:
- Citation graphs Cora, CiteSeer, and PubMed;
- Wikipedia graphs Chameleon, Squirrel;
- The Actor co-occurrence graph;
+ And WebKB datasets including Cornell, Texas, and Wisconsin.

15
Experiments
Node Classification Performance
+ ANS-GT overperforms all Graph Transformer baselines and achieves state-of-the-art results on all 6
datasets. => the effectiveness of the proposed model.
+ Some Graph Transformer baselines achieve poor performance compared with GNN models.
=> Due to the full graph attention mechanisms or the fixed node sampling schemes.

16
Experiment results
Effectiveness of Adaptive Node Sampling
+ PPR and 1-hop neighbors achieve high weights on Cora while 2-hop
neighbors dominate other sampling strategies on Squirrel.
 Cora and Squirrel are strong homophily/heterophily dataset respectively.
+ For Citeseer and Actor, the weights of KNN firstly goes up and gradually
decreases.
 nodes with similar attributes are most useful for the training at the beginning
stage.
+ ANS-GT has a large advantage compared to without the adaptive sampling.

17
Experiment results
Graph Coarsening Methods
+ There is no significant difference between different
coarsening algorithms
 the robustness of ANS-GT.
+ As for the coarsening rate, the results indicate that the
coarsening rate of 0.01 to 0.10 has the best performance.

18
Conclusions
• Propose Adaptive Node Sampling for Graph Transformer (ANS-GT) modifies a multi-armed
bandit algorithm to adaptively sample nodes for attention in this paper.
• To incorporate long-range dependencies and global contexts, the authors design a hierarchical
graph attention scheme in which coarse-grained attention is achieved with graph coarsening.
• Evaluate the proposed method on six benchmark datasets to show the advantage over existing
Graph Transformers and popular GNNs.
• The adaptive node sampling module could effectively adjust the sampling strategies according to
graph properties.

NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Transformer with Adaptive Node Sampling", NeurIPS 2022

More Related Content

Similar to NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Transformer with Adaptive Node Sampling", NeurIPS 2022 (20)

More from ssuser4b1f48 (20)

Recently uploaded (20)

NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Transformer with Adaptive Node Sampling", NeurIPS 2022