SlideShare a Scribd company logo
LAB SEMINAR
Nguyen Thanh Sang
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: sang.ngt99@gmail.com
Hierarchical Graph Transformer with Adaptive
Node Sampling
--- Zaixi Zhang, Qi Liu, Qingyong Hu , Chee-Kong Lee ---
2023-07-03
Content
s
1
⮚ Paper
▪ Introduction
▪ Problem
▪ Contributions
▪ Framework
▪ Experiment
▪ Conclusion
2
Introduction
 Transformer-based models have achieved unprecedented
successes in natural language processing (NLP) and computer
vision (CV).
 When it comes to graph-structured data, transformers have not
achieved competitive performance, especially on large graphs.
3
Problems
+ Existing Transformer models for graph data by treating each node as a token and
designing dedicated positional encoding.
 only focus on small graphs such as molecular graphs with tens of atoms.
+ Graphormer achieves SOTA performance on molecular property prediction tasks.
 For large graphs: the quadratic computational and storage complexity of the vanilla
Transformer with the number of nodes inhibits the practical application.
+ Sparse Transformer methods can improve the efficiency of the vanilla Transformer.
 have not exploited the unique characteristics of graph data and require a quadratic or
at least sub-quadratic space complexity.
 still unaffordable in most practical cases.
4
Problems
+ Existing Graph Transformers have the following deficiencies:
(1) The fixed node sampling strategies in existing Graph Transformers are ignorant of the
graph properties
 sample uninformative nodes for attention.
 an adaptive node sampling strategy aware of the graph properties is needed.
(2) Though the sampling method enables scalability, most node sampling strategies focus
on local neighbors and neglect the long-range dependencies and global contexts of
graphs.
 incorporating complementary global information is necessary for Graph Transformer.
5
Contributions
+ An Adaptive Node Sampling for Graph Transformer (ANS-GT): modifies a multi-armed bandit algorithm to
adaptively sample nodes for attention.
+ Introduce coarse-grained global attention with graph coarsening
 helps graph transformer capture long-range dependencies while increasing efficiency.
+ Evaluate our method on six benchmark datasets to show the advantage over existing Graph
Transformers and popular GNNs.
6
Transformer Architecture
+ Each Transformer layer has two parts:
- a multi-head self-attention (MHA) module.
- a position-wise feed-forward network (FFN).
+ MHA module firstly projects the input H to query-, key-, value-spaces:
+ The scaled dot-product attention mechanism:
+ The outputs from different heads are further concatenated and transformed to obtain the final
output of MHA.
7
Graph Coarsening
+ Goal: reduces the number of nodes in a graph by clustering them into super-nodes while preserving
the global information of the graph as much as possible.
+ The coarse graph is a smaller weighted graph
+ G′ is obtained from the original graph by first computing a partition
+ The clusters 𝐶1 · · · 𝐶|𝑉| are disjoint and cover all the nodes in V.
+ Each cluster 𝐶𝑖 corresponds to a super-node in G′.
8
Motivating Observations
+ For large graphs, existing Graph Transformer models typically
choose to sample a batch of nodes for attention.
+ Real-world graph datasets exhibit different properties, which
makes a fixed node sampling strategy unsuitable for all kinds of
graphs.
+ Use four popular node sampling strategies for node
classification: 1-hop neighbors, 2-hop neighbors, PPR, and KNN.
 check the performance of Graph Transformer on graphs with
different properties.
+ For strong homophily (e.g., α = 1.0): sampling 1-hop neighbors
or nodes with top PPR scores.
+ For graphs with strong heterophily (e.g., α = 0.05): 2-hop
neighbors as neighborhoods.
+ KNN achieves the best performance (i.e., 77.2% accuracy) when
all the nodes are connected randomly.
9
Framework
10
Adaptive Node Sampling
+ Adaptive Node Sampling: adaptively choose the batch of most informative nodes by a multi-
armed bandit mechanism.
+ Four representative sampling heuristics:
- 1-/2-hop neighbors: adopt the normalized adjacency matrix:
For 1-hop neighbors:
- KNN: adopts the cosine similarity of node attributes to measure the similarities of nodes.
The similarity score 𝑆𝑖𝑗:
- PPR: The Personalized PageRank matrix S is calculated as: S = c(I − (1 − c)A) −1,
where factor c ∈ [0, 1] (set to 0.15 in our experiments).
11
Adaptive Node Sampling
+ The final node sampling probability:
the probability vector
Node sampling matrix
12
Hierarchical Graph Attention
+ To efficiently capture both the local and global information in the graph.
+ The positional encoding:
+ Graphormer framework to obtain the output of the l-th transformer layer:
the normalized adjacency matrices with self-loop
the number of positions
13
Optimization and Inference
+ Sample S input sequences for each center node and use its representation from the final
Transformer layer for prediction.
+ Predict the node class:
+ Average cross entropy loss of labeled training nodes:
+ In the inference stage, taking a bagging aggregation to improve accuracy and reduce variance:
14
Experiments
Datasets
+ Six benchmark datasets including:
- Citation graphs Cora, CiteSeer, and PubMed;
- Wikipedia graphs Chameleon, Squirrel;
- The Actor co-occurrence graph;
+ And WebKB datasets including Cornell, Texas, and Wisconsin.
15
Experiments
Node Classification Performance
+ ANS-GT overperforms all Graph Transformer baselines and achieves state-of-the-art results on all 6
datasets. => the effectiveness of the proposed model.
+ Some Graph Transformer baselines achieve poor performance compared with GNN models.
=> Due to the full graph attention mechanisms or the fixed node sampling schemes.
16
Experiment results
Effectiveness of Adaptive Node Sampling
+ PPR and 1-hop neighbors achieve high weights on Cora while 2-hop
neighbors dominate other sampling strategies on Squirrel.
 Cora and Squirrel are strong homophily/heterophily dataset respectively.
+ For Citeseer and Actor, the weights of KNN firstly goes up and gradually
decreases.
 nodes with similar attributes are most useful for the training at the beginning
stage.
+ ANS-GT has a large advantage compared to without the adaptive sampling.
17
Experiment results
Graph Coarsening Methods
+ There is no significant difference between different
coarsening algorithms
 the robustness of ANS-GT.
+ As for the coarsening rate, the results indicate that the
coarsening rate of 0.01 to 0.10 has the best performance.
18
Conclusions
• Propose Adaptive Node Sampling for Graph Transformer (ANS-GT) modifies a multi-armed
bandit algorithm to adaptively sample nodes for attention in this paper.
• To incorporate long-range dependencies and global contexts, the authors design a hierarchical
graph attention scheme in which coarse-grained attention is achieved with graph coarsening.
• Evaluate the proposed method on six benchmark datasets to show the advantage over existing
Graph Transformers and popular GNNs.
• The adaptive node sampling module could effectively adjust the sampling strategies according to
graph properties.
19
Thank you!

More Related Content

PPTX
COARFORMER: TRANSFORMER FOR LARGE GRAPH VIA GRAPH COARSENING.pptx
PPTX
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
PPTX
DEFORMABLE GRAPH TRANSFORMER.pptx
PPTX
A Generalization of Transformer Networks to Graphs.pptx
PPTX
240304_Thanh_LabSeminar[Pure Transformers are Powerful Graph Learners].pptx
PPTX
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
PDF
NS-CUK Seminar: S.T.Nguyen, Review on "Do Transformers Really Perform Bad for...
COARFORMER: TRANSFORMER FOR LARGE GRAPH VIA GRAPH COARSENING.pptx
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
DEFORMABLE GRAPH TRANSFORMER.pptx
A Generalization of Transformer Networks to Graphs.pptx
240304_Thanh_LabSeminar[Pure Transformers are Powerful Graph Learners].pptx
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
NS-CUK Seminar: S.T.Nguyen, Review on "Do Transformers Really Perform Bad for...

Similar to NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Transformer with Adaptive Node Sampling", NeurIPS 2022 (20)

PPTX
NS-CUK Seminar: S.T.Nguyen, Review on "Are More Layers Beneficial to Graph Tr...
PDF
AI-CUK Joint Journal Club: V.T.Hoang, Review on "Global self-attention as a r...
PPTX
240729_Thuy_Labseminar[Substructure Aware Graph Neural Networks​].pptx
PPTX
Graph Transformer with Graph Pooling for Node Classification, IJCAI 2023.pptx
PPTX
NS-CUK Seminar: S.T.Nguyen, Review on "Make Heterophily Graphs Better Fit GNN...
PPTX
Chapter 3.pptx
PDF
Graph Sample and Hold: A Framework for Big Graph Analytics
PPTX
NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...
PPTX
NS-CUK Seminar: J.H.Lee, Review on "Graph Propagation Transformer for Graph R...
PDF
Learning Convolutional Neural Networks for Graphs
PDF
Graph Neural Networks.pdf
PPTX
[NS][Lab_Seminar_241111]Patch-Wise Graph Contrastive Learning for Image Trans...
PPTX
NS-CUK Joint Journal Club: V.T.Hoang, Review on "Universal Graph Transformer ...
PPTX
PowerLyra@EuroSys2015
PDF
Grl book
PPTX
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
PPTX
250324_HW_LabSeminar[Coarformer: Transformer for large graph via graph coarse...
PPTX
[NS][Lab_Seminar_240617]A Survey on Graph Neural Networks and Graph Transform...
PDF
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAI
PPTX
[NS][Lab_Seminar_240624]Graph Neural Networks for End-to-End Information Extr...
NS-CUK Seminar: S.T.Nguyen, Review on "Are More Layers Beneficial to Graph Tr...
AI-CUK Joint Journal Club: V.T.Hoang, Review on "Global self-attention as a r...
240729_Thuy_Labseminar[Substructure Aware Graph Neural Networks​].pptx
Graph Transformer with Graph Pooling for Node Classification, IJCAI 2023.pptx
NS-CUK Seminar: S.T.Nguyen, Review on "Make Heterophily Graphs Better Fit GNN...
Chapter 3.pptx
Graph Sample and Hold: A Framework for Big Graph Analytics
NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...
NS-CUK Seminar: J.H.Lee, Review on "Graph Propagation Transformer for Graph R...
Learning Convolutional Neural Networks for Graphs
Graph Neural Networks.pdf
[NS][Lab_Seminar_241111]Patch-Wise Graph Contrastive Learning for Image Trans...
NS-CUK Joint Journal Club: V.T.Hoang, Review on "Universal Graph Transformer ...
PowerLyra@EuroSys2015
Grl book
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
250324_HW_LabSeminar[Coarformer: Transformer for large graph via graph coarse...
[NS][Lab_Seminar_240617]A Survey on Graph Neural Networks and Graph Transform...
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAI
[NS][Lab_Seminar_240624]Graph Neural Networks for End-to-End Information Extr...
Ad

More from ssuser4b1f48 (20)

PPTX
NS-CUK Seminar: V.T.Hoang, Review on "GOAT: A Global Transformer on Large-sca...
PPTX
NS-CUK Seminar: H.B.Kim, Review on "Cluster-GCN: An Efficient Algorithm for ...
PPTX
NS-CUK Seminar: H.E.Lee, Review on "Weisfeiler and Leman Go Neural: Higher-O...
PPTX
NS-CUK Seminar:V.T.Hoang, Review on "GRPE: Relative Positional Encoding for G...
PPTX
NS-CUK Seminar: J.H.Lee, Review on "Learnable Structural Semantic Readout for...
PDF
Aug 22nd, 2023: Case Studies - The Art and Science of Animation Production)
PDF
Aug 17th, 2023: Case Studies - Examining Gamification through Virtual/Augment...
PDF
Aug 10th, 2023: Case Studies - The Power of eXtended Reality (XR) with 360°
PDF
Aug 8th, 2023: Case Studies - Utilizing eXtended Reality (XR) in Drones)
PPTX
NS-CUK Seminar: J.H.Lee, Review on "Learnable Structural Semantic Readout for...
PPTX
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
PPTX
NS-CUK Seminar:V.T.Hoang, Review on "Augmentation-Free Self-Supervised Learni...
PPTX
NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...
PPTX
NS-CUK Seminar: H.E.Lee, Review on "PTE: Predictive Text Embedding through L...
PPTX
NS-CUK Seminar: H.B.Kim, Review on "Inductive Representation Learning on Lar...
PPTX
NS-CUK Seminar: H.E.Lee, Review on "PTE: Predictive Text Embedding through L...
PPTX
NS-CUK Seminar: J.H.Lee, Review on "Relational Self-Supervised Learning on Gr...
PPTX
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
PPTX
NS-CUK Seminar: H.E.Lee, Review on "Graph Star Net for Generalized Multi-Tas...
PPTX
NS-CUK Seminar: V.T.Hoang, Review on "Namkyeong Lee, et al. Relational Self-...
NS-CUK Seminar: V.T.Hoang, Review on "GOAT: A Global Transformer on Large-sca...
NS-CUK Seminar: H.B.Kim, Review on "Cluster-GCN: An Efficient Algorithm for ...
NS-CUK Seminar: H.E.Lee, Review on "Weisfeiler and Leman Go Neural: Higher-O...
NS-CUK Seminar:V.T.Hoang, Review on "GRPE: Relative Positional Encoding for G...
NS-CUK Seminar: J.H.Lee, Review on "Learnable Structural Semantic Readout for...
Aug 22nd, 2023: Case Studies - The Art and Science of Animation Production)
Aug 17th, 2023: Case Studies - Examining Gamification through Virtual/Augment...
Aug 10th, 2023: Case Studies - The Power of eXtended Reality (XR) with 360°
Aug 8th, 2023: Case Studies - Utilizing eXtended Reality (XR) in Drones)
NS-CUK Seminar: J.H.Lee, Review on "Learnable Structural Semantic Readout for...
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar:V.T.Hoang, Review on "Augmentation-Free Self-Supervised Learni...
NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...
NS-CUK Seminar: H.E.Lee, Review on "PTE: Predictive Text Embedding through L...
NS-CUK Seminar: H.B.Kim, Review on "Inductive Representation Learning on Lar...
NS-CUK Seminar: H.E.Lee, Review on "PTE: Predictive Text Embedding through L...
NS-CUK Seminar: J.H.Lee, Review on "Relational Self-Supervised Learning on Gr...
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.E.Lee, Review on "Graph Star Net for Generalized Multi-Tas...
NS-CUK Seminar: V.T.Hoang, Review on "Namkyeong Lee, et al. Relational Self-...
Ad

Recently uploaded (20)

PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
project resource management chapter-09.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
WOOl fibre morphology and structure.pdf for textiles
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Approach and Philosophy of On baking technology
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
A Presentation on Touch Screen Technology
PDF
Web App vs Mobile App What Should You Build First.pdf
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Building Integrated photovoltaic BIPV_UPV.pdf
A comparative analysis of optical character recognition models for extracting...
Group 1 Presentation -Planning and Decision Making .pptx
Assigned Numbers - 2025 - Bluetooth® Document
project resource management chapter-09.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
1 - Historical Antecedents, Social Consideration.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
A comparative study of natural language inference in Swahili using monolingua...
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Unlocking AI with Model Context Protocol (MCP)
MIND Revenue Release Quarter 2 2025 Press Release
WOOl fibre morphology and structure.pdf for textiles
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Hindi spoken digit analysis for native and non-native speakers
Approach and Philosophy of On baking technology
DP Operators-handbook-extract for the Mautical Institute
A Presentation on Touch Screen Technology
Web App vs Mobile App What Should You Build First.pdf

NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Transformer with Adaptive Node Sampling", NeurIPS 2022

  • 1. LAB SEMINAR Nguyen Thanh Sang Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: sang.ngt99@gmail.com Hierarchical Graph Transformer with Adaptive Node Sampling --- Zaixi Zhang, Qi Liu, Qingyong Hu , Chee-Kong Lee --- 2023-07-03
  • 2. Content s 1 ⮚ Paper ▪ Introduction ▪ Problem ▪ Contributions ▪ Framework ▪ Experiment ▪ Conclusion
  • 3. 2 Introduction  Transformer-based models have achieved unprecedented successes in natural language processing (NLP) and computer vision (CV).  When it comes to graph-structured data, transformers have not achieved competitive performance, especially on large graphs.
  • 4. 3 Problems + Existing Transformer models for graph data by treating each node as a token and designing dedicated positional encoding.  only focus on small graphs such as molecular graphs with tens of atoms. + Graphormer achieves SOTA performance on molecular property prediction tasks.  For large graphs: the quadratic computational and storage complexity of the vanilla Transformer with the number of nodes inhibits the practical application. + Sparse Transformer methods can improve the efficiency of the vanilla Transformer.  have not exploited the unique characteristics of graph data and require a quadratic or at least sub-quadratic space complexity.  still unaffordable in most practical cases.
  • 5. 4 Problems + Existing Graph Transformers have the following deficiencies: (1) The fixed node sampling strategies in existing Graph Transformers are ignorant of the graph properties  sample uninformative nodes for attention.  an adaptive node sampling strategy aware of the graph properties is needed. (2) Though the sampling method enables scalability, most node sampling strategies focus on local neighbors and neglect the long-range dependencies and global contexts of graphs.  incorporating complementary global information is necessary for Graph Transformer.
  • 6. 5 Contributions + An Adaptive Node Sampling for Graph Transformer (ANS-GT): modifies a multi-armed bandit algorithm to adaptively sample nodes for attention. + Introduce coarse-grained global attention with graph coarsening  helps graph transformer capture long-range dependencies while increasing efficiency. + Evaluate our method on six benchmark datasets to show the advantage over existing Graph Transformers and popular GNNs.
  • 7. 6 Transformer Architecture + Each Transformer layer has two parts: - a multi-head self-attention (MHA) module. - a position-wise feed-forward network (FFN). + MHA module firstly projects the input H to query-, key-, value-spaces: + The scaled dot-product attention mechanism: + The outputs from different heads are further concatenated and transformed to obtain the final output of MHA.
  • 8. 7 Graph Coarsening + Goal: reduces the number of nodes in a graph by clustering them into super-nodes while preserving the global information of the graph as much as possible. + The coarse graph is a smaller weighted graph + G′ is obtained from the original graph by first computing a partition + The clusters 𝐶1 · · · 𝐶|𝑉| are disjoint and cover all the nodes in V. + Each cluster 𝐶𝑖 corresponds to a super-node in G′.
  • 9. 8 Motivating Observations + For large graphs, existing Graph Transformer models typically choose to sample a batch of nodes for attention. + Real-world graph datasets exhibit different properties, which makes a fixed node sampling strategy unsuitable for all kinds of graphs. + Use four popular node sampling strategies for node classification: 1-hop neighbors, 2-hop neighbors, PPR, and KNN.  check the performance of Graph Transformer on graphs with different properties. + For strong homophily (e.g., α = 1.0): sampling 1-hop neighbors or nodes with top PPR scores. + For graphs with strong heterophily (e.g., α = 0.05): 2-hop neighbors as neighborhoods. + KNN achieves the best performance (i.e., 77.2% accuracy) when all the nodes are connected randomly.
  • 11. 10 Adaptive Node Sampling + Adaptive Node Sampling: adaptively choose the batch of most informative nodes by a multi- armed bandit mechanism. + Four representative sampling heuristics: - 1-/2-hop neighbors: adopt the normalized adjacency matrix: For 1-hop neighbors: - KNN: adopts the cosine similarity of node attributes to measure the similarities of nodes. The similarity score 𝑆𝑖𝑗: - PPR: The Personalized PageRank matrix S is calculated as: S = c(I − (1 − c)A) −1, where factor c ∈ [0, 1] (set to 0.15 in our experiments).
  • 12. 11 Adaptive Node Sampling + The final node sampling probability: the probability vector Node sampling matrix
  • 13. 12 Hierarchical Graph Attention + To efficiently capture both the local and global information in the graph. + The positional encoding: + Graphormer framework to obtain the output of the l-th transformer layer: the normalized adjacency matrices with self-loop the number of positions
  • 14. 13 Optimization and Inference + Sample S input sequences for each center node and use its representation from the final Transformer layer for prediction. + Predict the node class: + Average cross entropy loss of labeled training nodes: + In the inference stage, taking a bagging aggregation to improve accuracy and reduce variance:
  • 15. 14 Experiments Datasets + Six benchmark datasets including: - Citation graphs Cora, CiteSeer, and PubMed; - Wikipedia graphs Chameleon, Squirrel; - The Actor co-occurrence graph; + And WebKB datasets including Cornell, Texas, and Wisconsin.
  • 16. 15 Experiments Node Classification Performance + ANS-GT overperforms all Graph Transformer baselines and achieves state-of-the-art results on all 6 datasets. => the effectiveness of the proposed model. + Some Graph Transformer baselines achieve poor performance compared with GNN models. => Due to the full graph attention mechanisms or the fixed node sampling schemes.
  • 17. 16 Experiment results Effectiveness of Adaptive Node Sampling + PPR and 1-hop neighbors achieve high weights on Cora while 2-hop neighbors dominate other sampling strategies on Squirrel.  Cora and Squirrel are strong homophily/heterophily dataset respectively. + For Citeseer and Actor, the weights of KNN firstly goes up and gradually decreases.  nodes with similar attributes are most useful for the training at the beginning stage. + ANS-GT has a large advantage compared to without the adaptive sampling.
  • 18. 17 Experiment results Graph Coarsening Methods + There is no significant difference between different coarsening algorithms  the robustness of ANS-GT. + As for the coarsening rate, the results indicate that the coarsening rate of 0.01 to 0.10 has the best performance.
  • 19. 18 Conclusions • Propose Adaptive Node Sampling for Graph Transformer (ANS-GT) modifies a multi-armed bandit algorithm to adaptively sample nodes for attention in this paper. • To incorporate long-range dependencies and global contexts, the authors design a hierarchical graph attention scheme in which coarse-grained attention is achieved with graph coarsening. • Evaluate the proposed method on six benchmark datasets to show the advantage over existing Graph Transformers and popular GNNs. • The adaptive node sampling module could effectively adjust the sampling strategies according to graph properties.