[20240902_LabSeminar_Huy]Dynamic Semantic-Based Spatial Graph Convolution Network for Skeleton-Based Human Action Recognition.pptx

Quang-Huy Tran
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: huytran1126@gmail.com
2024-09-02
Dynamic Semantic-Based Spatial Graph
Convolution Network for Skeleton-Based
Human Action Recognition
Jianyang Xie et al.
AAAI-2024: The Thirty-Eighth AAAI Conference on Artificial Intelligence

2
OUTLINE
• MOTIVATION
• METHODOLOGY
• EXPERIMENT & RESULT
• CONCLUSION

3
MOTIVATION
• Human action recognition (HAR) is an essential topic:
o computer vision and wide range of applications.
o based-on skeleton sensor.
o Traditional methods (CNN/RNN) or STGNN extracting handcrafted features from skeleton sequence.
Overview and Limitation
o SOTA ST-GCN considered fixed graph.
 insufficient to capture changeable movements.
o Adaptive adjacency based: ignored the
semantic information.
 insufficient to capture semantic properties of
actions.
o Semantic-guided: explicit input encoding.
 Not flexible and cooperate when in deeper GCN.
• Challenges:

4
INTRODUCTION
• Propose temporal-causal SFD network (TC-SFDN) architecture to detect the forgeries at
the frame, clip and action levels.
o a hierarchical GCN architecture to learn both low-level skeleton representations based on physical
body connections.
o high-level action representations based on the temporal-causal graph for each action instance.
Contribution
• Propose dynamic semantic-based graph neural convolutions network (DS-GCN):
o encode the dynamical semantic information of joints and edges implicitly.
o joint/edge type was encoded with different transform functions, each of which represents a specific
distribution
• A group of SSL tasks are designed to efficiently train TC-SFDN for multilevel SFD.

5
METHODOLOGY
Problem Definition
• A skeleton data is constructed as spatial-temporal graph
o N body joints in T frames: .
o : spatial and temporal link.
o : joint coordinates as the node feature, d is dimension.
o Spatial graph: intra-body .
o Temporal graph: Same joints along consecutive frames .
o ST-GCN can be divided into using 1D temporal convolution: S-GCN (focus on) and T-GCN.
• Topology-Fixed Graph Convolution Network:
o Update the node representation by aggregating information from its neighborhood.
o Denotes adjacency three partition
o Output of S-GCN from input

6
METHODOLOGY
Problem Definition
• Topology-Adaptive Graph Convolution Network:
o Adaptive matrix dynamically learned with self attention mechanism.
o Suppose with 2 two transformation functions, the correlation between 2
joints:
• Semantic-Guided Graph Convolution Network:
o input feature was refined by adding a one-hot vector of joint types
o Adaptive matrix S-GCN:

7
METHODOLOGY
Main Architecture: DS-GCN

8
METHODOLOGY
Dynamic Semantic-Based GCN
• Topology-adaptive GCN:
o Joint and edge types encoded dynamically.
o a directed graph G = (V, E, A, R, X), A and R denote the type mapping function for each node, edge:
o Semantic-based adaptive graph for node and edge:

9
METHODOLOGY
• Node Type-Aware Adaptive Topology.
o projected into their individual feature space with a node type mapping function.
o Calculate according to the non-local mechanism.
 s and t as two nodes of different types, node-aware feature representation:
o Directed correction between node sand t along channel dimension:

10
METHODOLOGY
• Edge Type-Aware Adaptive Topology.
o applying separate convolution kernel on the adaptive graph.
o Given three nodes s, t and u of different types, edge type-aware adaptive correlation:
o Edge type-aware topology can be represented
 s and t is the node type index, M is the number of types.

11
METHODOLOGY
• Decomposed into three branches:
o The node-type aware branch, edge-type aware branch, and general branch.
o A branch-wise weight:
 learnable and utilized for the combination of a shared correction matrix.
o For each branch, combination of a shared correction matrix and a self-adaptive graph was utilized for
spatial graph convolution operation.
 3 branches were concatenated along feature channel dimension and followed by a 1 × 1 convolution kernel.
 Process DS-GCN:

12
METHODOLOGY
Model Architecture
• Ten blocks in series:
o Followed by a global average pooling and a softmax classifier.
o Number of basic feature channels is 64 and doubled at 5th
and 8th
block.
o Each block: 1 DS-GCN and multi-scale temporal module (temporal convolution network).

13
EXPERIMENT AND RESULT
Experiment Settings
• Dataset: human action recognition
o NTU-RGB+D and Kinetics-400.
• Baselines:
o STGNN or GNN: ST-GCN [1], SGN[2], AS-GCN[3], RA-GCN[4], 2s-GCN[5], GCNN[6], FGCN[7], shiftGCN[8],
DSTA-Net[9], MS-G3D[10], CTR-GCN[11] and ST-GCN++[12].
o CNN: PoseConv3D[13].
[1] Yan, S., Xiong, Y., & Lin, D. (2018, April). Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32, No. 1).
[2] Zhang, P., Lan, C., Zeng, W., Xing, J., Xue, J., & Zheng, N. (2020). Semantics-guided neural networks for efficient skeleton-based human action recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1112-1121).
[3] Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., & Tian, Q. (2019). Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3595-3603).
[4] Song, Y. F., Zhang, Z., Shan, C., & Wang, L. (2020). Richly activated graph convolutional network for robust skeleton-based action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 31(5), 1915-1925.
[5] Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019). Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12026-12035).
[6] Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019). Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7912-7921).
[7] Yang, H., Yan, D., Zhang, L., Sun, Y., Li, D., & Maybank, S. J. (2021). Feedback graph convolutional network for skeleton-based action recognition. IEEE Transactions on Image Processing, 31, 164-175.
[8] Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., & Lu, H. (2020). Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 183-192).
[9] Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2020). Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. In Proceedings of the Asian conference on computer vision.
[10] Liu, Z., Zhang, H., Chen, Z., Wang, Z., & Ouyang, W. (2020). Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 143-152).
[11] Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., & Hu, W. (2021). Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13359-13368).
[12] Duan, H., Wang, J., Chen, K., & Lin, D. (2022, October). Pyskl: Towards good practices for skeleton action recognition. In Proceedings of the 30th ACM International Conference on Multimedia (pp. 7351-7354).
[13] Duan, H., Zhao, Y., Chen, K., Lin, D., & Dai, B. (2022). Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2969-2978).
• Measurement:
o Accuracy (ACC).

14
Result – Overall Performance
Tab. Classification accuracy comparison against state-of-the-art methods.

15
Result – Ablation study.
Tab. Generalization of the proposed semantic module.
Tab. Ablation On the edge/node type encoding.
Tab. Comparison DS-GCN in different learnable weight manners.
Tab. Exploration on the semantic encoding stage.

16
CONCLUSION
• Propose 2 dynamical semantic-based adaptive graph:
o Node type-aware and edge type-aware adaptive graph.
o Can be apply to any ST-GCN models for skeleton-based recognition.
Summarization
• Generated a dynamic semantic-based graph neural network for skeleton-based human
action recognition:
o outperforms SOTA methods notably on both NTURGB+D and Kinetics-400.

[20240902_LabSeminar_Huy]Dynamic Semantic-Based Spatial Graph Convolution Network for Skeleton-Based Human Action Recognition.pptx

[20240902_LabSeminar_Huy]Dynamic Semantic-Based Spatial Graph Convolution Network for Skeleton-Based Human Action Recognition.pptx

More Related Content

Similar to [20240902_LabSeminar_Huy]Dynamic Semantic-Based Spatial Graph Convolution Network for Skeleton-Based Human Action Recognition.pptx (20)

More from thanhdowork (20)

Recently uploaded (20)

[20240902_LabSeminar_Huy]Dynamic Semantic-Based Spatial Graph Convolution Network for Skeleton-Based Human Action Recognition.pptx