DLゼミ: ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

DLゼミ (論文紹介)
ViTPose: Simple Vision Transformer
Baselines for Human Pose Estimation
北海道大学大学院情報科学研究院
情報理工学部門複合情報工学分野調和系工学研究室
博士後期課程2年森雄斗
2023/06/12

Copyright © 2020 調和系工学研究室 - 北海道大学大学院情報科学研究院情報理工学部門複合情報工学分野 – All rights reserved.
論文情報 2
タイトル
ViTPose: Simple Vision Transformer Baselines for Human
Pose Estimation
著者
Yufei Xu1*, Jing Zhang1*, Qiming Zhang1, Dacheng Tao2,1
1School of Computer Science, The University of Sydney, Australia
2 JD Explore Academy, China
発表
NeurIPS2022
URL
デモページ (Huggingface Spaces)
https://huggingface.co/spaces/hysts/ViTPose_video
GitHub
https://github.com/ViTAE-Transformer/ViTPose
論文
https://proceedings.neurips.cc/paper_files/paper/2022/file/fbb10
d319d44f8c3b4720873e4177c65-Paper-Conference.pdf

概要 3
プレーンなVision Transformerを用いた
姿勢推定モデルの提案
モデル構造のシンプルさ
モデルサイズのスケーラビリティなどが特徴
スループットとパフォーマンスの
パレートフロントの解であり、最高精度を記録

姿勢推定 (Pose Estimation) 4
コンピュータビジョンの1タスク
画像、動画から人間のキーポイント座標を推定
https://github.com/ViTAE-Transformer/ViTPose

姿勢推定の発展 5
CNNベース
Deeppose[1] (2014)
ResNet-50 base[2] (2018)
HRNet[3] (2019)
Transformerベース
HRFormer[4] (2021)
TokenPose[5] (2021)
TransPose[6] (2021)
[1] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 1653–1660, 2014.
[2] B. Xiao, H. Wu, and Y. Wei. Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV), 2018.
[3] K. Sun, B. Xiao, D. Liu, and J. Wang. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pages 5693–5703, 2019.
[4] Y. Yuan, R. Fu, L. Huang, W. Lin, C. Zhang, X. Chen, and J. Wang. Hrformer: High-resolution transformer for dense prediction. In Advances in Neural Information Processing
Systems, 2021.
[5] Y. Li, S. Zhang, Z. Wang, S. Yang, W. Yang, S.-T. Xia, and E. Zhou. Tokenpose: Learning keypoint tokens for human pose estimation. In Proceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV), 2021.
[6] S. Yang, Z. Quan, M. Nie, and W. Yang. Transpose: Keypoint localization via transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision
(ICCV), 2021.
HRNetの出力

従来手法の課題 6
HRFormer
Transformerを使い特徴量を抽出し、
多解像度並列transformerを介して高解像度表現を得る
課題
特徴抽出のための余分のCNNが必要 or 変換器構造を慎重に検討する必要
著者の疑問
プレーンのTransformerは, 姿勢推定にどの程度対応できるのか？
ネットワーク構造

ViTPoseの優れた点 7
1. Simplicity (シンプルさ)
• シンプルで非階層的なVision Transformer[1]を採用
• 特定のドメイン知識を不必要
• デコーダーはup-sampling層と畳み込み予測層で構成
2. Scalability (拡張性)
• Transformer層の数による推論速度と性能のバランス
3. Flexibility(柔軟性)
• 入力画像の解像度
• single poseからmulti poseへの適応
[1] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words:
Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.

ViTPoseのネットワーク構造 8
Transformer Block
クラシックな
Decoder
シンプルな
Decoder
multiple datasetの
ためのDecoder群

1. Simplicity: 入力からEncoderまで 9
𝑑: ダウンサンプリング率
𝐶: チャネル次元数
入力画像: 𝑋 ∈ ℛ𝐻×𝑊×3
Patch Embedding layer:
𝐹 ∈ ℛ
𝐻
𝑑
×
𝑊
𝑑
×𝐶
Transformer Blockの中身

1. Simplicity: Transformer Block 10
MHSA = multi-head self-attention
LN = Layer Normalization (Norm)
FFN =Feed-forward network
[1] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is
worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
ViT[1]のネットワーク図

1. Simplicity: Decoder 11
クラシックなDecoder
シンプルなDecoder
Bilinear: バイリニア補間
BN : Batch Normalization
Predictor: 畳み込み層で
ヒートマップを出力

2. Scalability: transformer層の増減 12
transformer層の個数によって
特徴表現力を増減させることが可能
ViT-B, ViT-L, ViT-H
ViTAE-G
帰納的バイアスを獲得し、汎用性が向上したViT

3. Flexibility 13
事前学習データ
Masked Autoencoderを使った事前学習によって
少ない学習データでも学習可能
解像度
入力サイズの変更が可能
ダウンサンプリング比𝑑も変更可能
Attention type
メモリ負担の軽減のための2つ手法を使用
Shift window
Pooling window
Patch Embedding layer:
𝐹 ∈ ℛ
𝐻
𝑑
×
𝑊
𝑑
×𝐶

関連研究 14
Vision transformerのための自己教師あり学習[1]
BERTで使われているMasked Autoencoder(MAE)の
Vision Transformer版
ViTPoseはmasked image modeling (MIM) で
事前学習したViTを採用
[1] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022.

3. Flexibility 15
FineTuning
MHSA のパラメータを凍結させても
すべてのパラメータを学習させた値に近い性能
Table6: 1,2行目

実験: 事前確認 16
バックボーンの詳細
過学習対策
Ablation study
ViTベースはSimpleなDecoderでも問題なし
𝐴𝑃5050: OKE(一致していると判定する指標) のしきい値50 のAverage Precision

17
実験: データセット
• データセット
– COCO Keypoint Detection
• 200,000以上の17のキーポイントが
ラベル付けされた画像
• https://cocodataset.org/#home
• 評価指標
– Object Keypoint Similarity (OKS)
𝑂𝐾𝑆 =
𝑖 𝑒𝑥𝑝 −𝑑𝑖
2
2𝑠2𝑘𝑖
2
𝛿 𝑣𝑖 > 0
𝑖 𝛿 𝑣𝑖 > 0
𝑑𝑖 : 推定座標とGround truthの座標のユークリッド距離
𝑠 : 人物領域の面積
𝑘𝑖 : 減衰を制御するキーポイントごとの定数 (eyes < nose < … < ankles < Hips)
𝑣𝑖 : Ground truthの可視性フラグ (部位が画像に存在するかどうか)
広範囲

実験: SoTA手法との比較 18
ViTPoseが高精度を記録
* multi-datasetで学習

ViTPoseの結果 19

パレートフロントを記録 20

制限と考察 21
特殊な構造がなくてもSoTAを記録
複雑なDecoderの設計やFPN構造を変えるこ
とでさらなる精度向上が見込める

まとめ 22
プレーンなVision Transformerを用いた
姿勢推定モデルの提案
モデル構造のシンプルさ
モデルサイズのスケーラビリティなどが特徴
スループットとパフォーマンスのパレートフロ
ントの解であり、最高精度を記録

2023年6月現在の状況 23
PCT[1]
CVPR2023で発表
バックボーンはSwin-Transformer
Decoder部分にはMLP-mixerを使用
[1] Geng, Zigang, et al. "Human Pose as Compositional Tokens." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

2023年6月現在の状況 24
ViTPose+[1]
ViTPoseと同じ著者
2022年11月にarXivに投稿
[1] Xu, Yufei, et al. "ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation." arXiv preprint arXiv:2212.04246 (2022).
新たなDecoder

DLゼミ: ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

More Related Content

What's hot (20)

Similar to DLゼミ: ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation (20)

More from harmonylab (20)

DLゼミ: ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Editor's Notes