InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

InternImage: Exploring Large-Scale Vision
Foundation Models with
Deformable Convolutions
발표자: 김병현
이미지처리팀:
강인하, 김현진, 안종식, 이주영, 이희재, 현청천

InterImage
 State-of-the-Art (COCO test-dev) Backbone Network
Released by OpenGVLab
2

Introduction
 Success of Transformers in Computer Vision Tasks
 CNN-based foundation models can also achieve
comparable or even better performance than ViTs when
equipped with similar operator-/architecture-level designs,
scaling-up parameters, and massive data.
3

Introduction
 Gap between CNNs and ViTs
Operator Level
• Long-range dependency
• Adaptive spatial aggregation
Architecture View
• Advanced components
– Layer Normalization
– Feed Forward Networks
– GELU
Recent Long-Range CNNs
• Very large kernels (31x31)
• Gap with SOTA ViTs
4

Introduction
 Comparison of different core operators
5

Introduction
 Global Attention: Vision Transformer
6
Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers
for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
Architecture of ViT

Introduction
 Local Attention: Swin Transformer
7
Architecture of Swin Transformer
Liu, Ze, et al. "Swin transformer: Hierarchical vision transformer using
shifted windows." Proceedings of the IEEE/CVF international conference
on computer vision. 2021.

Introduction
 Large Kernel: SLaK
8
Liu, Shiwei, et al. "More convnets in the 2020s: Scaling up kernels beyond
51x51 using sparsity." arXiv preprint arXiv:2207.03620 (2022).

Introduction
 Concentration on CNN-based Model
InterImage
• Brand-New CNN-based Backbone Network
• Characteristics
– Dynamic sparse convolutional layer
» Only with 3x3 kernels
» Adaptive spatial aggregation
» Reduce inductive bias
» Low computational cost compared to large convolutional layers
– Overall Architecture of ViT
9

Introduction
 Contributions
1st CNN-based backbone with more than 1 billion params.
Add long-rage dependencies and adaptive spatial aggregation
with 3x3 DCN
SOTA accuracy in COCO dataset
10

Overall Architecture
12
1. Deformable
Convolutional Layer V3
2. Architecture Design

Deformable Convolutional Layer v3
 Revisiting DCNv2
13

14
https://www.taokong.org/report/DCN_v2_paper_reading.pdf

15
https://www.taokong.org/report/DCN_v2_paper_reading.pdf

1. Sharing weights among convolutional neurons.
• Heavy computational cost of DCNv2
• independent linear projection weights
• memory complexity is linear with the total number of sampling points
• To remedy this problem, we borrow the idea from the separable
convolution and detach the original convolution weights into
depth-wise and point-wise parts
2. Introducing multi-group mechanism
• Split the spatial aggregation process into G groups
3. Normalizing modulation scalars along sampling points
• Change element-wise sigmoid normalization to softmax
normalization along sample points.
16

17

 Normal Convolutional Layer
18
ℎ
𝑤
𝑑
𝑘
𝐷
𝑘
𝐻
𝑊
𝐷
Input Tensor
Convolutional Kernel
Output Tensor

 Deformable Convolutional Layer v1
19
ℎ
𝑤
𝑑
Input Tensor
𝑘
𝐷
𝑘
𝐻
𝑊
𝐷
Output Tensor
𝑘
2𝑁(𝑁 = 𝑘2
)
𝑘
Offset Layer
(Convolutional Kernel)
𝐻
𝑊
Offset map
2𝑁(𝑁 = 𝑘2
)

20
𝑘
2𝑁(𝑁 = 𝑘2
)
𝑘
Offset Layer
𝐻
𝑊
Offset map
2𝑁(𝑁 = 𝑘2
)
ℎ
𝑤
𝑑
Input Tensor
𝑘
𝐷
𝑘
𝐻
𝑊
𝐷
Output
Tensor
𝑘
𝑁(𝑁 = 𝑘2
)
𝑘
Modulation Layer
𝐻
𝑊
Modulation map
𝑁(𝑁 = 𝑘2
)
Sigmoid

21
𝑘
2𝑁(𝑁 = 𝑘2
)
𝑘
Offset Layer
(Convolutional Kernel*)
𝐻
𝑊
Offset map
2𝑁(𝑁 = 𝑘2
)
ℎ
𝑤
𝑑 = 𝐶′ × 𝐺
Input Tensor
𝑘
𝐶′
𝑘 Convolutional Kernel
𝐻
𝑊
𝐷
Output
Tensor
𝑘
𝑁(𝑁 = 𝑘2
)
𝑘
Modulation Layer
(Convolutional Kernel*)
𝐻
𝑊
Modulation map
𝑁(𝑁 = 𝑘2
)
Softmax
(Axis -> Depth)
× 𝐺
*Details not found in the paper
Shared Params
in Kernels

InterImage Model
23
1. Basic Block

InterImage Model
24
2. Stem layer & Downsampling

InterImage Model
25
3. Stacking Rules

InterImage Model
26
1. Basic Block

InterImage Model
27
2. Stem & Downsampling

InterImage Model
28
3. Stacking Rules

InterImage Model
29
4. Scaling Rules

InterImage Model
30
5. Hyper-parameters for models of different scales
4 Stages Models are prevalent since DETR
• Swin Transformer
• MetaFormer

Experiments
 Image Classification (Tiny Model)
32

Experiments
 Image Classification (Large Model)
33

Experiments
 Object Detection & Instance Segmentation
34

Experiments
 Object Detection & Instance Segmentation
35

Experiments
 Semantic Segmentation
36

이미지처리팀 리뷰 의견
 Deformable Conv V3에 대한 분석이 부재
Ablation Study 부재
• ResNet 등 기존 Conv 기반 모델에서 성능향상을 가지는지 확인해줬으면
정말 좋았을 듯
Deformable Conv V3의 장점에 대한 정성적인 분석 부재
• Convolution을 Group으로 나누면서 생기는 단일 레이어의 다양한 Offset
Map의 장점에 대한 시각화 자료가 있었으면 좋았을 듯
Inductive Bias를 줄일 수 있었다는 주장에 대한 근거자료 부재
 코드가 아직 공개되지 않아 정확한 검증은 어려움
37

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

More Related Content

What's hot (20)

Similar to InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions (20)

More from taeseon ryu (20)

Recently uploaded (20)

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions