SlideShare a Scribd company logo
InternImage: Exploring Large-Scale Vision
Foundation Models with
Deformable Convolutions
발표자: 김병현
이미지처리팀:
강인하, 김현진, 안종식, 이주영, 이희재, 현청천
InterImage
 State-of-the-Art (COCO test-dev) Backbone Network
Released by OpenGVLab
2
Introduction
 Success of Transformers in Computer Vision Tasks
 CNN-based foundation models can also achieve
comparable or even better performance than ViTs when
equipped with similar operator-/architecture-level designs,
scaling-up parameters, and massive data.
3
Introduction
 Gap between CNNs and ViTs
Operator Level
• Long-range dependency
• Adaptive spatial aggregation
Architecture View
• Advanced components
– Layer Normalization
– Feed Forward Networks
– GELU
Recent Long-Range CNNs
• Very large kernels (31x31)
• Gap with SOTA ViTs
4
Introduction
 Comparison of different core operators
5
Introduction
 Global Attention: Vision Transformer
6
Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers
for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
Architecture of ViT
Introduction
 Local Attention: Swin Transformer
7
Architecture of Swin Transformer
Liu, Ze, et al. "Swin transformer: Hierarchical vision transformer using
shifted windows." Proceedings of the IEEE/CVF international conference
on computer vision. 2021.
Introduction
 Large Kernel: SLaK
8
Liu, Shiwei, et al. "More convnets in the 2020s: Scaling up kernels beyond
51x51 using sparsity." arXiv preprint arXiv:2207.03620 (2022).
Introduction
 Concentration on CNN-based Model
InterImage
• Brand-New CNN-based Backbone Network
• Characteristics
– Dynamic sparse convolutional layer
» Only with 3x3 kernels
» Adaptive spatial aggregation
» Reduce inductive bias
» Low computational cost compared to large convolutional layers
– Overall Architecture of ViT
9
Introduction
 Contributions
1st CNN-based backbone with more than 1 billion params.
Add long-rage dependencies and adaptive spatial aggregation
with 3x3 DCN
SOTA accuracy in COCO dataset
10
Q & A
Q & A
11
Overall Architecture
12
1. Deformable
Convolutional Layer V3
2. Architecture Design
Deformable Convolutional Layer v3
 Revisiting DCNv2
13
Deformable Convolutional Layer v3
14
https://www.taokong.org/report/DCN_v2_paper_reading.pdf
Deformable Convolutional Layer v3
15
https://www.taokong.org/report/DCN_v2_paper_reading.pdf
Deformable Convolutional Layer v3
1. Sharing weights among convolutional neurons.
• Heavy computational cost of DCNv2
• independent linear projection weights
• memory complexity is linear with the total number of sampling points
• To remedy this problem, we borrow the idea from the separable
convolution and detach the original convolution weights into
depth-wise and point-wise parts
2. Introducing multi-group mechanism
• Split the spatial aggregation process into G groups
3. Normalizing modulation scalars along sampling points
• Change element-wise sigmoid normalization to softmax
normalization along sample points.
16
Deformable Convolutional Layer v3
17
Deformable Convolutional Layer v3
 Normal Convolutional Layer
18
ℎ
𝑤
𝑑
𝑘
𝐷
𝑘
𝐻
𝑊
𝐷
Input Tensor
Convolutional Kernel
Output Tensor
Deformable Convolutional Layer v3
 Deformable Convolutional Layer v1
19
ℎ
𝑤
𝑑
Input Tensor
𝑘
𝐷
𝑘
Convolutional Kernel
𝐻
𝑊
𝐷
Output Tensor
𝑘
2𝑁(𝑁 = 𝑘2
)
𝑘
Offset Layer
(Convolutional Kernel)
𝐻
𝑊
Offset map
2𝑁(𝑁 = 𝑘2
)
Deformable Convolutional Layer v3
 Deformable Convolutional Layer v2
20
𝑘
2𝑁(𝑁 = 𝑘2
)
𝑘
Offset Layer
(Convolutional Kernel)
𝐻
𝑊
Offset map
2𝑁(𝑁 = 𝑘2
)
ℎ
𝑤
𝑑
Input Tensor
𝑘
𝐷
𝑘
Convolutional Kernel
𝐻
𝑊
𝐷
Output
Tensor
𝑘
𝑁(𝑁 = 𝑘2
)
𝑘
Modulation Layer
(Convolutional Kernel)
𝐻
𝑊
Modulation map
𝑁(𝑁 = 𝑘2
)
Sigmoid
Deformable Convolutional Layer v3
 Deformable Convolutional Layer v3
21
𝑘
2𝑁(𝑁 = 𝑘2
)
𝑘
Offset Layer
(Convolutional Kernel*)
𝐻
𝑊
Offset map
2𝑁(𝑁 = 𝑘2
)
ℎ
𝑤
𝑑 = 𝐶′ × 𝐺
Input Tensor
𝑘
𝐶′
𝑘 Convolutional Kernel
𝐻
𝑊
𝐷
Output
Tensor
𝑘
𝑁(𝑁 = 𝑘2
)
𝑘
Modulation Layer
(Convolutional Kernel*)
𝐻
𝑊
Modulation map
𝑁(𝑁 = 𝑘2
)
Softmax
(Axis -> Depth)
× 𝐺
*Details not found in the paper
Shared Params
in Kernels
InterImage Model
22
InterImage Model
23
1. Basic Block
InterImage Model
24
2. Stem layer & Downsampling
InterImage Model
25
3. Stacking Rules
InterImage Model
26
1. Basic Block
InterImage Model
27
2. Stem & Downsampling
InterImage Model
28
3. Stacking Rules
InterImage Model
29
4. Scaling Rules
InterImage Model
30
5. Hyper-parameters for models of different scales
4 Stages Models are prevalent since DETR
• Swin Transformer
• MetaFormer
Q & A
Q & A
31
Experiments
 Image Classification (Tiny Model)
32
Experiments
 Image Classification (Large Model)
33
Experiments
 Object Detection & Instance Segmentation
34
Experiments
 Object Detection & Instance Segmentation
35
Experiments
 Semantic Segmentation
36
이미지처리팀 리뷰 의견
 Deformable Conv V3에 대한 분석이 부재
Ablation Study 부재
• ResNet 등 기존 Conv 기반 모델에서 성능향상을 가지는지 확인해줬으면
정말 좋았을 듯
Deformable Conv V3의 장점에 대한 정성적인 분석 부재
• Convolution을 Group으로 나누면서 생기는 단일 레이어의 다양한 Offset
Map의 장점에 대한 시각화 자료가 있었으면 좋았을 듯
Inductive Bias를 줄일 수 있었다는 주장에 대한 근거자료 부재
 코드가 아직 공개되지 않아 정확한 검증은 어려움
37
Q & A
Q & A
38

More Related Content

PDF
Bayesian Neural Networks : Survey
PDF
Generative Adversarial Networks (GAN) の学習方法進展・画像生成・教師なし画像変換
PPTX
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
PPTX
PyTorch, PixyzによるGenerative Query Networkの実装
PPTX
[DL輪読会]Life-Long Disentangled Representation Learning with Cross-Domain Laten...
PDF
ベイジアンディープニューラルネット
PDF
GAN(と強化学習との関係)
PPTX
[DL輪読会]EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Bayesian Neural Networks : Survey
Generative Adversarial Networks (GAN) の学習方法進展・画像生成・教師なし画像変換
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
PyTorch, PixyzによるGenerative Query Networkの実装
[DL輪読会]Life-Long Disentangled Representation Learning with Cross-Domain Laten...
ベイジアンディープニューラルネット
GAN(と強化学習との関係)
[DL輪読会]EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

What's hot (20)

PDF
[DL輪読会]Deep Learning 第2章 線形代数
PDF
Mean Teacher
PPTX
[DL輪読会]Graph Convolutional Policy Network for Goal-Directed Molecular Graph G...
PDF
グラフデータ分析 入門編
PPTX
[DL輪読会]ドメイン転移と不変表現に関するサーベイ
PDF
「R言語による Random Forest 徹底入門 -集団学習による分類・予測-」 - #TokyoR #11
PPTX
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
PPTX
有向グラフに対する 非線形ラプラシアンと ネットワーク解析
PDF
[DL輪読会]Deep Learning 第15章 表現学習
PDF
スペクトラルグラフ理論入門
PDF
Newman アルゴリズムによるソーシャルグラフのクラスタリング
PDF
画像生成・生成モデル メタサーベイ
PDF
グラフニューラルネットワークとグラフ組合せ問題
PDF
【メタサーベイ】数式ドリブン教師あり学習
PDF
変分推論法(変分ベイズ法)(PRML第10章)
PDF
グラフデータの機械学習における特徴表現の設計と学習
PPTX
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
PPTX
【LT資料】 Neural Network 素人なんだけど何とかご機嫌取りをしたい
PDF
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
PPTX
モデルアーキテクチャ観点からの高速化2019
[DL輪読会]Deep Learning 第2章 線形代数
Mean Teacher
[DL輪読会]Graph Convolutional Policy Network for Goal-Directed Molecular Graph G...
グラフデータ分析 入門編
[DL輪読会]ドメイン転移と不変表現に関するサーベイ
「R言語による Random Forest 徹底入門 -集団学習による分類・予測-」 - #TokyoR #11
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
有向グラフに対する 非線形ラプラシアンと ネットワーク解析
[DL輪読会]Deep Learning 第15章 表現学習
スペクトラルグラフ理論入門
Newman アルゴリズムによるソーシャルグラフのクラスタリング
画像生成・生成モデル メタサーベイ
グラフニューラルネットワークとグラフ組合せ問題
【メタサーベイ】数式ドリブン教師あり学習
変分推論法(変分ベイズ法)(PRML第10章)
グラフデータの機械学習における特徴表現の設計と学習
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
【LT資料】 Neural Network 素人なんだけど何とかご機嫌取りをしたい
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
モデルアーキテクチャ観点からの高速化2019
Ad

Similar to InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions (20)

PPTX
Cvpr 2018 papers review (efficient computing)
PDF
Recent advances of AI for medical imaging : Engineering perspectives
PDF
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
PDF
1-bit semantic segmentation
PPTX
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
PPTX
Convolutional neural networks 이론과 응용
PPTX
Review on cs231 part-2
PDF
物件偵測與辨識技術
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
PPTX
Large scale gpu cluster for ai
PDF
DLD meetup 2017, Efficient Deep Learning
PPTX
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
PDF
Modern Convolutional Neural Network techniques for image segmentation
PDF
Lecture 6: Convolutional Neural Networks
PDF
REVIEW ON OBJECT DETECTION WITH CNN
PDF
“Introduction to Computer Vision with Convolutional Neural Networks,” a Prese...
PDF
Computer vision for transportation
PPTX
White box in Computer Vision
PPTX
Caffe framework tutorial2
PDF
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Cvpr 2018 papers review (efficient computing)
Recent advances of AI for medical imaging : Engineering perspectives
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
1-bit semantic segmentation
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Convolutional neural networks 이론과 응용
Review on cs231 part-2
物件偵測與辨識技術
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Large scale gpu cluster for ai
DLD meetup 2017, Efficient Deep Learning
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Modern Convolutional Neural Network techniques for image segmentation
Lecture 6: Convolutional Neural Networks
REVIEW ON OBJECT DETECTION WITH CNN
“Introduction to Computer Vision with Convolutional Neural Networks,” a Prese...
Computer vision for transportation
White box in Computer Vision
Caffe framework tutorial2
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Ad

More from taeseon ryu (20)

PDF
VoxelNet
PDF
OpineSum Entailment-based self-training for abstractive opinion summarization...
PPTX
3D Gaussian Splatting
PDF
JetsonTX2 Python
PPTX
Hyperbolic Image Embedding.pptx
PDF
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
PDF
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
PDF
YOLO V6
PDF
Dataset Distillation by Matching Training Trajectories
PDF
RL_UpsideDown
PDF
Packed Levitated Marker for Entity and Relation Extraction
PPTX
MOReL: Model-Based Offline Reinforcement Learning
PDF
Scaling Instruction-Finetuned Language Models
PDF
Visual prompt tuning
PDF
PDF
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
PDF
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
PDF
The Forward-Forward Algorithm
PPTX
Towards Robust and Reproducible Active Learning using Neural Networks
PDF
BRIO: Bringing Order to Abstractive Summarization
VoxelNet
OpineSum Entailment-based self-training for abstractive opinion summarization...
3D Gaussian Splatting
JetsonTX2 Python
Hyperbolic Image Embedding.pptx
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
YOLO V6
Dataset Distillation by Matching Training Trajectories
RL_UpsideDown
Packed Levitated Marker for Entity and Relation Extraction
MOReL: Model-Based Offline Reinforcement Learning
Scaling Instruction-Finetuned Language Models
Visual prompt tuning
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
The Forward-Forward Algorithm
Towards Robust and Reproducible Active Learning using Neural Networks
BRIO: Bringing Order to Abstractive Summarization

Recently uploaded (20)

PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Foundation of Data Science unit number two notes
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
.pdf is not working space design for the following data for the following dat...
Business Acumen Training GuidePresentation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Mega Projects Data Mega Projects Data
Data_Analytics_and_PowerBI_Presentation.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
ISS -ESG Data flows What is ESG and HowHow
annual-report-2024-2025 original latest.
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Miokarditis (Inflamasi pada Otot Jantung)
IB Computer Science - Internal Assessment.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Foundation of Data Science unit number two notes
Supervised vs unsupervised machine learning algorithms
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
.pdf is not working space design for the following data for the following dat...

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

  • 1. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions 발표자: 김병현 이미지처리팀: 강인하, 김현진, 안종식, 이주영, 이희재, 현청천
  • 2. InterImage  State-of-the-Art (COCO test-dev) Backbone Network Released by OpenGVLab 2
  • 3. Introduction  Success of Transformers in Computer Vision Tasks  CNN-based foundation models can also achieve comparable or even better performance than ViTs when equipped with similar operator-/architecture-level designs, scaling-up parameters, and massive data. 3
  • 4. Introduction  Gap between CNNs and ViTs Operator Level • Long-range dependency • Adaptive spatial aggregation Architecture View • Advanced components – Layer Normalization – Feed Forward Networks – GELU Recent Long-Range CNNs • Very large kernels (31x31) • Gap with SOTA ViTs 4
  • 5. Introduction  Comparison of different core operators 5
  • 6. Introduction  Global Attention: Vision Transformer 6 Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020). Architecture of ViT
  • 7. Introduction  Local Attention: Swin Transformer 7 Architecture of Swin Transformer Liu, Ze, et al. "Swin transformer: Hierarchical vision transformer using shifted windows." Proceedings of the IEEE/CVF international conference on computer vision. 2021.
  • 8. Introduction  Large Kernel: SLaK 8 Liu, Shiwei, et al. "More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity." arXiv preprint arXiv:2207.03620 (2022).
  • 9. Introduction  Concentration on CNN-based Model InterImage • Brand-New CNN-based Backbone Network • Characteristics – Dynamic sparse convolutional layer » Only with 3x3 kernels » Adaptive spatial aggregation » Reduce inductive bias » Low computational cost compared to large convolutional layers – Overall Architecture of ViT 9
  • 10. Introduction  Contributions 1st CNN-based backbone with more than 1 billion params. Add long-rage dependencies and adaptive spatial aggregation with 3x3 DCN SOTA accuracy in COCO dataset 10
  • 11. Q & A Q & A 11
  • 12. Overall Architecture 12 1. Deformable Convolutional Layer V3 2. Architecture Design
  • 13. Deformable Convolutional Layer v3  Revisiting DCNv2 13
  • 14. Deformable Convolutional Layer v3 14 https://www.taokong.org/report/DCN_v2_paper_reading.pdf
  • 15. Deformable Convolutional Layer v3 15 https://www.taokong.org/report/DCN_v2_paper_reading.pdf
  • 16. Deformable Convolutional Layer v3 1. Sharing weights among convolutional neurons. • Heavy computational cost of DCNv2 • independent linear projection weights • memory complexity is linear with the total number of sampling points • To remedy this problem, we borrow the idea from the separable convolution and detach the original convolution weights into depth-wise and point-wise parts 2. Introducing multi-group mechanism • Split the spatial aggregation process into G groups 3. Normalizing modulation scalars along sampling points • Change element-wise sigmoid normalization to softmax normalization along sample points. 16
  • 18. Deformable Convolutional Layer v3  Normal Convolutional Layer 18 ℎ 𝑤 𝑑 𝑘 𝐷 𝑘 𝐻 𝑊 𝐷 Input Tensor Convolutional Kernel Output Tensor
  • 19. Deformable Convolutional Layer v3  Deformable Convolutional Layer v1 19 ℎ 𝑤 𝑑 Input Tensor 𝑘 𝐷 𝑘 Convolutional Kernel 𝐻 𝑊 𝐷 Output Tensor 𝑘 2𝑁(𝑁 = 𝑘2 ) 𝑘 Offset Layer (Convolutional Kernel) 𝐻 𝑊 Offset map 2𝑁(𝑁 = 𝑘2 )
  • 20. Deformable Convolutional Layer v3  Deformable Convolutional Layer v2 20 𝑘 2𝑁(𝑁 = 𝑘2 ) 𝑘 Offset Layer (Convolutional Kernel) 𝐻 𝑊 Offset map 2𝑁(𝑁 = 𝑘2 ) ℎ 𝑤 𝑑 Input Tensor 𝑘 𝐷 𝑘 Convolutional Kernel 𝐻 𝑊 𝐷 Output Tensor 𝑘 𝑁(𝑁 = 𝑘2 ) 𝑘 Modulation Layer (Convolutional Kernel) 𝐻 𝑊 Modulation map 𝑁(𝑁 = 𝑘2 ) Sigmoid
  • 21. Deformable Convolutional Layer v3  Deformable Convolutional Layer v3 21 𝑘 2𝑁(𝑁 = 𝑘2 ) 𝑘 Offset Layer (Convolutional Kernel*) 𝐻 𝑊 Offset map 2𝑁(𝑁 = 𝑘2 ) ℎ 𝑤 𝑑 = 𝐶′ × 𝐺 Input Tensor 𝑘 𝐶′ 𝑘 Convolutional Kernel 𝐻 𝑊 𝐷 Output Tensor 𝑘 𝑁(𝑁 = 𝑘2 ) 𝑘 Modulation Layer (Convolutional Kernel*) 𝐻 𝑊 Modulation map 𝑁(𝑁 = 𝑘2 ) Softmax (Axis -> Depth) × 𝐺 *Details not found in the paper Shared Params in Kernels
  • 24. InterImage Model 24 2. Stem layer & Downsampling
  • 27. InterImage Model 27 2. Stem & Downsampling
  • 30. InterImage Model 30 5. Hyper-parameters for models of different scales 4 Stages Models are prevalent since DETR • Swin Transformer • MetaFormer
  • 31. Q & A Q & A 31
  • 34. Experiments  Object Detection & Instance Segmentation 34
  • 35. Experiments  Object Detection & Instance Segmentation 35
  • 37. 이미지처리팀 리뷰 의견  Deformable Conv V3에 대한 분석이 부재 Ablation Study 부재 • ResNet 등 기존 Conv 기반 모델에서 성능향상을 가지는지 확인해줬으면 정말 좋았을 듯 Deformable Conv V3의 장점에 대한 정성적인 분석 부재 • Convolution을 Group으로 나누면서 생기는 단일 레이어의 다양한 Offset Map의 장점에 대한 시각화 자료가 있었으면 좋았을 듯 Inductive Bias를 줄일 수 있었다는 주장에 대한 근거자료 부재  코드가 아직 공개되지 않아 정확한 검증은 어려움 37
  • 38. Q & A Q & A 38