SlideShare a Scribd company logo
DeepLabV3+
Encoder-Decoder with Atrous Separable Convolution
for Semantic Image Segmentation
Background
▪ DeepLabV3+ is the latest version of the DeepLab models.
▪ DeepLab V1: Semantic Image Segmentation with Deep Convolutional Nets and
Fully Connected CRFs. ICLR 2015.
▪ DeepLab V2: DeepLab: Semantic Image Segmentation with Deep Convolutional
Nets, Atrous Convolution, and Fully Connected CRFs. TPAMI 2017.
▪ DeepLab V3: Rethinking Atrous Convolution for Semantic Image Segmentation.
arXiv 2017.
▪ DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic
Image Segmentation. arXiv 2018.
Semantic Segmentation
▪ Classifying all pixels in an image
into classes.
▪ Classification at the pixel level.
▪ Does not have to separate different
instances of the same class.
▪ Has important applications in
Medical Imaging.
Current Results on Pascal VOC 2012
Motivation and Key Concepts
▪ Use Atrous Convolution and Separable Convolutions to reduce computation.
▪ Combine Atrous Spatial Pyramid Pooling Modules and Encoder-Decoder
Structures.
▪ ASPPs capture contextual information at multiple scales by pooling features at
different resolutions.
▪ Encoder-Decoders can obtain sharp object boundaries.
Architecture Overview
Advanced Convolutions
Convolution (Cross-Correlation) for 1 Channel
Convolution with Zero-Padding Display with Convolution Kernel
Blue maps: inputs, Cyan maps: outputs, Kernel: not displayed
Other Convolutions (Cross-Correlations)
Strided Convolution with Padding Atrous (Dilated) Convolution with r=2
Blue maps: inputs, Cyan maps: outputs, Kernel: not displayed
Atrous Convolution
▪ à trous is French for “with holes”
▪ Atrous Convolution is also known as
Dilated Convolution.
▪ Atrous Convolution with r=1 is the
same as ordinary Convolution
▪ The image on the left shows 1D
atrous convolution
Receptive Field of Atrous Convolutions
▪ Left: r=1, Middle: r=2, Right: r=4
▪ Atrous Convolution has a larger receptive field than normal convolution
with the same number of parameters.
Depth-wise Separable Convolution
▪ A special case of Grouped Convolution.
▪ Separate the convolution operation along the depth (channel) dimension.
▪ It can refer to both (depth -> point) and (point -> depth).
▪ It only has meaning in multi-channel convolutions (cross-correlations).
Review: Multi-Channel 2D Convolution
Exact Shapes and Terminology
▪ Filter: A collection of 𝑪𝒊𝒏 Kernels of shape (𝑲 𝑯, 𝑲 𝑾) concatenated channel-wise.
▪ Input Tensor Shape: 𝑵, 𝑪𝒊𝒏, 𝑯, 𝑾 𝑜𝑟 (𝑵, 𝑯, 𝑾, 𝑪𝒊𝒏)
▪ Filters are 3D, Kernels are 2D, All filters are concatenated to a single 4D array in 2D CNNs.
(𝑵: 𝐵𝑎𝑡𝑐ℎ 𝑁𝑢𝑚𝑏𝑒𝑟, 𝑯: 𝐹𝑒𝑎𝑡𝑢𝑟𝑒 𝐻𝑒𝑖𝑔ℎ𝑡, 𝑾: 𝐹𝑒𝑎𝑡𝑢𝑟𝑒 𝑊𝑖𝑑𝑡ℎ,
𝑪𝒊𝒏: #𝐼𝑛𝑝𝑢𝑡 𝐶ℎ𝑎𝑛𝑛𝑒𝑙𝑠, 𝑪 𝒐𝒖𝒕: #𝑂𝑢𝑡𝑝𝑢𝑡 𝐶ℎ𝑎𝑛𝑛𝑒𝑙𝑠, 𝑲 𝑯: 𝐾𝑒𝑟𝑛𝑒𝑙 𝐻𝑒𝑖𝑔ℎ𝑡, 𝑲 𝑾: 𝐾𝑒𝑟𝑛𝑒𝑙 𝑊𝑖𝑑𝑡ℎ)
Step 1: Convolution on Input Tensor Channels
Step 2: Summation along Input Channel Dimension
Step 3: Add Bias Term
▪ Each kernel of a filter iterates only
1 channel of the input tensor.
▪ The number of filters is 𝐶 𝑜𝑢𝑡. Each
filter generates one output channel.
▪ Each 2D kernel is different from all
other kernels in the 3D filter.
Key Points
Normal Convolution
▪ Top: Input Tensor
▪ Middle: Filter
▪ Bottom: Output Tensor
Depth-wise Separable Convolution
▪ Replace Step 2.
▪ Instead of summation, use point-wise convolution (1x1 convolution).
▪ There is now only one (𝑪𝒊𝒏, 𝑲 𝑯, 𝑲 𝑾) filter.
▪ The number of 1x1 filters is 𝑪 𝒐𝒖𝒕.
▪ Bias is usually included only at the end of both convolution operations.
▪ Usually refers to depth-wise convolution -> point-wise convolution.
▪ Xception uses point-wise convolution -> depth-wise convolution.
Depth-wise Separable Convolution
Characteristics
▪ Depth-wise Separable Convolution can be used as a drop-in replacement for
ordinary convolution in DCNNs.
▪ The number of parameters is reduced significantly (sparse representation).
▪ The number of flops is reduced by several orders of magnitude
(computationally efficient).
▪ There is no significant drop in performance (performance may even improve).
▪ Wall-clock time reduction is less dramatic due to GPU memory access patterns.
Example: Flop Comparison (Padding O, Bias X)
Ordinary Convolution
▪ 𝐻 ∗ 𝑊 ∗ 𝐾 𝐻 ∗ 𝐾 𝑊 ∗ 𝐶𝑖𝑛 ∗ 𝐶 𝑜𝑢𝑡
▪ For a 256x256x3 image with 128 filters
with kernel size of 3x3, the number of
flops would be
256 ∗ 256 ∗ 3 ∗ 3 ∗ 3 ∗ 128 = 226,492,416
▪ There is an 8-fold reduction in the
number of flops.
Depth-wise Separable Convolution
▪ 𝐻 ∗ 𝑊 ∗ 𝐾 𝐻 ∗ 𝐾 𝑊 ∗ 𝐶𝑖𝑛 + 𝐻 ∗ 𝑊 ∗ 𝐶𝑖𝑛 ∗ 𝐶 𝑜𝑢𝑡
▪ Left: Depth Conv, Right: Point Conv
▪ For a 256x256x3 image with 128 filters
and a 3x3 kernel size, the number of
flops would be
256 ∗ 256 ∗ 3 ∗ 3 ∗ 3 + 256 ∗ 256 ∗ 3 ∗ 128
= 1,769,472 + 25,165,824 = 26,935,296
Example: Parameter Comparison (Excluding Bias Term)
Ordinary Convolution
▪ 𝐾 𝐻 ∗ 𝐾 𝑊 ∗ 𝐶𝑖𝑛 ∗ 𝐶 𝑜𝑢𝑡
▪ For a 256x256x3 image with 128 filters
and 3x3 kernel size, the number of
weights would be
3 ∗ 3 ∗ 3 ∗ 128 = 3,456
Depth-wise Separable Convolution
▪ 𝐾 𝐻 ∗ 𝐾 𝑊 ∗ 𝐶𝑖𝑛 + 𝐶𝑖𝑛 ∗ 𝐶 𝑜𝑢𝑡
▪ For a 256x256x3 image with 128 filters
and 3x3 kernel size, the number of flops
would be
3 ∗ 3 ∗ 3 + 3 ∗ 128 = 411
▪ There is also an 8-fold reduction in
parameter numbers.
Atrous Depth-wise Separable Convolution
Architecture Overview
Encoder-Decoder Structures
▪ The Encoder reduces the spatial sizes of feature maps, while extracting higher-
level semantic information.
▪ The Decoder gradually recovers the spatial information.
▪ UNETs are a classical example of encoder-decoder structures.
▪ In DeepLabV3+, DeepLabV3 is used as the encoder.
Architecture Overview
Decoder Layer
Structure
1. Apply 4-fold bilinear up-sampling on the
ASPP outputs.
2. Apply 1x1 Convolution with reduced filter
number on a intermediate feature layer.
3. Concatenate ASPP outputs with
intermediate features.
4. Apply two 3x3 Convolutions.
5. Apply 4-fold bilinear up-sampling.
Purpose & Implementation
▪ The ASPP is poor at capturing fine details.
▪ The decoder is used to improve the
resolution of the image.
▪ The intermediate layer has 1x1
convolutions to reduce channel number.
ASPP: Atrous Spatial Pyramid Pooling
The ASPP Layer
▪ Encodes multi-scale contextual
information through multiple rates.
▪ Concatenate all extracted features and an
up-sampled global average pooling layer
channel-wise.
▪ Use Atrous Depth-wise separable
convolutions for multiple channels.
▪ Bad at capturing sharp object boundaries.
Modified Aligned Xception Network
▪ Xception: Extreme Inception Network.
▪ Backbone network for DeepLabV3+
▪ Uses residual blocks and separable
convolutions.
Explanation of Xception
▪ Takes the “Inception Hypothesis”, which states that cross-channel correlations
and spatial correlations are sufficiently decoupled that it is preferable not to
map them jointly, to the extreme.
▪ The extensive use of separable convolutions and atrous convolutions allows
the model to fit in GPU memory despite the huge number of layers.
▪ Originally applied point-wise convolution before depth-wise convolution.
▪ Invented by François Chollet.
Architecture Review
The End

More Related Content

PPTX
U-Net (1).pptx
PDF
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...
PPT
ملف بوروبينت اساسيات برمجة الحاسب والخوارزميات
PPTX
딥러닝 - 역사와 이론적 기초
PPTX
Digraphs
PDF
Deeplabv1, v2, v3, v3+
PDF
Research & theses writing
PPTX
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
U-Net (1).pptx
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...
ملف بوروبينت اساسيات برمجة الحاسب والخوارزميات
딥러닝 - 역사와 이론적 기초
Digraphs
Deeplabv1, v2, v3, v3+
Research & theses writing
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...

What's hot (20)

PPTX
Regularization in deep learning
PPTX
Transformers AI PPT.pptx
PPTX
Semantic segmentation with Convolutional Neural Network Approaches
PDF
Convolutional Neural Networks (CNN)
PPTX
Data Augmentation
PDF
Image segmentation with deep learning
PDF
Feature Engineering
PPTX
Autoencoders in Deep Learning
PPTX
[Paper Reading] Attention is All You Need
PPTX
Deep neural networks
PDF
A brief introduction to recent segmentation methods
PDF
Introduction to Diffusion Models
PPTX
CNN Tutorial
PDF
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
PDF
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PDF
Latent diffusions vs DALL-E v2
PDF
Moving Object Detection And Tracking Using CNN
PPTX
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
PPT
backpropagation in neural networks
PPSX
ADABoost classifier
Regularization in deep learning
Transformers AI PPT.pptx
Semantic segmentation with Convolutional Neural Network Approaches
Convolutional Neural Networks (CNN)
Data Augmentation
Image segmentation with deep learning
Feature Engineering
Autoencoders in Deep Learning
[Paper Reading] Attention is All You Need
Deep neural networks
A brief introduction to recent segmentation methods
Introduction to Diffusion Models
CNN Tutorial
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Latent diffusions vs DALL-E v2
Moving Object Detection And Tracking Using CNN
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
backpropagation in neural networks
ADABoost classifier
Ad

Similar to DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation (20)

PDF
Deep learning for image video processing
PDF
PR243: Designing Network Design Spaces
PPTX
Deep learning requirement and notes for novoice
PPTX
Quality enhamcment
PDF
CyberSec_JPEGcompressionForensics.pdf
PPTX
Optimizing the Graphics Pipeline with Compute, GDC 2016
PDF
Convolutional Neural Network Models - Deep Learning
PDF
(Paper Review)3D shape reconstruction from sketches via multi view convolutio...
PPTX
2021 01-04-learning filter-basis
PPT
WT in IP.ppt
PPTX
Week5-Faster R-CNN.pptx
PDF
Adaptive Linear Solvers and Eigensolvers
PDF
Lecture 5: Convolutional Neural Network Models
PPTX
150807 Fast R-CNN
PPTX
Cv mini project (1)
PPTX
Data compression
PDF
PR-297: Training data-efficient image transformers & distillation through att...
PDF
MobileNet - PR044
PDF
Faster R-CNN - PR012
PDF
IJCAI13 Paper review: Large-scale spectral clustering on graphs
Deep learning for image video processing
PR243: Designing Network Design Spaces
Deep learning requirement and notes for novoice
Quality enhamcment
CyberSec_JPEGcompressionForensics.pdf
Optimizing the Graphics Pipeline with Compute, GDC 2016
Convolutional Neural Network Models - Deep Learning
(Paper Review)3D shape reconstruction from sketches via multi view convolutio...
2021 01-04-learning filter-basis
WT in IP.ppt
Week5-Faster R-CNN.pptx
Adaptive Linear Solvers and Eigensolvers
Lecture 5: Convolutional Neural Network Models
150807 Fast R-CNN
Cv mini project (1)
Data compression
PR-297: Training data-efficient image transformers & distillation through att...
MobileNet - PR044
Faster R-CNN - PR012
IJCAI13 Paper review: Large-scale spectral clustering on graphs
Ad

More from Joonhyung Lee (10)

PPTX
PPTX
Rethinking Attention with Performers
PPTX
Denoising Unpaired Low Dose CT Images with Self-Ensembled CycleGAN
PPTX
Deep Learning Fast MRI Using Channel Attention in Magnitude Domain
PPTX
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
PPTX
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...
PPTX
AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...
PPTX
AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
PPTX
StarGAN
PPTX
Deep Learning in Bio-Medical Imaging
Rethinking Attention with Performers
Denoising Unpaired Low Dose CT Images with Self-Ensembled CycleGAN
Deep Learning Fast MRI Using Channel Attention in Magnitude Domain
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...
AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...
AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
StarGAN
Deep Learning in Bio-Medical Imaging

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
KodekX | Application Modernization Development
PPT
Teaching material agriculture food technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Electronic commerce courselecture one. Pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
cuic standard and advanced reporting.pdf
Review of recent advances in non-invasive hemoglobin estimation
“AI and Expert System Decision Support & Business Intelligence Systems”
KodekX | Application Modernization Development
Teaching material agriculture food technology
NewMind AI Weekly Chronicles - August'25 Week I
Per capita expenditure prediction using model stacking based on satellite ima...
Encapsulation_ Review paper, used for researhc scholars
Programs and apps: productivity, graphics, security and other tools
Diabetes mellitus diagnosis method based random forest with bat algorithm
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Electronic commerce courselecture one. Pdf
20250228 LYD VKU AI Blended-Learning.pptx
Spectroscopy.pptx food analysis technology
Building Integrated photovoltaic BIPV_UPV.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Mobile App Security Testing_ A Comprehensive Guide.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Spectral efficient network and resource selection model in 5G networks

DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

  • 1. DeepLabV3+ Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
  • 2. Background ▪ DeepLabV3+ is the latest version of the DeepLab models. ▪ DeepLab V1: Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. ICLR 2015. ▪ DeepLab V2: DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. TPAMI 2017. ▪ DeepLab V3: Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017. ▪ DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. arXiv 2018.
  • 3. Semantic Segmentation ▪ Classifying all pixels in an image into classes. ▪ Classification at the pixel level. ▪ Does not have to separate different instances of the same class. ▪ Has important applications in Medical Imaging.
  • 4. Current Results on Pascal VOC 2012
  • 5. Motivation and Key Concepts ▪ Use Atrous Convolution and Separable Convolutions to reduce computation. ▪ Combine Atrous Spatial Pyramid Pooling Modules and Encoder-Decoder Structures. ▪ ASPPs capture contextual information at multiple scales by pooling features at different resolutions. ▪ Encoder-Decoders can obtain sharp object boundaries.
  • 8. Convolution (Cross-Correlation) for 1 Channel Convolution with Zero-Padding Display with Convolution Kernel Blue maps: inputs, Cyan maps: outputs, Kernel: not displayed
  • 9. Other Convolutions (Cross-Correlations) Strided Convolution with Padding Atrous (Dilated) Convolution with r=2 Blue maps: inputs, Cyan maps: outputs, Kernel: not displayed
  • 10. Atrous Convolution ▪ à trous is French for “with holes” ▪ Atrous Convolution is also known as Dilated Convolution. ▪ Atrous Convolution with r=1 is the same as ordinary Convolution ▪ The image on the left shows 1D atrous convolution
  • 11. Receptive Field of Atrous Convolutions ▪ Left: r=1, Middle: r=2, Right: r=4 ▪ Atrous Convolution has a larger receptive field than normal convolution with the same number of parameters.
  • 12. Depth-wise Separable Convolution ▪ A special case of Grouped Convolution. ▪ Separate the convolution operation along the depth (channel) dimension. ▪ It can refer to both (depth -> point) and (point -> depth). ▪ It only has meaning in multi-channel convolutions (cross-correlations).
  • 14. Exact Shapes and Terminology ▪ Filter: A collection of 𝑪𝒊𝒏 Kernels of shape (𝑲 𝑯, 𝑲 𝑾) concatenated channel-wise. ▪ Input Tensor Shape: 𝑵, 𝑪𝒊𝒏, 𝑯, 𝑾 𝑜𝑟 (𝑵, 𝑯, 𝑾, 𝑪𝒊𝒏) ▪ Filters are 3D, Kernels are 2D, All filters are concatenated to a single 4D array in 2D CNNs. (𝑵: 𝐵𝑎𝑡𝑐ℎ 𝑁𝑢𝑚𝑏𝑒𝑟, 𝑯: 𝐹𝑒𝑎𝑡𝑢𝑟𝑒 𝐻𝑒𝑖𝑔ℎ𝑡, 𝑾: 𝐹𝑒𝑎𝑡𝑢𝑟𝑒 𝑊𝑖𝑑𝑡ℎ, 𝑪𝒊𝒏: #𝐼𝑛𝑝𝑢𝑡 𝐶ℎ𝑎𝑛𝑛𝑒𝑙𝑠, 𝑪 𝒐𝒖𝒕: #𝑂𝑢𝑡𝑝𝑢𝑡 𝐶ℎ𝑎𝑛𝑛𝑒𝑙𝑠, 𝑲 𝑯: 𝐾𝑒𝑟𝑛𝑒𝑙 𝐻𝑒𝑖𝑔ℎ𝑡, 𝑲 𝑾: 𝐾𝑒𝑟𝑛𝑒𝑙 𝑊𝑖𝑑𝑡ℎ)
  • 15. Step 1: Convolution on Input Tensor Channels
  • 16. Step 2: Summation along Input Channel Dimension
  • 17. Step 3: Add Bias Term ▪ Each kernel of a filter iterates only 1 channel of the input tensor. ▪ The number of filters is 𝐶 𝑜𝑢𝑡. Each filter generates one output channel. ▪ Each 2D kernel is different from all other kernels in the 3D filter. Key Points
  • 18. Normal Convolution ▪ Top: Input Tensor ▪ Middle: Filter ▪ Bottom: Output Tensor
  • 19. Depth-wise Separable Convolution ▪ Replace Step 2. ▪ Instead of summation, use point-wise convolution (1x1 convolution). ▪ There is now only one (𝑪𝒊𝒏, 𝑲 𝑯, 𝑲 𝑾) filter. ▪ The number of 1x1 filters is 𝑪 𝒐𝒖𝒕. ▪ Bias is usually included only at the end of both convolution operations. ▪ Usually refers to depth-wise convolution -> point-wise convolution. ▪ Xception uses point-wise convolution -> depth-wise convolution.
  • 21. Characteristics ▪ Depth-wise Separable Convolution can be used as a drop-in replacement for ordinary convolution in DCNNs. ▪ The number of parameters is reduced significantly (sparse representation). ▪ The number of flops is reduced by several orders of magnitude (computationally efficient). ▪ There is no significant drop in performance (performance may even improve). ▪ Wall-clock time reduction is less dramatic due to GPU memory access patterns.
  • 22. Example: Flop Comparison (Padding O, Bias X) Ordinary Convolution ▪ 𝐻 ∗ 𝑊 ∗ 𝐾 𝐻 ∗ 𝐾 𝑊 ∗ 𝐶𝑖𝑛 ∗ 𝐶 𝑜𝑢𝑡 ▪ For a 256x256x3 image with 128 filters with kernel size of 3x3, the number of flops would be 256 ∗ 256 ∗ 3 ∗ 3 ∗ 3 ∗ 128 = 226,492,416 ▪ There is an 8-fold reduction in the number of flops. Depth-wise Separable Convolution ▪ 𝐻 ∗ 𝑊 ∗ 𝐾 𝐻 ∗ 𝐾 𝑊 ∗ 𝐶𝑖𝑛 + 𝐻 ∗ 𝑊 ∗ 𝐶𝑖𝑛 ∗ 𝐶 𝑜𝑢𝑡 ▪ Left: Depth Conv, Right: Point Conv ▪ For a 256x256x3 image with 128 filters and a 3x3 kernel size, the number of flops would be 256 ∗ 256 ∗ 3 ∗ 3 ∗ 3 + 256 ∗ 256 ∗ 3 ∗ 128 = 1,769,472 + 25,165,824 = 26,935,296
  • 23. Example: Parameter Comparison (Excluding Bias Term) Ordinary Convolution ▪ 𝐾 𝐻 ∗ 𝐾 𝑊 ∗ 𝐶𝑖𝑛 ∗ 𝐶 𝑜𝑢𝑡 ▪ For a 256x256x3 image with 128 filters and 3x3 kernel size, the number of weights would be 3 ∗ 3 ∗ 3 ∗ 128 = 3,456 Depth-wise Separable Convolution ▪ 𝐾 𝐻 ∗ 𝐾 𝑊 ∗ 𝐶𝑖𝑛 + 𝐶𝑖𝑛 ∗ 𝐶 𝑜𝑢𝑡 ▪ For a 256x256x3 image with 128 filters and 3x3 kernel size, the number of flops would be 3 ∗ 3 ∗ 3 + 3 ∗ 128 = 411 ▪ There is also an 8-fold reduction in parameter numbers.
  • 26. Encoder-Decoder Structures ▪ The Encoder reduces the spatial sizes of feature maps, while extracting higher- level semantic information. ▪ The Decoder gradually recovers the spatial information. ▪ UNETs are a classical example of encoder-decoder structures. ▪ In DeepLabV3+, DeepLabV3 is used as the encoder.
  • 28. Decoder Layer Structure 1. Apply 4-fold bilinear up-sampling on the ASPP outputs. 2. Apply 1x1 Convolution with reduced filter number on a intermediate feature layer. 3. Concatenate ASPP outputs with intermediate features. 4. Apply two 3x3 Convolutions. 5. Apply 4-fold bilinear up-sampling. Purpose & Implementation ▪ The ASPP is poor at capturing fine details. ▪ The decoder is used to improve the resolution of the image. ▪ The intermediate layer has 1x1 convolutions to reduce channel number.
  • 29. ASPP: Atrous Spatial Pyramid Pooling
  • 30. The ASPP Layer ▪ Encodes multi-scale contextual information through multiple rates. ▪ Concatenate all extracted features and an up-sampled global average pooling layer channel-wise. ▪ Use Atrous Depth-wise separable convolutions for multiple channels. ▪ Bad at capturing sharp object boundaries.
  • 31. Modified Aligned Xception Network ▪ Xception: Extreme Inception Network. ▪ Backbone network for DeepLabV3+ ▪ Uses residual blocks and separable convolutions.
  • 32. Explanation of Xception ▪ Takes the “Inception Hypothesis”, which states that cross-channel correlations and spatial correlations are sufficiently decoupled that it is preferable not to map them jointly, to the extreme. ▪ The extensive use of separable convolutions and atrous convolutions allows the model to fit in GPU memory despite the huge number of layers. ▪ Originally applied point-wise convolution before depth-wise convolution. ▪ Invented by François Chollet.