DeconvNet, DecoupledNet, TransferNet in Image Segmentation

DeconvNet, DecoupledNet,
TransferNet in Image Segmentation
NamHyuk Ahn @ Ajou Univ.
2016. 05. 11

Contents
- Semantic Segmentation
- Deconvolution Network for Supervised Learning
- Decoupled Network for Semi-Supervised Learning
- Transfer Learning in Semantic Segmentation

Semantic Segmentation
- Predict pixel-level label in image
- ct
[Shotton et al . 2007]

PASCAL VOC
- 20 classes
- 12K training / 1K test images 
MS COCO
- 91 classes
- 120K training / 40K test 
images
Datasets

Deconvolution Network for
Supervised Learning

Problems of FCN
- FCN only handle
single-scale semantic,
since it has ﬁxed-size
receptive ﬁeld
- Label map is so small,
tend to forget detail
structures of object

DeconvNet
- To address such issue, they use “deconvolution”
- Convolution Network extract features (VGG-16 net)
- Deconvolution Network generate probability map (same size
to input image)
- Probability map indicate probability each pixel belongs to one
of class
-

Deconvolution Network
- Unpooling
• Reconstruct structure of
original activation map
• Activation size is preserved,
but still sparse
- Deconvolution
• Densify sparse (enlarge)
activation map

Analysis of DeconvNet
- DeconvNet is better in segmentation since it produce
dense and enlarged pixel-wise map
- Shallow layers tend to capture overall structure of object
(shape, region, position), deep layers does complicated
patterns
- Unpooling captures example-specific structure so can
reconstruct object details in higher resolution
- Deconvolution captures class-specific shape, so closely
related to target class are amplified and noise activations
are suppresed

More details of DeconvNet
- Instance-wise segmentation
- Use batch normalization in both networks
- Two-stage training
- Ensemble with FCN
• FCN, DeconvNet are complementary relationship
• Best result

Instance-wise Segmentation
- Input proposal instances in network (not entire image)
- Get proposal instance using EdgeBox algorithm
- Identify more details of object with multi scale
- Reduce search space, so can reduce memory at train

Two-stage Training
- DeconvNet has lots of parameters, but don’t have
many segmentation data (10K in PASCAL VOC)
• Use two-stage training to address this issue
• Fist stage: Input center-cropped images
• Second stage: Input proposal sub-images
- So network generalize better

Result
- 2nd best in Pascal VOC only training
- Note: In paper they say mean IOU is 72.5, but in
presentation ﬁles, 74.8

Recap
- Possible to make dense, precise segmentation mask
since reconstruct coarse-to-ﬁne construction
- With instance-wise segmentation, it can handle object
scale variation
- But lots of parameters (almost 2x VGG-16)  
so additional training stage is needed

Decoupled Network for Semi-
Supervised Learning

Motivation
- Make ground-truth of segmentation takes a lot of
cost so do it like semi-supervised learning
- Utilize many image-level annotation and few pixel-
level annotation
- Modify DeconvNet
- With less data (25 per class), achieve good result
(62.5 mean IOU)

Main idea
- Semantic segmentation can be decomposed to  
multi-label classiﬁcation, binary segmentation
Person
Bottle
Multi-label classiﬁcation Binary segmentationSemantic segmentation

Overview
- Classification network for multi-label classification
- Segmentation network for binary segmentation
- Bridging layers for delivering class-specific
information to segmentation network

Architecture
- Classiﬁcation Network (Same as VGG-16)
- Segmentation Network
• Take class-speciﬁc activation map from bridge layer and do
binary segmentation (main difference with DeconvNet)
• Binary segmentation reduce parameters, so we can train with
few pixel-wise annotation data

Architecture
- Bridging Layers
• Segmentation network needs class-specific and spatial info to
produce class-specific segmentation mask
• Get spatial information from pool5 in classification network
• has useful info for shape generation, but contain mixed info
of all relevant label → identify class-specific activation
• Make saliency map to identify class-specific activation

Architecture
- Saliency Map
1. Produce score vector, set
dscore all 0 but 1 in idx
related to label that want
to track
2. Backprop to arbitrary
layer (pool5 in this paper)
- By saliency map we can get
class-speciﬁc information  
in each label (class)
Qualitative example of saliency map  
[Karen Simonyan et al,. 2014]

Architecture
- Bridging Layers
• Combine , to produce class-speciﬁc activation map
• Pass through fc layer and feed to segmentation network
• g has both spatial and class-speciﬁc information

DeconvNet, DecoupledNet, TransferNet in Image Segmentation

Inference
- Computing segmentation map for each identiﬁed label
- Pixel-wise aggregate each segmentation map M

Training
- Train classiﬁcation network with many image-level
annotation
- Train segmentation network and bridging layers with
few pixel-level annotation

Recap
- Utilize many image-level annotation and few pixel-level
annotation
- Add bridging layer to DeconvNet for binary segmentation to
reduce parameter
- Bridging layer output both spatial and class-speciﬁc information
in each class (label)
- Train two networks separately (decoupled)
• Worse performance in fully-supervision since jointly optimization is
more desirable in fully-supervision
- With few strong annotated data (25 per class) achieve good
result (62.5 mean IOU)

Transfer Learning in Semantic
Segmentation

Motivation
- Pre-train network and inference to new dataset 
(ex. train with MS COCO, inference to PASCAL VOC)
- This idea doesn’t work well with DecoupledNet
• DecoupledNet trained with class-speciﬁc input, so it
can’t be generalize to new class
• Train network with class-independent input!

Overview
- Attention model identify salient region of each class associated with input
image
• Output of attention model has location information of each class in
coarse feature map
- Encoder extract features; Decoder generate dense foreground
segmentation mask of each focused region
- Training stage
• Fix encoder (pre-trained) and train decoder, attention model using pixel-level
annotation from source domain
• Train attention model using image-level annotation in both domain
- After training, decoder is trained with source domain and attention is
trained with both domain so attention adapted to target domain

Overview
- Decoupled encoder-decoder make it possible to share information
for shape generation among different class
- Attention model provides
• Predictions for localization
• Class-speciﬁc information → enable to adapt decoder into target domain
- With attention model, able to get information transferable across
different domain and provide useful segmentation prior information

Architecture
- Encoder
• Extract feature descriptor as  
A is obtain from last conv layer to retain spatial information
• M, D is # of hidden unit (20x20), # of channel respectively
- Attention model
• To train weight vector , where represents
relevance of location to each class l
• Formally,
• And extra technique to reduce parameter [R. Memisevic. 2013] did

Architecture
- Attention model
• To apply attention to this model, it has to be trainable in both
domain
• Add additional layers on top of attention model, and train 
both , under classiﬁcation objective
• Finally, , z represents class-speciﬁc
feature
• Can optimize z using weak annotation with both domain 
• Example of attention

Architecture
- Decoder
• Output of attention model is spare due to softmax, it may lost
information for shape generation
• Feed additional input A to z (multiply) → densiﬁed attention
• With densiﬁed attention, optimize segmentation loss, procedure is
same as DecoupledNet, but optimize decoder only with source domain

Analysis of TransferNet
- Decoder generates foreground segmentation of
attention to each label
- By decoupling classiﬁcation (domain speciﬁc task), it
can capture class-independent information for shape
generation and apply unseen class
- Train attention model using not only pixel-level but also
image-level annotation, it can handle unseen class
• In DecoupledNet, bridging layer is trained by only pixel-level data

Train / Inference
- When train, optimize this eq
• Trained using only class label is good, but jointly train with
segmentation label to regularize noise
• After training, remove since it is required only in training to
learn attention from target domain
- Inference
1. Iteratively obtain attention and segmentation mask
2. Aggregate mask (same as DecoupledNet)

Reference
- Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. “Learning
deconvolution network for semantic segmentation.” Proceedings of the
IEEE International Conference on Computer Vision. 2015.
- Seunghoon Hong, Hyeonwoo Noh, and Bohyung Han. "Decoupled deep
neural network for semi-supervised semantic segmentation.” Advances in
Neural Information Processing Systems. 2015.
- Seunghoon Hong, et al. “Learning Transferrable Knowledge for Semantic
Segmentation with Deep Convolutional Neural Network.” arXiv preprint
arXiv:1512.07928 (2015).
- Hyeonwoo Noh. “Semantic Segmentation and Visual Question Answering”
(https://drive.google.com/ﬁle/d/0B5xl2L77gZfVRXZxQWNmSGlBemc/view)

DeconvNet, DecoupledNet, TransferNet in Image Segmentation

More Related Content

What's hot (20)

Similar to DeconvNet, DecoupledNet, TransferNet in Image Segmentation (20)

Recently uploaded (20)

DeconvNet, DecoupledNet, TransferNet in Image Segmentation