FeatUp: A Model-Agnostic Framework for Features at Any Resolution

Contents
▪ Introduction
▪ FeatUp
▪ Experiments
▪ Conclusions

Introduction
▪ Deep features often sacrifice spatial resolution for semantic quality
• ResNet-50 produces 7 × 7 deep features from a 224 × 224 pixel input (32× resolution reduction)
• Vision Transformers (ViTs) incur a significant resolution reduction
▪ FeatUp: a novel framework to improve the resolution of any vision model’s features
• inspired by 3D reconstruction frameworks like NeRF that is multi-view consistency of low-resolution signals can
supervise the construction of high-resolution signals

Introduction
▪ Proposed method, FeatUp
• to significantly improve the spatial resolution of any model’s features, parametrized as either a fast feedforward
upsampling network or an implicit network
• with a fast CUDA implementation of Joint Bilateral Upsampling orders of magnitude more efficient
• can be used as drop-in replacements for ordinary features to improve performance on dense prediction tasks and
model explainability

FeatUp
▪ Intuition
• Compute high-resolution features by observing multiple different “views”
• Two steps in the pipeline
Step 1. generate low-resolution feature views to refine into a single high resolution output
Step 2. construct a consistent high-resolution feature map from these views

FeatUp
▪ Step 1
• Generate low-resolution feature views to refine into a single high resolution output
• Perturb the input image with small pads, scales, and horizontal flips
• Apply the model to each transformed image to extract a collection of low-resolution feature maps

FeatUp
▪ Step 2
• Construct a consistent high-resolution feature map from these views
• Track the parameters used to “jitter” each image instead of using estimating parameters
• Apply the same transformation to our learned high-resolution features prior to downsampling
• Compare downsampled features to the true model outputs using a gaussian likelihood loss

FeatUp
▪ Reconstruction Loss
ℒ𝒓𝒆𝒄 =
𝟏
𝑻
෍
𝒕∈𝑻
𝟏
𝟐𝒔𝟐
𝒇 𝒕 𝒙 − 𝝈↓ 𝒕 𝑭𝒉𝒓 𝟐
𝟐
+ 𝒍𝒐𝒈 𝒔
• 𝑡 ∈ 𝑇 : a collection of small transforms such as pads, zooms, crops, horizontal flips, and their compositions
• 𝑥 : an input image
• 𝑓 : model backbone
• 𝜎↓ : a learned downsampler
• 𝜎↑ : a learned upsampler
• 𝐹ℎ𝑟 = 𝜎↑ 𝑓 𝑥 , 𝑥 : the predicted high-res features
• 𝑠 = 𝒩 𝑓 𝑡 𝑥 : a spatially-varying adaptive uncertainty
• 𝒩 : a small linear network

FeatUp
▪ Choosing a Downsampler
• Two options:
• A fast and simple learned blur kernel
• A more flexible attention-based downsampler

FeatUp
• A fast and simple learned blur kernel
• Blurs the features with a learned blur kernel and can be implemented as a convolution applied independently to each channel
• Normalized to be non-negative and sum to 1 to ensure the features remain in the same space
• Cannot capture dynamic receptive fields, object salience, or other nonlinear effects → DO NOT USE

FeatUp
• A more flexible attention-based downsampler
• Uses a 1x1 convolution to predict a saliency map from the high-resolution features
• Combines this saliency map with learned spatially-invariant weight and bias kernels
• Normalizes the results to create a spatially-varying blur kernel that interpolates the features
𝜎↓ 𝐹ℎ𝑟 𝑖𝑗 = softmax 𝑤⨀Conv 𝐹ℎ𝑟 Ω𝑖𝑗 + 𝑏 ⋅ 𝐹ℎ𝑟 Ω𝑖𝑗
• 𝐹ℎ𝑟 Ω𝑖𝑗 : a patch of high resolution features corresponding to the 𝑖, 𝑗 location in the downsampled features
• The main hyperparameter for the downsampler is the kernel size, which should be larger for models with larger receptive fields

FeatUp
▪ Choosing an Upsampler
• Two variants:
• “JBU” FeatUp parameterizes 𝜎↑ with a guided upsampler based on a stack of Joint Bilateral Upsamplers (JBU)
• Learns an upsampling strategy that generalizes across a corpus of images
• “Implicit” FeatUp uses an implicit network to parameterize 𝜎↑ and can yield remarkably crisp features when overfit to a single image
• Both methods are trained using the same broader architecture and loss

FeatUp
• Joint Bilateral Upsampler (JBU)
𝐹ℎ𝑟 = JBU ⋅, 𝑥 ∘ JBU ⋅, 𝑥 ∘ ⋯ 𝑓 𝑥
• This architecture is fast, directly incorporates high-frequency details from the input image into upsampling process
• Generalizes the original JBU to high-dimensional signals and makes this operation learnable
෠
𝐹ℎ𝑟 𝑖, 𝑗 =
1
𝑍
෍
𝑎,𝑏 ∈Ω
𝐹𝑙𝑟 𝑎, 𝑏 𝑘𝑟𝑎𝑛𝑔𝑒 𝐺 𝑖, 𝑗 , 𝐺 𝑎, 𝑏 𝑘𝑠𝑝𝑎𝑡𝑖𝑎𝑙 𝑖, 𝑗 , 𝑎, 𝑏
• 𝐺 : a high-resolution signal as guidance for the low-resolution features 𝐹𝑙𝑟
• Ω : a neighborhood of each pixel in the guidance (a 3x3 square centered at each pixel)
• 𝑘 ⋅,⋅ : a similarity kernel that measures how “close” two vectors are
• 𝑍 : a normalization factor to ensure the kernel sums to 1

FeatUp
෠
𝐹ℎ𝑟 𝑖, 𝑗 =
1
𝑍
෍
𝑎,𝑏 ∈Ω
𝐹𝑙𝑟 𝑎, 𝑏 𝑘𝑟𝑎𝑛𝑔𝑒 𝐺 𝑖, 𝑗 , 𝐺 𝑎, 𝑏 𝑘𝑠𝑝𝑎𝑡𝑖𝑎𝑙 𝑖, 𝑗 , 𝑎, 𝑏
• 𝑘𝑠𝑝𝑎𝑡𝑖𝑎𝑙 : a learnable Gaussian kernel on the Euclidean distance between coordinate vectors of with 𝜎𝑠𝑝𝑎𝑡𝑖𝑎𝑙
𝑘𝑠𝑝𝑎𝑡𝑖𝑎𝑙 𝑥, 𝑦 = exp
− 𝑥 − 𝑦 2
2
2𝜎𝑠𝑝𝑎𝑡𝑖𝑎𝑙
2
• 𝑘𝑟𝑎𝑛𝑔𝑒 : a temperature-weighted softmax applied to the inner products from MLP that operates on the guidance signal 𝐺
𝑘𝑟𝑎𝑛𝑔𝑒 𝑥, 𝑦 = softmax 𝑎,𝑏 ∈Ω
1
𝜎𝑟𝑎𝑛𝑔𝑒
2 𝑀𝐿𝑃 𝐺 𝑖, 𝑗 ⋅ 𝑀𝐿𝑃 𝐺 𝑎, 𝑏
• 𝜎𝑟𝑎𝑛𝑔𝑒
2
acts as the temperature

FeatUp
• An efficient CUDA implementation of the spatially adaptive kernel used in the JBU

FeatUp
• Implicit
• Draws a direct analogy with NeRF by parametrizing the high-res features of a single image with an implicit function 𝐹ℎ𝑟 = MLP 𝑧
• Uses a small MLP to map image coordinates and intensities to a high-dimensional feature for the given location
• Uses Fourier features to improve the spatial resolution of our implicit representations
• Adding Fourier color features allows the network to use high-frequency color information from the original image
𝐹ℎ𝑟 = MLP ℎ 𝑒𝑖: 𝑒𝑗: 𝑥, ෝ
𝑤
• ℎ 𝑧, ෝ
𝑤 : the component-wise discrete Fourier transform of an input signal 𝑧, with a vector of frequencies ෝ
𝑤
• 𝑒𝑖. 𝑒𝑗 : the two-dimensional pixel coordinate fields ranging in the interval [-1, 1]
• : : concatenation along the channel dimension
• MLP is a small 3-layer ReLU network with dropout (p=.1) and layer normalization
• At test time, the pixel coordinate field can be quired to yield features 𝐹ℎ𝑟 at any resolution

FeatUp
▪ Additional Method Details
• Accelerated Training with Feature Compression
• To reduce the memory footprint and further speed up the training of FeatUp’s implicit network, compress the spatially-varying
features to their top k = 128 principal components
• This operation is approximately lossless as the top 128 components explain ∼ 96% of the variance across a single image’s
features
• This improves training time by a factor of 60× for ResNet-50, reduces the memory footprint, enables larger batches, and does
not have any observable effect on learned feature quality
• When training the JBU upsampler, we sample random projection matrices in each batch to avoid computing PCA in the inner loop
• Total Variation Prior
• To avoid spurious noise in the high resolution features, add a small (𝜆𝑡𝑣 = 0.05) total variation smoothness prior on the implicit
feature magnitudes
ℒ𝑡𝑣 = ෍
𝑖,𝑗
𝐹ℎ𝑟 𝑖, 𝑗 − 𝐹ℎ𝑟 𝑖 − 1, 𝑗 2
+ 𝐹ℎ𝑟 𝑖, 𝑗 − 𝐹ℎ𝑟 𝑖, 𝑗 − 1 2

Experiments
▪ Comparison with several key upsampling baselines
• Bilinear upsampling
• Resize-conv
• Stride (i.e. reducing the stride of the backbone’s patch extractor)
• Large Image (i.e. using a larger input image)
• CARAFE (Want et al., 2019)
• SAPA (Lu et al., 2022)
• FADE (Lu et al., 2022)
▪ Upsample ViT features by 16x (to the resolution of the input image) with every method except the strided and large-
image baselines, which are computationally infeasible above 8x upsampling

Experiments
▪ Qualitative Comparisons
• Visualizing upsampling methods

Experiments
▪ Qualitative Comparisons
• Robustness across vision backbones

Experiments
▪ Transfer Learning for Semantic Segmentation and Depth Estimation
• Train linear probes on top of low-resolution features for both semantic segmentation and depth estimation
• Use a frozen pre-trained ViT-S/16 → Upsample the features (14x14 → 224x224) → extract maps by applying a linear layer
• Semantic segmentation
• Train a linear projection to predict the coarse classes of the COCO-Stuff (27 classes) training dataset using a cross-entropy loss
• Depth prediction
• Train on pseudo-labels from the MiDaS (DPT-Hybrid) depth estimation network using their scale- and shift-invariant MSE

Experiments
▪ Class Activation Map Quality
• FeatUp features can be dropped into existing CAM analyses to yield stronger and more precise explanations
• Metrics for CAM quality : Average Drop (A.D., σ𝑖=1
𝑁 max 0,𝑌𝑖
𝑐
−𝑂𝑖
𝑐
𝑌𝑖
𝑐 ⋅ 100) and Average Increase (A.I., σ𝑖=1
𝑁
1𝑌𝑖
𝑐<𝑂𝑖
𝑐
𝑁
⋅ 100)
• 𝑌𝑖
𝑐
: the classifier’s softmax output on sample 𝑖 for class 𝑐
• 𝑂𝑖
𝑐
: the classifier’s softmax output on the CAM-masked sample 𝑖 for class 𝑐

Experiments
▪ End-to-End Semantic Segmentation
• FeatUp not only improves the resolution of pre-trained features but can also improve models learned end-to-end
• Train SegFormer using ADE20k dataset with JBU upsampler

Conclusion
▪ FeatUp is to upsample deep features using multiview consistency
▪ JBU-based upsampler imposes strong spatial priors to accurately recover lost spatial
information with a fast feedforward network based on a novel generalization of Joint Bilateral
Upsampling
▪ Implicit FeatUp can learn high quality features at arbitrary resolutions
▪ Both variants dramatically outperform a wide range of baselines across linear probe transfer
learning, model interpretability, and end-to-end semantic segmentation

FeatUp: A Model-Agnostic Framework for Features at Any Resolution

More Related Content

Similar to FeatUp: A Model-Agnostic Framework for Features at Any Resolution (20)

More from Sungchul Kim (20)

Recently uploaded (20)

FeatUp: A Model-Agnostic Framework for Features at Any Resolution